Abstract
Recently, skeleton-based action recognition has become a very important topic in the field of computer vision. It is a challenging task to accurately build a human action model and precisely distinguish similar human actions. In this paper, an action (skeleton sequence) is represented as a third-order nonnegative tensor time series to capture the original spatiotemporal information of the action. As a linear dynamical system (LDS) is an efficient tool for encoding the spatiotemporal data in various disciplines, this paper proposes a nonnegative tensor-based LDS (nLDS) to model the third-order nonnegative tensor time series. Nonnegative Tucker decomposition (NTD) is utilized to estimate the parameters of the nLDS model. These parameters are used to build extended observability sequence for the action, which implies that can be considered as the feature descriptor of the action. To avoid the limitations introduced by approximating with a finite-order matrix, we represent an action as a point on infinite Grassmann manifold comprising the orthonormalized extended observability sequences. The classification task can be performed by dictionary learning and sparse coding on the infinite Grassmann manifold. The experimental results on the MSR-Action3D, UTKinect-Action, and G3D-Gaming datasets demonstrate that the proposed approach achieves a better performance in comparison with the state-of-the-art methods.
1. Introduction
Human action recognition based on spatiotemporal data has been one of the most prominent research topics owing to its applications in human-computer interfaces [1], gaming [2], and surveillance systems [3]. Over the past few decades, numerous methods have been proposed for recognizing human actions from monocular RGB videos [4]. However, the monocular RGB data is very sensitive to background clutter, occlusion, variations in the view-point, and illumination changes. Thus, although in the past few decades several significant research studies have been conducted, the method to accurately recognize human actions from RGB videos still remains a challenging problem. As a human skeleton can be viewed as an articulated system of rigid bodies connected by bone joints, a human action can be described as the spatiotemporal evolution of a series of skeletons. Therefore, if human skeleton sequences can be accurately extracted from RGB videos, it is possible to perform action recognition by classifying the skeleton sequences. However, it is an extremely difficult task to reliably extract the human skeleton from monocular video sensors. With the development of cost-effective depth sensors [5], it has become easier to extract the three-dimensional (3D) positions of skeletal joints from depth data. Hence, skeleton-based action recognition has once again become an active area of research.
When recognizing human actions, the action representations embodying the temporal dynamics can provide a more relevant description than using static data [6]. A linear dynamical system (LDS) [7] is an effective tool in various disciplines for capturing the spatiotemporal data. Hence, the authors of [6, 8] employed an LDS model to capture the spatiotemporal information of an action (skeleton sequence) and used the singular value decomposition (SVD) [9] or Tucker decomposition [10] to estimate model parameters . The parameters were used to build finite observability matrix . Hence, an action, represented by an LDS, was alternately identified as a point on the finite Grassmann manifold corresponding to the column space of .
Motivated by the above methods, this paper proposes a novel approach to model and analyze human actions. The overall approach is shown in Figure 1. In this study, an action (a skeleton sequence) is represented as a third-order nonnegative tensor and each skeleton is converted into a second-order nonnegative tensor. To retain the spatiotemporal information of an action to the maximum extent, a nonnegative tensor-based LDS (nLDS) is proposed to model the action. In this work, nonnegative Tucker decomposition (NTD) [11] is used to decompose the third-order nonnegative tensor for improving the accuracy of action recognition. Because the NTD is a powerful tool to extract a part-based representation of high-dimensional tensors, an action can be represented by a linear combination of relevant components.

The parameter tuple can be learned by the nLDS model and NTD of an action. The parameters and represent the appearance and dynamics of the nLDS model, respectively. Thus, it is appropriate to regard extended observability sequence as the feature descriptor of the action. The conventional method approximates as a finite-order matrix and maps to a point on a finite Grassmann manifold. However, the value of order can affect the asymptotic behavior of and computational complexity. To avoid the limitations introduced by choosing the value of order , an action is represented as a point on infinite Grassmann manifold consisting of the orthonormalized extended observability sequences. Finally, classification is performed using dictionary learning and sparse coding on the infinite Grassmann manifold.
The main contributions of this study are as follows:
(1) To retain the spatiotemporal information of an action, an nLDS is used to model the action that is represented by a third-order nonnegative tensor.
(2) Compared with the Tucker decomposition, the uniqueness of NTD allows the representation of an action to be more discriminative. Thus, to further improve the accuracy of action recognition, NTD is used to estimate the parameters of the nLDS model. The parameters are utilized to build an extended observability sequence that can be considered as the feature descriptor of the action.
(3) To overcome the limitation caused by approximating extended observability sequence with finite-order matrix , an action is represented as a point on infinite Grassmann manifold consisting of the orthonormalized extended observability sequences.
The rest of the paper is organized as follows: Section 2 reviews related work, Section 3 briefly introduces some fundamental concepts on the Tucker model, NTD, and LDS. Section 4 elaborates the nLDS model and describes how to represent an action as a point on the infinite Grassmann manifold. Section 5 presents our experimental results. Section 6 concludes this paper.
2. Related Work
A brief overview of the skeleton-based action recognition approaches is provided in this section. Presently, the existing skeleton-based action recognition approaches can be categorized into two types; the first type represents the human skeleton as a set of skeletal joints. Wang et al. [22] employed pairwise relative positions of the joints to represent a human skeleton and used a hierarchy of Fourier coefficients to model the temporal evolution of this representation. To obtain the discriminative joint combinations, they used a multiple kernel learning approach to characterize the human actions. Li et al. [23] used a graph-based model to represent the relative spatial variations between various skeletal joints. They utilized the relative variance of the joint relative distance (RVJRD) [24] to indicate their activity levels in a human action to obtain the most informative skeletal joint pairs. To derive the spatiotemporal compatibility of the skeletal joints between different actions, Koniusz et al. [25] used a sequence compatibility kernel (SCK) to capture the spatial and temporal similarities between the skeletons in an action. Furthermore, they employed a dynamics compatibility kernel to represent the similarity between a pair of skeletons in a given action to capture the spatiotemporal dynamics of that action. Zhu et al. [26] input the raw 3D skeletal joint locations to an end-to-end fully connected deep long short-term memory (LSTM) network for recognizing skeleton-based human actions. Ding et al. [6] represented a human skeleton as a 3D joint-based tensor, and then a human action was considered as a third-order tensor time series. They proposed a tensor-based LDS (tLDS) to model a tensor time series and estimated the parameters of the LDS model using the Tucker decomposition. To eliminate the noise and occlusion in 3D skeleton data, Liu et al. [17] proposed a spatiotemporal long short-term memory (ST-LSTM) network with trust gates. By analyzing the reliability of the skeleton data, trust gates could dynamically update the long-term context information stored in the memory cell. Lee et al. [14] built an Ensemble Temporal Sliding LSTM (TS-LSTM) network which was composed of short-term, medium-term, and long-term TS-LSTM networks. The subnetworks could capture the temporal dependencies between skeletons and the spatial dependency of each skeleton. The second type of skeleton-based action recognition approaches represents a human skeleton as a set of connected rigid bodies. The authors of [12] represented human actions as curves in the Lie group. For the easier performance of the classification task, the human actions represented by the Lie group were mapped to its Lie algebra in the vector space. In [8], the authors divided a skeleton into smaller body parts and employed certain bioinspired shape features to represent each body part. An LDS was used to learn the temporal evolution of the bioinspired features. Using the motion velocities, direction of the motion, and curvatures of the 3D trajectories, Ding et al. [27] categorized the foremost actions into two types of action-units: dynamic instants and intervals. They utilized self-organizing mapping (SOM) [28] to cluster the action-units with the spatiotemporal feature and employed the sequences of the discrete symbols of each action to build profile hidden Markov models (PHMMs) [29] to obtain the spatiotemporal information between the action-units in the given actions. Huang et al. [21] generated a neural network architecture to learn the most informative Lie group representations. Based on the proposed network structure, they input the Lie group features into rotation mapping layers to obtain the desirable results.
3. Briefs of Basic Concepts
A brief overview of the Tucker model, NTD, and LDS is presented here, which will help in understanding the tensor-based LDS. Let denote a -order tensor, represent the mode-n matricization of tensor , represent a matrix, and represent a mode-n matrix belonging to the Tucker model. If all elements of tensor are nonnegative (i.e., , where and ), is called a nonnegative tensor. The mode-n matricization of tensor utilizes the elements of tensor to structure matrix .
3.1. Tucker Model
The mode-n product of tensor by matrix is defined as ()-tensor given byfor all the index values. With the help of the mode-n product, (1) is rewritten in a form of the matrix unfoldings by fixing the n-th mode as follows:where and are the matrix unfolding of and , respectively.
The Tucker model decomposes -order tensor into the mode products of core tensor and mode matrices as follows:where are the factor matrices and .
Mode-n matricization of tensor in (3) can be represented by the mode-n matricization of the core tensor and mode matrices: where denotes the Kronecker product.
3.2. Nonnegative Tucker Decomposition
Given nonnegative -order tensor , NTD of obtains core tensor and mode matrix , which are restricted to having only nonnegative elements, such that
To search for approximate factorization of tensor , the cost function was used to quantify the quality of the approximation. The generalized Kullback–Leibler divergence (or I-divergence) is usually used to construct the cost function as follows:To obtain mode matrices and core tensor of tensor , Kim et al. [11] used multiplicative updating algorithms to minimize cost function (6) as follows.
Problem 1. Minimize with respect to and , subject to constraints .
They define which is the Kronecker product structure of mode matrices , and is represented as a backward cyclic form. The mode-n matrices of the NTD can be rewritten in the form:
Kim et al. [11] derived multiplicative updating algorithms for the mode matrices and core tensor of NTD as follows:
The update rule for mode matrices is
The update rule for core tensor iswhere with , , / is the element-wise division, denotes the Hadamard product, and is a distinct tensor, all of whose elements are 1. I-divergence is nonincreasing under the update rules.
3.3. Linear Dynamical Systems
An LDS belongs to a multivariate time series (MTS) model. The MTS model utilizes hidden states to indirectly represent the observation sequences. Given MTS , an LDS is usually described as follows: where discrete variable is the time index, denotes the d-dimensional hidden state at time , represents an n-dimensional observed state at time , and d is of the LDS. is a that can transform current hidden state to previous hidden state . can transform hidden state to observed state . Noise covariance matrices are and . Noise components and represent multivariate normal distributions with a zero-mean and covariance matrices and , respectively. In [30], G. Doretto et al. employed singular value decomposition (SVD) of the observed sequence to obtain the best estimate of and as follows: where and .
The parameters of LDS do not lie in a linear space. Transition matrix A is constrained to be stable and its eigenvalues are distributed in the unit circle. Observation matrix C is an orthonormal matrix. In fact, matrix C lies in the Stiefel manifold. Parameters can be utilized to describe the intrinsic characteristics of the LDS model [31] because and represent the dynamics and spatial appearance, respectively. Therefore, pair can be used to describe a set of joint trajectories of an articulated body model. The extended observability matrix [32] for a tuple () has the following form:
In the current human action research based on skeletons, a human action is usually described as a finite skeleton sequence. Therefore, a human action can be defined as a k-length extended observability matrix as follows: Here k is the total number of frames in the skeleton sequence. The size of is . The column space of is a -dimensional subspace of .
Grassmann manifold [33] represents a set of -dimensional linear subspaces of . Each point on the Grassmann manifold is a particular subspace spanned by the column space of , the orthogonal matrices. To obtain a particular subspace spanned by the columns of , the Gram–Schmidt orthonormalization can be used to compute an orthonormal basis. Thus, a human action can be represented as a point on Grassmann manifold corresponding to the column space of .
4. Nonnegative Tensor-Based Skeleton Sequence Model
4.1. Nonnegative Tensor Representation of Skeleton Sequence
The continuity of human motions decides that a skeleton sequence, which is used to describe a human action, is a combination of interrelated skeletons. Instead of traditional skeleton feature vectorization, a skeleton sequence is represented as a nonnegative tensor based on a time series. By searching the nonnegative tensor components of a human action, the tensor representation can better reflect independence of each skeleton and the variation between different skeletons.
Figure 2(a) shows a human skeleton with 19 rigid bodies and 20 joints. We preprocess the complete skeleton in the action datasets, which makes it more accurate and easy to represent a skeleton sequence as a nonnegative tensor. We use the preprocessing method elaborated in Section 5.9 to keep the joints of a skeleton in the first octant of the global coordinate system (refer to Figure 2(b)).

(a)

(b)
Let be a preprocessed skeleton (the preprocessed skeleton located at the first octant of the global coordinate system). is a set of joints in the first octant of the global coordinate system. denotes the 3D position of joint i. is a set of rigid bodies. represents the rigid body connecting adjacent joints and . represent the three angles between rigid body and the global x, y, and z-axes . Therefore, each skeleton can be represented as a second-order nonnegative tensor or matrix as follows: where is a distinct set that includes the joints of and the angles between and the three global axis. Then, a skeleton sequence can be represented as a third-order nonnegative tensor as , where is the number of frames in the skeleton sequence.
4.2. Nonnegative Tensor-Based LDS Model
Mathematically, a third-order tensor can be unfolded into a second-order tensor time series. It is well known that a second-order tensor time series can be modeled as the output of an LDS. Therefore, a third-order nonnegative tensor, representing a skeleton sequence, is used to build an LDS model. The parameters are estimated using the LDS model and NTD of the action. Then, the LDS model is called the nonnegative tensor-based LDS (nLDS) model, as shown in Figure 3.

A skeleton sequence can be represented as a third-order nonnegative tensor as follows: where and . The NTD of uses the update rules proposed by Kim et al. [11] as follows: where , and . Encoding variable matrix is computed in the following manner: Then, the NTD of is given by where the core tensor and mode matrices (, , and ) are restricted to having only nonnegative elements in the factorization, , , and . This is illustrated in Figure 4. Then, the mode-3 matricization of is achieved as follows:where is the mode-3 matricization of core tensor and (, , and ).

Let third-order nonnegative tensor , where and . The mode-3 matricization of is , where is a nonnegative matrix and is the vector representation of . The mode-3 matricization of is , where is a nonnegative matrix and is the vector representation of .
Here, we assume that represents the hidden state and represents the observation. Then, second-order nonnegative tensor time series can be represented by the LDSs as follows:where , , , , is a nonnegative matrix, is a zero-mean Gaussian noise modeling the stochastic relation between the states and observations, and is a zero-mean Gaussian noise modeling the stochastic component of the transition. (We suppose that is a nonnegative matrix.) Then, can be rewritten as follows:where .
Based on (22) and (20), we haveWe transpose the matrix on both sides of (23) as follows:
Next, we consider the problem of finding the estimates of and in the sense of Frobenius ( and represent nonnegative matrices and , respectively):Since , can achieve its minimum 0 when and ( is a by matrix, while is a by matrix), this means that is approximately equal to zero when is equal to zero. Therefore, the tuple(, ) is one of the solutions to the problem (25). In other words, the tuple (, ) is an element in the set represented by (25).
Then, transition matrix is obtained by solving the least square problem as follows: where and and denote the Frobenius norm and Moore–Penrose inverse, respectively. Given the above transition matrix and observation matrix , noise covariance matrices and can be estimated directly from the residuals. According to the parameters estimated using NTD, we can rewrite the LDS model expressed in (21) as follows:calling it the nonnegative tensor-based LDS(nLDS).
The pseudocode in Algorithm 1 shows the processing to build the nLDS model and solve its parameters.
|
4.3. Representing a Human Action as a Point on Infinite Grassmann Manifold
Starting from the initial state , the expected observation sequence of an nLDS model is represented as follows: Here the transition matrix is stable with eigenvalues inside the unit circle, the observation matrix is an orthonormal matrix, and is the largest eigenvalue of in magnitude. The expected observation sequence lies in the column space of the extended observability sequence given by The column space of can be seen as the descriptor of an LDS due to invariance of the choice of the basis of its state space. In this manner, the nLDS model of an action can be represented by its , which means that the can be seen as the feature descriptor of the action.
The traditional methods [6, 30] can approximate the extended observability sequence by taking the finite observability matrix (-order observability matrix) . Thus, an LDS can be represented as a point on a Grassmann manifold corresponding to the column space of . The value of the order can influence the approximation of the extended observability matrix: if the value of is too small, a -order observability matrix cannot adequately represent the behavior of the extended observability matrix; conversely, the finite observability matrix can be asymptotical to the extended observability matrix by increasing the value of , but this also leads to an increase in the computational cost. To avoid these limitations, we use the method proposed in [34] to project infinite-order observability matrices to points on infinite Grassmann manifold (infinite Grassmann manifold has been defined as in [35]). Let be an element of . For any , the orthonormalization of is performed by the Cholesky decomposition . Then, the orthonormalization of is defined as denotes the quotient space of . According to the definition of , it is infinite Grassmann manifold with an extra intrinsic structure. Thus, an action, represented by , can be alternately identified as a point on infinite Grassmann manifold consisting of the orthonormalized extended observability sequences.
4.4. Sparse Coding and Dictionary Learning on Infinite Grassmann Manifold
To classify the actions projected to infinite Grassmann manifold , an efficient method [35] is used to perform parse coding and dictionary learning on the infinite Grassmann manifold. Given dictionary , action set , and coefficients ( , and ), a sparse coding objective function on can be expressed as The purpose of dictionary learning is to search a good dictionary that can represent all the actions with a small reconstruction error. Let the tuples of the element and the action be and , respectively. Then, the problem of dictionary learning on the infinite Grassmann manifold can be expressed as optimization function , where where , , and represent the Cholesky decomposition matrices associated with dictionary elements , , and action , respectively.
5. Experiments
In this section, three standard 3D human action datasets, i.e., the MSR-Action3D dataset [36], UTKinect-Action dataset [37], and G3D-Gaming dataset [2], are utilized to evaluate the proposed LDS model and nonnegative tensor representation.
5.1. Alternative Nonnegative Tensor Representation of Skeleton Sequence
To obtain the nonnegative tensor representation more easily, the whole skeleton is transformed to the first octant. Each skeleton in the first octant contains N joint points, N-1 joint angles, and N-1 rigid bodies, and an action sequence consists of skeletons. To verify the effectiveness of our approach, we compare it with the following four alternative nonnegative tensor representations:
Second-Order Nonnegative Joint Positions (2NJP). An action sequence is represented as a second-order nonnegative tensor, , in which each skeleton is seen as a nonnegative vector consisting of the 3D coordinates of all joint points.
Second-Order Nonnegative Rigid Body Direction (2NRBD). An action sequence is represented as a second-order nonnegative tensor, , in which each skeleton is seen as a nonnegative vector consisting of the directions of all rigid bodies (the directions of a rigid body are represented by the three angles between the rigid body and x-, y-, and z-axis, respectively).
Third-Order Nonnegative Joint Angle and Joint Positions (3NJAP). Given two adjacent rigid bodies, we can obtain a joint angle between the two rigid bodies and a coordinate tuple comprising the coordinates of the three joints in the two rigid bodies. An action sequence can be represented as a third-order nonnegative tensor, , where each skeleton is seen as a second-order nonnegative tensor that consists of all the joint angles and coordinate tuples.
Third-Order Nonnegative Joint Angle and Direction (3NJAD). Given two adjacent rigid bodies, we can obtain a direction tuple containing only the directions of the two rigid bodies and a joint angle between the two rigid bodies. An action sequence can be represented as a third-order nonnegative tensor, , in which each skeleton is seen as a second-order nonnegative tensor consisting of all the joint angles and direction tuples.
5.2. Parameter Estimation
In the MSR-Action3D, UTKinect-Action, and G3D-Gaming datasets, all the skeletons contain the same number of joints and rigid bodies (19 rigid bodies and 20 joint points). NTD computes the best rank approximation of nonnegative tensor by the specified dimensions ()( is the core tensor). Thus, , and can affect the cost function between nonnegative tensor and its best rank approximation . Each skeleton sequence in the three datasets can be represented as nonnegative tensor ( is the length of the skeleton sequence). To compute the best rank approximation of nonnegative tensor , we set , , and ( on the MSR-Action3D dataset).
5.3. Experiments on MSR-Action3D Dataset
The MSR-Action3D dataset includes 557 human pose sequences that were captured by 10 subjects performing 20 actions, with each action having 2 or 3 repetitions. Each pose of this dataset provides the 3D locations of 20 joints. Each action sequence includes approximately 50 frames. Experiments on the dataset are challenging because many actions are very similar and the pose sequence of the same action class can have a large intraclass variation owing to the performing style variations.
Following the experimental protocol in [36], 20 actions in the dataset are categorized into three different subsets: , , and , and each subset includes 8 actions. Subsets and include actions with similar movements. Subset includes more complex actions. The cross-subject evaluation method, in which half of the subjects are used for training and remaining subjects for testing, is utilized to perform recognition on each subset. The average recognition is reported over 10 different combinations of training and testing sets. Table 1 shows that the proposed approach outperforms various other methods extracting the action feature from 3D joint positions. Our approach achieves an average accuracy of 97.63% for the MSR-Action3D dataset, outperforming the other action recognition approaches. The average accuracy of our approach is 0.41% better than the average accuracy of Ensemble TS-LSTM [14], 2.78% better than the average accuracy of tLDS [6], 5.79% better than the average accuracy of Bi-LSTM [15], 1.81% better than the average accuracy of 3NJAP-nLDS, and 0.17% better than the average accuracy of 3NJAD-nLDS. The superior performance on the three subsets indicates that our approach is better than the other methods, for both distinguishing similar actions and recognizing complex actions.
Following the experimental protocol in [8], we test all the actions on the MSR-Action3D dataset. The experiment on the entire dataset is more challenging than that in [36]. Our approach achieves an accuracy of 96.97%, as shown in Table 2.
Figure 5 shows the classification confusion matrix in the entire MSR-Action3D dataset. By observing the confusion matrix, we find that the recognition rate of most actions achieves 100%. It is obvious that classification errors occur if two actions are extremely similar, such as draw tick and horizontal arm wave.

5.4. Experiments on UTKinect-Action Dataset
The UTKinect-Action dataset is used to further evaluate our approach. This dataset comprises 10 types of human actions, which are captured by a single stationary Kinect in indoor settings. The 10 actions are walking, sitting down, standing up, picking up, carrying, throwing, pushing, pulling, waving, and clapping hands. Ten different subjects (9 males, 1 female) performed each action twice. Overall, there are 6220 frames of the 199 action sequences. This dataset is very challenging. Firstly, as the body parts of some actions are out of the field-of-view, parts of the human body are invisible. Secondly, different subjects perform the same action with different limbs, such as left-hand waving and right-hand waving. Thirdly, the action sequences captured from different views cause difficulties for action recognition.
To appropriately compare our approach with the state-of-the-art algorithms, the leave-one-sequence-out cross validation (LOOCV) method is applied to perform our experiment on the dataset. For each iteration, we choose an action sequence for testing and use the remaining action sequences for training. Each testing sequence was randomly chosen. The experiment on the dataset was performed ten times. Table 3 presents the experiment results achieved by our approach and other state-of-the-art methods. The recognition rate of our approach on the dataset is 98.23%. It is obvious that our approach outperforms SE3 [12], EigenJoints [13], Grassmann manifold [8], Key-Pose-Motifs [18], learning features combination [16], Ensemble TS-LSTM [14], tLDS [6], and Bi-LSTM [15], which achieve recognition rates of 97.08%, 97.10%, 88.5%, 93.47%, 98.00%, 96.97%, 96.48%, and 96.89%, respectively. The main reason for this may be that our approach more accurately reflects the relationships between the skeletons in the action sequence.
5.5. G3D-Gaming Dataset
The G3D-Gaming dataset consists of 663 sequences of the 20 different gaming actions captured by Microsoft Kinect. Ten different subjects performed each gaming action more than twice. The dataset can provide three types of data: synchronized video, depth, and skeleton data. Only the skeleton data is chosen to perform our experiment. Experiments on this dataset are also challenging owing to the following two factors: First, when the body parts are occluded, the Kinect tracker gives the inferred results that influence the recognition rate, such as the TennisSwingBackhand, Golf, and ThrowBowlingBall rates. Second, if the movement range of two different actions is relatively small, the two actions may easily interfere with each other in the action recognition. The cross-subject evaluation method is used to perform our experiment. The average recognition results are reported over ten different combinations of the training sets and testing sets. The proposed approach is compared with the state-of-the-art methods reported for the G3D-Gaming dataset, as listed in Table 4. GB-RBM+HMM [19] and LieNet [21] use deep learning method to recognize human action: GB-RBM+HMM incorporates a Gaussian binary-restricted Boltzmann machine (GB-RBM) with a hidden Markov model (HMM) to capture the global and local dynamic features of the joint trajectories. LieNet combines the Lie group structures with a deep network architecture to obtain more appropriate Lie group features used to recognize human action. Our approach achieves a recognition accuracy of 92.56%. It is obvious that our approach outperforms GB-RBM+HMM [19], SE [12], SO [20], LieNet [21], and tLDS [6], which achieve recognition rates of 86.40%, 91.09%, 87.95%, 89.10%, and 90.60%, respectively.
5.6. Discussion about Nonnegative Tensor Representation and Extended Observability Sequence
In this work, an action can be represented by a third-order nonnegative tensor. Referring to the approaches proposed in [6], the finite extended observability sequence of the third-order nonnegative tensor can be considered as the feature descriptor for the action ( and is the truncation parameter of the extended observability sequence ). The subspace spanned by columns of the finite extended observability sequence corresponds to a point on a Grassmann manifold. Therefore, in order to verify the effectiveness of infinite Grassmann manifold, we use the dictionary learning and sparse coding on Grassmann manifold to classify the nonnegative tensor-based actions represented as points on a Grassmann manifold (the application of extended observability sequence approximation is that an extended observability sequence can be mapped to a point on infinite Grassmann manifold). The experiment results, shown in Table 5, demonstrate that the application of extended observability sequence approximation is effective in improving the accuracies on the three datasets.
In [6], an action can be represented by a third-order tensor. The third-order tensor can be mapped to a point on an infinite Grassmann manifold by using the approach proposed in [34]. Then, in order to verify the effectiveness of the nonnegative tensor-based action representation, we use the dictionary learning and sparse coding on infinite Grassmann manifold to classify the third-order tensors represented as points on an infinite Grassmann manifold. The experiment results, shown in Table 6, demonstrate that the nonnegative tensor-based action representation is effective in improving the accuracies on the three datasets.
5.7. Evaluating the Effect of Infinite Mapping and NTD on the Accuracy of Action Recognition
The extended observability sequence of a third-order nonnegative tensor is given by . The parameters are estimated by the nonnegative tensor-based LDS (nLDS) model (nLDS model uses NTD to obtain its parameters). The subspace spanned by columns of the finite extended observability sequence corresponds to a point on a Grassmann manifold ( is the truncation parameter of ). Then, the actions, represented by third-order nonnegative tensors, can be mapped to points on a Grassmann manifold. Therefore, in order to verify the effectiveness of NTD, we use LTBSVM [8] to classify the actions represented as points on a Grassmann manifold. The experiment results, shown in Table 7, demonstrate that NTD is effective in improving the accuracies on the three datasets.
In [8], the extended observability matrix of an action is given by . The parameters are estimated by the ARMA model of the action sequence. We map the extended observability matrix to a point on an infinite Grassmann manifold using the approach proposed in [34]. Then, the dictionary learning and sparse coding on infinite Grassmann manifold are employed to classify the actions represented as points on an infinite Grassmann manifold in order to verify the effectiveness of infinite mapping (or infinite Grassmann manifold). The experiment results, shown in Table 8, demonstrate that infinite mapping is effective in improving the accuracies on the three datasets.
5.8. Computation Complexity and Run Time
Computation complexity comprises time complexity and space complexity. The time complexity of our algorithm is , where is the scale of a skeleton sequence (or an action) and is the total number of skeleton sequences in a dataset. Its space complexity is .
Matlab is used to run our experiments on a 3.60GHz Intel Core i7-4790 CPU machine. Run time comprises the training time and the testing time. Table 9 shows the run time on the three datasets.
5.9. Preprocessing and Skeleton Translation
The Preliminary Preprocessing and Translation of Skeleton. Before all skeletons in an action dataset are translated to the local coordinate system, we preprocess the skeletons using the preliminary preprocessing as follows:
Preliminary Preprocessing. For each action dataset, all skeletons were transformed to a global coordinate system whose origin locates at the hip center. This makes the skeletal data invariant to absolute location of the human in the scene. One of the skeletons was chosen as a reference skeleton. All other skeletons were normalized (without changing their joint angles) such that their body-part lengths are equal to the corresponding body-part lengths of the reference skeleton, which makes the skeletons scale-invariant. All skeletons were rotated such that the global x-axis is aligned with the ground plane projection of the vector from the left hip to the right hip, which makes the skeletons view-invariant. Figure 6(a) shows a preprocessed skeleton.

(a)

(b)
Obviously, the hip center of each preprocessed skeleton is located at the origin of the global coordinate system. This means that whether a subject moves a lot or little, the translation’s amount of each joint in the preprocessed skeleton sequence is not large (the preprocessed skeleton sequence represents the locomotion of the subject). In other words, when the locomotion of subjects was represented by the preprocessed skeleton sequences, the amount of translation, required to make all joint coordinates stay in the first octant, is not large. Therefore, the locomotion of subjects does not influence the amount of translation because all skeletons have been preprocessed by preliminary preprocessing. In fact, the amount of translation is only related to the joints in the preprocessed skeleton sequences, while the translation’s amount of the joints is small. For our approach, the preliminary preprocessing not only has no special limitations, but also helps to improve the accuracy for action recognition.
Given an action dataset , all skeletons in the dataset have been preprocessed by the preliminary preprocessing. Let denote the 3D position of joint at skeleton in skeleton sequence , where is the total number of joints in skeleton , is the total number of skeletons in skeleton sequence , and is the total number of skeleton sequences in action dataset (skeleton sequence belongs to action dataset , which means that the skeletons in skeleton sequence have been preprocessed by the preliminary preprocessing). Let be the origin of the local coordinate system, where , , and . Next, to make all joints stay in the first octant, all preprocessed skeletons in action dataset are translated to the local coordinate system with the origin at point , and the hip center of the preprocessed skeletons is placed at origin of the local coordinate system. Figure 6(b) shows a preprocessed skeleton translated to the local coordinate system.
Discussion about Skeleton Translation. Let denote the distance between the origin of the local coordinate system and the origin of the global coordinate system, where , , , and . Figure 7 shows the relationship between and the recognition rates. We found that the recognition rates gradually decreased with the increase of , which meant that the translation of the skeletons affects the accuracy for action recognition. Therefore, aiming to reduce the negative effect as far as possible, we set the origin of the local coordinate system to , which can ensure that the joints of translated skeletons are located at the first octant of the global coordinate system while can achieve its minimum as well.

6. Conclusion and Future Work
In this paper, an action is represented as a third-order nonnegative tensor. To capture the original spatiotemporal information of the action, an nLDS is used to model the action. NTD is employed to estimate the parameters of the nLDS model. The extended observability sequence of the parameters, which is considered as the feature descriptor of the action, is mapped to a point on infinite Grassmann manifold. The dictionary learning and sparse coding on infinite Grassmann manifold are used to perform classification. The experiment results demonstrate that our approach achieves a better performance than other state-of-the-art skeleton-based action recognition approaches. According to the theories proposed by Zhou et al. [38], future researches will focus on the effect of the unique and sparse NTD on action recognition.
Data Availability
In this paper, the experiments were performed on three public datasets as follows: (1) MSR-Action3D dataset is an action dataset of depth sequences captured by a depth camera. The dataset can be found in http://research.microsoft.com/en-us/um/people/zliu/actionrecorsrc/. (2) UTKinect-Action3D dataset is an action dataset that was captured using a single stationary Kinect. The dataset can be found in http://cvrc.ece.utexas.edu/KinectDatasets/HOJ3D.html. (3) G3D dataset is an action dataset containing synchronized video, depth, and skeleton data. The dataset can be found in http://dipersec.king.ac.uk/G3D/.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the National Natural Science Foundation of China under Grant No. 61850410523, the National Natural Science Foundation of China under Grant No. 61571345, the Nature Science Foundation of Anhui Province under grant No. 1908085MF186, and the Fundamental Research Funds for Xidian University No. XJS18041.