Nonnegative Tensor-Based Linear Dynamical Systems for Recognizing Human Action from 3D Skeletons

Li, Guang; Liu, Kai; Ding, Wenwen; Cheng, Fei; Ding, Chongyang

doi:https://doi.org/10.1155/2019/8940807

Mathematical Problems in Engineering

On this page

Abstract Introduction Related Work Conclusion Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2019 | Article ID 8940807 | https://doi.org/10.1155/2019/8940807

Nonnegative Tensor-Based Linear Dynamical Systems for Recognizing Human Action from 3D Skeletons

Guang Li,¹Kai Liu,¹Wenwen Ding,¹Fei Cheng,¹and Chongyang Ding¹

Academic Editor: Rafal Zdunek

Received05 Aug 2018

Revised23 Dec 2018

Accepted25 Feb 2019

Published27 Mar 2019

Abstract

Recently, skeleton-based action recognition has become a very important topic in the field of computer vision. It is a challenging task to accurately build a human action model and precisely distinguish similar human actions. In this paper, an action (skeleton sequence) is represented as a third-order nonnegative tensor time series to capture the original spatiotemporal information of the action. As a linear dynamical system (LDS) is an efficient tool for encoding the spatiotemporal data in various disciplines, this paper proposes a nonnegative tensor-based LDS (nLDS) to model the third-order nonnegative tensor time series. Nonnegative Tucker decomposition (NTD) is utilized to estimate the parameters of the nLDS model. These parameters are used to build extended observability sequence for the action, which implies that can be considered as the feature descriptor of the action. To avoid the limitations introduced by approximating with a finite-order matrix, we represent an action as a point on infinite Grassmann manifold comprising the orthonormalized extended observability sequences. The classification task can be performed by dictionary learning and sparse coding on the infinite Grassmann manifold. The experimental results on the MSR-Action3D, UTKinect-Action, and G3D-Gaming datasets demonstrate that the proposed approach achieves a better performance in comparison with the state-of-the-art methods.

1. Introduction

Human action recognition based on spatiotemporal data has been one of the most prominent research topics owing to its applications in human-computer interfaces [1], gaming [2], and surveillance systems [3]. Over the past few decades, numerous methods have been proposed for recognizing human actions from monocular RGB videos [4]. However, the monocular RGB data is very sensitive to background clutter, occlusion, variations in the view-point, and illumination changes. Thus, although in the past few decades several significant research studies have been conducted, the method to accurately recognize human actions from RGB videos still remains a challenging problem. As a human skeleton can be viewed as an articulated system of rigid bodies connected by bone joints, a human action can be described as the spatiotemporal evolution of a series of skeletons. Therefore, if human skeleton sequences can be accurately extracted from RGB videos, it is possible to perform action recognition by classifying the skeleton sequences. However, it is an extremely difficult task to reliably extract the human skeleton from monocular video sensors. With the development of cost-effective depth sensors [5], it has become easier to extract the three-dimensional (3D) positions of skeletal joints from depth data. Hence, skeleton-based action recognition has once again become an active area of research.

When recognizing human actions, the action representations embodying the temporal dynamics can provide a more relevant description than using static data [6]. A linear dynamical system (LDS) [7] is an effective tool in various disciplines for capturing the spatiotemporal data. Hence, the authors of [6, 8] employed an LDS model to capture the spatiotemporal information of an action (skeleton sequence) and used the singular value decomposition (SVD) [9] or Tucker decomposition [10] to estimate model parameters . The parameters were used to build finite observability matrix . Hence, an action, represented by an LDS, was alternately identified as a point on the finite Grassmann manifold corresponding to the column space of .

Motivated by the above methods, this paper proposes a novel approach to model and analyze human actions. The overall approach is shown in Figure 1. In this study, an action (a skeleton sequence) is represented as a third-order nonnegative tensor and each skeleton is converted into a second-order nonnegative tensor. To retain the spatiotemporal information of an action to the maximum extent, a nonnegative tensor-based LDS (nLDS) is proposed to model the action. In this work, nonnegative Tucker decomposition (NTD) [11] is used to decompose the third-order nonnegative tensor for improving the accuracy of action recognition. Because the NTD is a powerful tool to extract a part-based representation of high-dimensional tensors, an action can be represented by a linear combination of relevant components.

The parameter tuple can be learned by the nLDS model and NTD of an action. The parameters and represent the appearance and dynamics of the nLDS model, respectively. Thus, it is appropriate to regard extended observability sequence as the feature descriptor of the action. The conventional method approximates as a finite-order matrix and maps to a point on a finite Grassmann manifold. However, the value of order can affect the asymptotic behavior of and computational complexity. To avoid the limitations introduced by choosing the value of order , an action is represented as a point on infinite Grassmann manifold consisting of the orthonormalized extended observability sequences. Finally, classification is performed using dictionary learning and sparse coding on the infinite Grassmann manifold.

The main contributions of this study are as follows:

(1) To retain the spatiotemporal information of an action, an nLDS is used to model the action that is represented by a third-order nonnegative tensor.

(2) Compared with the Tucker decomposition, the uniqueness of NTD allows the representation of an action to be more discriminative. Thus, to further improve the accuracy of action recognition, NTD is used to estimate the parameters of the nLDS model. The parameters are utilized to build an extended observability sequence that can be considered as the feature descriptor of the action.

(3) To overcome the limitation caused by approximating extended observability sequence with finite-order matrix , an action is represented as a point on infinite Grassmann manifold consisting of the orthonormalized extended observability sequences.

The rest of the paper is organized as follows: Section 2 reviews related work, Section 3 briefly introduces some fundamental concepts on the Tucker model, NTD, and LDS. Section 4 elaborates the nLDS model and describes how to represent an action as a point on the infinite Grassmann manifold. Section 5 presents our experimental results. Section 6 concludes this paper.

A brief overview of the skeleton-based action recognition approaches is provided in this section. Presently, the existing skeleton-based action recognition approaches can be categorized into two types; the first type represents the human skeleton as a set of skeletal joints. Wang et al. [22] employed pairwise relative positions of the joints to represent a human skeleton and used a hierarchy of Fourier coefficients to model the temporal evolution of this representation. To obtain the discriminative joint combinations, they used a multiple kernel learning approach to characterize the human actions. Li et al. [23] used a graph-based model to represent the relative spatial variations between various skeletal joints. They utilized the relative variance of the joint relative distance (RVJRD) [24] to indicate their activity levels in a human action to obtain the most informative skeletal joint pairs. To derive the spatiotemporal compatibility of the skeletal joints between different actions, Koniusz et al. [25] used a sequence compatibility kernel (SCK) to capture the spatial and temporal similarities between the skeletons in an action. Furthermore, they employed a dynamics compatibility kernel to represent the similarity between a pair of skeletons in a given action to capture the spatiotemporal dynamics of that action. Zhu et al. [26] input the raw 3D skeletal joint locations to an end-to-end fully connected deep long short-term memory (LSTM) network for recognizing skeleton-based human actions. Ding et al. [6] represented a human skeleton as a 3D joint-based tensor, and then a human action was considered as a third-order tensor time series. They proposed a tensor-based LDS (tLDS) to model a tensor time series and estimated the parameters of the LDS model using the Tucker decomposition. To eliminate the noise and occlusion in 3D skeleton data, Liu et al. [17] proposed a spatiotemporal long short-term memory (ST-LSTM) network with trust gates. By analyzing the reliability of the skeleton data, trust gates could dynamically update the long-term context information stored in the memory cell. Lee et al. [14] built an Ensemble Temporal Sliding LSTM (TS-LSTM) network which was composed of short-term, medium-term, and long-term TS-LSTM networks. The subnetworks could capture the temporal dependencies between skeletons and the spatial dependency of each skeleton. The second type of skeleton-based action recognition approaches represents a human skeleton as a set of connected rigid bodies. The authors of [12] represented human actions as curves in the Lie group. For the easier performance of the classification task, the human actions represented by the Lie group were mapped to its Lie algebra in the vector space. In [8], the authors divided a skeleton into smaller body parts and employed certain bioinspired shape features to represent each body part. An LDS was used to learn the temporal evolution of the bioinspired features. Using the motion velocities, direction of the motion, and curvatures of the 3D trajectories, Ding et al. [27] categorized the foremost actions into two types of action-units: dynamic instants and intervals. They utilized self-organizing mapping (SOM) [28] to cluster the action-units with the spatiotemporal feature and employed the sequences of the discrete symbols of each action to build profile hidden Markov models (PHMMs) [29] to obtain the spatiotemporal information between the action-units in the given actions. Huang et al. [21] generated a neural network architecture to learn the most informative Lie group representations. Based on the proposed network structure, they input the Lie group features into rotation mapping layers to obtain the desirable results.

3. Briefs of Basic Concepts

A brief overview of the Tucker model, NTD, and LDS is presented here, which will help in understanding the tensor-based LDS. Let denote a -order tensor, represent the mode-n matricization of tensor , represent a matrix, and represent a mode-n matrix belonging to the Tucker model. If all elements of tensor are nonnegative (i.e., , where and ), is called a nonnegative tensor. The mode-n matricization of tensor utilizes the elements of tensor to structure matrix .

3.1. Tucker Model

The mode-n product of tensor by matrix is defined as ()-tensor given byfor all the index values. With the help of the mode-n product, (1) is rewritten in a form of the matrix unfoldings by fixing the n-th mode as follows:where and are the matrix unfolding of and , respectively.

The Tucker model decomposes -order tensor into the mode products of core tensor and mode matrices as follows:where are the factor matrices and .

Mode-n matricization of tensor in (3) can be represented by the mode-n matricization of the core tensor and mode matrices: where denotes the Kronecker product.

3.2. Nonnegative Tucker Decomposition

Given nonnegative -order tensor , NTD of obtains core tensor and mode matrix , which are restricted to having only nonnegative elements, such that

To search for approximate factorization of tensor , the cost function was used to quantify the quality of the approximation. The generalized Kullback–Leibler divergence (or I-divergence) is usually used to construct the cost function as follows:To obtain mode matrices and core tensor of tensor , Kim et al. [11] used multiplicative updating algorithms to minimize cost function (6) as follows.

Problem 1. Minimize with respect to and , subject to constraints .

They define which is the Kronecker product structure of mode matrices , and is represented as a backward cyclic form. The mode-n matrices of the NTD can be rewritten in the form:

Kim et al. [11] derived multiplicative updating algorithms for the mode matrices and core tensor of NTD as follows:

The update rule for mode matrices is

The update rule for core tensor iswhere with , , / is the element-wise division, denotes the Hadamard product, and is a distinct tensor, all of whose elements are 1. I-divergence is nonincreasing under the update rules.

3.3. Linear Dynamical Systems

An LDS belongs to a multivariate time series (MTS) model. The MTS model utilizes hidden states to indirectly represent the observation sequences. Given MTS , an LDS is usually described as follows: where discrete variable is the time index, denotes the d-dimensional hidden state at time , represents an n-dimensional observed state at time , and d is of the LDS. is a that can transform current hidden state to previous hidden state . can transform hidden state to observed state . Noise covariance matrices are and . Noise components and represent multivariate normal distributions with a zero-mean and covariance matrices and , respectively. In [30], G. Doretto et al. employed singular value decomposition (SVD) of the observed sequence to obtain the best estimate of and as follows: where and .

The parameters of LDS do not lie in a linear space. Transition matrix A is constrained to be stable and its eigenvalues are distributed in the unit circle. Observation matrix C is an orthonormal matrix. In fact, matrix C lies in the Stiefel manifold. Parameters can be utilized to describe the intrinsic characteristics of the LDS model [31] because and represent the dynamics and spatial appearance, respectively. Therefore, pair can be used to describe a set of joint trajectories of an articulated body model. The extended observability matrix [32] for a tuple () has the following form:

In the current human action research based on skeletons, a human action is usually described as a finite skeleton sequence. Therefore, a human action can be defined as a k-length extended observability matrix as follows: Here k is the total number of frames in the skeleton sequence. The size of is . The column space of is a -dimensional subspace of .

Grassmann manifold [33] represents a set of -dimensional linear subspaces of . Each point on the Grassmann manifold is a particular subspace spanned by the column space of , the orthogonal matrices. To obtain a particular subspace spanned by the columns of , the Gram–Schmidt orthonormalization can be used to compute an orthonormal basis. Thus, a human action can be represented as a point on Grassmann manifold corresponding to the column space of .

4. Nonnegative Tensor-Based Skeleton Sequence Model

4.1. Nonnegative Tensor Representation of Skeleton Sequence

The continuity of human motions decides that a skeleton sequence, which is used to describe a human action, is a combination of interrelated skeletons. Instead of traditional skeleton feature vectorization, a skeleton sequence is represented as a nonnegative tensor based on a time series. By searching the nonnegative tensor components of a human action, the tensor representation can better reflect independence of each skeleton and the variation between different skeletons.

Figure 2(a) shows a human skeleton with 19 rigid bodies and 20 joints. We preprocess the complete skeleton in the action datasets, which makes it more accurate and easy to represent a skeleton sequence as a nonnegative tensor. We use the preprocessing method elaborated in Section 5.9 to keep the joints of a skeleton in the first octant of the global coordinate system (refer to Figure 2(b)).

(a)

(b)

Let be a preprocessed skeleton (the preprocessed skeleton located at the first octant of the global coordinate system). is a set of joints in the first octant of the global coordinate system. denotes the 3D position of joint i. is a set of rigid bodies. represents the rigid body connecting adjacent joints and . represent the three angles between rigid body and the global x, y, and z-axes . Therefore, each skeleton can be represented as a second-order nonnegative tensor or matrix as follows: where is a distinct set that includes the joints of and the angles between and the three global axis. Then, a skeleton sequence can be represented as a third-order nonnegative tensor as , where is the number of frames in the skeleton sequence.

4.2. Nonnegative Tensor-Based LDS Model

Mathematically, a third-order tensor can be unfolded into a second-order tensor time series. It is well known that a second-order tensor time series can be modeled as the output of an LDS. Therefore, a third-order nonnegative tensor, representing a skeleton sequence, is used to build an LDS model. The parameters are estimated using the LDS model and NTD of the action. Then, the LDS model is called the nonnegative tensor-based LDS (nLDS) model, as shown in Figure 3.

A skeleton sequence can be represented as a third-order nonnegative tensor as follows: where and . The NTD of uses the update rules proposed by Kim et al. [11] as follows: where , and . Encoding variable matrix is computed in the following manner: Then, the NTD of is given by where the core tensor and mode matrices (, , and ) are restricted to having only nonnegative elements in the factorization, , , and . This is illustrated in Figure 4. Then, the mode-3 matricization of is achieved as follows:where is the mode-3 matricization of core tensor and (, , and ).

Let third-order nonnegative tensor , where and . The mode-3 matricization of is , where is a nonnegative matrix and is the vector representation of . The mode-3 matricization of is , where is a nonnegative matrix and is the vector representation of .

Here, we assume that represents the hidden state and represents the observation. Then, second-order nonnegative tensor time series can be represented by the LDSs as follows:where , , , , is a nonnegative matrix, is a zero-mean Gaussian noise modeling the stochastic relation between the states and observations, and is a zero-mean Gaussian noise modeling the stochastic component of the transition. (We suppose that is a nonnegative matrix.) Then, can be rewritten as follows:where .

Based on (22) and (20), we haveWe transpose the matrix on both sides of (23) as follows:

Next, we consider the problem of finding the estimates of and in the sense of Frobenius ( and represent nonnegative matrices and , respectively):Since , can achieve its minimum 0 when and ( is a by matrix, while is a by matrix), this means that is approximately equal to zero when is equal to zero. Therefore, the tuple(, ) is one of the solutions to the problem (25). In other words, the tuple (, ) is an element in the set represented by (25).

Then, transition matrix is obtained by solving the least square problem as follows: where and and denote the Frobenius norm and Moore–Penrose inverse, respectively. Given the above transition matrix and observation matrix , noise covariance matrices and can be estimated directly from the residuals. According to the parameters estimated using NTD, we can rewrite the LDS model expressed in (21) as follows:calling it the nonnegative tensor-based LDS(nLDS).

The pseudocode in Algorithm 1 shows the processing to build the nLDS model and solve its parameters.

Input: Third-order non-negative tensor
Output: Parameters C and A in the nLDS model
1: Unfolding along three modes
2: Building the nLDS model of
3: Rewriting using nLDS
4: Using NTD to factorize
5: the 3-mode matricization of
6:
7: and
8: Finding the best estimate of the parameter

4.3. Representing a Human Action as a Point on Infinite Grassmann Manifold

Starting from the initial state , the expected observation sequence of an nLDS model is represented as follows: Here the transition matrix is stable with eigenvalues inside the unit circle, the observation matrix is an orthonormal matrix, and is the largest eigenvalue of in magnitude. The expected observation sequence lies in the column space of the extended observability sequence given by The column space of can be seen as the descriptor of an LDS due to invariance of the choice of the basis of its state space. In this manner, the nLDS model of an action can be represented by its , which means that the can be seen as the feature descriptor of the action.

The traditional methods [6, 30] can approximate the extended observability sequence by taking the finite observability matrix (-order observability matrix) . Thus, an LDS can be represented as a point on a Grassmann manifold corresponding to the column space of . The value of the order can influence the approximation of the extended observability matrix: if the value of is too small, a -order observability matrix cannot adequately represent the behavior of the extended observability matrix; conversely, the finite observability matrix can be asymptotical to the extended observability matrix by increasing the value of , but this also leads to an increase in the computational cost. To avoid these limitations, we use the method proposed in [34] to project infinite-order observability matrices to points on infinite Grassmann manifold (infinite Grassmann manifold has been defined as in [35]). Let be an element of . For any , the orthonormalization of is performed by the Cholesky decomposition . Then, the orthonormalization of is defined as denotes the quotient space of . According to the definition of , it is infinite Grassmann manifold with an extra intrinsic structure. Thus, an action, represented by , can be alternately identified as a point on infinite Grassmann manifold consisting of the orthonormalized extended observability sequences.

4.4. Sparse Coding and Dictionary Learning on Infinite Grassmann Manifold

To classify the actions projected to infinite Grassmann manifold , an efficient method [35] is used to perform parse coding and dictionary learning on the infinite Grassmann manifold. Given dictionary , action set , and coefficients ( , and ), a sparse coding objective function on can be expressed as The purpose of dictionary learning is to search a good dictionary that can represent all the actions with a small reconstruction error. Let the tuples of the element and the action be and , respectively. Then, the problem of dictionary learning on the infinite Grassmann manifold can be expressed as optimization function , where where , , and represent the Cholesky decomposition matrices associated with dictionary elements , , and action , respectively.

5. Experiments

In this section, three standard 3D human action datasets, i.e., the MSR-Action3D dataset [36], UTKinect-Action dataset [37], and G3D-Gaming dataset [2], are utilized to evaluate the proposed LDS model and nonnegative tensor representation.

5.1. Alternative Nonnegative Tensor Representation of Skeleton Sequence

To obtain the nonnegative tensor representation more easily, the whole skeleton is transformed to the first octant. Each skeleton in the first octant contains N joint points, N-1 joint angles, and N-1 rigid bodies, and an action sequence consists of skeletons. To verify the effectiveness of our approach, we compare it with the following four alternative nonnegative tensor representations:

Second-Order Nonnegative Joint Positions (2NJP). An action sequence is represented as a second-order nonnegative tensor, , in which each skeleton is seen as a nonnegative vector consisting of the 3D coordinates of all joint points.

Second-Order Nonnegative Rigid Body Direction (2NRBD). An action sequence is represented as a second-order nonnegative tensor, , in which each skeleton is seen as a nonnegative vector consisting of the directions of all rigid bodies (the directions of a rigid body are represented by the three angles between the rigid body and x-, y-, and z-axis, respectively).

Third-Order Nonnegative Joint Angle and Joint Positions (3NJAP). Given two adjacent rigid bodies, we can obtain a joint angle between the two rigid bodies and a coordinate tuple comprising the coordinates of the three joints in the two rigid bodies. An action sequence can be represented as a third-order nonnegative tensor, , where each skeleton is seen as a second-order nonnegative tensor that consists of all the joint angles and coordinate tuples.

Third-Order Nonnegative Joint Angle and Direction (3NJAD). Given two adjacent rigid bodies, we can obtain a direction tuple containing only the directions of the two rigid bodies and a joint angle between the two rigid bodies. An action sequence can be represented as a third-order nonnegative tensor, , in which each skeleton is seen as a second-order nonnegative tensor consisting of all the joint angles and direction tuples.

5.2. Parameter Estimation

In the MSR-Action3D, UTKinect-Action, and G3D-Gaming datasets, all the skeletons contain the same number of joints and rigid bodies (19 rigid bodies and 20 joint points). NTD computes the best rank approximation of nonnegative tensor by the specified dimensions ()( is the core tensor). Thus, , and can affect the cost function between nonnegative tensor and its best rank approximation . Each skeleton sequence in the three datasets can be represented as nonnegative tensor ( is the length of the skeleton sequence). To compute the best rank approximation of nonnegative tensor , we set , , and ( on the MSR-Action3D dataset).

5.3. Experiments on MSR-Action3D Dataset

The MSR-Action3D dataset includes 557 human pose sequences that were captured by 10 subjects performing 20 actions, with each action having 2 or 3 repetitions. Each pose of this dataset provides the 3D locations of 20 joints. Each action sequence includes approximately 50 frames. Experiments on the dataset are challenging because many actions are very similar and the pose sequence of the same action class can have a large intraclass variation owing to the performing style variations.

Following the experimental protocol in [36], 20 actions in the dataset are categorized into three different subsets: , , and , and each subset includes 8 actions. Subsets and include actions with similar movements. Subset includes more complex actions. The cross-subject evaluation method, in which half of the subjects are used for training and remaining subjects for testing, is utilized to perform recognition on each subset. The average recognition is reported over 10 different combinations of training and testing sets. Table 1 shows that the proposed approach outperforms various other methods extracting the action feature from 3D joint positions. Our approach achieves an average accuracy of 97.63% for the MSR-Action3D dataset, outperforming the other action recognition approaches. The average accuracy of our approach is 0.41% better than the average accuracy of Ensemble TS-LSTM [14], 2.78% better than the average accuracy of tLDS [6], 5.79% better than the average accuracy of Bi-LSTM [15], 1.81% better than the average accuracy of 3NJAP-nLDS, and 0.17% better than the average accuracy of 3NJAD-nLDS. The superior performance on the three subsets indicates that our approach is better than the other methods, for both distinguishing similar actions and recognizing complex actions.

Following the experimental protocol in [8], we test all the actions on the MSR-Action3D dataset. The experiment on the entire dataset is more challenging than that in [36]. Our approach achieves an accuracy of 96.97%, as shown in Table 2.

Figure 5 shows the classification confusion matrix in the entire MSR-Action3D dataset. By observing the confusion matrix, we find that the recognition rate of most actions achieves 100%. It is obvious that classification errors occur if two actions are extremely similar, such as draw tick and horizontal arm wave.

5.4. Experiments on UTKinect-Action Dataset

The UTKinect-Action dataset is used to further evaluate our approach. This dataset comprises 10 types of human actions, which are captured by a single stationary Kinect in indoor settings. The 10 actions are walking, sitting down, standing up, picking up, carrying, throwing, pushing, pulling, waving, and clapping hands. Ten different subjects (9 males, 1 female) performed each action twice. Overall, there are 6220 frames of the 199 action sequences. This dataset is very challenging. Firstly, as the body parts of some actions are out of the field-of-view, parts of the human body are invisible. Secondly, different subjects perform the same action with different limbs, such as left-hand waving and right-hand waving. Thirdly, the action sequences captured from different views cause difficulties for action recognition.

To appropriately compare our approach with the state-of-the-art algorithms, the leave-one-sequence-out cross validation (LOOCV) method is applied to perform our experiment on the dataset. For each iteration, we choose an action sequence for testing and use the remaining action sequences for training. Each testing sequence was randomly chosen. The experiment on the dataset was performed ten times. Table 3 presents the experiment results achieved by our approach and other state-of-the-art methods. The recognition rate of our approach on the dataset is 98.23%. It is obvious that our approach outperforms SE3 [12], EigenJoints [13], Grassmann manifold [8], Key-Pose-Motifs [18], learning features combination [16], Ensemble TS-LSTM [14], tLDS [6], and Bi-LSTM [15], which achieve recognition rates of 97.08%, 97.10%, 88.5%, 93.47%, 98.00%, 96.97%, 96.48%, and 96.89%, respectively. The main reason for this may be that our approach more accurately reflects the relationships between the skeletons in the action sequence.

5.5. G3D-Gaming Dataset

The G3D-Gaming dataset consists of 663 sequences of the 20 different gaming actions captured by Microsoft Kinect. Ten different subjects performed each gaming action more than twice. The dataset can provide three types of data: synchronized video, depth, and skeleton data. Only the skeleton data is chosen to perform our experiment. Experiments on this dataset are also challenging owing to the following two factors: First, when the body parts are occluded, the Kinect tracker gives the inferred results that influence the recognition rate, such as the TennisSwingBackhand, Golf, and ThrowBowlingBall rates. Second, if the movement range of two different actions is relatively small, the two actions may easily interfere with each other in the action recognition. The cross-subject evaluation method is used to perform our experiment. The average recognition results are reported over ten different combinations of the training sets and testing sets. The proposed approach is compared with the state-of-the-art methods reported for the G3D-Gaming dataset, as listed in Table 4. GB-RBM+HMM [19] and LieNet [21] use deep learning method to recognize human action: GB-RBM+HMM incorporates a Gaussian binary-restricted Boltzmann machine (GB-RBM) with a hidden Markov model (HMM) to capture the global and local dynamic features of the joint trajectories. LieNet combines the Lie group structures with a deep network architecture to obtain more appropriate Lie group features used to recognize human action. Our approach achieves a recognition accuracy of 92.56%. It is obvious that our approach outperforms GB-RBM+HMM [19], SE [12], SO [20], LieNet [21], and tLDS [6], which achieve recognition rates of 86.40%, 91.09%, 87.95%, 89.10%, and 90.60%, respectively.

5.6. Discussion about Nonnegative Tensor Representation and Extended Observability Sequence

In this work, an action can be represented by a third-order nonnegative tensor. Referring to the approaches proposed in [6], the finite extended observability sequence of the third-order nonnegative tensor can be considered as the feature descriptor for the action ( and is the truncation parameter of the extended observability sequence ). The subspace spanned by columns of the finite extended observability sequence corresponds to a point on a Grassmann manifold. Therefore, in order to verify the effectiveness of infinite Grassmann manifold, we use the dictionary learning and sparse coding on Grassmann manifold to classify the nonnegative tensor-based actions represented as points on a Grassmann manifold (the application of extended observability sequence approximation is that an extended observability sequence can be mapped to a point on infinite Grassmann manifold). The experiment results, shown in Table 5, demonstrate that the application of extended observability sequence approximation is effective in improving the accuracies on the three datasets.

In [6], an action can be represented by a third-order tensor. The third-order tensor can be mapped to a point on an infinite Grassmann manifold by using the approach proposed in [34]. Then, in order to verify the effectiveness of the nonnegative tensor-based action representation, we use the dictionary learning and sparse coding on infinite Grassmann manifold to classify the third-order tensors represented as points on an infinite Grassmann manifold. The experiment results, shown in Table 6, demonstrate that the nonnegative tensor-based action representation is effective in improving the accuracies on the three datasets.

5.7. Evaluating the Effect of Infinite Mapping and NTD on the Accuracy of Action Recognition

The extended observability sequence of a third-order nonnegative tensor is given by . The parameters are estimated by the nonnegative tensor-based LDS (nLDS) model (nLDS model uses NTD to obtain its parameters). The subspace spanned by columns of the finite extended observability sequence corresponds to a point on a Grassmann manifold ( is the truncation parameter of ). Then, the actions, represented by third-order nonnegative tensors, can be mapped to points on a Grassmann manifold. Therefore, in order to verify the effectiveness of NTD, we use LTBSVM [8] to classify the actions represented as points on a Grassmann manifold. The experiment results, shown in Table 7, demonstrate that NTD is effective in improving the accuracies on the three datasets.

In [8], the extended observability matrix of an action is given by . The parameters are estimated by the ARMA model of the action sequence. We map the extended observability matrix to a point on an infinite Grassmann manifold using the approach proposed in [34]. Then, the dictionary learning and sparse coding on infinite Grassmann manifold are employed to classify the actions represented as points on an infinite Grassmann manifold in order to verify the effectiveness of infinite mapping (or infinite Grassmann manifold). The experiment results, shown in Table 8, demonstrate that infinite mapping is effective in improving the accuracies on the three datasets.

5.8. Computation Complexity and Run Time

Computation complexity comprises time complexity and space complexity. The time complexity of our algorithm is , where is the scale of a skeleton sequence (or an action) and is the total number of skeleton sequences in a dataset. Its space complexity is .

Matlab is used to run our experiments on a 3.60GHz Intel Core i7-4790 CPU machine. Run time comprises the training time and the testing time. Table 9 shows the run time on the three datasets.

5.9. Preprocessing and Skeleton Translation

The Preliminary Preprocessing and Translation of Skeleton. Before all skeletons in an action dataset are translated to the local coordinate system, we preprocess the skeletons using the preliminary preprocessing as follows:

Preliminary Preprocessing. For each action dataset, all skeletons were transformed to a global coordinate system whose origin locates at the hip center. This makes the skeletal data invariant to absolute location of the human in the scene. One of the skeletons was chosen as a reference skeleton. All other skeletons were normalized (without changing their joint angles) such that their body-part lengths are equal to the corresponding body-part lengths of the reference skeleton, which makes the skeletons scale-invariant. All skeletons were rotated such that the global x-axis is aligned with the ground plane projection of the vector from the left hip to the right hip, which makes the skeletons view-invariant. Figure 6(a) shows a preprocessed skeleton.

(a)

(b)

Obviously, the hip center of each preprocessed skeleton is located at the origin of the global coordinate system. This means that whether a subject moves a lot or little, the translation’s amount of each joint in the preprocessed skeleton sequence is not large (the preprocessed skeleton sequence represents the locomotion of the subject). In other words, when the locomotion of subjects was represented by the preprocessed skeleton sequences, the amount of translation, required to make all joint coordinates stay in the first octant, is not large. Therefore, the locomotion of subjects does not influence the amount of translation because all skeletons have been preprocessed by preliminary preprocessing. In fact, the amount of translation is only related to the joints in the preprocessed skeleton sequences, while the translation’s amount of the joints is small. For our approach, the preliminary preprocessing not only has no special limitations, but also helps to improve the accuracy for action recognition.

Given an action dataset , all skeletons in the dataset have been preprocessed by the preliminary preprocessing. Let denote the 3D position of joint at skeleton in skeleton sequence , where is the total number of joints in skeleton , is the total number of skeletons in skeleton sequence , and is the total number of skeleton sequences in action dataset (skeleton sequence belongs to action dataset , which means that the skeletons in skeleton sequence have been preprocessed by the preliminary preprocessing). Let be the origin of the local coordinate system, where , , and . Next, to make all joints stay in the first octant, all preprocessed skeletons in action dataset are translated to the local coordinate system with the origin at point , and the hip center of the preprocessed skeletons is placed at origin of the local coordinate system. Figure 6(b) shows a preprocessed skeleton translated to the local coordinate system.

Discussion about Skeleton Translation. Let denote the distance between the origin of the local coordinate system and the origin of the global coordinate system, where , , , and . Figure 7 shows the relationship between and the recognition rates. We found that the recognition rates gradually decreased with the increase of , which meant that the translation of the skeletons affects the accuracy for action recognition. Therefore, aiming to reduce the negative effect as far as possible, we set the origin of the local coordinate system to , which can ensure that the joints of translated skeletons are located at the first octant of the global coordinate system while can achieve its minimum as well.

6. Conclusion and Future Work

In this paper, an action is represented as a third-order nonnegative tensor. To capture the original spatiotemporal information of the action, an nLDS is used to model the action. NTD is employed to estimate the parameters of the nLDS model. The extended observability sequence of the parameters, which is considered as the feature descriptor of the action, is mapped to a point on infinite Grassmann manifold. The dictionary learning and sparse coding on infinite Grassmann manifold are used to perform classification. The experiment results demonstrate that our approach achieves a better performance than other state-of-the-art skeleton-based action recognition approaches. According to the theories proposed by Zhou et al. [38], future researches will focus on the effect of the unique and sparse NTD on action recognition.

Data Availability

In this paper, the experiments were performed on three public datasets as follows: (1) MSR-Action3D dataset is an action dataset of depth sequences captured by a depth camera. The dataset can be found in http://research.microsoft.com/en-us/um/people/zliu/actionrecorsrc/. (2) UTKinect-Action3D dataset is an action dataset that was captured using a single stationary Kinect. The dataset can be found in http://cvrc.ece.utexas.edu/KinectDatasets/HOJ3D.html. (3) G3D dataset is an action dataset containing synchronized video, depth, and skeleton data. The dataset can be found in http://dipersec.king.ac.uk/G3D/.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grant No. 61850410523, the National Natural Science Foundation of China under Grant No. 61571345, the Nature Science Foundation of Anhui Province under grant No. 1908085MF186, and the Fundamental Research Funds for Xidian University No. XJS18041.

References

S. Fothergill, H. M. Mentis, P. Kohli, and S. Nowozin, “Instructing people for training gestural interactive systems,” in Proceedings of the 30th ACM Conference on Human Factors in Computing Systems, CHI 2012, pp. 1737–1746, USA, May 2012.
View at: Google Scholar
V. Bloom, D. Makris, and V. Argyriou, “G3D: A gaming action dataset and real time action recognition evaluation framework,” in Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2012, pp. 7–12, USA, June 2012.
View at: Google Scholar
W. Lao, J. Han, and P. H. N. de With, “Automatic video-based human motion analyzer for consumer surveillance system,” IEEE Transactions on Consumer Electronics, vol. 55, no. 2, pp. 591–598, 2009.
View at: Publisher Site | Google Scholar
J. K. Aggarwal and M. S. Ryoo, “Human activity analysis: a review,” ACM Computing Surveys, vol. 43, no. 3, pp. 1–43, 2011.
View at: Google Scholar
J. Shotton, A. Fitzgibbon, M. Cook et al., “Real-time human pose recognition in parts from single depth images,” in Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, pp. 1297–1304, USA, June 2011.
View at: Google Scholar
W. Ding, K. Liu, E. Belyaev, and F. Cheng, “Tensor-based linear dynamical systems for action recognition from 3D skeletons,” Pattern Recognition, pp. 75–86, 2017.
View at: Publisher Site | Google Scholar
P. Turaga, A. Veeraraghavan, A. Srivastava, and R. Chellappa, “Statistical analysis on manifolds and its applications to video analysis,” Studies in Computational Intelligence, vol. 287, pp. 115–144, 2010.
View at: Google Scholar
R. Slama, H. Wannous, M. Daoudi, and A. Srivastava, “Accurate 3D action recognition using learning on the Grassmann manifold,” Pattern Recognition, vol. 48, no. 2, pp. 556–567, 2015.
View at: Publisher Site | Google Scholar
G. H. Golub and C. F. Van Loan, Matrix Computations, Jon Hopkins University Press, 2nd edition, 1989.
L. R. Tucker, “Some mathematical notes on three-mode factor analysis,” Psychometrika, vol. 31, no. 3, pp. 279–311, 1966.
View at: Publisher Site | Google Scholar
Y. D. Kim and S. Choi, “Nonnegative tucker decomposition,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 1–7, IEEE, 2007.
View at: Google Scholar
R. Vemulapalli, F. Arrate, and R. Chellappa, “Human action recognition by representing 3D skeletons as points in a lie group,” in Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, pp. 588–595, USA, June 2014.
View at: Google Scholar
X. Yang and Y. Tian, “Effective 3D action recognition using Eigen Joints,” Journal of Visual Communication and Image Representation, vol. 25, no. 1, pp. 2–11, 2014.
View at: Publisher Site | Google Scholar
I. Lee, D. Kim, S. Kang, and S. Lee, “Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks,” in Proceedings of the 16th IEEE International Conference on Computer Vision, ICCV 2017, pp. 1012–1020, Italy, October 2017.
View at: Google Scholar
A. B. Tanfous, H. Drira, and B. B. Amor, “Coding kendall’s shape trajectories for 3D action recognition,” in Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2840–2849, IEEE, June 2018.
View at: Publisher Site | Google Scholar
D. Carbonera Luvizon, H. Tabia, and D. Picard, “Learning features combination for human action recognition from skeleton sequences,” Pattern Recognition Letters, vol. 99, pp. 13–20, 2017.
View at: Publisher Site | Google Scholar
J. Liu, A. Shahroudy, D. Xu, and G. Wang, “Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition,” in Proceedings of the European Conference on Computer Vision(ECCV, vol. 816833, 2016.
View at: Google Scholar
C. Wang, Y. Wang, and A. L. Yuille, “Mining 3D key-pose-motifs for action recognition,” in Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, pp. 2639–2647, USA, July 2016.
View at: Google Scholar
S. Nie and Q. Ji, “Capturing global and local dynamics for human action recognition,” in Proceedings of the 22nd International Conference on Pattern Recognition, ICPR 2014, pp. 1946–1951, Sweden, August 2014.
View at: Google Scholar
R. Vemulapalli and R. Chellappa, “Rolling rotations for recognizing human actions from 3D skeletal data,” in Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, pp. 4471–4479, USA, July 2016.
View at: Google Scholar
Z. Huang, C. Wan, T. Probst, and L. Van Gool, “Deep learning on lie groups for skeleton-based action recognition,” in Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, pp. 1243–1252, USA, July 2017.
View at: Google Scholar
J. Wang, Z. Liu, Y. Wu, and J. Yuan, “Learning actionlet ensemble for 3D human action recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 5, pp. 914–927, 2014.
View at: Publisher Site | Google Scholar
H. Rahmani and A. Mian, “Learning a non-linear knowledge transfer model for cross-view action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, pp. 2458–2466, USA, June 2015.
View at: Google Scholar
J. K. Tang and H. Leung, “Retrieval of logically relevant 3D human motions by Adaptive Feature Selection with Graded Relevance Feedback,” Pattern Recognition Letters, vol. 33, no. 4, pp. 420–430, 2012.
View at: Publisher Site | Google Scholar
P. Koniusz, A. Cherian, and F. Porikli, “Tensor representations via kernel linearization for action recognition from 3D skeletons,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): Preface, vol. 9908, pp. 37–53, 2016.
View at: Google Scholar
W. Zhu, C. Lan, J. Xing et al., “Co-Occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks,” in Proceedings of the 30th AAAI Conference on Artificial Intelligence, AAAI 2016, pp. 3697–3703, USA, February 2016.
View at: Google Scholar
W. Ding, K. Liu, X. Fu, and F. Cheng, “Profile HMMs for skeleton-based human action recognition,” Signal Processing: Image Communication, vol. 42, pp. 109–119, 2016.
View at: Publisher Site | Google Scholar
T. Kohonen, “The self-organizing map,” Neurocomputing, vol. 21, no. 1–3, pp. 1–6, 1998.
View at: Publisher Site | Google Scholar
S. R. Eddy, “Profile hidden Markov models,” Bioinformatics, vol. 14, no. 9, pp. 755–763, 1998.
View at: Publisher Site | Google Scholar
G. Doretto, A. Chiuso, Y. N. Wu, and S. Soatto, “Dynamic textures,” International Journal of Computer Vision, vol. 51, no. 2, pp. 91–109, 2003.
View at: Publisher Site | Google Scholar
A. Bissacco, A. Chiuso, Y. Ma, and S. Soatto, “Recognition of human gaits,” in Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. II52–II57, USA, December 2001.
View at: Google Scholar
D. Comaniciu and P. Meer, “Mean shift: a robust approach toward feature space analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 5, pp. 603–619, 2002.
View at: Publisher Site | Google Scholar
A. Edelman, R. Arias, and S. Smith, “The geometry of algorithms with orthogonal constraints,” Siam Journal on Matrix Analysis and Applications, vol. 20, no. 2, pp. 303–353, 1999.
View at: Google Scholar
W. Huang, F. Sun, L. Cao, D. Zhao, H. Liu, and M. Harandi, “Sparse coding and dictionary learning with linear dynamical systems,” in Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, pp. 3938–3947, USA, July 2016.
View at: Google Scholar
K. Ye and L.-H. Lim, “Schubert varieties and distances between subspaces of different dimensions,” SIAM Journal on Matrix Analysis and Applications, vol. 37, no. 3, pp. 1407–1427, 2014.
View at: Publisher Site | Google Scholar | MathSciNet
W. Li, Z. Zhang, and Z. Liu, “Action recognition based on a bag of 3D points,” in Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, CVPRW 2010, pp. 9–14, USA, June 2010.
View at: Google Scholar
L. Xia, C.-C. Chen, and J. K. Aggarwal, “View invariant human action recognition using histograms of 3D joints,” in Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2012, pp. 20–27, USA, June 2012.
View at: Google Scholar
G. Zhou, A. Cichocki, Q. Zhao, and S. Xie, “Efficient nonnegative tucker decompositions: algorithms and uniqueness,” IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 4990–5003, 2015.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2019 Guang Li et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Mathematical Problems in Engineering

Nonnegative Tensor-Based Linear Dynamical Systems for Recognizing Human Action from 3D Skeletons

Abstract

1. Introduction

2. Related Work

3. Briefs of Basic Concepts

3.1. Tucker Model

3.2. Nonnegative Tucker Decomposition

3.3. Linear Dynamical Systems

4. Nonnegative Tensor-Based Skeleton Sequence Model

4.1. Nonnegative Tensor Representation of Skeleton Sequence

4.2. Nonnegative Tensor-Based LDS Model

4.3. Representing a Human Action as a Point on Infinite Grassmann Manifold

4.4. Sparse Coding and Dictionary Learning on Infinite Grassmann Manifold

5. Experiments

5.1. Alternative Nonnegative Tensor Representation of Skeleton Sequence

5.2. Parameter Estimation

5.3. Experiments on MSR-Action3D Dataset

5.4. Experiments on UTKinect-Action Dataset

5.5. G3D-Gaming Dataset

5.6. Discussion about Nonnegative Tensor Representation and Extended Observability Sequence

5.7. Evaluating the Effect of Infinite Mapping and NTD on the Accuracy of Action Recognition

5.8. Computation Complexity and Run Time

5.9. Preprocessing and Skeleton Translation

6. Conclusion and Future Work

Data Availability

Conflicts of Interest

Acknowledgments

References

Copyright