Abstract

Human recognition models based on spatial-temporal graph convolutional neural networks have been gradually developed, and we present an improved spatial-temporal graph convolutional neural network to solve the problems of the high number of parameters and low accuracy of this type of model. The method mainly draws on the inception structure. First, the tensor rotation is added to the graph convolution layer to realize the conversion between graph node dimension and channel dimension and enhance the model’s ability to capture global information for small-scale tasks. Then the inception temporal convolution layer is added to build a multiscale temporal convolution filter to perceive temporal information under different time domains hierarchically from 4-time dimensions. It overcomes the shortcomings of temporal graph convolutional networks in the field of joint relevance of hidden layers and compensates for the information omission of small-scale graph tasks. It also limits the volume of parameters, decreases the arithmetic power, and speeds up the computation. In our experiments, we verify our model on the public dataset NTU RGB + D. Our method reduces the number of the model parameters by 50% and achieves an accuracy of 90% in the CS evaluation system and 94% in the CV evaluation system. The results show that our method not only has high recognition accuracy and good robustness in human behavior recognition applications but also has a small number of model parameters, which can effectively reduce the computational cost.

1. Introduction

Computer vision technology is a key link in the realization of artificial intelligence, and its emergence has given artificial intelligence great potential in visual perception. Among them, human action recognition is the most challenging technology in computer vision. The implementation of this technology can add to intelligent applications such as pedestrian following and behavior analysis. The most widely researched human behavior recognition methods are based on the human skeleton and, of course, image-based human behavior recognition methods. Human skeleton-based and image-based approaches are very different [14]. Skeleton-based methods take human skeleton recognition data as input, focus on analyzing the depth, spatial and temporal information of human skeletal joints, and then combine all features to achieve a behavior prediction result. Compared with image-based methods, human skeletal data are denser and reduce the computational cost by replacing a large number of pixel points with dense skeletal data. The skeleton-based action recognition method also performs better in the working environment of multiple targets and complex backgrounds [57].

Traditional skeleton-based methods rely on skeletal joint trajectories, with all method models based on recurrent neural networks of skeletal joint point data [811]. Some researchers would prefer to adopt deep neural network models, for which there is already relevant literature demonstrating their substantial advantages as well as their shortcomings. The skeletal data distribution is rather fragmented, and each skeletal joint point data are not locally linked. Therefore, for deep neural networks, a separate neural network needs to be tailored to accommodate the structured skeletal data to coordinate with all the skeletal joint point data [12, 13].

In the deep recurrent network model, only the connections in the feature point space can be analyzed, and the connections between features at the temporal level cannot be obtained. To solve this problem, researchers in the literature [14, 15] used a long short-term memory network (LSTM) [16] to feature extraction from skeletal data, where the authors first divided the skeletal data into slices, each corresponding to an individual LSTM unit, and merged all. Such an architectural design improves the model’s spatial perception of the skeletal data. However, this method suffers from the manual architecture predetermined to the rule limitation, which reduces the generalization ability and robustness of the network [17]. Considering the spatial and temporal features of skeletal data, the literature [18] introduced a graph convolutional network, which breaks the limitation of 2-dimensional data and can handle any graph structure. It transposes the computation of graph convolutional network to skeletal nodal data, which can dimensionally integrate the connections between spatial feature points. The literature [19] presented a spatial-temporal graph convolutional neural network based on the previous study, which can represent each skeletal data point in a graph structure and then perform feature extraction in a graph convolutional pattern as a way to obtain the spatial features between skeletal joint points [2022]. In addition, the model adds a temporal convolution unit to integrate the temporal links between skeletal joint points, estimate the trajectory of skeletal joints, and finally predict the class of behavior [23].

Based on preliminary research and experiments, this paper proposes the Inception-ST-GCN (IST-GCN) method, which aims to reduce the complexity of building the neural network architectures while capturing the global information of the graph. In this paper, a tensor rotation module will be added to rotate the graph dimension to the RGB dimension and use the one-dimensional convolution Conv 1 × 1 to capture the global information afterward. A new inception layer multiscale temporal convolution filter is added to divide it into four branches with different temporal perception domains to capture richer temporal information and greatly decrease the volume of model parameters. The IST-GCN method achieves a compact and efficient network. To test the effectiveness of the method in this paper, we perform experimental validation on the public dataset NTU RGB + D. The results show that the number of parameters of the method in this paper is greatly reduced compared with the original ST-GCN model, and the accuracy and precision of action recognition are greatly improved.

The remainder of this paper is laid out as follows: Section 2 introduces the construction of the basic network and the principles of mathematical equations. Section 3 details the principles and implementation procedures related to the improved human action recognition network. Section 4 presents the relevant experimental datasets and analysis of the results. Finally, Section 5 reviews our findings and reveals some additional research.

2. Basic Network

Through our preliminary examination, we apply the graph convolutional neural network as the base network, and its network structure is shown in Figure 1. This network is an upgrade for the graph convolutional network, aiming to optimize the perceptual domain of the graph convolution and increase the joint of the graph convolutional network for the feature relations at the temporal level. The main purpose of this network is sequence encoding the skeletal data and predict the joint behavior by the spatial features and temporal associations between skeletal joint points. For skeletal feature acquisition, we usually use the OpenPose [24] algorithm to localize the human body using 25 skeletal points and the connections among different skeletal points as human joints. The input is usually a video sample in AV format, and each frame of the sample video corresponds to this set of joint coordinates. The OpenPose algorithm can split and resolve each set of joint coordinates and map them to each skeletal unit map node of the human body, using the joints and the edges of the human body as boundaries to build a complete spatial-temporal map. In other words, the input of OpenPose can also be understood as a set of joint coordinates of skeletal points in the same way as the 2-dimensional pixel intensity vector input of the convolutional neural network. To obtain a wider range of information, the graph convolutional network is then stacked and all outputs are then fed into the classifier in parallel.

The input in Figure 1 is a fixed skeleton sequence, assuming that represents the constituent sequence of the total number of skeletons, represents the number of skeletal joints, and denotes the set of constructed skeleton spatial-temporal sequences, where traverses the skeleton joints obtained along with all-time sequences, and denotes all nodes. denotes the set of connections between joints, and consists of and . An arbitrary human skeleton joint , denotes the composition of skeleton intra-joint connections within time . The subset of intra-skeletal connections is divided into disjoint regions in the center of gravity rule and is represented using the adjacency matrix encoding . denotes the union of connections between all skeletal joints in a continuous time series. The fusion of the above features results in a sequence diagram that can be extended in the spatially mapped temporal dimension.

The literature [25] optimized the spatial submodule of the spatial-temporal graph convolutional neural network and proposed the following graph convolution equation:where denotes the adjacency matrix of internal connections of skeletal nodes, denotes the unit matrix, denotes the size of the convolution kernel in spatial dimensions, and denotes the training weights. The temporal convolution module is . In 2D graph convolution, and the perceptual field of the convolution kernel is not considered when operating in the dimension, where denotes the number of frames.

The graph structures in graph convolution are predefined, and to increase their adaptability, the literature [26] uses a fixed adjacency matrix and proposes an adaptive graph convolution formula as follows:where denotes the parameters learned in training and denotes the connected vertices determined with the over-similarity function.

3. Improved Action Recognition Network

The spatial-temporal graph convolution model uses a predefined structural graph as a topological constraint to achieve the ability of different time-step graphs to share the same topology, and such a structure leads to the inability of the graph task to fully capture the relevant features of the hidden joint layer. To solve this problem, our most common approach is to build a regional neural network using a local perceptual domain as the starting point and a small-scale graph task in the experimental region. This can easily produce global information omission. To simulate the principle of computation of pixel points by convolutional neural networks, each graph node and adjacent graph nodes become the key nodes for graph convolution computation in the graph convolution task. Considering the problem of density heterogeneity and narrow local structure between neighboring nodes, in our improved network, we employ node features of fixed size for feature learning in the temporal dimension, selectively ignoring the size of cluster features, and being able to capture more features in the temporal dimension. Therefore, we present the inception spatial-temporal graph convolutional network (IST-GCN), which applies the inception structure to some network layers as a way to reduce the model parameters, broaden the network width, and enhance the robustness of the model.

3.1. Inception Module

The inception module is a sparsity structure proposed in 2015, which has excellent feature expression capability and local topology capability. When the image is input, the pixel point population is involved in a series of convolution operations and pooling operations to obtain features at different scales from different scales of convolution kernels. All the output results are taken for parallel processing to filter out the best image features. The original structure of inception is shown in Figure 2. Its network structure mainly contains three scales of convolutional kernels and a 3×3 pooling, through which a combination of 1, 3, and 5 convolutional kernels can fully acquire large-scale sparse features and small-scale nonsparse features. Such structures not only increase the network width but also increase the adaptability of the network to different scales. Finally, all features are synthesized by a concatenation operation to obtain the nonlinear properties of the features.

3.2. Graph Convolutional Layer Improvement Strategy

Our proposed IST-GCN model originates from a two-part optimization of the spatial-temporal graph convolutional network. The first part is to optimize the graph convolutional network layers; the second part is to add the inception layer. In the graph convolutional layer, the original model aims to obtain spatial location information between the human skeletal joint points to achieve the representation of the joint points. It should start from the initial neighboring nodes to build up a local perceptual domain, in which a large number of sample nodes are generated. Although many false samples are generated at this point, adding topological angle restrictions in the subsequent process of filtering the sequence in Euclidean space can filter the false samples. When all sample nodes are in Euclidean space, all sample nodes can be considered as point from the global level view, and the sequence of points is considered as a one-dimensional vector. In this case, to capture a large number of sample node features, a large-scale graph convolution sum is required, whose size is consistent with the number of nodes. To properly solve this problem, we propose a tensor rotation strategy. We add a tensor rotation module, which we call Rotate tensor GCN (R-GCN), at the beginning and end of the graph convolution layer. The detailed network structure is shown in Figure 3.

By the action of the tensor rotation module, each sample node can share the same set of identical topological matrices, and all nodes can participate in the process of capturing global information. Taking human nodes as an example, each graph contains 25 nodes, and in the fully connected layer, we choose a filter of size 25. The rotation tensor module will rotate a tensor according to the different nodes separately so that the dimensionality of the nodes and the dimensionality of the channels remain the same. By tensor rotation, the predefined topological matrix is discarded and the global features are learned adaptively according to the self-cycling unit for joint relevance. Finally, the global information is integrated through the tail-Conv 1 × 1 dimensionality reduction. Such a structural design can effectively reduce the use of higher-order polynomial estimation layer by layer to capture higher-order features, thus achieving a reduction in the number of parameters.

3.3. Inception Layer Design Strategy

We consider using the inception structure to broaden the spatial-temporal graph convolutional network because of the sparse structure advantage of inception. More feature information can be obtained by the layout of the sparse structure while avoiding the increase in the number of parameters. We refer to the optimization process of inception from V1 to V4 and discover the one-dimensional convolutional dimensionality reduction method [2729]. We are building the inception time convolution network (I-TCN), and the expansion of parameters is exacerbated by the exponentially growing expansion coefficients in the time convolution layer to widen the network. In contrast, the inception tiling structure is incremented according to layers, and each branch is preceded by adding Conv 1 × 1 dimensionality reduction to assign different expansion settings to each branch, allowing the time-scale information to be graded into the inception branches and achieving information integration in different time dimensions. Through the above structure of time coefficient assignment, the exponential growth of coefficients is avoided and the purpose of reducing the number of parameters is achieved.

Two two-layer I-TCN layers are added at the end of each IST-GCN cell, and the TCN is divided into 4 branches according to the hierarchical principle, with each branch producing output to the corresponding group, whose structure is shown in Figure 4. The initial value of the expansion coefficient n of the network is 1. As the network deepens, the layer units increase step by step, and the maximum value of the expansion coefficient is 4. This external connection refers to the residual structure, which passes through a one-dimensional convolution with a step size of 2 in the middle, and this design can avoid the gradient dispersion problem. Improving the temporal convolutional network by inserting the inception structure allows for capturing more time-scale information while reducing the number of network parameters by a large amount and reducing the computational cost. A compact and efficient temporal feature extraction network is achieved by using different temporal filters to adaptively select the best feature information to optimize the classification problem.

3.4. IST-GCN Action Recognition Process

The process of human action recognition based on the IST-GCN model is shown in Figure 5. Firstly, the sample video data are input, and the video data are processed in frames during the analysis process. The human joints under different frames have the problem of position change, but the set of all joint points in different frames obeys random distribution. Therefore, we first select the batching standard module (BN) in the first layer of the network hierarchical distribution to normalize the joint point data at the temporal level and spatial level to make the input skeletal data more standardized, reduce the error volatility, and optimize the algorithm’s convergence. In the second layer of the network, we choose the attention mechanism (ATT), which connects our new R-GCN layer and the I-TCN layer in the next network. The R-GCN layer relies on the tensor rotation operation to obtain global information, after which the obtained global features are input into the I-TCN to analyze the linkage relationship among the nodal features at the temporal level, supplemented by the ATT mechanism to weaken the non-conforming features that do not conform to the bounded range of the model and filter features of different time-scales. The whole network consists of nine IST-GCN units sequentially connected to fully capture and fuse the graph feature information, then perform average pooling, then classify the features through the fully connected layer, and finally output the behavior prediction results according to the classification weights.

4. Experiment

4.1. Datasets

To validate the performance of our method, we chose the public dataset NTU RGB + D [30] for experimental test validation. This dataset is one of the more comprehensive datasets covering categories in human action recognition studies. The dataset contains a total of three types of production specifications, which are the two-person interaction dataset, the medical interaction dataset, and the daily interaction dataset. It can be subdivided into 60 categories of actions based on action types, with a total of 56880 sample sequences. All videos are stored in a uniform dataset standard, and the maximum video frames of each sample video do not exceed 300 frames. At the same time, all sample data are preprocessed by OpenPose human skeleton detection, and the corresponding skeleton data and Jason files are stored separately. In addition, a set of independent evaluation criteria, namely, Cross-Subject (CS) and Cross-View (CV), is proposed for this dataset. The CS evaluation system is evaluated based on the ID number of the person in the dataset as a sequence, and the CV evaluation system is evaluated based on the camera ID number as a sequence. The detailed volume of the training and testing datasets are shown in Table 1.

4.2. Experimental Details

In the action recognition experiments, we mainly focus on action jogging as the control standard to verify whether the action recognition results match with the real action, each test sample is 300 frames, while the experiments are divided into single-player action recognition experiments and multiplayer action recognition experiments to test the performance of the improved method hierarchically while comparing with the spatial-temporal map convolutional neural network model.

4.2.1. Single Action Recognition Experiment

The performance of single-person recognition result is shown in Figure 6, it can be seen that the action recognition result matches with the experimental preset result, the effect is better and the action recognition result is accurate.

Compared with the spatial-temporal map convolutional neural network model, the single-person action recognition effect is not much different, and the comparison experiment is shown in Figure 7. Although there are a few frames that recognize the action as a triple jump and occasionally misrecognition occurs, the final score voting result still matches the real action and has little impact on the overall action recognition result.

4.2.2. Multiplayer Action Recognition Experiment

The performance of multiperson recognition result is shown in Figure 8, which shows that the action recognition effect is good, and a few frames appear to misrecognition situation, but it does not affect the overall action recognition, and the recognition result is accurate.

Compared with the original spatial-temporal map convolutional neural network model, the recognition effect of our method is superior, and the comparison of the action recognition effect is shown in Figure 9.

As shown in Figure 9 Experiment A, two-thirds of the frames of the original ST-GCN method identify the action as triple jump, although there are also some frames identified as real action jogging, but the overall triple jump action score is higher, so the final action recognition result is triple jump. Our method uses different scales of time windows to capture information and has better control of global information, so it performs well in the multiperson action recognition experiment and the recognition results are accurate. From Figure 9, experiment B in the recognition effect of the original ST-GCN algorithm, one person was obscured and although the skeletal information was recognized, the action could not be classified, and then the overall action was recognized as roller skating, which could not be matched with the real action. The effect of multiperson action recognition experiments is not as good as that of single-person recognition experiments. The more the number of people, the lower the accuracy of human skeleton recognition and the efficiency of action classification. We try to control the multiperson action recognition experiment to less than three people in the experiment. Our method can recognize and correctly categorize the occluded part of the action, further highlighting the advantages of our proposed IST-GCN method.

4.3. Experimental Results Analysis

Our proposed IST-GCN method involves the improvement of two main parts, namely, the rotated tensor module in the graph convolution layer (R-GCN) and the inception structure embedding in the temporal convolution layer (I-TCN). To verify the effect of each, ablation experiments were performed. First, the GCN in ST-GCN was replaced with R-GCN, and the group of experiments was named with the letter R to construct the R-GCN efficiency testing experimental group. Secondly, the TCN in ST-GCN was replaced with I-TCN, and the group was named with the letter I. The experimental group was constructed to verify the performance of the I-TCN module. The above two groups were validated with the spatial-temporal map convolutional neural network and our proposed IST-GCN on the NTU RGB + D dataset. The results were compared in terms of accuracy (Acc), bone recognition accuracy (Bone), joint recognition accuracy (Joint), and number of parameters (Param) levels as shown in Table 2.

The R-GCN technique improves overall accuracy by 3.7 percent, and the number of parameters is lowered proportionally, as shown in Table 2. The I-TCN approach improves overall accuracy by 7.5 percent and reduces the number of parameters by half. The results reveal that I-TCN has a greater impact on overall performance than R-GCN, although less effective than I-TCN in terms of overall performance improvement, is indispensable in capturing the global feature level. The two optimizations mirror each other and prove the effectiveness of the IST-GCN method.

To verify the effectiveness of our IST-GCN, we compare four different kinds of skeleton-based action recognition models, dynamic skeleton [31], ST-GCN [18], P-LSTM [30] and TCN [32]. The dynamic skeleton represents a series of action recognition models based on hand-crafted labels, P-LSTM denotes a series of recurrent neural network classes, TCN denotes a series of convolutional neural network classes, and ST-GCN denotes a series of hands-on models based on graph convolutional neural networks. The above four methods and our method are validated on the NTU RGB + D dataset. The experimental data is shown in Table 3.

The experimental comparison results in Table 3 indicate that in the validation experiments of the dataset NTU RGB + D, the GCN-based action recognition method greatly outperforms other types of action recognition methods, proving that graph convolutional networks have great advantages. Our method compared with the spatial-temporalspatial-temporal graph convolutional neural network model improves the accuracy in CS metrics by 9%, reaching 90% and in CV metrics by 6% and reaching 94%.

To verify the effectiveness of our method among similar optimization methods for graph convolutional neural networks, we compared four algorithms that perform better among current variant methods for graph convolutional neural networks in terms of both number of parameters (Params) and accuracy (Acc), namely AS-GCN [33], 2S-AGCN [26], NAS-GCN [34], and Shift -GCN [35]. The validation was carried out in dataset NTU RGB + D with CS evaluation metrics, and the comparison results are shown in Table 4.

Table 4 reveals the findings of the experimental comparison. The comparison results between AS-GCN, 2S-AGCN, and NAS-GCN under the evaluation index of CS indicate that our method has better efficiency with an accuracy of 91%, both in the number of model parameters and accuracy. Given the Shift-GCN method, which introduces a more complex hyperbolic space structure, the classification accuracy is further optimized. Even though the accuracy is not as good as that of Shift-GCN, the number of model parameters in this paper adopts the inception structure to form a more compact model, and the number of model parameters in our improved method is only one-fifth of that of the Shift-GCN method, which greatly decreases the computational cost. Furthermore, there are fewer parameters in this model than in previous ones. All of this demonstrates the efficacy of our strategy.

5. Conclusion

In this paper, we present a deep learning method for human action recognition based on the IST-GCN framework, which optimizes the recognition accuracy of the model by reducing the model parameters. First, we add a tensor rotation module in the graph convolution layer to better capture the global features of the graph task. Then we add the inception structure in the temporal convolution layer to build a multiscale temporal convolution filter to obtain temporal information in different temporal perceptual domains and reduce the arithmetic power. Finally, we perform experimental validation on the public dataset NTU RGB + D. The accuracy of CS evaluation reaches 90% and the accuracy of CV evaluation reaches 94%. The results reveal that our optimized method is robust and accurate, which not only improves the efficiency of the graph topology learning process but also greatly decreases the volume of parameters. Compared with the spatial-temporalspatial-temporal graph convolutional neural network model and similar graph convolutional optimization algorithms, the advantages of our method are outstanding.

As can be seen from the experimental results in Table 4, there is still a certain gap between the accuracy of our method and the Shift-GCN. Although we have a clear advantage in the number of parameters, accuracy is always the first assessment index as the effect of human action recognition. In the next work, we will consider using hyperbolic spatial structure to optimize the accuracy, and also ensure that the volume of parameters is small. To achieve a human action recognition model with high accuracy, few parameters, high robustness, and good stability.

Data Availability

The dataset can be accessed upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.