Abstract

This paper investigates the extraction of volleyball players’ skeleton information and provides a deep learning-based solution for recognizing the players’ actions. For this purpose, the convolutional neural network-based approach for recognizing volleyball players’ actions is used. The Lie group skeleton has a large data dimension since it is used to represent the features retrieved from the model. The convolutional neural network is used for feature learning and classification in order to process high-dimensional data, minimize the complexity of the recognition process, and speed up the calculation. This paper uses the Lie group skeleton representation model to extract the geometric feature of the skeleton information in the feature extraction stage and the geometric transformation (rotation and translation) between different limbs to represent the volleyball players’ movements in the feature representation stage. The approach is evaluated using the datasets Florence3D actions, MSR action pairs, and UTKinect action. The average recognition rate of our method is 93.00%, which is higher than that of the existing literature with high attention and reflects better accuracy and robustness.

1. Introduction

Volleyball players’ action recognition has obtained a lot of attention in recent years, thanks to advances in computer vision, artificial intelligence, and pattern recognition. Virtual reality, medical rehabilitation, game creation, video surveillance, multimedia video retrieval, and other disciplines have all benefited from this new field of research. Early studies on volleyball player action identification relied mostly on the standard RGB colour video. Feature representation based on optical flow and motion information, spatiotemporal interest points, descriptive feature representation, and static feature representation based on shape are the most common ways. There are a number of things to consider, including the volleyball players’ high degree of freedom, background confusion, camera movement and zooming, lighting changes, and video noise. Volleyball players’ action recognition based on the classic RGB colour video is no longer a viable option. With the emergence of depth sensors such as Microsoft Kinect and ASUS XTION, the way of obtaining motion information has significantly improved. Using Kinect to rapidly and reliably extract bone information and depth information of human movements not only contains more volleyball players’ motion information but also helps to overcome obstacles such as light, body temperature, and body temperature change. External environmental factors influence the changes in volleyball players’ body shapes. Researchers currently employ angle change, joint position change, and the relative geometric relationship between the limbs of the volleyball players to depict the activity of volleyball players based on the bone information of the players.

For the development of volleyball players’ action recognition, machine learning algorithm is a popular approach for action recognition. Algorithms such as support vector machine (SVM), boosting, and other similar algorithms are profoundly used nowadays. With the increasing convenience of data acquisition, deep learning algorithms began to shine. Combined with the strong feature learning ability of deep learning, the application of deep learning in volleyball players’ action recognition greatly improves the recognition effect. In human interaction, understanding a person’s action or behavior is an important aid to understand each other’s intention, and people often spend a lot of time and energy to observe and explain others’ behavior. Today, with the rapid development of economy and science and technology, people want to let machines understand human behavior, so as to carry out human-computer interaction more naturally and make machines bring better comfort to human beings. The purpose of action recognition of volleyball players is to let the machine automatically analyze the actions of volleyball players from their moveable data. The authors used the motion perception experiment [1] to sense the changes of the joint position and motion of the experimenter. The experimental results show that human vision not only detects the direction of motion but also detects different types of limb motion patterns, including the recognition of activity and speed of different motion patterns. The experiment is also considered as a pioneer in the field of volleyball players’ behavior recognition. Zatsiorsky et al. [2] regarded volleyball players as a hinge system connected by joint points and then defined their actions as a continuous time transformation of limbs in space. This research laid the foundation for the development of action recognition of volleyball players based on skeletal limbs. With the advent and application of Kinect and other depth sensors, the authors carried out pioneering work to estimate the joint position of volleyball players from the depth map [3], which promoted the development of research on action recognition of volleyball players based on bone joints. In the field of computer vision, researchers define a volleyball player’s skeleton as a schematic model composed of the head, trunk, and limbs. At present, two kinds of volleyball player’s schematic models are widely used by the researchers, which are shown in Figure 1.

According to Turaga et al. [4], the behavior recognition of volleyball players is mainly action recognition. Volleyball players’ action recognition mainly includes two stages of feature extraction and feature classification. In case of feature extraction, the early volleyball players’ action recognition based on the RGB colour video mainly used manual extraction of action features, such as HOG/HOF features, HOG3d features, and SURF features [5]. This feature extraction method is more laborious and mainly depends on the experience of researchers, with limited development space. With the emergence of Kinect, researchers began to use the joint angle, joint position, and other key parts of volleyball players as the movement characteristics. In recent years, the feature extraction method based on the 3D relative geometric relationship between limbs has been proposed. The advantage of this method is that it can better overcome the problems of similarity between movements and intraclass difference. Feature classification is the process of judging different actions and featured data. The classical classifiers include the support vector machine (SVM) [68], hidden Markov model (HMM) [9, 10], and Bayesian networks (BNs) [11, 12]. In [13], the authors proposed deep belief networks (DBNs), which promoted the development of deep learning in academia and industry. As an extension of machine learning, deep learning has achieved great success in the field of image recognition, and it was gradually introduced into the field of dynamic video behavior analysis. The advantage of deep learning lies in deep feature learning for massive data, strong nonlinear fitting ability, and high-dimensional data processing ability, which has broad applications in feature extraction and classification. At present, there are deep learning shadows in the fields of action recognition, speech recognition, speech emotion recognition, and text emotion recognition.

The rest of the paper is organized as follows. In Section 2, related work is discussed. In Section 3, the proposed action recognition method of volleyball players using a deep learning approach is discussed. In Section 4, experimental results and analysis are provided. Finally, the paper is concluded in Section 4.

In this section, we provide the related work. First, the existing technology used for volleyball players’ action recognition is discussed in Section 2.1. Next, the volleyball players’ sports information acquisition technology is discussed in Section 2.2. Finally, the acquisition of bone information of the players is discussed in Section 2.3.

2.1. Technology for Volleyball Players’ Action Recognition

In human communication, the actions of volleyball players, similar to their language, play an important role in conveying the information. The research of volleyball players’ action recognition is often carried out in a modular way, that is, action data acquisition, action feature extraction, and feature classification. At present, the popular classic data acquisition methods are Kinect somatosensory technology and motion capture technology [1419]. The methods of action feature extraction are mostly based on data sources, mainly including (1) feature extraction method based on the RGB colour image and depth image, which mainly extracts the spatial features of volleyball players’ movement and (2) feature extraction method based on bone information, which mainly extracts the position coordinates of bones and joints, spatiotemporal changes, and limb angle, respectively. Common methods include spatiotemporal points of interest (STIPs), shape context, 3D joint point histogram (HOJ3D), and nonlinear 3D geometric relationship between limbs. Feature classification is the process of judging different features as specific actions. At present, the more popular classification methods are the support vector machine (SVM), hidden Markov model (HMM), random forest, and deep learning models, such as CNNs and DBNs.

In this paper, the proposed framework of volleyball players’ action recognition method based on deep learning is shown in Figure 2. In the data acquisition stage, because Kinect is easy to extract volleyball players’ bone information, this paper uses Kinect somatosensory technology for data acquisition. In the feature extraction stage, the Lie group skeleton representation model is used to extract the geometric features of the skeleton information and uses the geometric transformation (rotation and translation) between different limbs to represent the volleyball players’ movements. For feature classification, convolutional neural networks (CNNs) in the deep learning model are used for feature learning and classification.

2.2. Volleyball Players’ Sports Information Acquisition Technology

For volleyball players’ action data acquisition, obtaining appropriate action data will greatly promote the effect of action recognition. At present, the main data sources of volleyball players’ actions include RGB-D video data, portable sensor data, depth information of volleyball players’ actions, and bone information of volleyball players’ actions, among which the open databases based on video data include KTH [20, 21], Weizmann [22], UCF sports [23], UCF101 [24], daily living [25], and YouTube [26]. Public databases based on depth information and bone information include MSR action 3D, MSR action pairs [27], NTU RGB + D [28], UTKinect action [29], and G3D gaming [29]. This paper mainly extracts bone information for volleyball players’ action recognition.

2.3. Acquisition of Bone Information of Volleyball Players

The data acquisition of volleyball players’ bone information is the key step for the players’ movement analysis, which is of great value to analyze the changes in their posture and obtain movement information. Kinect’s powerful function is its ability of bone tracking. Within the time delay range allowed by the system, it can quickly build the players’ limbs according to their bone joints. There are two states of the skeleton: (1) when the skeleton is at rest at a certain time, it is a volleyball player’s posture; (2) when the joints or limbs in the bone are in the state of motion in space, they appear as the actions or behaviors of the players.

3. Action Recognition Method of Volleyball Players Based on Deep Learning

Feature classification is one of the key steps of action recognition for volleyball players. The design of the classifier will directly affect the results of action recognition. This paper uses the convolutional neural network to learn and classify the action features. The biggest disadvantage of Li Qun’s skeleton representation model is that when it represents the volleyball players’ bones, it calculates the three-dimensional geometric relationship of each frame of bones and limbs and then superimposes the three-dimensional geometric relationship between the bones and limbs of the whole action sequence. This results in a relatively high feature dimension. Combined with the high-dimensional data processing ability and feature learning ability of deep learning, this section uses the convolutional neural network to classify the action features. It can reduce the complexity of data processing and save the cost of calculation. Moreover, it can get better effect of action recognition.

3.1. Convolutional Neural Network

The characteristics of the convolutional neural network (CNN) lie in local area perception, weight sharing, and temporal or spatial sampling. These characteristics make it possible to use fewer training parameters when using the CNN for data training. The CNN model reduces the complexity of the network, improves the calculation speed and generalization ability, makes the model invariant to translation, distortion, and scaling to a certain extent, and makes the model robust and fault-tolerant.

In the CNN, multiple feature maps constitute the convolutional layer, and multiple neurons constitute the feature map. Each neuron is locally connected with the feature map of the upper layer through the convolution kernel. In the structure of the CNN, the deeper the depth, the more the number of feature maps, the larger the feature space that the network can represent, and the stronger the learning ability. However, the depth and additional number of feature maps lead to overfitting. Convolution kernel is a weight matrix, which is used to extract features automatically according to the network model. The convolutional layer of the CNN extracts different features by checking the input data. In the first convolutional layer, some low-level features are often extracted, such as edge, line, and contour features, which can be used as the edge detector. The more advance the convolutional layer is, the more advance the feature extraction will be. After convolution, the size of the feature graph is calculated as follows.

Let the size of the input feature graph be m×n, the convolution kernel be k × k, and the sliding step of the convolution kernel be s; the size will be calculated as follows:

In the convolution process, the expressions of the input and output are

In this equation, f is the activation function, which is used to change the input signal into the output signal. The commonly used activation functions are the sigmoid function, tanh function, ReLU function, radial basis function, and so on. is the transform weight, and is the bias parameter. The convolution process is to slide the convolution kernel on the input matrix, multiply the corresponding weight of the convolution kernel by the data at the corresponding position of the input matrix, and add the results to get the final convolution result. The specific process is shown in Figure 3. In Figure 4, the size of the input feature map of the input layer is 4 × 5, the size of the convolution kernel is 2 × 2, and the sliding step is set as 1. At the beginning of sliding, the neurons (the range of the input blue box) in the feature map convolute with the convolution kernel to get the value of the output layer (blue box neurons). Similarly, when sliding to the red box area of the input layer, the neurons in this area convolute with the convolution kernel to get the value of red box neurons of the output layer. Finally, after convolution, the size of the input layer feature map becomes smaller, i.e., 3 × 4.

Combined with the fact that the data dimension of the action features extracted in this paper is high and with reference to the CNN model in [17], the basic structure of the CNN model proposed in this paper is shown in Figure 5. In the first layer, i.e., convolutional layer, a group of convolution checks with the size of 13 × 13 are used to convolute the input features. Here, the number of feature maps is set to 46. The second layer is the pooling layer, which selects the max pooling method, and the pooling core size is 4 × 4. This layer is used to reduce the feature dimension and ensure the same number of feature graphs as the previous layer. The third layer is the second convolutional layer of the model, the convolution kernel size is set to 8 × 8, and the number of characteristic graphs is 78. The fourth layer is the pooling layer. The size of the pooling core is set to 4 × 4, and the pooling mode is set as maximum pooling. After the previous convolution and pooling operations, the feature dimension is greatly reduced. At this stage, using the full connection layer, the local features are connected into 128-dimensional global feature vectors. The sixth layer is the output layer. At this layer, the number of neurons is the same as the number of action categories, which is used for the final classification. To avoid overfitting during training, due to the large amount of data, this section introduces weight attenuation in the loss function, i.e., L2 regularization, whose coefficient λ is 0.008. At the same time, when the gradient decreases, the momentum coefficient is introduced to accelerate the convergence speed, and its value is set to 0.9. During the experiment, the learning rate of the network is 0.0001.

In this paper, experiments are carried out on the open database Florence3D action. The size and quantity of each layer’s feature map after its input and convolution and pooling operations are shown in Table 1.

4. Experimental Results and Analysis

Based on the volleyball players’ action feature extraction and classification, this section verifies the accuracy of the proposed recognition method using open database Florence3D. The experimental results on three databases show that the proposed method can achieve ideal action recognition effect on public databases. It should be emphasized that this method has strong adaptability to the transplantation of the database. In other words, when the model is trained on one database, it can easily be transplanted to another database for experiments. There is no need to redebug the network parameters such as the number of network layers, the number of characteristic graphs, the size of the convolution core, and the size of the pooling core.

4.1. Experimental Analysis on the Florence3D Action Database

Firstly, the action sequence in the Florence3D action database is represented by the Lie group skeleton model for feature extraction. After feature preprocessing, the feature matrix is obtained. Out of the 215 action sequences in the database, 115 action sequences are selected as the training set, and the remaining 100 action sequences are selected as the test set, according to the setting idea of the training set and test set in [14]. The convolutional neural network proposed in this paper is used for feature recognition and classification. The recognition rate changes with the number of iterations, as shown in Figure 6. The average recognition rate of the proposed method is 93.00%. It can be seen from the analysis in Figure 6 that, with the increase of iteration times, the average recognition rate of database actions gradually tends to be stable, indicating that the network training is good. The experimental results in the Florence3D action database are shown in Table 2. According to the comparative experimental results, it is not difficult to see that this method has achieved good recognition results.

From the analysis of Table 2, it can be seen that the action recognition method of volleyball players in this paper can achieve better action recognition effect. Compared with some other popular action recognition methods, the effect of this paper is better. In particular, the average recognition rate of this method is 11% higher than that of L. Seidenari et al. [27], R. Vemulapalli [30], and others, which used the method of Li group skeleton representation and support vector machine (SVM) to recognize volleyball players’ movements. Compared with the existing methods, this paper not only achieves better recognition effect but also consumes less time in the whole training time. The SVM does not reduce the dimension of high-dimensional data, but the CNN used in this paper can effectively process high-dimensional data, reduce the data complexity, and save the calculation cost. At the same time, the CNN has excellent feature learning ability, which is conducive to feature learning and classification.

When using the convolutional neural network for feature learning and classification, it is difficult to select the appropriate number of feature graphs. If the number of feature graphs is set too small, some features that are beneficial to network learning may be ignored. If the number of feature graphs is set to a high value, it will increase the network training parameters and training time, which is not conducive to the learning of the network model. In this paper, the experiments are carried out on different numbers of feature maps. For example, Table 3 shows the average recognition results of the three cases. It can be seen from the table that, for the CNN used in this paper, when the number of feature maps of the first convolutional layer is 46 and the number of feature maps of the second convolutional layer is 78, it can achieve a better recognition effect.

In the process of network training, the selection of network model parameters will greatly affect the effectiveness of the model, for example, the selection of key parameters such as the weight between the input layer feature map and the output layer feature map and bias parameters, which will produce different recognition results. In this paper, in order to achieve a satisfactory recognition effect, several experiments were carried out by selecting different parameter combinations (mainly including the size of the convolution kernel in different convolutional layers). These experiments took into account the actual situation of the high dimension of action-featured data extracted in this paper and also took the CNN model as a reference. The recognition results of several representative key model parameters are shown in Table 4.

It can be concluded from Table 4 that although several CNN models with different convolution kernel size combinations can effectively classify the Lie group features of volleyball players extracted in this paper, there are differences in the classification effect. The main reason is that when different convolution kernel sizes process the features, the feature information obtained is different, compared with other weight combinations. When the convolution kernel size of the first convolutional layer is 13 × 13 and the convolution kernel size of the second convolutional layer is 8 × 8, relatively good recognition results can be achieved. Therefore, this paper selects the weight parameters to build the network model. In order to clarify the action recognition situation of this method in the Florence3D action database, this paper presents the correct recognition rate and error recognition rate of each action in the form of a confusion matrix, as shown in Figure 7. According to the results presented by the confusion matrix, among the 9 volleyball players’ movements, 6 movements can be completely and accurately identified. For the other three movements, 2 of them have more than 80% accuracy. When two actions are similar, it is easy to misjudge while recognizing. For example, the action “answer phone” has a 20% probability of being recognized as the action “drink” and a 10% probability of being recognized as the action “read watch.” All these three actions support lifting hands and arms, with similar action tracks and high similarity between actions.

5. Conclusion

This paper examines the action recognition mechanism used by volleyball players using a convolutional neural network (CNN). It first briefly introduces deep learning and common deep learning models and then delves into the network structure, working principle, and benefits of the CNN before presenting a CNN model and parameter setting for volleyball players’ action recognition. Experiments were carried out on the open database Florece3D action, and the results show that the proposed method based on the Lie group feature and deep learning can achieve good recognition effect and has a strong ability of database transplantation. At the same time, compared with the existing methods from the literature, it has better recognition effect and robustness, and the computational cost is greatly reduced.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The author declares that there are no conflicts of interest regarding the publication of this paper.