Abstract
With the development of random image processing technology and in-depth learning, it is possible to recognize human movements, but it is difficult to recognize and evaluate dance movements automatically in artistic expression and emotional classification. Aiming at the problems of low efficiency, low accuracy, and unsatisfactory evaluation in dance motion recognition, this paper proposes a long short-term memory (LSTM) model based on deep learning to recognize dance motion and automatically generate corresponding features. This paper first introduces the related deep learning model recognition methods and describes the related research background. Secondly, the method of identifying dance movements is identified concretely, and the process of identifying concretely is given. Finally, through the comparison of different dance movements through experiments, it shows that there are obvious advantages in the accuracy of action recognition, error rate, similarity, and model evaluation method.
1. Introduction
As time enters the twenty-first century, the technology in the computer field has made a qualitative leap, and various scientific and technological achievements have been widely used in people’s daily life, constantly promoting the world to be trendy day by day. Human thought, art, and creativity are by far the most unique and difficult to replicate, and dance is one of the carriers of artistic inheritance. It is not only a grand goal but also a significant research project to make the “dance” created by machines without “emotion” recognized by human beings. It can be applied to education, entertainment, animation, games, and other fields, liberating human work and creating new artistic life. Up to now, computer-generated dance movements are stiff and out of rhythm, and the specific generation process still stays at the stage of extracting matches from the existing dance movement database, so it is impossible to automatically create new dances. In this paper, a data set of music and dance movements containing about 270,000 frames is constructed, dance posture features are extracted by human posture detection technology, a specific music feature encoder is designed, and a system for automatic generation of dance movements is proposed based on the deep learning method. Literature [1] proposed a deep learning framework DanceNet3D. Dance is generated with parameterized motion converter. MIDI Music Emotional Annotation System integrates music and dance movements to complete automatic choreography [2]. Literature [3] made autonomous humanoid robot generate dance through real-time music input. Literature [4] designed an automatic facial animation generation system for dance characters considering emotions in dance and music. Literature [5] selected motion to generate dance according to connection similarity. Literature [6] proposed a new system of Music2Dance to solve automatic music and choreography. Literature [7] introduced a system and method for automatically generating dance symbols. Literature [8] generated a human skeleton sequence map from music to generate the final video. Literature [9] took violin or piano playing audio as input and output skeleton predicted video for animated avatars. A method is capable of generating realistic video independent of subject directly from original audio [10]. Literature [11] proposed an effective method for real-time detection of image 2D posture by using part association fields. Literature [12] reviewed the latest trend of video-based human capture and analysis to realize automatic visual analysis of motion. Literature [13] automatically generated Labanotation from human motion capture data stored in BVH file. Literature [14] summarized the research of target detection based on convolution neural network. Literature [15] designed the LSTM recurrent neural network speech recognition system based on i-vector feature.
2. Theoretical Basis
Due to space limitations, only simple explanations are made such as deep learning [16], LSTM [17], dance generation algorithm [18], self-encoder [19], and so on.
2.1. Deep Learning Model
We divide deep learning into three categories. The first type is the commonly used convolution neural network (CNN). The second is self-coding neural network based on multilayer neurons, which includes self-coding and sparse coding; the third type is deep confidence network (DBN), which mainly uses multilayer self-coding neural network for pretraining.
There are two characteristics of deep learning. The first point is that deep learning can emphasize the depth of model structure; the second point is to clarify the importance of feature learning. It makes classification and prediction operations easier.
The high bias or high variance state of the deep network is shown in Figure 1.

In the application of deep learning, a lot of mathematical knowledge will be used. We show some formulas as follows:(1)Upper and lower bounds are as follows:(2)Concave and convexity of functions are as follows [20]:(3)Sigmoid function is as follows:(4)Directional derivative is as follows:
2.2. Dance Generation Algorithm
As shown in Table 1, we summarized some dance generation algorithms.
2.3. LSTM
Cyclic neural network is referred as “RNN.” It can effectively deal with the problem of time series format data.
The infrastructure of the cyclic neural network is shown in Figure 2. Cyclic neural network has a closed loop, which can continuously input time series information into the network layer at different times. This cyclic structure shows the close relationship between RNN and time series data.

In Figure 2, ht is called state or hidden state and A is hidden layer, which is cyclic layer.
Long and Short-Term Memory (LSTM) network. It belongs to a special type of RNN. Although RNN has made amazing achievements in recent years, ordinary cyclic neural networks are mainly trained by the back propagation algorithm, and it is very difficult to learn long-term dependence. This is due to gradient disappearance and gradient explosion. LSTM can solve the problem of long-term dependence, and it works well on time series.
The state of cells in each network layer and the horizontal line passing through cells are the key to LSTM. The structure of cells is like a conveyor belt structure. Data run directly on the chain with only a small amount of linear interaction.
2.4. Self-Encoder
The self-encoder includes a decoder and an encoder as shown in Figure 3.

Detailed process description is as follows:wherein W and B are the weights and offsets of the codes and and are the weights and offsets of the decoding. The whole training process uses the traditional gradient-based training method.
The characteristics of the self-encoder are shown in Table 2.
2.5. Attention Mechanism
The core idea of the attention model is to introduce attention weight into the input sequence and to learn by adding an extra feedforward neural network into the Sequence-to-Sequence architecture. The two states of are used as the input of neural network, and then the value of is learned.
Attention mechanism has three advantages. First, the interpretability of neural networks is improved; Second, it is linked with input length and input sequence, which affects network performance. Third, in order to improve the performance of deep learning, the model is allowed to dynamically focus on the input part.
3. System Design and Implementation
3.1. General System Design
In this chapter, the overall design of the dance automatic generation system is discussed. The timing characteristics of the generated dance data and the music data are focused. Music and dance coexist, and they cannot be well integrated, which proves that the design of this system fails as shown in Figure 4.

The system first extracts features related to audio and action. Then, the system inputs the audio features into the dance generator and then passes the audio features through the autoencoder module. In this way, we can get the predicted dance form and make MSE loss, and we can also get the audio reconstructed loss. Finally, the system puts the predicted dance form and the real MSE loss into the discriminator to discriminate the training model.
3.2. Music Feature Extraction
Music feature extraction is divided into rhythm and prosody. For prosody, we mainly choose the 24-dimensional Meyer frequency cepstrum feature and the 8-dimensional Tempogram feature which are closest to the field of human sound processing. Let them represent vectors so as to represent the melody of audio and make the sound have more satisfactory feature representation as shown in Table 3.
Quoting rhythm characteristics, both music and dance have fixed beats and rhythms. If this feature is not introduced, music and dance will be arranged in a messy and inconsistent way. In addition, because sound features will slowly disappear in the deep network, it is necessary to introduce rhythm features as shown in Table 4.
3.3. Open Pose Attitude Detection
Considering the funds and specific usage of this research, we choose from several of the most popular human posture estimation algorithms. Open Pose open-source library is selected. and human posture is described with connected coordinates. We can effectively detect the 2D movements of single or multiple people in dance videos.
Pose detection is shown in Figures 5 and 6.


18 key points are selected. In this way, the feature representation can be well divided to represent the dance posture as shown in Table 5.
Training data is shown in Table 6.
Finally, the optimization goal of the model is shown in formula as follows:
3.4. Automatic Generation Design of Dance
In this module, we need the generator to transform the feature vector to a certain extent, and then we get the dance posture we need. Simply generating each posture of dance is not complete. We specially introduced the improved Pix2Pix algorithm to transform the dance into a real person's posture, so as to make it smoother and more realistic.
Feature dimension vectors are mapped by human posture as shown in Figure 7.

4. Experimental Analysis
4.1. Experimental Environment Setting
The specific experimental environment settings are shown in Table 7.
4.2. Performance Comparison of LSTM
In this part, we test the performance of LSTM when the features are different and select the best expression features. The test data of left and right legs are intercepted, and the method is unified as LSTM, and only the characteristics are different as shown in Figure 8.

We can find that LSTM based on Join + Line features has the best performance, which is 2.5% higher than the method with single features. Therefore, in the feature part of dance generation research, the best choice is Join + Line feature fusion method.
In order to better highlight the advantages of this method, we compare the performance of the LSTM method adopted in this paper with other common methods. The test data of left and right legs are intercepted. The accuracy of the two data sets is compared. We can find that the LSTM method in this paper has the highest accuracy, with an average accuracy of 94.33%, as shown in Figure 9.

4.3. Loss Function Analysis
The experiment in this section is mainly aimed at the stage of music feature extraction. We will analyze different data set processing methods and compare their final influence on the loss function. Filtering erroneous data are very important for the final result. If we eliminate the wrong key coordinate points of human body extracted in the extraction stage, our final result will be significantly enhanced as shown in Figure 10.

After filtering the wrong data, we can find that the generator based on the generator model has the lowest loss, with an average of 11.93; the average loss of the Generator + Discriminator model is 14.61. With autoencoder, the average loss is as high as 17.36 which has the greatest influence and the worst fitting ability as shown in Figure 11.

4.4. Analysis of Dance Sequence Results
The dance effect is measured by similarity. From the figure, we can find that LSTM-PCA has the highest similarity, up to 0.205; Generator + Discriminator + Autoencoder model has the best effect, and the similarity is as low as 0.063, which is superior to other methods. The details are shown in Figure 12.

4.5. System Usage Analysis
The basic performance of the system is tested professionally.
4.5.1. Evaluation of Dance Authenticity
Because our system is arranged and designed by computer machines, we need to invite the audience to score and evaluate the authenticity of this dance, and we can also use scores to test the integrity, smoothness, and rationality of the dance.
We first invited 20 ordinary spectators who volunteered to participate. Five experts related to dance music were also invited. They score the authenticity of the dances generated by our system. The dances watched are divided into 5 types, with a total of 15 dance fragments. The highest score for realistic evaluation is 10 points, and the lowest score is 0 points. We can find that Model 5 has the best effect with an average truth of 8 points as shown in Figure 13.

4.5.2. Music Consistency Evaluation
What this system emphasizes most is the consistency of music and dance. We selected three music data sets Kpop, Poppin, and Hip-hop for testing and asked the audience to score their models, respectively. We can find that the scores of Kpop data set in five models are the highest, which are 4.54 points, 5.61 points, 6.54 points, 8.01 points, and 9.01 points, respectively. This shows that Kpop music type is most suitable for this system, and the consistency of dance generation music is high as shown in Figure 14.

4.6. Image Quality Evaluation
The quality of the generated image is evaluated. Generator + Investigator + Autoencoder is rated as high as 40.37 with the best image quality, more details, and obvious dance posture as shown in Table 8.
5. Conclusion
The results show that(1)From the experimental results, we can see that the LSTM method in this paper is more efficient and has the highest accuracy than other existing methods, with an average accuracy of 94.33%. The Join + Line feature performs best.(2)Erroneous data have the greatest influence. After filtering error data, the generator based on the Generator model has the lowest loss, averaging 11.93, and the best fitting ability.(3)By calculating the similarity between the generated dance and the real dance, the Generator + Discriminator + Autoencoder model has the best effect, and the similarity is as low as 0.063, which is superior to other methods.(4)Model 5 is the best in evaluating the degree of dance truth, and the average degree of dance truth is as high as 8 points. Kpop music type is most suitable for dance generation in this system, and its music consistency is high.
The system designed in this paper basically meets the requirements. However, the content of this system is not rich enough, simple, and simplified, and the structure constructed by deep learning is too complex. In fact, people will have different movement postures, and the automatic dance generation system is still missing and insufficient. In the future, it is necessary to add more dance data sets for data training and find better models.
Data Availability
The experimental data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest regarding this work.