Research on the Evaluation Model of Dance Movement Recognition and Automatic Generation Based on Long Short-Term Memory

Yuan, Xiuming; Pan, Peipei

doi:https://doi.org/10.1155/2022/6405903

Mathematical Problems in Engineering

On this page

Abstract Introduction Implementation Conclusion Data Availability Conflicts of Interest References Copyright Related Articles

Special Issue

Bio-Inspired Algorithms and Applications

View this Special Issue

Research Article | Open Access

Volume 2022 | Article ID 6405903 | https://doi.org/10.1155/2022/6405903

Research on the Evaluation Model of Dance Movement Recognition and Automatic Generation Based on Long Short-Term Memory

Xiuming Yuan¹and Peipei Pan²

Academic Editor: Man Fai Leung

Received09 Feb 2022

Revised28 Mar 2022

Accepted03 Apr 2022

Published28 Apr 2022

Abstract

With the development of random image processing technology and in-depth learning, it is possible to recognize human movements, but it is difficult to recognize and evaluate dance movements automatically in artistic expression and emotional classification. Aiming at the problems of low efficiency, low accuracy, and unsatisfactory evaluation in dance motion recognition, this paper proposes a long short-term memory (LSTM) model based on deep learning to recognize dance motion and automatically generate corresponding features. This paper first introduces the related deep learning model recognition methods and describes the related research background. Secondly, the method of identifying dance movements is identified concretely, and the process of identifying concretely is given. Finally, through the comparison of different dance movements through experiments, it shows that there are obvious advantages in the accuracy of action recognition, error rate, similarity, and model evaluation method.

1. Introduction

As time enters the twenty-first century, the technology in the computer field has made a qualitative leap, and various scientific and technological achievements have been widely used in people’s daily life, constantly promoting the world to be trendy day by day. Human thought, art, and creativity are by far the most unique and difficult to replicate, and dance is one of the carriers of artistic inheritance. It is not only a grand goal but also a significant research project to make the “dance” created by machines without “emotion” recognized by human beings. It can be applied to education, entertainment, animation, games, and other fields, liberating human work and creating new artistic life. Up to now, computer-generated dance movements are stiff and out of rhythm, and the specific generation process still stays at the stage of extracting matches from the existing dance movement database, so it is impossible to automatically create new dances. In this paper, a data set of music and dance movements containing about 270,000 frames is constructed, dance posture features are extracted by human posture detection technology, a specific music feature encoder is designed, and a system for automatic generation of dance movements is proposed based on the deep learning method. Literature [1] proposed a deep learning framework DanceNet3D. Dance is generated with parameterized motion converter. MIDI Music Emotional Annotation System integrates music and dance movements to complete automatic choreography [2]. Literature [3] made autonomous humanoid robot generate dance through real-time music input. Literature [4] designed an automatic facial animation generation system for dance characters considering emotions in dance and music. Literature [5] selected motion to generate dance according to connection similarity. Literature [6] proposed a new system of Music2Dance to solve automatic music and choreography. Literature [7] introduced a system and method for automatically generating dance symbols. Literature [8] generated a human skeleton sequence map from music to generate the final video. Literature [9] took violin or piano playing audio as input and output skeleton predicted video for animated avatars. A method is capable of generating realistic video independent of subject directly from original audio [10]. Literature [11] proposed an effective method for real-time detection of image 2D posture by using part association fields. Literature [12] reviewed the latest trend of video-based human capture and analysis to realize automatic visual analysis of motion. Literature [13] automatically generated Labanotation from human motion capture data stored in BVH file. Literature [14] summarized the research of target detection based on convolution neural network. Literature [15] designed the LSTM recurrent neural network speech recognition system based on i-vector feature.

2. Theoretical Basis

Due to space limitations, only simple explanations are made such as deep learning [16], LSTM [17], dance generation algorithm [18], self-encoder [19], and so on.

2.1. Deep Learning Model

We divide deep learning into three categories. The first type is the commonly used convolution neural network (CNN). The second is self-coding neural network based on multilayer neurons, which includes self-coding and sparse coding; the third type is deep confidence network (DBN), which mainly uses multilayer self-coding neural network for pretraining.

There are two characteristics of deep learning. The first point is that deep learning can emphasize the depth of model structure; the second point is to clarify the importance of feature learning. It makes classification and prediction operations easier.

The high bias or high variance state of the deep network is shown in Figure 1.

In the application of deep learning, a lot of mathematical knowledge will be used. We show some formulas as follows:(1)Upper and lower bounds are as follows:(2)Concave and convexity of functions are as follows [20]:(3)Sigmoid function is as follows:(4)Directional derivative is as follows:

2.2. Dance Generation Algorithm

As shown in Table 1, we summarized some dance generation algorithms.

2.3. LSTM

Cyclic neural network is referred as “RNN.” It can effectively deal with the problem of time series format data.

The infrastructure of the cyclic neural network is shown in Figure 2. Cyclic neural network has a closed loop, which can continuously input time series information into the network layer at different times. This cyclic structure shows the close relationship between RNN and time series data.

In Figure 2, h_t is called state or hidden state and A is hidden layer, which is cyclic layer.

Long and Short-Term Memory (LSTM) network. It belongs to a special type of RNN. Although RNN has made amazing achievements in recent years, ordinary cyclic neural networks are mainly trained by the back propagation algorithm, and it is very difficult to learn long-term dependence. This is due to gradient disappearance and gradient explosion. LSTM can solve the problem of long-term dependence, and it works well on time series.

The state of cells in each network layer and the horizontal line passing through cells are the key to LSTM. The structure of cells is like a conveyor belt structure. Data run directly on the chain with only a small amount of linear interaction.

2.4. Self-Encoder

The self-encoder includes a decoder and an encoder as shown in Figure 3.

Detailed process description is as follows:wherein W and B are the weights and offsets of the codes and and are the weights and offsets of the decoding. The whole training process uses the traditional gradient-based training method.

The characteristics of the self-encoder are shown in Table 2.

2.5. Attention Mechanism

The core idea of the attention model is to introduce attention weight into the input sequence and to learn by adding an extra feedforward neural network into the Sequence-to-Sequence architecture. The two states of are used as the input of neural network, and then the value of is learned.

Attention mechanism has three advantages. First, the interpretability of neural networks is improved; Second, it is linked with input length and input sequence, which affects network performance. Third, in order to improve the performance of deep learning, the model is allowed to dynamically focus on the input part.

3. System Design and Implementation

3.1. General System Design

In this chapter, the overall design of the dance automatic generation system is discussed. The timing characteristics of the generated dance data and the music data are focused. Music and dance coexist, and they cannot be well integrated, which proves that the design of this system fails as shown in Figure 4.

The system first extracts features related to audio and action. Then, the system inputs the audio features into the dance generator and then passes the audio features through the autoencoder module. In this way, we can get the predicted dance form and make MSE loss, and we can also get the audio reconstructed loss. Finally, the system puts the predicted dance form and the real MSE loss into the discriminator to discriminate the training model.

3.2. Music Feature Extraction

Music feature extraction is divided into rhythm and prosody. For prosody, we mainly choose the 24-dimensional Meyer frequency cepstrum feature and the 8-dimensional Tempogram feature which are closest to the field of human sound processing. Let them represent vectors so as to represent the melody of audio and make the sound have more satisfactory feature representation as shown in Table 3.

Quoting rhythm characteristics, both music and dance have fixed beats and rhythms. If this feature is not introduced, music and dance will be arranged in a messy and inconsistent way. In addition, because sound features will slowly disappear in the deep network, it is necessary to introduce rhythm features as shown in Table 4.

3.3. Open Pose Attitude Detection

Considering the funds and specific usage of this research, we choose from several of the most popular human posture estimation algorithms. Open Pose open-source library is selected. and human posture is described with connected coordinates. We can effectively detect the 2D movements of single or multiple people in dance videos.

Pose detection is shown in Figures 5 and 6.

18 key points are selected. In this way, the feature representation can be well divided to represent the dance posture as shown in Table 5.

Training data is shown in Table 6.

Finally, the optimization goal of the model is shown in formula as follows:

3.4. Automatic Generation Design of Dance

In this module, we need the generator to transform the feature vector to a certain extent, and then we get the dance posture we need. Simply generating each posture of dance is not complete. We specially introduced the improved Pix2Pix algorithm to transform the dance into a real person's posture, so as to make it smoother and more realistic.

Feature dimension vectors are mapped by human posture as shown in Figure 7.

4. Experimental Analysis

4.1. Experimental Environment Setting

The specific experimental environment settings are shown in Table 7.

4.2. Performance Comparison of LSTM

In this part, we test the performance of LSTM when the features are different and select the best expression features. The test data of left and right legs are intercepted, and the method is unified as LSTM, and only the characteristics are different as shown in Figure 8.

We can find that LSTM based on Join + Line features has the best performance, which is 2.5% higher than the method with single features. Therefore, in the feature part of dance generation research, the best choice is Join + Line feature fusion method.

In order to better highlight the advantages of this method, we compare the performance of the LSTM method adopted in this paper with other common methods. The test data of left and right legs are intercepted. The accuracy of the two data sets is compared. We can find that the LSTM method in this paper has the highest accuracy, with an average accuracy of 94.33%, as shown in Figure 9.

4.3. Loss Function Analysis

The experiment in this section is mainly aimed at the stage of music feature extraction. We will analyze different data set processing methods and compare their final influence on the loss function. Filtering erroneous data are very important for the final result. If we eliminate the wrong key coordinate points of human body extracted in the extraction stage, our final result will be significantly enhanced as shown in Figure 10.

After filtering the wrong data, we can find that the generator based on the generator model has the lowest loss, with an average of 11.93; the average loss of the Generator + Discriminator model is 14.61. With autoencoder, the average loss is as high as 17.36 which has the greatest influence and the worst fitting ability as shown in Figure 11.

4.4. Analysis of Dance Sequence Results

The dance effect is measured by similarity. From the figure, we can find that LSTM-PCA has the highest similarity, up to 0.205; Generator + Discriminator + Autoencoder model has the best effect, and the similarity is as low as 0.063, which is superior to other methods. The details are shown in Figure 12.

4.5. System Usage Analysis

The basic performance of the system is tested professionally.

4.5.1. Evaluation of Dance Authenticity

Because our system is arranged and designed by computer machines, we need to invite the audience to score and evaluate the authenticity of this dance, and we can also use scores to test the integrity, smoothness, and rationality of the dance.

We first invited 20 ordinary spectators who volunteered to participate. Five experts related to dance music were also invited. They score the authenticity of the dances generated by our system. The dances watched are divided into 5 types, with a total of 15 dance fragments. The highest score for realistic evaluation is 10 points, and the lowest score is 0 points. We can find that Model 5 has the best effect with an average truth of 8 points as shown in Figure 13.

4.5.2. Music Consistency Evaluation

What this system emphasizes most is the consistency of music and dance. We selected three music data sets Kpop, Poppin, and Hip-hop for testing and asked the audience to score their models, respectively. We can find that the scores of Kpop data set in five models are the highest, which are 4.54 points, 5.61 points, 6.54 points, 8.01 points, and 9.01 points, respectively. This shows that Kpop music type is most suitable for this system, and the consistency of dance generation music is high as shown in Figure 14.

4.6. Image Quality Evaluation

The quality of the generated image is evaluated. Generator + Investigator + Autoencoder is rated as high as 40.37 with the best image quality, more details, and obvious dance posture as shown in Table 8.

5. Conclusion

The results show that(1)From the experimental results, we can see that the LSTM method in this paper is more efficient and has the highest accuracy than other existing methods, with an average accuracy of 94.33%. The Join + Line feature performs best.(2)Erroneous data have the greatest influence. After filtering error data, the generator based on the Generator model has the lowest loss, averaging 11.93, and the best fitting ability.(3)By calculating the similarity between the generated dance and the real dance, the Generator + Discriminator + Autoencoder model has the best effect, and the similarity is as low as 0.063, which is superior to other methods.(4)Model 5 is the best in evaluating the degree of dance truth, and the average degree of dance truth is as high as 8 points. Kpop music type is most suitable for dance generation in this system, and its music consistency is high.

The system designed in this paper basically meets the requirements. However, the content of this system is not rich enough, simple, and simplified, and the structure constructed by deep learning is too complex. In fact, people will have different movement postures, and the automatic dance generation system is still missing and insufficient. In the future, it is necessary to add more dance data sets for data training and find better models.

Data Availability

The experimental data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest regarding this work.

References

B. Li, Y. Zhao, and L. Sheng, “DanceNet3D: music based dance generation with parametric motion transformer,” vol. 12, no. 3, pp. 1873–1876, 2021.
View at: Google Scholar
P. Lv, “Automatic dance generation system based on emotion annotation,” Electronic Technology, vol. 37, no. 5, pp. 18–32, 2008.
View at: Google Scholar
J. H. Seo, J. Y. Yang, J. W. Kim, and D. S. Kwon, “Autonomous humanoid robot dance generation system based on real-time music input,” in Proceedings of the RO-MAN, 2013 IEEE, no. 2, IEEE, Gyeongju, Korea (South), August 2013.
View at: Publisher Site | Google Scholar
W. Asahina, N. Okada, N. Iwamoto, T. Masuda, T. Fukusato, and S. Morishima, “Automatic facial animation generation system of dancing characters considering emotion in dance and music,” in Proceedings of the Siggraph Asia. ACM, pp. 928–1021, Kobe Japan, November 2015.
View at: Publisher Site | Google Scholar
S. Baek and M. Kim, “Dance motion generation with pose constraints,” in Proceedings of the 2013 International Conference on Sport Science and Computer Science CCCS 2013, San Francisco, USA, October 2013.
View at: Google Scholar
W. Zhuang, C. Wang, S. Xia, J. Chai, and Y. Wang, “Music2Dance: music-driven dance generation using WaveNet,” pp. 1202–1089, 2020.
View at: Google Scholar
X. Zhang, Z. Miao, X. Yang, and Q. Zhang, “An efficient method for automatic generation of labanotation based on Bi-directional LSTM,” in Proceedings of the 2019 3rd International Conference on Machine Vision and Information Technology (CMVIT 2019), pp. 268–273, Guangdong, China, February 2019.
View at: Publisher Site | Google Scholar
X. Ren, H. Li, Z. Huang, and Q. Chen, “Music-oriented dance video synthesis with pose perceptual loss,” 2019, https://arxiv.org/abs/1912.06606.
View at: Google Scholar
E. Shlizerman, L. Dery, H. Schoen, and I. Kemelmacher-Shlizerman, “Audio to body dynamics,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7574–7583, Salt Lake City, UT, USA, June 2018.
View at: Publisher Site | Google Scholar
G. Chen and S. Li, “Network on chip for enterprise information management and integration in intelligent physical systems,” Enterprise Information Systems, vol. 15, no. 7, pp. 935–950, 2021.
View at: Publisher Site | Google Scholar
Z. Cao, T. Simon, S. E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299, Honolulu, HI, USA, July 2017.
View at: Publisher Site | Google Scholar
T. Moeslund, A. Hilton, and V. Kruger, “A survey of advances in vision-based human motion capture and analysis,” Computer Vision and Image Understanding, vol. 104, no. 2-3, pp. 90–126, 2006.
View at: Publisher Site | Google Scholar
H. Guo, Z. Miao, F. Zhu, G. Zhang, and S. Li, “Automatic Labanotation generation based on human motion capture data,” in Proceedings of the Chinese Conference on Pattern Recognition, pp. 426–435, Springer, Changsha. China, November 2014.
View at: Publisher Site | Google Scholar
X. Li, Y. Mao, and L. Tao, “A review of target detection based on convolution neural network,” Computer application research, vol. 34, no. 10, pp. 7–12 +17, 2017.
View at: Google Scholar
G. Huang, Y. Tian, J. Kang, J. Liu, and S. H. Xia, “LSTM recurrent neural network speech recognition system based on i-vector features under low resource conditions,” Computer Application Research, vol. 34, no. 2, pp. 392–396, 2017.
View at: Google Scholar
X. Ning, W. Li, B. Tang, and H. He, “BULDP: biomimetic uncorrelated locality discriminant projection for feature extraction in face recognition,” IEEE Transactions on Image Processing, vol. 27, no. 5, pp. 2575–2586.
View at: Publisher Site | Google Scholar
L. Chen, R. K. Maddox, Z. Duan, and C. Xu, “Hierarchical cross-modal talking face generation with dynamic pixel-wise loss,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7832–7841, Long Beach, CA, USA, June 2019.
View at: Publisher Site | Google Scholar
A. Duarte, F. Roldan, M. Tubau et al., “Wav2Pix: speech-conditioned face generation using generative adversarial networks,” in Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, May 2019.
View at: Publisher Site | Google Scholar
a Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional LSTM and other neural network architectures,” Neural Networks: The Official Journal of the International Neural Network Society, vol. 18, no. 5-6, pp. 602–610, 2005.
View at: Publisher Site | Google Scholar
Y. Wang, Z. Q. Liu, and L. Z. Zhou, “Learning hierarchical non-parametric hidden Markov model of human motion,” in Proceedings of the 2005 International Conference on Machine Learning and Cybernetics, IEEE, Guangzhou, China, August 2005.
View at: Publisher Site | Google Scholar
G. Sun and C.-C. Chen, “Influence maximization algorithm based on reverse reachable set,” Mathematical Problems in Engineering, vol. 2021, 12 pages, 2021.
View at: Publisher Site | Google Scholar
C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating videos with scene dynamics,” in Proceedings of the Advances in neural information processing systems, pp. 613–621, Barcelona, Spain, 2016.
View at: Google Scholar
I. Goodfellow, J. Pouget-Abadie, M. Mirza et al., “Generative adversarial nets,” in Proceedings of the Advances in neural information processing systems, pp. 2672–2680, Montreal, Canada, December 2014.
View at: Google Scholar
F. Ofli, E. Erzin, Y. Yemez, and A. M. Tekalp, “Learn2dance: learning statistical music-to-dance mappings for choreography synthesis,” IEEE Transactions on Multimedia, vol. 14, no. 3, pp. 747–759, 2011.
View at: Publisher Site | Google Scholar
C. Chan, S. Ginosar, T. Zhou, and A. A. Efros, “Everybody dance now,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 5933–5942, Seoul, Korea (South), November 2019.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2022 Xiuming Yuan and Peipei Pan. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies