Abstract
All human movements can be effectively represented with labanotation, which is simple to read and preserve. However, manually recording the labanotation takes a long time, so figuring out how to use the labanotation to accurately and quickly record and preserve traditional dance movements is a key research question. An automatic labanotation generation algorithm based on DL (deep learning) is proposed in this study. The BVH file is first analyzed, and the data are then converted. On this foundation, a CNN (convolutional neural network) algorithm for generating the dance spectrum of human lower-limb movements is proposed, which is very good at learning action space information. The algorithm performs admirably in terms of classification and recognition. Finally, a spatial segmentation-based automatic labanotation generation algorithm is proposed. To begin, every frame of data is converted into a symbol sequence using spatial law, resulting in a very dense motion sequence. The motion sequence is then regulated according to the minimum beat of motion obtained through wavelet analysis. To arrive at the final result, the classifier is used to determine whether each symbol is reserved or not. As a result, we will be able to create more accurate dance music for simple human movements.
1. Introduction
Labanotation is a type of dance music created by Hungarian Laban that records human movements with various elements and symbols. Labanotation is made up of lines and symbols, just like music staff. It is widely used in recording and teaching dance movements [1] because it can accurately and conveniently record and analyze all human movements. It is indispensable for the dissemination and communication of culture [2]. As a result, using labanotation as a method of recording is extremely beneficial to the preservation and transmission of Chinese traditional folk dance.
In comparison to the turn of the century, western dance art theory has progressed significantly, with notable achievements in recording dance score, dance steps, and rhythm, the most influential of which is labanotation, which not only has a significant impact on dance art but is also widely used in other fields involving human body movements [3, 4]. When motion capture technology first became available, it was quickly applied to the recording of dance movements. Computers have matured research on data compression and preservation, and motion capture technology converts three-dimensional human motion into binary data that can be recognized by computers. Using mathematical models to store human motion data, on the other hand, is not only expensive but also easily damaged, difficult to read, and publicize. Keeping records of national dance movements is thus an important research topic.
The main work of this study is to study an efficient algorithm for automatic generation of labanotation based on motion capture data. Firstly, according to the characteristics of human motion, the skeletal features that conform to the topological structure of human body are designed. An efficient DL (deep learning) algorithm [5] is proposed for the first time to recognize human upper- and lower-limb movements. Finally, the designed features and upper- and lower-limb movements recognition algorithm are integrated into the automatic generation platform of dance spectrum, and an end-to-end automatic generation system from motion capture data to labanotation is realized.
The study’s novelty is that the labanotation analysis of lower-limb movements focuses on the dynamic process of movements, whereas CNN (convolutional neural network) can capture the context in time-series data very well, thanks to its special gate control. As a result, this study examines and proposes the use of CNN to recognize lower-limb movements. The proposed CNN-based method’s superiority in processing dynamic time-series data, as well as the high efficiency of lower- and middle-limb motion generation in labanotation, is demonstrated by experimental comparison and analysis of the proposed method and other recognition algorithms.
2. Related Work
The advancement of computer science has had an impact on the advancement of many fields [6–8]. Labanotation has also investigated the use of computers for drawing and displaying. Literature [9] used motion capture equipment to collect data from a 41-point model, emphatically analyzing movements such as rotation, pace, jumping, lifting, and descending, and the Laban Writer output dance music through the interface, so the movements studied lacked universality. Literature [10] analyzes the supporting and unsupported actions separately and proposes a method of automatically generating labanotation based on knowledge rules, which details each part of the method and realizes the platform of automatically generating labanotation, according to the labanotation analysis principle. The transformation process of labanotation was realized in literature [11], in which segmentation and recognition were completed automatically, but the segmentation and action recognition rates were low and unable to meet the actual application level. Literature [12] has perfected the conversion of dance spectrums. The new method is a rule-based dance spectrum conversion system, but it is only good for some basic action research. In light of inaccurate segmentation, literature [13] improves the segmentation algorithm and adopts an action segmentation algorithm based on rhythm analysis, which is only applicable to behaviors with a strong sense of rhythm, is complex, and cannot be applied in practice.
To automatically generate labanotation by computer, it is essential to have a technology that can collect human actions. Before the technology of motion capture matured, no one had ever formally studied the automatic generation of labanotation. Literature [14] puts forward the method of generating labanotation of upper-limb movements by mapping labanotation symbols in three-dimensional coordinate space, which is a spatial analysis based on Laban notation principle. However, in their research, the expression of labanotation in upper-limb movements was emphatically analyzed, while the more complex lower-limb movements and the movement of human center of gravity were not analyzed and discussed. Literature [15] analyzes the motion based on motion capture data by a rule-based method, but when generating labanotation, the pause and segmentation of the motion are not considered, so the generated labanotation is of low quality. Literature [16] summarizes the above experience and perfects the whole automatic dance spectrum conversion process, but its recognition rate is very low and far from the application level. Literature [17] optimized it and proposed a rule-based dance spectrum transformation system. However, there is still a long way to go before the practical teaching level.
3. Overview of Labanotation and Motion Capture
3.1. Introduction to Labanotation
Labanotation is composed of a spectrum table and symbols of movement elements. First of all, in music, we can see that every syllable is expressed in the form of musical symbols, and the style of dance movements has a great relationship with beats. Laban proposed to use rectangular symbols to represent each movement beat, and the length of symbols represents the duration of movements. Secondly, in music, it can be seen that a piece of music is recorded in a horizontal way, while Laban analyzes the movement recording process of dance movements in time and limbs, and proposes to record a piece of dance movements by means of column longitudinal recording.
Labanotation is based on science and vividly restores all human activities [18]. His influence on the communication and dissemination of different dances in different countries is obvious. At present, labanotation has become a common language in dance circles.
In eleven vertical columns, labanotation records the movements of the main parts of the human body. These 11 columns are arranged on both sides of the center line and are used to describe the movements of various body parts. The left half of the human body and the head are recorded and described using columns on the left side of the center line; the right half of the human body and the head are recorded and described using columns on the right side of the center line. The basic principles of human anatomy, as well as the characteristics of standing and walking, are well represented by this notation method. More importantly, it can make it easier for people to read dance music and comprehend the descriptions of each movement in each section.
3.2. The Principle of Labanotation Drawing Based on Motion Capture Data
The automatic drawing of labanotation is realized by processing motion capture data. Each dance symbol in labanotation corresponds to an action, while each frame of data in BVH file contains the movement of 26 joint points of human body. Therefore, in the process of automatically generating labanotation based on motion capture data, BVH files need to be analyzed and processed, as shown in Figure 1.

Firstly, the BVH file is parsed, the joint semantics are uniquely determined by quadruple, and the format of the data area file is converted into position data.
Then, the motion capture data representing human motion are segmented and extracted, and four kinds of segmentation algorithms are adopted based on speed threshold and its improved algorithm, PCA (principal component analysis) and its extended algorithm, beat algorithm, and feature coupling algorithm, so that the whole human motion is segmented into basic element actions [19, 20].
Then, the segmented data are processed by attitude analysis. The posture transformation of the supported and unsupported parts is analyzed. Finally, the human motion transformation is recorded in the form of dance notation.
4. Research Method
On the basis of the dynamic characteristics of lower-limb movements in dance music records, this study proposes for the first time to use CNN model in DL algorithm to generate lower-limb movements dance music based on motion capture data. At the same time, referring to Laban notation rules, this study puts forward a spatial segmentation method for upper-limb movement recognition.
4.1. Preprocessing of 3D Motion Capture Data
Considering the research difficulties in this study, it is particularly important to extract effective motion features from experimental video data, and the quality of feature selection will directly affect the computation, recognition accuracy, and algorithm robustness of the platform, as shown in Figure 2.

As shown above, the appropriate motion features are selected, but they cannot be directly calculated from the original BVH data, so it is necessary to convert the feature space, that is, convert the BVH format file [21].
The concrete idea is to hypothesis that there is a point and transform coordinate system to obtain an inertial coordinate system , as shown in the following equation:where is the rotation matrix to be constructed between two points, and the data in the experiment are rotated around axis in turn, then and the matrix obtained along axis can be obtained as shown in the following equation:where represents the rotation matrix obtained along the axis, and can be obtained in the same way. Rotation matrix is shown as follows:
According to the above formula, the transformation from object coordinate system to inertial coordinate system is completed, and then the world coordinate system is obtained by translation.
Because every adjacent node of BVH file is related, the position of the child node can be obtained by the position of the parent node. Assuming that the inertial rotation matrix of all previous nodes of node is , the offset of this node relative to its predecessor nodes in inertial coordinate system is shown in the following equation:
Assuming that the world coordinate system of the node at the previous moment is and the world coordinate system of the node is , after ten frames, the displacement, velocity, and acceleration are shown as follows:Here, exist in the data block of BVH file and can be read and used directly. Displacement, velocity, and acceleration can reflect the posture information of human body, and angle can reflect the rotation information of human body. The included angle of knee joint is taken as an example (the included angle of elbow joint is the same), as shown in Figure 3:

If the -node world coordinate system is , -node world coordinate system is , and -node world coordinate system is , the side length of the triangle formed by these three nodes of the human leg and the included angle of the knee joint can be calculated as follows:
4.2. Automatic Generation Algorithm of Labanotation Based on DL
4.2.1. Generation of Lower-Limb Labanotation
For the dynamic changes of lower-limb movements in dance music records, this study proposes for the first time to use the very powerful DL algorithm-CNN (convolutional neural network). The biggest feature of CNN is its sharing mechanism and unique convolution operation. The network can fully learn the spatial information, thus achieving the classification and recognition effect.
The main feature of CNN is that its network structure has different convolution calculations from other networks. It is one of the current representative algorithms of depth science. It is used in image processing [22], target detection [23], behavior recognition [24], and natural language processing and other fields [25–27]and has made extraordinary achievements [28].
On the whole, CNN is similar to other neural network structures that are composed of input layer, hidden layer, and output layer.
CNN’s unique convolution kernel can learn the spatial information of human body movements well, and different convolution kernels can capture more information. Therefore, compared with traditional machine-learning methods, CNN’s structure is more comprehensive in learning and representing tasks and can achieve better performance than traditional methods.
In this study, a CNN-based method for human motion recognition in labanotation is proposed for the first time. Figure 4 is the framework diagram of human motion recognition based on CNN proposed in this study.

The dimension of each input sample after extracting the vector features of human skeleton nodes from motion capture data is 20 ∗ 14, which is smaller than the input dimension of complex global human behavior recognition, so the CNN network structure proposed in this study has also changed. The entire network is optimized in terms of recognition accuracy and response time. The CNN structure proposed in this study is not particularly deep in hierarchy, as shown in Figure 4, and it is well suited to the task of automatic labanotation generation in recognition performance and time performance.
The CNN-based method for generating dance spectrum of lower limbs is proposed in this study, in which the network structure consists of two convolution layers, a pool layer, three fully connected layers, the output of the softmax function is the probability value of the human action category, and finally, the category with the largest probability value is taken as the result of network prediction [29].
In order to prevent overfitting, a dropout layer is connected behind each fully connected layer. The whole network uses random gradient descent algorithm to optimize the network and accelerate convergence. The objective function of the whole network is to minimize the cross loss entropy function [30]:Here, indicates that the true label category of sample is , and indicates the predicted category of sample by the network.
4.2.2. Upper-Limb Motion Segmentation and Labanotation Generation
When he raises his hand forward, he records a forward symbol, but when the hand returns to its original position, he does not record the backward symbol like the foot but records the original symbol. It can be seen that the recognition of arm movements is related to the current spatial position of the arm.
When Laban notation records upper-limb movements, it is judged according to the spatial position of the current motion state, and symbol changes occur when spatial changes are made. Based on this rule, this study proposes a spatial segmentation method.
The first step is to determine the symbol level, that is, the type of vertical direction, after obtaining the motion vector of each frame of arm movement. Because of the difference in people’s height, a standing arm movement and a squatting arm movement may also be the same symbol, so the judgment in the vertical direction still needs to be obtained by using the vector included angle.
The motion vector is represented by , and the three-dimensional coordinates of the object coordinate system are represented by . The three vectors are perpendicular to each other, so each axis can be regarded as the normal vector of the plane determined by the other two axes, the projection vectors on the plane can be calculated from the vector and the plane normal vector, and the projection vectors of the counter vector on the three planes are , respectively. There are
The angle between vector and three axes of object coordinate system can be calculated respectively. The forward, left or right front, left or right, left or right back, and backward motions can be judged by the included angle between the motion vector and the axis, and whether it is left or right can be further determined by judging the included angle between the axes, so that a horizontal symbol can be determined.
The dense motion sequence obtained in the second step. The result has the characteristics of Laban symbol recording dance, but every frame of data is converted, and the sequence obtained is particularly dense. The next step is to integrate these dense symbol sequences.
The third step is to integrate sequences based on the beat method. Both music and dance movements are inseparable from the beat. As long as you get the minimum beat of a dance, you can use this minimum beat to regulate the signal. Because the dance is not chaotic, you can think that they are all carried out on the basis of this minimum beat, that is, the beat of the whole dance movement will be an integral multiple of this minimum beat.
In this study, SVM (support vector machine) is used for classification judgment. SVM classifier is to find a hyperplane to separate positive and negative samples, and its process can be understood as maximizing the minimum interval. The objective function is shown in the following formula:Here, represents the sample category, and represents the hyperplane equation. The final result can be obtained through continuous optimization and iteration.
Figure 5 is a speed frame diagram of an arm rotating at a constant speed. It can be seen that different motions can be well distinguished by using angular velocity components in different directions. After obtaining the classifier, the Laban symbol sequence obtained in the previous step to the classifier is sent, and whether each symbol is retained or deleted is judged, so as to obtain the final accurate result.

5. Result Analysis and Discussion
5.1. Algorithm Analysis of Lower-Limb Movement Labanotation Generation
In this study, the proposed CNN-based lower-limb motion recognition method is verified on two published data sets of human lower-limb motion capture. Firstly, these two data sets are introduced as follows:(1)Data set I: The data set contains 16 types of lower-limb movements, and there are 400 action samples in each category, that is to say, the total number of samples in the data set is 6400. The length of each sample is 20 frames, and finally, the whole data set has 128,000 frames.(2)Data set II: The data set is established and expanded on the basis of data set I in this study, which enriches the low-order and high-order actions in labanotation, and contains 48 kinds of human actions that are consistent with the sample distribution in data set A and all adopt uniform distribution, thus eliminating the problem of disbelief in recognition accuracy that may be caused by the unbalanced distribution of sample categories. The final data set contains 19,500 samples and 379,000 frames.
The size of the convolution has a clear impact on image processing. The enhancement effect of action information interested in the image is not obvious if the convolution kernel used is small, but it may increase noise in the same frequency band and obscure some of the image’s original details. If the convolution kernel used is large, the computation of convolution will increase as well, resulting in a high level of computational complexity. As a result, choosing the right convolution kernel size for the entire video image is crucial.
According to a large number of 2D CNN applications in human motion recognition, the convolution network structure with the convolution kernel size of 33 has the best effect, as shown in Figure 6. The convolution kernel of a 3D CNN has a time dimension of three in general, so the network structure proposed in this study uses a 3D convolution kernel with a size of 33.

The CNN structure proposed in this study can be effective for both data set I and data set II, which shows that the network structure has high robustness. Figures 7 and 8 show the accuracy of different algorithms in human motion recognition.


Compared with the traditional template matching method in reference [22], the recognition accuracy of the proposed CNN-based method on data set I is about 13% higher and about 5% higher than that of the method in reference [23].
Even on the data set II containing more action types and more action samples, the proposed method still achieves the highest recognition accuracy, with an average recognition rate of 95.58%, which is better than the previous research methods, and fully demonstrates the advantages of CNN in feature learning and representation compared with traditional machine-learning methods.
Compared with the recognition method in reference [24], the CNN-based method proposed in this study learns more spatial information, so it achieves better results. Even when the data samples are enlarged, CNN has achieved very good performance under the condition that the network structure and parameter settings are completely unchanged, which proves that the method based on CNN is efficient and robust in learning human motion space information.
Figures 9 and 10 show the time performance comparison between the proposed lower-limb motion recognition method based on CNN and other methods.


It can be seen that the method based on CNN is very fast, and it only takes 73 seconds on the data set I containing 6400 samples, while the method in reference [22] takes 382 seconds. However, the method in reference [23] takes longer, about 2000 seconds, because it needs to build a model for each type of action. This shows the rapidity of CNN-based method.
When the number of data samples increases to 9,600, the method in reference [23] takes longer, which is 61 times as long as the method proposed in this study, which fully proves the efficiency of the CNN-based method proposed in this study. The recognition method based on CNN is three times faster than the recognition method in reference [24]. The reason is that the network structure in reference [24] contains self-connection, and the update calculation of the central unit is also time consuming. Therefore, compared with the existing methods, the proposed method based on CNN not only achieves the best recognition accuracy but also performs best in time performance.
5.2. Analysis of Upper-Limb Motion Segmentation and Labanotation Generation Algorithm
More than 400 upper-limb movement dances were collected in the experiment, including 232 left arm movements and 222 right arm movements, each with about 200 frames, for a total of tens of thousands of frames in the data set. In this experiment, 50% of the samples are used for training, and the remaining 50% of the data are used for testing, with the final result being the average of 10 experimental results.
The recognition accuracy of human upper-limb movements is first compared to a feature of normalized human orientation extracted from literature [28] using the same spatial segmentation method, as shown in Figure 11. Feature1 denotes the original action data, Feature2 the features added with normalization of human body orientation, Feature3 the bone features of other nodes of the human body relative to the original nodes, and Feature4 the bone features of other nodes with normalization of human body orientation relative to the original nodes.

As can be seen from Figure 11, the bone features designed in this study have achieved the best recognition results, reaching about 94% recognition accuracy, which is about 2% higher than the features extracted from literature [28], which fully proves the effectiveness of the bone node vector features proposed in this study.
Figure 12 is the result of accuracy calculated by different dance movements. In the experiment, 50 pieces of dance data were collected, with 30,000 human body movement data of about 3 minutes per dance. Each piece of dance action corresponds to 50 Laban symbols, including a series of actions such as walking, running, jumping, and arm rotation. It can be found that the spatial segmentation method is much higher than the other two algorithms in the action recognition of upper-limb nodes.

Through comparative analysis, we can see that the automatic generation algorithm of labanotation based on DL designed in this study can realize the efficient generation from motion capture data to labanotation and basically meet the needs of practical application.
6. Conclusion
In this study, the notation rules and fundamental elements of labanotation are thoroughly discussed, as well as the structural analysis and data conversion of BVH files. On this foundation, a motion capture data to labanotation conversion algorithm is proposed, and finally, automatic labanotation generation is realized. The DL-based segmentation algorithm and the spatial segmentation-based recognition algorithm were introduced for the first time, effectively improving recognition accuracy. CNN is primarily used to study and express the spatial characteristics of human body movements, and it can comprehensively learn the spatial position information of human body movements. To generate a Laban symbol sequence, the coordinates in the world coordinate system are projected into three-dimensional space, and the minimum beat unit is then incorporated into the Laban symbol sequence for regular optimization. Laban keeps track of the human body’s movement in the simplest form possible. Finally, in order to obtain an accurate recognition result, this study introduces a classifier to eliminate the interference space in the middle. The segmentation and recognition are integrated directly from the spatial sequence in this study, which introduces a novel concept from data to dance music.
As labanotation is in the preliminary research stage in China, it still needs our constant exploration and research. At last, labanotation has become a universal communication tool in the international dance field, and it is hoped that Laban automatic generation system will be more and more perfect.
Data Availability
The data used to support the findings of this study are included within the article.
Conflicts of Interest
The authors do not have any possible conflicts of interest.
Acknowledgments
This study was supported by Research Start-up Fund for Young Teachers of Shenzhen University (project no. 000002111006) and Research Start-up Fund for Young Teachers of Shenzhen University (project no. 000002111105).