Abstract

Real-time tactical sign language recognition enables communication in a silent environment and outside the visual range, and human-computer interaction (HCI) can also be realized. Although the existing methods have high accuracy, they cannot be conveniently implemented in a portable system due to the complexity of their models. In this paper, we present MyoTac, a user-independent real-time tactical sign language classification system that makes the network lightweight through knowledge distillation, so as to balance between high accuracy and execution efficiency. We design tactical convolutional neural networks (TCNN) and bidirectional long short-term memory (B-LSTM) to capture the spatial and temporal features of the signals, respectively, and extract the soft target with knowledge distillation to compress the scale of the neural network by nearly four times without affecting the accuracy. We evaluate MyoTac on 30 tactical sign language (TSL) words based on data from 38 volunteers, including 25 volunteers collecting offline data and 13 volunteers conducting online tests. When dealing with new users, MyoTac achieves an average accuracy of 92.67% and the average recognition time is 2.81 ms. The obtained results show that our approach outperforms other algorithms proposed in the literature, reducing the real-time recognition time by 84.4% with higher accuracy.

1. Introduction

Gestures are one of the most commonly used ways for humans to convey their expectations [1]. In the transmission of information, gestures often account for a very large proportion. Tactical sign language (TSL) is the gestures utilized in combat operations, using the movements of arms, fingers, and palms to communicate. By recognizing TSL, simple gestures can be applied to deliver rich semantic information to interact with devices, and people can also communicate in a silent environment and beyond the visual range. The traditional gesture recognition is based on computer vision [2, 3], but the situation on the battlefield is so complex that image-based approach is not competent as it may be seriously affected by lighting, shadows, and background. In addition, the required infrastructure is inconvenient for users and not suitable for carrying around. On the contrary, it is more convenient to use the armband with electromyography (EMG) signal acquisition and inertial measurement unit (IMU) to collect the signal data. Among classifiers, deep neural network (DNN) has been widely implemented in gesture recognition [4, 5] due to its hidden layer extract buried features automatically. Because the gestures of tactical sign language are very complex, the system needs to balance both the motion capture accuracy and real-time operating speed. A lightweight model with fewer parameters is essential. However, DNN models for gesture recognition typically stack layers to improve accuracy, resulting in a large model, sacrificing the speed of real-time operation.

In this paper, we present MyoTac, a user-independent high-real-time tactical sign language classification system that makes the network lightweight through knowledge distillation and replacement of the fully connected layer (FCL), so as to balance high accuracy and execution efficiency at the same time. Knowledge distillation refers to transferring the knowledge in an ensemble of models to a single model [6]. We utilize a portable device, Myo, which enables data transmitting via Bluetooth Low Energy (BLE). The Myo armband acquires 8-channel EMG signals and 9-channel IMU signals, including 3-channel accelerometer signals, 3-channel gyroscope signals, and 3-channel orientation signals. When people wear the Myo armband, our system converts the data received into commands, so that machines can understand human intentions and complete HCI interaction naturally and harmoniously.

Based on the excellent performance of DNN in the fields of image recognition, speech classification, and natural language processing, we develop a hybrid neural network to recognize 30 types of TSL. First, the multimodal mixed data is input into TCNN network to obtain the global characteristics of each channel. Then, the learnt features are input into the B-LSTM to model the temporal features. By training the large model, the knowledge in the large model can be extracted. The soft target obtained from the large model can be combined with the actual label to train the small model. Through the rich information entropy of the soft target, our small model can acquire more knowledge and thus obtain a higher accuracy rate.

In order to evaluate the classification effect of MyoTac, we invited 25 volunteers to test 30 commonly used tactical sign languages and collected 37,500 samples. Volunteers only need to wear the Myo armband on their left arm and repeat the arm and finger movements. By fusing the IMU and EMG signals to comprehensively analyse the fine movements of the arms and fingers, as well as upsampling for fusion, some gestures that cannot be distinguished by a single signal can be more accurately distinguished. When dealing with new users, MyoTac achieves an average accuracy of 92.67%. These results prove the superiority of our algorithm. In order to further reduce the scale of the model and improve the real-time recognition speed, we also compare the large model with the small model obtained through knowledge distillation. The results prove the excellent effect of lightening through knowledge distillation.

The contributions of this paper are summarized as follows: (1)We propose a multimodal hybrid neural network that combines TCNN and B-LSTM, to realize the classification of user-independent EMG signals of tactical sign language with multichannel correlation and time-varying. The TCNN part extracts spatial features of different channels, and B-LSTM extracts temporal correlation under time series(2)A novel method based on knowledge distillation is presented to lightweight the TCNN and B-LSTM network, considering the generalization ability of the adopted model can be transferred from a large model to a small model. This method can reduce the space complexity and time complexity of the network, achieving a higher accuracy under the same network scale(3)We design a nonintrusive real-time tactical sign language recognition system, which can be used for silent human-computer interaction on the battlefield as well as the common sign language recognition (code is available at https://github.com/YifanZhangchn/MyoTac.git). It only requires the participants to wear the Myo armband and carry out military sign language movements without any other pretraining, allowing them to convey instructions to the machine. In addition, we collected a TSL data set with 30 symbol samples and conducted real-time gesture recognition experiments. Experiments on our sign language recognition and network lightweight methods show its excellent recognition performance

The remainder of this paper is organized as follows. We first review related research in Section 2, followed by materials and methods in Section 3, including the data set, signal processing, and the specific deep neural networks. Section 4 presents the results, and Section 5 presents the discussion. Finally, the conclusions of the paper are drawn in Section 6.

This section reviews the related research in sign language recognition and lightweight neural network.

2.1. Sign Language Recognition

To the best of our knowledge, the classifiers for sign language recognition mainly include traditional machine learning methods [7, 8] and neural network models. In the early work, the researchers extracted handcrafted features and put them into classical machine learning classifiers. Wang et al. extracted shape, depth, and bone trajectory features of the hand to recognize independent gestures through support vector machines (SVM) [9, 10]. Zhuang et al. fused the signal of sEMG and accelerometer on the back of the hand and extracted posture-related features into the linear discriminant analysis (LDA) model to recognize 18 isolated Chinese sign language (CSL) signs [11]. K-nearest neighbour (K-NN) algorithm [12, 13] is also commonly used in recent research. Manually extracting features requires expert domain knowledge, in-depth analysis and heuristic thinking about the problem, as well as combining and experimenting with various refined features, which increases the difficulty of obtaining features with higher matching degrees. The classic machine learning model is a miniature, but it is difficult to construct and select features, so there are problems of low accuracy and weak generalization.

Since the neural network does not need to manually extract features, it can get some hidden features difficult to detect and manual sort, resulting in widely using in sign language recognition [14]. Liu et al. [15] constructed 100 Chinese sign language datasets to capture the motion trajectories of four skeletal joints and input the data into LSTM model. Liang et al. merged multimodal video streams and applied a 3D-CNN model to extract spatial and temporal features in real-time to capture motion information [16]. In addition to analysing gestures through image or video data, biological signals have also been widely used [1719]. Among them, back-propagation (BP) based neural network, pulse-coupled neural network (PCNN) [20], and probabilistic neural network (PNN) [21] is adopted for the recognition of these complex data. Chen et al. [22] divided the overlapping data segment with a size of through a sliding window of 260 ms and applied the continuous wavelet transform (CWT) with a scale of 32 to transform the data into the time-frequency domain. Then, they built a compact CNN model named EMGNet. This model can be used to distinguish static gestures, but it is not enough to infer temporal tactical sign language gestures. The hybrid neural network obtained by combining the recurrent neural network (RNN) and multinetwork model [23] is applied to distinguish the EMG signals acquired from different sign language actions with high accuracy. Zhang et al. [24] presented MyoSign which combines the signals of industrial sensor and EMG sensor. MyoSign proposed a system for inferring American Sign Language and constructed a model integrating multimodel CNN and CTC. However, this kind of portable deep learning system applied to distinguish bioelectric signals usually contains a large number of parameters, which affects its real-time performance.

2.2. Lightweight Neural Network

With the emergence of different types of neural networks, the function of neural networks is becoming more and more powerful, as well as the scale is also expanding. In order to improve the operation efficiency of neural network and deploy it in small embedded devices or mobile devices, the network must be lightweight. At present, the lightweight methods of neural network are mainly reflected in two aspects, namely, lightweight network models and network lightweight methods.

The former is mainly achieved by changing the convolution mode and exchanging information between different convolution layers. The SqueezeNet model of Berkeley Stanford [25] employs the stacking idea of the visual geometry group (VGG) network, emphasizing the application of a convolution kernel to compress feature maps. Google’s Mobilenet model [26] uses the depth-wise separation filter instead of the traditional convolution method. The Shufflenet model [27] draws on the idea of dividing the convolutional layer into two parts of the MobileNet model, retaining the depth-wise layer to convolve a single channel, and then introducing the shuffle layer to shuffle the channels to ensure the circulation of feature map information in different channels.

The latter is chiefly the method to reduce the space complexity and time complexity of the model by compressing the number and depth of the parameters without obvious influence on the accuracy of the existing network model. Pruning is one of the most commonly used model compression methods [28], and its premise is the overparameterization of deep neural networks. Hao et al. [29] combined the loss with the regulator in training to make weight sparse; then, the importance of the parameters is evaluated according to the absolute value in order to remove the parameters whose importance is lower than the threshold. Quantization is another useful method, which takes the high precision of parameters as the premise. Gupta et al. [30] used two rough methods to realize parameter quantization. One is rounding up nearby, and the other is rounding up or down according to a certain probability.

3. Materials and Methods

3.1. Dataset

To evaluate the performance of the recognition system, we selected 30 tactical sign language gestures commonly used for data collection. The selected 30 gestures (see Table 1) are divided into 5 categories. Twenty-five volunteers (18 males, 7 females, age: , height: , weight: ) are recruited in our experiment. Each volunteer wears the Myo armband at the forearm of the left hand (see Figure 1), because TSL is performed with one hand on the left hand. EMG and IMU signals are collected synchronously. The signal data collection consists of raising hands, corresponding tactical sign language movements and arm relaxation movements.

Before data collection, we inform the volunteers of the purpose of the study, the collection procedure, and the duration of collection. During data collection, each subject was required to wear a Myo armband in the same position and in the same orientation of the left hand. They completed the corresponding tactical sign language gestures within 2 seconds and rested for 1 second. Volunteers were required to repeat the above steps 50 times. A total of 37500 samples are collected, and standard Myo software development kit (SDK) is leveraged to collect sensor data.

3.2. MyoTac System Architecture

MyoTac is a user-independent sign language recognition system, which can complete the classification of 30 military sign language commands in real-time. Figure 2 shows the structure of the system. The left side of Figure 2 performs the offline model training. The data is collected by a portable sensor device, Myo, and then transmitted in real-time via BLE. EMG data and IMU data are divided into 2-second segments, and after the data is synchronized by upsampling, they are merged into a matrix. The 17 channels correspond to 9 IMU signals plus 8 signals from EMG sensors. 400 is the amount of time series data collected within 2 s at a collection rate of 200 Hz. The formatted data is first put into the TCNN network to extract hidden spatial features. Then, B-LSTM collects context information from multiple different timing modules. In the field of knowledge distillation, soft targets are interpreted as the output obtained after training complex networks, while hard targets refer to the real label of data [31]. Finally, combined with soft target and hard target, the scale of the network is reduced by 86.3%, and the running time is reduced by 33.6% compared with the MyoTac-original model.

The right side of Figure 2 shows the online recognition. We adopt the model that was trained offline to process the new data so we can evaluate the real-time performance of the system. New users do not need to register or perform any other processing, just wear an armband on their left arms, and perform corresponding military sign language actions. The system will detect the short-term energy value of EMG data in real-time. When the value rises to the threshold, it will be regarded as a user action, and the corresponding activity segment will be intercepted. The data of EMG and IMU are also merged into a matrix and input into our offline trained model. The system will output the classification results and the time taken for identification in real-time.

3.3. Data Segmentation and Fusion Method of Upsampling

The data obtained from Myo armband has been denoised by filtering and eliminating power frequency interference. The commonly used methods include Butterworth filter to remove high-frequency and low-frequency interference, wavelet denoising, and band stop filtering at 50 Hz to remove power frequency interference.

The EMG signal from Myo and the corresponding short-term energy (see Figure 3) show two similar partial waves when participant performs actions two times. These parts contained in two dotted line boxes are valid data. The time to complete a military sign language action generally does not exceed 2 seconds. Therefore, we set the effective operation time to 2 seconds. During the data collection, our program reminds the volunteers of the beginning and end of each two-second period. Through the analysis of the data of each gesture, it is easily seen that the short-term energy will reach a higher value during the arm activity phase. The short-term energy of EMG signal at time is defined as where is the window size, is the EMG time domain signal of th channel, and is the window function, which we set as rectangular windows. Therefore, in the recognition part, we set a sliding time window to continuously monitor the short-term energy of the EMG data. When the short-term energy exceeds the threshold, it is judged as the active segment of the signal. In the case that the EMG signal sampling rate is 200 Hz and the number of channels is 8, we obtain an EMG signal matrix containing the active segment. It can be seen as a picture with a resolution of .

In the tactical sign language instruction set, there are a numerous number of sign language instructions with the same arm movements and the inconsistent finger movements. For example, both “me” and “female” gestures raise the hand to chest, and likewise “hurry up” and “shotgun” gestures both raise the arm to move up and down. Similarly, there are also some movements where the finger movements are invariable with distinct arm movements. For example, “suspect” and “message received” gestures are both three fingers cocked. “You” and “me” commands are the index finger to straighten and the other fingers to bend. “Pistol” and “rifle” are gestures with index finger and thumb to make a gun. Only when both the IMU signals is applied to distinguish the obvious movements of the arm, and the EMG signals are used to discriminate explicit movements of the fingers, different sign language instructions can be better distinguished. As the sampling rate of the inertial sensor is 50 Hz and the number of channels is 9, there is a data matrix corresponding to inertial signal. Because of the difference in sampling rate of the inertial sensor and the EMG signal, the upsampling method is adopted. The data matrix is obtained by linear interpolation of inertial sensor data to ensure the time synchronization of sampling data. At each sampling point, by combining data matrix and matrix , we get matrix which fused 9 channel inertial sensitive samples and 8 channel EMG samples. Data processing is illustrated in Algorithm 1.

Input: Real time EMG data SEMG, IMU data SIMU, short time energy threshold H
Output: Fused matrix R17x400
1: For each t >0.05 s do
2:   E(t):=
3:   If E(t) > H then
4:      :=SEMG(t-0.05:t +1.95)
5:      :=SEMG(t-0.05:t +1.95)
6:      :=Interp
7:      :=merge
8:      Return
9:   End
10: End for
3.4. TCNN Model Construction Based on Correlation between Channels

The traditional method extracts the artificially designed features which refer to the signal features calculated according to the sensor data. They can be independently selected to be input into the machine learning classifier, such as mean absolute value (MAV), root mean square (RMS), zero crossing (ZC), and waveform length (WL). However, it is not always possible to directly find remarkable manual features that can be well generalized to different sensors and users. Based on the superior performance of convolutional neural networks (CNN), it is used to process multimode sensor signals. In order to reflect the timing property (see Section 3.4 for details), we separate the matrix into five clips, and there are 5 data matrices .

The muscle signal generated during arm movement can reflect the intensity of muscle activities. As shown in Figure 4, (a) and (b) represent the eight-channel time-domain signal and frequency-domain signal of two “see” gestures, and (c) and (d) are the signals of “two-way column” gestures. We found that when the gesture “see” is performed, the muscles around channel 4 and channel 7 are more active, and the generated signals are more concentrated in the frequency range of 60-100 Hz. For “two-way column,” the muscles near channels 3 and 7 are more active, and the EMG signal collected on channel 3 is also distributed in the low-frequency domain. It is worth noting that because the noise of EMG signal will be caused by power frequency interference at 50 Hz and its harmonics, the built-in algorithm of the Myo armband filters out the power frequency interference, so the distribution of all signals around 50 Hz is attenuated. The EMG signals and frequency domain signals generated by the same gesture in the time domain are very similar, while that of different gestures are quite different.

Figure 5 shows the distribution of the forearm muscles of the human body. When performing tactical sign language, muscle fibres are activated to generate electrical signals, which are transmitted to the surface of the skin. The electrode pads of Myo armband are attached to the surface of the skin, so the signals collected by each electrode pad are the superposition of signals generated by multiple muscle fibres. We analyse the correlation between EMG data channels and find that the Pearson coefficients of adjacent channels are all greater than 0.3, while the Pearson values between nonadjacent channels are not more than 0.2. This is consistent with the fact that the electrical signals captured by adjacent electrode sheets are relatively similar.

In order to master the data relationship between different channels, we design a 3-layer TCNN (see Figure 6). In the first layer, we use the convolution filter to obtain the characteristics of the signal matrix. We find that when is set to 1, the information of each channel is processed separately, and the highest degree of discrimination is achieved. Then, the second layer () filter is used to learn the high-level representation. After the convolutional layer has extracted the data features, researchers in [32] adopt the FCL to adjust the tensor to so that it can be input to the next stage of the RNN layer. Due to the large number of parameters in the FCL, the scale of the model will be greatly increased. Therefore, we adjust the size of the feature map to () through a () convolutional filter. We apply a maximum pooling layer to each convolutional layer to simplify computational complexity of the network. For all layers, we wield rectified linear unit (RELU) as the activation function. RELU can optimize the gradient dissipation problem in deep neural networks, thereby speeding up the learning speed.

At the end of the TCNN network module, a flattening layer is applied. The flattening layer is used to flatten the output of the previous convolutional layer in order to input features into the B-LSTM module.

3.5. Construction of Bidirectional-LSTM Network Based on Temporal Correlation

Both the EMG signal and the IMU signal describe all the movements during the whole-time sequence. Here, we give the EMG signals of the action “doorway” and “assemble” as examples for illustration. Part (a) and part (b) (see Figure 7) describe the actions “doorway” and “assemble,” respectively. The expression of “doorway” in TSL is to raise your hand and use your index finger up, left, and then down to draw the shape of a door. When the hand moves in different directions, the most active muscle mass will change, and the signal strength obtained through different channels will change accordingly. In part (a), it is obvious that three active waveforms are corresponding to the signals when the hand moves in three directions. The sign language expression of “assemble” is to raise your hand, extend your index finger to the sky, and turn it twice before retracting. The signal in (b) is divided into four segments, which correspond to the raising of the hand, the first rotation, the second rotation, and the withdrawal. The waveforms of the signals in the middle two segments are similar, and the signals of each channel are sequentially active as they rotate. Due to the high correlation between the signal and time, we divide the signals received into five segments for processing in RNN and displayed the dynamic behaviour over time series.

RNN is a neural network capable of processing sequence information. The connections between nodes form a directed graph along the sequence. LSTM, which has input gate, forgotten gate, and update gate to minimize the impact of long-term dependencies, can only be calculated based on the previous information. But many gestures in tactical sign language have the same way of expression at the beginning. In such cases, the inability of LSTM to access future information may cause recognition errors. Therefore, we deployed a B-LSTM as the temporal modelling layer. B-LSTM includes two LSTM layers, which are opposite in the time domain. This makes the output of a certain node depend on both the previous and the subsequent hidden layer state of the sequence at the same time, which ensures that the sign language classification at a certain point in time depends on the entire sequence.

Data needs to be divided into multiple sequences, and the B-LSTM network accepts data of one sequence at a time. We divide the data into 3, 4, 5, 6, 8, and 100 parts, respectively, for experiments, and the model achieve best accuracy when divided into five parts. The action information contained in the data also includes five parts: rest, raise the arm, sign language action, put down the arm, and rest. The specific way proposed to divide it into five parts is to transform the data into a continuous time series matrix before convolution (see Figure 8). And each convolution only processes one matrix of data at a time. After getting the output of the TCNN network, these time series blocks are recombined as primary data and subsequently input into the bidirectional LSTM network. After the LSTM Layers, we use a dropout layer to reduce overfitting. The dropout rate is adjusted from 0.2 to 0.5 in a parameter selection experiment, and 0.5 is the optimal choice.

Here, the FCL is commonly used to integrate data input into the softmax layer. In addition, in order to avoid the use of the FCL, we try to reshape the output feature map of the B-LSTM to () and turn it into () through a convolution, where 30 is the number of classifications. And then the map feature is fed into the softmax after reshaping. But in fact, the model size of the two methods is almost the same, so the conventional method is still used. Finally, we apply the softmax function to normalize the output vector, which is then interpreted as the classification probability of the symbol label with index .

3.6. Lightweight Network through Knowledge Distillation

People intuitively obtain knowledge from fixed real targets, which hinders the abstract view of extracting knowledge, that is, knowledge is a mapping from input vector to output vector [33], which can be extracted from a bulky model set to a small model. The bulky model can be a series of individually trained models or a single but large model. When the cumbersome model is trained, we can deploy another type of training, namely, knowledge distillation, to transfer the knowledge learned from the large model to the student model. Due to the knowledge transfer relationship between the model, the method of knowledge distillation is also called the teacher-student neural network. Especially when recognizing military sign language and the target user has never appeared before, the transferred knowledge has better distinguishing performance because it is liberated from specific instances.

When neural networks conduct classification training, softmax function output is typically applied. The loss function is defined as where represents target label. If the real value belongs to category , equals 1; otherwise, equals 0. is the distillation intensity that is set to 1 during training and testing, and set to 8 during distillation. A higher value produces a relatively softer simulation target. We first train a model similar to the final model, but with more convolution kernels and larger parameters. By training this model, we can obtain knowledge about the degree of similarity between actions. After that, we use the same training set to train the small model. The logit of the bulky model contains a lot of information and can transfer its generalization ability to the small model, so that the small model has sufficient classification ability. Combined with a distillation intensity value , the soft target provided by the small model is obtained as shown in Equation (2). The hard target refers to the actual action corresponding to the signal in the training set. Since the information entropy of soft targets is higher, each training instance provides much more information than hard targets. There are two ways to employ soft target and hard target at the same time. The first way is to use real targets to modify the model obtained by soft target training. While the other is to use the weighted average of two objective functions. Considering soft and hard targets in a comprehensive way enables the small model to match the soft target provided by the bulky model while predicting the real target and achieves the best results.

4. Results

4.1. Results of Signal Fused

First, we discuss the importance of fusing the signals of EMG sensors and IMU sensors. In order to fully capture the fine arm and finger movements, we fused the inertial sensor data and the EMG signal data together. Figure 9 shows the confusion matrices, respectively, when only the EMG signals are used, only the IMU signals are used, and the fusion signals are used. The overall recognition accuracy of the thirty gestures in each case was 48.2%, 69.1%, and 92.4%. Table 2 lists the accuracy of tactical sign language in three cases. From Table 2, it is obvious that for some gestures, such as “commander,” it is impossible to distinguish when the EMG signal or the IMU signal is adopted alone, but the accuracy is greatly improved when two signals are combined. Therefore, the IMU signal and the EMG signal are complementary to each other for the measurement of muscle activities and the evaluation of arm movements, realizing the accurate recognition of sign language instructions.

Zhang et al. [24] apply multi-CNN networks to extract the features of the accelerometer, gyroscope, orientation, and EMG sensors and then merge the resulting tensor to carry out the information interaction between the modals, which is also a signal fusion method. According to this method, we establish a three-layer CNN for each sensor and merge the output of multi-CNN into a multichannel tensor. We pass multichannel tensor through the convolutional layer, flattening layer, and dropout layer and inputted it into the LSTM network to extract temporal features. Taking the data of volunteer 25 as the test set, the comparison between above method and our method is shown in Figure 10. It can be seen that the best accuracy of a single convolutional network (our) with the upsampling method is 92.67%, higher than multi-CNN. It is mainly because the two types of data after upsampling are well synchronized, and the characteristics are not strongly correlated in each channel.

4.2. Comparison of Classification Systems

We use leave-one-user-out cross-validation to evaluate the recognition accuracy of MyoTac in 30 TSL gestures, as shown in Figure 11. The recognition accuracy of no. 1 and no. 21 volunteers is relatively high, because their movements and postures are accurate. No. 8 volunteers’ recognition accuracy rate is slightly lower than that of others, since her arm is too thin to keep the armband in the same position. In addition, volunteer no. 11 has a large body weight and a high amount of fat, resulting in a relatively low accuracy. Mainly speaking, the average accuracy of 25 volunteers is 92.4%, and the standard deviation is 2.3%. Different volunteers not only have differences in the outer shape of the arm, including fat thickness and arm circumference, but also different behaviour habits, including the power of the action and some subtle movement differences. But MyoTac can still achieve satisfactory results. Moreover, the recognition accuracy of volunteers is above 88%. This illustrates the generalization ability of MyoTac across users.

In order to further prove the accuracy and high real-time performance, our model is compared with several state-of-the-art researches [22, 24]. Researchers have applied deep learning methods to the field of EMG signal and gesture recognition and have explored several effective network frameworks.

EMGNet [22] builds a compact CNN model to process the time-frequency data and get the classification results. We choose it for comparison as it also reduces the weight of gesture classification model base on EMG signal. MyoSign [24] is an American Sign Language recognition system which inputs the processed data into a complicated model. In order to realize end-to-end sign language recognition, they added a 10-width connectionist temporal classification (CTC) beam decoder at the end of the model.

These two models are compared with MyoTac in the MyoDataSet [34] and the tactical sign language instruction data set we collected. MyoDataSet is a seven-category data set collected by Myo, including seven gestures: neutral, hand close, wrist extension, ulnar deviation, hand open, wrist flexion, and radial deviation. In the first case, after shuffle the MyoDataSet, the training set and the test set are divided at a ratio of 7/3. Then, in the second case, the data of two volunteers of different genders in MyoDataSet are used as test set, while the data of others are used as training set. There is also user-independent data set division on our tactical sign language instruction set data. For MyoSign proposed by Zhang et al., since the data set and model-related parameters are not disclosed in this paper, we use our tactical sign language instruction set to perform the same training set and test set division to compare the two models. In terms of parameters, including training batches, total epochs, and learning rate, both models use the same values. The comparison results are shown in Table 3, and the accuracy of the specific iterative process of the three models when using our military sign language instruction data set is shown in Figure 12.

From Table 3, it can be found that our model has a higher accuracy than other methods. The gestures collected by MyoDataSet remain static during the collection process, while the tactical sign language commands correspond to changing movements. It can be seen from Table 3 that although the model of EMGNet is small, it sacrifices the ability to process time-varying data containing timing information, and the accuracy of distinguishing tactical sign language is only slightly greater than random. Our method can handle both dynamic data and static data splendidly. The experimental results prove the necessity of combining B-LSTM to establish a time dynamic model. Compared with MyoSign, we find that our model reduces the model parameters through knowledge distillation and replacement of the FCL, while maintaining a high accuracy. MyoSign uses 3D convolution, which has one more dimension, and the total amount of calculation to obtain the output of this layer is also one more dimension than 2D convolution, which slows down the calculation speed of the model. Moreover, CTC is usually applied to the scene where the fixed-length sequence is converted to the variable-length sequence. It is not completely applicable to the sign language classification problem where the output is fixed-length and further reduces the processing time.

4.3. Results of Knowledge Distillation

In this section, we evaluate the effect of knowledge distillation on the lightweight of hybrid networks. First, we train a single cumbersome network to complete the recognition model and then use the distillation intensity value to convert the output of the model into a soft target with rich information entropy, which is provided to the small model for learning. Table 4 compares several different ways of training small models. Here, we use the data of one volunteer with the highest accuracy in cross training as the test set, and the data of other volunteers as the training set. We first evaluate the results of using only soft targets and get an accuracy of 94.2%, followed by the assessment of applying hard labels with accuracy of 95.6%. Then, we apply both the hard target and soft target, including the weighted average of both targets and hard target correction which is to train the model with the soft target and to revise with the following hard target. The accuracy of the two methods is 97.1% and 97.0%, respectively.

From Table 4, it can be concluded that the application of soft targets alone for training has the worst effect, followed by the use of hard targets alone. Combining soft targets and hard targets to train small models can achieve better training effects. The highest accuracy rate is obtained by the weighted average of soft targets and hard targets, in which we use a relative weight of 0.8 on the cross-entropy for the hard targets. It is considered that soft targets define a rich similarity structure over the gesture, that is, they contain rich information of different gestures which are more similar and whose differences are greater, so the small model can learn knowledge quickly during training. It is similar to that students can start learning expeditiously under the guidance of a teacher. However, due to the large gap between the soft target and the real label, the complete application of the soft target for training leads to a large deviation. Just like a student who only listens to the teacher’s teaching without self-study, he cannot succeed. When soft targets and hard targets are combined for training, the small model has both soft targets to provide rich knowledge and hard targets to provide accurate classification. Similar to students who combine teacher guidance and their own diligence, they can reach the highest level of knowledge.

We evaluate the runtime and model size of 3 models (see Table 5): the original MyoTac model, the model after knowledge distillation, and the model without FCL. Volunteers can only make one sign language action, so the inference model only needs to process and classify one original test data in a certain period of time. The parameter of the smallest model is only 3.7 M, and the average running time of all gesture is 2.37 ms. Without affecting the accuracy, it has faster computing speed and smaller storage scale than before.

4.4. Results of Real-Time Recognition

In order to achieve natural human-computer interaction, MyoTac must be able to complete sign language inference in real-time. The capacitance of the lithium battery used in Myo armband is 220 mAh, and the average measured current during operation is about 9 mA. Therefore, the battery duration of the system is about 24.4 hours. We verified the online operation with 13 participants who were new users to our system. Before the experiment, participants were asked to learn the tactical sign language gestures to avoid the influence of nonstandard actions on the recognition results. During the experiment, participants were asked to select one of the thirty tactile sign language pictures without repetition until all 30 gestures were tested. Each participant performs the above process four times.

The experimental results are shown in Figures 13 and 14. The overall average accuracy of real-time military sign language inference is 92.67%, and the average runtime is 2.81 ms. For 53.3% gestures in TSL, the accuracy of inference is higher than 95% in this experiment. The accuracy of all gestures except “message received” is higher than 75%. The “message received” gesture has the largest number of incorrect judgements, and the accuracy rate is only 63.46%. This is because the arm movement trajectory of “message received” is the same as that of “heard,” “pistol,” and “two-way column,” which leads to that finger movements have to be involved in the inference. However, the finger movements of “message received” and “hearing” are also awfully similar. The signals recognized by the eight-channel Myo armband are not enough to distinguish the extremely similar finger movements under the influence of larger movements.

5. Discussion

This study proposes a lightweight hybrid neural network that achieves tactical sign language classification by using the multimodal data of EMG and IMU. When the user is new to our system, the accuracy of real-time classification will not be reduced. This study proves that neural network has high redundancy, and the parameters and network layers can still be reduced while achieving high accuracy. Moreover, the reasonable neural network structure which is combined with data can achieve better classification performance under the same numbers of parameters. In the network lightweight, the method of combining soft target with hard target in knowledge distillation is adopted. It also proves that the soft target which obtained by training the large model in knowledge distillation can be used to guide the training of small models and can be applied to the mixed neural network of the convolution layer and BLSTM layer.

The results of this study also show that the increase of data is beneficial to the accuracy of the algorithm. The amount of data in the test set is changed to test the recognition of the same subject. The result is that with the increase of training set data, the accuracy of the action is positively correlated. This is because the key information of gesture movement is extracted, and the difference of movements among different groups can also be recognized by the network. At the same time, the importance of standardizing the acquisition signal is also obvious. When collecting the first experimental data set, the way different participants wore Myo armbands was not specified in detail. This variable greatly affects the classification results of data, so that we refined some criteria of data collection and rebuilt the second data set.

Section 4.2 discusses the necessity of signal fusion. The EMG signal obtained at 200 Hz sampling rate must be fused with IMU signal to achieve better classification effect. However, it is possible to use only the EMG signals for classification. Since the EMG signals can distinguish more precise finger movements, there should be a certain degree of division for arm movements. One of the reasons we need to fuse IMU signal is that the frequency of useful signal of EMG signal is 0-500 Hz, while the sampling rate of Myo is only 200 Hz.

The experiments in terms of network compression are not sufficient. We will conduct further research on training methods that combine soft and hard targets. It is planned to evaluate the accuracy of the step-by-step training method and train each stage separately without affecting the weight of the second-stage network. For the distillation intensity value for distilling out the soft target, the optimal value was not obtained. In addition, due to the particularity of the convolutional network used, methods such as depth-wise convolution and channel pruning can also be mixed for further network compression.

Although the experimental results are satisfactory, there still exists limitation of MyoTac. For example, with regard to some people with thin arms, the installation is unable to be fixed on the arm stably, resulting in unstable signals. Second, the new gesture cannot be dynamically adapted. In the future, we will struggle to change the classification model to form a miniaturized adaptive classification system. In addition, we will design a multimodal sensor suitable for various arm thickness to make the system more robust. In the application of sign language recognition, the volunteers sat to collect data. We also analysed the standing situation and found that it has little effect on the analysis results. For the application scenario of walking, because the signal of inertial sensor will be distorted by movement, it may need a separate sensor to measure the influence of walking and correct the signal data.

6. Conclusions

We present MyoTac, a sign language recognition system based on EMG, which uses EMG and IMU multimodal signal fusion to classify sign language through a knowledge distillate lightweight neural network. We collected the standard tactical sign language data from 25 volunteers and completed a multimodal data set. A convolution combined B-LSTM hybrid network is designed, as well as the reduction for scale of the network by knowledge distillation and less use of FCL. Our system can effectively distinguish different sign languages on the premise of user independence. The average accuracy of real-time classification inference is 92.67%, and the average real-time running time is 2.81 ms. The encouraging performance of MyoTac proves its potential for silent human-computer interaction applications.

Data Availability

Data is available at https://github.com/YifanZhangchn/MyoTac.git.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

The authors would like to thank the collaboration of all volunteers who participated in data collection. This research was funded by National Natural Science Foundation of China (no. 61702018).