Abstract

Artificial intelligence and Internet of Things (IoT) devices are experiencing explosive growth. Currently, the commonly used gesture recognition methods are difficult to deploy and expensive, so this paper uses the Channel State Information (CSI) for Chinese sign language recognition. Aiming at the problems of current gesture recognition methods, such as strong personnel dependence, high computational resource consumption, and low robustness, we proposed a Chinese sign language gesture recognition method named Air-CSL. In this method, the Local Outlier Factor (LOF) removal algorithm and the Discrete Wavelet Transform (DWT) are used to reduce the noise in the data, and the subcarriers that best represent the gesture data are selected by principal component analysis. After denoising, mathematical statistics were extracted from the gesture waveform as the eigenvalues, and the features were fused by the Deep Restricted Boltzmann Machine (DBM). Finally, the result of gesture classification and recognition is obtained by the Gated Recurrent Unit (GRU). In this way, the prediction model realizes as well as the classification of sign language gestures. The results show that the proposed method can effectively recognize Chinese sign language gestures of different people in different environments and has good robustness.

1. Introduction

Wireless communication technology has evolved to the point where WiFi and mobile devices use traffic transmission, accounting for 68% of the total global Internet information [1]. In 2011 Halperin et al. obtained the Channel State Information (CSI) of WiFi signals from Intel 5300 NICs. The research on human perception and indoor localization based on CSI has become the focus of academic attention [2]. Some scholars have focused on gesture recognition and worked on real-life applications. At present, wireless communication technology pays attention to its application in the field of special education [3]. According to statistics, there are about 1.57 billion people with hearing impairment worldwide [4]. In China, the number of deaf people exceeds 20.8 million, accounting for 1.69% of the country’s total population. As a universal language for deaf people, sign language is a necessary means of communication and learning for the hearing impaired and speech impaired. With the advancement of gesture recognition technology, wireless Internet of Things environments enable people to interact freely with environmental devices.

In special education, deaf people learn Chinese characters through pinyin and eventually reach the purpose of learning sign language. Meanwhile, as a special language, if sign language can be converted into corresponding characters, it will greatly facilitate the communication between deaf people and normal hearing people. Thus, the application of sign recognition through wireless communication technology to the teaching of sign language to special people such as deaf people is a very effective way of teaching.

Sign language recognition systems based on multiple methods can break the barrier that exists between deaf people and people who do not know sign language. For example, some sign language recognition systems use cameras [59] or Kinect [10, 11] to capture sign language gesture data literature. However, the data acquisition equipment is affected by the lighting conditions, while the camera captures data in such a way that violates the privacy of the person. Some gesture recognition systems propose the use of somatosensory controllers for gesture data acquisition. However, this data collection method is more sensitive to the distance and displacement between the device and the person. Some gesture recognition works propose that sensors can be used for motion data acquisition [1214]. However, for gesture recognition, the user needs to wear the sensor in a position such as a finger or an arm, which will greatly affect the user’s daily life.

Since 2013, academia has begun to pay attention to CSI-based research on human perception and indoor positioning. At present, human behavior research based on CSI has achieved good results. Currently, CSI can be used in intrusion detection [1517] with gesture recognition [1820]. Moreover, WiFi devices open new paths for gesture recognition due to their simplicity and ease of deployment. Currently, most of the gesture recognition objects using WiFi are American sign language gestures or daily actions, numbers, etc. And our goal is to recognize Chinese sign language gestures, which are large in number and complicated in action. When different experimenters complete the gesture actions multiple times, it causes problems such as a large amount of data and high computational overhead, so we should choose recognition algorithms with high recognition accuracy and low computational overhead cost.

Although there has been some progress in gesture recognition, however, CSI is very sensitive to the indoor environment, and the whole transmission process is still affected by the multipath effect. Moreover, the feature extraction of gesture recognition is challenging, and the actual application environment greatly affects the gesture recognition results. When the amount of gesture data is large, the calculation cost is high. To solve the above problems, we propose a sign language gesture recognition method based on CSI, which can recognize Chinese deaf people’s Hanyu Pinyin sign language, and the Hanyu Pinyin sign language gestures are shown in Figure 1. Hanyu Pinyin differs from other languages and has a total of thirty letters. The sign language gesture actions are described by the CSI measurements. After denoising the raw data, Air-CSL classifies and recognizes the preprocessed CSI values. We validated Air-CSL in both laboratory and classroom environments and verified that the method achieves good robustness in perceiving and recognizing Pinyin sign language gestures. In summary, we have made three contributions. (1)Due to the multipath effect, this paper combines the Local Outlier Factor (LOF) removal algorithm with the Discrete Wavelet Transform (DWT) to remove noise from gesture data(2)Commonly used mathematical statistics such as kurtosis cannot fully describe the characteristics of sign language gestures, resulting in insufficient gesture information. There are many kinds of gestures, which can easily lead to the misclassification of classifiers. Therefore, the Deep Restricted Boltzmann Machine (DBM) is used to train mathematical statistics and extract more abstract and comprehensive sign language gesture features(3)The sample size of sign language gesture data collected by us is large, and the calculation time is too long, which makes the feedback time of human-computer interaction extremely challenging. Therefore, the Gated Recurrent Unit (GRU) is used for Chinese sign language gesture recognition(4)Finally, we propose the Air-CSL model based on commercial WiFi devices. Through a large number of experiments, the superiority of the model in recognizing different people’s sign language gestures in different experimental environments is verified, and the recognition rate of the empty hall can reach 91.77%.

Currently, researchers have proposed various perceptual techniques for human action recognition, mainly based on sensors, computer vision, and wireless devices.

The first category recognizes human actions through sensors. Yuan et al. proposed the use of sensor-equipped accessories as data acquisition devices to capture gestural actions, with recognition results of more than 90% in all cases [21]. However, wearable device-based recognition methods require the user to wear a special device, which affects the action description and greatly reduces the comfort of the user.

The second category recognizes actions through computer vision. The gesture recognition is through the Microsoft Kinect sensor [22, 23]. For instance, Liu et al. used double and tenfold cross-validation to achieve more than 91% recognition of Arabic numerals (0-9) and English letters (A-Z). In the literature [24], the projector camera system is used to extract the spatial information of human actions, so as to realize the control of dynamic gestures on the Augmented Reality (AR) model. In this paper, the four gestures are recognized and combined with the projected AR environment construction, correction, and registration to achieve surgical guidance. In the literature [25], the surgical robot was placed on a mobile platform, and the advantages of AR visual guidance information, surgeon’s experience, and robotic surgery were integrated to enhance the surgical field of vision. The eight gestures mentioned in this paper can control the robot and AR, so as to achieve high-quality interaction of surgical information. The literature [26] recognizes hand gestures based on video. In this paper, by modeling the human skeleton and retaining the key frame, the time dimension information is taken as the classification key. For static gestures, this method achieves a very satisfactory recognition rate. However, visual recognition usually requires lighting conditions and involves personal privacy issues, which has limitations in practical use.

The third type of wireless device action recognition method can be done by ultrabroadband radar [27], radio frequency identification technology [28], received signal strength indication, or CSI signal [29]. Among them, RF technology and ultrabroadband radar require dedicated equipment and high deployment complexity, and currently, received signal strength indication or CSI is mostly used for contextual awareness. Such sensing alleviates the need for people to constantly wear sensor devices. Researchers used WiFi for indoor person behavior detection [30, 31].

Kang et al. [32] used adversarial learning schemes and feature de-entanglement modules to remove the influence of irrelevant factors in gestures and used an attentional scheme based on the output of source domain discriminator to reflect the similarity differences between multiple source domains and target domains, thus reducing negative transfer. On Widar 3.0 data sets, their model evaluation improved by an average of 3% to 12.7%. In 1D, WiTrace [33] used the synthetic signal to derive the phase of the hand-reflected signal and measured the phase change to obtain the distance of movement. For 2D space, WiTrace proposed the Kalman Filter to filter out the noise of tracking. The method achieved an average accuracy of 6.23 cm for initial position estimation. And WiTrace achieved the average tracking errors of 1.46 cm and 2.09 cm for 1D tracking and 2D tracking. The DFGR used the deep network. The deep network could learn discriminative deep features. What is more, the method could use the transferable similarity to evaluate ability in test conditions [34]. The WiGAN system proposed by Jiang et al. uses generative adversarial networks to extract and generate gesture features, fuses the features, and classifies human activities through support vector machines with a final average recognition accuracy of over 95% [35].

Through experiments, we found that the gesture data reaches a certain magnitude when using support vector machines (SVM) for feature extraction, gesture recognition requires a large overhead, and the existing gesture recognition lacks sufficient research on Chinese gestures. In this paper, the LOF algorithm is used to remove the outliers in sign language gesture data. Meanwhile, we use the DWT to process the low-frequency information in gesture data. After data preprocessing, this paper selects the subcarrier that can best represent sign language gesture information through the PCA, so as to effectively remove part of the environmental interference. Through the above methods, we have effectively solved the problem of large computation overhead. To solve the problems of large computational overhead and low accuracy of gesture recognition, this paper analyzes the statistical features of CSI amplitude and fuses the features by the DBM. Finally, the recognition results are obtained when the gesture data are input into the model, and the GRU is recognized.

3. Overview of the Gesture Recognition Method

Sign language gesture recognition by CSI requires four steps: sign language gesture data sensing, noise removal, feature extraction, and sign language gesture recognition, and the workflow is shown in Figure 2. We used two laptops configured with Intel 5300 NIC for data acquisition, one working in IEEE 802.11n Monitor mode as the transmitter and the other as the receiver.

3.1. Data Acquisition and Preprocessing

There are multiple reflection paths caused by gesture actions. The traditional CSI-based action recognition generally uses the channel features of a single antenna and a single link as the sensed data. However, such an approach tends to lose more action feature data easily. Multiple antennas provide sufficient CSI, and we collected the raw information as shown in Figure 3. The channel impact response can be used to describe the merit of the propagation path, expressed as where denotes the static response, denotes the sum of the number of dynamic reflection paths, and denotes the complex attenuation factor. denotes the Doppler shift of the reflected signal.

The amplitude can describe the static response of CSI, and since the experimenter is on the central link and there is a strong LOS signal in the indoor environment, we choose an antenna with a smaller amplitude. Meanwhile, different antennas have different sensitivities to hand gesture actions. The variance can fully describe the CSI changes caused by gesture actions, and a larger variance indicates a better reflection of the dynamic response. Therefore, we choose the antenna with the maximum variance of CSI and relatively small amplitude.

The filtered antennas are affected by the multipath effect and inherent noise, and there are spikes and burrs in the data waveform. To remove the outliers while preserving the gesture information as much as possible, we choose the LOF algorithm to remove the outliers from the gesture data, as shown in Figure 4(a). It can be expressed as where denotes the total number of points in the -distance neighborhood of the point . is the local reachable density of the point .

The principle of anomaly removal is when tends to 1, it indicates that the density of neighboring points of the measured point is almost equal and belongs to the same cluster; the more is greater than 1, there is a difference in the domain density of the measured point, then the point is considered as an anomaly; if is much less than 1, the measured point is considered as a dense point.

When there is more interference in the environment, DWT is selected for multipath effect removal as in Figure 4(b). DWT performs multiscale analysis of fine-grained actions and uses Symlet5 for signal decomposition into approximate coefficients and multiple detail coefficients to remove high-frequency noise while retaining the approximate characteristics and data details of the gesture waveform, and the detail coefficients describe the random noise and CSI data details in the device. This can be expressed as where is the approximation coefficient, is the detail coefficient, and denotes the gesture data sample points. A soft thresholding algorithm is applied to the detail parameters, and the inverse Discrete Wavelet Transform is used to reconstruct the denoised gesture waveform , expressed as

The 30 subcarriers after removing the noise contain subcarriers with less correlation with the gesture actions, so the principal component analysis algorithm is used to reduce the dimensionality and select the subcarriers with high similarity to those before the reduction. Firstly, the mean value of the gesture sample set is obtained and expressed as

is the sample set after the sample normalization process, where . The covariance matrix of the reconstructed sample is obtained and expressed as

The eigenvalue matrix of this covariance matrix is . The first eigenvalues are taken in descending order, and is the eigenvector matrix composed of the vectors corresponding to the eigenvalues. The eigenvector matrix is multiplied with the original sample set to obtain the reduced-dimensional matrix . The first principal component is finally retained as the CSI waveform for gesture recognition, and the results of the subcarrier extraction by the principal component analysis algorithm are shown in Figure 4(c).

3.2. Feature Extraction

There are differences in the way and speed of describing different people when completing gestures, and in addition, it is difficult to guarantee that the waveforms are the same when someone performs the same gesture. Different people perform different gestures as shown in Figure 5. To achieve consistency in the waveform of the same gesture and to highlight the differences between different gestures, multiple eigenvalues can be selected, but too many eigenvalues are likely to cause fitting problems. Therefore, in this paper, we select the feature values: skewness, kurtosis, standard deviation, and peak-to-peak value, and the four are described in Table 1. where denotes the data points in the sample and denotes the mean of the data points. where represents the fourth-order central moment, and represent the sample points and sample means, respectively, and represents the standard deviation. where refers to the value of the difference between the highest and lowest value of the signal in one cycle, which is the range between the maximum and minimum.

When there are too many gesture data and a variety of gesture types, it is easy to cause inadequate feature description. Currently, we know that existing restricted Boltzmann machines usually perform well in extracting high-dimensional features from the data. Stacking Restricted Boltzmann Machines (RBM) to form the DBM can fuse and downscale gesture features, thus compensating for the recognition errors caused by single features. The multilayer nonlinear transform structure of DBM can accomplish the simulation of complex nonlinear functions. The key component of the DBM is the RBM, which detects, identifies, and classifies the input data by combining multiple layers of RBMs with a final classifier. We input the raw mathematical statistics of the data waveform into the DBM, which is mapped to depth features through multiple feature reconstruction.

Feature fusion through DBM-3 can be roughly divided into the following three steps:

Step 1. Input the base feature values of the gesture data: skewness, kurtosis, standard deviation, and kurtosis into the DBM-1 layer for the first deep fusion to obtain the new feature eigenvalues .

Step 2. Combine the features two by two and input them into DBM-2 for the second fusion, feature values .

Step 3. The features are combined and input into DBM-3 to get the final fused features, which are used as the input of GRU in the gesture recognition stage. The feature fusion via DBN is shown in Figure 6.

3.3. Recognition Model

Recurrent neural network (RNN) is divided into input layer-hidden layer-output layer when processing time-series data, and the output result is related by the current input and previous hidden state, but it ignores the law between information with long interval and current input information. Based on the recurrent neural network and its existing deficiencies, the RNN is improved and derived from neural network models such as long short-term memory network (LSTM) and Gated Recurrent Unit (GRU) [36]. However, LSTM requires four linear layers for each unit, and the training efficiency is low. Compared with RNN and LSTM, sequence modeling with GRU involves fewer parameters and is easier to train.

The GRU contains an update gate and a reset gate. The update gate determines what information needs to be added to the new state and what information needs to be removed at the previous time . The reset gate determines the degree of forgetting of the hidden state at the previous time . When the time is , the inputs of both the reset gate and update gate are the gesture waveform timing features extracted by the DBM and the hidden states of the previous time . The network structure of GRU is shown in Figure 7, and the model is derived from the literature [37].

The hidden state processed by the reset gate , update gate , and activation function in GRU and the final hidden state parameters are updated by the following equations. where represents the feature value information input at moment and is the hidden state at moment . And is the activation function.

After the feature information enters the GRU, it is multiplied with the hidden state of the previous moment by the activation function sigmoid to get the gating signals and . As a result, the hidden state of the previous moment and the updated reset gating with are multiplied by the elements through the activation function to get the current candidate state value . The “updated memory” forgets some unimportant information in , and the “enhanced memory” for some important information to get a new sequence of hidden states for the next time step. After several training sessions, the probability of the gesture action category is finally derived by the softmax function as where is the probability value of the classification action. It can be calculated by Equation (16). The reason why we choose ReLU is that the calculation speed is fast, the calculation amount is small, and the training time is shorter

The pseudocode of the recognition framework training process of Air-CSL is shown in Algorithm 1. For the test of gesture data, we use the trained model to output the probability.

Input:
Output: The loss value
Initialize the model parameters, the hidden layer dimension , the training rounds , the number , training loss value
1: 
2: for to do
3:  use to update the reset door , the update door , and the hidden state
4:  forget unimportant information, update
5: End for
6: pass the data to the tensor
7: for to ;
8:  if the current the predicted loss
9:  save model parameters
10:    and
11:    else if
12:     output the training rounds and
13:   End if
14: End for
15: Return

4. Experiments and Evaluation

4.1. Implementation

We collected gesture data from two laptops with Intel 5300 NIC inside as a pair of transceivers with one antenna at the receiving end and three at the transmitting end. We consider the actual lecture environment, so we set the horizontal distance between the devices to 2 meters. The experimental environments are an empty hall and a classroom, and the scene schematic is shown in Figure 8. To reduce the impact of action completion time on the overall recognition rate, the data acquisition time was set to 10 s. Among them, 0-3 s were stationary, the experimenter described the gesture action in the 4th second, the action description took about 2 s, and the action was retracted in the 7th s. Each action was repeated 20 times. Seventy percent of the experimental data were used for the training of the recognition model, and 30% were used as a test set to test the model. Twelve experimenters were randomly selected. The height and weight information of volunteers is shown in Table 2. The experimenters numbered 1-6 are randomly selected interested people in our college who participated in all the comparative trials involved in this paper, where represents the tester added later, namely 7-12. The experimenters in this section only participated in the experiments in Section 4.4.

4.2. Influence of Different Heights

A pair of transceivers was chosen for the experimental equipment. However, considering the practical situation, we propose that the device will have some effect on gesture recognition when it is at different heights. We speculate that with the height where the average center of gravity of the experimenter is located as the middle value, the sensitivity of the signal to the gesture action decreases as the height of the device increases or decreases, and when the height reaches a certain limit, the effect of the sign language action on the CSI stream almost disappears. The accuracy of sign language gestures decreases with the increase or decrease of the device height. Because a stronger signal can respond to hand movements, signal sensitivity is high; conversely, sensitivity decreases. The accuracy rates at different heights are shown in Figure 9.

As seen in Figure 9, the sign language recognition performance is best when the height of the device is 1.35 m, and the performance of gesture recognition decreases to different degrees as the height increases or decreases. We calculated the average of the comfortable heights considered by the six subjects, and combined with the real-life teaching scenarios, we finally chose the experimental device spacing of 1.25 m. In the subsequent comparison experiments, we all used the height setting of 1.25 m.

4.3. Influence of Different Packet Sending Rates

Sign language gesture actions are fine-grained actions, and the signal changes caused by fine-grained actions may be weak or cause brief and fast changes in the signal. Therefore, the high sampling rate can fully describe each sign language gesture. When the CSI value collected is large enough, it is beneficial to the feature extraction and classification of sign language gestures. To evaluate the performance at different sampling rates, we chose to vary from 400 to 1200 packets/sec. The experimental results show that the accuracy of the method reaches its maximum at a packet rate of 1000 packets/sec. Therefore, the packet rate is uniformly set to 1000 packets/sec in the subsequent experiments. At the same time, we found that the packet rate and gesture recognition accuracy are roughly positively correlated, but when the packet rate reaches 1000 packets/second or more, better hardware is needed to support the store as shown in Figure 10.

4.4. Influence of Different Experimenters
4.4.1. Influence of Different Individuals

Since different personnel differs in the method and time spent on action description when completing sign language gestures. To compare the recognition accuracy of different personnel in the same environment, we compared the recognition results of six experimental personnel in an empty hall and a conference room. The recognition rates of different experimenters are shown in Figure 11, which shows that there are some differences in the recognition rates of different experimenters, and it is concluded that the recognition results of female students are better than those of male students due to the inherent difference in body size.

The recognition rate is the average value of the recognition rate of this experimenter in both scenes. We found that the recognition rate is relatively low for slightly fat people and people with too fast gesture description and higher for people with proportional body and even gesture description processes. Since the height of the device is set at 1.25 m, when the height of the experimenter is too high or too low, it will also have some influence on the recognition results. Overall, the average gesture recognition rate of different personnel in an open environment stays above 91.77%. This indicates that Air-CSL has strong adaptability to different personnel.

4.4.2. Influence of Different Genders

To fully study the influence of different genders on the recognition rate, we added four males and four females based on the existing experimenters. We conducted data collection for the additional six people under the same experimental scenario. Each gesture was repeated 10 times in each scene. This shows that Air-CSL data can present good recognition rates even for untrained personnel. At the same time, it can also reflect the influence of gender differences on gesture recognition rate.

As can be seen from Figure 12, compared with the trained sample, the gesture sample of the untrained sample also achieved good recognition results. The average recognition rate of people without sample training was 88.27%. Figure 12(a) shows the recognition rate of different females. By comparing with Figure 12(b), we can get the following results: Among several randomly selected experimental personnel, the recognition rate of females is better than that of males. The average recognition rate was 88.89% for females and 87.66% for males. Our discussion leads to the following conclusion: women with the same body mass index (BMI) have a slower motion description process than men. In addition, among the subjects we selected, women had less fat distribution in their hands than men, so their fingers were thinner, which is conducive to sign language gesture recognition.

4.5. Influence of Different Experimental Environments

To verify the robustness of the method in this paper, we added static interference and dynamic interference to the two existing experimental environments. The static interference (I) is set as follows: a chair is placed at a horizontal distance of 0.5 m from the transmitting end and the receiving end, respectively; the dynamic interference (I) is set as follows: an experimenter is allowed to walk at a uniform speed at a place parallel to the line-of-sight path distance of 1 m; the dynamic interference (II) is set as follows: the door of the room is artificially flapped to simulate a door disturbed by external factors in the teaching environment. The recognition rates of different environments are shown in Figure 13.

The experimental results show that static interference (I) has relatively little effect on both environments. The average recognition rate in the empty hall can still reach 90.78% on the other hand because the classroom has been arranged with more furniture thus leading to a decrease in the recognition rate in the classroom. When dynamic interference (I) was added to the environment, the gesture recognition rate in both environments decreased significantly to 88.38% and 84.32%, respectively, due to the large gait motion amplitude that interfered with CSI. However, these values are within the acceptable range. Dynamic interference (II) in the two environments caused a more obvious impact, due to the human control of the door swing amplitude being larger, and each swing speed cannot be determined; therefore, gesture recognition accuracy decreased significantly were 85.7% and 81.8%. From Figures 13(b) and 13(c), it can be seen that, when the interference in the environment increases, the original gesture recognition rate of low experimental personnel is affected more seriously.

4.6. Influence of Different Feature Values on Gesture Recognition

Feature extraction directly affects the gesture recognition results. To verify the effectiveness of the feature extraction method selected, we use the unfused mathematical statistics-valued features and the fused feature values to test in different environments, respectively. Figure 14 shows the average recognition rate of the original feature combination in two different experimental scenarios. It can be seen that the recognition accuracy is lower when a single mathematical-statistical value is used as a feature. And the recognition rate is distributed in [84.24,89.36], which is because the original statistics contain errors. Figure 15 shows the average recognition rate of the features fused by the method of this paper in two different experimental scenarios. We can find after the fusion of feature values that the gesture recognition rate is significantly improved in both, and the average recognition rate can reach 89.85% in the two experimental environments, which indicates the effectiveness of feature fusion.

4.7. Influence of Different Data Preprocessing Methods

Due to the large amount of noise contained in the original data and to facilitate the extraction of gesture features, we propose to combine LOF, DWT, and PCA data processing schemes. To prove the effectiveness of the method, we remove each data preprocessing scheme separately. In addition, existing gesture data processing schemes were added for comparison after removing some data preprocessing schemes. In the following, we use the initial six experimenter gesture recognition data to do the evaluation. (a)We compare the previous methods after removing the LOF algorithm, and the results are shown in Figure 16. From Figure 16, it can be concluded that the overall recognition rate of gestures decreases after the LOF algorithm is removed. Because the data collection is random and the external environment is variable, there are unknown outliers in the collected CSI data, and these outliers are scattered for various reasons and are far from the normal data set. The core of the LOF algorithm is to calculate and compare the density of data distribution to detect the scattered outliers. Therefore, we use the LOF algorithm to preprocess the data to facilitate the recognition of gestures(b)The DWT algorithm is removed and compared with the previous method, and the results are shown in Figure 16. From Figure 16, it can be seen that the gesture recognition rate decreases in different environments after the DWT algorithm is removed. The decrease in recognition rate is more obvious in the classroom compared to the empty hall. This is because the gesture data is low-frequency information, and the data without DWT processing has mixed low-frequency signals and high-frequency signals. The high-frequency information includes those caused by hand gestures and those caused by environmental factors. In order to effectively distinguish the above information, DWT is chosen to retain the peak and abrupt change parts of the useful signal needed in the original signal. Detail coefficients and approximation coefficients in DWT can filter out the high-frequency information caused by environmental noise while retaining the signal abrupt change caused by gestures(c)The PCA algorithm is removed and compared with the previous method, and the results are shown in Figure 16. From Figure 16, it can be seen that the recognition rate of gestures has significantly decreased after removing the PCA algorithm. This is because environmental factors cannot be avoided in the gesture data collection process, and PCA can eliminate the influence between CSI data caused by different factors as much as possible by using covariance matrix. And in the actual data processing stage, we tried to select different numbers of principal components, and the results showed that the higher the number of selected principal components, the higher the cumulative contribution rate, but it also contains the interference factors in the environment, and at the same time will increase the computational cost. After several attempts, we finally selected the first ten principal components(d)In addition to self-comparison, we added comparison with other data pretreatment methods. It uses a Butterworth low-pass filter and sets the cutoff frequency to 100 Hz (we will abbreviate it as “B”) [38]. In actual comparison, we need to set the cut-off frequency of the Butterworth low-pass filter lower to conform to the frequency of the gesture (we will abbreviate it as “B + I + P”) [39]. There are studies on linear interpolation of the original data and then Butterworth low-pass filter for filtering and finally PCA processing of the data. The comparison results are shown in Figure 16, indicating that LOF + DWT + PCA is more suitable for gesture data preprocessing. We think that only one filter can roughly filter the data, and linear interpolation of the original data is easy to cause problems such as low interpolation accuracy and poor smoothness of the data curve. To sum up, the combination of LOF + DWT + PCA can better process the data

4.8. Influence of Different Algorithms

In recent years, scholars have proposed a variety of recognition methods for gesture recognition, such as LSTM, SVM, and convolutional neural network (CNN). To fully demonstrate the performance of Air-CSL, Air-CSL (GRU) is compared with the existing CNN, LSTM, SVM, k-NN, and random forest (RF). The above methods are used to identify the data collected in the two scenarios, as shown in Figure 17.

We evaluate different algorithms by the cumulative distribution functions (CDF). In Figure 17, -axis represents the recognition error rate and -axis represents CDF. We chose the initial device height of 1.25 m. Figure 17 shows that about 85% of the Air-CSL test data have an error rate less than 10%. When CNN was selected as the recognition algorithm, the performance was poor, and the error rate of about 57% of the test data was less than 20%. This indicates that Air-CSL can maintain high accuracy in the recognition of human sign language gestures in the two scenes selected by us.

To further compare the performance of the Air-CSL method with other gesture recognition models, we perform cross-validation of the above models. In this paper, we compare the performance of the four different methods in the empty hall by accuracy, recall, and F1-score values, as shown in the performance comparison table of different algorithms in Table 2. F1-score is the weighted average of accuracy and recall, and the larger the F1 value is, the better the model performance is. From Table 3, we can conclude that the recognition rate and F1-score of the Air-CSL method are higher than other methods. It is proved that Air-CSL can effectively recognize the sign language gestures of deaf people and has better overall performance.

In order to fully the superiority and robustness of the proposed method in this paper, we evaluate the algorithm performance in terms of the processing time dimension of the algorithm. The number of training samples is 3200, and the number of testing samples is 960. As can be seen from Section 4.7, the recognition of gestures by multiple methods shows better recognition in open environments, while at the same time, the recognition results in classrooms are generally lower. In terms of computation time, the best algorithm is LSTM, followed by GRU, followed by CNN and SVM. Compared to the GRU model chosen in this paper, the LSTM algorithm performs better in terms of computation time, but combined with its recognition rate in different environments, we believe that the GRU model is more suitable for the sign language gestures recognized in this paper.

4.9. System Performance Evaluation

The core of Air-CSL is essentially a feature fusion recognition algorithm. It carries out feature fusion on commonly used mathematical statistics through the DBM and transforms the feature matching in a general gesture recognition scheme into the feature fusion problem. The curves of Loss and Acc of model training are shown in Figure 18. It can be seen from Figure 18 that with the increase of the training cycle, the loss value gradually decreases, which proves that the model we trained tends to converge, and the Acc value gradually increases, which proves that the accuracy of training gradually improves and reaches the expected value of model training.

We used data sets collected from 6 experimentarians in two different environments to conduct multiple sets of comparison tests for evaluation and combined with multiple sets of comparison tests. Finally, we obtained the recognition rate of the empty hall which can reach 91.77%, and the average recognition rate of the classroom is 87.97%. In order to fully describe the recognition accuracy of the sign language gesture recognition method proposed in this paper, we describe the comprehensive recognition results of 30 sign language gestures by confounding evidence, as shown in Figure 19. In general, the recognition results of all gestures were satisfactory, but due to similar gestures, such as “M” and “N” or “CH” and “C,” the misjudgment rate of these gestures was relatively high.

5. Conclusion

This paper proposes a CSI-based sign language gesture recognition method, combining real-life applications and the influence of environmental factors on the gesture features of people, using the LOF algorithm for outlier removal and noise reduction and filtering of the collected gesture data by DWT and PCA, and extracting and fusing the time-domain information by DBM. Finally, the gesture data feature values are put into the GRU network for gesture classification and recognition. After various comparison tests, the results show that the average recognition rate of Air-CSL for sign language gestures is 88.93%.

Although we achieved satisfactory accuracy, there were still some limitations. The follow-up work of this paper mainly focuses on the following aspects: (1) improve the robustness of this method and apply it to continuous sign language gesture recognition in different environments; (2) add gesture features with frequency-domain information on the existing basis to describe human gesture features as comprehensively as possible; (3) based on the existing recognition models such as CNN, LSTM, K-NN [40], and SVM, we will try to use their upgraded versions to compare with our recognition model in the future work.

Data Availability

Data are available on request to the authors. The data that support the findings of this study are available from the corresponding author upon reasonable request. However, as the data involved in this paper are still used in other studies, only part of the data are provided.

Conflicts of Interest

The authors declare no potential competing interests in the paper.

Acknowledgments

This work was supported by the research on node localization and coverage technology of the Gansu Provincial Key Research and Development Program of Science and Technology Support: WiFi indoor positioning technology based on group intelligence perception (20YF8GA048), the National Natural Science Foundation of China (62162056), and the multiwireless signal collaborative cross-domain sensing technology in the Internet of Things and the 2019 “Light of the West” Talent Program of the Chinese Academy of Sciences: Topological Reliability Research of Ecological Monitoring of Internet of Things (201923). Special thanks are due to the volunteers who participated in the experimental data collection for this paper.