Abstract
The CNN-LSTM network has a low generalization ability, and the backward relevance of actions is not strong. In this work, a convolutional self-encoding timing network with a fusion of attention mechanism, namely, convolutional block attention module (CBAM), is proposed. The model first designs a convolutional self-encoding network for pretraining to obtain feature vectors of smaller dimensions. Second, it uses the BN network to speed up the training process and enhance the network generalization ability. Then, we use the encoder part of the pretrained convolutional autoencoder, embed the attention mechanism to further focus on the weight of important parts in image features, and use Bi-LSTM to form a CNN-Bi-LSTM network. As compared with the traditional CNN-LSTM model, the proposed method continuously expands the training samples through the pretrained network to improve the generalization performance. The experimental results show that the method proposed in this paper effectively recognizes the sign language video. The recognition rate reaches 89.90%, which is higher as compared to other methods. These results verify the feasibility and effectiveness of the proposed method.
1. Introduction
The sign language is a unique and meaningful means of conversation created by human beings. The deaf and dumb individuals are able to communicate with the outside world by using the sign language. In addition, the sign language is one of the main ways of communication in special military missions and special teaching. Therefore, the research and applications based on sign language have been a focus of research community in the field of human-computer interaction [1].
Sign language recognition using dynamic video is difficult due to various reasons, such as the differences in the recording equipment, and background and usage habits of sign language users. Moreover, the sign language recognition is persistent and includes not only the hand shape but also a series of arm movements. These hand movements are unstable and diverse and are distinguished based on the coherence of the movements, which are generally divided into two situations, namely, static and dynamic sign languages [2]. The static sign language recognition, also known as gesture recognition, focuses on image features at a certain point. On the other hand, the dynamic sign language recognition focuses on movements over a time span.
The CNN architecture is suitable for image feature extraction, but is unable to extract the temporal features from videos [3]. The extraction of temporal information requires a lot of data processing, and the resulting models are relatively complex. The earlier studies on sign language recognition were mainly based on the analysis of wearing equipment [4]. However, the motion trackers and data gloves are not only expensive but also have a small practical value. With the rapid development of deep learning and its computer vision applications [5–7], the sign language research based on public RGB images has become a hotspot. The recurrent neural networks (RNN) are very suitable for processing sequential data. However, they are unable to effectively extract the features from images. Therefore, few research works employ CNNs for extracting the features from images, which are then used as the input of RNN [8]. It should be noted that RNNs are prone to gradient explosion in case of long sequences. The long short-term memory (LSTM) network effectively addresses the problem of gradient explosion [9]. The combination of CNN-RNN and CNN-LSTM can be effectively applied in the field of action recognition. Please note that the sign language can be regarded as action recognition under certain conditions. Therefore, the aforementioned networks can be used for performing recognition [10]. As the dynamic sign language data is mainly obtained from videos, the sign language information has strong forward and reverse correlation. Therefore, a bidirectional LSTM network is used for sign language recognition [11, 12].
In this work, based on unsupervised learning, we first obtain the sign language image representation by convolution self-coding network. Then, the coding network is optimized, and the features are extracted by embedding the CBAM network, thus making the image representations more illustrative. The extracted image information is sorted and reorganized for highlighting the spatial and temporal features. We propose a convolutional self-coding time-domain network based on CBAM for extracting the depth continuous feature information from videos.
The major contributions of this work are as follows: (1)We apply convolutional autoencoder networks to dynamic sign language recognition for the first time. The proposed method compresses the images to reduce the training time and learn deeper latent information(2)The proposed method improves the learning weight of key regions of sign language images(3)We propose a novel self-coding sequential network with CBAM for sign language recognition
2. Related Works
The CNNs are widely used to process the dynamic video data. Action recognition is an important aspect of dynamic video research. In [13], a time convolution network architecture with memory performance is proposed. The spatiotemporal features from the video data can be extracted by using CNN-RNN. In [14], a time convolution network based on LSTM cells is proposed. This network is able to effectively process long sequences and mitigates exploding gradients. In [15], the authors combine channel attention mechanism and spatial attention mechanism of a CNN to improve the weightage of the useful parts of image. In [16], an unsupervised image compression network is designed, and an encoder structure is proposed, which enables the network to autonomously reconstruct and learn information without labels. The research works presented in literature show that the compression of image data, weight attention, and the correlation performance before and after action is still a very challenging task.
The dynamic sign language recognition is very similar to traditional motion recognition, as both mainly focus on arm movements and hand gestures. However, the action recognition considers the actions of the whole body. Considering the semantics of sign language, continuous sign language actions not only need forward correlation, but backward correlation as well. In [17], the feature and spatiotemporal information are extracted using two modules. The experimental results show that as compared with the unidirectional association LSTM, the bidirectional association Bi-LSTM effectively improves the accuracy of sign language recognition. The authors in [18] combine CNN and LSTM and use 3D bone data as input to extract the temporal and spatial motion characteristics. In [19] optical flow and RGB image are used as the input of CNN and LSTM network, and feature fusion is performed in the fully connected layer for enhancing sign language action representation. Similarly, in [20], bone point information and image information are used as the input of CNN and LSTM network for enhancing sign language action representation. The researchers in [21] use 3D-CNN for extracting the spatiotemporal information from the videos.
Please note that the identification accuracy is determined by the quality of feature extraction. As the number of layers in a neural network increases, the gradient disappearing problem also increases. The authors in [22] use ResNet for avoiding gradient explosion. During the training process being performed using sign language data, the convergence is hardly guaranteed. On the other hand, the pattern information consumes a lot of time during the training process due to large amount of information. Moreover, the image feature weights are diffused and cannot be correlated with the action of sign language. In order to address the aforementioned issues, this paper combines the current popular self-supervised learning and attention mechanism methods. A residual convolutional self-coding network is used to compress the images and reconstruct them in an unsupervised manner. Consequently, an effective characterization of the image is obtained. The channel attention information and spatial attention information are integrated for increasing the weight of important parts in image features. Furthermore, the local image feature weights are increased to reconstruct the extracted image features from the time series and fuse them with the Bi-LSTM neural network. Finally, the processed video data is disposed by CBAM with autoencoder time series neural network to facilitate sign language video classification.
3. Basic Network
In this work, a long short-time continuous sign language recognition network based on convolutional self-coding network and fused attention mechanism is proposed. The purpose of the proposed method is to enable the convolutional self-coding network to learn the information from the data in a self-supervised manner and retain the weight of the convolutional coding part of the network. We embed CBAM to further focus on the deep feature information. Finally, we decode the features by using Bi-LSTM and classify them by using Softmax to obtain the final accuracies of sign language recognition.
3.1. Residual Neural Network
The ResNet is an epoch-making convolution network [23]. Please note that the depth of the network influences the recognition accuracy. The initialized parameters of neural network are generally closer to zero. In a deep network, when these parameters are updated during the training process, it may lead to gradient disappearance. As a result, the shallow parameters cannot be updated, resulting in poor accuracy. The residual block forms the basis of ResNet [23] and is presented in Figure 1.

The output of the residual block is mathematically expressed as follows: where is linearly transformed through the weight layer to represent and is used as the input of the next layer after passing through the activation function .
The mathematical expression for is similar to (2): where denotes the output corresponding to after linear transformation. After introducing the shortcut path, the network output is represented as follows:
The generalized equation for is mathematically expressed as follows:
If , , then As presented in (4), (5), and (6), after adding the structure, the network output eliminates the phenomenon of gradient disappearance.
3.2. Two-Way Long Short-Term Memory Network
The LSTM solves the problem of gradient explosion and disappearance faced in the traditional RNNs for long input sequences. The LSTM accomplishes this by using the gate mechanism. The structure of basic LSTM neuron is shown in Figure 2.

Each LSTM neuron has 3 inputs and 3 outputs. is the current input, and and are the representations from previous time stamp. The output of neurons relatively contains more information about the current moment and is called long-term memory, while is called short-term memory. Collectively, they are referred to as long- and short-term memory network. The computation process is mathematically expressed by the following equations [24]. where , , and represent the input gate, forget gate, and output gate, respectively. and represent the weight and offset parameters, respectively.
The Bi-LSTM consists of forward and backward LSTMs, and it considers both the historical sign language information and future sign language information [25]. The structure of Bi-LSTM is shown in Figure 3.

The basic neuron of Bi-LSTM is a parallel structure, which processes the signals forward and backward simultaneously and splices the output of the two layers as the output of the hidden layer. The computations of Bi-LSTM are mathematically expressed as where , , and represent the weight matrices of the input layer, hidden layer, and output layer of the neuron, respectively. , , and represent the bias of each layer, respectively.
3.3. BN Layer
As the depth of a network increases, the training process becomes more and more complex. During forward propagation, the parameters of each layer affect the parameters of the latter layer, and the data in different batches change the parameters differently. The neural network needs to learn different data samples in each iteration. Consequently, it is difficult for the network to reach an optimal solution. In order to address these problems, the authors in [26] proposed batch normalization, which divides the dataset into several batches to accelerate the training process. The computation is mathematically expressed as follows:
(16) calculates the mean of the input batch sample, (17) calculates the variance of the input batch sample, (18) standardizes the batch sample, and (19) offsets the data of the input batch sample. and are training parameters.
3.4. CBAM Attention Mechanism
The superiority of attention mechanism has been verified in various research works. The attention mechanism changes the global attention into local attention, obtains model weights of key regions, and improves the representation ability of specific regions. As compared with the SENET network, which only focuses on the channel attention [27], the proposed CBAM network integrates spatial attention and channel attention, as shown in Figure 4.

The spatial attention refers to the features of distribution in spatial domain, and channel spatial attention mechanism refers to feature attention. It is hoped that the most important features of the input can be extracted by combining spatial and channel information. where is the channel attention mechanism and MLP is a multilayer perceptron. The input feature M is subject to average pooling and max pooling so that the important channels of the image have a significant weight and the insignificant parts have a smaller weight. The nonlinear activation is performed using σ function to achieve feature attention.
In the spatial attention part, the positions of interest are extracted, and the average pooling and maximum pooling are applied in the channel dimension. The processed two-dimensional feature matrix is superimposed as the input of the convolution layer. Then, the weight optimization is performed as follows: where is the spatial attention mechanism, Conv is the convolution operation, and the input feature is the first obtained through the channel attention mechanism. Then, is dotted with and then with the spatial attention mechanism . Finally, we obtain where is the input feature and is the output feature.
3.5. Convolutional Self-Coding Networks
A self-coding network is an unsupervised network comprising coding and decoding networks connected by fully connected neurons. As presented in Figure 5, the network obtains feature codes through nonlinear mapping. Due to the use of fully connected layer, it is difficult to optimize all the parameters of the model.

In addition, due to the fully connected structure, the local spatial features in the image are ignored. Using CNN, the image feature extraction becomes more reasonable, and the parameter sharing of convolutional kernel greatly reduces the model parameters. It also benefited from the idea of CNN, so deconvolution decoding is used for reconstruction operation after feature extraction. The structure of convolutional self-decoding network is shown in Figure 6.

The mathematical expressions for encoding and decoding structures are expressed as follows: where denotes the encoding network, denotes the decoding network, denotes the encoding structure parameter, is the decoding structure parameter, and feature vector is obtained by training the convolutional autoencoder.
4. Experimental Results and Analysis
4.1. Datasets
The dataset used in this work is SLR-500 dynamic isolated gesture dataset published by the visual sign language research group of USTC. SLR500 dataset is a sign language video captured by Kinect2.0 [28]. There are three modes available in this dataset: first, an RGB video frequency with a video resolution of 1280 to 720 and a frame rate of 30 fps; second, a depth video with a resolution of 512 to 424 and a frame rate of 30 fps; and third, the detection of bone nodes, which can detect the positions of 25 bone nodes in the human body. The SLR500 dataset contains a total of 500 everyday words, each of which is demonstrated by 50 performers. Each performer is 1.5 m away from Kinect and performs 5 times. The dataset contains 125000 samples.
4.2. Data Preprocessing
4.2.1. Local ROI Extraction
The sign language video contains background and foreground presentation areas. The foreground demonstration area is the focus of sign language video. The ROI extraction refers to the process of putting forward the parts of image that need to be identified, increasing the differences between different types of sign language videos, and reducing the differences between different sign language samples in the same category. The ROI area is shown in Figure 7.

The Haar feature classifier is used in sign language ROI extraction. The central coordinates of the sign language action region are obtained by detecting the human face [29] as presented by the green rectangular area in Figure 7. Since the ratio of length of the human arm to height is close to 2 : 3, the corresponding coordinate transformation of the central coordinate is performed to obtain the blue rectangular ROI region as presented in Figure 7, which is then used as the input of the neural network during the training process. Please note that in the figure, the green box refers to the location coordinates of the face, and the blue box refers to the area where ROI processing is performed. The processed area is shown in Figure 8.

4.2.2. RGB Video Preprocessing
The original video length of the SLR-500 sign language dataset is about 2 s. After ROI processing, the video is transformed into continuous frames at the speed of 30 fps, and an estimated 60 images are cut out. In order to facilitate batch processing, we select 30 pictures as the benchmark. First, we use a multiprocess to improve the average response time and calculate the number of pictures cut out of the video at the speed of 30 fps. Then, based on equal difference sampling, the tolerance is calculated when the cardinality is fixed, and the image is selected from the starting frame [30]. Through isometric selection, we not only obtain the core motion information of sign language, but also reduce the dataset and the impact of unnecessary data on the experiment. As the sample size of the image extracted by ROI is not consistent, the size of the image is unified to 224 × 224 by using the transform function of Pytorch.
4.2.3. Network Pre-Training
In order to make the training process efficient, it is necessary to compress the images as they effectively represent the information. The compression representation vector of the images can be efficiently obtained by the convolution self-encoder. A convolutional self-coding network is designed based on the ResNet. The characteristics of encoder and decoder structures are presented in Table 1. During the network training, 16 samples are randomly selected from each batch for iterative training. We use the MSE_LOSS () loss function for training the network [31]. Combined with dynamic learning, the early learning rate is large, and the learning rate at later stages is small. The initial learning rate is 0.01. We use the Adam optimizer and 100 epochs to train the network. All the samples in the training set iterate once to retain the coding part of the model and weight parameters.
The input image is resized into a tensor of (3, 224, 224) size during the training process. The feature extraction is performed by using the ResNet. After the fully connected layer, the tensor dimension is compressed to 500, and up-sampling is performed. Finally, a tensor with the same size as (3,224,224) is constructed for training. The network architecture is presented in Figure 9.

UnConv values are up-sampled by transposing the convolution kernels, such as mapping one value in the input matrix to nine values in the output matrix, which is a one-to-many mapping relationship. Through the up-sampling process, the data with the same size as the sign language picture is constructed for training.
4.3. Experimental Process
First, the coding network of the convolutional self-coding network model and the parameters of the network model are extracted. Second, the proposed CBAM network module is embedded at the front and behind of the coding network and then combined with the Bi-LSTM network. The CNN-BLSTM network with integrated attention mechanism is shown in Figure 10. 30 images sampled by equal difference are used as sample units for training by randomly considering one sample at a time. 30 RGB images with a size of 224 × 224 are extracted from each sample. Each image is converted into a feature vector of [1, 512]. The 30 images are combined into a temporal feature vector [30, 1, 1512], which is then used as the input of the Bi-LSTM network. The initial learning rate is 0.001, and the CrossEntropy loss is used. The sample of the training set is iterated 130 times, and the model parameters are saved once after each iteration.

4.4. Analysis of Experimental Results
This experiment mainly compares the feature fusion based on convolutional self-encoder and the recognition results obtained using the SLR sign language dataset. The results are presented in Figures 11–13. Other methods are compared and tested on the SLR-500 dataset, as shown in Table 2. C3D network has a simple structure and is the earliest CNN applied in video classification and behavior detection. It consists of 8 convolution layers, 5 pooling layers, and 2 fully connected layers. Due to its simple structure, it has the worst effect on SLR sign language dataset. In order to improve the video classification and action recognition, the researchers gradually deepened the networks. The idea of residual is introduced to prevent the gradient explosion caused by the network depth. As presented in Table 2, the 3D-ResNet performs well on the SLR sign language dataset. This shows that the increase in the depth of the neural network to a certain extent effectively improves the ability of the network to fit complex data. The I3D network uses double flow in a 3D convolution network. A 3D CNN accepts RGB information and optical flow information. With the introduction of LSTM network, 2DCNN can also recognize the related video actions. It is evident from Table 2 that the combination of time series memory network, i.e., ResNet-LSTM, has a good effect on sign language detection. In addition, due to forward and backward correlation of sign language actions, the timing sequence should include not only the forward timing but also reverse timing. Therefore, the Bi-LSTM network is used for sign language video. In order to improve the detection results, the proposed CBAM network is embedded, which can be perfectly integrated with the above network. It is evident from Table 2 that the ResNet-Bi-LSTM network embedded in the CBAM network has the best results. The experimental results show that the proposed method is effective for recognizing sign language.



5. Conclusions
In this paper, a new sign language video recognition network is proposed, which combines the attention mechanism and self-coding timing network. The proposed method uses the RGB sign language video information to recognize the sign language movements. Based on the pretraining of sign language video, the network learns the significant information of the image and compresses the image information. Then, the proposed CBAM network is embedded to increase the important feature information weight in the image and improve the network representation ability without affecting the performance of the pretrained network. We use batch normalization for data processing to speed up the training process and improve the network generalization ability at the same time. Combined with the forward and backward particularity of sign language action, the Bi-LSTM network is introduced so that the network can fully learn the relevant information of sign language before and after action. In order to strengthen the weight of sign language region in sign language video, the original sign language video is preprocessed. We use the face recognition method to assist ROI image extraction for cutting out the useful regions for reducing the data interference and enhancing the validity of data. The proposed method is not only effective in the field of sign language action classification, but also suitable for video classification. In addition, the network proposed in this work also has some shortcomings. For instance, it does not consider the optical flow and node information of sign language video. As a result, the input information of the network is monotonous, and it is difficult to represent sign language actions. Therefore, the sign language recognition network based on multimodal fusion is the focus of future research.
Data Availability
The data used to support the findings of this study are available from https://ustc-slr.github.io/openresources/cslr-dataset-2015/index.html.
Conflicts of Interest
The authors declare that they have no competing interests.