Abstract

This paper proposes an improved selective kernel network-temporal convolutional (SKResNet-TCN) network-based video recognition model for isolated word sign language with too large parameters, large computation, and difficult to extract effective features. SKResNet uses grouped convolution to save computational cost while dynamically selecting feature information of different perceptual fields to improve the feature extraction ability of the model for video frame images, and TCN introduces causal and inflation convolution to take full advantage of computer parallel computing and reduce memory overhead during computation. The introduction of causal convolution and dilation convolution allows the network to take full advantage of computer parallel computing and reduce memory overhead during computation, and it can capture the feature information between consecutive frames. In this paper, we design a hybrid SKResNet-TCN network model based on these two networks, and propose a solution of hybrid inflated convolution for the problem of losing information between data features in inflated convolution, using adaptive maximum pooling to preserve significant features of sign language instead of adaptive average pooling, and using Mish activation function to improve the generalization ability and accuracy of the model. The accuracy is 100% on the Argentine LSA64 dataset, and the experimental results show that the model in this paper has the advantages of fewer model parameters, smaller operations, and higher accuracy in sign language recognition compared with traditional 3D convolutional networks and long–short term memory, which effectively saves computational cost and time cost.

1. Introduction

According to the World Hearing Report released by World Health Organization (WHO), about 450 million people in the world are hearing-disabled patients. As a supplement to language, sign language [1] is a multichannel expression method. The expression of sign language includes not only hand movements, but also body posture and facial expressions, which play an important role in the communication between the deaf and the hearing. With the technological progress brought about by artificial intelligence, people began to study how to use artificial intelligence technology to translate sign language into a language that healthy people can understand, so as to break down the barriers of communication and avoid the barriers of communication between deaf and dumb people and healthy people. Isolated word sign language recognition is discrete sign language recognition, which refers to the shape of the hand as the visual feature and simple gestures can represent a single word or byte. The modeling process relies on the scene information of the image frames in the sign language video and the actions of continuous frame information. Traditional sign language recognition inputs artificially extracted features into time series models such as hidden Markov [2] and dynamic time warping [3], which is time-consuming and labor intensive. In recent years, with the development of deep learning, researchers began to use convolutional neural networks (CNN) to extract hand features. 2D-CNN can extract temporal information in sign language videos, and 3D-CNN can extract spatiotemporal information, making new progress in sign language recognition. However, there are still many problems to be solved in the field of sign language recognition, such as: there is redundancy in sign language video, and the video contains a large number of invalid frames, which makes it difficult to extract key frames; the network model parameters of sign language translation are too large, which is not conducive to the performance of sign language recognition. Popular application; changes between consecutive frames of sign language videos are difficult to capture, etc. This paper proposes an isolated word sign language video recognition network based on the improved selective kernel network-temporal convolutional (SKResNet-TCN), which mainly has the following contributions: (1) reconstruct the sign language dataset, reduce the calculation amount of the network model, and improve the recognition rate, keyframe processing, and keyframe cardinality are performed on video data to facilitate sign language recognition of isolated words and improve network performance while maintaining high accuracy; (2) use the improved SKResNet network to save its computing cost, dynamically select the feature information of different receptive fields, retain the significant features of sign language, and improve the feature extraction ability of the model; (3) use the TCN network to capture the change information between video frame sequences, in addition, it can also make full use of the advantages of computer parallel computing to reduce the memory overhead during calculation.

Sign language recognition research is divided into two types: data glove-based and computer vision-based sign language recognition according to the difference in sign language information collection.

2.1. Sign Language Recognition Based on Data Gloves

The sign language recognition system based on data gloves requires users to wear data gloves, which are composed of a main control module, an attitude module, a power module, and a wireless module, and various sensor devices are embedded. These sensors are used to capture the angle of each joint of the human hand, the trajectory of movement in space, posture, timing, and other information, and accurately record the key features of sign language. In 1985, Zimmerman et al. [4] developed the VPL data glove, which drives the real-time 3D model of the hand through the main computer and can operate the computer-simulated objects like real objects. In addition, it can also provide a visual programing language interface. Lee and Xu [5] of Carnegie Mellon University (CMU) designed and developed a sign language recognition system based on the CyberGlove data glove, which uses a hidden Markov model (HMM) to translate and output the data transmitted by the data glove and can recognize 14 hand sign language gestures. In addition, it can not only perform sign language recognition but also learn new sign language gestures autonomously. Later, with the continuous development and progress of science and technology, some products using sign language recognition systems gradually entered the public’s field of vision. Takahashi and Kishino [6] of Hitachi laboratories used data gloves to explore the shape and position change information of hands. The system has a recognition rate of 83% on a sign language dataset consisting of 10 sign sentences. Gao Wen from Harbin Institute of Technology, Li et al. [7], and others innovatively used HMMs, artificial neural networks, and dynamic programing to achieve a recognition rate of 91.4%. The use of data gloves for sign language recognition requires wearing additional glove equipment, and the glove equipment is relatively expensive and can only be used in laboratories, so it cannot be widely used in real life.

2.2. Sign Language Recognition Based on Computer Vision

Computer vision-based sign language recognition technology obtains 2D images and videos through a single camera or RGB (Red, Green, Blue) images through a stereo camera, and then obtains depth information on this basis, and uses image processing, deep learning, and other algorithms to perform sign language recognition. Compared with the data glove-based method to obtain sign language information, the computer vision-based method to obtain data through the camera is less costly, less difficult to promote, simple to operate, easy to popularize in daily life, and more compatible with the communication of deaf people. Before 2016, deep learning technology had not been widely applied to sign language recognition and traditional machine learning algorithms were mainly used to solve the problem of sign language recognition at a certain scale, and the recognition rate of the algorithm model was not high. For example, Grobel and Assan [8], a researcher at the RWTH laboratory in Germany, established a language model based on German grammar in 1997, used the HMM to extract features from 262 sign languages and output the recognition results after model training, with an accuracy rate of 89.8%. Hassan et al. [9] and others proposed an ArSLR system based on the HMM and the improved version of k-nearest neighbor (KNN), using traditional feature optimization methods, using statistical learning methods such as mean, standard deviation, and covariance sign language research and analysis. In 2012, AlexNet, a deep learning network, made a splash on the ImageNet dataset, and deep learning technology gradually attracted people’s attention. Afterward, various neural networks emerged one after another, and more and more research on neural networks in sign language. Pigou et al. [10] proposed a sign language recognition system based on CNN networks and graphics processing unit acceleration. Using two CNN networks to extract upper body movement features and hand features, there is no need to construct complex manual features, and the recognition accuracy rate reaches 91.7% on a dataset with 20 Italian sign languages. Chamgoz et al. [11] designed a two-layer recurrent convolutional network with an attention mechanism, using an encoder–decoder design to translate sign language videos into everyday spoken language end to end. The system can learn spatial representations, underlying languages, mappings between signed and spoken languages, and different grammars and sequences, and it provides spoken translations for videos from German weather broadcasts. Molchanov et al. [12] used 3D-CNN to extract the feature information of sign language videos, and input the sign language recurrent neural network (RNN) feature information extracted by 3D-CNN into the RNN to connect the connectionist temporal classification (CTC) algorithm as the objective function to learn the alignment rules from input to output, realize sign language video classification, and complete sign language recognition tasks. Guo et al. [13] found that there is a certain connection between the hidden states and potential states in the HMM, so they used the nearest neighbor propagation clustering algorithm to analyze the hidden states of sign language gestures in an attempt to find the relationship between them and thus proposed an adaptive HMM sign language recognition model. Zheng and Liang [14] used a directional gradient histogram pyramid of 3D motion maps, i.e., 3D motion maps generated from the entire depth video sequence, which can map motion information from three orthogonal planes to represent different scales of appearance information of gestures. Huang et al. [15] proposed a sequence to sequence learning method based on the central segment of video keyframes. Unlike previous studies, only the key information of this system is given a salient symbol, and the keyframe-to-word task is transformed into a keyframe centered clip-to-word task. The recognition rate was 91% in a dataset with 310 Chinese sign language words, which was better than the SLR system. Zhou et al. [16] proposed a combined spatial–temporal multicue network, where the SMC, or spatial multicue module, was used to model spatial information and the TMC, or temporal multicue module, modeled temporal correlation along two mutually parallel routes to explore multiple cues while ensuring uniqueness. RNNs, whose outputs and inputs are sequences, are very suitable for solving temporal problems and are widely used in machine translation, speech recognition, and sign language recognition. Guo et al. [17] proposed a hierarchical long–short term memory (LSTM) model using 3D CNN to extract temporal and spatial information from sign language videos, a temporal attention mechanism to learn to grasp the intrinsic connections between the source locations of the optics, a two-layer LSTM to encode visual representations and decode word embeddings, which can achieve 92.9% accuracy on the Chinese sign language dataset.

In this paper, we focus on the algorithm and implementation of computer vision-based sign language recognition, combining different neural networks to build an improved network model based on SKResNet-TCN, which achieves 100% recognition rate on the sign language isolated words dataset LSA64.

3. Materials and Methods

3.1. Dataset

In this paper, we use the Argentinian Sign Language Dataset LSA64, and Figure 1 shows a partial dataset of LSA64. The dataset contains 3,200 videos in which 10 nonexpert volunteers repeat 64 different types of symbols with one or two hands. The signs are among the most commonly used signs in the LSA lexicon and include both verbs and nouns. The dataset was collected under different photometric conditions. Few constraints were imposed on the volunteers for each video collection to increase the diversity in the database. Each video has a resolution of 1920 × 1080 at 60 frames per second and was captured with a Sony HDR-CX240 camera.

3.2. Overall Architecture

The algorithm framework of this paper is shown in Figure 2. In order to extract the feature information of the sign language video more effectively, the key frames of the dataset are extracted using the interframe difference method, and the key frame base is unified. SKResNet performs spatial feature extraction on the processed video frame images to obtain features such as hand and facial expressions, and then all the collected feature information are extracted by dynamic selective convolution of which useful information features are arranged according to the time dimension, and the action information between the keyframes is captured using the TCN network, and the recognition results are output by the fully connected layer.

Obtaining the global information for each channel can be expressed in the equation:

Equation (1) indicates that the c-th element in can be obtained by compressing on the dimension of . Then, a full join operation is done on all of the output to find the occupation size of each channel, and the equation is expressed as:where denotes the ReLu function and denotes the batch regularization processing function. Each module is described in detail below.

3.3. Data Preprocessing

First, all the sign language video data need to be extracted into video keyframes and saved. Since the performers in the sign language video will subconsciously emphasize the semantic meaning of the sign language and pause for a moment at the key moment when doing the sign language action, there are also a large number of invalid video frames in the video of the sign language video, and these useless video frames will directly affect the recognition results of the subsequent deep neural network.

Let all frame sequences of the video be , and then use the Equation (3) to do the interframe difference to obtain the average interframe difference sequence .

According to the interframe difference intensity sequence, the local maximum of the average interframe difference sequence is taken as the key frame, and these key frame sequences are denoised by smoothing operation.

3.4. SKResNet Video Frame Space Feature Extraction

The SKResNet network was proposed by Li et al. [18] to achieve dynamic adjustment of different perceptual field sizes by a nonlinear method of fuzing feature information from different kernels. Sign language not only consists of a series of hand movements but also comes with the facial expression movements of the sign language performer as a way to enhance the expression of the sign language. According to this feature of sign language video data, this paper uses the SKResNet network in the spatial feature extraction stage, using its ability to extract different feature information from different convolutional kernels, and the network learns by itself to select the feature information of different sensory fields, so as to facilitate the extraction of detailed features such as movements and expressions of sign language performers in sign language videos, and thus to improve the spatial feature information extraction ability. The convolution kernel of this network contains two branches, as shown in Figure 3.

Where denotes the input features and denotes the output features. When the features are input into the SKResNet convolutional network, first two different convolutional kernels will perform feature extraction on the input features to obtain two different sets of feature maps. For example, the size of the convolutional kernels in the figure is 3 × 3 and 5 × 5, allowing the network to extract more feature information, and then a gating mechanism is used to control the flow of information into different branches in the next convolutional layer. A simple summation and fusion of two different sets of feature maps is performed as follows:

The obtained feature information is then used to reduce the global features by global pooling and fully connected layers to reduce the number of features to improve the network’s generalization ability. The formula is as follows:

After getting the global average pooling, do the full connection operation to calculate the percentage size of each channel, while further reducing the dimensionality to improve efficiency, the formula is as follows.where is a two-layer fully connected layer with descending and ascending dimensions, is the ReLu function, and denotes batch normalization, which facilitates the subsequent select module between channels to select different size information. The select module can be considered as a soft attention mechanism, this module will regress the weight information of each channel by two Softmax on the feature information and output the final extracted feature information.

3.5. TCN Video Frame Time Feature Extraction

RNNs are generally used to process temporal information, in which neurons not only receive input from other neurons but also process their own information. However, the sign language video needs to focus not only on the features of the frames, but also on the action information between the pictures, and RNNs are not good enough to deal with these problems. Although the traditional RNN networks have been replaced by LSTMs with some success, LSTMs are also computationally intensive and cannot take advantage of the parallel processing of data by multiple computer threads. Traditional CNN are also considered unsuitable for this type of problem because they cannot extract temporal information due to the limitation of convolutional kernel size. However, Bai et al. [19] found that CNN outperform RNNs in processing temporal tasks such as audio synthesis and machine translation, and designed a TCN for extracting temporal information, which not only can extract features of long temporal data with flexible perceptual fields, but also can compute results in parallel to improve recognition efficiency. It provides a new idea for extracting temporal dimensional features of sign language video data.

TCN uses causal convolution to process the temporal information, unlike the traditional convolutional network, the causal convolutional network only focuses on the information in the lower layer at moment t and the information before the lower layer moment, and the information features flow in one direction, simply because there is the information associated with the next layer at moment t to have the information features at moment t [20]. Therefore, it is called causal convolution, as shown in Figure 4.

Traditional convolution kernels are limited in size when dealing with timing problems and cannot capture longer dependencies. TCN uses expanded convolution to increase the perceptual field while maintaining the size of the original feature map.

Figure 5(a) shows the normal convolution and Figure 5(b) shows the inflated convolution. As can be seen from the above figure, the inflated convolution, as the name implies, expands the size of the original convolution, but the real convolutional kernel units involved in the computation remain unchanged, with the blue color shown in the above figure being the actual convolutional kernel units involved in the computation, and the white squares indicating that they are filled with zeros.

At the same time, in order to prevent the overfitting problem caused by the network being too deep, the TCN network also uses residual connections so that the network can transfer information across layers, and adds WeightNorm and Dropout to regularize the network in the residual block, which to some extent eliminates the effects of partial gradient disappearance and gradient explosion and makes the network have stronger generalization ability to prevent the network overfitting problem.

3.6. Improvement Based on SKResNet-TCN Hybrid Network

Although the SKResNet-TCN network model improves the accuracy while reducing the parameters of the network model, the same inflated convolutions in the model produce grid effects with stacking, so that many features in the grid are not extracted, and the grid effects affect the final recognition results. As shown in Figure 6 with the stacking of multiple identical inflated convolutions, a large number of voids appear in the grid, a large number of pixels are not extracted, and the integrity and continuity between the data are destroyed, which is not conducive to network learning. Inflated convolution also suffers from the problem of uncorrelated information when dealing with long-distance information, which affects the consistency between data.

In view of the above-mentioned problems of the expansion convolution, this paper adopts the hybrid expansion convolution instead of the expansion convolution. The hybrid expansion convolution uses three ways to avoid the above problems.(1)The expansion factors in the network cannot have a convention greater than 1.(2)The expansion factor is designed as a sawtooth structure.(3)The expansion factor should satisfy Equation (7).where denotes the expansion coefficient of the network at layer and denotes the maximum expansion coefficient at layer .

Adaptive average pooling is used in the SKResNet base network, and this pooling method can best preserve the background features of video frames, but it does not fit the characteristics of the sign language video in this paper. In this paper, adaptive maximum pooling is used to preserve the detailed texture features of video frames.

The ReLu activation function is used in the basic network of this section. The purpose of using the activation function is to make the network have nonlinear fitting ability and enhance the expression ability of the network. However, the ReLu function produces gradient disappearance and introduces a self-regularizing neural activation function that allows better feature information to flow into the network, improving its accuracy and generalization. This function was confirmed on that paper to improve in final accuracy over both Swish (+0.494%) and ReLu (+1.671%). The formula for the Mish function is shown as:

Also, in order to have a stronger nonlinear fit for the Mish activation function, Ranger is used as the optimizer in this paper. The Ranger optimizer combines the advantages of Radam and LookAhead to dynamically adjust the adaptive momentum and reduce the need for a large number of hyperparameters. The experimental study finds that the Ranger optimizer can give a good start to the training while achieving faster convergence with minimal computational overhead.

4. Results and Discussion

4.1. Parameter Selection

Because of the large training data, batch training is chosen. In order to get the optimal parameters, experiments are conducted in this paper on how to choose the batch size (64, 128, and 256), learning rate (0.001, 0.0001), and number of iterations (500, 1,000, and 1,500), and judged by recognition rate and loss value. Table 1 represents the comparison results of network recognition effects with different parameters. The experimental results show that the network is the best for isolated word sign language recognition when the learning rate is set to 0.0001, the batch size is set to 128, and the number of iterations is 1,000, reaching 100%, and its loss value is also the lowest. Therefore, we set the learning rate to 0.0001, the batch size to 128, and the number of iterations to 1,000 for model training.

Too many or too few keyframes will affect the final experimental results. Too many keyframes will add some video frames with high similarity to the keyframes to the sign language videos that have a small number of keyframes, and too few video frames will remove some useful keyframes from the sign language videos that have a large number of keyframes. In order to select the appropriate number of key frames (29, 30, 31, 32, and 33) this paper conducted experiments to evaluate three metrics in terms of model computation flops, accuracy rate, and loss value. As shown in Table 2, the number of keyframes is 32 which is the most suitable, when the recognition rate reaches the last 100% and the loss value is reduced to the lowest.

The whole iteration of the model in this paper is implemented through the dataset and DataLoader classes provided by PyTorch, allowing more flexibility in loading the dataset, with a ratio of 6 : 2 : 2 for the training set, validation set, and test set, and a random inversion of the data augmentation operation in the test set. From Figure 7(a), it can be seen that in the whole training iteration process, the whole training accuracy tends to level off at 180 iterations, and the recognition rate can reach more than 98%, and the model reaches the best training state at 656 iterations, and the accuracy rate reaches 100%. During the whole training process, the recognition rates of the training set and the test set keep the same change in the change trend. Also no overfitting problem occurred, which also indicates the excellent performance of the network in this paper. As shown in Figure 7(b), the network can minimize the loss value in a faster time, and the network decreases very rapidly until the 116th iteration, and the loss on the test set reaches 0.096. When the network model is trained to 200 rounds, the loss function tends to smooth out, and such a result is due to the design of the network, and the loss values on the training set and the loss values on the test are set. This result is attributed to the design of the network, where the loss values in the training set and the loss values in the test set are consistent in their trends, indicating that the network does not show overfitting and the network is relatively robust.

4.2. Ablation Experiments

As shown in Table 3, the recognition accuracy of the model is significantly improved and the generalization ability is enhanced after adding the hybrid inflation convolution, adaptive maximum pooling, and Mish activation function.

4.3. Comparison Experiments

Table 4 shows that we can see that the network model designed in this section outperforms the previous network model in terms of model parameters and computational volume, and achieves 100% model accuracy. I3D and MEMP networks are similar to the network designed in this paper in terms of model parameters, but there are significant differences in model computational volume and accuracy. The (2 + 1)D-SLR network has 2.7 times more model parameters than the model in this chapter, and the accuracy of the model is not as good as the model in this chapter. Although the I3D + GLR + CSA network achieves 100% model accuracy and the difference between the model parameters and this paper is only 0.6M, the computation volume of this network is 1.8 times the computation volume of this chapter, which is mainly because the keyframes of this chapter are selected with a unified key frame base in order to reduce the computation volume of the network. The above illustrates that the selection of keyframes on and the unified base not only make the network model less computationally intensive but also makes the network make full use of the key video frame information so that the recognition rate reaches 100%.

Table 5 shows the results of the network model designed in this section compared with several other mainstream models in terms of accuracy. The experimental data in this paper uses RGB image data from LSA64, and does not use LSA64 pose based data, but the network model in this paper achieves 100% recognition rate, which is the best result among these two data processing methods and achieves the best recognition level for this dataset.

5. Conclusions

In this paper, we propose a new isolated word sign language video recognition method based on improved SKResNet-TCN network, which mainly contains three parts: data preprocessing, SKResNet, and TCN network. Among them, data preprocessing is to select the most suitable number of video frames, SKResNet uses group convolution to dynamically extract hand features and expression features of video frame images, and TCN network extracts action information between consecutive frames, in addition, its introduction of causal convolution and expansion convolution can fully utilize the advantages of computer parallel computing and reduce the memory overhead during computation. In addition, this paper addresses the disadvantages of the SKResNet-TCN hybrid network model by using hybrid inflated convolution, adaptive maximum pooling, and Mish activation function. The experimental results show that the present model has fewer parameters, fewer operations, and higher accuracy. Although this paper has made some progress in isolated word sign language recognition, there will be inconsistency between the translation results and natural speech order in continuous sign language, which is the next research focus of this paper.

Data Availability

The (LSA64) data used to support the findings of this study have been deposited in the (LSA64: A Dataset of Argentinian Sign Language) repository (Download–MEGA).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

We are very grateful for the guidance we received from our teachers.