Abstract
Human gait recognition has emerged as a branch of biometric identification in the last decade, focusing on individuals based on several characteristics such as movement, time, and clothing. It is also great for video surveillance applications. The main issue with these techniques is the loss of accuracy and time caused by traditional feature extraction and classification. With advances in deep learning for a variety of applications, particularly video surveillance and biometrics, we proposed a lightweight deep learning method for human gait recognition in this work. The proposed method includes sequential steps–pretrained deep models selection of features classification. Two lightweight pretrained models are initially considered and fine-tuned in terms of additional layers and freezing some middle layers. Following that, models were trained using deep transfer learning, and features were engineered on fully connected and average pooling layers. The fusion is performed using discriminant correlation analysis, which is then optimized using an improved moth-flame optimization algorithm. For final classification, the final optimum features are classified using an extreme learning machine (ELM). The experiments were carried out on two publicly available datasets, CASIA B and TUM GAID, and yielded an average accuracy of 91.20 and 98.60%, respectively. When compared to recent state-of-the-art techniques, the proposed method is found to be more accurate.
1. Introduction
Person recognition and identification using gait have great importance in the field of machine learning and computer vision [1]. Gait is the walking behavior of a person but to recognize a person by gait from distance and in less illuminated environment it becomes very complicated and difficult [2]. Moreover, as compared to other traditional biometric techniques such as fingerprint, face detection, and iris detection, it does not require direct contact of a person [3]. Due to these discriminative factors, it has taken a lot of attention from researchers and it is used to apply in various applications like security surveillance, dubious person detection, and forensics [4, 5].
In early research, gait recognition was categorized in to two main categories such as model-based and appearance-based [6]. The prior categories are more costly to implement the human model using high-resolution videos and give low average results as compared to modern categories; hence, researchers focus on using modern categories for gait feature detection [7]. In the model-based method, prior information is used to extract the moving features of the human body [8]. Furthermore, in this method, the movement of human body is examined using changing factors like gait path, position of joints, and the torso [9]. This is a challenging method, due to its high computational complications. In the model-free method, gait cycle is used to extract the features from a silhouette, and it is simple to implement due to less computational cost [10].
There are various machine learning and computer vision techniques that are used to overcome the covariate factors like angular deviations [11], lightning conditions [12], and clothing and carrying bags [6, 13], but there exist various challenges in extracting useful features that affect the optimal accuracy results. Feature extraction is considered as the most important step in recognizing gait traits [14], such as if the extracted features are related to the problem, then the system will be able to correctly recognize the human gait patterns. In contrast, if irrelevant features are evaluated, then the system performance will go down and it will not give optimal recognition results [10]. In past, various types of features are used like shape-based features [15], geometrical features [16], and statistical features [17]. Deep features are extracted using deep convolutional neural network techniques to overcome these challenges. Deep learning techniques, rather than manual feature extraction, extract automated features from raw images [18, 19]. In this work, we proposed a sequential lightweight deep learning architecture for human gait recognition. Our major contributions are listed as follows:(i)Two pretrained deep learning models are modified namely VGG-19 and MobileNet-V2 based on the target dataset classes and adjusted their weights. Then, both models are trained using transfer learning without freezing any layer and obtained newly trained models.(ii)Feature engineering is performed on fully connected layer 7 (VGG-19) and global pooling layer (MobileNet-V2) and fused by employing discriminant correlation analysis (DCA)(iii)A modified moth-flame optimization algorithm is developed for the selection of optimum features that are finally classified using extreme learning machine (ELM)
The rest of the article is organized as follows: Section 2 describes the manuscript’s related work. Section 3 discusses the specifics of selected datasets. The proposed methodology is presented in Section 4. Section 5 discusses and explains the experimental results. Finally, Section 6 brings the entire manuscript that followed the references to a close.
2. Related Works
Identification of human through gait is the most biometric application, and researchers have made extensive studies for it by extracting feature values [20]. In literature, various machine learning and computer vision-based techniques are implemented for human gait recognition [21]. Liao et al. [22] presented Pose-gait model-based gait recognition method. In this approach, the human 3D pose is estimated using CNN, and then spatiotemporal features extracted from the 3D pose are used for the improvement in recognition. Two publicly available datasets CASIA B and CASIA E are used for experimentation, and it gives auspicious results in the presence of covariate factors. Sanjay et al. [23] introduced an automated approach for human gait recognition in the presence of a covariate situation. In the first step, basic distinct stances in the gait cycle are detected which are used to compute the gait features related to these detected stances and it is termed as dynamic gait energy image (DGEI). Then generative adversarial network (GAN) is used to detect the corresponding dynamic gait energy image. These extracted features are then used to compare with the gallery sequences. Finally, GAN-created DGEI is used for final recognition. Publicly available datasets such as CASIA B, TUM Gait, and OU-ISIR TreadMill B are used to validate the presented approach, and it gives considerably improved results as compared to existing methods. Chen et al. [24] introduced a method for cross-view gait recognition using deep learning. Multiview gait generative adversarial network is introduced for creating fake gait data samples for extension in existing data. The method is then used to train each instance of each view involved in single or multiple datasets. Domain alignment using projected maximum mean dependency (PMMD) is utilized to minimize the effect of distribution divergence. CASIA B and OUMVLP are used for experimentation, and the achieved results show that the introduced method gives better results than existing methods. Hou et al. [25] presented a set residual network-based gait recognition model to detect more discriminative features from the silhouettes. Set residual block is introduced to extract the silhouette level and two-level features in a parallel sequence, and then the residual connection is applied to join the two-level features. Moreover, an efficient method is applied to utilize features from the deep layers. Two datasets CASIA B and OUMVLP are used for experimentation. The applied approach gives consistent results as compared to existing methods. Gul et al. [13] introduced a machine vision method to extract the distinct gait features from covariate factors. Spatiotemporal gait features are extracted, and these features are used to train the 3D CNN model to overcome these challenges. Then, the holistic method is used by the model to implement the distinct gait features in the form of gait energy images. Two publicly available datasets OULP and CASIA B are used to test the validity of the introduced method with large gender and age differences. The presented approach gives promising results using CASIA B dataset as compared to existing methods. The methods presented above concentrated on both spatial and temporal data. None of them focused on feature fusion or optimization of extracted features to achieve better results in the shortest amount of time. As a result, in this article, we proposed a lightweight deep learning framework for human gait recognition that not only improves accuracy but also reduces a system’s computational time.
3. Datasets
3.1. TUM GAID Dataset
TUM Gait from Audio, Image and Depth (GAID) [26] dataset consists of RGB audio, video, and depth. It consists of 305 subjects carried out in two indoor walking sequences in which four distinct situations are captured without any view variation through Microsoft Kinect like six normal walk videos (n1–n6), two videos with carrying a bag (b1–b2), and two walking videos wearing coating shoes (s1-s2), and there is an elapsed time instance in which 32 subjects were recorded wearing distinct cloths. A few sample images are shown in Figure 1.

3.2. CASIA B Dataset
CASIA B [27] is a multiview and indoor gait dataset in which 124 subjects are included in recording session of which 93 are male participants and 31 female participants. This dataset considers major factors for gait recognition, that is variation in view angle, clothing, and carrying situations separately. For all, subject videos are captured through a USB camera from 11 different views that include six normal walking videos (NM), two walking videos with wearing a coat (CL), and two walking videos with carrying a bag (BG). A few sample images are shown in Figure 2.

4. Methodology
The proposed lightweight (LW) human gait recognition framework has been presented in this section with detailed mathematical formulations and flow diagrams. The main architecture is shown in Figure 3. In this figure, it is illustrated that the proposed method consists of some important steps such as modification of pretrained CNN models, training of modified models using transfer learning (TL), feature engineering on global average pooling layers, fusion of extracted deep features using discriminant correlation analysis (DCA), selection of best features using the modified moth-flame optimization algorithm, and finally classification using the extreme learning machine (ELM). The details of each step are given.

4.1. Convolutional Neural Network (CNN)
The convolutional neural network (CNN) has become an important recognition task in the domain of computer vision. The CNN architecture is employed for feature extraction of an image based on several hidden layers. CNNs have many layers, including convolutional, pooling, fully connected, and others. The convolution layer (CL) is the most important layer of a CNN that performed a 2D convolution process on the input and the kernels through a forward pass. The kernel weights in every CL are assigned randomly and their values are changed at each step by applying the loss function through network training. In the end, the resultant learned kernels may identify some types of shapes within the input images. In CL, three different types of steps are performed like convolution, stack, and nonlinear activation function.
Suppose, we have an input matrix and an output of the CL, and there are some set of kernels , then the output of the convolution process after step 1 is represented aswhere refers to the convolution process, which is the product of filter and inputs. Second, all activation maps are combined to create a novel 3D activation map.where represents the combination of operations with channel direction, and is the total number of filters. Third, the 3D activation map is given as input into the activation function and gives the resultant activation map as
The size of three main matrices (input, filters, and result) is taken aswhere the variables represent the size of height, width, and channels of the activation map, and the subscripts , and represent input, filter, and output, respectively. It contains two equalities. First, refers the channel of input equals to the channels of filter . Second, refers the channels of output equals the number of filters . Suppose represents padding, represents stride, so the result of can be evaluated aswhere is the floor function. The nonlinear activation generally selects the rectified linear unit (ReLU) function [28].where is the component of the activation map . At present, ReLU is the mostly used NLAF as compared to the traditional hyperbolic tangent (HT) and sigmoid function (SM) function, that are computed as
4.2. Transfer Learning
Transfer learning (TL) is the branch of machine learning that transfer the knowledge of one domain to a different domain within less computational time. Given a source domain with the source task and a target domain with the target task, aims to learn a better mapping function for the target task for the knowledge transfer from the source domain and task . Hence, the TL is expressed as follows:
Hence, DTL is defined as follows: given a TL task based on , DTL objectives to acquire the by leveraging powerful DL process. Visually, the DTL process is shown in Figure 4.

4.3. Modified VGG-19 Model
Visual geometry group (VGG)-19 [29] is the modified version of VGG, and it consists of 19 layers with 16 convolutional layers of 64, 128, and 256, and 512 filter sizes with stride length and padding is 1 pixel on each side. The convolutional layer consists of 5 sets, two of them contain 64 filters, the next set contains 2 convolutional layers with 128 filters, the next set contains 4 convolutional layers with 256 filters, and the last two sets contain 4 convolutional layers with 512 filters each. Then, a max pooling layer with a 2 × 2 filter size and stride rate of 2 pixels are used after each set of convolutional layers. The output is then passed to the fully connected layer. Three fully connected (FC) layers and one softmax layer are used for the classification. In this work, we removed the last fully connected (FC) layer and added a new FC layer. Then, several hyperparameters are employed such as training rate, epochs, mini batch size, optimizer, and training loss. Based on these hyperparameters, we trained the modified model from scratch through TL and obtained a new model for only gait recognition task. Later, this modified model is used for the feature engineering task.
4.4. Modified MobileNet-V2 Model
MobileNet-V2 [30] is a lightweight CNN-based model specially designed for mobile devices. This architecture can perform well on small datasets as it can overcome the effect of overfitting and it also optimized the memory consumption. In this network, 17 inverted residuals are used between two convolutional layers and one FC layer. So, the depth of the network consists of 53 convolutional layers and one FC layer. The working of this architecture is based on two concepts that include depth-wise separable convolution and the inverted residual methods. In this architecture, a full convolutional layer is replaced with a factorized version that divides the convolution into two separate groups. The first layer is named as depth-wise convolution; its function is to do lightweight filtering by using one convolutional filter on each input channel. The second layer is named as point-wise convolutional layer, which is used for creating new features by computing linear models of the input channels. The depth-wise convolutional layers contains 2 convolutional layers: the first layer contains a 3 × 3 filter size, while the other one has a 1 × 1 filter size. The other two regular convolutional layers have filter sizes of 3 × 3 and 1 × 1. Moreover, in this architecture, ReLU6 is used instead of ReLU as it is more efficient for less accurate computation. Dropout and batch normalization are then applied, where layer activators are standardized to mean zero and unit variance, and then, linear transformation is applied [31].
In this work, we removed the last layer and added a new FC layer. Then, several hyperparameters are employed such as training rate (0.005), epochs (100), mini batch size (32), optimizer (Adam), and training loss. Based on these hyperparameters, we trained the modified model from scratch through TL and obtained a new model for only gait recognition task.
4.5. Feature Engineering
Feature engineering is applied on global average pooling layers of both models and obtained two feature vectors of dimension and . Mathematically, this process is defined as follows:where and represent the length of feature vectors, and layer defines the selected one like global average pooling. Thereafter, the fusion process is performed using discriminant canonical correlation analysis approach.
4.6. Discriminative Canonical Correlation Analysis-Based Fusion
In this work, the DCCA fusion approach is employed for feature fusion. By applying canonical correlation analysis (CCA), the correlated features and , are extracted and merged for identification. Though the features extracted from related class samples are not utilized, resultantly it becomes the constraint of the recognition capabilities of CCA. Moreover, the basic concept to introduce CCA is for modeling instead of recognition, and correlation refers to the certainty among and , CCA was more often utilized for modeling and estimation, for instance image extraction and parameter prediction. If the extracted features are for recognition, then the class description of the instances should be utilized to get more discriminatory features. Finally, the class description was fused with the CCA framework for cooperated feature extraction and presented an innovative approach of fused feature extraction for multimodal recognition, termed as discriminative canonical correlation analysis (DCCA). Mathematically, this approach is defined as follows.
Suppose z pairs of mean-normalized pairwise instances access from p classes, DCCA can be systematically represented in the below optimization problem:where the parameters and are used to compute the inside-class association and between-class association, respectively (detailed description is given below), adjustable metric that shows the comparative significance of the inside-class association contrasted with the between-class association . Moreover, the limitation value represents the scale limitation on .where refers the kth instance in the th class, similarly , and shows the number of instances of or in the th class. The matrix is represented aswhere represents a proportionate, positively defined, block crosswise matrix, and Matrix . In contrast, the matrix is represented as
The “=” in the end holds because mean normalization is applied on the instances, hence both and holds. In contrast between equations (13) and (15), the only difference among and is a single negative sign, thus the objective of equation (10) will be , and this enhancement issue is free from parameter , consequently can be excluded. Hence, DCCA can be represented as
By applying the Lagrangian multiplier technique, it becomes very simple to get main equation of DCCA, which is represented as
When the vector pairs , adjacent to the first largest generalized eigenvalues and attained, let , then both feature extraction and the feature fusion can be performed using FFS-I and II, respectively, where fulfills the limitations and . The formulation returned a fused vector of dimension , where . Later on, this resultant vector is further improved using a modified moth-flame optimization algorithm.
4.7. Moth-Flame Optimization Algorithm (MFO)
Several nature-inspired optimization algorithms have been introduced in the literature for best feature selection such as genetic algorithm, particle swarm optimization, and moth-flame optimization [32]. The improved moth-flame optimization algorithm is utilized in this work for the best feature selection. Originally, the MFO algorithm was presented by Mirjalili [33]. It is under the populace-based metaheuristics algorithm. In this procedure, first the data flow of MFO begins by randomly generating moths within the resultant space. Then it calculates the positional (i.e., fitness) value of each moth and label the best position by flame. Afterwards, changing the moth place depends on a whole movement function used to attain a better position labeled by a flame. Moreover, it updates the new best positions of the individual. The previous process (i.e., updating of moths’ location and generating the new location) until it meets the resultant criteria. The MFO algorithm consists of three major steps that are as follows.
4.7.1. Creating the Initial Population of Moths
As stated in [33], it is supposed that an individual moth can fly in 1D, 2D, 3D, or in hyper-dimensional position. The matrix of moths can be represented aswhere represents the number of moths’ and a represents the number of dimensions in the resultant region. Moreover, the fitness values for entire moths’ stored in an array are represented as
The remaining elements in the algorithm are flames that are represented using D-dimensional space with their fitness/position value function in the following matrix set as
It is important to note that moths and flames both are solutions. The moths are the real search agents that revolve around the search area, while flames are the moth’s best position that is obtained yet. Hence, an individual moth hunts around a flame and updates it when it finds the best solution. Following this procedure, a moth never misses its best solution.
4.7.2. Updating Moths’ Location/Positions
MFO utilizes three distinct functions to convergent the global optimum of the optimization issues. Mathematically, it is defined as follows:where represents the first random positions of the moths , represents that the motion of the moths in the search is , and represents end of the search process . The equation given below represents function, which is used for the implementation of random distribution:where and refers to the upper and lower bound variables, respectively. As discussed before, the moths fly in the search area by means of transverse direction. There are three conditions that should be followed when applying a logarithmic spiral: (i) The spiral starting point should start from the moth; (ii) the spiral endpoint should be the location of the flame, and (iii) variation in the range of spiral should not extend from the search area. Thus, in the MFO algorithm, the logarithmic spiral can be defined aswhere represents space between the xth moth and yth flame (computed by equation 24), represents a solution to define the shape of the logarithmic spiral, b represents a random range between [−1, 1].
In MFO, the equalization among exploitation and examination is affirmed by the spiral motion of the moth near the flame in the search area. Moreover, to escape from falling in the trap of the local goal, the best solution has been kept in each step, and the moths fly around the flames by means of VP and VH matrices. Then, the update criteria are defined as follows:
4.7.3. Updating the Size of Flames
This part highlights to augment the manipulation of the MFO algorithm (i.e., updating the moths’ location in m various positions in the search area may minimize the chance of exploitation of the best optimal solutions). However, minimizing the extent of flames helps to overcome this problem using the following equation:where refers to the maximum number of flames, refers the current number of iterations, and represents the maximum number of iterations. This equation returns the best features; however, during the analysis stage, it is observed that the best selected features contain some redundant information; therefore, we tried to overcome this problem and speedup the selection process based on Newton Raphson (NR) formulation. Mathematically, the NR method is defined as follows:where and represent the selected features of moth-flame. Through the above formulation, a stop value is obtained that added in equation 25 for final selection.
The final selected features are passed to the extreme learning machine (ELM) for classification. A few visual predicted frames are shown in Figure 5.

5. Experimental Results and Analysis
In this section of the proposed method, the detailed experimental process is presented in the form of tables and graphs. The proposed method is tested using two publicly available datasets, CASIA B and TUM GAID. Section 2 contains more information on both datasets. Instead of 70 : 30, the selected datasets are divided 50 : 50 for training and testing. The main reason for this portioning is to make the validation process more equitable. All of the results are based on 10-fold cross-validation. For the classification results, several classifiers are used, including the extreme learning machine (ELM), support vector machine (SVM), KNN, ensemble tree (EBT), and decision trees (DTs). The entire proposed framework is implemented on MATLAB 2021b using Personal Desktop Computer Corei7, 32 GB RAM, and 8 GB graphics card.
5.1. Results
5.1.1. CASIA B Dataset Results
The proposed method results for the CASIA B dataset are presented in the form of numerical values and a time plot in this section. Table 1 provides the classification results of the CASIA B dataset from all perspectives. Normally, researchers choose only a few angles, but in this work, we chose all 11 angles to test the capability of the proposed algorithm. Each angle has three classes: normal walk (NM), walk with a bag (BG), and walk while wearing a coat (WC) (CL). On this dataset, ELM performed better, with average accuracies of 96.89, 93.07, and 83.66% for NM, BG, and CL, respectively. For each angle, the obtained accuracy is above 90% that shows the proposed method effectiveness. A comparison of ELM with other classifiers such as SVM, FKNN, EBT, and DT shows that ELM performed better than all of them. Moreover, the time is also noted of each classifier as shown in Figure 6. From this figure, it is observed that the ELM and DT classifiers executed fast than the other listed methods.

5.1.2. TUM GAID Dataset Results
The results of the proposed method on the TUM GAID dataset are given in Table 2. In this table, accuracy is computed of each class of the selected dataset such as normal walk, walk with a bag, and walk with shoes. Moreover, the average accuracy of each classifier is also computed. Many classifies are selected and ELM shows the better average accuracy of 98.60%. The rest of the classifiers obtained average accuracies of 97.25, 96.73, 96.91, and 96.26%, respectively. The computational time of each classifier is also computed and plotted in Figure 7. It can be seen from this figure that the ELM has a minimum computation time of 86.43 (sec) compared to the rest of the classifiers. Hence, overall, ELM classifier performed better using the proposed method on the TUM GAID dataset.

5.2. Discussion and Comparison
A detailed analysis of the proposed framework has been conducted in this section based on confidence interval and standard error means (SEM). As given in Tables 3 and 4, the proposed LightweightDeep-ELM framework gives the better accuracy than other combinations on the CASIA B dataset. Similarly, the proposed framework (LightweightDeep-ELM) also obtained better results on the TUM GAID dataset. Moreover, the average computational time of each classifier for both datasets is also shown in Figures 6 and 7. The ELM execution time is minimum than the rest of the selected classifiers. To further analyze the performance of the ELM classifier, the proposed framework is executed 500 times and computed two values—minimum accuracy and maximum accuracy. Based on the minimum and maximum accuracy, the standard error mean is computed. Through SEM, a confidence error is obtained that shows the consistency of proposed framework.
Table 3 provides the confidence interval-based analysis of the CASIA B dataset. Confidence level and margin of error (MoE) are calculated for each class such as walk, bag, and coat. We selected several confidence levels such as 68.3%, ; 90%, 1.645; 95%, 1.960; and 99%, 2.576 and obtained MOE for each is noted as given below. Based on the MoE, it is observed that the proposed framework showed consistent performance on the CASIA B dataset after 500 iterations. Similarly, Table 4 provides the proposed confidence interval-based analysis of the TUM GAID dataset. From this table, it is also confirmed that the proposed method’s accuracy is consistent after the numbers of iterations.
At the end, a detailed comparison is conducted with recent techniques for both selected datasets such as CASIA B and TUM GAID. Table 5 provides the comparison of the proposed method accuracy with recent techniques on the CASIA B dataset. In this table, the authors of [34] obtained an average accuracy of 51.4% on the CASIA B dataset. The authors in [35] improved the average accuracy and reached to 84.2% that was later further improved by [36] of 87.5%. Recently, the authors of [37] obtained an average accuracy of 89.66% on the CASIA B dataset that is improved then the previous noted techniques. Our method achieved an accuracy of 91.20% on the CASIA B dataset that is improved than the existing techniques. Similarly, the comparison of the TUM GAID dataset is given in Table 6. In this table, it is noted that the recently achieved accuracies were 84.4%, 96.7%, 97.9%, and 97.73%. Our proposed method obtained an accuracy of 98.60% that is improved than the recent state-of-the-art (SOTA) techniques.
Finally, the improved moth-flame optimization algorithm is compared to several other nature-inspired algorithms (Figure 8) such as the genetic algorithm, particle swarm optimization, bee colony optimization, ant colony optimization, whale optimization, crow search, and firefly algorithm. This graph shows that the proposed optimization algorithm outperforms the other compared algorithms in terms of accuracy. Moreover, the gait is important for several purposes such as assisting those suffering from Parkinson’s disease [41, 42]. In this work, we used Adam as an optimizer [18] during the training of deep learning models instead of stochastic gradient descent (SGD). For the gait recognition task, SGD is not performed better than Adam due to a high number of video frames. As we know, Adam is known to be computationally fast, requires less memory, and needs little tuning.

6. Conclusion
Human gait recognition using lightweight deep learning models and improved moth-flame optimization algorithm has been presented in this work. Two lightweight pretrained CNN models were fine-tuned and deep transfer learning based trained. Features are extracted from the global average pooling layer and fused using a new approach named DCCA. Furthermore, an optimization algorithm is developed for the selection of the best features. The proposed method was compared based on several classifiers such as ELM and SVM and found ELM is more suitable based on accuracy and time. Two publicly available datasets were employed for the validation process and achieved an improved average accuracy of 91.20 and 98.60%. The key findings of this work are as follows: (i) freezing few middle layers can train a model with less time but it is a chance to sacrifice the better accuracy; (ii) fusion of lightweight models features using DCCA approach is time-consuming but at the end, better information in the form of features is obtained; and (iii) improved optimization algorithm provides the better accuracy and reduces the computational time. In the future, a new scratch-based CNN model will be developed for human gait recognition.
Data Availability
The datasets used in this work are publicly available (http://www.cbsr.ia.ac.cn/english/Gait%20Databases.asp).
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Authors’ Contributions
All authors contributed equally in this work and have read and agreed to the published version of the manuscript.
Acknowledgments
This research was supported by Korea Institute for Advancement of Technology (KIAT) grant funded by the Korea Government (MOTIE) (P0012724, The Competency Development Program for Industry Specialist) and the Soonchunhyang University Research Fund.