Abstract

Through the in-depth study of biological characteristics, the gait information of each person is found to be unique. Therefore, gait characteristics can be used to identify a person’s identity information. However, pedestrian gait information is often affected by various interference factors such as viewing angle, dress, and carrying objects. To solve this problem, the idea of perspective transformation is proposed to transform gait images from different perspectives and different states to gait images under standard conditions on one side. Generative adversarial network (GAN) is adopted for perspective transformation. Moreover, in order to solve the difficulty and instability to converge characteristics of GAN in the training process, we introduced a slack module to adjust the shortcomings of the GAN. Experiments show that under normal walking conditions, our method achieves an average accuracy of 98.32% on the CASIA-B gait dataset.

1. Introduction

Gait is the way a person walks and can be used as a feature to identify an individual. Compared with fingerprints, faces, iris, palm prints, and other biological features, gait features have unique advantages in the collection process, such as being noncontact and not easy to forge, and are especially suitable for long-distance human body recognition [1]. In the past few decades, gait recognition algorithms have been widely used, such as video surveillance, crime prevention, and forensic identification. Therefore, the research on gait recognition has important theoretical research value and significance [2].

Because many factors affect gait information, we consider using the nonstandard gait images into the network to transform the perspective and transform it into a gait image under a standard perspective. After comparing the models, the generative adversarial network (GAN) is adopted. The method used in this paper can not only generate a real gait energy image (GEI) but also maintain the identity information of pedestrians. At the same time, the training process can converge steadily. However, the challenges of the current research are that GAN is difficult to train and difficult to generate realistic images and to effectively maintain the identity information of characters while generating standard images.

To deal with these problems, GEI is usually used as the input to the network. GEI is a template that mixes static and dynamic gait information. By calculating the average intensity of contour pixels in a gait cycle, the energy of each pixel in the template is obtained. Different from directly inputting the gait contour map to the deep network, the method in [3] used the GEI as the model input. In order to solve the problem of crossview in gait recognition, they proposed a two-layer convolutional layer structure. The DeepGait was proposed in [4], using the VGG-16 (Visual Geometry Group-16) pretraining model, which is based on the convolutional representation of gait contour features. The gait biometric-based person identification using a deep convolutional neural network (CNN) method was proposed in [5] to recognize the critical discriminative gait features for human identification. The technique uses gait energy images of humans for identification. The literature [6] also used deep convolutional neural network to extract the gait features of a person by training the neural network architecture with GEI.

The method based on the key points of the human body posture has a certain processing ability for self-occlusion, clothing, and carry-on images. The processing method of the key points of the human body posture can alleviate the impact of the carried variable changes on the recognition performance, but the effectiveness of the model has not been verified in the crossview scene. The work in [7] extracted human pose information from the original video sequence, which includes the positions of six key points (left and right hips, left and right knees, and left and right ankles) in each frame of the original video sequence. After getting the specific key points, it is proposed to use the attitude-based space-time network (PTSN). The literature [8] designed three different prediction models to verify the influence of model structure on the performance of gait recognition. Introducing 3D convolution technology and characteristics and extracting gait timing information can accurately identify single models. The literature [9] proposed to use 3D convolution to capture the space and time information in the gait sequence and use the multiview three-dimensional convolution (MV3DCNN) to input the gait image sequence and gray information as features to process clothing pair recognition. To solve the problem that the convolutional network cannot handle gait sequences of variable length, the algorithm divides a gait sequence into several short sequences of fixed length for processing. The spatiotemporal deep neural network (STDNN) for multiviewpoint gait recognition was proposed in [10]. STDNN includes Temporal Feature Network (TFN) and Spatial Feature Network (SFN). The literature [11] proposed the attentive spatial-temporal summary networks to learn salient spatial-temporal and view-independence features for irregular gait recognition. The literature [12] divided the network into three submodules: Feature Convolutional Neural Network (FCNN), Mapping Convolutional Neural Network (MCNN), and fully connected layer.

The traditional generative method needs to estimate the angle of the gait sequence relative to the camera in advance, and for each pair of angles, a model needs to be trained for recognition, which lacks the use value in real scenes. In order to improve the practicability of the generative method in gait recognition, the unified model based on a multilayer autoencoder (AutoEncoder) was proposed in [13] to alleviate the change of covariates such as perspective, clothing, and carrying objects in gait recognition. The literature [14] proposed the use of a stacked sparse autoencoder (SSAE) network to extract visual invariance features of images. Each layer stack maps the gait energy image (GEI) to a virtual image with a small change in viewing angle. Repeating this process, the characteristic changes caused by the viewing angle change are gradually reduced. The output of SSAE can be used as a feature for identification.

A new gait recognition architecture based on deep learning and mathematical voting algorithms was proposed in [15]. Different from the traditional gait feature extraction method, this method uses a convolution restricted Boltzmann machine (CRBM) for unsupervised feature extraction, and voting algorithms are added to the structure at the same time. A clothing invariant gait recognition method based on CNN was proposed in [16]. This method will automatically learn to extract the most distinctive gait features from low-level input data (i.e., GEI).

The method based on Gait Generative Adversarial Network (GaitGAN) can deal with the impact of influences such as perspective and clothing on recognition performance at the same time [17]. The generative adversarial network has proven to be effective in fitting sample distributions. The literature [18] proposed the GaitGANv2 algorithm, which adds softmax loss and contrast loss to the GaitGAN to increase the interclass distance of different subjects and narrow the intraclass distance of the same subject. The literature [19] proposed a generative adversarial networks (GAN) in order to address the problem of gait recognition from an incomplete gait cycle. The network can reconstruct a complete GEI from an incomplete GEI. The literature [20] took a Multiview Gait Generative Adversarial Network (MvGGAN) to generate fake gait samples to extend existing gait datasets, which provides adequate gait samples for deep learning-based crossview gait recognition methods. A generative adversarial network (CA-GAN) was proposed in [21] to map gait images of different views, so that more realistic gait images can be obtained so that crossview gait recognition research can be carried out. In CA-GAN, the generative network consists of two branches, which simultaneously perceive human’s global context and local body part information. The method of Multitask Generative Adversarial Networks (MGANs) was proposed in [22] to learn view-specific feature representations. In order to retain more time information, a new multichannel gait template called Periodic Energy Image (PEI) is also proposed. Based on the hypothesis of the perspective manifold, MGAN can use adversarial training to extract more recognition features from the gait sequence.

Given all that, in the actual application process, gait information is often affected by many factors, among which the most common and most need to be solved factor is the multiview gait recognition problem. In addition to the angle of view, there are also the effects of clothing and carry-on items. These influencing factors will affect the appearance of pedestrian gait. A view transformation method based on the slack module is proposed to eliminate the above effects.

Our main contributions can be summarized as follows: (1)The slack module. In order to improve the influence of the recognizer on the generator, we add a slack module in the network. The addition of a slack module will enhance the effect of the generator, so as to make the generated image more realistic and bring a higher recognition rate in the recognition stage(2)Dynamic learning rate. In the process of perspective transformation, the learning rate which is more suitable for the model is designed. This learning rate will make the process of training more stable

The rest of the paper is organized as follows. Section 2 presents the related work in gait recognition. Section 3 describes the proposed method. Experiments and evaluations are presented in Section 4. Section 5 gives the conclusions and identifies future work.

2. Slack Allocation Generation Adversarial Network

In order to reduce the influence of view transformation on gait recognition, we use the generative adversarial network to perform view transformation. According to the research, the positive side gait image often contains rich gait information. Therefore, our goal is to generate a standard gait image with a positive side, standard dress, and no carrying. The method proposed in this paper is more advanced than other GAN methods in that we impose constraints on the generator. The addition of constraints makes the generator meet different training purposes in different training stages and makes both true-false discriminators and identity discriminators converge stably. The method adopted in this paper is more consistent with the task of view transformation and improves the recognition accuracy of the algorithm.

2.1. Gait Energy Image (GEI)

In the field of gait recognition, gait energy image (GEI) is the most common feature. The gait energy image is used to extract the gait silhouette frame by frame in the gait video and synthesize an image of the gait silhouette in a period by the following formula; such an image is called the gait energy map. GEI can reflect the gait information of pedestrians in a cycle, and GEI can be obtained by simply processing the gait silhouette. The silhouettes and GEI used in the experiments are produced in the way as where represents the pixel value of the coordinate () in the gait silhouette image at time in the th gait sequence. As shown in Figure 1, a gait energy image (the rightmost one) is produced by averaging all the silhouettes (all the remaining images on the left) in one gait cycle.

2.2. Generative Adversarial Network

Generative adversarial network (GAN) is a deep learning network. In practical applications, the generator and discriminator are usually implemented using convolutional neural networks (CNNs). The traditional generative confrontation network consists of two parts: generator () and discriminator (). The generator is used to generate pictures of the same type as real pictures. The discriminator is used to judge whether the input picture comes from the generator or the real picture. In this way, confrontational competition will occur between the two networks. In the continuous confrontational competition, the generator will continue to train to generate high-quality images, and the discriminator will continue to train and improve its ability to determine whether the image is a real image or a generated image.

2.3. Slack-GAN for Gait Recognition

Traditional generative adversarial networks use noise as the input of the generator. Through research, it is found that the input of the generator can be an image instead of noise, so that the pixel-level transfer between the input image and the target image can be realized. PixelDTGAN can convert visual input into different forms and then generate pixel-level images as output [23]. This kind of improvement to GAN can simulate the similarity between the visual scene and the target perceived by the human eye. Two spaces are defined in PixelDTGAN, one is the source space (that is, the image under different angles of the target subject), and the other is the target space (that is, the image under the standard perspective of the target subject). The above-mentioned network contains three important parts. The generator uses the image as input to generate a gait image. The authenticity discriminator determines whether the sample generated by the generator is a pedestrian gait sample, and the domain recognition discriminator determines the sample generated by the generator, whether it is a sample of the target subject. Yoo et al. hope that while generating image samples, it can further maintain the category of generated samples. However, the training process of the generator is difficult, so the generator needs an additional loss function to constrain the generated target image. The domain recognition discriminator takes the original image and the target image as input and obtains the probability that the input pair is a subject through training. The loss function for the identity discriminator is defined as where is the source image, is the ground truth target, is the irrelevant target, and is the generated image from the converter.

Another component is the real-fake discriminator which is similar to traditional GAN in that it is supervised by the labels of real or fake, in order for the entire network to produce realistic images. The discriminator produces a scalar probability to indicate if the image is a real one or not. The true-false discriminator’s loss function takes the form of binary crossentropy. where contains real training images and contains fake images produced by the generator. The two discriminators are fed images and labels so that they can produce real images while preserving the semantic information of the individual identity.

To reduce the impact of changes, as shown in [17], GAN was used as a regression factor to generate a standard gait image. The generated standard image contains the gait of the subject viewed from the side, wearing normal (standard) clothes, and not carrying. Any gait image from any pose is converted to the above-mentioned standard view because it contains richer information about gait. Although this is intuitively appealing, the key point that must be solved is how to preserve the person’s identity information in the generated gait image.

The method in this paper draws on the structure of the three-player generation adversarial networks as shown in Figure 2, optimizes and modifies it based on it, and uses a generator to generate gait images under a standard perspective. In the structure of the article, a three-player generative adversarial network is used to achieve the purpose of perspective change. When updating the two discriminators, GaitGAN performs batch sampling on real images, images of different subjects, and target images [17]. In this paper, we adjust the influence of the two discriminators on the generator to make the training process stable and convergent. In the early stages of training, our goal is to generate real GEI. As the training progresses, the more we need generators to generate the true GEI of the same subject. Based on the above ideas, we introduce a slack module to constrain the effect of the discriminator.

We set the GEIs at all the viewpoints as the source and the GEIs of normal walking at 90° (side view) as the target, as shown in Figure 3. The perspective transformation problem in this article is not only to generate a real GEI but also to generate a GEI of the same individual. We also use GAN as a regression factor to achieve perspective transformation. To solve the problem of GAN training instability, the slack module is added to balance the two discriminators in the training. The slack module can be viewed as a function that varies exponentially from 0 to 1. With the slack module, the perspective generation problem mainly generates real GEI in the early stage of training. As the training progresses, the training goal becomes to generate GEI of the same individual. The above priority division of tasks can make the task of perspective change stable and obvious convergence. The network model is shown in Figure 4. In the structure, the input of the generator is GEI from various angles, and the output is a standard GEI. The standard GEI generated by the generator is the input of the true-false discriminator and the identity discriminator. In the true-false discriminator, if the input picture can be regarded as a GEI, the true-false discriminator output is 1. In the identity discriminator, if the input image can be regarded as the GEI of the same individual as the input image, the identity discriminator output is 1. The slack module in the structure is used to adjust the influence of the two discriminators on the generator, so as to achieve different training goals in different stages of training.

In the structure, generate real GEI as the early training target, so the influence of the identity discriminator is relatively weak in the early stage. As the generated GEI meets the requirements of the true-false discriminator, the influence of the identity discriminator needs further increased, so that the generated image contains more rich character information. The slack factor is shown in the following equation: where to ensure that the priority of synthesizing real samples is higher than the priority of synthesizing difficult samples. changes linearly from 0 to 1 as the training progress. The slack factor gradually increases during the training process, it makes the generator generate a sample of real gait while generating a sample of the generated sample contains richer character information.

In the process of training, it is very important to find a suitable learning rate. If the learning rate is set too large, the training may diverge. If the learning rate is set too small, the training will converge to the final solution, but it will take a long time. In this paper, we have also made some adjustments to the learning rate. The dynamic learning rate is shown in the following equation: where . the initial learning rate. changes linearly from 0 to 1 as the training progress. Equation (7) is the dynamic learning rate designed for our model, which starts from the initial dynamic learning rate to present nonlinear attenuation. In the early stage of iterative optimization, the learning rate is larger, the forward step will be longer, and then, it can be gradient descent at a faster speed. However, in the later stage of iterative optimization, it gradually reduces the value of the learning rate and reduces the step size, which will help the convergence of the algorithm and easier to approach the optimal solution.

3. Experiment and Analysis

3.1. Dataset

The CASIA-B gait database is collected and established by the Institute of Automation and is one of the largest public gait databases in existence. It was created in January 2005 by the Institute of Automation, Chinese Academy of Sciences. The database was collected by 124 subjects (31 women and 93 men). The gait database was shot from 11 angles, the angle of view ranged from 0 degrees to 180 degrees, and there was an 18-degree interval between two adjacent angles. Each interviewee has 11 perspectives, including 6 normal walking sequences (“NM”), 2 walking sequences with objects (“BG”), and 2 walking sequences with coats (“CL”).

3.2. The Experiment Design

In the experiment, three conditions are considered: normal state, carrying items, and wearing a coat. The experiment uses six groups of standard conditions, two groups of carrying items, and two groups of wearing overcoats as the training set. There are 62 individuals in the training set, and the remaining 62 individuals in the testing set. In the training set, the four standard states of the top 62 are regarded as the gallery set, and the remaining states are regarded as the probe set as it is shown in Table 1.

3.3. Model Parameters

The structure uses a structure similar to that in GaitGAN, as shown in Tables 2 and 3. The first four layers of the encoder are the same as the real/fake discriminator. After Conv.4, the real/fake and identification discriminator output is a binary value. The generation confrontation network is used as a perspective converter for end-to-end training, but the generator can also be divided into two structures, one is an encoder, and the other is a decoder. The encoder is composed of four convolutional layers to capture individual characteristic attributes. The decoder uses a four-layer deconvolution layer to decode the features extracted by the encoder to obtain the generated target image, so as to achieve the purpose of viewing angle transformation. The structure of the identity-preserving discriminator and the authenticity discriminator is similar to the encoder in the generator, and both are composed of four convolutional layers.

3.4. Experimental Results on CASIA-B Dataset

After training with the addition of the slack module, the function curve of loss of true-false discriminator and identity discriminator is shown in Figures 5 and 6. At the early stage of training, the true-false discriminators have a great influence on the generator to generate the real GEI, and the true-false discriminators can converge stably. At the later stage of training, the influence of the identity discriminator on the generator increases to generate the GEI of the same individual, and the identity discriminator can converge stably.

In Figure 7, (a) and (d) are the generated images, (b) and (e) are the target images, and (c) and (f) are the input images. In Figure 7, we can find that the proposed method can accomplish the task of perspective transformation.

In order to evaluate the validity of the model, three experiments were designed for three influencing factors: view, carrying variations, and clothing. In this paper, the GEI for the first four normal states of each subject is set to the gallery set, and the remaining two states are put into the probe set. Since each pedestrian has 11 perspectives, there are 121 results for each experiment. For multiview problems, the performance of this model is shown in Tables 46. The recognition rates of these combinations are listed in Table 4. For the results in Table 5, the main difference with those in Table 4 is the probe sets. The probe data contains images of people carrying bags, and the carrying conditions are different from those of the gallery set. The probe sets for Table 6 contain gait data with coats. The values in the table represent the average accuracy of a given dataset and tag.

3.5. Comparisons with State of the Art

Our work is based on the GaitGAN. And compared with GaitGAN, our method is more stable and convergent. In the same experimental environment, the experimental results are compared as shown in Figure 8. The experimental results show that, under the same experimental conditions, compared with GaitGAN, the experimental accuracy of our algorithm is improved by about 1.5% in normal walking sequences (NM), about 6.5% in the case of walking with a bag (BG), and about 4.52% in the case of walking in a coat (CL).

Through the experiment, we found that our method can not only solve the case of multiview gait recognition but also alleviate the influence of clothing and carrying objects to a certain extent.

In order to better analyze the performance of the proposed method, we compare the recognition rate in the case of no angle change, the average recognition rate in Figure 9. The same method was used to calculate the average recognition accuracy of GEI+PCA [24], SPAE [13], and CNN-LSTM [25]. As can be seen from Figure 9, the recognition accuracy of this model under normal sequence is 98.32%, which is not much different from that of GEI+PCA and SPAE. Compared with CNN-3DCNN [8] and CNN-LSTM, the proposed model has higher recognition accuracy and can achieve accurate recognition. This model can improve the recognition accuracy of walking with a bag and walking in a coat. In the case of BG, compared with GEI+PCA and other methods, The accuracy of this model is improved by 16.17%, respectively. In the case of CL, the recognition accuracy of GEI + PCA, SPAE, and CNN-LSTM was increased by 29.09%, 0.57%, and 3.62%, respectively. However, the recognition accuracy of this model is slightly lower than that of CNN-3DCNN. After comprehensive comparison and analysis, it is found that the proposed method overcomes the case of walking with a bag and walking in a coat.

In the case of walking in a coat, compared with the 3D gait recognition model, the recognition accuracy of the proposed model is lower than that of the 3D gait recognition model. However, the proposed method is based on two-dimensional gait image recognition, the model complexity is lower, and the performance is more stable.

In addition, we also further compare the proposed method with other methods (MFA [26], CMCC [27], and SVR [28]). The selected probe angles are 54°, 90°, and 126°, respectively. In the experiment, the recognition rates at different angles of the gallery are listed to compare the advanced performance of our method under the crossview. The experimental results are shown in Figure 10.

By varying the observation views from 0° to 180° with an interval of 18° (totally 11 views), all the experimental results are shown in Figure 10. The results show that the performance of this method is better than that of other methods when the angle difference between the gallery and the probe is large.

In Table 7, the results of C3A [29], ViDP [30], SPAE [13], GaitGAN [17], GaitGANv2 [18] and the proposed method are listed. The models are trained with GEIs of the first 24 subjects. Average recognition rates at probe angles 54°, 90°, and 126°. The gallery angles are the rest 10 angles except the corresponding probe angle. The values in the rightmost column are the average rate at the three probe angles 54°, 90°, and 126°. We found that compared with the C3A method, the recognition accuracy of our method was greatly improved, and compared with other methods of perspective transformation, the recognition accuracy of our method was also slightly improved. In addition, while the recognition rate is improved, our method can achieve the task of multiview recognition more stably. Here, we would like to emphasize that for any view, as well as changes in clothing or carrying conditions, the proposed method can obtain comparable results using only one generation model.

4. Conclusions and Future Work

Aiming at the problem of multiview gait recognition, the idea of perspective transformation is adopted to transform the GEI images from various angles and states. In this paper, GAN is used for perspective transformation tasks. In order to solve the problem of GAN training instability and difficult convergence, the performance of the GAN is improved by adding the new slack module. With the addition of the slack module, the generated gait image contains more rich characters. Under the same experimental conditions, the improved method has a different degree of improvement compared with the original method in different influencing factors. After the improvement, the loss function converges obviously in the training process. The improved method in this paper can provide a reference for other method optimization later.

However, there are still some areas that need improvement in future research. First of all, the angles in a typical dataset are fixed. In the future, we may use richer pedestrian perspectives to validate our experiment. Second, we can further optimize GAN using a more complex model. We believe that the development of GAN will bring more long-term development to gait recognition technology.

Data Availability

All data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.