Abstract

In the application of pedestrian reidentification, misjudgment is often caused by low video resolution, illumination variation, and background interference. In order to solve these problems, this study proposes a pedestrian reidentification algorithm based on local feature fusion. Taking advantage of the inherent structure of the human body, we pay attention to pedestrian parts with prominent features and ignore other parts with interference information. Feature extraction is carried out for detected pedestrian parts with significant features, and new fusion features are generated. By calculating distance measurement between image features, pedestrians are classified and recognized. Experimental results show that the accuracy of the proposed algorithm is superior to that of other comparison algorithms on the datasets of Market1501, Duke, and CUHK03. It is proved that the proposed algorithm has a good pedestrian reidentification effect.

1. Introduction

Pedestrian reidentification refers to the technology of retrieving the target pedestrian from different cameras according to the specified image of the target pedestrian in multiple nonoverlapping cameras. It is widely used in security, image management, e-commerce, and other fields. With this technology, the police can track and retrieve the location of the suspect faster and more easily find the missing person. Users can more conveniently manage electronic images. Managers of unmanned supermarkets can track customers' behavior. However, there are some problems when using it, such as low video resolution, illumination change, background interference, changeable pedestrian movements, pedestrian occlusion, and so on. Pedestrian reidentification requires searching specific identified pedestrians in the known identified pedestrian database under cross-device conditions and finding all matching results [1]. It mainly includes three steps: pedestrian image preprocessing, feature extraction, and similarity calculation. Among them, image preprocessing includes image flipping, clipping, zooming, and pixel normalization. Feature extraction plays a decisive role in algorithm performance. Similarity calculation is realized by calculating the Euclidean distance between features.

In recent years, researchers have proposed a lot of pedestrian reidentification methods and achieved important research results. The existing methods can be generally divided into pedestrian reidentification method based on feature design and pedestrian reidentification method based on multitask learning. The key to pedestrian reidentification method based on feature design is to design a reliable and discriminant model to extract the robust features of pedestrian image. The model can be either a manual design model or an end-to-end deep learning model. The features acquired by the manual design model are mainly HSV color histogram [2], scale-invariant local ternary pattern (SILTP) [3], and other low-level features. Literature [4] extracts 11-dimensional color-named descriptors for each image local block and aggregates them into a global vector using the bag-of-words model. Literature [5] proposes a feature representation method, which can analyze the local feature level and maximize it for stable representation of viewpoint changes.

With the development of a convolutional neural network (CNN) [6], feature design methods based on deep learning are constantly proposed. Literature [7] uses the PCB network to uniformly divide pedestrian features into multiple horizontal regions and outputs convolution features jointly composed of these regional features. Literature [8] designs a multilevel factor network (MLFN) to decompose human visual imagery into multiple semantic hierarchical factors. Through the factor selection module to interpret the input image content, literature [9] proposes a pedestrian reidentification method based on regional alignment. Feature regions are divided by locating pedestrian body nodes, and all feature regions are fused to obtain feature representation. In order to learn more discriminative features, some researchers have introduced an attention mechanism in feature extraction. For example, Mancs deep network was proposed in literature [10]. This network uses an attention mechanism to solve the problem of misalignment of pedestrian images to obtain more stable pedestrian features. These deep learning models acquire high-level pedestrian features by designing network models. Such features generally have stronger representational ability than low-level features.

The method based on multitask learning enhances the algorithm performance by combining tasks such as pedestrian attribute prediction [11], image segmentation [12], and image generation [13]. Literature [14] proposes a simple CNN model, which predicts pedestrian attributes while learning pedestrian representation. This model can effectively improve the performance of the ReID algorithm. Literature [15] uses binary segmentation masks to generate RGB-Mask image pairs and then designs a mask-guided comparative attention model (MGCAM) to learn the features of pedestrian subjects and background areas, respectively. In order to avoid the overfitting phenomenon during the training of deep network models, literature [16] uses generative adversarial network (GAN) to generate training data and allocates unified labels to these data. Then, it carries out model training together with the original data. All the above methods use RGB pedestrian images to extract features, without considering the influence of color factors on the ReID algorithm. However, in practical application scenarios, there are some problems, such as inconsistent color of pedestrian images with the same identity, similar color of pedestrian images with different identities, different resolution of pedestrian images, occlusion, and chaotic background. Therefore, the overall recognition performance is greatly affected.

The main challenge of pedestrian reidentification in different situations is that the appearance of the same person in different situations is basically the same. At the same time, the problem of rerecognition is to match video images in different positions. The position, shooting angle, and illumination change in the same person will produce great differences, thus increasing the difficulty of pedestrian matching. In order to solve these problems, this study proposes a pedestrian reidentification algorithm based on local feature fusion. This algorithm uses deep convolutional network to extract regional features, fuse different regional features, and finally use measured distance to realize pedestrian reidentification.

In order to solve the problem of low pedestrian reidentification rate, this study proposes a pedestrian reidentification algorithm based on local feature fusion. The innovations and contributions of this study are listed as follows:(1)The detected pedestrian components with significant features are extracted, and new fusion features are generated.(2)Pedestrians are classified and recognized by calculating the distance measure between image features.

The structure of this study is listed as follows. Related work is described in the next section. The proposed method is expressed in section 3. Section 4 focuses on the experiment and analysis. The section 5 is the conclusion.

2.1. Pedestrian Reidentification

Representational learning is applied to pedestrian recognition to learn human physical features. Literature [17] uses a convolutional neural network (CNN) to learn global features. The image is divided into multiple parts to extract distinguishable local features, and local features with few changes can be effectively extracted through image horizontal segmentation. Image segmentation is a common local feature extraction method. But its disadvantage is that it has high requirements for image alignment. If the two images are not aligned up and down, it is likely that the head and upper body will not be aligned, which will make the model wrong. The problem of pedestrian body mismatch will seriously affect the feature matching between different images. In order to solve the problem of image misalignment, the researchers introduced a spatial transformer network (STN) into the rerecognition model to align pedestrians by spatial transformation of pedestrian images. Other researchers introduced human body analysis [18] and attitude estimation methods as prior knowledge into the ReID model to align pedestrian images.

2.2. Pedestrian Reidentification Based on Spatial Transformation

The STN [19] is a spatial transformation module, which can introduce a neural network to provide spatial transformation function. It includes translation, scaling, rotation, and so on. STN is a small network that can perform standard backpropagation and end-to-end training without increasing the complexity of the training process. STN consists of a location network, grid generator, and sampler. The location network outputs the transformation parameters by obtaining the input characteristic map. The grid generator calculates the position coordinates in the original image of each output pixel. The sampler generates the output image of the sample.

Literature [20] proposes a multiscale context-aware network (MSCAN). It extracts variable body parts by combining STN with positioning loss, thereby reducing background influence, and aligning pedestrian images to some extent. However, the center prior constraint of locating loss is based on the integrity of the image subject and image alignment. Literature [21] proposes a pedestrian alignment network (PAN), which uses STN to align pedestrian images before the ReID deep convolutional network. But PAN only uses the ReID loss to train it, and the image alignment effect is poor.

2.3. Pedestrian Reidentification Based on Attitude Estimation

Literature [22] uses a posture estimation algorithm to predict key points of human body, learn the features of each component, and combine the component-level features to form the final descriptor, thus solving the posture change problem. Posture-driven deep convolution (PDC) model cuts body regions through posture information to obtain rotated and resized body parts, which are used in the posture transformation network to normalize body parts. In literature [23], pose-invariant feature (PIE) is used as a pedestrian descriptor, and pose estimation is used to position key points, and pose box structure is generated by affine change mapping of body parts. Literature [24] proposes a postural-sensitive pedestrian ReID model, and introduces joint information and rough orientation information into the convolutional neural network to learn distinguishing features. Experimental results show that the detected joint position and the camera angle are helpful to feature learning. However, these methods embed pose estimation directly into the model, which increases the computational cost and model complexity.

2.4. Local Pedestrian Reidentification

In local pedestrian reidentification, the matching of the local image and the whole image is a big problem, as only local body can be observed in the local image. Sliding-window matching (SWM) uses the sliding window of the same size as the local image to search the most similar area on each overall image. However, the calculation cost of local matching is too high. Literature [25] proposes a deep spatial feature reconstruction (DSR) scheme, which uses a fully convolutional network (FCN) to generate spatial feature maps with a certain size to match pedestrian images of different sizes. Compared with the SWM scheme, the DSR scheme greatly reduces the amount of calculation. Literature [26] proposes the visibility partial model (VPM), which perceives the visibility of regions through supervised learning, extracts region-level features, and compares the shared regions of two images.

3. Algorithm Flow

This study proposes a pedestrian reidentification algorithm based on feature fusion. The algorithm adopts the strategy of replacing the whole image with a local area. The force of key local areas is magnified on pedestrian reidentification, to reduce the interference of weak difference component areas on pedestrian reidentification. First, the faster region-based convolutional neural network algorithm is optimized to build a detection and positioning model for pedestrian parts (face and height) [27]. Then, the VGG16 model was used as the basic model to adjust the network structure for the fusion target, extract the deep convolution features of the corresponding region, and then fuse the extracted features. Finally, the Euclidean distance between the query image and the image in the retrieval database is calculated to realize pedestrian reidentification.

3.1. Component Detection and Positioning Model

In this study, the Faster R-CNN algorithm is optimized. In the original Faster R-CNN algorithm, the whole image is first extracted by using the convolution layer. In this study, five groups of convolution layers of VGG16 are used as feature extraction operators. Then, the feature maps of group 5 were sent to the RPN network and ROI pooling layer, respectively. The proposal obtained by the RPN network was input into the ROI pooling layer at the same time, and the characteristic map of the ROI region was obtained. After two fully connected layers, the loss function is calculated by the classification and regression layer. The component detection process is shown in Figure 1.

The RPN network is used to obtain the initial proposal. After obtaining the proposal, the frame regression operation will be performed on the proposal to obtain a more accurate positioning frame. However, for the border regression operation, only when the proposal is close to the ground truth, it can be used as a training sample to train the linear regression model of this study. Otherwise, the regression model of training will be invalid. In order to achieve an accurate positioning effect, this study studies the improvement of the Faster R-CNN. In the proposal generation stage, K-means clustering optimization was added to make the RPN network obtain a more accurate proposal. It can further promote the regression effect of the border regression operation, so as to accurately locate the face and height.

First, the training set data are clustered to obtain the image target data. Using the width and height of the object as the coordinate axis, the 2D data clustering is carried out. Suppose the number of data object targets is m, then k objects μ1, μ2,…, are randomly selected as the initial cluster center. For the remaining objects, they are assigned to the most similar clusters according to their similarity to these cluster centers. The category to which the ith data object belongs is calculated as shown in equation (1). The mean value of all objects in the cluster was calculated, and the cluster center of each cluster was updated. The clustering center calculation is shown in equation (2). The clustering center and object category are repeatedly updated until the sum of distances between all data objects and their clustering center points meets formula (3), and finally, the clustering center point of data is obtained.where represents the function:

Clustering operation is mainly used to optimize the RPN network to obtain a more accurate proposal. The schematic diagram of the regional network is shown in Figure 2.

When the reference box corresponding to the acquired anchors fits the size of the detected object, a more accurate proposal can be obtained. Therefore, the K-means clustering operation is used in the positioning model to adjust the size and aspect ratio of the anchor box. Positive and negative samples are marked before training the classifier. When the coincidence degree IoU of the reference box corresponding to anchor and ground truth is greater than 0.7, the reference box corresponding to anchor is marked as a positive sample. For some extreme cases, for example, if the IoU of the reference box corresponding to all anchors and ground truth is not greater than 0.7, the reference box with the largest IoU value is taken as the positive sample. If the IoU of the reference box corresponding to anchor and ground truth is < 0.3, its corresponding reference box is marked as a negative sample. The loss function of training RPN is composed of classification loss and regression loss according to certain weights. The specific calculation is shown in formula (4).where x is the index of anchor. refers to the probability that anchor x is the foreground target class. represents the category probability of actual ground truth annotation. That is, when anchor is a positive sample, the value is 1; otherwise, it is 0. is the parameterized coordinate of the prediction box. is the parameterized coordinate of the real box. is the classification loss. p represents the type of classification, which is divided into two main categories (target and nontarget). represents regression loss. represents the weight. is the minibatch size. is the number of anchors. Parameterized coordinates of regression loss are as follows:where i and j are the coordinates of the center point of the box. m and b indicate the width and height of the enclosure. represents the reference box corresponding to anchor. indicates the ground truth box (the same as , , and ).

Feature maps were fed into the improved RPN network to obtain more accurate candidate regions. This step will output more boxes and can only be divided into target and nontarget. Then, the feature images extracted from the candidate region and the fifth convolution layer were input to the ROI pooling layer at the same time, and the coordinate positions of the ROI region were mapped to the feature map. Then, the pooling operation was carried out in the ROI region of the feature map, and finally, the feature images of the ROI region (i.e., the proposal region) were obtained. The mapping relationship between the original image and the feature image is shown in equation (16).

Among them,where is the receptive field center of feature map . L stands for CNN at layer L. represents the step of the v-layer convolution kernel. represents the convolution kernel size at the u layer. represents the padding size of the u layer.

After two fully connected layers, it finally enters the classification and regression layers. The selection of positive and negative samples is the same as the selection rules of positive and negative samples in the RPN network, and the final classification regression loss function is also composed of classification loss and regression loss according to certain weights. The loss function is shown in equation (8).

Among them,

Among them, , which is divided into z +1 categories (z category target plus 1-category background). P is the category index. , indicating the parameterized coordinates of the prediction frame of category z. represents the parameterized coordinates of ground truth. When the background region category is defined as 0, then is set to 0 in the background region box (negative sample) and 1 in the rest. Parametric coordinate references RPN regression loss, where the reference box corresponding to the anchor is replaced by the ROI frame.

3.2. Feature Fusion Algorithm

In this study, the training set is used to fine-tune the pretrained VGG16 model. The VGG convolutional neural network consists of five convolution operations (Conv1 Conv2, Conv3, Conv4, and Conv5), (FC6 and FC7) two full-connection operations, and 1 softmax classification layer. A set of convolution operations includes convolution and max-pooling. The principles are shown in equations (10) and (11), respectively.where , and is the y-th characteristic diagram of the l-th layer of the output layer. is the y-th characteristic diagram of the l-th layer of the output layer. represents the convolution kernel of the x-th characteristic graph of the input layer and the y-th characteristic graph of the output layer on layer L. is the offset term of the y-th characteristic graph of the l-th layer. Pool (•) is a pooled calculation function.

The calculation of the full-connection layer is shown in equation (12).

Among them, l layer is the fully connected layer. is the connection parameter between the y-th neuron in layer L and all neurons in the x-th input characteristic diagram in layer L-1. is the offset term.

The softmax layer is used for the training of convolutional neural network parameters. Once the training is complete, the softmax layer is removed. The calculation of the softmax layer is shown in equation (13).where x is the number of categories and is the input vector of dimension.

The VGG16 model has been able to extract better features in the convolution layer structure and has achieved a good classification effect. Excessive network layer structure will increase feature training time. However, if the number of layers is too small, the extracted feature description ability will be relatively poor. The VGG16 model training model on ImageNet has been exposed. On this basis, other datasets can be fine-tuned and have good adaptability to other datasets. The pretrained model is used for transfer learning. On the one hand, it can make the network model more easily converge; on the other hand, it can achieve better results on new datasets. In this study, the training set is used to fine-tune the pretrained VGG16 model. The fine-tuning steps are as follows: in the training stage, the training set is used to fine-tune the two-layer parameters (FC6/FC7) of the full-connection layer, the parameters of other convolution layers are fixed, the output type of the softmax layer is set to the number of types to be classified, and the image is input to start training. The schematic diagram of feature training is shown in Figure 3. In the test phase, the FC7 and softmax layers are removed, and the output of the fully connected layer FC6 is taken as the feature vector.

In order to effectively capture the connection between different component features, the convolution features extracted from different component regions are fused for pedestrian reidentification. In the proposed fusion model, the feature descriptions of different parts are connected in series in the fusion stage. The principle is shown in equation (14). The schematic diagram of feature fusion is shown in Figure 3. This fusion method is computationally simple. It can retain the ability to describe the regional features of each component as much as possible during fusion. It can reduce the degradation of feature description ability caused by fusion operation. Therefore, the new fusion features can represent the regional features of each component. In the rerecognition experiment, the convolution features of face parts and height parts are extracted, respectively. Features in the face region can learn and represent differences between pedestrians. The features of height components can be learned to represent the differences between different pedestrians. The regional features of the two parts are fused to obtain new fusion features, which are used to describe the features of different pedestrians.

Among them, represents the fusion features obtained after fusion operation. represents the fusion weights of different regional features, represents series operation, and represents different regional features.

3.3. Rerecognition Distance Measurement

In essence, the rerecognition problem is a more fine-grained image recognition problem, that is, the microclassification recognition problem at the individual level. However, image classification is usually carried out through the training of classifier, which has certain limitations in the rerecognition problem. On the one hand, classification at the individual level has many categories, so the training effect of classifier may not be good. On the other hand, the training of the trainer is aimed at a fixed number of species, which belongs to hard classification. The new type of image will also be assigned to the previously trained type, resulting in recognition errors and no scalability. In order to enhance the expansibility of the model, the model introduced a distance measure, which sorted the distance from large to small by comparing the distance measure between image features. The smaller the distance, the more likely it is to be the same person. In the specific operation of calculating feature measurement distance, a pedestrian image Q to be recognized is given and dataset R is retrieved. Comparing the fusion features, the fusion features of the pedestrian image V and the fusion features of all images in the retrieved dataset R can be measured and calculated. The calculation formula is as follows:

4. Experiment and Analysis

4.1. Dataset

This study conducted experiments on three datasets: Market1501, Duke, and CUHK03 [28]. The Market1501 dataset contains 1,501 pedestrians with different identities captured by 6 cameras. The Duke dataset contains 34,183 images of 1,404 identities captured by 8 cameras. The CUHK03 dataset contains 13,164 images of 1,467 different pedestrians captured by 6 cameras. Among them, the training set has 7,365 images in CUHK03-labeled and 7,368 images in CUHK03-detected, respectively. Due to the serious dislocation of image boundary boxes and background clutter in CUHK03-detected datasets, ReID recognition is required to be higher. The above three datasets were all set with a single query in the ReID test stage, that is, a single image to be queried with the same identity as the image to be queried.

4.2. Evaluation Indicators

The evaluation indexes commonly used in pedestrian reidentification tasks include cumulative matching characteristic (CMC) curves and MAP. CMC represents the probability of hitting the person in the row in k images with the highest similarity between pedestrian images in query set and gallery set. Given an image in the query set, if the descending ordering result of the similarity between it and all images in the gallery set is , the rank-k calculation of this is shown as follows:

The MAP represents the average performance of the algorithm over all test data, which is related to accuracy (P) and recall (R). Given an image to be queried, the AP calculation method is as follows:

The GU represents the area under the corresponding P-R curve. MAP can be obtained by means of AP values of all images in the query set.

4.3. Visualization Results of the Algorithm in This Study

In order to verify the alignment effect of the proposed algorithm on local images, experiments are carried out on the Market1501 dataset. The algorithm in this study can learn the spatial transformation of alignment, and its results on the Market1501 dataset are shown in Figure 4. The results show that the algorithm in this study can not only align the whole image but also accurately align the local image, which verifies the effectiveness of the algorithm module in this study.

4.4. Experiments Results

This study conducted mAP, rank-1, and rank-3 accuracy experiments onthree datasets, and the results are shown in Tables 13.

In order to verify the recognition effect of the proposed algorithm on local ReID, the proposed algorithm is compared with other models. For local ReID datasets, literature [22] obtained the rank-1 accuracy of 62.31%, 57.31%, and 62.02% on Market1501, Duke, and CUHK03 datasets, respectively. Compared with literature [22], the rank-1 accuracy of the algorithm in this study on the dataset Market1501 is 3.7% higher. Compared with literature [22], the advantage of the algorithm in this study lies in that the algorithm in this study extracts the features of each part of the body to deal with the problem of significant incomplete body display. However, it has no obvious performance improvement in the Duke dataset, the possible reason is that most bodies are retained in local images in the Duke dataset, and the degree of misalignment is within the processing range of CNN.

The proposed algorithm is compared with other models on datasets Market1501, Duke, and CUHK03. Compared with literature [17] and literature [19], the performance of the algorithm presented in this study is better than literature [17] and literature [19] but comparable to literature [20]. On the Duke dataset, the performance of the proposed algorithm is better than that in the literature [17]. Because the algorithm model in this study has embedded a strategy of feature extraction of local image, it can improve the recognition performance of local image.

4.5. Visualization of Retrieval Results

For a convolution neural network, even if the local image has the body parts of pedestrians, some obvious image features, such as yellow clothes or images with less offset, it can also learn advanced functions to realize image differentiation. But for no obvious features, the significant alignment of local images is difficult to match with the overall image. The retrieval results of image pedestrian local images are shown in Figure 5. In Figure 5(a), literature [22] could not find the matching image, that is, the matching image ranked after the fifth place, while the matching image of the algorithm in this study ranked the first place. In Figure 5(b), for the same input image, the matching image of literature [22] ranks fourth, while the matching image of the algorithm in this study ranks first. The experimental results show that local image alignment plays an important role in local pedestrian recognition.

5. Conclusions

In pedestrian reidentification, the misjudgment is often caused by the occlusion of the pedestrian body part and the poor quality of the pedestrian image. In order to solve this problem, this study proposes a pedestrian reidentification algorithm based on local feature fusion. First, the important parts of the human body are detected by the human body structure detection algorithm. Then, the deep convolution features of corresponding regions are extracted, and the extracted features are fused to form new fusion features. Then, the distance measure between the query image and the image features of the retrieval database is calculated and the calculated distance measure is sorted from small to large. Finally, the category identifier corresponding to the minimum distance measure is used as the recognition mark of the reidentification pedestrian. The validation results on open datasets Market1501, Duke, and CUHK03 show the superiority of the proposed method. In the follow-up, the model of attention mechanism will be studied to improve the accuracy of local pedestrian rerecognition by suppressing background and other interference information.

Data Availability

The labeled dataset used to support the findings of this study is available from the corresponding author upon request.

Conflicts of Interest

The author declares that there are no conflicts of interest.

Acknowledgments

This work was supported by the General Project of Hunan Natural Science Foundation (no. 2018JJ2147), the part of Youth Project of Hunan Natural Science Foundation (no. 2018JJ3203), the Project of Hunan Science and Technology Department (no. 2019ZK4018), and the Hunan University of Science and Engineering Computer Application Special Subject Funding.