Abstract

Vehicle reidentification has important applications in intelligent monitoring systems. However, due to many factors, such as inaccurate vehicle image detection and viewing angle changes, distinguishing features cannot be effectively obtained when the vehicle is reidentified. To improve the recognition ability and robustness of vehicle reidentification, this study proposes a new multiattention part alignment network (MAPANet). The network uses different channels in the feature map to perceive different characteristics of the image clustering of the channels and achieves fine-grained attention to the vehicle. It can automatically locate the distinguishing subregions in the vehicle image and avoid the need for a large number of additional manual pretreatment steps. Moreover, an unsupervised reranking method based on multiple metrics is proposed. The k-reciprocal encoding algorithm can optimize the performance of the sorted list in the reordering problem, recalculate the interclass and intraclass distances of vehicle pictures, and improve sorting results. The experiments in this paper are carried out on the VeRi-776 and VehicleID datasets, and the mean average precision (mAP) results on the two datasets are 72.83% and 75.25%, respectively.

1. Introduction

The problem of vehicle reidentification refers to a traffic monitoring scene within a specific range, judging whether vehicle images taken at different times in nonoverlapping areas belong to the same vehicle. Vehicle reidentification [1] plays an important role in traffic flow monitoring, city monitoring and tracking suspicious vehicles, estimating vehicle trajectory and speed [2], improving the safety performance of automated vehicles, and reducing transportation energy consumption [3, 4]. Vehicle reidentification is more fine-grained than material object identification, and different vehicle instances must be distinguished. Therefore, it is particularly challenging to obtain distinctive features in the task of vehicle reidentification.

The vehicle reidentification method mainly relies on the global appearance features of the image, but due to the following problems, excessive reliance on the global features extracted by the deep neural network cannot achieve a good recognition effect. First, the similarity within the vehicle category is high. Different vehicles, especially vehicles of the same make and model, are very similar in appearance and color. Second, subtle local cues in the vehicle image are very important for reidentification, such as interior decoration, tire texture, and other details. However, due to the large changes in the viewing angle of the vehicle image, it is impossible to accurately locate these detailed features. Third, the license plate is invalid. Vehicle license plates are the unique identification code of vehicle identity and have been widely used in real traffic flow analysis systems [5]. However, license plate recognition is too dependent on the external environment, and persons planning illegal activities may deliberately remove, obscure, or disguise the license plate, which makes license plate recognition technology unreliable. In summary, the method of extracting appearance attributes between different vehicle images based on a deep neural network has great limitations, and it must pay attention to smaller discriminative local areas to distinguish vehicles.

The contributions of this study can be summarized as follows:(1)Introduce local discriminant features into the vehicle recognition task, use different channel response information to obtain local features, and reduce the impact of local part misalignment and a large number of marked key points.(2)Propose a discriminative and fine-grained network composed of multiple branches—the multiattention part alignment network (MAPANet), which extracts and integrates global and local features in a multibranch network structure. By combining complementary global and local features, the subtle differences between different vehicle logos with similar appearances can be accurately distinguished.(3)To further improve the recognition performance, based on the representative features of multiattention points and global images, the fusion measurement strategy and k-receptive algorithm are used to design the resorting stage.

2.1. Vehicle Reidentification

The problem of vehicle reidentification has received extensive attention in recent years. Similar to pedestrian reidentification [69], it relies on the global appearance characteristics of the image, but the feature description of pedestrian reidentification can only be applied to pedestrians. Migrating the pedestrian reidentification method to vehicle reidentification did not achieve the expected results. The appearance attributes of vehicles are highly monotonous, so a targeted network structure needs to be well applied to vehicle reidentification.

With the development of deep learning models, Liu et al. [10] merged vehicle color and texture data with high-level semantic information extracted by neural networks for vehicle reidentification and constructed a vehicle reidentification dataset VeRi-776 [10]. The results showed that with deep features and manual “Compared with” features, vehicles are clearly distinguishable. Later, they [11] constructed a large-scale vehicle dataset VehicleID [11] and proposed a progressive vehicle reidentification (PROVID) method. The image was filtered by appearance information, while the license plate information was used for more refined verification, and finally, the images were reordered. In recent years, generative adversarial networks (GANs) have also been widely used for vehicle recognition. Zhou and Shao [12] proposed a viewpoint-aware attention multiview inference model (VAMI) based on visual information, which extracted single-view features of vehicle images and transformed them into global multiview features, and adopted the viewpoint-aware attention model to achieve an effective multiview inference model through GAN viewpoint feature reasoning. In addition, Chen et al. [13] designed a segmentation and reunion network structure (PRN) and proposed a 3D segmentation strategy. By using a double-branch structure, the network is divided into “height-channel” branches and “width-channel” branches to avoid certain spatial characteristics. Zheng et al. [14] proposed a multiscale attention framework (MSA) that considers multiscale mechanisms and attention technologies. The multiscale mechanism is used in the feature map to obtain a comprehensive representation of the fusion of global and local cues, and it is used on each subnetwork attention mechanism to mine characteristic information. Chen et al. [15] used the cross-correlation method to propose a local calibration network (PAN) to address the vehicle deviation problem and obtained a more robust local feature representation.

In summary, to obtain effective regional information, most methods segment vehicle images or mark key points. As shown in Figures 1(b) and 1(c), although these methods have greatly promoted vehicle reidentification research, due to change in the vehicle’s perspective, the segmentation of parts can easily cause vehicle misalignment. The use of key points increases the data preprocessing work. To solve this problem, this study proposes a multimetric reordering method based on multiattention part alignment network (MAPANet) and k-reciprocal. As shown in Figure 1(d), by clustering the feature map channels, the local area to be considered is obtained, and the distinguishing local feature vector is obtained. Then, different distance formulas are used for reranking calculations to identify vehicles.

2.2. Multiattention Models in Person and Vehicle Re-ID

The attention mechanism [14] has certain advantages in dealing with local dislocation challenges and has a wide range of applications in computer vision tasks, including target detection [16], saliency detection [17, 18], aesthetic clipping [19], and cross-mode search [20, 21]. Hourglass [22] and cascaded pyramid networks [23] use the attention mechanism to estimate human body poses very well. The deconvolution head network proposed by Xiao et al. [24] uses key point detection models to locate human joints. There are relatively few studies on vehicle reidentification of local key attention. The multiscale attention framework (MSA) proposed by Zheng et al. [14], which considers multiscale mechanisms and attention technologies, uses attention blocks on subnetworks of each scale to mine complementary and distinguishing information. Tumrani et al. [25] proposed local attention and multiple attributes based on appearance features for vehicle reidentification, improved the number of output channels of the deconvolution head network proposed by Xiao et al. [24], and realized the use of vehicle key points to obtain local features.

2.3. Reranking in Person and Vehicle Re-ID

The results obtained by reordering representative information are consistently more representative than those obtained through other mechanisms. Existing metric learning and ranking algorithms have been successfully applied to various reidentification problems. The metric learning method is mainly based on Markov metric learning (KISSME) and cross-view quadratic discriminant analysis (XQDA) [26]. For reidentification of human individuals, Zhong et al. [27] proposed the k-reciprocal encoding algorithm, introduced the Jaccard distance, and merged it with the initial distance to improve the reidentification results. The reordering technology of vehicle reidentification has also attracted attention. Wang et al. [26] used the k-reciprocal encoding algorithm to propose a discriminative fine-grained vehicle reidentification network based on a two-stage reordering framework. In the first stage, the k-reciprocal encoding feature is obtained from the fusion feature. In the second stage, the average feature of the sample is formed by extracting the average center of k-reciprocal encoding nearest neighbors. The final distance is weighted by the original and Jaccard distances.

3. Overview of Vehicle Reidentification and Reranking System Based on MAPANet

Subtle local clues are the key to the fine-grained image classification algorithm [28], which contains sufficient distinguishing feature information. Figures 2(a)2(f) show different vehicles with similar appearances obtained from the two datasets. As shown in Figure 2(a), the vehicles are distinguished by the vehicle logo, and in Figures 2(c) and 2(e), the annual inspection labels on the hood and windshield are used, respectively. In Figure 2(b), it is determined whether the wheel hubs are the same, whether there is a receiving antenna on the roof of Figure 2(f), and the appearance of the same vehicle in Figure 2(d) is different due to different tasks. The reranking method is very important in reidentification. Figure 2(g) shows the top ten images in the reidentification. The red box indicates the incorrect sample, and the green box corresponds to the correct sample. The figure shows that some fake samples achieve higher rankings, while certain true samples achieve lower rankings. Fine-grained vehicle identification is the key to vehicle reidentification. Fine-grained networks can detect and extract more detailed and distinguishing information.

Aiming at the difficulty of extracting local subtle clues due to vehicle misalignment and angle changes, this study proposes a multiattention part alignment network (MAPANet), which probably realizes the extraction of distinctive local features of the vehicle. The overall structure of MAPANet network and reidentification is shown in Figure 3, which is mainly composed of two parts: feature extraction and reordering. The feature extraction part is used to extract multiattention local and global features. The reidentification part reorders the initial list to reduce the top ranking of false samples. In Figure 3, part A shows the picture of the MAPANet network. Due to the convincing performance of IBN-Net and the well-designed network architecture of ResNet50, the MAPANet model uses the ResNet50 network with the IBN-Net structure as the backbone network of MAPANet. Following the effective modification strategy in the person-RE-ID model, the conv of the third to fourth stages of ResNet50-b is copied and divided into five branches. The five branches are named as Global Feature and Part Feature. Global Feature, which contains a branch, is used to extract the deep-level global features of the vehicle and perform similarity mapping from the image to Euclidean space through recognition and verification loss. Part Feature, which contains four branches, extract local images obtained by channel grouping and pay more attention to local features, so that subtle differences between vehicles can be recognized. The vehicle image with the size of 256 × 256 × 3 is entered into the backbone network, and five feature maps are generated through five branches, including one global feature map and four local feature maps. When generating a local feature map, due to the channel grouping operation, the feature vector dimension is different, and 1 × 1 convolution is used on each vector to align with global feature graphs. After the global feature performs the GAP operation, the fully connected layer outputs a 512-dimensional vector. In order to complete the reordering stage, the local feature maps are connected in series, and the dimension of the vector is unified to 512 using 1 × 1 convolution. The two subnetworks of Global Feature and Part Feature share weights at the same time during training and combine recognition and verification losses. Part B is the proposed reordering algorithm. Based on the k-reciprocal encoding algorithm, the Mahalanobis distance of the global feature is calculated as the initial list and the Jaccard distance of the multiattention local feature. Finally, the two-part distance of the deep feature vector is merged for sorting.

4. Vehicle Reidentification Based on MAPANet and k-Reciprocal

4.1. Vehicle Global Feature Extraction

To extract global features with strong resolution, the IBN-Net structure is added to ResNet50 to extract global features, which helps to improve the accuracy of the model without increasing the amount of calculation. This network has good feature extraction capabilities and has been widely used in computer vision tasks.

The recognition model regards the task of vehicle reidentification as a multiclassification problem. The cross-entropy loss function is used to train the deep model directly on the vehicle reidentification dataset. Recognition loss pays more attention to learning local semantic features, which can retain enough identifiable information for reidentification. In this way, the classification loss will be more inclined to semantic local features rather than global features during training.

Definition 1 (neural network model symbol). is used to represent the extracted vehicle features. is the MAPANet network model, is the image of the vehicle to be detected, is the extraction of vehicle feature vectors, and θ is all the parameters that the model needs to learn.
The typical softmax loss function is used to train the model:where refers to the number of samples in the training set, refers to the training parameters, refers to the number of classifications, refers to the classification target, and refers to the prior probability of the lst target vehicle. If , then ; otherwise, . refers to the predicted probability of the softmax layer on class .
The input of the verification model is a pair of vehicle images, which determines whether the input has the same identity. In the feature space, images of vehicles with the same identity are closer than images of vehicles with different identities.

Definition 2 (verification loss). where and represent the neural network feature vectors obtained from two vehicle images and minimize the distance between these two features. If , then and have the same ID. Otherwise, they will have different vehicle IDs.

4.2. Vehicles Pay More Attention to Local Feature Extraction

Figure 4 is the attention part alignment network framework; this clustering is realized by taking advantage of the different visual information corresponding to different channels in the feature map and the different characteristics of their peak response areas, clustering similar channels in the response area to obtain the local attention area. The original image of the vehicle in Figure 4(c) passes through the first stage of ResNet50-b to obtain 256 feature channels, and the white area on each feature channel is the peak response of different visual information.

In the experiment, in order to prevent too few clusters from making local features close to global features, excessive segmentation reduces the weight of global features. Therefore, in Figure 4(d), the channels are clustered into 4 subregions; after testing, 4 types are relatively moderate specific numbers. To make the channel clustering more optimized, the channel clustering operation is completed by using the full connection layer. Multiattention local feature extraction [29] is designed to adapt to vehicle features, as shown in Algorithm 1.

Definition 3. is the feature map obtained from image block , where is all the training parameters of the network, is the 9-layer convolution, pooling, activation, and other operations of the first block of ResNet50-b, is the width of 56, is the height of 56, and c is the number of channels of 256.

Definition 4. is the maximum response coordinate on the training image.
This coordinate serves as the representative feature of the channel and participates in clustering, where is the coordinate of the maximum response position and is the number of channels.

Definition 5. refers to the number of channel clusters, and is the fully connected layer corresponding to the channel clusters, where the input of each is the feature map of the convolutional layer. is the weight vector obtained by the fully connected layer.

Definition 6. refers to the channel grouping obtained by clustering, refers to the multiattention local feature map, and . refers to the attention local feature vector.
To adapt to the characteristics of the vehicle image, in the multiattention network, we set the channel grouping loss function [29], as shown in the following formula:where is the label of each subarea and is the local area. In the last item representing the proportion of the entire area, the local area obtained in the experiment should be as large as possible and contain more information to prevent loss of details. refers to the classification loss of the local region, which is used for comparison and classification of channel grouping with original features. loss function is used for channel group loss, and the formulas are shown as follows:where makes the parts more concentrated and makes the parts more dispersed so that only a certain proportion of channels can exist in a certain part. The two loss functions strengthen each other and iterate continuously until their value no longer changes.
The different local features obtained in the branch network are connected in series, and the global and local feature vectors are finally obtained. The sorting result is obtained by comparing the distance between the query image and the library set . Among them, this distance is obtained by using the complementarity of global or local features.

Input:
Output: multiattention part feature vectors ()
(1) for in //channel grouping
(2)  for in
(3)    //weight vector
(4)  end for
(5)   //weight vector collection
(6)  if
(7)  
(8)  else
(9)  
(10) end if
(11)end for
(12)for in //generate local attention feature map
(13)//accumulation of similar channels, using sigmoid to generate probability
(14)//loss function training
(15)//features of the jth channel, multiplied element by element
(16) for in
(17)   //using 1 × 1 convolution to change the number of channels
(18)   //enter the remaining block to obtain the local feature map vector
(19)end for
(20)
(21)END
4.3. Vehicle Reordering Based on k-Reciprocal Encoding

Vehicle recognition is a subproblem of image retrieval. In the retrieval process, resorting can optimize the accuracy of the results. However, most studies determine the similarity by calculating the distance between the features, and the recognition results obtained are biased. The k-reciprocal encoding [27] algorithm has achieved good results in pedestrian recognition. This study proposes a fusion multimetric vehicle reidentification and reranking method to determine the characteristics and differences between vehicles. If an image in and k of are close to each other, they are likely to have the same ID; therefore, the performance of the vehicle recognition dataset is improved. The distance is defined as two parts. The original distance list Old_list is obtained by the Mahalanobis distance between the global feature vectors, and the Jaccard distance is obtained by multiple attention complementary local area feature vectors. The final Recorder_list list is obtained by the aggregation of two distances. The Mahalanobis distance and Jaccard distance are shown in the following formulas, respectively:where is the covariance matrix.where is the picture in and k is the number of neighboring images.

Definition 7. is the final distance used to sort the list, weighted by the original distance and Jaccard distance.where is the Jaccard distance and is the Mahalanobis distance.

5. Results and Discussion

5.1. Datasets

This study uses VeRi-776 [10] and VehicleID [11] datasets to evaluate the performance of the network structure. A summary of the two datasets is shown in Table 1.

VeRi-776 is a widely used vehicle reidentification dataset. Twenty cameras are arranged in different areas of 1 km2 to cover different traffic conditions. The images include lanes, intersections, and different lighting and background conditions while ensuring data quality. After manual processing and annotation, 49360 images of 776 cars were finally generated. Each vehicle is captured by at least two cameras to ensure that the data are suitable for reidentification tasks. The VeRi-776 dataset provides color, model, and make information. The vehicle colors include black, blue, brown, gray, green, gold, red, white, yellow, and purple; vehicle types include bus, sedan, hatchback, MPV, pickup, sedan, SUV, and truck; vehicle make information includes more than 30 makes.

VehicleID is a large-scale vehicle reidentification dataset presented in [11] that includes approximately 200,000 images of 26,267 cars. It mainly includes the front and rear perspectives of the vehicle. The dataset contains information such as car models and colors. Among them, there are 250 models, including Audi, Mercedes Benz, and Toyota; vehicle colors include black, blue, gray, red, silver, white, and yellow.

5.2. Network Training

The MAPANet model was trained on the dataset for 24 hours on a single NVIDIA GeForce GTX 1660Ti processor in the PyTorch environment. Multibranch neural network models usually do not allow training. Direct integration of multiparameter original network branch training requires a long convergence time. The network structure used in this study includes 5 branches, including a multiattention branch network, global feature network, and reordering network. Therefore, the network is divided into “Baseline,” representing a single-branch ResNet50 model for extracting global features; “MAPA + local,” referring to the proposed multiattention area; “MAPANet,” referring to the proposed network; and “MAPANet + FR,” corresponding to the entire network using the reordering method. ResNet50 pretrained on ImageNet was used as the backbone network to accelerate the training process, and pretraining was carried out through cross-entropy loss. All vehicle images are adjusted to 256 × 256 in order to maintain sufficient information on each segmentation part. To make the proposed model more robust, random horizontal flips and random erasures with a probability of 0.5 are used to increase the data. In order to output the 2048-dimensional feature vector of the initial FC layer, 512-dimensional feature vector and L2 distance are obtained after compression. The weight decay factor of L2 regularization is set to 5e-4, the initial learning rate is set to 3e-4, and the decay is 3e-6 after 50 epochs. The batch size is set to 64, and the Adam optimizer based on AMSGrad is used for optimization, and the maximum epoch is 60.

5.3. Evaluation Criteria

Vehicle reidentification is a subproblem of image retrieval. According to widely used protocols, the mean average precision (mAP) and Rank-N table evaluate the overall performance of vehicle reidentification on VeRi-776 and VehicleID vehicle data.

5.3.1. Mean Average Precision

The mean average precision represents the average of the precision of all search results.

It is an important index to evaluate the overall performance of the reidentification method, which considers both recall rate and accuracy. For each query image, calculate the average accuracy (AP) as shown in the following equation:where N refers to the total number of images of the target vehicle, k refers to the image sequence number of the picture set, n refers to the total number of galleries and images, refers to the accuracy before the kth position in the retrieval sequence, and refers to whether the kth image is the target vehicle. Finally, the mean average precision (mAP) of all query sets retrieved is calculated as shown in the following equation:where Q is the total number of query images.

5.3.2. Rank-N Form

When comparing the performance of different methods, if the performance difference between the methods is low, the cumulative matching performance curve will mostly overlap, so it is impossible to accurately judge the performance. To compare the performance of different methods more concisely, Rank-1 and Rank-5 are selected to represent the probability of correct image matching.

5.4. Performance on the VeRi-776 Dataset

In this section, the method in this paper is compared with some traditional methods and current existing methods using the VeRi-776 dataset. Both LOMO [30] and BOW-CN [31] are traditional methods. LOMO used the manual feature local highest occurrence rate expression method that is robust to illumination changes in pedestrian reidentification for vehicle reidentification. BOW-CN used the BOW model and color name to obtain the color and color characteristics of the vehicle. Based on a deep learning network, GoogLeNet [32] was a neural network architecture that learns vehicle features and is pretrained on ImageNet. Different from conventional deep networks, two-branch convolutional networks were applied in PROVID [33], FACT + Plate-SNN + STR [34], Siamese-CNN [35], and DFN [26]. FACT + Plate-SNN + STR [34] was based on FACT, adding license plate information and spatiotemporal relationships to achieve distinguishing characteristics. DFN [26] used fine-grained network to obtain global and local features for vehicle identification. STR + ST [36] and PAMA [25] separately process the key points of the vehicle. The former extracts 20 well-aligned combination key points of local area features in different directions. PAMA [25] used multiple vehicle attributes and key points to extract vehicle features. VRSDNet [37] used the short and dense connection mechanism of the Siamese network to learn vehicle characteristics. VANet [38] used the learning method of viewpoint perception metrics to learn two metrics of similar viewpoints and different viewpoints in two feature spaces; thereby, viewpoint perception was generated. PNVR [39] relies on local regularization to discriminate feature retention, which enhances the ability to perceive subtle differences and merges local and global features through detection branches. Table 2 shows more details of the comparison results between this method and other methods.

Ablation Study. The proposed method includes three main parts, and the performance is continuously improved by gradually increasing the functions of each part. The final reordering of MAPANet has the best performance and is superior to the baseline in mAP/Rank_1/Rank_5: 13.83/5.75/2.79 (%), MAPA + local: 7.6/3.58/2.01 (%) and the proposed MAPANet: 2.96/2.04/0.63 (%), as shown in Figure 5(a).

5.5. Performance on the VehicleID Dataset

The method in this study is compared with the existing methods on the VehicleID dataset. Generally, all methods perform better on small-scale test sets because large test sets introduce more challenging and complex scenarios. EALN [40], VAMI [13], and XVGAN [41] all focused on using multiview information to obtain global features through the GAN network, so as to improve the results of vehicle reidentification. Among them, VAMI [13] viewpoint perception and attention multiview inference method used the vehicle’s multiview information to extract distinguishing features from single-view features. EALN [40] designed an end-to-end embedding adversarial learning network to generate samples positioned in the embedding space, which greatly improves the network’s ability to distinguish similar vehicles. DLCNN [42] used the Siamese network to combine verification and recognition losses, and the overall performance has been greatly improved. GS-TRE [43] used vehicle ID tags and uses intraclass variance to merge samples and individuals into an intermediate group. The VGG network is used as the backbone network for clustering, and the triple loss is used for training. In comparison, MAPANet achieves better performance and is much better than Baseline. Table 3 shows more details of the comparison results between this method and other methods.

Ablation Study. For the VehicleID data, when the multiattention part is added to the global feature extraction method and the reranking algorithm is used, the vehicle recognition performance is constantly revised. MAPANet + FR finally achieves the best performance, which is most obvious in small datasets. Among them, mAP/Rank_1/Rank_5 are better than Baseline: 3.76/9.05/4.12 (%), MAPA + local: 2.57/4.94/3.39 (%), and MAPANet without reranking: 1.14/2.29/1.19 (%). Figures 5(b)5(d) show more details. To provide more visualization to MAPANet, a more intuitive result is provided in Figure 6. The left column contains different query images, and the other columns are the results obtained by the MAPANet method. The correct results sorted by similarity are displayed in the green box, and the incorrect results are displayed in the red box.

6. Conclusions

This study proposes an unsupervised reranking method for vehicle reidentification and fusion of multiple metrics based on the MAPANet model. The MAPANet model is mainly composed of a multiattention local alignment network, local feature extraction, and global feature extraction. The design loss function is designed to include as much detail as possible by taking advantage of the different regions to which the different channels respond. The reordering method is used to further improve the reidentification performance of the vehicle. The method in this study focuses on more details of different attention areas and does not require complex strategies to maintain local matching. The proposed framework is evaluated on two public vehicle reidentification datasets (VeRi-776 and VehicleID) and compared with other current algorithms. The experimental results clearly show the competitive performance of this method.

In the future, research will be conducted in the following areas relative to vehicle reidentification:(1)Lighten the network structure and improve the problem of long training time for multibranch networks(2)Deal with special vehicles in special scenarios, such as ambulances, police cars, school buses, fire trucks, and muck trucks(3)Improve the local feature extraction method and increase the weight for some important local objects

Data Availability

The data used can be found at https://pan.baidu.com/s/1UFc6VJioavlO3Po2FbJU-g, extraction code: nnrw.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this study.

Acknowledgments

This work was supported by the Natural Science Foundation of Hebei Province, China (Grant No. F2019201329). Portions of the research in this paper used the VehicleID dataset funded by the National Natural Science Foundation of China.