Abstract
In actual traffic scenarios, the environment is complex and constantly changing, with many vehicles that have substantial similarities, posing significant challenges to vehicle tracking research based on deep learning. To address these challenges, this article investigates the application of the DeepSORT (simple online and realtime tracking with a deep association metric) multitarget tracking algorithm in vehicle tracking. Due to the strong dependence of the DeepSORT algorithm on target detection, a YOLOv5s_DSC vehicle detection algorithm based on the YOLOv5s algorithm is proposed, which provides accurate and fast vehicle detection data to the DeepSORT algorithm. Compared to YOLOv5s, YOLOv5s_DSC has no more than a 1% difference in optimal mAP0.5 (mean average precision), precision rate, and recall rate, while reducing the number of parameters by 23.5%, the amount of computation by 32.3%, the size of the weight file by 20%, and increasing the average processing speed of each image by 18.8%. After integrating the DeepSORT algorithm, the processing speed of YOLOv5s_DSC + DeepSORT reaches up to 25 FPS, and the system exhibits better robustness to occlusion.
1. Introduction
The increasing number of vehicles has caused great difficulties in traffic management. Vehicle tracking is an application of a target tracking in the field of transportation, which can serve to alleviate the pressure of traffic management [1–3]. At present, the mainstream target tracking method is the discriminative tracking method, which adds the step of target detection and makes the tracking more accurate. Discriminant tracking methods mainly include tracking methods based on sparse representation [4–6], tracking methods based on correlation filtering [7–9], and tracking methods based on deep learning. Li and Huang [10] proposed the TOD (tracking object based on detector) algorithm, which used YOLOv3 for target detection, and tracked the target according to LBP (local binary pattern) features and color histogram. Bertinetto et al. [11] proposed the SiamFC (Siamese fully convolutional) algorithm, which took the target object in the first frame as one input of the SiameseNet and the search area in the subsequent frames as another input and then found out the area closest to the target object to realize the target tracking. However, the target loss can easily happen, while the target size changes. Zhu et al. [12] adopted a distractor recognition model to update the tracking template online, which could well deal with the problems of serious occlusion and appearance change of the target. Li et al. [13] introduced the deep network into the Siamese Net framework and played the role of the deep network through multilayer aggregation.
Multitarget tracking is harder than the single-target tracking. Problems such as appearance similarity among targets, occlusion, and the start and end of single-target tracking tasks pose significant challenges in the field of multitarget tracking. Bewley et al. [14] proposed the SORT (simple online and realtime tracking) algorithm, which used the Kalman filter to predict the tracking frame information of the tracked object in the next frame and performed data associated with the detection frame information in the next frame to achieve multitarget tracking. The algorithm had small memory footprint and high speed, but the accuracy was very low when the target was occluded. Wojke et al. [15] proposed the DeepSORT algorithm based on the idea of SORT. The algorithm considered the motion information and appearance information in the tracking process and resolved the problem of target occlusion. At present, detection-based tracking algorithms still have many problems, such as a lack of datasets, inaccurate target detection, and insufficient realtime performance,.
Traditional target detection algorithms rely on image features and classifiers such as SVM (support vector machine) [16], Adaboost [17], Random Forest [18], artificially designed color features [19], gradient features [20], and pattern features [21]. Target detection algorithms based on deep learning have stronger adaptability to complex scenes, including target detection methods based on candidate regions and target detection methods based on regression. The representative algorithm based on candidate region is R-CNN (Region-CNN) series [22–24]. Owing to the need to process large number of candidate frames, such methods face the problem of low efficiency and do not have the ability for realtime detection. The regression-based target detection method reduces the steps of generating candidate regions and improves the speed significantly. It has been widely used for developing realtime target detection systems. The YOLO (you only live once) algorithm [25] proposed in 2016 used a grid to divide an image and generated a series of initial anchor boxes in each grid of the image. By learning to fine tune the initial box, the predicted box was generated to be closer to the actual box. The YOLOv2 algorithm introduced batch normalization and used DarkNet-19 as the backbone network, which could dynamically adjust the input and achieve better precision for small targets [26]. On this basis, the YOLOv3 algorithm used DarkNet-53 as the backbone network, introduced FPN (feature pyramid network) structure to obtain feature maps at different scales, and used a logistic classifier to predict the category of targets [27]. The YOLOv4 algorithm added data enhancement and self-antagonistic training methods at the input end [28]. The backbone network used CSPDarkNet53 and improved the loss function of the output layer, which greatly improved the speed and accuracy. The YOLOv5 has the same performance as YOLOv4. However, YOLOv5 is faster and has a detection speed of 140 FPS on Tesla P100. Sasagawa and Nagahara [29] used YOLO to locate and identify objects and proposed a method for detecting objects under low illumination by utilizing the power of transfer learning. Krišto et al. [30] used thermal images on YOLO to improve target detection performance in challenging conditions such as adverse weather, night time, and dense areas. Xiao et al. [31] fused the context information in the YOLO backbone network to avoid the loss of low-level context features, retain lower spatial features, and solve the problem of difficult detection of targets under dim light. Guo et al. [32] designed an improved SSD (single shot multibox detector) detector, which used the method of single data deformation data amplification to transform the color gamut and affine of the original data and could detect targets that were close to each other. To improve feature fusion for small tassel detection, Liu et al. [33] proposed a novel algorithm referred to as YOLOv5-tassel to detect tassels. To enrich feature information and improve the feature extraction ability, Bie et al. [34] proposed an improved YOLOv5 algorithm based on bidirectional feature pyramid network for multiscale feature fusion. Wang et al. [35] proposed a novel vehicle detection and tracking method for small target vehicles to achieve high detection and tracking accuracy based on the attention mechanism. In summary, research based on the improved YOLOv5 algorithm mainly focuses on the accuracy of small object detection, while research on detection speed and occlusion robustness in vehicle tracking still has great research value. The main contributions of this article are as follows:(1)To solve the problems of large number of vehicles, fast-moving speed of vehicles, substantial similarity of vehicle appearance, and vehicle occlusion in the actual urban traffic scene, the DeepSORT algorithm is used for vehicle tracking, which has better realtime performance and tracking robustness than traditional vision-based vehicle tracking methods.(2)To reduce the calculation amount of YOLOv5s, reduce its inference time, and improve the operation speed, a YOLOv5s_DSC algorithm with faster inference speed is proposed.(3)Combining YOLOv5s_DSC with the DeepSORT algorithm, the robustness of occlusion of the proposed algorithm is verified and the realtime performance of the algorithm is tested in the cases of vehicles being occluded by foreign objects or vehicles being occluded by each other.
2. Algorithmic Framework
2.1. Overall Framework
The DeepSORT algorithm adopts a two-stage idea of detection and tracking, using the Kalman filter and Hungary algorithm to track the target and introducing a deep convolutional neural network to extract the appearance information of the tracked target for data association, which solves the problem that the target occlusion is difficult to track accurately. Stable and accurate vehicle detection result is an important guarantee for the DeepSORT algorithm in the vehicle tracking task. Considering the realtime requirements of the realistic application scenarios, the YOLOv5 target detection algorithm is studied in this article. To further reduce the memory and computing resources occupied by the algorithm, a DSC structure with residual is introduced into YOLOv5s, and the YOLOv5s_DSC algorithm with a smaller model and faster speed is proposed. YOLOv5s_DSC is used as the detector of the DeepSORT algorithm, and its excellent detection accuracy can make tracking more accurate and provide better realtime performance.
2.2. The DeepSORT Algorithm
Figure 1 is a framework diagram of the DeepSORT algorithm. First, the Kalman filter is used to predict the tracking frame information of the tracked target in the next frame, and all the detection frame information is obtained by the target detection algorithm in the subsequent frame. Then, the Hungarian algorithm is used to find an optimal allocation for the minimum cost between all the detection frames and the tracking prediction frames. The cost matrix used in this step contains not only the Mahalanobis distance but also the cosine distance of the appearance features constructed from the appearance features extracted by the deep convolutional neural network. After solving by the Hungarian algorithm, the optimal combination of the prediction frame and the detection frame can be obtained. The DeepSORT algorithm uses cascade matching, and the shorter the number of frames from the last successful matching is, the higher the priority is in this matching. The tracking frame information is updated according to the detected frame information after the matching is successful, and the tracking frame information of the tracked target in the next frame is continued to predict according to the tracking information. For the samples that fail to match, the cost matrix will be constructed again with the IOU calculation results of the remaining tracking frame and the prediction frame and then transferred to the Hungarian algorithm for the solution. After the matching is successful, the tracking frame is updated according to the detection frame information, and the tracking frame information of the tracked target in the next frame is continuously predicted according to the tracking frame information. Whether the match is successful or not is determined by marking “true” and “false.” For the detection frame that fails to match, a flag will be added–“false,” and three subsequent rounds (age is the round and max age = 3) of investigation will be conducted. If all three rounds of matching are successful, the flag will be changed to “true.” For tracking frame that fail to match, if they are marked as “false,” the tracking task will be stopped, and if they are marked as “true,” the lifespan will be set. Within the lifespan, the following three rounds of investigation will also be conducted. If all three rounds of matching are successful, the mark will be changed to “true.”

State estimation methods mainly include state observers and various linear and nonlinear discrete estimators based on the Kalman filter. Liu et al. [36] proposed a novel vehicle sideslip angle estimation algorithm with the fusion of dynamic model and vision for vehicle dynamic control. A vehicle attitude angle observer based on the square-root cubature Kalman filter (SCKF) is designed in [37] to estimate the roll and pitch to reject the gravity components induced by the vehicle roll and pitch. For simplicity, this article uses the Kalman filter as state estimation. The prediction equation of the Kalman filter is as follows:where is the state estimation at the time , is the state estimation at the time , is the covariance matrix of the state estimation, and is the state transition matrix. The measurement update equation of the Kalman filter is as follows:where is the measurement vector at a time , is the measurement matrix, is the covariance matrix of measurement noise, is the Kalman gain used to correct the state estimation, and is the identity matrix. The state vector of the DeepSORT algorithm can be described as follows:where , , , and represent the target box center coordinates of , aspect ratio, and height, respectively. , , , and represent the corresponding value in the next frame predicted with Kalman filtering. The DeepSORT algorithm uses the cost matrix constructed by Mahalanobis distance and cosine distance of appearance features in the first data association. The Mahalanobis distance correlation metric is calculated as follows:where represents the th detection state, represents the th tracking target which predicts the state of the current frame according to the state of the previous frame. is the covariance of the detected state with the predicted state. The cosine distance measurement formula of appearance features is as follows:where corresponds to the feature vector of the th detection frame, correspond to the feature vector of the tracking frame, and is for the last set of features times successfully tracked. The DeepSORT algorithm constructs a deep convolutional neural network to extract the appearance features of the tracking target and uses L2 standardization to project the features. The network structure is shown in Table 1.
Due to the nonconstant update frequency of image frames, we use the time difference between the two frames as the time step of the Kalman filter during the discretization process. This approach allows us to dynamically adjust the state update rate of the Kalman filter based on the actual situation, which helps to better track the target.
3. Improved Yolov5 Vehicle Detection Method
To achieve high precision vehicle tracking tasks, the vehicle detection algorithm is studied in this subsection. To further improve the realtime performance of vehicle detection, the DSC structure with residual is introduced, and the YOLOv5s_DSC vehicle detection algorithm is proposed, which has a lower number of parameters and calculation and faster detection speed.
3.1. Depth Separable Convolution
With the help of grouping convolution, DSC uses point-by-point convolution to fuse the feature information of different channels, which can achieve the purpose of a lightweight deep learning network, while ensuring feature extraction. It divides into the following two steps:(1)Channel-by-channel convolution: the input image is . Each channel consists of a . The convolution kernel performs an independent convolution operation to obtain characteristic map, whose size is . The parameter number of the convolution kernel is . As shown in Figure 2, if 3 channels of images are as inputs in the point-by-point convolution, 3 single-channel will be obtained.(2)Pointwise convolution: using , convolution kernel performs convolution operation output of (1) to obtain the characteristic map with . As shown in Figure 3, the number of parameters of the characteristic map is .


The number of parameters of the whole DSC is as follows:
This is similar to the packet convolution with a number of packets . The difference is that the results of group convolution are the splicing of each group result, while the results of DSC are the weighted combination of each group of result by point-by-point convolution, which can make full use of the characteristic information of each channel at the same position.
3.2. Yolov5s Improvement Strategy
The YOLOv5s model has 283 layers in total, the number of parameters is 7,071,633, and the amount of calculation is 16.4GFLOPS. To further simplify the network structure, reduce the amount of calculation, and reduce the reasoning time of the model, the DSC structure is introduced to replace the C3 structure of the backbone part in YOLOv5s. As shown in Figure 4, the first C3 structure in the YOLOv5s network contains five convolutions, and the parameters are shown in Table 2.

From Table 2, it can be calculated that the number of parameters for Conv1 and Conv2 is 2048, the number of parameters for Conv3 is 4,096, the number of parameters for Conv4 is 1,024, and the number of parameters for Conv5 is 9,216. Therefore, the number of parameters of the first C3 structure in the YOLOv5s network amounts to 18,432.
DSC performs two convolutions. The first convolution obtains the features of each channel. The second convolution fuses the position information of each channel. In contrast to the first C3 structure in the YOLOv5s network, the input and output channels of the DSC are also set to 64, and the size of the channel-by-channel convolution is . Then, the number of parameters of the channel-by-channel convolution is 576, and the size of the point-by-point convolution is . Then, the number of parameters of the pointwise convolution is 4,096. Thus, the number of parameters of the DSC structure amounts to 4,672, which is 13,760 lower than that of the first C3 structure in the YOLOv5s network. The backbone of the YOLOv5s network contains four C3 structures, which are replaced by DSC structures in turn. To avoid network degradation caused by replacing with DSC, a residual structure is introduced, as shown in Figure 5.

The introduction of DSC can effectively reduce the number of parameters and make the network model smaller. The comparison of the number of parameters after replacement is shown in Table 3. The number of parameters of each structure includes the parameters of convolution, deviation, and batch normalization in the structure. The improved network framework is presented in Table 4.
4. Experimental Results and Analysis
4.1. Dataset Preparation
The VeRi dataset [38] is a large vehicle rerecognition dataset, which contains vehicle images from multiple angles and under different light intensities. It is suitable for related research on vehicle rerecognition. As shown in Figure 6, each folder contains pictures taken by the same vehicle from different angles, with a total of 776 folders. The training set and the test set are distributed according to the proportion of 8 : 1.

UA-DETRAC [39] is a vehicle dataset, collected from the real traffic environment of Beijing and Tianjin, labeled with four vehicle categories of “Bus,” “Car,” “Van,” and “Others,” including vehicle images of different angles and periods, covering most of the traffic conditions. The UA-DETRAC dataset contains a total of 60 image folders collected from different road sections and periods, and each folder corresponds to an XML tag file. We use the code to strip the tag corresponding to each image in the XML file and convert all the XML tag files obtained into TXT format. The training set and the test set are distributed according to the proportion of 9 : 1. There are 73,876 pieces of training sets and 8,209 pieces of test sets in total. The data structure of images and labels is shown in Figure 7.

4.2. Training of DeepSORT Deep Convolutional Neural Network
The vehicle rerecognition dataset is used to train the DeepSORT deep convolutional neural network, so that it can correctly extract the appearance features of the vehicle for the calculation of the cosine distance of the appearance features. Since the task requirement is vehicle tracking, the input of the network is set , according to the aspect ratio of the vehicle image. The network model is built under the PyTorch framework. The initial learning rate is set at 0.1, which is reduced to 0.1 times every 40 epochs. The training loss curve is shown in Figure 8. After reaching 100 epochs, the loss tends to be stable and the accuracy on the test set reaches 88%.

4.3. Yolov5s_DSC Network Training and Result Analysis
The YOLOv5s network model and YOLOv5s_DSC network model are constructed under the PyTorch framework, respectively. The UA-DETRAC dataset is used for training. The batch size is 128, and 50 epochs are trained. The training loss is shown in Figure 9. The YOLOv5s_DSC network decreases as fast as YOLOv5s in the regression loss, the classification loss, and the target loss, where the lowest values of the three losses of YOLOv5S are 0.01722, 0.0011741, and 0.02758, respectively. However, the lowest values of the three losses of YOLOv5s_DSC are 0.01835, 0.0013954, and 0.02933, respectively, which indicates that the introduction of DSC structure with residual error does not bring too much impact on the training difficulty of the network.

Compare the performance of YOLOv5s and YOLOv5s_DSC in mAP, precision, and recall. In Figure 10, the curves of the two networks are almost coincident, which indicates that the introduction of the DSC structure with residuals brings about a decrease in the number of network parameters but does not cause a decrease in the accuracy of the network. The YOLOv5s_DSC with KF (Kalman filter) is smoother than the YOLOv5s_DSC. The YOLOv5s with KF is smoother than the YOLOv5s. This indicates that KF can dynamically adjust its update rate, which helps to better track the targets. In Table 5, mAP (mean average precision) indicates better performance of the detector, except for the optimal mAP0.5: 0.95; the difference between the optimal mAP0.5, precision, and recall of YOLOv5s_DSC and YOLOv5s is not more than 1%, while the number of parameters is reduced by 23.5%. The amount of calculation is reduced by 32.3%, and the size of the weight file is decreased by 20%. In the hardware environment, where the graphics card is NVIDIA GeForce RTX 3080 and the CPU is Inter (R) Xeon (R) CPU E5-2670 v3, the average processing speed of each image is improved by 18.8%, which proves that the proposed algorithm is faster while ensuring the accuracy.

4.4. Verification Experiment
Select a video of traffic flow captured from the front of the intersection as the input. As shown in Figure 11, the YOLOv5s_DSC vehicle detection algorithm can effectively detect vehicles and correctly classify vehicles under the window. Each detection frame contains two information: vehicle category name and category confidence. In the hardware environment shown in Table 6, the detection speed of the algorithm reaches 77 FPS. Select a video of traffic flow captured from the oblique side of the intersection as the input, and YOLOv5s_DSC vehicle detection algorithm can also effectively detect the vehicles and correctly classify the vehicles under this window, as shown in Figure 12. YOLOv5s_ DSC can accurately detect vehicles from different angles. As shown in Figure 12(b), local mutual occlusion between vehicles does not affect the detection effect of the algorithm. Therefore, the advantages of the YOLOv5s_ DSC algorithm for vehicle detection can provide realtime and accurate vehicle detection information for vehicle tracking.

(a)

(b)

(a)

(b)
To test the effect and the robustness of occlusion of the algorithm on vehicle tracking, the YOLOv5s_DSC is as a detector connected to YOLOv5s_DSC and DeepSORT. As shown in Figure 13, the tracking boxes of different types of vehicles have different colors, and each tracking box includes a tracking ID in addition to the category and category confidential information of the vehicle. In the hardware environment shown in Table 6, the YOLOv5s_DSC + DeepSORT algorithm achieves a processing speed of 25 FPS.

Next, the robustness of the proposed algorithm is verified in the occlusion scene. Consider the tracking performance of two occlusion situations: (1) the target is occluded by foreign objects and (2) the targets are occluded by each other. First, the robustness of the proposed algorithm is verified when the target is occluded by external objects. The effect of rerecognition and retracking after the target disappears is tested. A traffic video blocked by a pillar is supported to verify the algorithm. Figure 14 shows four consecutive images. The dark car with tracking ID 3 reappears after being blocked by a pillar and can still be tracked by the algorithm. This result shows that the algorithm exhibits strong robustness and accuracy in occluded scenes, providing strong support for target tracking in practical applications.

(a)

(b)

(c)

(d)
We also evaluate the algorithm’s performance in scenarios where targets are occluded by other targets. Specifically, we test the algorithm’s ability to track targets that are partially occluded by other targets. A video sequence in which a bus partially occludes a car that is being tracked is selected. As shown in Figure 15, the vehicle with tracking ID 4 is partially blocked by the bus with tracking ID 1. Despite the occlusion, the tracking ID for the car remains unchanged. It is shown that the proposed algorithm is capable of handling partial occlusions between targets. These results further demonstrate the robustness and effectiveness of the algorithm in occluded scenes, which is crucial for the practical application of the target tracking.

(a)

(b)

(c)

(d)
5. Conclusions
This article investigates the application of the DeepSORT algorithm in vehicle tracking, using vehicle flow videos from different scenarios to verify the effectiveness and robustness of the YOLOv5s_DSC vehicle detection algorithm. The YOLOv5s + DeepSORT algorithm is validated by reproducing traffic flow videos that block each other after the vehicle disappears. It is showed that the algorithm has good rerecognition and retracking ability and robustness against partial occlusion of targets. However, the algorithm in this article does not take into account the detection effect in different weather environments such as rainy days, foggy weather, and vehicle video blurring. In future work, the application of model compression methods will be studied to further compress network models, while maintaining a certain accuracy and improving the speed of network reasoning and combining algorithms such as environment optimization to achieve more scene applications.
Data Availability
The data used to support the findings of this study are available upon request from the corresponding author.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Authors’ Contributions
Lixiong Lin conceptualized the study, proposed the methodology, validated the study, and wrote and prepared the original draft; Hongqin He and Dongjie Wu performed formal analysis, performed investigation, provided resources, and performed data curation; Zhiping Xu wrote, reviewed, and edited the manuscript, visualized the study, and performed project administration. All authors have read and agreed to the published version of the manuscript.
Acknowledgments
The research was supported by the Jimei University Startup Research Project, China, under grant ZQ2022002, by the Education Department Foundation of Fujian Province, China, under grant JAT220169, by the Xiamen Key Laboratory of Marine Intelligent Terminal R&D and Application, China, under grant B18208, and by the Xiamen Ocean and Fishery Development Special Fund Project, China, under grant 21CZB013HJ15.