Abstract
3D multiobject tracking (MOT) is an important part of road condition detection and hazard warning algorithm in roadside systems and autonomous driving systems. There is a tricky problem in 3D MOT that the identity of occluded object switches after it reappears. Given the good performance of the 2D MOT, this paper proposes a 3D MOT algorithm with deep learning based on the multiobject tracking algorithm. Firstly, a 3D object detector was used to obtain oriented 3D bounding boxes from point clouds. Secondly, a 3D Kalman filter was used for state estimation, and reidentification algorithm was used to match feature similarity. Finally, data association was conducted by combining Hungarian algorithm. Experiments show that the proposed method can still match the original trajectory after the occluded object reappears and run at a rate of 59 FPS, which has achieved advanced results in the existing 3D MOT system.
1. Introduction
With the rapid development of computer vision, image processing, and other technologies as well as the emergence of deep learning, the field of object detection has achieved great development. From the high accuracy of two-step RCNN [1], fast RCNN [2], and faster RCNN [3] to the high speed of one-step YOLO [4], YOLOv2 [5], YOLOv3 [6], and SSD [7] and from anchor-based methods [8, 9] to anchor-free methods [10, 11], object detection has made great progress in both accuracy and speed. At the same time, the development of object detection also promoted the development of other fields, including object tracking. Multiobject tracking is a branch of object tracking, which is closely related to the development of object detection [12]. Object tracking algorithm is divided into single-object tracking algorithms [13] and multiobject tracking algorithms [14]. Single-object tracking algorithms are widely used in monitoring and navigation systems. Among the single-object tracking algorithms, SiamMask [15] only needs to initialize the frame; then it can generate the masks segmented with the object and the boundary frames in the video with the speed up to 35 FPS. SiamRPN++ [16] develops a Siamese tracker based on ResNet architecture. Chen et al. [17] proposed a multiscale fast correlation filtering tracking algorithm based on a feature fusion model. Zhang et al. [18] exploited spatial and semantic convolutional features extracted from convolutional neural networks in continuous object tracking. Multiobject tracking is widely used in autonomous driving systems because it can associate the results of object detection in time without switching the identities of multiple targets [19, 20]. The autonomous driving system can estimate the location of the object by using tracking algorithm and avoid accidents. In the MOT algorithm, simple online and realtime tracking (SORT) [21] adopts the Kalman filter and Hungarian matching algorithm to track the objects, which obtains fast and great tracking performance, but it may cause the ID switch of the occluded object after it reappears. In order to reduce the frequency of ID switch, simple online and realtime tracking with a deep association metric (DeepSORT) [22] was proposed. DeepSORT combines the advantages of SORT, and it makes up for the defects of the SORT by adding the reidentification network of pedestrians, extracting the pedestrian features, and matching the feature similarity. Ristani and Tomasi [23] put forward DeepCC algorithm, and Tang et al. [24] put forward LMP algorithm; all of these algorithms use reidentification to improve the performance of tracking algorithm through matching the similarity of trajectories. In order to enhance the robustness of complicated changes of multiple objects and complex background scene, Chen et al. [25] proposed the visual object tracking algorithm based on adaptive combination kernel. In addition, tracking has many other applications, such as tracking in basketball games [26].
Since the related algorithms become more and more proven in image processing, the development of image object detection algorithm cannot escape from the limitations of two-dimensional data, and the drawbacks of data are more obvious, which lead to many problems in the algorithm. For example, object detection and tracking algorithm are greatly affected by light, rain, snow, and haze weather. Under such conditions, the object detection accuracy is low and the recognition results are two-dimensional without including distance and volume. However, the point clouds acquired by LiDAR are little affected by the light and have the information of distance and volume, which can overcome these problems above and make up for the shortages of image processing. In recent years, with the decrease of LiDAR cost, more and more researchers use LiDAR to replace the camera for object detection. At the meantime, different from image, point clouds are sparse and disorder space points. However, the proven algorithm used in the image processing cannot be directly used in point clouds. In order to solve this problem, many researchers adopted projection methods [27–31] to project 3D objects into multiple views and fuse the features of each view for detection and recognition. Using the projection method provides a transformation idea from point clouds to image processing. However, a large number of projections will cause the increase of computation, while reducing the number of projections will cause the lack of information. Wu et al. [32] and Le and Duan [33] applied the idea of voxelization to voxelate the point clouds and processed it directly, which improved the efficiency of object detection. The development of point cloud object detection also promoted the development of point cloud tracking algorithm. Weng and Kitani [34] extended the two-dimensional SORT to three-dimensional and proposed the AB3DMOT algorithm, which performed well on the KITTI dataset [35]. In order to improve the performance of point clouds multiobject tracking and retrieve the ID information of occluded objects, we combine reidentification algorithm of pedestrian and 3D Kalman filter and apply them to point clouds. Our contributions are as follows:(i)The tracking algorithm based on deep learning of image processing is introduced into the tracking algorithm based on point cloud, and a tracking algorithm model based on deep learning is established.(ii)The proposed tracking algorithm model uses the three-channel image composed of bird’s eye view (BEV), density, and intensity maps of the point cloud to train the point cloud reidentification network. The two-dimensional features of the three-channel image are extracted by using the point cloud recognition network, and they are made cascade matching with the location features of the IOU.(iii)The proposed tracking algorithm model performs well in the point cloud tracking algorithm. The original trajectory can be matched again after occlusion. The proposed model provides a new baseline for the point cloud tracking algorithm.
2. Related Works
2.1. 3D Object Detection
3D object detection is an indispensable part of 3D object tracking, and the 3D bounding box of detection is also very important for the effect of tracking. 3D object detection can be divided into four categories: image processing methods, voxel-based methods, point-based methods, and some fusion methods. Li et al. [36] presented 3D point cloud to 2D image, and then used the 2D end-to-end full convolution neural network to predict target confidence and 3D bounding boxes through bounding boxes encoding. Simon et al. [37] transformed point clouds into BEV map, density map, and intensity map, and used the method of image processing for 3D detection. Zhou and Tuzel [38] proposed VoxelNet, which divided point clouds into different voxels. Then, they used the VFE (Voxel Feature Encoding) layer to encode features uniformly. Finally, RPN (region proposal network) was used for category classification and 3D bounding boxes regression. Based on the VoxelNet, Yan et al. [39] proposed sparsely embedded convolutional detection (SECOND) by using sparse convolution, which improved the accuracy of detection further. Qi et al. [40] put forward PointNet through using point clouds directly. PointNet adopted spatial transformation matrix to align point clouds and the combined convolutional neural network (CNN) to obtain good results in object segmentation and detection. This method is a better one than two-dimensional image processing. Later, in order to solve the shortcomings of PointNet, Qi et al. [41] put forward PointNet++ by modifying PointNet. Shi et al. [42] put forward PV-RCNN by combining the advantages of voxel-based and point-based methods and then achieved the highest score on KITTI data. In addition, there are some other multisensor fusion methods: MV3D [43] fused BEV and front view of point clouds with RGB image; AVOD [44] fused RGB images and six-channel BEV map consisting of five equal height slices and density map; and F-ConvNet [45] used 2D region to estimate end-to-end of bounding boxes in 3D space.
2.2. 3D MOT
The difference between 3D MOT and 2D MOT is that the tracking objects of 3D MOT are three-dimensional and have height information and distance information. Osep et al. [46] proposed a 2D-3D Kalman filter to jointly use images and the 3D world coordinate system. Baser et al. [47] proposed an online multiobject tracking method based on CNN. Hu et al. [48] used long short-term memory network (LSTM) learning module to predict long-term motion more accurately. Frossard and Urtasun [49] described this problem as a linear programming problem and adopted CNN to detect and match end-to-end. Zhang et al. [50] put forward mmMOT to encode point clouds in the process of data association and realized the fusion of multimodal data. Shenoi et al. [51] developed JRMOT which used a two-dimensional RGB image and three-dimensional point cloud. Here, three-dimensional point cloud was used for detection, and a two-dimensional RGB image was used for reidentification based on CNN, and then multi-object tracking was achieved. The camera shooting angle results in the occlusion of the object in a RGB image, so we combine the aerial view of point cloud with the reidentification method based on CNN to match the similarity and use the three-dimensional Kalman filter to predict the three-dimensional information of the object’s movements.
3. Materials and Methods
According to the characteristics of point clouds, 2D and 3D separation methods are used. We use the 3D Kalman filter to predict the 3D coordinate information of the point clouds and extract the features of the bird’s-eye view by the reidentification network. Our system uses the three-dimensional object detection networks such as SECOND to obtain the three-dimensional coordinate information X, Y, Z, L, W, H, and θ. These seven parameters represent the coordinates of the center point, length, width, height, and heading angle of the frame. The object detection results are transformed into 2D bounding boxes in the three-channel image which is composed of BEV, density, and intensity map, and then, they are sent to the reidentification network to extract features. X, Y, Z, L, W, H, and θ are used for state prediction and trajectory matching of the 3D Kalman filter. After that, the results of feature matching and 3D Kalman filter matching are output to obtain the ID information of the current detection results. The flow chart is shown in Figure 1.

3.1. 3D Object Detection
With the rapid development of 3D object detection, many 3D object detections have obtained good results in the KITTI dataset. We use the advanced 3D detector on the KITTI dataset to conduct experiments and directly use their detection results for performance test of tracking. The detection result of D is obtained by high-precision 3D object detection. D includes {X, Y, W, L, θ, Z, H, S} (S represents the detection score). Dt is the detection result of frame t and Dt = {Dt1, Dt2, …, Dtn} (n represents the number of objects detected). In addition, considering the detection speed and effect, we choose SECOND as the three-dimensional object detection detector of our tracking system. SECOND uses sparse convolution to improve significantly the speed of training and reasoning. The structure chart of SECOND is shown in Figure 2, and the detection performance is shown in Figure 3.


3.2. 3D Kalman Filter
In order to describe the moving object, we use the Kalman filter to predict the next frame state of it. It predicts the position of the current frame by the position information of historical target and then establishes the following state equation as equation (1) for each goal:where are the , , and coordinates of the point clouds, denotes the course angle, and denotes length, width, and height of the object, respectively.
By observing the movement law of vehicles and target characteristics of the point cloud, we find that the height and z coordinate of vehicle and pedestrian hardly changed during the movement. In order to reduce the calculation amount and improve the performance, we ignore the height H and Z coordinates. In the experiment, we find that adding angle will cause the increase of the radian of the predicted target, and the target’s angle will be flipped over. Therefore, the final state model we use is as follows:
The status of detection results can be expressed as follows:
The predicted state equation can be expressed as follows:
3.3. Point Cloud Reidentification
The point cloud is different from the image in which point cloud has no fine-grained features, and the fine features are difficult to distinguish. Although RGB images can be used for reidentification to obtain a large number of fine-grained features, they have some problems: the image may encounter obscuring; the farther the distance, the smaller the target; the farther the distance, the less distinctive the features which are even difficult to distinguish. On the contrary, the aerial view of point cloud has a large field of view and no occlusion of the object, which is conducive to reidentification and solves the problems existing in the image.
Reidentification can realize the matching of feature similarity in the trajectory, so that when it appears again after the object is blocked, it can find the original trajectory by comparing with the features in the trajectory, while the traditional matching method will cause id jump. We use the three-channel image composed of BEV map, density map, and intensity map of point clouds to replace the RGB image to realize feature extraction. Due to the difference between the point cloud coordinate system and the image coordinate system, equation (5) is used to convert the point cloud coordinate system to the image coordinate system, and the transformation diagram is shown in Figure 4:where x and y represent coordinates in the point cloud coordinate system, h denotes the distance from the point cloud boundary to the y-axis, is the distance from the point cloud boundary to the x-axis, and and represent the coordinates in the image coordinate system.

After coordinate transformation, the height of point clouds is mapped to the pixel value to obtain an aerial view, and then, the intensity value of corresponding points in the BEV map is mapped into the intensity map. Finally, we calculate the density value of corresponding point clouds in the image by using equation (6). The resultant three-channel picture is shown in Figure 5:whererepresents density of the ith location point, represents the number of point clouds at the ith location point, is the minimum number of point clouds at the location point, and denotes the maximum number of point clouds at the location point.

After the image of the 2D bounding box in the converted three-channel image is cut and adjusted to 128 × 64 × 3, it is sent to the reidentification network trained by ResNet18 for feature extraction. The reidentification network can match the similarity between the current detection box and the bounding box saved in the trajectory to find the trajectory of the target. The input size of the reidentification network is 128 × 64 × 3, and the output feature vector is 1 × 512, which is shown in Figure 5.
4. Results and Discussion
The experiment in this paper is conducted on ubuntu16.04, GPU 1080ti. We use the KITTI MOT dataset and dataset of roadside 32-line LiDAR in our company to perform the evaluation. There are 21 sequences in the KITTI training set and 29 sequences in the test set which include point clouds, images, and camera parameters. Since there is no label information in the KITTI train set, we use 8008 frames in the training set for the test and use sequences 1, 6, 8, 10, 12, 13, 14, 15, 16, 18, and 19 in reference [52] to validate it. In order to train the point cloud reidentification network, we use tags in sequence 0, 2, 3, 4, 5, 7, 9, 11, and 20 to extract 354 tracks and convert them into the three-channel image composed of BEV, intensity, and density maps. Partial data of the reidentification network in training are shown in Figure 6.

Tables 1 and 2 are comparison results of tracking experiments using object detection results provided by AB3DMOT. Due to the lacking labels of the roadside 32-line LiDAR dataset, this paper only shows the comparative recognition effect.
It can be seen from Tables 1 and 2 that our method has better performance than the FANTract method. Since our method is mainly used on the side of the road, there are a lot of occlusion and reappearance problems, which rarely occurs in the KITTI dataset. Therefore, the advantages of our method cannot be reflected in the KITTI dataset, which is slightly lower than those in AB3DMOT. In order to prove that our method can match the original trajectory and reflect the advantage of the reidentification network, we compare the occlusion in frame 354 to 360 and frame 372 to 379 in the first sequence of the KITTI dataset. In Figure 7, the vehicle id 222 in AB3DMOT jumps to 252 after blocking, while the id number of our method remains at 187 after blocking. In Figure 8, the vehicle id 262 of frame 372 in the AB3DMOT method reappears to be 275 after occlusion, while our method keeps the id 204 all the time.


Figure 9 is a segment of the roadside data. In our method, the id of the two objects with id numbers 4004 and 3985 remain unchanged after occlusion, while the corresponding vehicles id switching occur in the AB3DMOT method. No matter if it is KITTI data or roadside data, our method can keep the id number after occlusion, which reflects the advantage of the reidentification method in matching by features when lacking distance information.

5. Conclusions
This paper introduced re-identification algorithm into point cloud tracking algorithm based on 2D MOT, and proposed 3D MOT algorithm based on deep learning. We use the object detector to obtain the 3D boundary box of the target, and then, use the 3D Kalman filter to estimate state, combining with the re-identification algorithm to match feature similarity, and finally use the Hungarian algorithm for data association. On the KITTI dataset, our approach achieves competitive results, and on the roadside dataset, our approach is more prominent. It is believed that our method can be widely used in self-driving and roadside assisted driving.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
The research was supported by the National Key R&D Program for the 13th-Five-Year Plan of China (2018YFF0300305 in 2018YFF0300300).