Abstract
Mixed traffic is a common phenomenon in urban environment. For the mixed traffic situation, the detection of traffic obstacles, including motor vehicle, non-motor vehicle, and pedestrian, is an essential task for intelligent and connected vehicles (ICVs). In this paper, an improved YOLO model is proposed for traffic obstacle detection and classification. The YOLO network is used to accurately detect the traffic obstacles, while the Wasserstein distance-based loss is used to improve the misclassification in the detection that may cause serious consequences. A new established dataset containing four types of traffic obstacles including vehicles, bikes, riders, and pedestrians is collected under different time periods and different weather conditions in urban environment in Wuhan, China. Experiments are performed on the established dataset on Windows PC and NVIDIA TX2, respectively. From the experimental results, the improved YOLO model has higher mean average precision than the original YOLO model and can effectively reduce intolerable misclassifications. In addition, the improved YOLOv4-tiny model has a detection speed of 22.5928 fps on NVIDIA TX2, which can basically realize the real-time detection of traffic obstacles.
1. Introduction
Traffic conditions in urban societies can be highly complex, since vehicles, pedestrians, and riders may be on the same road, especially in developing countries. In recent years, the rapid rise of bike-sharing and take-away industries has aggravated this phenomenon to a certain extent. The coexistence of vehicles, bikes, riders, and pedestrians has brought great challenges to driving safety in urban areas. The detection and classification of vehicles, bikes, riders, and pedestrians is essential for ICVs [1].
Vision-based object detection and classification is an important method to achieve traffic obstacle detection and classification. In traditional object detection methods, traditional machine learning methods such as scale-invariant feature transform (SIFT) [2] and histogram of oriented gradients (HOG) [3] extract the object features and input the extracted features into the classifiers like support vector machine (SVM) [4] and AdaBoost [5]. The design of these features can be very complicated; in particular, these features are handcrafted features, and their performances are task-dependent, which is not scalable to large-scale applications and can hardly be generalized. At this stage, traditional machine learning object detection methods can barely meet the requirements in practical applications; therefore, new target detection methods are needed. With the development of deep learning, many deep learning techniques have been applied to the field of object detection, among which the deep convolutional neural network (CNN) [6] is the most prominent one. Unlike traditional feature extraction algorithm relying on domain knowledge, CNN has shown to be invariant to geometric transformation, deformation, and illumination, thus effectively overcoming the difficulties caused by the variability of non-motorized vehicle appearance. It also can adaptively capture complex feature patterns by learning from data, leading to its high flexibility and generalization ability. Many deep learning-based object detection methods were proposed in recent years, including one-stage and two-stage detection methods, as shown in Figure 1 [7]. One-stage detection algorithms, such as YOLO [8], SSD [9], and Retina-Net [10], do not need to predict region proposals. In particular, they directly generate the label and the location of objects. After a single test, the final detection result can be obtained in an end-to-end manner, so that the detection speed is faster. In contrast, the two-stage detection algorithm divides the detection problem into two stages. Firstly, the region proposals are generated, and then the regional proposals are classified. In most cases, the predicted positions need to be refined. One typical example of the two-stage algorithms is the family of R-CNN algorithm, which is based on the region proposal, including R-CNN [11], SPPNet [12], Fast R-CNN [13], Faster R-CNN [14], and FPN [15].

Among all the aforementioned state-of-the-art object detection algorithms, YOLOv3 [16] and YOLOv4 [17] are arguably the most promising approaches. Proposed by Redmon et al. in 2018 and by Bochkovskiy et al. in 2020, YOLOv3 and YOLOv4 have both high detection speed and accuracy and can be used for the detection and classification of traffic obstacles. Researchers have conducted many research studies on the detection of traffic obstacles based on YOLO [18–22]. Wang et al. [18] used YOLOv3 to detect vehicles, pedestrians, and non-motor vehicles, which improved the detection accuracy. Narayanan et al. [19] proposed a model using HOG and YOLO algorithm for pedestrian detection in thermal images. Hung et al. [20] performed real-time obstacle detection with the YOLO model on an embedded system. Wang [21] proposed a real-time vehicle detection algorithm that integrates vision and lidar point cloud information, which achieved high detection accuracy and good real-time performance. Arvind et al. [22] developed a near-range obstacle sensing system based on vision sensor, which can ensure early detection and tracking of the obstacle. Zhang et al. [23] proposed a classification method for four classes of moving object using 3D point cloud, which recognized the moving objects effectively. Feng et al. [24] presented a 32-layer multibranch method for object detection in traffic scenes, which achieved the state-of-the-art performance. Li et al. [25] proposed an improved multivehicle detection method considering traffic flow, which achieved good performance and robustness. Wang et al. [26] presented a vision-based crash detection framework in mixed traffic flow environment, which achieved a high detection rate with relatively low false alarm rate. Cai et al. [27] presented an improved framework for object detection based on YOLOv4. Hnewa et al. [28] outlined the state-of-the-art frameworks for object detection under rainy conditions. Liu et al. [29] proposed a radar and camera information fusion method for object recognition. Bell et al. [30] presented a real-time system for night time vehicle detection. Satyanarayana et al. [31] proposed a vehicle method for heterogeneous and lane-less traffic. However, the above research seldom carried out on-vehicle real-time detection and classification of traffic obstacles based on the target characteristics of real hybrid traffic scenes, and the detection accuracy and real-time performance can be further improved.
In the task of traffic obstacle identification and classification, each misclassification is considered to be the same in terms of the potential costs it may bring. However, in actual applications, different misclassifications can result in significant different consequences for ICVs, and some may only lead to minor mistakes, while the others can bring disastrous consequences. To improve the safety of ICVs and avoid disastrous consequences caused by wrong predictions, one may need to assign different weights to different mislabelled results. Recently, the application of Wasserstein distance in object detection system has attracted much attention from the machine learning community [32]. Wasserstein distance [33] is a measure of distance between probability distributions, combining with which the loss function of YOLO could effectively reduce the probability of producing intolerable misclassification in ICVs, thereby reducing the security risk caused by misclassification.
In this paper, an improved Wasserstein distance loss is proposed based on the YOLO model. The main contributions of this paper can be summarized as follows:(i)A new dataset, containing traffic obstacles including vehicles, bikes, riders, and pedestrians under different time periods and different weather conditions in urban environment in Wuhan, China, is collected and established for detection.(ii)Based on YOLO network, the improved model is designed for traffic obstacle detection. The Wasserstein distance-based loss, which assigns different weights for one sample classified to different classes with different values, so that the misclassified objects are classified to similar classes with a higher probability, is combined with the loss function of YOLO to enhance the performance of traffic obstacle detection.(iii)The improved model is deployed on NVIDIA TX2 for real-time detection and then compared with the original model. Empirical experiments show that the improved model presents more accurate and robust results than the original model, and its real-time performance can basically meet the requirements of real-time detection applications.
The remainder of this paper is organized as follows. In Section 2, the dataset collected in real scenes is described. Section 3 presents the Wasserstein loss-based YOLO model, including the network architecture of the designed model and the loss function for training it. The experimental results are reported in Section 4. Finally, the conclusions are presented in Section 5.
2. Dataset
2.1. Data Acquisition
In order to achieve accurate and efficient traffic obstacle detection, image data specifically for traffic obstacles including vehicle, bike, rider, and pedestrian were collected by a camera at a 1920 1080 pixel resolution in Wuhan, China. The collection was conducted during different time periods including daytime and nightfall, and the weather conditions included sunny and cloudy. 496 image data were selected as the original image data to establish the dataset.
2.2. Data Classification
In the urban hybrid traffic scenario, vehicle, bike, rider, and pedestrian are the main traffic obstacles that affect the driving safety of intelligent and connected vehicles. Therefore, as shown in Figure 2, the detection objects in the collected data are divided into these four categories.

2.3. Data Augmentation
As shown in Figure 3, in order to enrich the dataset and enhance the robustness, data augmentation operations including rotation and brightness transformation were performed on the image data. The dataset after data augmentation contains a total of 2976 image data in hybrid traffic scenes.

(a)

(b)

(c)

(d)

(e)

(f)
2.4. Data Annotation
After the above processing, the dataset was manually labelled. In the images, objects with contours less than 50% and small targets that cannot be seen clearly were not labelled. The detailed sample size of each category that has been labelled is shown in Table 1.
3. Methodology
3.1. YOLO Model
In this paper, the YOLO-based detection models, including YOLOv3, YOLOv4, and YOLOv4-tiny, are established. In the YOLOv3 model [16], the image is divided into SS grid cells, and the grid cell at the center of the object is responsible for completing the prediction of the object. In view of the large number of vehicles, bikes, riders, and pedestrians in the urban hybrid traffic environment and the large difference in scale, the model uses the multiple scale fusion method to make predictions. The features of the three detection scales with sizes of 1313, 2626 and 5252 are fused, so as to be compatible with large and small objects.
The network is mainly composed of a series of 1 x 1 and 3 x 3 convolutional layers (each convolutional layer is followed by a BN layer and a LeakyReLU layer). Three detections were performed in the network, which were performed during 32 times downsampling (2^5), 16 times downsampling (2^4), and 8 times downsampling (2^3). After the 79th layer of the convolutional network, it passes through several convolutional layers to obtain a scale of detection results. Compared with the input image, the feature map used for detection here has 32 times downsampling. Due to the high downsampling factor, the receptive field of the feature map here is relatively large, so it is suitable for detecting objects of relatively large size in the image data. In order to achieve fine-grained detection, start sampling from the feature map of the 79th layer and then fuse it with the feature map of the 61st layer (concatenation) to obtain a fine-grained feature map of the 91st layer, which also passes through several convolutional layers, and then get a 16 times downsampled feature map relative to the input image, which has a medium-scale receptive field and is suitable for detecting medium-scale objects. Finally, the 91st layer feature map is upsampled again and fused with the 36th layer feature map to obtain a feature map that is downsampled 8 times relative to the input image. It has the smallest receptive field and is suitable for detecting small-sized objects.
YOLOv4 [17] has made a series of improvements on the basis of YOLOv3, mainly including the following: the backbone feature extraction network is changed from DarkNet53 to CSPDarkNet53 [34], the feature pyramid is changed to SPP [35] and PAN [36], the classification regression layer is unchanged for YOLOv3, etc.
The YOLOv4-tiny network structure is a simplified version of YOLOv4, which is a lightweight model with only 6 million parameters equivalent to one-tenth of the original. As shown in Figure 4, the overall network structure has 38 layers, using three residual units, the activation function uses LeakyReLU, the classification and regression of the target are changed to use two feature layers, and the feature pyramid network (FPN) is used when merging the effective feature layers. It also uses the CSPnet structure, performs channel segmentation on the feature extraction network, divides the feature layer channel output after 3x3 convolution into two parts, and takes the second part. The detection speed of the YOLOv4-tiny model has been greatly improved, which makes it possible to be deployed on mobile embedded terminals such as NVIDIA TX2 for real-time detection.

3.2. Wasserstein Distance-Based Loss
To alleviate the undesirable consequences caused by misclassification, we propose to incorporate the Wasserstein distance into the framework of YOLO and apply it to ICVs. The Wasserstein distance is a metric for measuring the discrepancy or dissimilarity between probability measures, and it calculates the cost of moving one distribution to another one [37]. For discrete distributions and , the Wasserstein distance between and can be formulated as follows:where is the set of all possible optimal transport plan between and and is the distance matrix, whose element measures the distance between and . In particular, for arbitrary optimal transport plan , it has to satisfywhich implies that the optimal transport plan can also be interpreted as a joint distribution of and . To say it in another way, the Wasserstein distance tries to find the optimal joint distribution of and , which can produce the minimal cost of transporting to . Compared to other distance metrics for probability measures such as Kullback–Leibler divergence, Hellinger distance, and Jensen–Shannon divergence, the Wasserstein distance has some favourable geometry properties. Firstly, it is a valid distance metric, i.e., it is symmetric and non-negative, and also satisfies the triangular inequality and identity of indiscernible. Secondly, it can capture the geometry in the underlying space [38].
In object detection, we consider the source distribution as the prediction of the probability distribution of the label of objects and as the ground truth of the label of objects. More specifically, in this paper, the discrepancy between the predictions produced by classifier and the ground-truth labels will be measured by the Wasserstein distance. In [39], it is proved that if either the source distribution or the target distribution is a one-hot histogram, there is only one possible transport plan, and the Wasserstein distance between the source distribution and the target distribution can be calculated bywhere is the index of the one-hot element in , is the number of object classes, and represents the -th row of the distance matrix . The distance matrix, which specifies the distance between categories, needs to be predefined. In this paper, there are four categories in the dataset, namely, vehicle, bike, rider, and pedestrian. As discussed in the Introduction, different misclassifications may result in different consequences, and if the classifier is able to discriminate two different misclassifications, then disastrous consequences can be avoided. For example, classifying “bike” as “rider” may not change the decision made by autonomous driving system, as bike and rider share the same behaviour pattern in a large degree. However, classifying “bike” as “vehicle” is very likely to have significant influence on the decision-making process of a self-driving vehicle, not only because bike and vehicle are different objects but also because they are expected to have distinct trajectories. To prevent the above undesirable problem, in the proposed method, the distance matrix is defined as in Figure 5.

Denote by the predicted location and the ground-truth location, the predicted confidence and ground-truth confidence, and the predicted class and ground-truth class; in the original Yolov3, the loss function is composed of three parts, the location loss , confidence loss , and classification loss . In this paper, we propose to use an additional loss, the Wasserstein loss, and thus the modified YOLO loss function becomeswhere is a hyperparameter that controls the weight of the Wasserstein distance.
4. Experimental Results
4.1. Experimental Environment
The experiments were trained and tested on a Windows PC with two Intel Xeon processors, a CPU at 3.5 GHz, 128 G DDR4, and an NVIDIA GeForce RTX 2080 with 8 GB memory. The established dataset is divided into training set and test set at a ratio of 9 : 1. During training, all but three output layers were first frozen to get a stable loss and then unfrozen, and training was continued to fine-tune. To avoid overfitting, when the loss cannot be reduced within ten epochs, training is terminated. In addition, the original and improved YOLO models were performed on NVIDIA TX2 for real-time detection.
4.2. Evaluation Metric
In this study, the precision-recall curve (P-R curve), F1 score, and mean average precision (mAP) were used to evaluate the performance of the model.
The P-R curve is a curve composed of the value of precision (P) as the ordinate and the value of recall (R) as the abscissa, where P can be defined as
R can be defined aswhere the definitions of TP, FN, and FP are shown in Table 2.
F1 score, an index that comprehensively considers the values of P and R to reflect the performance of the detection model, can be defined as
The area under the P-R curve is the value of the average precision (AP), and the AP value over four categories of the obstacle objects in the hybrid traffic scene is defined as mAP. The AP and mAP value can be defined as
4.3. Result of Designed Models on Established Dataset
In order to verify the detection effect of the designed models, the models including YOLOv3, YOLOv4, and YOLOv4-tiny were performed on the four categories of obstacle objects. The loss curves of the designed models are shown in Figures 6, 7, and 8, respectively.



It can be seen from the loss curves that the loss value of the improved model is higher than that of the original model at the beginning of training, and the loss value of the improved model and original model is basically the same when the loss value stabilizes. This is because of the addition of Wasserstein distance-based loss to the improved model. The final loss values of the YOLOv3, YOLOv4, and YOLOv4-tiny models are about 24.5, 10, and 11.5, respectively.
The experimental results of designed models are shown in Table 3, and the P-R curves are shown in Figure 9. It can be seen from the experimental results that the mAP of the improved YOLOv3, YOLOv4, and YOLOv4-tiny models is 98.57%, 98.19%, and 80.39%, respectively, slightly higher than that of each original model, and the F1 value of the improved models is basically the same as each original model.

(a)

(b)

(c)

(d)

(e)

(f)
4.4. Result of Designed Models on BDD Dataset
BDD is one of the latest published autonomous driving datasets with dense traffic scenes, on which the detection effect of the designed models is also verified. In the BDD dataset, there are few objects in the bike and rider categories, so we selected the data containing these two categories of objects for testing to maintain the relative balance between the various categories. The experimental results of designed models are shown in Table 4, and the P-R curves are shown in Figure 10.

(a)

(b)

(c)

(d)

(e)

(f)
It can be seen from the experimental results that the mAP of the improved YOLOv3, YOLOv4, and YOLOv4-tiny models is 92.97%, 91.23%, and 77.97%, respectively, higher than that of each original model, and the F1 value of the improved models is basically the same as each original model. The detection mAP value of the designed model on the BDD dataset is slightly lower than that on the established dataset. This is because the training of the model is carried out on the training set in the established dataset, which is similar to the testing set scene but different from the BDD dataset scene. However, the detection results on both datasets could meet the basic application requirements.
4.5. The Application-Oriented Performance on NVIDIA TX2
NVIDIA TX2 is a mobile terminal that can be deployed directly on the vehicle. The vehicle application scenarios on NVIDIA TX2 are shown in Figure 11. The trained improved and original models are deployed on NVIDIA TX2, respectively, and then tested on the established dataset. In addition, the NVIDIA TX2 with a camera is installed on the vehicle for real-time detection to verify the detection effect and real-time performance of the proposed model.

The detection speed of different models is shown in Table 5. As can be seen from the table, the detection speed of the improved YOLOv3 and YOLOv4 models on NVIDIA TX2 is between 3 fps and 4 fps, while on Windows PC, it is between 8 fps and 9 fps, which is a little poor in real-time performance. The detection speed of the improved YOLOv4-tiny model on NVIDIA TX2 is above 22 fps, while on Windows PC, it is above 27 fps, which can basically realize the real-time detection of traffic obstacles.
The real-time detection effect of the improved YOLOv4-tiny model was verified on the NVIDIA TX2 and compared with the original YOLOv4-tiny model. As shown in Figure 12, some misclassifications detected by the original model can be effectively and correctly classified by the improved model, proving that the improved model can effectively reduce intolerable misclassifications between different categories.

(a)

(b)

(c)

(d)

(e)

(f)
5. Conclusions
In this paper, an improved YOLO model for traffic obstacle detection and classification applied in ICVs is presented. A new dataset containing traffic obstacles collected under different time periods and different weather conditions in urban environment was established. The improved models, which reduce the intolerable misclassification and enhance the performance of traffic obstacle detection by combining the Wasserstein distance-based loss with the YOLO models, were designed and implemented. The improved model was trained and then tested on established dataset and selected BDD dataset and deployed on NVIDIA TX2 for real-time detection.
Experimental results showed that the mAP values of the improved YOLOv3, YOLOv4, and YOLOv4-tiny models are 98.57%, 98.19%, and 80.39%, respectively, higher than those of each original model. From the application-oriented performance on NVIDIA TX2, the detection speed of the improved YOLOv4-tiny model is 22.5928 fps, which is much better than that of the YOLOv3 and YOLOv4 models and basically meets the real-time detection requirements of traffic obstacles. In addition, in the real-time vehicle verification, the improved YOLOv4-tiny model can reduce the intolerable misclassifications between different categories more effectively than the original model. In practical applications, the improved model could effectively improve the accuracy of decision making for ICVs, thereby improving the driving safety. In the future study, the dataset could be enriched and the detection model could be further optimised.
Data Availability
The established dataset used to support the findings of this study is available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This research was supported in part by the National Key R&D Program of China under grant no. 2018YFB0105205 and in part by the Hubei Province Technological Innovation Major Project under grant no. 2019AAA025.