Abstract
There are few studies for the classification detection and dynamic multitarget detection of the targets in front of vehicles. In order to solve this problem, a dynamic multitarget detection algorithm is proposed. First, a dynamic multitarget detection with displacement at any time is suggested; secondly, a multitarget detection algorithm based on improved You Only Look Once version 3 (YOLOv3) is proposed for the detection of multitarget high probability risk events in front of the vehicle. The YOLOv3 algorithm model is a lightweight backbone network that uses embedded real-time detection technologies. In this paper, we use a lightweight Mobilenetv2 to replace Darknet-53 for feature extraction. Moreover, an optimizer is used for multiobjective feature extraction, group normalization, and multiobjective feature extraction. The results show that in comparison with the original YOLOv3 algorithm, the detection leakage rate of the improved YOLOv3 multitarget detection algorithm is less than 5%, and the amount of model parameters in this paper is reduced by 95% as compared to the traditional data and CPU intertime is reduced to 78%.
1. Introduction
The introduction of smart Internet of Things technology within the traffic management system is essential in light of the rapid expansion of urban modernization [1]. It is inseparable from the support of developing innovations to effectively adapt to the complex and changing traffic management needs. Sophisticated computer system, Internet of Things innovation, telecommunication, automatic control technology, and other technologies have been used to create an intelligent traffic management system. Its features include real-time, accurate, and efficient monitoring. To achieve efficient traffic control, the objective of a smart transportation system is to create an evolutionary combination of three parts of traffic involvement: driver, vehicle, and road [2].
Target detections major function and task are to accurately locate the relevant target object information in an image using location and recognition. The main purpose of the Internet of things is to provide network capabilities to the equipment and supplies used in our daily life to form information networking and interconnections and maintain the coverage of an information network service [3]. Vehicle signal recognition and vehicle information image acquisition in the Internet of things mode mainly focus on dynamic multitarget acquisition, which is the mainstream improvement direction of multitarget detection technology algorithm at present [4]. Traditional target detection technologies, such as hog feature technology combined with support vector machine (SVM), have some problems: first, the region selection strategy of sliding window is lacking pertinence, high time complexity, and window redundancy [5]; the second problem is that the combination of manual design features and object detection has no good robustness [6]. Multitarget detection can achieve a technical breakthrough and improve the intelligent level of automobile driving faster by using an algorithm with international advanced experience, which is based on the advantages of embedded systems and the deep learning convolution neural network of Internet of things [7]. For using a unified algorithm architecture of pedestrian and vehicle joint recognition technology, it is necessary to adopt highly significant regional or redundancy strategies to select specific targets [8]. Through the multitarget online self-learning tracking and multitarget tracking method, the multitarget recognition in front of the vehicle can be realized, and the technology can automatically upgrade the recognition algorithm to make the system track the target more stably [9]. Some scholars have studied a basic network architecture of single-shot detector (SSD) network and visual geometry group (VGG)-16. The assumption is replaced by residual network Resnet-26, which improves the detection algorithm and the real-time accuracy of traffic detection [10]. Some scholars applied the full convolution network technology to the three-dimensional scanning data to combine the three-dimensional point dimension with the two-dimensional grid as the feature extraction method by forming different 2D end-to-end full convolution networks through the candidate regions and detect the vehicle target and frame and achieved good results. Scholars of vehicle technology proposed to design a feature convolution kernel library composed of multiple forms and color Gabor [11] to train and screen the optimal feature extraction convolution kernel group by replacing the low-level convolution kernel group of the original network, so as to improve the detection accuracy [12]. In the upgrading iteration from single image detection to multitarget image detection, the adaptive threshold strategy is added to reduce the missed alarm rate and false alarm rate and realize the target detection in complex traffic scenes [13]. In addition, there are some innovative researches including traffic hazard’s warning sign recognition technology, which turns target detection in complex traffic environment into a method of detection together with signs, forming an effective network combination method.
The above studies solve and put forward relevant methods and specific improvement directions from different angles and technical levels. The main research value of the dynamic multitarget detection algorithm in front of the vehicle in the embedded system combined with the Internet of things lies in the following: a lot of research focuses on how to improve the algorithm, and the detection algorithm of multiple targets is upgraded. The improvement of multitarget detection technology must insist that the multitarget detection algorithm can meet the specific requirements and operation value. Combined with the traffic development trend, single category target detection can no longer meet the needs of the traffic scene in front of the vehicle. The complexity of all target detection has become complex and diverse, and the detection accuracy still needs to be improved. In the next few years, road traffic will be more complex, and it is necessary to find approaches of how to effectively and accurately identify the process of the development using the Internet of things. Thus, the Internet will become more critical when it comes to identifying multiple targets, such as a variety of different vehicles, mixed pedestrians and objects, electric motorcycles, and two-wheeled vehicles which often appear and become dangerous targets, so it is necessary to upgrade and iterate the multiobjective algorithm of vehicle automatic driving technology detection.
By considering the statement of the above problems and according to the analysis of this paper, the target needs to be inspected in front of the vehicle in a complex traffic environment and divided into vehicle dynamic target and vehicle static target. The dynamic target is the displacement in front of the vehicle at any time. The main body applicable to the road includes four-wheel vehicles and two-wheel vehicles. Cars, trucks, and buses are in category of four-wheel vehicles while bicycles, motorcycles, and people are placed in the category of two-wheel vehicles. There are many potential safety problems for pedestrians and cyclists. The static target is about the target that the front of the vehicle will not be displaced. The road auxiliary reference is the traffic signal. The single-stage detection algorithm YOLOv3 algorithm is a basic algorithm aiming at the problem that the model is large and is not suitable for embedded devices; the interfree problem is improved during CPU detection.
The rest of this paper is organized as follows. In Section 2, the basic principle of the proposed algorithm and its optimization is presented. In Section 3, the experimental results are discussed. Finally, this paper is concluded in Section 4.
2. Algorithm Principle and Optimization
The multitarget detection algorithm based on deep learning can be divided into regions with convolutional neural networks (r-CNN) series in terms of detection mode. Its multitarget detection method can adopt the two-stage algorithm via the single-stage detection algorithm represented by YOLO series. The two-stage detection algorithm concept is as follows: first, employ algorithm for location information [14]; secondly, produce classification for category information. According to the proposed model, the focus is on real-time detection [15]. The single-stage detection method introduces a novel concept: the dynamic multitarget image in front of the vehicle is transformed into network output [16] by returning the category of bounding box position in the output layer [17] and transforming multitarget detection problem to regression problem by improving the detection speed [18–20].
2.1. YOLOv3 Algorithm
YOLO is a series of algorithms. YOLOv3 is the third version of YOLO algorithm. The proposed research work has a network structure that describes the YOLOv3 algorithm's principle; the network structure diagram is shown in Figure 1.

YOLOv3 architecture does not use classic backbone network structures such as VGG-16 and Resnet-50. The ImageNet classification is used as the backbone network for object detection. ImageNet is recognized as authoritative datasets for evaluating the capabilities of deep convolutional neural networks. Many new networks are developed to improve the performance of existing networks. The proposed model produces a backbone network as the combination of Darknet-53 feature extraction method. The structure of network is shown in Figure 2. The structure of basic YOLOv3 has neither pool layer nor the full connection layer. The front propagation process is size transformation and is realized by changing the step size of convolution layer. The image edge is reduced to half size and the area is reduced to 1/4 of its original size. Using five samples, the characteristic image is 1/32 of the original image. The YOLOv3 algorithm adopts the fixed pattern noise (FPN) idea [21, 22] in the construction by using multiscale targets of different sizes for detection, and as an output, three characteristics of different scales, 13 × 13, 26 × 26, and 52 × 52, are produced, which makes the detection effect of YOLOv3 more significant than that of Yolo algorithm.

By comparing the YOLOv3 algorithm with the fast r-CNN two-order detection algorithm, the former has obvious advantage in terms of detection speed, but there are some shortcomings: the trained model is large, not suitable for embedded devices, and the influence time in the detection under the CPU is high. In order to solve this problem, the following methods are adopted for the optimization of YOLOv3 algorithm. Model optimization is carried out through the aspects of backbone network optimization and model pruning optimization. This paper considers two aspects of optimization, backbone network optimization adjustment and normalized optimization adjustment, as shown in Figure 3.

2.2. Lightweight Improved Model
MobileNetv2 is utilized as the backbone network to replace Darknet-53 for feature extraction in the lightweight model's design. MobileNet has been optimized into Mobilenetv2 [23–25]. The model is useful for resolving the compatibility issue between mobile terminals and embedded devices. MobileNet builds depth neural networks using depth separable convolution, based on the streamlined design. In Figure 3, the first layer is a convolution layer, followed by depth convolution and point-by-point convolution layers. The used lightweight model architecture is the segregated deep neural network.
Set the input feature mapping according to the model F, its size is (DF, DF, M), the standard convolution K used is (DK, DK, M, N), and the output feature map is G, size (DG, DG, N). The standard convolution is calculated by
The number of inserted channels is M and the number of output channels is N. The corresponding calculation quantity is expressed as
The modified standard convolution is K=(DK, DK, M, N). Then, it is divided by depth convolution and point-by-point convolution. Specifically, depth convolution is responsible for filtering, and its size is (DK, DK, 1, M) and its output characteristic is (DG, DG, M). The point-by-point convolution is responsible for the conversion channel, and its size is (1, 1, M, N) and its output is (DG, DG, N). The convolution formula of depth convolution is
In (3), K is the depth convolution and convolution kernel is (DK, DK, 1, M). Among them, mth is the application of convolution kernel in F and mth is the channel, G and mth outside.
The calculated equation of depth convolution and point-by-point convolution is
From the perspective of comprehensive calculation, the reduction of comprehensive calculation is more obvious. Through the derivation of the above formula, it can be found that the amount of parameters can be significantly reduced by using depth separation convolution. Thus, MobileNet is improved in aspect of linear bottleneck and inverted residual block, as shown in Figure 4.

The algorithms of RELU6 and PW1x1 are introduced in MobileNets, built in the previous portion. For feature extraction, the point-to-point convolution feature approach is used with deep convolution [26]. Because calculating the characteristics of DW convolution defines its ability to adjust the number of channels, flowing from the top layer to the output, the improvement primarily involves PW convolution added to DW convolution as illustrated in Figure 5. The number of channels flowing to the upper layer is small. DW can only extract features in low-dimensional space, which is not good. Therefore, PW activation function can be added before each DW, generally linear bottleneck, because the activation function can effectively increase nonlinearity in high-dimensional space and destroy features in low-dimensional space. The function of the second PW is dimension reduction.

2.3. Normalization Adjustment
In training of depth convolution neural network, the problems of network deepening, difficult training, and slow convergence arise [27]. BN and GN are shown in Figure 6.

The gradient will disappear when the parameters reach to low level of neural network. Based on BN, the optimization algorithm adds the group normalization method to normalize the same group of the same feature map, while the group is divided into channel dimension. Therefore, the normalization operation is independent from the size of the batch and avoids its influence.
3. Experimental Analysis and Visualization
The experimental platform is divided into two portions, the hardware platform and the software platform. NVIDIA GeForce GTX 1080ti (double) GPU; Intel Core i7-7700 CPU; 32 GB disc Tensorflow1.13.1 GPU deep learning framework; PyCharm Community IDE; Linux-Ubuntu 16.04; and 10 types of targets in the bdd100k data set are analyzed, Python script files are generated, and the seven types of target image data and label files studied in this paper are extracted, all with the goal of studying the dynamic multitarget in front of the vehicle.
3.1. Experiment Data Set
The experimental datasets are shown in Table 1. The statistics of data set adopts instance object, BDD100 K, and team-test. Datasets in the experiment are car, bus, truck, bike, motor, rider, and person.
3.2. Visual Analysis
Visual analysis is carried out using YOLOv3 algorithm and the optimized YOLOv3 algorithm, respectively. The loss change value of models is recorded, as shown in Figure 7.

After a comparison analysis, it can be shown that using different algorithms introduced in the training process, the optimized algorithms curve change is steeper, the decrease in the curve descent process is slower, and the mitigation algorithm's effectiveness is lowered. Missed detection rate and false detection rate of the target detection test set in the field of unmanned technology environment perception are considered incredibly essential evaluation indicators in the target detection model, which directly increases the model's trustworthiness.
3.3. Visualization of Target Detection
The visualization of target detection effect is shown in Figures 8 and 9, the algorithm YOLOv3 MobileNetV2 obtains YOLOv3, and the model visualizes the dynamic multitarget detection in front of the test object vehicle.


In the real scene detection effect, I Xian multitarget detection can be carried out, including all detection objects such as cars, trucks, bicycles, pedestrians, and motorcycles. The effect of multitarget detection in front of vehicles has been significantly improved, which is proved in experiments. The algorithm YOLOv3 MobileNetV2 obtains YOLOv3. The model has significantly improved the visualization of dynamic multitarget detection in front of the test object vehicle.
4. Conclusion
A dynamic multitarget detection algorithm in front of vehicle based on improved YOLOv3 is proposed in this paper. The lightweight network MobileNetv2 is used as a replacement of Darknet-53 to form the backbone network for feature extraction; the normalization and optimizer are adjusted to accelerate the convergence of the network. The optimization algorithm trained the two models and from the perspectives of loss visualization, missed detection rate and false detection rate, model size, and effect time, the fundamental algorithms are compared and analyzed. The experiments show that the optimization model constructed in this paper improved by 0.5% in the map, the parameter quantity is reduced to about 89% compared with the YOLOv3 basic model, and the influence time under the CPU is reduced to about 70%. The visualization of test set and actual scene detection qualitatively verifies that the proposed algorithm has a good detection effect for dynamic multitarget detection in front of the vehicles. From the current research in the field of intelligent networked vehicle environment perception, the image level research is using camera as a sensor, including target detection, target tracking, and semantic or instance segmentation. There is still room for improvement in accuracy and speed, the common practice in the industry is sensor fusion, which matches and fuses the cloud information obtained by radar with the image information obtained by the camera to achieve higher perception accuracy. The significance of this paper is that it can optimize the speed of camera sensor target detection, saved computing time, and space for the subsequent fusion. In this paper, the algorithm is only for those embedded systems that are based on the environmental detection and has not been transplanted to the other embedded platforms, and the robustness of the model has not been specifically analyzed. The follow-up research will focus on this aspect.
Data Availability
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.