Abstract
Whether in intelligent transportation or autonomous driving, vehicle detection is an important part. Vehicle detection still faces many problems, such as inaccurate vehicle detection positioning and low detection accuracy in complex scenes. FCOS as a representative of anchor-free detection algorithms was once a sensation, but now it seems to be slightly insufficient. Based on this situation, we propose an improved FCOS algorithm. The improvements are as follows: (1) we introduce a deformable convolution into the backbone to solve the problem that the receptive field cannot cover the overall goal; (2) we add a bottom-up information path after the FPN of the neck module to reduce the loss of information in the propagation process; (3) we introduce the balance module according to the balance principle, which reduces inconsistent detection of the bbox head caused by the mismatch of variance of different feature maps. To enhance the comparative experiment, we have extracted some of the most recent datasets from UA-DETRAC, COCO, and Pascal VOC. The experimental results show that our method has achieved good results on its dataset.
1. Introduction
In recent years, with the rapid development of the automobile industry, the number of motor vehicles in the city has also developed rapidly. By the end of 2020, the number of motor vehicles in the country has reached more than 300 million, followed by the colossal traffic pressure and many traffic regulation problems. When the traffic pressure and traffic control problems become more serious, it will bring many inconveniences to the production and life of urban residents and will also restrict the rapid development of cities and towns. On the other hand, with the gradual maturity of artificial intelligence, the application of artificial intelligence in vehicles has also received extensive attention, and the application of vehicle detection is the focus of this article. Moreover, vehicle detection covers more and more areas, such as road traffic monitoring, the automatic lifting of the entrance guard of the community, some charging pile parking places, and automated driving.
Pedestrians or private cars are often parked in unguarded charging areas, causing inconvenience to new energy owners. When the vehicle is fully charged, you need to remind the owner to move the car out of the charging pile. There are still many problems with vehicle detection in autonomous driving. The vehicle will be seriously obscured or cause misidentification in different scenarios, which is often one of the leading causes of accidents in self-driving cars. And, before the license plate recognition, it is often necessary to recognize the body to narrow the identification range, reduce the interference of the surrounding environment, and improve accuracy. Based on these conditions, the vehicle detection is still a challenging task.
The primary purpose of vehicle detection can be divided into two. One is to determine whether a vehicle is detected (such as a bus, truck, and car) in the video or image. During the detection process, it needs to determine its location and mark it. Second, the specific category needs to be determined. The particular sort of vehicle needs to be determined by analyzing the semantic information (e.g., [1]) of the car in the frame to complete the vehicle detection task.
Neural network design often follows several elements. The main one is to reduce the information path and enhance the information dissemination; for example, the residual connection [2] and dense connection [3] have played a perfect role. It is also effective in improving the flexibility and diversity of information channels, typically using a split-transform-merge strategy [4]. There are also many methods [5–7] to combine high-resolution image information with high-level semantic information.
Driven by these cutting-edge algorithms, we propose an improved-FCOS algorithm to detect vehicles, consisting of mainly three points. Firstly, the car belongs to a rigid structure and changes laterally under different shooting angles and scene occlusions. The receptive field of the standard convolution kernel is often rectangular, extending to the surroundings, which may not completely cover the entire car. We introduce Deformable ConvNet [8] which enables the convolution kernel to adaptively learn the response’s position deviation according to the target’s deformation. Adding a bottom-up module to the original FPN further enhances the flow of information between different feature layers, reducing the distance between the bottom layer and the top layer. Pang et al. [9] put forward a new idea called the Libra R-CNN before this, thinking that today’s detectors all follow the region selection, feature extraction, and then the gradual convergence under the guidance of multitask loss, which directly affects the effect of model training. Based on the former balance concept, we add a balanced module after the improved FPN, which integrates feature maps of different resolutions, uses this feature to strengthen the previous pyramid, and then adds nonlocal attention to enhance the contextual network connection. This operation can reduce the inconsistency of bbox head recognition due to different variances, and the whole process is cost-free through interpolation and pooling.
The work of this paper is as follows:(1)The introduction of DCN [8] can make the receptive field of the convolution kernel to adaptively change(2)We added a bottom-up module behind the traditional FPN [6] to reduce the distance from bottom to top(3)We added a balanced module after the improved FPN to reduce the inconsistency of the bbox head prediction
2. Related Works
This section mainly introduces the method of research used for vehicle detection and FCOS algorithm.
2.1. Detection Algorithm
Many methods based on combining the feature extraction and classifiers have been proposed to achieve vehicle detection formerly. For example, HOG [10] feature detection and then HOG [10] and LBP [11] features are combined to improve the accuracy of the vehicle detection further; Li Xiangfeng et al. put forward the Haar [12] feature algorithm. Although these algorithms achieve better detection results in simple scenarios, they are challenging to deal with complex systems.
After 2012, the convolutional neural network-based algorithm in deep learning became popular, extracting semantic information about vehicles from different feature layers and solving the problem of insufficient robustness of traditional algorithms when the dataset is sufficient.
Detection can be done in three ways. One is a two-stage regression algorithm (e.g., [9, 13–21]). These algorithms all propose an anchor as a prior to further improve the accuracy and speed up the convergence of the network model, which is often slower than single-stage detection algorithms in speed. The second is the single-stage detection algorithm (e.g., [22–27]). These algorithms do not use a prior method such as anchor, so there are many differences in regression.
The last one is the direction proposed by Facebook, represented by a transformer [28], which introduces the transformer in NLP into CV, greatly simplifying the network model. However, it has some shortcomings in accuracy, and it is still far away in the engineering deployment, such as [29–32].
2.2. FCOS Algorithm
We choose the anchor-free network because the anchor-based algorithm has many limitations:(1)The dataset is susceptible to the size, number, and aspect ratio of anchors, and different tasks need to be readjusted, which is not conducive to generalization(2)To better match the GT box, many anchors need to be generated, most of which are marked as negative samples, which will cause an imbalance between positive and negative examples(3)It is necessary to calculate the value of IoU, which consumes a lot of computing power, which slows down the detection speed and increases the cost
FCOS [22] is an anchor-free detection algorithm. The accuracy of the previously proposed anchor-free algorithm is quite different from that of the anchor-based algorithm. And, FCOS [22] successfully surpassed the anchor-based detection algorithm and became the SOTA of the year through alternative solutions.
The definition of the positive and negative samples in the FCOS [22] algorithm is quite different from before. If a location (x, y) falls into any GT box, it is a positive sample and regresses the distance between this point and the bounding box , as shown in the following equation:
The previous anchor-free algorithm has no optimal solution for overlapping the GT box regions, so the point regression has ambiguity. In FCOS [22], the ambiguity is significantly reduced through the feature pyramid. As we all know, the shallow layer of the neural network is rich in more detailed features, which is beneficial to small target detection [33]. Higher level has more semantic features, which are used to detect large targets. To reduce the overlapping of objects with significant differences, the parameter refers to the maximum distance of the feature map . If a location satisfies or , the point to a negative sample is set without regression. Among them, is set to 0, 64, 128, 256, 512, , respectively, divided into five intervals to reduce the overlapping area. If there is an overlapping area in a layer, it directly returns to the smallest area.
To further constrain those prediction boxes far away from the center of the GT box, FCOS [22] samples the centerness method to solve this problem (Equation (2)) and uses BCE loss [34] to optimize the centerness branch.
3. Methods
This chapter will give a detailed overview of the improved part of the algorithm (the complete structure of the algorithm in Figure 1). The deformable convolution [8] is added to the suppression factor that can reduce the influence of noise and background. The added bottom-up module could significantly reduce the loss of data reaching the top. The integrated feature map is put into the nonlocal attention structure by the balanced module and then redivided to obtain a new feature map. In this paper, the improved FCOS dramatically increases the accuracy but does not increase too much calculation.

3.1. Deformable Convolution
At present, the convolution unit commonly used in the CNN is a fixed geometric structure. In the same feature layer, the receptive field size of all activation units is the same. Considering its characteristics, the vehicle changes horizontally under different shooting angles and scene occlusions. The receptive field of the standard convolution kernel is often rectangular, extending to the surroundings, and may not completely cover the entire car. Therefore, to allow the detection algorithm to adapt to the scale, posture, and geometric changes of the vehicle target, we introduced deformable convolution [8], which adaptively determines the size of the receptive field to improve the accuracy of detection and positioning. Compared with standard convolution, the DCN [8] is more in line with the actual situation. Its principle is shown in Figure 2.

(a)

(b)
We introduced the deformable convolution of the DCN [8] (it is illustrated in Figure 3). Deformable convolution is mainly composed of two parts: (1) the previous feature map is convolved to obtain the deviation; (2) according to the deviation value, the new sampling coordinates are obtained and convoluted to generate a new feature map. The model will focus on the area outside the target in actual training, introducing noise and not conducive to detection, adding a suppression factor to make the model more focused on the target we need.

For example, denotes the coordinates of the 3 × 3 convolutional kernel and the convolution calculation is shown in Equation (3). Among them, refers to the deviation value, where denotes the suppression factor, which mainly assigns different weights to the target area and the noise background area. The sampling coordinates of the convolution kernel on the original feature map are . In the actual calculation, the coordinates of the former have a decimal point. We use bilinear interpolation to solve this problem, as shown in Equation (4).
In the experiment, the DCN [8] was added to the C3–C5 layer of the backbone, which brought a considerable increase in accuracy, and we added the C2 feature map to improve the learning ability of the model further.
3.2. Improved FPN
We modified the FCOS [22] neck module. The previous FPN [6] mainly improves the target detection effect by fusing high- and low-level features, especially for small-size targets. As we all know, high-level features contain semantic information, while low-level features contain more specific descriptions of detailed information. Driven by PAN [7] (the champion of instance segmentation competition that year), this paper adds a bottom-up path augmentation module as in Figure 4 after the traditional FPN [6]. However, in the FPN [6] algorithm, a top-down process is required. The transfer of shallow features to the top layer requires dozens or more than one hundred network layers. Obviously, after such a multilayer transfer, the superficial feature information will be seriously lost. The bottom-up path augmentation added in this article can connect the shallow features to the P2 through the lateral connection of the original FPN underneath and then pass from P2 to the top layer along with the bottom-up path augmentation. The number of layers passed is less than 10, which can better retain the shallow feature information.

3.3. Balance Module
The previous improvement makes the original image go through the FPN [6] layer to perform multiscale feature extraction from top to bottom and then go through the bottom to top to enhance the positioning feature information. The information of adjacent resolution feature maps is aggregated and strengthened, but there still exist some problems. The former does not consider the aggregation relationship of the hierarchical feature information between different resolutions, and the variance of each feature map is different. When sent to the bbox head, there will be inconsistency problems. Thanks to Libra RCNN’s [9] success, we add a balance module after the improved FPN (Figure 5). It resizes feature maps of different resolutions to the same size, then adds the feature elements together, and gets divided by the number of levels to achieve aggregation. The feature information of different scales is aggregated in the N4 feature map and then sent to the following refine structure (Equation (5)). The refine structure introduces a nonlocal attention mechanism, which is used to capture long-distance dependence, that is, how to build the connection between two nonadjacent pixels on the image. When calculating the response of an arbitrary position, nonlocal attention will consider the relationship of the feature map context to assign weights adaptively.where i is the location of the output feature map, j is the location of other different feature maps, x is the input feature map, f is the pairing calculation function for the two feature maps to calculate the correlation between the ith position and all other positions, is the unary input function for transform information, and is the normalization function.

Figure 6 is the specific form of nonlocal attention. First, convolves the input feature map three times to obtain the θ, φ, and features, and then it calculates the correlation between the two positions through the f function (3 parts in the structure diagram).

Equations (7) and (8) calculate the Gaussian distance in the embedding space by the corresponding parts 1 and 2 in the structure diagram.
Then, the dimensions of the above three features are reshaped except for the number of channels, and the correlation is calculated by matrix point multiplication of θ and φ. Finally, the weights are 0∼1 by the softmax operation, as follows:
Equations (9) and (10) are sorted out to obtain the following equation:
Finally, the attention coefficient is correspondingly multiplied back to the feature matrix , plus the number of extended channels. The result and the original input feature map are used for the residual operation (4 in the structure diagram) to obtain the refined feature map, enhancing the relationship between the feature maps and balancing the variance.
3.4. Loss Function
The final loss function is
In order to highlight the improvements, we have not changed the loss function of the original algorithm. Here, is the focal loss as in [26], and it can greatly reduce the problem of imbalance between positive and negative samples. is the IoU loss as in UnitBox [35], which considers that the correlation between the coordinates is different from the weighted sum of L1 and L2 loss. denotes the number of positive samples, and λ being 1 in this paper is the balance weight for . The summation is calculated over all locations on the feature maps . is the indicator function, being 1 if and 0 otherwise.
4. Experiment
Our experiment performed detection on three diverse datasets, including UA-DETRAC, MSCOCO2017, and Pascal VOC, for joint training (where each dataset only uses pictures in the category of car, bus, and truck).
4.1. Experimental Details
UA-DETRAC (a multitarget tracking dataset, taken on different roads in Beijing and Tianjin, China) contains various weather conditions, such as cloudy, night, sunny, and rainy. Occlusion is divided into unoccluded and heavy occlusions; the video per second recording had 25 frames and about 130,000 pictures. The pictures are highly similar, and this article takes a sample every 40 frames by increasing the contrast to prevent overfitting. The second dataset selected part of the vehicle pictures in Pascal VOC2012. The third dataset uses MSCOCO2017. The full name of COCO is “Common Objects in Context,” a dataset provided by the Microsoft team for image recognition. The images in the dataset are divided into training, verification, and test sets. This article samples all vehicle pictures in the training and validation sets, and the total dataset is about 20,000 (the specific information is in Table 1). All the annotation information is represented by the VOC format, 90% is used for training and verification, and the remaining 10% is used for the testing.
The experiments used in this article are all based on the mmdetection [36] framework developed by Shangtang, which is developed based on the PyTorch framework. It divides the target detection algorithm into several significant modules: backbone, neck, head, bbox, encode, decode, and loss, decoupling the connection between the modules. This article uses 6 Nvidia TITAN Xp to train the network, and all the parameter settings are consistent with mmdetection official, where the learning rate and the number of GPUs have a linear scaling relationship.
4.2. Accuracy Experiment
We report our main results on the test (approximate 2K images) by uploading detection results to the server. We firstly forward the input image through the network and obtain the predicted bounding boxes with a predicted class. Unless specified, the postprocessing and data enhancement of the algorithm will use the official default of mmdetection. We hypothesize that the performance of our detector may be improved further if we carefully tune the hyperparameters. We compare the mainstream algorithms in recent years, and the results are in Table 2. Compared with other algorithms in current years, our method achieves the best performance on this dataset.
4.3. Model Complexity
We also tested the complexity of each model on the dataset as shown in Table 3. It can be seen from the table that the one-stage network often has fewer parameters than the two-stage network and our model has dramatically improved the accuracy and reduced them. For the GFLOPs indicator, the parameter has only risen a little.
4.4. Ablation Experiment
This section analyzes the performance of ablation experiments on the improved network (Table 4). The DCN [8] can expand the receptive field of the convolution kernel, which can change the sampling points with the deformation of the object. Its visualization result is shown in Figure 7, and we can see that the feature focus area is very different. The experimental results are shown in Table 4. It is evident that the combined performance of the three improvement methods is the best. Figure 8 shows the linechart of cls and bbox loss. Compared with FCOS, our method converges faster and more stably on cls.

(a)

(b)

(a)

(b)
4.5. Visualization of Results
Figure 9 is a visualization of the effect of the algorithm. It can be seen that vehicles can be detected no matter whether they are at a distance or in some unique scenes, even when there is little picture information, which proves the superiority of our algorithm.

(a)

(b)

(c)

(d)
5. Conclusions
We improved the detection algorithm based on the anchor-free FCOS [22] and introduced the DCN [8] based on the original backbone to broaden the receptive field of the convolution kernel and also introduced a bottom-up module to improve the FPN and reduce the loss between the information transmission. A balance module is added because the variance of the feature pyramid affects the accuracy, which has a good effect and proves the superiority of the improved algorithm.
Data Availability
The codes used in this paper are available from the corresponding author upon request (fyan@nuist.edu.cn).
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (No. 61605083) and Jiangsu Provincial Key Research and Development Program (China).