Abstract
Detection and recognition of traffic signs are the keys to creating advanced driving assistance systems. Making highly precise maps also requires the identification and extraction of such road elements as traffic signs. Traditional detection and recognition methods no longer meet today’s needs, and object recognition algorithms based on deep learning have become the mainstream solution. However, current algorithms have limitations. The recognition speed of one-stage strategy algorithms is fast, but recognition accuracy is not satisfactory especially for small objects. The accuracy of two-stage algorithms is higher, but the recognition speed is extremely slow. This paper solves these problems with a proposed parallel attention convolution module, a channel attention pyramid network, and a loss function diagonal and center point IoU based on the YOLOv3 algorithm. The improved models in this paper are compared with SSD, YOLOv3, and Faster RCNN. Experimental results show that the proposed models have some improvement over the above models: the mAP of the models with PACM, CAFPN, and DCPIoU was 76.02%, compared with SSD300, SSD500, Faster RCNN, and YOLOv3, which had improvements of 9.27%, 6.93%, 2.94, and 5.3%, respectively. And the FPS of the improved model is basically the same as the original YOLOv3, without reducing the real-time performance.
1. Introduction
Autonomous driving has become a popular research topic recently. The advanced driving assistance system is emerging as the first transition technology for autonomous driving at the L4 level or above. Therefore, designing a traffic sign recognition algorithm with high accuracy and robustness is essential.
The accuracy and robustness of automatic recognition of traffic signs using deep learning significantly improve the traditional approach, which uses template matching and support vector machines. A feature expression method based on multilevel chain code histograms has been proposed in the literature [1] for template matching recognition that is less computationally intensive, is more real-time than previous methods, and has a higher ability to discriminate between the shapes of traffic signs. One model combines a multilayer perceptron with the Gabor wavelet transform [2]. The result accurately locates traffic signs and performs real-time classification using a single fast Fourier transform. Another method first converts the image from color space to grayscale space [3], then uses shape templates to obtain the region of interest, represents the features with HOG, and then uses SVM for classification. Although traditional methods can recognize traffic signs, the process is cumbersome, poor in extracting features, and inadequate in meeting the requirements of real-time autonomous driving. Therefore, a deep learning method based on convolutional neural networks has emerged.
With its powerful feature extraction, deep learning based on a convolutional neural network has become the mainstream process for traffic sign recognition. CNN object recognition algorithms are either one-stage or two-stage. Representative one-stage algorithms are YOLOv1–v4 and SSD [4–8]; two-stage algorithms include RCNN, Fast RCNN, and Faster RCNN [9–11]. Traffic sign in photos has a large range of sizes, with large objects up to several hundred pixels and small objects only a dozen. Moreover, small object recognition presents technical difficulties. Researchers have optimized the model algorithm in various ways to improve model recognition performance, such as backbone, feature fusion, and loss function.
The backbone plays a significant role in the feature extraction module of the network, and the ability to extract feature information from input images directly affects model prediction. Spatial pyramid pooling (SSP) is added to the backbone, based on YOLOv3 [12], and improves the model’s ability to recognize forbidden signs. Another idea from the literature [13] adds mixed-depth convolution to the backbone, which improves the feature extraction ability of the backbone through multiscale convolution. Literature [14] incorporates deformable convolution into the backbone to improve the feature extraction capability of the model in complex environments. Another approach improves the one-way backbone structure of Faster RCNN by adding deconvolution in VGG16 to transfer the information of different-depth feature maps [15].
Most of the backbone structure contains multiscale output, and the interactive fusion of information from different-scale feature maps also affects the model performance. Feature fusion methods from the initial FPN to PAnet [16, 17], one-way fusion increased to bidirectional fusion, and then from PAnet to BiFPN [18], manual design to neural network structure search, the performance of feature fusion structures that simply change the connection method reach a bottleneck. Feature fusion combined with attention modules may be an effective direction in development. Attention mechanisms in computer vision are also a popular research topic, from spatial transformer networks in the beginning to later squeeze-and-excitation networks [19, 20]. Another contemporary approach includes fusing channel and spatial attention (convolutional block attention module or CBAM) [21]. The literature [13] modified the feature fusion of YOLOv5 to AAF (Attention Feature Fusion) [22].
The results of this study indicate that the loss function is one key to model training. A comprehensive loss function can accelerate model convergence and obtain better parameters. Taking the YOLOv3 algorithm as an example, its loss function has three parts: confidence, classification, and anchor box localization. Cross-entropy is used as the loss function for confidence and classification and achieves satisfactory results. The mean square error (MSE) is used as the loss function for localization, but MSE does not make the model fit the best state, so a more reasonable loss function for localization needs to be determined. The literature [23] proposed GIoU based on IoU, which focuses on nonoverlapping regions compared to IoU. Other literature has proposed a DIoU and CIoU based on GIoU that considers the scale relationship [24]; this achieved satisfactory results on the SSD, YOLO, and RCNN algorithms.
In summary, to further improve model performance, this paper optimizes the accuracy of small-object recognition by presenting a new feature extraction network with a parallel convolutional structure, constructing a channel attention pyramid by adding an attention mechanism design, adopting the YOLOv3 recognition strategy, and optimizing the loss function of anchor box localization. The paper also offers a new one-stage algorithm that improves small-object recognition. The main improvements offered in this paper are as follows: (i)A parallel attention convolution module (PACM) to improve the feature extraction of the backbone by fusing a parallel structure with an attention mechanism(ii)A new feature fusion structure, the fusion structure channel attention feature pyramid network (CAFPN), which realizes information sharing of channel features between differently sized feature maps and improves the quality of feature fusion(iii)A diagonal and center point IoU (DCPIoU) loss function, based on the optimization of the DIoU loss function, and the diagonal loss penalty term between the prediction box and the true box
2. Materials and Methods
Figure 1 shows the network structure of the proposed modified algorithm. The principal algorithm is based on the same prediction strategy used by the YOLOv3 network. However, the backbone, feature fusion, and the localization loss function of the anchor box of the original algorithm were reconstructed. The vertically connected structure on the left side of Figure 1 shows the backbone feature extraction network with the PACM. Images of pixels are input, and , , and feature maps are output after processing by four different residual blocks, six base convolution modules, and spatial pyramid pooling (SPP). The complex horizontally and vertically connected network on the right side of Figure 1 is the CAFPN feature fusion structure, which employs the bidirectional feature fusion of PANet with a channel attention pyramid structure in parallel; this ensures the bidirectional fusion of feature maps. Combining information between feature maps of different sizes makes the bidirectional fusion of channel attention weights possible for feature maps of different depths. This weights the channel dimensions of the output feature maps and identifies the useful information. The loss function of the original model was also optimized, and a DIoU-based DCPIoU loss function was improved for anchor box localization training. Compared with the original MSE loss function, the improved DCPIoU loss function not only adds the ratio of the distance between the centroids of two boxes and the diagonal of the minimum enclosing rectangle but also adds the difference between the diagonal of the minimum enclosing rectangle and the diagonal of the intersecting rectangle of the two boxes to improve the degree of model convergence.

2.1. Backbone with Parallel Attention Convolution Module
The backbone consists of three structures: residual unit, base convolution module, and spatial pyramid pooling. The base convolution module processes the input data by 2D convolution, batch normalization, and then LeakyReLU activation. The structure of residual block A is shown in Figure 2(a), in which the left side of the figure is a large residual edge composed of a basic convolution module. The main part consists of small residual blocks. Each small residual block is convolved and twice; then, the residual structure is added to the small residual block input data. The small residual blocks are connected at the beginning to end. Finally, the output result is fused with the large residual edge and output as the residual block A output. The structure of residual block B is shown in Figure 2(b), and the last residual block is replaced with a PACM based on residual block A. The PACM has a parallel convolution structure, which first performs convolution, then and convolutions. This processing is followed by applying each result to a channel attention module. The outputs of the channel attention modules are added, and the result is added to the large residual edge and the convolution. Next, the RELU function activation is performed then passes the data to the spatial attention module. The results are used as the output of residual block B. The spatial and channel attention structure in this paper follows the computational approach of the CBAM proposed in the literature [21]. A PACM enhances the feature extraction of the model by connecting the channel attention module after parallel and convolutions; convolving different perceptual fields is also more adaptable to the feature extraction of different sizes of traffic signs in the experimental data. The structure of nested size residuals also ensures the convergence of a deep network. Because the outputs of the three residual block B in the backbone are used as inputs for subsequent feature fusion, a final spatial attention enhances spatial information extraction from the object in the feature map. The SPP structure consists of three different sizes of max pooling in parallel then concatenated with pool sizes of , , and .

(a)

(b)
2.2. Channel Attention Feature Pyramid Network
The backbone processes the input image and outputs three feature maps of sizes , , and pixels. Originally, YOLOv3 used the FPN structure to achieve feature fusion of different sizes [16]. This study adopted a bidirectional feature fusion structure similar to PANet and embedded a set of parallel channel attention pyramids to make up a CAFPN [17]. The channel weight extraction module in a CAFPN also uses the channel attention calculation proposed in the literature [21]. However, the channel attention weight module only calculates to the input feature map weight vector. The channel weight vector is not multiplied with the input feature map for the multiplication operation.
After backbone processing, the feature maps are input to CAFPN in the order of , , and . CAFPN adopts a bidirectional parallel fusion structure, which is composed of small-size upsampling fusion and then large-size downsampling fusion. The feature map is bidirectionally fused, and the channel weight vectors are extracted at the same time; the weight vectors are also fused using a bidirectional fusion strategy to form a parallel structure. The specific structure is as follows; firstly, and convolution is computed for the three size feature maps output from backbone, and the number of channels is kept constant. Next, the channel weight vector is extracted and upsampled for the feature map of size , and the channel weight vector is extracted from the upsampling feature map. After extraction, the upsampling feature map is fused with the feature map of size , the weight vector is also extracted from the fused feature map, and then, the extracted vector is fused with the vector extracted from the upsampling feature map. Next, the above fusion steps are repeated. After two upsampling parallel fusions, the fused feature maps of three scales and the corresponding fused channel weight vectors have been obtained, and for the convenience of the following description, the three fused vectors are denoted as , , and in the order of , , and , and the three upsampled fused feature maps are denoted as P1, P2, and P3. Then, the downward fusion and output are carried out; firstly, P1 is processed by Convblock, which contains convolution, convolution, and spatial attention. After being processed by Convblock, it is multiplied with the corresponding weight vector , and then, the channel alignment is realized by convolution according to the number of traffic sign categories and the aligned feature map as the output result of this size. Next, the result of multiplication with is downsampled, and the channel weight vector is extracted from the downsampling feature map and fused with . The fused feature map is processed by Convblock and multiplied with the fusion vector, and then, the channels are also aligned according to the number of traffic sign types as the output result of this size. The previous downward fusion step is repeated to obtain the output for . Finally, the output results of the three sizes are sent to the detection head, and this paper use the same detection head as YOLOv3.
2.3. Diagonal and Center Point IoU
The YOLOv3 loss function is divided into anchor box localization loss, confidence loss, and classification loss. The localization loss uses MSE to calculate the offset of the center point coordinates relative to the grid and the translation and scaling of the anchor box; the confidence and classification both use cross-entropy as the loss function. The lossbox, lossobj, and losscls are as specific as formulas (1) to (3). The full loss function calculation formula is shown in Equation (4): where , , , and are the predicted values of the horizontal and vertical offsets of the prediction box centroids relative to the grid and the translations and scaling of the anchor boxs, respectively; , and are the true values, is , is the true confidence level, is the number of grids, is the number of predefined anchor boxes, and are the confidence and category probabilities of grid prediction, and and are their true values.
The DIoU loss function has achieved satisfactory results in various models [24], and the DIoU function is calculated as shown in where IoU is the intersection ratio between the prediction box and the true box, is the diagonal length of the smallest rectangle containing the prediction box and the true box, and is the square of the distance between the center point of the prediction box and the true box. The point location is shown in Figure 3.

To further improve the positioning loss function of the anchor box, this study added the penalty term of diagonal length difference to DIoU to DCPIoU. The formula for calculating DCPIoU is shown in where IoU is the intersection ratio between the prediction box and the true box, is the diagonal length of the smallest rectangle containing the prediction box and the true box, is the diagonal length of the intersection rectangle between the prediction box and the true box, and is the square of the distance between the center point of the prediction box and the true box.
The modified DCPIoU is more generalized than DIoU. In the scene shown in Figure 4, the prediction box is surrounded by the true box, and the center point is closer. Currently, the penalty term in DIoU is very small, and the effect on weight update is weak, so DIoU defaults to IoU. Obviously, the gap between the prediction box and the true box is still large, and the prediction box does not contain the complete object. Furthermore, the DCPIoU adds a penalty term for the diagonal length of the two boxes, which can still effectively update the weights and improve the convergence, as shown in Figure 4.

3. Results and Discussion
The difficulty of traffic sign recognition lies in the broad range of object areas in the image, which requires high model robustness, especially for the model’s small object extraction. One-stage models, such as YOLO and SSD, have fast recognition speed but poor ability to recognize small objects. The PACM and CAFPN structures proposed in this paper enhance the attention of the model to small objects in terms of feature extraction and feature fusion, respectively, while the DCPIoU is proposed to improve model convergence. To verify the feasibility of the proposed improvements, this study constructed a hybrid dataset. The experimental dataset is selected from the open-source China traffic sign detection dataset (CCTSDB) and Tinghua100K dataset, containing 4000 images of small objects. The data are divided into three categories according to the CCTSDB labeling method: prohibitory, warning, and mandatory. Then, the dataset is randomly divided into a training set and a test set in the ratio of 9 : 1. The same training set and training strategy are used for all categories. The experimental host configuration has an i7-8700 Intel CPU, 16 GB of memory, and a GTX1080 graphics card. AP and mAP evaluation metrics were used to measure model performance, and IoU is set to 0.5. Our improved model is compared with SSD equipped with VGG16 and Faster RCNN, and the experimental results are shown in Table 1.
Table 1 shows that the models with different improvement modules had different degrees of improvement over SSD300, SSD512, and YOLOv3; some of them outperformed Faster RCNN. The models with only PACM improved 4.37%, 2.03%, and 0.4% over SSD300, SSD512, and YOLOv3, respectively. The models with PACM and CAFPN improved 7.91%, 5.57%, 1.58%, and 3.94% compared to SSD300, SSD512, Faster RCNN, and YOLOv3, respectively. The model with PACM, CAFPN, and DIoU improved 8.62%, 6.28%, 2.29%, and 4.65% over SSD300, SSD500, Faster RCNN, and YOLOv3, respectively. The models with PACM, CAFPN, and DCPIoU show 9.27%, 6.93%, 2.94, and 5.3% improvements over SSD300, SSD512, Faster RCNN, and YOLOv3, respectively. The model with DCPIoU added improved 0.65% over the one with DIoU.
In order to compare the recognition speed of models, FPS of the above models are compared in this paper, and the comparison results are shown in Table 2.
Through comparison, it can be found that the algorithm in this paper can achieve recognition speed comparable to that of the original YOLOv3 model while improving its performance.
In order to represent the advantages of the model more effectively, we selected small object images from CCTSDB and Tinghua100K datasets for testing and drawing attention heat maps, which were not involved in model training.
Figures 5–7 show that the improved model structure of this paper has improved the detection ability of small objects compared with the original YOLOv3. The three traffic signs in Figure 5 are of moderate size, and both YOLOv3 and the improved model of this paper can recognize them accurately, but it can be seen from Figure 6 that the original YOLOv3 model appears to miss detection when the recognition objects gradually become smaller. As can be seen in Figure 7, compared with YOLOv3, the PACM and CAFPN structures proposed in this paper can improve the model’s attention to the traffic sign area, and the more focused attention has a significant impact on the final recognition effect.

(a)

(b)

(c)

(a)

(b)

(c)

(a)

(b)
4. Conclusions
To solve the difficult problem of recognizing small traffic signs, this paper proposes PACM and a CAFPN for feature extraction and fusion of extracted features, respectively. DCPIoU is proposed as a new localization loss function based on DIoU. In order to verify the actual performance of the above improved algorithm, 4000 pictures matching the small object characteristics were selected from two open source datasets, CCTSDB and Tinghua100K, to construct a hybrid dataset, and the hybrid dataset was randomly divided into a training set and a test set as experimental data according to the proportion. Then, the improved model of this paper is trained on the divided training set with SSD, YOLOv3, and Faster RCNN models, and the same test set is used for testing after the training. The test results show that the improved method proposed in this paper improves the performance of the model to different degrees; compared with the original loss and DIoU, the improved DCPIoU loss function can improve the convergence of the model, and the mAP is higher with the same training set and training strategy.
Data Availability
The data used to support the findings of the study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this paper.
Authors’ Contributions
He Huang did the conceptualization, investigation, and writing—original draft. Qice Liang did the conceptualization, investigation, and writing—review and editing. Dean Luo did the conceptualization and investigation. Dong Ha Lee did conceptualization and investigation.
Acknowledgments
This work was funded by the National Key Research and Development Program of China (no. 2017YFB0503702).