Abstract
Infrared target detection is a popular applied field in object detection as well as a challenge. This paper proposes the focus and attention mechanism-based YOLO (FA-YOLO), which is an improved method to detect the infrared occluded vehicles in the complex background of remote sensing images. Firstly, we use GAN to create infrared images from the visible datasets to make sufficient datasets for training as well as using transfer learning. Then, to mitigate the impact of the useless and complex background information, we propose the negative sample focusing mechanism to focus on the confusing negative sample training to depress the false positives and increase the detection precision. Finally, to enhance the features of the infrared small targets, we add the dilated convolutional block attention module (dilated CBAM) to the CSPdarknet53 in the YOLOv4 backbone. To verify the superiority of our model, we carefully select 318 infrared occluded vehicle images from the VIVID-infrared dataset for testing. The detection accuracy-mAP improves from 79.24% to 92.95%, and the F1 score improves from 77.92% to 88.13%, which demonstrates a significant improvement in infrared small occluded vehicle detection.
1. Introduction
Infrared target detection is a hot topic in object detection due to its specific characteristics and special demands. The infrared images have some inherent defects; for instance, infrared targets captured by the infrared cameras are not distinguished in the shape and boundary, which is easily to be misclassified by the environment information; secondly, compared with the visible images, the infrared images contain much more noise such as the Gaussian noise, which may depress the detection accuracy, if not preprocessed. What is more, as for the infrared remote sensing targets, the pixels are much smaller than the ordinary images [1]. All of these features make the infrared target detection more challenging than the normal detection tasks.
Since the infrared remote sensing targets are small and weak, the current methods are feature fusing [2, 3] and multiscale detection [4] to keep the small-scale features. As for the noise impact, the common method is the use of noise filters to suppress the background, such as the median and Robinson filters [5]. Moreover, the infrared datasets are not as sufficient as the visible datasets, which means that they are insufficient to train the model with infrared images in the same way as that with visible images. Thus, transfer learning [6, 7] is a good way to make up for the deficiency.
Nevertheless, the current papers focus more on the infrared small, dim targets without too much confusing background information, while the infrared object detection under the confusing background is not being sufficiently studied. Usually in this scene, the targets are occluded by the useless information from the wild environment, such as the trees, the shadow, and other ground features. The background information may invalidate the detection performance of the model and cause a false decision; that is to say, the detection falsely regards the negative sample as the targets resulting in a low precision. However, in the current military field, the most application scenarios are in the wild complex environments; thus, it is of vital practical importance to improve the detection performance of the models so that we can still detect the weak and occluded targets in complex environments precisely. Last but not the least, a good detection model can replace hand labor and increase the efficiency of surveillance and detection, as shown in Figure 1, and our paper tries to solve the detection issues in this field.

In terms of the above issues, our paper proposes the focus and attention mechanism-based YOLO (FA-YOLO) model. First of all, to mitigate the impact of confusing background information, we change the YOLOv4 data flowing structure and introduce the negative sample focusing mechanism during the training process. After several epochs of training, the model selects a number of false-positive samples and maps them into the corresponding locations in the feature map and trains them again. Through focusing on the confusing sample training, the model could learn to be more precise.
Secondly, to enhance the features of small objects, we reconstruct the backbone network of YOLOv4 by adding an attention mechanism to the CSPDarknet53 network. We plug the sequent channel and spatial attention block after each residual block; meanwhile, to increase the reception field, we change the convolutional kernel in spatial attention into a dilated convolutional kernel.
Additionally, we use CycleGAN [8] to create infrared images from the visible images to make up for an insufficient infrared training dataset. Transfer learning is also used to promote the optimization of the model parameters. To further verify the superiority of our model, we also add SSD [9], faster R-CNN [10], and YOLOv3 [11] as the comparison models. Compared with the original YOLOv4 [12] model, the detection accuracy- of the FA-YOLO models improves from 79.24% to 92.95%, and the F1 score improves from 77.92% to 88.13%, which has a state-of-the-art performance.
The main contributions of our work are as follows: (1)Use GAN to increase the amount of the infrared images and transfer learning to promote the training process(2)Add a negative sample focusing mechanism to the YOLOv4 model, let it focus more on the negative sample training to reduce the impact of the confusing background, and thus improve the detection accuracy of the model(3)Fix the dilated convolutional block attention module (dilated CBAM) into the CSPDarknet53 to enhance the features of small targets
Section 2 surveys the related works. Section 3 explains the FA-YOLO in theory. Section 4 is the experiment, and Section 5 concludes the whole paper.
2. Related Works
This section briefly surveys the related works in infrared small target detection and attention mechanism.
2.1. Infrared Small Target Detection
Infrared object detection mainly contains infrared person detection [6, 13–15], infrared vehicle detection [7, 16, 17], infrared aircraft detection [5], and infrared creature recognition and counting [18]. Usually, the lack of an infrared dataset for training and the unclear infrared image features are the problems that need to be overcome.
Transfer learning [6, 7, 19] is usually used for the insufficient training datasets; thus, it is also effective in the infrared dataset training. The possibility mainly relies on the similar image features of the two datasets. The similarity and the huge pretraining dataset are the two conditions needed for transfer learning. The generative adversarial network (GAN) [6] is another method applied to make up for the insufficient infrared datasets through generating infrared images in different styles from visible images.
Wang et al. [2] propose the MNET network, using only three downsampling operations to preserve the features of small infrared targets and using dense connection of the feature map to keep the size all the same; Xu and Wu [3] also use DenseNet and expand it to four scales of anchor boxes in YOLOv3; Zhang et al. [20] uses a double multiscale feature pyramid network to combine different semantic and resolution feature levels.
2.2. Attention Mechanism
CBAM [21] is a simple yet effective attention module for feedforward convolutional neural networks, generating both channel and spatial attention maps separately. It is a lightweight and general module, and it can be integrated into any CNN architectures seamlessly with negligible overheads and is end-to-end trainable along with base CNNs. BAM [22] is also a two-dimensional attention module, which is placed at each bottleneck of models where the downsampling of feature maps occurs. AS-YOLO [23] adds the CBAM after the fusion of different scale feature maps in the PANet so as to enhance the fused features. Gao et al. [24] add a channel attention module (ECANet [25]) after all residual modules of CSPDarknet53 in YOLOv4, and its module mainly consists of two parts, namely, dimensionless local cross-channel interaction and one-dimensional convolution operation with the size of an adaptive convolution kernel. Chen et al. [26] construct a multilevel feature pyramid, use the attention model to obtain the salient features of different levels, and fuse the salient features of different levels for SAR ship detection in multiscale and complex scenarios.
3. The Proposed Method
3.1. Work Flow
The whole procedure of the FA-YOLO is shown in Figure 2. After pretraining, use CycleGAN to generate enough infrared images and put them to the detection model for the final training. The FA-YOLO consists of dilated CBAM and hard example mining module and could detect the small targets and delete the confusing negative sample.

During the transfer learning process, we use UCAS-AOD as the pretraining dataset; it contains 510 visible vehicle images; and through flipping and adding noise, we augment them to 3060 images.
Then, we use the CycleGAN network to transfer the VIVID-visible images to infrared images, as shown in Figure 3. Overall, the final infrared dataset contains 500 images for training from the VIVID-infrared dataset generated by CycleGAN.

3.2. Negative Sample Focusing
As shown in Figure 1, the vehicles in VIVID-infrared images selected by us are heavily impacted by environment; the features of vehicles are mixed with the confusing background information, which is even difficult for human eyes to recognize. The complex background information may interfere with the detection model by causing too much false-positive examples. To mitigate the impact of the background information and depress the damage of the false positives, herein, we revise the YOLOv4 model with a negative sample focusing mechanism which could focus on training the confusing negative samples and distinguish the targets from the complex background.
After the NMS of the YOLOv4 model, the YOLO-head layer outputs several predicted boxes with location parameters (, , , and ) and class possibilities . Through calculation, we could gain the IoU of each predicted box towards the corresponding target box. In general, as for each predicted box, when the , prediction is corrected; otherwise, it should have been recognized as the background but was falsely predicted as the targets, that is to say, the negative samples. When doing a detection task, there would be so much negative samples that impact the performance of the model.
In consequence, we need to revise the model, and let it focus more on such negative samples. As shown in equation (1), select the predicted boxes, of which the into ; these are the negative samples. Figure 4 shows the negative sample focusing mechanism in the FA-YOLO model. In the training procedure, every time when doing backpropagation, the model gets the four location parameters (, , , and ) of the false positives (FP) and uses the location parameters to map the FPs to the corresponding area in the layers before the multiheads (as shown in the red areas). In theory, the locations in different layers have a congruent relationship through the convolution operation, and we could use reverse convolution operation to find the location relationship between the shallow layer and the deep layer. Then, we transfer them to the corresponding locations in the feature map output by the CSPDarknet53 and optimize the model with these samples again.

Every time after normally training for epochs, select the first samples in dataset , find the negative samples and their corresponding feature maps output by the backbone, put them into the forward-propagation operation, and optimize the loss values of the NS. When doing the negative sample training optimization, to make the parameter optimize faster, we freeze the backbone parameters and just upgrade the subsequent parameters.
3.3. Dilated CBAM
Given the problem that infrared vehicle targets are small and the features are not obvious from the background, it is not easy for the model to extract and conserve the features. In this way, the attention mechanism, channel attention and spatial attention, is added to the YOLOv4 network to enhance the small targets, making the key features distinguishable.
Our attention contains both channel attention and spatial attention, given the input feature map from the upper layer, and the dilated CBAM sequentially generates a 1D channel attention map and a 2D spatial attention map as illustrated in Figure 5. The overall attention process can be summarized as

3.3.1. Channel Attention
In channel attention, we use the module from CBAM [21], which aggregates the spatial information of a feature by using both average pooling and max pooling, generating two different spatial context descriptors: and . Both of the two descriptors are forwarded to a multilayer perception (MLP) to generate a different channel attention map and then added and activated by the sigmoid function to the final channel attention map. The channel attention is computed as
3.3.2. Spatial Attention
In spatial attention, we change the convolutional layer in CBAM into a dilated convolution kernel to increase the receptive field so as to link the information of the targets and the background. However, Yu et al. [27] point out that dilated convolutions can cause gridding artifacts, which often occur when a feature map has higher frequency content than the sampling rate of the dilated convolution. To remove the gridding artifacts, we add two more dilated convolutional kernels with smaller dilated rates after the first dilated one with a dilated rate of 4, as shown in the first row in Figure 5.
Firstly, apply the average pooling and max pooling operation along the channel axis and concatenate them to generate an efficient feature descriptor, . Then, put the descriptor forward to the to generate the spatial attention. The is composed of three dilated convolution layers, i.e., kernels with dilated rates of 4, 2 and 1, respectively. In short, the spatial attention is computed as
The CSPDarknet53 has residual blocks, we plug the dilated CBAM after each block, thus getting an attention-based CSPDarknet53 feature extraction network, and each residual block with the dilated CBAM is a new basic unit of the attention-based CSPDarknet53.
4. Experiment
4.1. Dataset and Environment
The pretraining dataset is UCAS-AOD visible dataset, with a total of 3060 images. The final infrared datasets contain 500 images from the VIVID generated from CycleGAN and were manually annotated by . The testing dataset contains 100 infrared images from the VIVID-infrared dataset, and the vehicle in each image is heavily occluded and impacted by the confusing background information. During the experiments, the GPU is RTX 2080Ti.
4.2. Comparison Experiments
To verify the superiority of the FA-YOLO, we launch extensive comparison experiments. The SSD, YOLOv3, faster R-CNN, and original YOLOv4 model are put on the dataset for training and testing. Furthermore, we also launch an experiment to verify the efficiency of transfer learning. Based on the YOLOv4 model, we used no transfer learning as the comparisons and just train the model on the infrared dataset.
4.3. FA-YOLO Experiments
Finally, we apply the negative sample focusing mechanism and dilated CBAM to the YOLOv4 model sequently. For the negative sample focusing mechanism, each time when normal training for 9 times, select the first 120 negative samples for one time focusing training. As for the dilated CBAM, we add the module to the CSPDarknet53, since the structure has changed and we first train the model in VOC-2007 for 1,000 epochs and get the weight file. Then, we keep all the procedures and parameters consistent with those in the original experiments.
4.4. Experiment Results
The experiment results are shown in Table 1. The mean average precision () and F1 score are adopted as the metrics of the detection accuracy, as shown in the following equations:
It could be concluded from the table and the curve in Figure 6 that our FA-YOLOv4 has the highest and F1 score among all the other models. When using a transfer learning strategy, the improves by 11.1% and the F1 score improves by 9.58%; when using the negative sample focusing mechanism, the improves by 12.68% and the F1 score increases by 10.06%. When adding the dilated CBAM to the YOLOv4, the improves by 13.71% to 92.95% and the F1 score improves by 10.21% to 88.13%.

Figure 7 shows the part of the detected images on the testing set of FA-YOLO, from which we draw the conclusion that the attention module could detect the small, weak, and occluded targets well. Figure 8 is the heat map—the explanation of the CSPDarknet53 with the dilated CBAM; we use Grad-CAM [28] to visualize the output of the backbone when inputting an image. The attention module could focus on the target information and filter the background information well in most targets, but there is still some confusing background information which may mislead the detection model.


Figure 9 shows the comparison detection results of SSD, faster R-CNN, and FA-YOLO on testing images. The blue boxes are the ground truth boxes (GT), the green boxes are the true positive samples detected by the model (TP), and the red boxes are the background information falsely recognized as positive samples by the models (FP), in other words, the negative samples. From formulas (5), (6), and (7), the FPs will decrease the detection accuracy and the TPs are what we really need. What is more, the comparison of the three row images indicates that the baseline models could not distinguish the confusing background information correctly, while the FA-YOLO could solve this problem well.

5. Conclusion
In our paper, the FA-YOLO model is proposed to the application of infrared occlusion vehicle detection in wild complex background, where the confusing background information causes great impact on the target detection. By using GAN and transfer learning, our model has a sufficient dataset for training and optimization. By using the negative sample focusing mechanism during the training procedure, it could mitigate the complex background information and occlusion influences, thus making the model more accurate for distinguishing the targets and the background. Finally, by plugging the attention mechanism module into CSPDarknet53, the YOLOv4 could enhance the features of small targets so as to improve the detection accuracy. Through extensive experimental verification and comparison, the detection accuracy- on the VIVID-infrared occluded vehicle improves by 13.71% and the F1 score increases by 10.21%, which shows a significant improvement of our method and superiority of the proposed model.
Data Availability
The [experiment results and algorithm codes] data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.