Abstract
Undercarriage device is one of the essential parts of an aeroplane, and accurate detection of whether the aeroplane undercarriage is operating normally can effectively avoid aeroplane accidents. To address the problems of low automation and low accuracy of small target detection in existing aeroplane undercarriage detection methods, an improved algorithm for aeroplane undercarriage detection YOLO V4 is proposed. Firstly, the convolutional network structure of Inception-ResNet is integrated into the CSPDarkNet53 framework to improve the algorithm’s ability to extract semantic information of target features; then an attention mechanism is added to the path aggregation network algorithm structure to improve the importance and relevance of different features after conceptual operations. In addition, aeroplane and undercarriage datasets were constructed, and finally, the generated partitioned test sets were tested to evaluate the test performance of Faster R-CNN, YOLO V3, and YOLO V4 target detection algorithms. The experimental results show that the improved algorithm has significantly improved the recall rate and the mean accuracy of detection for small targets in our dataset compared with the YOLO V4 algorithm. The reasonableness and advancedness of the improved algorithm in this paper are effectively verified.
1. Introduction
Aeroplane take-off and landing is one of the most important links, and more than 50% of aeroplane safety accidents occur in the take-off and landing stage [1]. In the process of taking off and landing, whether the undercarriage can be unfolded and folded normally is very important to flight safety. At present, whether the undercarriage is deployed or not is only fed back to the pilot by the internal signal of the aeroplane [2]. However, due to the lack of 100% reliability of electronic components, if the pilot receives the wrong signal, the pilot misjudges whether the undercarriage is deployed, and then the aeroplane cannot land safely. Therefore, it is necessary to conduct an external observation on the detection of the aeroplane undercarriage to ensure the normal deployment of the undercarriage and the safe landing of the aeroplane.
At present, machine vision and deep learning are springing up and widely used in various industries, such as face recognition [3–7], defect detection [8–10], remote sensing image detection [11–14], and driverless car [15–18]. Machine vision is used as an image acquisition method, and deep learning algorithm is further introduced to identify and detect objects. YOLO V4 [19] algorithm is an upgraded and improved version of YOLO [20], YOLO V2 [21], and YOLO V3 [22], and it is also one of the recently proposed classification algorithms that uses a regression network to realize object detection. Different from traditional network algorithms based on candidate regions, such as R-CNN [23], Fast R-CNN [24], and Faster R-CNN [25], YOLO series algorithms are one-stage algorithms. Although the detection accuracy of the one-stage algorithm is not as high as that of the two-stage algorithm, candidate region, and detection, the detection speed of the one-stage algorithm is much faster than that of the two-stage algorithm. This is because the YOLO series algorithm integrates the two stages of the candidate region and detection into one stage and directly treats the detection task as a regression problem, and it can realize end-to-end prediction.
However, in the actual test process, it is found that the detection effect of YOLO V4 for some small targets is not good enough, and there are still quite a number of occasions of false detection and missed detection [26]. At the same time, because the convolution network performs multiple downsampling operations, the image resolution gradually decreases, resulting in the loss of small targets. Compared with using the subsampling convolution layer, the convergence effect is reduced by about 3% if the expanded convolution layer is used instead, which shows that subsampling is indeed not conducive to CNN learning [27]. The down sampling operation has the greatest impact on the small target object because it may cause the loss of the feature value of the small target object, which will not be detected accordingly.
To solve the problems that the image features will be lost and the detection effect will be affected by the CNN sampling, several methods are proposed recently. Liu et al. [28] proposed a novel IPG-Net (Image Pyramid Guidance Network) based on FPN [29] (Feature Pyramid Networks). It contains an image pyramid-guided IPG subnetwork, a fusion module, and an image pyramid-based backbone. It effectively alleviates the problem of insufficient spatial information in the deeper convolution layer and the loss of small targets. Kim et al. [30] proposed a parallel feature pyramid network based on SPP-Net [31] (Spatial Pyramid Pooling in deep convolutional networks for visual recognition), which is constructed by widening the network width rather than increasing the network depth, so as to improve the detection effect of large, medium, and small targets. Although these improvements can improve the small target detection ability to a certain extent, they do not consider widening the width of the network rather than deepening the depth of the network, extracting features from multi-scale and giving different weights to features according to their different importance.
In order to improve the detection performance of an aeroplane and undercarriage, this paper adopts the improved YOLO V4 algorithm to carry out research on undercarriage detection. Firstly, after the trunk features are extracted from different feature layers of network output, Inception-ResNet [32] structure is added and multi-scale feature extraction, to improve the network to small object feature extraction ability. Secondly, the attention mechanism [33] is added after the PANet (Path Aggregation Network) [34] algorithm structure, to improve the importance and relevance of different features after concept operation. Finally, the divided test set was used to evaluate the detection performance of the algorithm under different improved methods. At the same time, the performance was compared with the currently commonly used Faster R–CNN, YOLO V3, and YOLO V4 algorithms.
The innovations and main contributions of this paper are described as follows:(1)In order to effectively improve the ability of small target detection, an improved YOLO V4 algorithm based on feature multi-scale processing and attention mechanism is proposed.(2)Based on the different detection targets, the feature multi-scale extraction module Inception-ResNet convolution network structure and attention mechanism are introduced.(3)The influence of different number of pixels on the feature extraction ability of the proposed network is analyzed.(4)A complete comparative experiment is designed to test the detection performance of the improved algorithm.
2. Attention Mechanism and Multi-Scale Feature Processing
2.1. Attention Mechanism
The convolutional neural network can extract relatively rich object feature information, but there will be more miscellaneous feature information, such as a lot of irrelevant background information and pixel information caused by different light. The concept operation used by YOLO V4 only simply lists and connects the feature information in the dimension, but the semantic information and importance of different features cannot be effectively associated, and the processed feature information cannot accurately describe the object.
Aiming at containing more efficient feature information in the feature map, this paper uses the squeeze-and-excitation block [33] structure unit with an attention mechanism to make different pixels have different weight values in the channel dimension. The structure unit mainly consists of three components, including squeeze, excitation, and scaling (Figure 1).

The squeeze part uses global average pooling to get a size feature map. The goal is to compress a single feature map into a point. Through the part of excitation, the weight learned based on the overall information of the channel is obtained. The calculation equations are as follows:
In formula (1), is the calculated value of the feature maps after global average pooling, and the output format is ; and are the width and height dimensions of the feature maps; represents the pixel value of position and in the feature maps.
In the excitation part, two fully connected layers are used for dimensionality reduction. The rule activation function is used to activate the middle fully connected layers after dimensionality reduction. Finally, Sigmoid activation function is used to activate the middle fully connected layers after dimensionality reduction, so that each weight in the fully connected layers is normalized. Through the reverse calculation and the learning of the whole dataset, the fully connected layers can learn all the channel information after the training of the whole dataset, as well as the strengthening and weakening rules of each channel. The calculation equations are as follows:
In formula (2), is the weight of eigenvalues in each channel, and the output format is ; is the ReLU activation function; is the Sigmoid activation function; is the characteristic kernel of dimension; is the characteristic kernel of dimension, where is the scaling parameter, and the value is 16 here [33].
In the scaling part, multiplying the feature maps and weights of each dimension of the excitation section to get the feature graph with different weights. The calculation equations are as follows:
In formula (3), is the updated feature maps.
2.2. Feature Multi-Scale Processing
In the process of feature extraction of the convolutional neural network, the depth of the convolutional layer should be increased if the recognition accuracy needs to be improved. However, with the increase of network depth, the intermediate parameters and operation time of the algorithm will be increased, and some phenomena such as gradient explosion may occur. To solve the problem that the undercarriage features of an aeroplane are easy to be lost or not obvious, this paper refers to the design idea of increasing network width and depth in GoogleNet [32] algorithm and YOLO V4 algorithm and introduces the Inception-ResNet [32] structural unit (Figure 2).

This structure unit consists of a four-channel parallel convolution structure, and the input feature maps are divided into four-channel convolutional operation. The first one is output to the total feature fusion layer through the ResNet [35] structure. In 2016, He proposed residual network (ResNet), which ensured the depth of the model and used residual blocks and batch normalization layers to effectively alleviate problems such as gradient disappearance and gradient explosion. The idea expressed by the residual block is the output mode of fusion between the input features and the features obtained after two or three convolution operations, thus solving the problem of model degradation. The general form of residual blocks is shown in Figure 3, where is the input of the residual blocks, and is output after two-layer convolution operation, which can be expressed as

In formula (4), and are the weight parameters of the convolution layer, and is the weight parameter of the transformation from input to output.
The second one is output to the partial feature fusion layer after convolution kernel operation. The third one is output to the partial feature fusion layer after and convolution kernel operation. The fourth one is output to the partial feature fusion layer after , , and convolution kernel operation; then the output feature maps of articles 2, 3, and 4 are fused with convolution kernel at the feature layer. Finally, the feature layer of the first output and the feature layer of the partial feature fusion layer output are fused.
3. Improvement of YOLO V4 Network Architecture
YOLO V4 is an improved version of YOLO V3. Firstly, the backbone feature extraction network structure has been improved from the Darknet53 network to the CSPDarknet53 structure (Figure 4), which contains five CSP [36] (Cross Stage Partial connections) modules and uses Mish [37] activation function instead of LeakyReLU activation function. Secondly, the SPP-Net module is introduced to extract the most important features from the bottom without affecting the network processing speed, and effectively expand the range of receptive field then, PANet was used as the neck network to replace the original FPN. Multi-channel feature fusion is carried out while more important context features are extracted. Finally, continue to use the YOLO V3 head as the detection head, as shown in Figure 4.

YOLO V4 algorithm uses CSPDarknet53 as the backbone feature extraction network. After convolution operation, it outputs P3, P4, and P5 feature layers, whose receptive fields are 1/8, 1/16, and 1/32 of the original input layer size, respectively. Among them, the output P3 and P4 feature layers enter the PANet structure for a feature fusion after a convolution operation, respectively, and the feature layer P5 enters the SPP structure for maximum pooling after a , , and convolution operation. To a certain extent, these methods can extract the characteristics of an aeroplane and undercarriage of different sizes and types. However, due to the small size of the undercarriage relative to the aeroplane, it is easy to lose features. If the YOLO V4 algorithm is directly used for training and detection, the detection effect is relatively poor.
In this paper, three feature layers of P3, P4, and P5 are output in the CSPDarkNet53 network, and the Inception-Resnet structure is added to optimize the convolutional layers of P3 and P4, then output to the PANet for the next step of operation. With the improvement of the two feature layers, on one hand, the width, depth, and complexity of the network can be increased and the ability of feature extraction can be improved and on the other hand, a larger receptive field can be obtained, and more global and higher semantic level feature information can be obtained, so as to improve the success rate of small object feature extraction, as shown in the yellow box in Figure 4. At the same time, the squeeze-and-excitation block structure unit is introduced into the three-scale feature map after PANet in the YOLO V4 algorithm. By introducing fewer parameters and computation, the performance of the algorithm can be effectively improved. The framework of the algorithm is shown in the orange box in Figure 4.
4. Network Training
4.1. Training Platform
Windows10 operating system is used in this paper experiment. The hardware configuration used in the experiment is as follows: CPU: Inter I CoITM) I7-10700K CPU @ 3.80 GHz 3.79 GHz; GPU: NVIDIA GeForce RTX 2070. The programming environment is PyCharm2020, and the deep learning framework is Keras.
4.2. Training Dataset
The data used in this experiment is composed of aeroplane pictures in the PASCAL VOC 2007 dataset and aeroplane images collected in the field at the airport. The pictures include an aeroplane and undercarriage. Due to the limitations of the data, this paper used exposure adjustment to divide it into normal exposure, overexposure, and underexposure to simulate over-strong light at noon and incomplete exposure to simulate normal weather, with high light at noon and low light at dusk, as shown in Figure 5. The dataset after image enhancement reached 1440 images, including 1296 images of the training set and 144 images of the test set. Due to the difference in size between the images collected in the field at the airport and the aeroplane images in the PASCAL VOC 2007 dataset, the size was processed uniformly as before training. In this paper, the labelImg software is used to label the aeroplane and undercarriage in the images, and the information such as the path of the picture, the label, and the coordinate of the labeled area is stored in the XML file.

4.3. Training Parameter Setting
The algorithm needs to set relevant parameters before training. In this experiment, the batch size was set as 8, and the number of training epochs was set to 100. In this paper, the learning rate of cosine annealing attenuation is adopted. The learning increases first and then decreases. The ascending stage is a linear rise, and the descending stage simulates the decline of cosine function. The initial learning rate is , the maximum learning rate and the minimum learning rate are and , and the execution is repeated. Each epoch iterates 145 steps, and the whole training process iterates 14,500 steps in total. After 100 Epoch iterations, the model tends to converge, and the loss function of the training set converges to 7.415 and 7.036 for the test set. The change of the loss function in the whole training process is shown in Figure 6.

4.4. Evaluation Indicators of Model
In deep learning object detection, the mAP (mean Average Precision) (7) is generally calculated by precision (4) and recall (5) of the model on the test set, and AP (Accuracy Precision) (6) is used to evaluate the accuracy of the model in a single detection category, and the larger the mAP value is, the higher the overall detection accuracy and the better the model’s prediction effect are. The mAP was used as the comprehensive evaluation index of the detection accuracy of the algorithm precision. The calculation equations are as follows:
In formula (5), is the precision; represents the number actually being the object and correctly predicted, namely the correctly predicted number; represents the number actually being the non-object but predicted as the object, namely the mispredicted number.
In formula (6), is the recall; represents the number actually being the object but not detected, namely the missed number.
In formula (7), is the number of images in the test set; is the precision value of the th image; is the change of the recall value from th to th images.
In formula (8), is the number of objective classes.
5. Results and Discussion
5.1. Results
In this paper, the improved method proposed for the YOLO V4 algorithm was compared with Faster R-CNN, YOLO V3, and YOLO V4. In addition, in order to test the performance of the two improved algorithms proposed in this paper, a set of ablation experiments were added to test the experimental effect of a single improved algorithm, respectively. The overall comparative experimental results are shown in Table 1.
Table 1 shows the overall performance of the detection effect of different algorithms in the dataset.
Firstly, it can be seen from Table 1 that the detection performance of the YOLO V4 algorithm is improved by the two improved methods proposed in this paper. When the two improved methods are used, the mAP value is the highest, which is 6.18% higher than that of the YOLO V4 algorithm. According to the analysis of Figures 7–9, the AP of an aeroplane increased by about 2%, the precision increased by about 5%, and the recall increased by about 11%; the AP of undercarriage increased by about 10%, the precision decreased by about 5%, and the recall increased by about 16%. It has effectively verified the rationality and advanced nature of the algorithm improvement.


(a)

(b)

(c)

(d)

(e)

(f)

(a)

(b)

(c)

(d)

(e)

(f)
Secondly, compared with the current mainstream two-stage detection network Faster R-CNN, the improved algorithm in this paper has a larger improvement in detection accuracy and detection speed. mAP is increased by 4.05% and detection time is decreased by 0.3254 s. Because the anchor of Faster R-CNN is prefixed, when the object to be detected is small, it cannot be freely adjusted according to the size of the object, resulting in frequent missed detections. In addition, Faster R-CNN is a two-stage algorithm, forming the candidate region firstly costs more time. It has a higher mAP but a slower detection speed than YOLO V4. Our algorithm has high mAP and fast detection speed.
5.2. Discussion
Figure 10 shows the detection effect of our algorithm and the YOLO V4 algorithm in this paper on some test sets.

(a)

(b)

(c)
In the image of field-collected aeroplane’s landing and taking off, the plane size is larger, and the undercarriage is relatively smaller. When the image of an aeroplane and undercarriage is single in the whole image background and occupies a relatively large proportion, and the image size is large, the detection effect of the improved algorithm in this paper and the YOLO V4 algorithm is basically the same. Both can better detect the position of the aeroplane and undercarriage (Figure 10(a)).
When the size of the aeroplane and undercarriage in the image is small (Figure 10(b)), the size of the image is 418 × 482 pixels, and the resolution of the upper and lower aeroplane is 204 × 79 pixels and 104 × 47 pixels, accounting for about 7% and 3%, respectively. Both the our algorithm and the YOLO V4 algorithm can detect the aeroplane object well. The number of pixels of the left undercarriage above Figure 10(b) is 55, accounting for about 0.02% (Figure 11), which belongs to small object detection. The proposed algorithm can detect the object effectively, but the YOLO V4 algorithm cannot detect the object correctly. The number of pixels of the left undercarriage above Figure 10(b) is 55, accounting for about 0.02% (Figure 11), which belongs to small object detection. The number of pixels of the middle undercarriage above Figure 10(b) is 26, accounting for about 0.01% (Figure 12). Neither the algorithm in this paper nor the YOLO V4 algorithm can be detected.


When there are a crowd of aeroplanes and undercarriages in the image, the image background is relatively complex, and object size proportion is relatively small, YOLO V4 algorithm appears relatively serious missed detection phenomenon, and the our algorithm can more accurately detect the aeroplane and undercarriage. Compared with the YOLO V4 algorithm, the number of aeroplane detection in our algorithm is basically the same,but seven more undercarriages were detected (Figure 10(c)).
6. Conclusions
In this paper, an improved detection algorithm based on the YOLO V4 network is proposed to combine attention mechanism with multi-scale feature processing to achieve accurate automatic detection of the aeroplane and undercarriage. Firstly, in order to improve the ability of the network to extract small target features, CSPDarkNet53 outputs P3 and P4 feature layers followed by convolutional layer optimization of Inconcept-ResNet structure and multi-scale feature processing. Secondly, the attention mechanism is added after the path aggregation network algorithm structure to improve the importance and relevance of different features after concept operation; Finally, the datasets of an aeroplane and undercarriage are constructed, and the divided test set is tested to evaluate the detection performance of the algorithm under different improvement methods. In our dataset, the comprehensive experimental results are better than YOLO V3 and Faster R-CNN. compared with the YOLO V4 algorithm, the improved algorithm improves the detection ability of an aeroplane and undercarriage. mAP improves by 6.18%, aeroplane AP improves by about 2%, accuracy improves by about 5%, and the recall rate improves by about 11%; undercarriage AP improves by about 10%, and the recall rate by about 16%. The disadvantage is that the small target detection capability has not improved enough and the detection speed has decreased, with an increase in detection time of 0.0016 s compared to YOLO V3 and 0.0039 s compared to YOLO V4. Our future work will be how to reduce the algorithm computation and increase the detection speed without decreasing the detection performance.
Data Availability
The partial data used to support the findings of this study can be found in the online versions at https://host.robots.ox.ac.uk/pascal/VOC/voc2007/devkit_doc_07-Jun-2007.pdf. The other data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (No. 12002115), the Natural Science Foundation of Hebei Province (No. F2020402005), the Natural Science Foundation of Hebei Province (No. F 2021402011), the Science and Technology Research Projects of Colleges and Universities in Hebei Province (No. ZD2018207), the Scientific and Technological Research Projects of Universities in Hebei Province (No. QN2020181), and the Science and Technology Research and Development Project of Handan City (No.19422101008-8).