Abstract
Object detection has become a crucial technology in intelligent vision systems, enabling automatic detection of target objects. While most detectors perform well on open datasets, they often struggle with small-scale objects. This is due to the traditional top-down feature fusion methods that weaken the semantic and location information of small objects, leading to poor classification performance. To address this issue, we propose a novel feature pyramid network, the adaptive learnable feature pyramid network (ALFPN). Our approach features an adaptive feature inspection that incorporates learnable fusion coefficients in the fusion of different levels of feature layers, aiding the network in learning features with less noise. In addition, we construct a context-aligned supervisor that adjusts the feature maps fused at different levels to avoid scaling-related offset effects. Our experiments demonstrate that our method achieves state-of-the-art results and is highly robust for the small object detection on the TT-100K, PASCAL VOC, and COCO datasets. These findings indicate that a model’s ability to extract discriminant features is positively correlated with its performance in detecting small objects.
1. Introduction
Artificial intelligence is a new technical science which is used to study and develop the theory, method, technology, and application system for simulating, extending, and expanding human intelligence. With the development of artificial intelligence, its application field has been widely expanded [1, 2]. Among them, the branch of computer vision based on deep learning has been fully applied in the field of autonomous driving, capturing external information, and detecting target objects through cameras or radar. However, in different application scenarios, there are still some technical difficulties to be overcome.
In recent years, with the development of deep learning and the evolution of CNN networks, the CNN-based object detection methods have been widely studied [3–6], which dominate the rankings on the leaderboard of most detection datasets. However, although the overall detection performance is improved, small object detection is still far from being solved due to the challenge of feature extraction. Most object detection methods are difficult to obtain an excellent result on small object datasets, limited to feature aliasing between small and large objects. In other words, even though a high average precision (AP) has been achieved on medium and large objects, the performance on small objects is still much worse than that of medium and large objects. Generally speaking, whether it is a single-stage [7] or two-stage [8] pipeline, FPN [9] is a crucial tool to solve the problem of object detection, including small objects detection. It can fuse the top-down and bottom-up feature maps without introducing redundant parameters and improve the multiscale object detection with a small computational cost [10].
In short, the bottom feature map is responsible for predicting large objects, while the top feature map is for the forecast of small objects. However, this way of feature fusion can easily confuse the features of large and small objects. A large object can retain most of the ontology information after multiple downsampling. By contrast, small objects, such as objects below 32 pixels 32 pixels, may only occupy the size of one pixel on the feature map after multiple downsampling. This phenomenon can easily lead to the loss of the small object features, which would result in the wrong detection of small objects. For example, in Figure 1, the small traffic sign is hard to detect, even for our human beings. However, our ALFPN performs well in such challenging scenarios, while the state-of-the-art detector (FPN) cannot detect the objects. Compared to FPN, we also have a lower missed detection rate in the scenes where the distribution of small objects is relatively close. The worse behavior of FPN can be attributed to its feature inconsistency on different scales, especially in the one-stage pipelines. In summary, the detection of large objects and small objects depends on different feature levels, which can easily lead to feature conflicts. This can make it difficult for the model to converge when large and small objects are in the same frame. Therefore, we construct an adaptive learner to adjust the conflict to solve this inconsistency in our paper.

(a)

(b)
In addition, to overcome the lack of semantic information of the bottom feature maps, FPN provides a top-down path to share the rich semantic information of the top-level features with the underlying features. However, this method still has some drawbacks. For example, the semantic information gap between features of different scales is very large, and direct fusion will lose the multiscale representation ability. Furthermore, downsampling will cause feature loss, and in the process of network feature extraction, more detailed information will be lost as the network depth deepens. In FPN, the original author chose the direct stacking method when fusing different levels of feature layers. We believe that this will intensify the aliasing effect due to inconsistent semantic information scales, which will have a negative impact on model learning [11]. Typically, for a single-scale feature, not all scale feature maps can be effective for performance in object detection. Especially for small-scale objects, the direct stacking of feature maps of different scales may cause the features of small objects to be covered, and their original features cannot be well preserved. Even adding FPN and other feature fusion pyramid networks to small object detection, the effect achieved is very limited [12]. Therefore, our adaptive learnable feature pyramid module is proposed to address this challenge in this paper.
In general, to solve the feature conflict between objects of different sizes, the other two main types of methods are context-information [13, 14] and super-resolution [15–17]. Context information utilizes the relationship between different objects in the same scene to assist in the detection of small objects with few features [13]. Super-resolution methods rely on generative adversarial, such as learning high-resolution representations and then simulating the generated high-resolution features to enrich the features of small objects [13]. However, the above two methods have a large amount of calculation, which is inconvenient for deployment and training. At the same time, the generated features are virtual, and it is easy to ignore the factual feature information extracted by CNN, which can lead to feature confusion.
To this end, we propose an adaptive learnable feature pyramid network (ALFPN) based on the original FPN, which aims at enabling the model to selectively focus on the discriminative features of small objects and overcome the interference caused by feature offset during training. By adding a learnable adaptive weight matrix before the traditional feature fusion, the model can give the corresponding feature map higher weights to obtain richer feature information when facing objects of different scales, thus, solving the model’s over-concern single-scale object problem. In addition, we also added a local fusion guidance mechanism. Due to the operating mode of bilinear upsampling, converting from low resolution to high resolution is achieved by filling in the information of adjacent points. However, this filling method can easily negatively impact small objects because small objects are very sensitive to the edge and positioning information, and this upsampling method is easy to cause feature shifts in the space of small objects. The method of directly stacking feature layers of different scales will significantly weaken the edge information of small objects and introduce noise to interfere with the training of the model. Therefore, we will perform position correction on the upsampled features, thereby reducing the feature loss caused by fusion. For the features of small objects, we also improve the underlying extraction process. Specifically, we propose a context-aligned supervisor (CAS), which is used to guide the model to fuse the feature information from different scales and make location corrections to the corresponding feature maps. This method can avoid the effect of feature shift caused by multiple convolutional downsampling. In addition, we also add adaptive feature detection (AFI) to the fusion process of different-level features and introduce a fusion factor matrix optimized based on training to assign an adaptive weight before different-level feature fusion. This method allows the model to selectively focus on layers rich in the discriminative features, avoiding feature loss, and confusion caused by direct stacking.
The main contributions of this paper can be summarized as follows:(i)We propose an adaptive learnable feature pyramid network. This is an innovative feature fusion paradigm that can better preserve the feature information of small objects, weaken the inconsistency of features at different levels, and improve the detection performance of object detectors in the small object domain.(ii)We design an adaptive feature inspection, which alleviates the feature confusion effect caused by the traditional FPN directly stacking feature maps of different levels and can better retain the features of small objects even in extreme scenes.(iii)We introduce a context-aligned supervisor, which mitigates the feature shift effect caused by multiple upsampling, and makes the model converge better thanks to the penalty coefficient of the learning parameters.
In addition, compared with other state-of-the-art methods, our ALFPN has better performance on the TT-100K, PASCAL VOC dataset. Furthermore, ALFPN has higher robustness on small object classification tasks, and the classification accuracy is higher than other feature pyramid networks.
2. Related Work
Because our work is related to several fields, including deep object detectors, small object detection, and multiscale features detection, we briefly review those three topics in this section.
2.1. Deep Object Detectors
In recent years, object detection has developed rapidly due to deep learning. The task of object detection is to find all the objects of interest in the image and determine their categories and positions [18], which is one of the core problems in the field of computer vision. At present, object detection methods are mainly divided into two paradigms: one-stage and two-stage. The two-stage detection method first selects the proposals, and each proposal must be sent to the CNN network for feature extraction and classification [19], represented by CNN, such as faster R-CNN [7]. Faster R-CNN generates a region of interest and then performs a regression prediction on the category and location information. Although the two-stage detection method may face a colossal calculation cost, its performance is usually superior to the single-stage object detector. The single-stage detection methods usually adopt an end-to-end way to solve the object detection, represented by the YOLO series [8]. YOLO series directly return the coordinate information with the help of the predefined anchor frame and can complete the two tasks of classification and positioning only with a single network, where the advantage of this method is much faster than the two-stage method. Although object detection has been well developed, it is still a difficult and hot spot in small object detection. Therefore, it is very meaningful to solve the problem of small object detection. This paper mainly analyzes and studies how to utilize the features of different scales better to detect small objects.
2.2. Small Object Detection
Object detection has always been an important branch in computer vision, and small object detection has long been a difficult point in object detection. It is designed to accurately detect small objects (objects under 32 pixels 32 pixels) with few visible features in the image [20]. In real scenarios, due to a large number of small objects, small object detection has broad application prospects and plays an essential role in many fields such as autonomous driving, smart medical care, defect detection, and aerial image analysis [21–23]. When faced with the problem that small objects are challenging to detect, many methods different from feature pyramid networks have emerged, and these methods have also achieved specific results. For example, Liu et al. [13] propose a structural inference network called SIN and believe that relational modeling should include two parts. One is the relationship between objects and scenes, and the other is between objects and objects. Thus, objects and scenes are used as nodes, and the relationships are as edges to construct a relationship. Then, the recurrent network is applied for transmitting and updating the information of each node. Noh et al. [15] introduce the GAN-based object detection algorithm, which reconstructs low-resolution images to high-resolution, and employs the discriminator to continuously optimize the high-resolution image to obtain better representation. However, there are still many difficulties in the detection of small objects, the most prominent of which is that the model cannot extract the features of small objects well.
First of all, the small object has fewer features and occupies a small proportion of the whole image compared with the large object. Multiple downsampling will even cause the small object to aggregate into the size of less than one pixel in the top-level features [24]. Second, the proportion of small objects in the current mainstream datasets is relatively small, which also causes the model to focus more on the detection of large objects. The usual way to solve the above problems is to introduce a feature pyramid network [25, 26]. Overall, these methods still face the problems of feature confusion, spatial inconsistency. To alleviate these problems, we propose context-aligned supervisor and adaptive feature inspection. Through end-to-end training, the fusion coefficients are adaptively generated according to the original data and assigned to different levels of feature layers, which can maximize the use of the information of different levels of feature layers.
2.3. Muti-Scale Features
The multiscale features are usually used to extract features of different scales to assist in detection, which makes the model perform well in the different scale objects [27–31]. For example, SSD [32] predicts at different scales in different output layers to avoid the problem of some features being lost in the downsampling process. In addition, one of the main object detection paradigms is the feature pyramid network [1, 26, 33, 34], which fuses image features of different scales through the top-down and bottom-up paths. This way results in the bottom feature that can also share the rich semantic information with the top feature [35, 36]. For instance, Bi-FPN [37] enhances the representation ability of features by adding residual connections to the original structure and removing nodes without feature fusion to reduce the amount of computation.
In recent years, some improved feature pyramid networks have emerged, such as PAFPN(path aggregation feature pyramids network) and HRFPN(high resolution feature pyramids network). The main innovation of PAFPN is to propose a two-way fusion path from the top-down to the bottom-up, and to add a “shot cut” between the bottom and top layers to reduce the loss of information as low-level features pass through the layers. However, HRPAN intervenes in the fusion of feature maps of different scales, and fuses low-resolution feature maps with high-resolution feature maps to obtain a high-resolution feature map representation. Generally speaking, it enables the network to retain high-resolution features. The primary innovation of GraphFPN is the use of a context interaction layer and an interlevel interaction layer in the graph neural network to facilitate information interaction at the same and different scales. To enhance the expressive ability of these layers, the authors incorporated two types of local channel attention mechanisms, drawing inspiration from convolutional neural network techniques, to effectively enhance the multiscale features of the full convolutional feature pyramid network.
Generally speaking, weight fusion is always preferentially added to a signal-stage detector, which significantly improves the feature utilization efficiency [38]. However, these FPN methods are hard to obtain satisfactory effectiveness for small object detection, which is proven that it works well for large objects but is not entirely suited for small objects. Therefore, this paper proposes an ALFPN based on the original FPN framework, which is dedicated to solving the impact of the detection performance degradation caused by the difference in the characteristics of large and small objects.
3. Method
3.1. Overview
Our proposed adaptive learnable feature pyramid network (ALFPN) is designed based on the original feature fusion pyramid network, as shown in Figure 2. In detail, we mainly focus on optimizing the C2, C3, and C4 feature layers. First, we change the original upsampling path, adopting C2 and C3 as the input of the AFI, which adaptively fuses the feature maps of different levels. Second, according to the calculation result of AFI, we update the parameter matrices at the same time. Then, as the input of the CAS, the fused feature map is further refined to calculate the feature offset, obtaining the more intrinsic features for the subsequent detection. Finally, the results of CAS replace the original upsampled fusion output as the final feature, which is fed into the detector.

3.2. Adaptive Feature Inspection (AFI)
The previous feature pyramid network methods have overlooked the issue of feature space conflict and inconsistency when merging feature layers of different levels. These methods involve upsampling the small-scale feature layer through bilinear interpolation, adjusting the number of channels through convolution, and then adding it to the same-scale feature layer based on the fusion method recommended by the original author. In other words, feature layers of different scales are combined to achieve global semantic information sharing [39]. However, previous methods have failed to address the disparity between feature layers of different scales, since not all scale features are suitable for a single-scale object. Consequently, direct addition leads to relevant features from other levels overpowering useful features. This indicates that this risk increases when both large and small objects are present at the same time. Furthermore, the direct addition of these features can result in aliasing effects and affect the representation of their features. To address this issue, we propose incorporating an adaptive coefficient matrix into the fusion process to reduce the discrepancy between feature layers of different levels. The detailed structure of this approach is shown in Figure 3.

We believe that the reason why the previous pyramid network is not effective for small object detection is that the difference in semantic information between features at different scales is not considered during fusion. In addition, the response value of feature layers for large objects is generally higher than that for small objects, making it challenging for the model to learn the discriminative features of small objects. Moreover, the traditional fusion directly adds feature maps in a one-to-one ratio, allowing features of large objects to easily dominate and overshadow those of small objects, leading to confusion or weakening of small object features after fusion.
Therefore, we designed the AFI module to alleviate this problem. The intrinsic data distribution is adaptively learned by introducing an adaptive fusion factor and assigning different weights to different levels. The model can better select feature scales that require more attention. Specifically, our proposed AFI module introduces two fusion factors, and , which correspond to the fusion coefficients of the bottom and top feature. We believe that it is necessary to take into account the strong semantic part of the top features and the rich details of the bottom features. By making the discriminative features have higher learning weights, the model can better learn the feature distinction of objects. At the same time, and can be learned and optimized in backpropagation to find the best fusion method. First, we obtain the feature maps of two adjacent levels according to the backbone networks, such as C2 and C3 feature maps. We divide the feature map into two branches, where the C3 feature map can through dilated convolutions with dilation rates of 6, 12, and 18, and then stack these results. Second, it is expanded from to by subpixel convolution. Next, two feature maps of the same scale with different semantic information are reduced channels by a convolution and a fusion supervision coefficient . Then, a two-dimensional and fusion matrix is obtained through the sigmoid function as a fusion coefficient matrix. Finally, we can get the final output by doing the Hadamard product of the coefficient matrix and the input features and . We use a matrix Hadamard product with fusion weights instead of simple stack addition, which can reduce the effect of feature confusion and the inconsistency of the semantic information between the different levels.
This method enables the model to obtain clean hierarchical features, which is helpful for the regression convergence of the networks. Experiments show that we have an effective improvement after adding the AFI module. The AFI process can be represented as follows:where is the subpixel convolution, is the matrix Hadamard product, is the normal convolution operation, is the sigmoid functio, is the dilated convolution, and is the CAS module.the above formula describes how to get the fusion coefficient matrix. We use dilated convolution and subpixel upsampling to get the input of and adpot the ordinary convolution to get input of .
3.3. Context-Aligned Supervisor (CAS)
The original FPN method simply combines the upsampled top feature map with the input image to generate a stable image representation. However, since the image has undergone recursive downsampling, the upsampled feature maps may have spatial displacement, and direct addition can cause feature aliasing effects. As a result, some of the correctly represented feature points may be lost [40]. Thus, we introduce the context-aligned supervisor to reduce this aliasing effect and make the features easily expressed.
The process of convolutional calculation involves dividing the feature map into the same size as the convolution kernel and gradually convolving it. However, the original author of the paper used bilinear upsampling for upsampling, which can cause features to shift in space because it fills new data in adjacent pixels. This method weakens the true data representation, as the value of the augmented pixel is determined by the pixel position and original data distribution, and object features may not necessarily be linearly increasing or decreasing. As is well-known, the convolution operation exhibits the property of translation invariance, which means that the feature map will show the same trend regardless of how the object is translated or rotated. Therefore, for the translation invariance property of convolution operation to work effectively, it is necessary to have an efficient and less noisy feature layer. However, the original stacking method can cause spatial displacement, and the bilinear interpolation can weaken the edge features of objects, which hinders the model’s ability to learn useful features. To address these issues, we introduce deformable convolution, which allows for the addition of offsets to the receptive field. More importantly, the receptive field is no longer a fixed square but can be adjusted to match the actual shape of the object, allowing the convolutional region to cover the object’s outline more accurately. This approach is particularly useful for objects with extreme aspect ratios and improves results compared to bilinear interpolation, which is not always sensitive to the object’s edge [41]. Inspired by this offset, we designed the context-information supervisor for learning this relationship, as shown in Figure 4.

Our proposed CAS takes the features fused by AFI as input. Using the deformable convolution formula, we obtain an offset matrix. Deformable convolution was originally proposed to overcome objects of different scales and shapes by changing the traditional convolution’s fixed position feature extraction. Inspired by this, we use the original large-scale feature map as a reference to redesign the offset matrix. By adding two dimensions to the original convolution, we calculate the x and y direction offsets corresponding to each pixel to achieve spatial correction of the object. The implementation is as follows.
The first step is to obtain the offset position of each point during convolution using the deformable convolution formula. Next, the pixel value at the corresponding position is calculated using bilinear interpolation, and the pixel values of adjacent points are calculated together. We can then obtain the weight based on the position of the adjacent point and the interpolation point. Using position weighting and distance weighting, we obtain the pixel value from the calculated weight and pixel value, and perform position calibration on the feature map according to the sampling weight. Additionally, we introduced the Gaussian kernel function to calculate the gap between the aligned image and the original image, and improved our alignment performance by adding Gaussian kernel loss to the loss for backpropagation. Thanks to the characteristics of the Gaussian kernel function, we can abstract the features into multidimensional features in space. By finding the highest response point as our resampling reference, it is convenient for the model to extract features, resulting in better generalization of our model. In general, we design a supervised output for the original input feature, correcting the feature expression so that the output fusion feature has a more detailed representation. The proposed CAS can be expressed as follows:where is the fully connected layers, is the guidance module, is to find the Gaussian radius and find the L1 distance within the radius.
Overall, we can reduce the aliasing effect when merging different levels of feature layers after adding the CAS module. In addition, the influence of edge blur, caused by the convolution edge effect can be reduced. By retaining the complete edge information, small objects can be better detected. Furthermore, the spatial offset problem introduced by the traditional upsampling method can be effectively suppressed, and it is well adapted to the location-sensitive features of small objects.
3.4. Implementation Details
To further improve our performance in small object detection, we adopt a subpixel convolution and channel attention module, which can facilitate the potential of our proposed AFI and CAS while learning the small object features. The detail is as follows.
3.4.1. Subpixel Convolution
In reality, when a camera captures images of the real world, it discretizes the continuous images in the physical world. Objects in the real-world are connected everywhere, but the image sensor can only use a small area to represent its color, resulting in pixels or subpixels between two pixels [42]. The principle of inverse convolution is to upsample a low-resolution image to a high-resolution image. However, the extra area is filled with 0 pixels, and these filled pixels are invalid virtual information, which is not beneficial for gradient optimization. To overcome this limitation, multiple channels of the convolution kernel are recombined to fill the subpixel area, resulting in a high-resolution image that effectively avoids the aforementioned defects. We replace the original bilinear interpolation upsampling with subpixel convolution to restore the real information of the small object. Unlike the traditional way of filling virtual and unreal data, the specific details are as follows:where [42] is an periodic shuffling operator that rearranges the elements of a .
3.4.2. Channel Attention Module
We believe that obtaining top-level semantic information is more conducive to detecting small objects. Therefore, we first extract the location information of small objects from the bottom layer features, and then obtain discriminative features from the top layer [43]. In addition, we incorporate an attention mechanism after stacking adjacent two-layer features. This mechanism is used to supervise the semantic differences between different levels and assign different weights to the features so that the model can focus on the discriminative information and improve the detection performance of small objects [44]. In detail, we first do global max pooling and average pooling to obtain two-channel descriptions. Then, they are sent to a two-layer shared neural network and adopt the Sigmoid function to obtain the weight coefficient, respectively. Finally, the weight coefficient is multiplied by the input feature to get the scaled new feature.
3.4.3. MMdetection
MMDetection is an open source project launched by SenseTime and the Chinese University of Hong Kong for target detection tasks. Based on Pytorch, MMDetection implements a large number of target detection algorithms and encapsulates the processes of data set construction, model building, and training strategy into modules. Through module invocation, we can implement a new algorithm with a small amount of code. Greatly improve the code reuse rate. All the module experiments in this paper are developed based on the above framework.
4. Experiments
4.1. Dataset
4.1.1. Tsinghua-Tencent 100K Dataset
The TT100K dataset [45] provides 100,000 images containing 30,000 instances of traffic signs, which cover large changes in light intensity and weather conditions and are more suitable for actual driving scenarios. This dataset divides traffic signs into three categories according to the size of the coco dataset: small objects, medium objects, and large objects. We select 45 categories of pictures in training, removing category data with a total number of fewer than 100 pictures, and finally evaluate our precision and recall rate on the test dataset with IOU = 0.5.
4.1.2. VOC Dataset
The VOC dataset [46] contains a training set (5011 images), a test set (4952 images), and a total of 9963 images, including a total of 20 categories. We evaluate the average accuracy (AP) of the overall data and the individual accuracy of each category in the test dataset according to the standard of the VOC dataset. We further judge the performance of the module according to the changes in the accuracy of different categories.
4.1.3. COCO Dataset
COCO dataset [47] is an open dataset provided by the Microsoft, which has been widely used in the field of computer vision. Most visual tasks such as object detection, semantic segmentation, object classification, etc. are tested based on the COCO dataset to obtain the evaluation indicators of the model. The dataset contains 80 classes. We tested our method strictly in accordance with the evaluation indicators of the COCO dataset.
4.2. Evaluation Index
In this paper, we utilized standard evaluation metrics to measure the model’s performance, and presented the final results in terms of AP values. The AP value is computed based on the accuracy and recall rate of the model. First, we defined TP as the correctly predicted positive sample, FP as the negative sample mistakenly detected as the positive sample, FN as the positive sample ignored as the negative sample, and TN as the negative sample detected as negative. The accuracy is calculated as TP divided by the sum of TP and FN, while recall is TP divided by the sum of TP and FN. The PR curve is a curve that plots precision as the vertical coordinate and recall rate as the horizontal coordinate. The AP value is the area under the PR curve.
4.3. Training Strategy
All our experiments are implemented based on MMDetection. We train and test the detector at a resolution of (1333, 800) on 1 NVIDIA GTX 3090 GPU (4 images per GPU). During training, 1 schedule represents 12 epochs. Our ALFPN can be applied to any FPN-based detector. In this paper we choose Faster R-CNN, Cascade R-CNN, Sparse R-CNN, and Retinanet as our baseline networks. The networks ResNet-101 and ResNet50 are used as the backbone of the experiment. Our initial learning rate is set to 0.0025 with a decay of 0.0001. In addition, the datasets of training and validation are the filtered TT-100K dataset in experiments. Unless otherwise specified, other settings follow the basic framework.
4.4. Ablation Study
To verify the effectiveness of proposed module, we conducted ablation experiments and analyzed the experimental results. We evaluate the performance of each component separately on the TT-100K dataset.
4.4.1. Core Components Ablation
We perform our proposed two main ablation experiments to verify the effect of the adaptive feature inspection on the TT-100K dataset. The results are shown in Table 1. In the first experiment, we added the CAS module separately to detect the increase in detection performance caused by correcting the offset in the feature space. This experiment provided valuable insights into the effectiveness of our spatial position error correction approach. By comparing experiment 1 and experiment 2 in Table 1, we can see that the addition of CAS module brings AP improvement of 4.2. We think the reasons for the increase are as follows. Through this test, we were able to determine the impact of the feature map of focus selection and object scale matching on small object detection performance. By separately adding the CAS module, we were able to isolate its impact on the overall performance of our method. In the second experiment, we added the AFI module separately. This experiment allowed us to determine the influence of the feature map of different proportions selected in the same proportion for feature extraction on small object detection performance. By comparing experiment 1 and experiment 3 in Table 1, we can see that the addition of AFI module brings AP improvement of 4.0. In the field of computer vision, detecting small-scale objects has always been a challenging task. One of the main reasons is that the feature information obtained from small-scale objects after repeated downsampling will gradually decrease. This means that as the feature maps become deeper, they become less friendly to smaller objects. Therefore, to address this issue, we propose the use of an adaptive coefficient matrix. This matrix can adjust the sampling weight according to different detection scenarios, allowing us to take into account the detection effect of both large-scale and small-scale objects. By using this method, we can improve the detection performance of small objects while still maintaining an accurate detection of larger objects. The adaptive coefficient matrix assigns different weights to the feature layers based on their corresponding values. This allows for the forward expansion of effective features and the reverse suppression of invalid features. By using this method, we are able to mitigate the negative effects of downsampling on small-scale objects and effectively retain their important information.
4.4.2. Components Ablation
We also design an ablation experiment to verify the performance of other improved components, such as the subpixel convolution, the attention mechanism, and the dilated convolution, as shown in Table 2. First, compared with No. 1 and No. 2, the subpixel convolution module can bring a very effective performance improvement. After adding subpixel convolution, the upsampling method is optimized, and the pixels to be expanded are filled with real data, which can bring better feature representation. Second, compared with No. 1 and No. 4, the AP value has also improved to a certain extent, which shows that adding dilated convolution can expand the receptive field and capture longer dependencies. It is worth noting that this expansion does not need to increase the computational cost. In addition, compared with No. 5, No. 6, and No. 7, the combination of different modules can bring a more significant improvement, and we believe that this combination plays a positive role in the utilization of features. Finally, we can achieve the best performance after adding the above three modules, proving the effectiveness of our added modules.
4.5. Transferability
In addition, we also design a set of tansferbility experiments to prove that our module is a plug and play module that can be applied to other types of feature pyramid modules as shown in Table 3. In this experiment, we utilized the Faster R-CNN as the detector and trained and tested it on the TT-100K dataset. We replaced the original FPN module with three different feature pyramid network architectures, namely PAFPN, HRFPN, and AugFPN, and added our proposed AFI and CAS modules to each of them. First, we conduct a migration test on PAFPN by adding our module, resulting in better performance. We believe that this is because our AFI module can adaptively focus on feature maps of different scales, thus improving the detection effect of small objects. Besides, our CAS module also contributes to the improvement of detection performance, which our method can reduce the interference of background noise on small object features through alignment. Second, we also test our core components based on HRFPN, achieving an effective improvement. We believe that, according to the current object scale size, selecting the corresponding feature layer adaptively can be helpful for object detection. Finally, we transfer our modules to AugFPN, receiving a best result, which prove the tansferbility our proposed method.
Overall, the mechanism of AFI is that when images with large-scale changes are included, we can properly ignore feature layers that do not match the object scale. It can avoid introducing invalid features, which will interfere with the detection and prevent the feature information of small objects from being covered up. For the CAS module, we believe that the alignment operation can reduce the loss of effective information, retain the original features of the object, and also reduce the disturbance caused by positioning. Therefore, our proposed plug and play modules has a rewarding tansferbility.
4.6. Visualizations
4.6.1. Qualitative Visualization
To verify the effectiveness of ALFPN, we visualize the results of FPN and our proposed ALFPN on the TT-100K dataset. First, from Figure 5, we find that ALFPN performs better than the feature pyramid networks in detecting small objects or in scenarios where small objects are clustered, which indicates the effectiveness of our proposed method. In detail, when the distribution of small objects is relatively close, our model also has better robustness, and the probability of mixed detection and missed detection is significantly reduced. In Figure 5, there are small object aggregations in the first, second, and fifth pictures. For this case, FPN has a high missed detection rate in such extreme detection scenarios, while our ALFPN has a higher detection rate and accuracy. In addition, for the third and fourth pictures, FPN could not detect small objects, but ALFPN could detect and identify them correctly. Therefore, our proposed ALFPN is more reliable and stabler than FPN.

4.6.2. Attention Visualization
To explain the ability of the feature extraction, we visualize the heat map for the two modules, compared with the FPN module and ALFPN in detecting the same image in Figure 6. For Figure 6(a), the area of interest on the heat map of FPN is relatively scattered, and the detection effect on small objects is not satisfactory. However, in Figure 6(b), our method can make the model pay more attention to the regions containing discriminative features, leading to a more aggregated effect for the same object, which can facilitate the detection of small objects.

(a)

(b)
4.6.3. Robustness Comparison
In Figure 7, we report the robustness comparison between ours and other similar methods. Compared with three other feature pyramid networks, FPN, PAFPN, and HRFPN, the results of our proposed ALFPN are better than other feature pyramid networks on the TT-100K dataset. The performance of these three pyramid networks is not stable, and the performance on different detectors is not the same, but our ALFPN has a higher AP values on all four detectors tested. This also proves that ALFPN is more applicable and can better fit various detection models.

4.7. Comparison of the State-of-the-Art Methods
We compare with the state-of-the-art methods on the TT-100K, PASCAL VOC, and COCO dataset. The results show we achieve a higher result than others, as shown in Tables 5 and 4. We adopt the Faster R-CNN and ResNet-101 as the backbone to complete our training following the pre-set training parameters.
4.7.1. Comparison on VOC and COCO Dataset
In order to further verify the effectiveness of our module, we compare the results with other types of feature pyramid networks on COCO dataset and VOC dataset. The results show that our performance is superior in detecting small objects, as shown in Table 4. We also adopt Faster R-CNN and ResNet-101 as the backbone to complete the training according to the preset training parameters. Since COCO dataset and VOC dataset are relatively general and large datasets, and both of them cover object images of various scales, we believe that the experimental data based on the above two datasets can be representative to a certain extent. To better compare the effect of the module, we used VOC and COCO evaluation indexes for evaluation and testing, respectively.
First, in comparison with models No. 1 and No. 5, our proposed method outperforms the FPN module in all metrics and shows improved detection performance across all object sizes, including large, medium, and small objects. In particular, our approach achieves significantly better performance in detecting small objects. This improvement is due to the adaptive focus of the AFI module on the feature maps of different scales, enabling it to enhance the detection of small objects by adjusting to their specific layer characteristics. Second, our proposed CAS module is capable of reducing the position sensitivity of small objects, which mitigates the impact of background noise through spatial alignment and reduces interference in IOU calculation. Compared to models No. 4 and No. 6, our approach also demonstrates reliable performance on the VOC dataset. This is because our model is able to obtain more effective feature information by adjusting the matching strategy between the scale of the detected object and the feature layer, while the spatial disturbance caused by the fusion of different feature layers is overcome by the feature alignment.
4.7.2. Comparison on TT-100K Dataset
First, compared with No. 1 and No. 20, our proposed ALFPN module outperforms the baseline in all metrics. Notably, we achieve a 5.1 AP improvement in small object detection without compromising the performance of medium and large objects. This can be attributed to the adaptive feature fusion and context supervision modules, which allow the model to selectively attend to relevant feature layers for different scale objects and reduce the impact of irrelevant feature layers. Second, our feature alignment operation preserves the original feature information of small objects after fusion, preventing spatial position interference, and resulting in superior positioning performance. Third, our module achieves state-of-the-art performance on the TT-100K dataset compared to other feature pyramid networks. It also demonstrates robustness and transferability when combined with different detectors, leading to significant improvements in detection performance. These results demonstrate the effectiveness of our proposed ALFPN module.
Overall, our ALFPN achieves a more significant accuracy improvement than other feature pyramid networks on the TT-100K dataset, the VOC dataset, and the COCO dataset. The experiments show that our module can be well integrated into the feature pyramid network modules of various paradigms, and the detection performance can be improved by adaptive fusion and feature alignment.
4.8. Other Experiments
In the field of object detection, it is important not only to consider the detection performance of a model but also to evaluate its parameter count, detection speed, and ease of deployment. To compare the complexity of our modules, we designed the following experiments. All experiments were conducted on the same RTX 3090 with a fixed image input.
As shown in the table Table 6, our model’s complexity is slightly higher than that of other lightweight models of the same type, while sacrificing fps to preserve more feature details. In future work, we plan to release a lightweight revised model based on the existing foundation for deployment on edge devices.
5. Conclusion
In this paper, we analyze some problems of the existing feature pyramid network, proves that the traditional way of samples will be in the expansion of the resolution on the space level when introducing a certain bias error. In addition, it is different to target objects at various scales depending on the scale layers’ responses, and a better matching degree can lead to a better detection result. Based on these findings, we have proposed a new feature pyramid network, ALFPN, to address these issues. Our module is composed of two main components. First, we have designed an adaptive feature inspection (AFI) module to reduce the inconsistency of features between different layers. Second, we have introduced a context-aligned supervisor (CAS) module to mitigate the feature offset problem and reduce interference from background noise on sample features and the disturbance of small objects in IOU calculation. Our experimental results show that our module is compatible with mainstream FPN detectors and can be used as a plug-and-play module. We have achieved a significant improvement in detecting small objects without sacrificing the detection performance of large and medium-sized objects.
Data Availability
The data used to support the findings of this study are available upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Authors’ Contributions
Haolin Chen and Qi Wang equally contributed to this study.
Acknowledgments
This research was supported by the National Natural Science Foundation of China (grant nos. 62162008, 62006046, 32125033, and 31960548), Innovation and Entrepreneurship Project for Overseas Educated Talents in Guizhou Province (2022)-04, Guizhou Provincial Basic Research Program (ZK[2022]-108), Guizhou University Cultivation Project (grant no. 2021-55). Natural Science Special Research Fund of Guizhou University (grant no. 2021-24). Program of Introducing Talents of Discipline to Universities of China (111 Program, D20023).