Abstract
Since small-target pedestrians account for a small proportion of pixels in images and lack texture features, the feature information of small-target pedestrians is often ignored in the feature extraction process, leading to reduced accuracy and poor robustness. To improve the accuracy of small-target pedestrian detection and the anti-interference ability of the model, a small-target pedestrian detection model that fuses residual networks and feature pyramids is proposed. First, a residual block with a discard layer is constructed to replace the standard residual block in the residual network structure to reduce the complexity of the model computation process and solve the problems of gradient disappearance and explosion in the deep network. Then, feature selection and feature alignment modules are added to the lateral connection part of the feature pyramid to enhance important pedestrian features in the input image, and the multiscale feature fusion capability of the model is enhanced for small-target pedestrians, thereby improving the detection accuracy of small-target pedestrians and solving the problems of feature misalignment and ignored multiscale features in the feature pyramid network. Finally, a cascaded autofocus query module is proposed to increase the inference speed of the feature pyramid network through focusing and querying, thus improving the performance and efficiency of small-target pedestrian detection. The experimental results show that the proposed model achieves better detection results than previous models.
1. Introduction
With the development of deep learning and computers, the fields of autonomous driving (AD) and intelligent transportation systems (ITS) have rapidly advanced. Although AD and ITS have achieved great results in some scenarios, AD motor vehicle collisions with pedestrians [1] and sensitive ethical and moral issues [2] present serious challenges to pedestrian detection technology, and pedestrian detection technology is crucial for the development of AD and ITS. Pedestrian detection is a technique to determine whether a pedestrian is present in an image or video and provide their precise location and size. Small-target pedestrian detection is a difficult aspect of pedestrian detection. Small-target pedestrians have little problem information in sensors and little feature information in deep learning. Accurate pedestrian detection is the basis for AD vehicles and provides operation and guidance strategies for AD vehicles to avoid collisions with pedestrians, reduce traffic accidents, and improve safety factors.
One of the most essential issues in intelligent transportation systems is small-target pedestrian detection, mainly focusing on urban roads and places with high pedestrian flow. However, actual traffic environments are large, complex, and contain multiple variables, and many challenges need to be addressed to achieve accurate and robust pedestrian detection using radar and digital image processing techniques. For example, in environments with partial pedestrian occlusion, radar techniques fail to detect occluded pedestrian targets [3], and small-target pedestrians are more difficult to detect. Depth-based detection of small-target pedestrians in these environments requires deeper networks and larger models, which require considerable computational power. In addition, detection speed remains a challenge.
To address these issues, an increasing number of scholars in this field of research have considered deep learning because the development of LIDAR systems is time-consuming and expensive. The most critical issue is that the sensor cannot perform image processing analogously to the human eye seeing pedestrians. Recently, Tesla proposed building self-driving cars using visual methods. The most widely used small-target pedestrian detection model is based on deep learning. Deep learning was proposed in 2006 and is widely used in computer vision, natural language processing, bioinformatics, and other fields because of its human-like analytical learning capabilities. Deep learning has also been used in pedestrian detection. The goal is to learn the relationship between target pedestrians in different images. The representative networks include the deep residual network (ResNet), the feature pyramid network (FPN), and the you only look once (YOLO) networks. ResNet addresses network degradation well, while the FPN has an improved feature fusion capability, and YOLO has a higher pedestrian detection speed. However, due to many factors, the patterns of small-target pedestrians in images are complex and variable. Thus, accurate small-target pedestrian detection is difficult to achieve with only a single shallow network. Small-target pedestrian features require both depth networks and feature fusion networks. Considering that deep networks and feature fusion networks can both improve the detection of small-target pedestrians [4], many scholars have studied small-target pedestrian detection considering depth and feature fusion networks. For example, Noh et al. [5] proposed feature superresolution for a small-target pedestrian detection algorithm, and Nie et al. [6] proposed enriched features for a small-target pedestrian detection network. These small-target pedestrian detection methods have achieved positive detection results. However, the pedestrian detection speed is not ideal as the models are enlarged, and detecting small-target pedestrians accurately and quickly remains an open problem.
To solve the above problems, this article proposes a small-target pedestrian detection model based on autonomous driving. The main contributions of this study can be summarized as follows:(1)An improved residual network is proposed. By adding a dropout layer to the residual network, the number of model parameters is reduced, and the model generalizability is improved. The model training effect is evaluated through ablation experiments, and the best model parameters are selected.(2)A feature fusion and alignment network is proposed. By adding feature selection and feature alignment modules to the feature pyramid network, the most important features in the feature map are enhanced, and the offset features in the feature extraction and feature fusion processes are corrected and aligned.(3)A cascaded autofocus query (AFQ) module is proposed to increase pedestrian detection speed. This module accelerates small-target pedestrian detection through automatic focusing and querying. Different AFQ modules are constructed according to feature maps of different scales, thus allowing the modules to automatically adapt to different scale features. In addition, the cascade method is used to share data to increase the detection speed of the model.
2. Literature Review
Pedestrian detection is a technology that judges whether there are pedestrians in an image or video and provides the precise position and size. Small-target pedestrian detection is a difficult aspect of pedestrian detection. In AD scenarios, high-precision small-target pedestrian detection can give the car control system sufficient time for early warning and processing [7], which is important in ensuring driving safety [8–10]. According to an overview of domestic and international research, pedestrian detection methods can be roughly divided into two categories: shallow machine learning detection models and deep learning detection models. Moreover, deep learning models can be further divided into two categories: one-stage pedestrian detection algorithms [11] and two-stage pedestrian detection algorithms [12]. The above two types of algorithms have distinct advantages and similar disadvantages, including occluded pedestrian targets [13, 14] and traffic signs [15, 16], image resolution [17], light intensity interference [18], scale transformation issues [16], and many other challenges.
Machine learning implements pedestrian detection by constructing feature models and using these features to train classifiers. Common feature extraction methods include Haar wavelet features, histograms of oriented gradient (HOG) features, grayscale and rotation invariant features, and denatured local binary pattern (LBP) features. Common classifiers include the support vector machine (SVM), AdaBoost, and random forests. Machine learning algorithms can achieve accurate pedestrian detection. However, due to the nonrigid nature of pedestrians, the constructed feature model is often difficult to adapt to pedestrians with different perspectives, mutual occlusion, and different postures. In particular, small detection targets are easy to miss. Moreover, false detection issues reduce the practicality of these algorithms.
Deep learning can address the above problems. The one-stage pedestrian detection algorithm mainly adopts the core idea of an end-to-end network [19]. A single neural network is used to directly predict the positions of objects in the image with only one evaluation. The conventional representative works on the one-stage pedestrian detection algorithm include the SSD algorithm [20] proposed by Liu et al. and the YOLO algorithm [21] proposed by Redmon et al. Since these algorithms do not consider feature and semantic information when extracting image features, the detection effect on small-target pedestrians is not ideal. Therefore, to optimize the detection effect on small-target pedestrians, Yin et al. proposed the FD-SSD algorithm [22]. This algorithm improves the semantic information of shallow feature maps through a multilayer feature fusion module. Through the multibranch residual hole convolution module, the original resolution of the feature map is maintained, and the context information of the feature map is improved. In addition, deformable convolutions are introduced to fit the shapes of small objects. Fu et al. proposed the DSSD algorithm [23], which imitates the feature pyramid, adds a Residual-101 network in the deconvolution layer, uses deconvolutions to upsample high-level features and combine them with shallow features, and increases the semantic information of the shallow layers to improve the accuracy of small object detection. Although these multifeature fusion methods improve the detection accuracy of small-target pedestrians to a certain extent, they still do not meet the actual needs [24].
The two-stage pedestrian detection algorithm first generates pedestrian candidate regions and then classifies the candidate regions using a convolutional neural network. The conventional two-stage pedestrian detection algorithms are the fast region convolutional neural network (Fast R-CNN) proposed by Li et al. [25], the faster region convolutional neural network (Faster R-CNN) proposed by Ren et al. [26], and the mask region-based convolutional neural network (Mask R-CNN) proposed by He et al. [27]. Similar to the one-stage algorithms, these two-stage algorithms are often not very effective in detecting small-target pedestrians. To address this problem, Zhang et al. [28] analyzed the Faster R-CNN algorithm and found the reason for the unsatisfactory small-target pedestrian detection results: the feature map resolution of the neural network is not sufficient when dealing with small-target pedestrians. As a result, the neural network easily ignores these pedestrian features during the learning process. Moreover, the use of a region proposal network (RPN) and decision forests (DFs) on the shared high-resolution convolutional feature map can effectively improve small-target pedestrian detection. Additionally, to address this problem, Liu and Stathaki [29] proposed a pedestrian detection algorithm using a faster R-CNN with a semantic segmentation network and a region-based convolutional neural network. This network uses semantic cues to better detect pedestrians by computing complementary high-level semantic features and integrating these features with convolutional features using multiresolution feature maps extracted from different network layers, thus ensuring good detection accuracy for pedestrians of different scales. These algorithms can effectively achieve small-target pedestrian detection; however, feature alignment issues occur during detection due to inaccurate spatial sampling [30].
To address the above problems, this paper proposes a fusion residual network and feature pyramid (FRFP) model for automatic focused for query small-target pedestrian detection. The model uses the two-stage Faster R-CNN model as the framework and ResNet with a fusion FPN as the backbone. The model uses a bottom-up path to generate feature maps of different scales by improving the residual network and a top-down path to fuse feature maps of different scales by using the feature pyramid incorporated in the residual network to achieve multiscale feature fusion. Finally, a cascaded AFQ module is added behind the feature pyramid. The cascaded AFQ module shares data, reduces the computational costs of the model in the inference process to determine the spatial information of small-target pedestrians, and passes the information to the next AFQ module to increase the detection speed of small-target pedestrians.
3. Our Approach
To address the problems that small-target pedestrians account for a relatively small amount of image information, neural networks ignore small-target pedestrian features in the feature fusion process [31] and feature pair misalignment [32] This paper uses Faster R-CNN as the overall framework of the model and incorporates an FPN into the output layer of the residual block of ResNet. This allows the model to mitigate network degradation issues and increase the accuracy of small-target pedestrian detection. Finally, the AFQ module is proposed to reduce the inference speed of the model and increase the detection speed of the model.
3.1. Improved Residual Network
The residual block in the residual network contains only the weight layer and the activation function, and the weight layer contains a convolutional layer and a batch normalization layer. The batch normalization layer uses sliding to calculate the average and error during training. As a result, the trained model is overly dependent on the training set, so overfitting occurs during the process of network stacking. To address this problem, this paper proposes the improved residual network (IResNet), which stacks the neural networks by constructing residual blocks with discarded layers to address the problems of overnetwork degradation and gradient disappearance. The structure diagram of the improved residual block is shown in Figure 1.

Figure 1 shows the input features of the residual block, where is the nonlinear mapping in the residual block, and is the output value of the residual block. If the underlying mapping function is set to , the output of the residual block iswhen , and the neural network layer in the residual block becomes a constant mapping layer. According to equation (1), the nonlinear mapping formula of the residual block can be defined as
Equation (2) indicates that the network determines the optimal solution when approaches 0, although the phenomenon of network degradation in the neural network is greatly reduced as the number of network layers increases. In the residual block, the weight layer contains a convolutional layer and a batch normalization layer. The convolutional layer extracts image features, and a pooling layer is added after the convolutional layer to reduce the size of the features and the number of network parameters using downsampling. The feature extraction and pooling processes are described in the following equations.where denotes the weight of the i-th filter in layer l; denotes the bias of the i-th filter in layer l; denotes the value of the j-th convolutional region in layer l; denotes the input to the j-th neuron in the i-th frame in layer l + 1; denotes the value corresponding to the neuron in layer l + 1 after the pooling operation, where ; and denotes the width of the pooled region.
The activation function in the residual block is the linear rectification function (ReLU), which is formulated as follows:
Dropout is a simple method proposed by Srivastava et al. [33] to address overfitting in neural networks with a large number of parameters. The dropout layer discards the values of neural units in the network according to a certain probability, i.e., if the output is set to zero, the weights are not updated. A schematic of the dropout process is shown in Figure 2.

The formula of the neural network in the residual block changes due to the introduction of the discard layer, which is calculated as follows:where is a random coefficient obeying the Bernoulli distribution; is the neuron in the hidden layer; is the neuron after the discard layer; is the neuron in the l + 1 layer that is waiting for activation; and are the weight and bias in the l + 1 layer, respectively; is the output neuron in the l + 1 layer after the activation function; and is the activation function in the residual block. By adding a discard layer to the residual block, the number of neurons in the hidden layer can be reduced. Thus, the number of features in the intermediate layer can be reduced, thus weakening the complex adaptive relationships among the neural nodes in the network, enhancing the generalizability and robustness of the network, and effectively reducing network degradation.
3.2. Feature-Aligned Pyramid Network
To solve the problems of feature pair misalignment and feature fusion in the feature extraction process of small-target pedestrian detection [34], this paper uses a feature pyramid network and proposes improvements based on this network. The feature pyramid network improves small-target pedestrian detection accuracy through multiscale feature map fusion. In this paper, we introduce a feature alignment module (FAM) and a feature selection module (FSM) in the lateral connections part of the feature alignment pyramid to build a network with lateral connections, learn and align important pedestrian features, and enhance the multiscale feature fusion ability of the network to improve the small-target pedestrian detection performance. The network structure is shown in Figure 3.

In Figure 3, the image in the lower left corner is the image input to be trained, the multiscale feature map output by the residual block is shown above the image, the feature map after multiscale fusion in the pyramid network is shown on the right, and the part in the dashed box is the lateral connection part of the pyramid network, which contains the 2x up-sampling module, feature selection module, and feature alignment module.
The feature selection module in the conventional FPN performs only 1 × 1 convolutions to ensure that the number of channels with high-dimensional and low-dimensional features remains constant. However, without judging the saliency of the respective channel features, it is difficult to express the important features of spatial details when solving channel compression. To address this problem, this paper introduces the feature selection module, which models the significant features in the feature mapping process while suppressing and recalibrating redundant feature mappings. Figure 4 shows the structure of the feature selection module.

Figure 4 illustrates the structure of the feature selection module. First, the global information of the input feature map is extracted based on the global average pooling operation. The global information is sent to the significant feature construction layer , which learns the weights of each channel in the input feature map. The above weights are expressed in terms of feature importance vectors to indicate the salience of the respective feature maps. The original input feature maps are scaled using the importance vectors. The scaled feature maps are added to the original feature maps to generate rescaled feature maps, which are introduced into the feature selection layer . This process retains the important feature maps while reducing the number of channels by removing redundant feature maps. The workflow of the feature selection module is shown in the following equations:where the global information can be obtained from equation (11); is the feature importance vector; is the significance of the d-th input feature map; is the significant feature construction layer, which consists of a 1 × 1 convolution layer and a sigmoid activation function; and is the feature selection layer, which consists of a 1 × 1 convolution.
Since the conventional FPN uses a recursive downsampling operation, the upsampled high-dimensional features and low-dimensional features lead to contextual misalignment of the predicted features in the feature alignment module. Thus, the conventional FPN performs feature fusion in a manner that affects the prediction of the target boundary, thus causing misclassification in the prediction process. The feature alignment module aligns the upsampled feature mappings to a set of reference mappings by adjusting the respective sampling positions in the convolutional kernel according to the learning offset.
Figure 5 illustrates the workflow of the feature alignment module, which aligns the upsampled feature map with its reference feature map before proceeding to feature fusion, i.e., the upsampled feature is normalized based on the spatial location information provided by . N in Figure 5 denotes the convolution kernel at N sample locations, and C denotes the number of feature channels. denotes the offset of the convolution kernel to be learned.

3.3. Autofocus Query Module
Although the combination of the FPN and ResNet increases the detection accuracy of small-target pedestrians, the detection speed and accuracy of small-target pedestrians are not ideal, especially the detection speed. The inference and computation processes of the feature pyramids for small-target pedestrian features are highly redundant due to the very sparse information about small-target pedestrians in the image space, which reduces the computational performance and increases the detection speed [35]. In addition, background noise in the image interferes with the features of small-target pedestrians, leading to poor accuracy. To address the above problems, this paper proposes the autofocus query (AFQ) module, which performs AFQ operations on feature maps of different scales, and its operation process is shown in Figure 6.

Figure 6 illustrates a schematic diagram of the AFQ module, which automatically focuses the low-resolution feature map input from the pyramid network and predicts the perceptual region. Then, the key locations of small-target pedestrians are calculated by means of a query while passing the key location coordinates as key information to the next higher resolution feature map through the AFQ module. We set the output vector map after the AFQ module as , where denotes the probability that the i-th row and j-th column of the feature map contain a small-target pedestrian. Then, we define small-target pedestrians in each feature map as objects with scales smaller than a predefined threshold and set the border of the small-target pedestrian in each feature map as , where is the center point of the small-target pedestrian and is the height and width of the small-target pedestrian. Next, a binary encoded feature map [36] is generated by calculating the distance from each feature pixel to the center point of the feature map according to the following distance calculation and judgment equations:
To predict the approximate location of the small-target pedestrian, a parallel query classification and regression module is added to the AFQ module, which corresponds to the feature mapping accepted by each layer of the AFQ module. The regression and prediction values are passed as location information to the next module. Let the key location be , which can be defined as
For each layer , the loss function is defined aswhere is the classification output, is the regression output, is the query score output, is the true mapping of the classification output, is the true mapping of the regression output, is the true mapping of the query score, is the focal loss, and is the bounding box regression loss [37].
To increase the inference speed, we use a cascade connection in each AFQ module [38], which has the advantage that k is not generated from a single feature map, which allows for increasingly more key locations as l decreases in the query mapping.
3.4. FRFP-AFQ
First, to ensure that the model can address network degradation, the residual network is used as the backbone network of the model in this paper. Second, to enhance the model’s ability to detect small-target pedestrians, the feature pyramid and ResNet are combined. Finally, an AFQ module is proposed to optimize the small-target pedestrian detection performance of the model. Therefore, this paper proposes an automatic, focused query, small-target pedestrian detection model that combines a residual network and feature pyramid. The proposed model is termed the FRFP-AFQ model, and the model structure is illustrated in Figure 7.

In Figure 7, the leftmost image is the original target detection input, which is a 640 × 480 pixel RGB image, and the dashed box immediately following the arrow contains the residual network, which is the feature map output by each residual block, where the lowest dimensional feature map has 160 × 120 pixels and 256 channels and the highest dimensional feature map has 20 × 15 pixels and 2048 channels. The dashed box below the residual network shows the structure of the feature pyramid network. The feature pyramid network fuses deep high semantic features and shallow multidetail features through lateral connections, and the lateral connections are shown in the lower right corner of Figure 7. The lateral connections are used to construct the fused shallow and deep feature map, which has 160 × 120 pixels and 256 channels. The deepest feature in each layer contains not only the detailed features of the current dimension but also the high semantic information of the deep layer. The deep feature maps have high semantic information and are suitable for detecting large targets, while the shallow feature maps have multi-detail features and are suitable for detecting small targets. Finally, the AFQ operation is used in each layer of the FPN to automatically focus the query operation, and the AFQ operations are cascaded to form the AFQ module. Each AFQ operation in the AFQ module includes classification, regression, and query functions to quickly determine the location of small-target pedestrians. Collectively, the FRFP-AFQ model can address network degradation and achieve superior multiscale feature fusion performance as well as excellent inference and detection performance.
3.5. Implementation Steps
The main steps of the FRFP-AFQ-based small-target pedestrian detection model are implemented as follows:
Step 1. The experimental environment uses cloud servers, two Tesla V100 graphics cards with computing powers of 15.7 TFLOPS (FP32) and 125 TFLOPS (FP16); a CPU using Xeon Gold 6139; an Ubuntu 18.04 system with 172 GB memory and 16 × 2 GB video memory; PyTorch version 1.9.0; CUDA version 11.4; and Python version 3.6.9.
Step 2. The model proposed in this paper was constructed by setting the structures of the convolutional layer, pooling layer, batch normalization layer, and other explicit and implicit layers, and the stochastic gradient descent (SGD) method with the introduction of momentum was chosen as the optimizer during model training. The network parameters were set by model parameter comparison experiments (see Section 3.3). The final number of epochs was set to 200.
Step 3. The dataset used in this paper was divided into three folders. The first folder was named Annotations and stored all the annotation files in XML format. The second folder was named JPEGImages and stored the image files corresponding to the annotation files in jpg format. The last folder was named ImageSets, which contained a main folder with txt files of the names of the images in the training, test, and validation sets.
Step 4. The pretrained weights were downloaded and unzipped into the pretrained_weights folder. Then, the uploaded dataset was unzipped, and the paths of the training set, test set, and validation set were configured. Then, we returned to the model folder in the terminal command line. Next, we input python train.py to train the model, python test.py to evaluate the trained model, python eval.py to evaluate the training level of the model, and python predict.py to assess the test images.
Step 5. We used equations (3)–(9) to obtain the IResNet model and generate the feature maps, and equations (10)–(12) to complete the FSM function in the IFPN.
Step 6. We calculated the pixel-to-pedestrian center distances of the small targets in the feature vector map using equation (13). Then, we determined the value of the pixel encoding in the new feature vector map using equation (14).
Step 7. We determined the key position information of the small-target pedestrians by using equation (15). Then, the pixel encoding value and position information generated in Step 6 were sent to the next AFQ module by combining them as one key value.
Step 8. We evaluated the trained model according to the loss function shown in equation (16). If the loss value was too large, the AFQ module parameters were fine-tuned, and Steps 6 and 7were repeated until the loss function value was less than a predefined threshold.
Step 9. We evaluated the data generated during the test to determine whether the detection accuracy reached the expected value. In this case, we output the obtained model. Otherwise, we return to Step 2, fine-tune the parameters according to the evaluation indices, and repeat Steps (4)–(8).
Step 10. We calculated the frames per second (FPS) of the model generated in Step 9 and obtained the detection results.
4. Experimental Results and Analysis
4.1. Dataset and Data Processing
The Caltech Pedestrian Dataset is a dataset dedicated to pedestrian detection that was released by Caltech in 2009. The dataset was mainly captured by cars driving on rural streets and contains 10 h of 640 × 480 30 Hz videos with a total of 250000 frames, 3500000 bounding boxes, and 2300 pedestrian annotations. The dataset includes an image dataset (data in seq format) and pedestrian labeled data (data in vbb format), which mainly includes the pedestrian bounding boxes in the dataset.
The experimental data processing is implemented via the Python programming language. First, the seq and vbb files are converted to jpg and XML files. The jpg and XML files are placed on the same level as the images and annotations folders and renamed. The unnamed files are deleted. After this processing, we obtain 18348 images and 18348 corresponding annotation files. The training set, test set, and validation set were generated randomly according to the ratio 6 : 2 : 2.
4.2. Evaluation Indicators
This paper adopts the evaluation metrics used in the COCO competition [39], including the average precision (AP), AP50, AP75, APS, APM, and APL. The AP is defined as the threshold value of the intersection over union (IOU) being m%, and its calculation is shown in equation (17).
The IOU precision formula indicates the summation of the detection accuracy under different IOU thresholds, where the IOU values are 0.5 : 0.05 : 0.95, and AP50 and AP75 are the AP values when the IOU values are 0.5 and 0.75, respectively. The precision indicates the total number of correctly identified pedestrians under the IOU thresholds as a percentage of the total number of pedestrians. The percentage of the number of correctly identified pedestrians under the IOU threshold is calculated by the following formula:where true positive (TP) indicates that the prediction result is correct when the sample is positive, false positive (FP) indicates that the prediction result is incorrect when the sample is positive, true negative (TN) indicates that the prediction result is correct when the sample is negative, and false negative (FN) indicates that the prediction result is incorrect when the sample is negative.
To judge the object conditions that indicate large, medium, and small targets, definitions are given according to the COCO evaluation index, and APS, APM, and APL are small, medium, and large targets, which are defined as follows:
In the above equation, area is the size of the detected object. The actual small object size is the number of pixels that are accounted for. The criterion for determining a small target is , the criterion for determining a medium target is , and the criterion for determining a large target is .
To judge the detection speed of the model, the number of frames per second (FPS) is used as the evaluation index in this paper [40], and its calculation formula is shown below:
In the above equation, FrameNum is the total number of detected images and ElapsedTime is the total time from the start to the end of the detection period.
4.3. Comparative Experiments and Analysis of Model Parameters
To obtain a network model suitable for small-target pedestrian detection, this paper sets different network structure parameters based on the Faster R-CNN framework. The five parameters are the learning rate, discard rate, momentum decay, weight decay, and batch size, and the specific modified network parameters and comparison results are shown in Table 1 and Figure 8.

In Table 1, model 0 has a learning rate of 0.01, a discard rate of 0.5, a momentum decay of 0.9, a weight decay of 0.0005, and a batch size of 64; model 1 sets the learning rate to 0.001 on the basis of model 0; model 2 sets the discard rate to 0.3 on the basis of model 0; model 3 sets the momentum decay to 0.8 on the basis of model 0; model 4 sets the weight decay to 0.05 based on model 0; model 5 sets the batch size to 32 based on model 0; model 6 sets the discard rate and momentum decay to 0.2 and 0.8, respectively; and model 7 sets the discard rate, weight decay, and batch size to 0.2, 0.05, and 32, respectively.
Figure 8 shows a comparison of the loss values of the models with different parameters, where M0 and M7 correspond to model 0 and model 7 in Table 1. Figure 8 shows that the lowest loss value of 0.0543 is obtained by M0, while the highest loss value of 0.0657 is obtained by M4. The loss function values of M1, M2, M3, M5, M6, and M7 are 0.0587, 0.0613, 0.0606, 0.0641, 0.0625, and 0.0642, respectively. The results indicate that the model performance is better under the M0 parameters and that the detection capability is excellent. Therefore, the model in this paper uses an initial learning rate of 0.01, a batch size of 64, a momentum decay of 0.9, a weight decay of 0.0005, and a dropout rate of 0.5.
4.4. Ablation Experiments and Analysis
Ablation experiments were conducted to verify the enhancement effect of the dropout layer in the residual network and the FAM module in the feature pyramid network for small-target pedestrian detection. To fairly compare the performance of the models, the ablation experimental frameworks all use Faster R-CNN, and the backbone neural networks are ResNet-50-FPN, ResNet-101-FPN, IResNet-50-FPN, IResNet-101-FPN, ResNet-50-IFPN, ResNet-101-IFPN, IResNet-50-IFPN, and IResNet-101-IFPN. AFQ ablation experiments were also conducted for each backbone network. The above models were trained on the Caltech Pedestrian Dataset to verify the validity of the models according to the COCO evaluation metrics.
The hyperparameters of this ablation experiment were selected as follows: the learning rate was set to 0.01, the batch size was set to 64, the momentum decay was set to 0.9, the weight decay was set to 0.0005, and the dropout rate was set to 0.5. The corresponding model was trained offline, the model was saved in the xxx.pth file format, and the corresponding detection code and detection image were configured. The command python predict.py was input to obtain the detection results of the video, dataset, or camera images. The final detection results are shown in Figure 9, and the evaluation results are shown in Table 2.

As shown in Figure 9 and Table 2, for both small-target and large-target pedestrian detection, the IResNet-IFPN evaluation results are better than those of the ResNet-FPN, IResNet-FPN, and ResNet-IFPN models. The model proposed in this paper was compared with the ResNet-FPN model, and the large-target pedestrian detection accuracy improved by 21.6% and 32.3%, the small-target pedestrian detection accuracy improved by 21.6% and 32.3%, the AP value improved by 17.2% and 24.5%, and the AP50 value improved by 7.8% and 8.2%. When the proposed model was compared with only the modified residual network, the large-target pedestrian detection accuracy improved by 8.7% and 14.4%, the small-target pedestrian detection accuracy improved by 16% and 10.1%, the AP value improved by 12.6% and 19.4%, and the AP50 value improved by 5.5% and 4.6%. When the proposed model was compared with the feature-only modified pyramid network, the large-target pedestrian detection accuracy improved by 8.4% and 14.3%, the small-target pedestrian detection accuracy improved by 14.1% and 22.7%, the AP value improved by 11.0% and 17.9%, and the AP50 value improved by 3.4% and 3.6%. The AFQ module ablation experiments show that the detection speed of the model increases from the lowest speed of 6.9 FPS to 9.8 FPS with 42.0% performance improvement and the highest speed of 18.5 FPS to 20.1 FPS with 8.4% performance improvement under the same backbone network. The performance clearly improves. Although the model detection accuracy decreases slightly, the overall accuracy is not affected. The above data comparison suggests that the FRFP-AFQ model greatly improves the original algorithm results for all row targets; however, there is not much improvement in large-target pedestrian detection and integrated pedestrian detection compared to the only modified residual network and only modified feature pyramid network. The small-target pedestrian detection accuracy is greatly improved, which proves that the model proposed in this paper can improve the comprehensive pedestrian detection capability of the model. Finally, the AFQ module improves the detection speed of the model by 8.4% to 42%. The results show that the FRFP-AFQ model is feasible and effective.
4.5. Comparison with Other Pedestrian Detection Algorithms
The FRFP-AFQ model proposed in this paper is compared with other conventional pedestrian detection algorithms, including MEL [41], SIRA [42], YOLOV3-Promote [43], YOLOV5 [44], and DMSFLN [13], and the detection results are evaluated using COCO evaluation metrics. All algorithms use the same module, as described in Section 2.3. The same hyperparameters and datasets are used; the final detection results are shown in Figure 10, and the evaluation results are shown in Table 3.

As seen in Figure 10 and Table 3, the DMSFLN pedestrian detection algorithm with a VGG-16 backbone network improves the AP50 accuracy by 41.3%, and the detection speed is approximately two times faster than that of the DMSFLN pedestrian detection algorithm. In the case of the same 101-layer residual network, compared with the MEL and SIRA algorithms, the FRFP model improves the large-target pedestrian detection accuracy by 16% and 14.5%, the small-target pedestrian detection accuracy by 26.8% and 20.6%, the AP value by 20.8% and 17.7%, and the AP50 value by 5.5% and 3.6%, respectively. Compared with the conventional YOLO detection algorithm, the detection speed of the model proposed in this paper is slightly reduced, but the AP50 and AP75 detection accuracies of the model with the IResNet-50-IFPN backbone network are, respectively, 21.4% and 13.0% better than the G-Module model with the YOLOV3-Promote and YOLOV5 backbone networks. When the model uses IResNet-101-IFPN as the backbone network, the AP50 and AP75 values are improved by 22.0% and 13.6%, but the slowest and fastest detection speeds are only 0.1 FPS and 8.4 FPS, which shows that the FRFP-AFQ model outperforms the conventional prediction algorithms in terms of detection capability and evaluation results for large, small, and integrated targets. In particular, the detection and evaluation results are better for small targets, which shows that the FRFP-AFQ model enhances the multiscale feature fusion and feature alignment abilities for small pedestrian targets, so the detection accuracy is higher than that of the MEL and SIRA models. In practical applications, considering multiscale feature fusion and feature alignment is beneficial for improving the detection performance of the model. The FRFP-AFQ model also outperforms the conventional pedestrian detection models in detecting medium and large targets, indicating that the proposed model has a better comprehensive detection capability and more advantages for small targets than the conventional models.
5. Summary
To improve the detection accuracy and robustness of small-target pedestrian detection, an FRFP-AFQ model is proposed to construct bottom-up multiscale feature maps via ResNet and perform feature fusion and feature alignment on the multiscale feature map by using an FPN. The multiscale feature fusion is completed by using a deep feature map with high semantic features and a shallow feature map with multi-detail features, and the fused feature map contains both the deep, high-semantic features and the shallow, detailed features. Finally, a cascaded AFQ module is introduced to reduce the inference process time and increase the detection speed. Experiments are conducted on the Caltech Pedestrian Dataset. The experimental results show that the model designed in this paper outperforms the conventional YOLOV3-promote, SIRA, YOLOV5, MEL, and other detection models and has good application prospects.
The detection accuracy of the proposed model is still affected by extreme weather and multitarget pedestrian occlusion, and the small-target pedestrian detection ability is reduced in bad weather such as heavy rain and fog, as well as in the case of high crowd flow. In future studies, we will focus on the effects of bad weather and multitarget pedestrian occlusion on detection, enhance the learning ability and generalizability of the model in the case of extreme weather and multitarget pedestrian occlusion, and improve the robustness of the model for small-target pedestrian detection in extreme situations such as snowstorms and pedestrian occlusion.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (NSFC) under grant 61976055 and the special fund for education and scientific research of Fujian Provincial Department of Finance under grant GY-Z21001.