Abstract

The CNN-based crowd counting method uses image pyramid and dense connection to fuse features to solve the problems of multiscale and information loss. However, these operations lead to information redundancy and confusion between crowd and background information. In this paper, we propose a multi-scale guided attention network (MGANet) to solve the above problems. Specifically, the multilayer features of the network are fused by a top-down approach to obtain multiscale information and context information. The attention mechanism is used to guide the acquired features of each layer in space and channel so that the network pays more attention to the crowd in the image, ignores irrelevant information, and further integrates to obtain the final high-quality density map. Besides, we propose a counting loss function combining SSIM Loss, MAE Loss, and MSE Loss to achieve effective network convergence. We experiment on four major datasets and obtain good results. The effectiveness of the network modules is proved by the corresponding ablation experiments. The source code is available at https://github.com/lpfworld/MGANet.

1. Introduction

Crowd counting can count the number of people in images or video frames to realize the effective management of different scenes such as meetings and sports events. It has been widely used in public security, intelligent traffic, video surveillance, etc. [15]. Crowd counting can also be used to count the number of cells, viruses, and animals, extending the field of research into medical and behavioral science [6]. However, it is still a challenging task in the field of computer vision due to the problems of crowd occlusion, scale variation, uneven data distribution, and so forth (see Figure 1).

With the excellent performance of convolutional neural networks in various fields of computer vision, researchers have made various attempts on crowd counting. Some researchers tried to use multicolumn structures with different convolution kernels to solve the problem of multiscale [710]. Other researchers used parallel convolution kernels to obtain feature maps with different scales or fused multiscale information by the dense connection of multilayer features [11, 12]. However, in these similar structures, the features learned from different branches have greater repeatability, which makes little contribution to the extraction of multiscale information. At the same time, these redundant features cause crowd density maps to be disturbed when generated and make background or other image content easily mistaken for the crowd [12, 13]. Some networks are tried to use spatial attention in the training process to emphasize the crowd in images, to solve the problems such as background interference [13, 14].

The feature maps of different layers in the network contain different scale features and semantic features, so we cascade the features of different layers from top to bottom. The low-level network contains more detailed information, which is conducive to the formation of a density map of high-density scenes, while the high-level network contains more semantic information, which is conducive to the distinction between the human head and background noise. In this way, different scales of information can be obtained without increasing the complexity of the network structure. In comparison, most crowd counting networks use a down-top multilayer fusion and then use deconvolution or dilated convolution to keep the final density map scale unchanged. But in our top-down network structure, we do not need to consider how to recover the final density map scale. We cascade the multilayer features of the network but not overlay the multilayer feature maps of the network through channels. Cascading operations at different layers can effectively prevent information loss so that the final density map contains complete context information.

The fusion of features of different layers can obtain rich information, but a further selection is needed before generating the density map. Most networks use spatial attention mechanisms to selectively strengthen the human head area [1517]. However, the channel information contains corresponding to the special class. In the crowd image, the classes are head and background. Therefore, we considered optimizing the resulting features by combining spatial attention with channel attention.

MSE loss is the most commonly used loss function in crowd counting methods. However, in crowd scenes, the texture features and pixel correlation of high-density areas and low-density areas are different. MSE Loss is based on the assumption of pixel independence and ignores the local correlation of the density map. Besides, MSE Loss does not take into account global counting errors of input images. To optimize the loss function, we combine SSIM Loss, MAE Loss, and MSE Loss as the final loss. The loss function can calculate the local consistency of the predicted density map and the ground truth density map and calculate the difference between the predicted number of people and the real number of people so that the network can better converge to generate the high-quality density map.

In this paper, we propose a novel network for crowd counting. The network integrates the multiscale features of the crowd from top to bottom and uses spatial attention mechanisms and channel attention mechanisms to further guide the features to produce high-quality density maps. This network is called MGANet, and its structure is shown in Figure 2. Specifically, our network consists of two parts: MFE (Multiscale Feature Extraction) and MFG (Multiscale Feature Guide). MFE uses the VGG16 backbone to extract the multiscale features and fuse the multilayer features step by step in a top-down manner, which can not only get the semantic features contained in different levels but also well express the detailed features of people of different scales. MFG differentiates the head information and background information through channel attention, locates the head area more accurately through spatial attention mechanism, and uses the output of the two kinds of attention to form an effective density map. MFG contains three columns of feature outputs with different resolutions, which are converted into density maps by Conv , and the density maps are upsampled to the same resolution. SSIM Loss, MAE Loss, and MSE Loss are combined as the final loss function. Experiments on four datasets (ShanghaiTech [11], UCF_CC_50 [18], WorldExpo’10 [19], and Smart City [12]) demonstrate the effectiveness and robustness of our method.

In summary, the main contributions of our paper are as follows:(1)We propose a novel crowd counting network MGANet, including MFE and MFG. It can effectively deal with the information redundancy and confusion caused in the process of multiscale feature fusion and generate a high-quality density map for accurate crowd counting.(2)We propose a top-down feature fusion network based on MFE, which realizes the complementarity of different layers of feature information by a multilayer feature cascade. The network can effectively adapt to the change of crowd scale and prevent the loss of context information. MFG is a network containing channel attention and spatial attention, which can eliminate the influence of redundant features, and the resulting density map pays more attention to the crowd.(3)We propose a counting loss function, combining SSIM Loss, MAE Loss, and MSE Loss, which is designed to achieve effective convergence of the network.

The researchers summarized a series of excellent crowd counting methods [20, 21]. We briefly introduce the research related to the paper, including crowd counting methods and the attention mechanism related to the crowd.

2.1. Crowd Counting Methods

Crowd counting methods can be classified into two categories: traditional methods and CNN-based methods.

2.1.1. Traditional Methods

Traditional methods are mainly classified into two categories: detection-based methods and regression-based methods. In the early days, people used the detection-based method to count the crowd. The main feature of this method was that the sliding windows were used to manually extract the whole of the body features of the human body for detection, for example, Haar wavelet, HOG, or pedestrian edge information [4, 22]. But in a crowded situation, these methods do not work. To solve these problems, researchers have tried to examine specific body parts rather than the whole body, such as the shoulders or head [23, 24], but these methods are still unable to cope with high-density crowd scenes. The regression-based method counts the crowd by learning a feature-to-crowd mapping [2527]. This method firstly extracts the low-level features such as texture and edge and then selects the appropriate machine learning regression model to realize the mapping of low-level features to the crowd. Whether the detection method or the regression method, the output information is relatively limited and the processing steps are relatively complicated.

2.1.2. CNN-Based Methods

In recent years, deep learning has developed rapidly, and great progress has been made in crowd counting based on CNN methods. Researchers try to use CNN-based target detectors for counting, including YOLO [28] and Faster-RCNN [29]. Although this method has made great progress compared with the traditional method, it can get better prediction results for the dense crowd in the image by using the fully convolutional network mapping to the density map. A detailed CNN-based counting survey is presented [30]. MCNN uses a three-column convolutional neural network with a similar structure. The three-column network contains convolution kernels of different sizes. Its purpose is to use convolution kernels of various scales to adapt to the head size of different scales [10]. Switch-CNN is also a multicolumn structure, which uses a density classifier to enable different density patches to be distributed to appropriate networks [8]. MSCNN used multiscale clusters for computation, which can generate scale-dependent features in a single column structure, thus achieving high performance of crowd counting, with high accuracy and cost-effectiveness in practical applications [31]. CSRNet is divided into front-end and back-end networks, and the VGG16 with full connection layer is eliminated as the front-end network, while the dilated convolutional neural network is used as the back-end network. While maintaining the resolution, the perception domain is enlarged to generate a high-quality crowd density map [11]. CP-CNN firstly extracts and classifies the features of the whole input image, converts the classification results into a global context related to the density level, and performs the same operation for the patches cut from the image to obtain the local context. Finally, the whole feature is constrained, so that the network can adaptively learn the features of the corresponding density level for any image [32]. SANet extracts the head information of multiple scales of the image, which uses a module similar to inception architecture. Different sizes of convolution kernels are used in each convolutional layer at the same time, and finally, the final high-resolution density map is obtained through deconvolution [33]. DADNet fuses the scale-aware attention output with different expansion rates to capture different visual granularity of the area of interest to the crowd and uses deformable convolution to generate high-quality density maps [15]. CAN fuses feature acquired from different receptive fields, learns the importance of each feature in the image, and adaptively encodes the scale of context information to accurately predict crowd density. These studies realize the aggregation of multiscale information utilizing multicolumn structure, parallel dilated convolution, or fusion of features of different levels [16]. MS-GAN combines a multiscale convolutional neural network (generator) and an adversarial network (discriminator) to generate a high-quality density map [34].

2.2. Attention Mechanism

To extract effective features, the attention mechanism has also been extensively studied in the field of crowd counting. DecideNet believes that the detection-based method is better in the sparse scene, while the regression method is better in the dense crowd scene. Therefore, the attention mechanism is adopted to adjust the weight of the two methods of detection and regression adaptively according to the change of crowd density [35]. DADNet uses multicolumns dilated Conv to deal with multiscale problems and uses to predict an attention map on different columns to achieve feature selection [15]. SCAR proposed to add a spatialwise attention model and a channelwise attention model to filter image features and then integrate features [36]. The focus for free integrates the tasks of segmentation, counting, and classification; uses segmentation to provide spatial attention for counting; and uses classification to provide channel attention for counting, to promote the accuracy of counting [37]. ASNet first generates interest masks at different density levels, then generates density maps and scaling factors, multiplying them to output individual attention-based density maps, adding these density maps to produce the final density map [20]. These attention mechanisms improve the counting accuracy only depending on the feature fusion method. This paper also considers the use of channel and spatial attention mechanisms for multiscale fused features. With a series of attention modules, the final attention effort can be refined and the noise gradually eliminated, giving more weight to the really important areas.

3. Method

We propose that MGANet can accurately count people with different crowding degrees. In this section, we elaborate on four aspects of MFE, MFG, feature fusion, and loss function.

3.1. MFE (Multiscale Feature Extraction)

In the process of crowd image acquisition, the different distances between the crowd and camera in the same scene will cause a perspective effect; that is, there is a multiscale problem. To solve the problem, we propose a feature fusion method from top to bottom. VGG16 has a good characteristic representation ability, and its network is easy to modify. Therefore, our network chooses VGG16 as the front-end feature extractor. Among others, Conv3-3, Conv4-3, and Conv5-3 are used to output features of dimensions 1/4, 1/8, and 1/16 of the original input dimensions. Due to the forward propagation of the network, the receptive field of the feature map gradually becomes larger. In the feature map generated by the front layer of the network, attention should be paid to the small-size heads, while in the feature map generated by the rear layer, attention should be paid to the large-size heads. The calculation formula of the receptive field is shown inwhere is the receptive field of the convolution layer at the layer, is the stride of the receptive field at the layer, is the step length of convolution, and is the convolution kernel. Besides, the high-level features contain more semantic information, while the low-level features retain the location information of the input image. Therefore, it is an effective method to improve network performance by combining the features of different levels. Our specific fusion method is shown inwhere represents the th row; the th column fuses features, ; represents the row; and the column is the feature obtained by the upsampling module. This fusion method can not only solve the multiscale problem of the human head but also make full use of the semantic information and location information of the feature map of different layers. In this work, was the key to connecting the layers, we named upsampled embedded block (UEB). Its detailed design is shown in Figure 3.

Firstly, the feature maps of the Conv5_3 output with the nearest neighbor interpolation are upsampled by UEB and fused with the feature map of the Conv4_3 output; the same operation is done with the Conv4_3 and Con3_3. The network obtains the first layer of fusion features and repeats similar operations. With Conv3_3, Conv4_3, and Conv5_3, we had three sets of feature fusions () which were fed into the MFG for further manipulation.

3.2. MFG (Multiscale Feature Guide)

The features obtained from MFE need to be advanced to generate a high-quality density map. The specific operation includes spatial attention mechanism and channel attention mechanism.

3.2.1. Spatial Attention

The full convolutional network regresses the pixels of the density map and does not explicitly give more attention to the head region. In other words, crowd background may influence training loss. To solve the problem of crowd and background confusion, we use spatial attention to make the network generate a spatial attention map of crowd head information. Then, the attention map is applied to suppress the selection of nonhead region information to make the network focus more on the head region (see Figure 4).

, the feature map taken by MFE on different columns, is used to predict an attention map with of 1 ∗ 1, respectively, and then multiplied by with attention map to obtain the spatial attention feature map . The process is shown inwhere is the attention map obtained by and is . The solution process is shown in the following equations:where , is the feature map of each channel, represents the convolution of , represent the parameters of convolution, and is the activation function.

3.2.2. Channel Attention

Each channel of a high-level feature can be viewed as a specific category of responses, with different semantic responses linked to each other. By utilizing the dependency relationship between different channels, the interdependence between features can be enhanced and the expression of feature semantics can be improved (see Figure 5).

The dimension of is . First, reshape to get , where . Then, the transpose matrix of is multiplied by . Finally, is sent to softmax to obtain . The above operation process is shown as follows:where is the effect on the channel on the channel and consists of . and reshape matrix multiplication and will get the final result combined with channel attention to ; its expression is as follows:

is initialized to 0 and gradually learns to assign more weight to it. In the process, the calculated matrix also plays a role of attention, each row of which calculates the dependency relationship between a certain channel and other channels. The value is changed to 0-1 by softmax probability; the greater the value is, the stronger the dependency will be. By multiplying the attention map and , the dependent channels were selectively integrated, semantic feature expression was improved, and long-distance semantic dependence between different channels was modeled.

3.2.3. Fusion of Features

For the output of spatial attention and channel attention, the elementwise sum is executed to complete feature fusion, and the density map is obtained through . Since the density map size is 1/4, 1/8, and 1/16 of the original input size, the upsampling operation is needed to obtain the feature map with the final size of 1/4 of the original image size.

3.3. Loss Function

In the crowd scene, the local features of the high-density crowd are different, but MSE Loss is based on the assumption of pixel independence, and the local correlation of the density map is not considered. Besides, the counting error of the image is not taken into account. Therefore, SSIM loss and MAE loss are considered to be added to the loss function. The SSIM Loss calculates the similarity between two images according to the local features, and the similarity between the generated crowd density map and the truth value can be compared. The MAE Loss directly measures the difference between the estimated crowd number and the ground truth.

3.3.1. SSIM Loss

SSIM is an indicator widely used in the field of image quality assessment. It calculates the similarity between two images based on local patterns (including mean, variance, and covariance). The value range of SSIM is [−1, 1]. The more similar the two images are, the greater the value is. When the two images are the same, it is equal to 1. First, the local statistics are estimated using an normalized Gaussian kernel with a standard deviation of 1.5. Then, the weight set from is defined, where is the center and contains the kernel of all locations. Therefore, for each position , the local statistics of the density map and the corresponding truth value are calculated. The local mean and variance of are calculated, as follows:

The local mean and variance of are as follows:

We can calculate the local covariance between and as follows:

According to these indicators, SSIM is calculated point by point:where and are random very small constants to avoid being divided by 0.

Finally, the loss function of SSIM can be defined as follows:where is the sum of pixels in the density map.

3.3.2. MAE Loss

Mean absolute error loss is introduced to assess the counts and the estimated values as follows: where means the density map generated by MGANet, is the estimated count, and is the label value.

3.3.3. MSE Loss

The Euclidean loss is used to assess the difference between the training density map and the model output density map, to facilitate the model to adjust parameters and produce a density map closer to the real situation. The Euclidean loss can measure estimation error at the pixel level. The loss function is given as follows:where is the output of MGANet, is the variable parameters of the model parameters, is the input image, and represents the ground truth result.

3.3.4. The Final Count Loss

The final loss consists of , and . and are weighting factors of and which are set as 0.001 and 0.001 separately:

4. Experiment

In this section, we introduce the evaluation metrics, the three standard benchmark datasets, the experimental setup, and the training method in order. Finally, we perform experimental results on the datasets.

4.1. Training Details

MGANet is an end-to-end structure, which is implemented using the PyTorch framework. So, the training process is very simple. We set the training batch size as 1. MGANet uses standard SGD as the optimization with the momentum at 0.9. Besides, we employ random Gaussian initialization with a standard deviation of 0.01. The initial learning rate is set to 0.001 and decreases as the number of iterations increases.

4.2. Evaluation Metrics

Following the existing literature, the evaluation metrics are the mean absolute error (MAE) and mean squared error (MSE), which can evaluate the performance of our method. MAE indicates the accuracy of the model counting, and MSE represents the robustness of the model. The formulae are as follows:where N represents the number of test images. is the actual number of people in the test image. is the corresponding estimated count, which is the output by our model.

4.3. Ground Truth Generation

For the density map to be able to adapt to various conditions of the crowd image, it can be expressed as with N heads. The calculation method of is to convolve the delta function with a Gaussian Kernel normalized to 1:where represents per pedestrian head at the pixel, is the crowd distribution of all the images in the dataset, is a constant, and represents the average distance of k nearest neighbors of the target.

In our experiments, we follow the configuration in CSRNet [11]. Certain parameters are set to fixed values The setups are shown in Table 1.

4.4. Comparisons with State of the Art

We evaluate our method on three publicly available crowd counting datasets: ShanghaiTech, UCF_CC_50, WorldExpo’10, and Smart City. These datasets include different crowd situations, such as dense scenes and sparse scenes. Tables 25 show the comparison results of MAE and MSE, respectively. Besides, the results of the model tests are visualized in Figures 68.

4.4.1. ShanghaiTech Dataset

ShanghaiTech dataset [11] includes two sets: Part A and Part B. It is taken in sparse scenes, which consists of 1198 crowd images with 330,165 people. Part A consists of 300 training images and 182 testing images. Part B contains 400 training images and 316 testing images, counting from 9 to 578.

ShanghaiTech dataset test results and visual results are shown in Table 6 and Figure 6. In the visualization section, we tested the low-density to high-density crowd images in Part A and Part B, respectively. We scored 96.7 on MSE on Part A, and other data also performed well. It can be seen that we perform well in both low-density and high-density scenarios.

4.4.2. UCF_CC_50 Dataset

UCF_CC_50 dataset [18] is a very challenging dataset, due to different perspectives, small dataset size, and resolutions. It contains 50 crowd images with 63974 people. The annotated persons range from 94 to 4543, with an average of 1280. Fivefold cross-validation is the most commonly used method in this dataset. Our results are lower than 212.2 and 243.7 of CAN [16]. In the UCF_CC_50 dataset, images are high-density scenes without complex background information. Our MFGNet is more suitable for complex background situations.

The test results and visual results on the UCF_CC_50 are shown in Table 2 and Figure 7. It can be seen that our MAE and MSE are 240.8 and 311.5, respectively, which are greatly improved compared with other methods, proving that our MAE and MSE also perform well in small datasets and high-density scenarios.

4.4.3. WorldExpo’10 Dataset

WorldExpo’10 [19] includes 108 different surveillance cameras, containing 3,980 training frames in 1,132 video sequences. It can provide the cross scene to evaluate the model. And regions of interest (ROI) are provided for all scenes.

The test results and visual results on the WorldExpo’10 dataset are shown in Table 3 and Figure 8. The data is divided into 5 different scenarios with large background interference. We tested each of the five scenarios, and the final score is 7.86. In S1 and S5 scenarios, the scores are 2.1 and 3.0, respectively, better than 2.4 and 4.0 in CAN [16]. It can be seen that our model also performs well on the multiscene crowd counting dataset under complex background.

4.4.4. Smart City Dataset

Smart City [12] contains 50 images. When collecting data, the shooting angle is high, and the scene includes ten scene images such as a sidewalk and shopping mall. The images are divided into indoor scenes and outdoor scenes, and there are few pedestrians, ranging from 1 to 14 people from the lowest to the highest, with an average of 7.4 people.

The test results and visual results on the Smart City dataset are shown in Table 4 and Figure 9. Relatively speaking, the background of people in this dataset is more complex. On this dataset, we got the best results. They are 8.2 and 10.2, respectively, better than 9.4 and 11.4 in CAN [16]. It can be seen that our model also performs well on the multiscene crowd counting dataset under complex background.

4.5. Ablation Experiment
4.5.1. Top-Down Feature Fusion

In the experiment to verify the feature fusion function, we conducted three different cascade fusion comparisons according to the network architecture: (1) no cascading fusion: only the original features output by VGG16 are used; (2) one-level cascading fusion: add a top-down cascade operation to the original VGG16 feature; (3) two-level cascading fusion: add a top-down two-tier cascade operation to the original VGG16 feature. Experiments are performed on ShanghaiTech A and B, and the results are shown (see Table 5). With the increase of cascade operation, the indicators gradually become better. However, the complexity of the model is gradually increasing, and the performance will not be significantly improved.

4.5.2. The Function of the Attention Mechanism

(1) Comparison of Different Attention Modules. In the experiments to verify the effect of attention, we set 4 conditions according to the network architecture: (1) no attention; (2) only channel attention is used; (3) only spatial attention is used; and (4) channel attention and spatial attention are used. Experiments are performed on ShanghaiTech A and B, and the results are shown in Table 7.

(2) Comparison of the Same Form of Attention Mechanism. Channel attention and spatial attention are also included in the CBAM [39]. Although there are similar names, they make certain differences. We believe that the tasks of CBAM and MFG are different. CBAM pays more attention to the recognition of target objects, which also makes CBAM have better explanatory. MFG pays more attention to pixel-level attention design and is more suitable for crowd counting. To be fair, we add CBAM and MFG to the MCNN [10] network, respectively, and the test results are shown in Table 8. (1) MCNN: the baseline of this paper. It is a fully convolutional neural network with three columns of convolution kernels of different sizes. (2) MCNN + CBAM: CBAM is added to the top of MCNN. (3) MCNN + MFG: MFG is added to the top of MCNN. (4) MGANet: the full model is proposed by ours, which consists of MFE and MFG.

We find that MFG achieves a better performance than CBAM: 84.4/136.7 versus 120.6/184.61 of MAE/MSE on ShanghaiTech A, 14.3/23.3 versus 29.77/49.66 of MAE/MSE on ShanghaiTech B. When MFG is embedded into the proposed network, the proposed model attains the best performance regardless of the counting results or density maps’ quality.

4.5.3. Function of Loss

(1) Different Combinations of Loss Functions. In the experiments of different loss function combinations, we set four different settings based on MSE Loss, namely, MSE Loss, SSIM + MSE Loss, MAE + MSE Loss, and SSIM + MAE + MSE Loss. The experiment was carried out in ShanghaiTech A and B, and the results are shown in Table 9. It can be seen that MSE Loss, the most commonly used loss, still plays a major role in network convergence, but the combination of SSIM Loss and MAE Loss in this paper can also promote network convergence to a certain extent.

(2) Setting of Super Parametersand. We briefly compare the different parameter settings of the presented loss function on ShanghaiTech A demonstrated in Table 10. It indicates that when α and β are and , the MAE and MSE reach the lowest.

5. Conclusion

In this paper, we propose MGANet for accurate and effective crowd counting. To obtain multiscale information and prevent the loss of context information, we propose a top-down method to concatenate deep and shallow features. To make the network pay more attention to the spatial information and channel information of the crowd in the image and ignore the irrelevant information, we design a combination of spatial attention and channel attention that pay more attention to the pixel level and further guide the features. To obtain a high-quality density map, based on the commonly used MSE Loss, the loss function that can effectively promote network convergence is obtained by combining SSIM Loss and MAE Loss. A large number of experiments have proved the effectiveness and robustness of our method, and the related ablation experiments have also confirmed the effectiveness of each module.

Data Availability

All data can be found at https://github.com/lpfworld/MGANet. Other raw data supporting the conclusions of this article can be obtained from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Zhejiang Provincial Technical Plan Project (nos. 2020C03105 and 2021C01129).