Abstract
Aiming at the problems of crowd distribution, scale feature, and crowd feature extraction difficulties in exhibition centers, this paper proposes a crowd density estimation method using deep learning for passenger flow detection systems in exhibition centers. Firstly, based on the pixel difference symbol feature, the difference amplitude feature and gray feature of the central pixel are extracted to form the CLBP feature to obtain more crowd group description information. Secondly, use the LR activation function to add nonlinear factors to the convolution neural network (CNN) and use dense blocks derived from crowd density estimation to train the LR-CNN crowd density estimation model. Finally, experimental results show that the mean absolute error (MAE) and mean square error (MSE) of the proposed method in the UCF_CC_50 dataset are 325.6 and 369.4, respectively. Besides, MAE and MSE in part_A of the Shanghai Tech dataset are 213.5 and 247.1, respectively, and they in part_B are 85.3 and 99.7, respectively. The proposed method effectively improves the accuracy of crowd density estimation in exhibition centers.
1. Introduction
The foreign exhibition service industry has developed into a relatively mature industry, and the domestic exhibition service industry is also developing rapidly. At present, most exhibition service companies still focus on whether the exhibition can be successfully held and provide postshow analysis reports for exhibition organizers. However, there is a lack of research on realtime exhibition hall analysis services, especially in terms of passenger flow detection [1]. The exhibition service industry based on location services has gradually emerged, and various crowd density estimation solutions have emerged [2–4].
Population counting and density estimation have great practical significance [5–7], which can be extended to the following three applications:(1)Public safety supervision: in places with dense crowds in the real scene, the staff monitors the crowd’s dynamic information in realtime through electronic camera equipment, analyzes potential safety hazards, and tries to avoid them [8, 9].(2)Intelligence collection and analysis: as far as China is concerned, residents’ travel and tourism have become normal during the annual holidays. Statistics and analysis of crowd flow of major tourist venues in China are beneficial to road traffic management and arrangements. At the same time, the overall tourism policy can be adjusted according to the travel preferences and interests of the crowd in each time period obtained in the past [10, 11].(3)Virtual model construction: it provides a reliable mathematical model for the transformation between virtual reality and reality [12].
Crowd counting and density estimation research cannot only provide important guarantees for the safety of people’s lives and property but also aid in promoting the maximization of social and economic benefits. It has a wide range of application prospects and important practical significance [13–15]. Therefore, crowd counting and density estimation have gradually become a common research hotspot in academia and industry.
In the early research, scholars used the Haar wavelet transform, shape feature, directional gradient histogram, and texture feature to manually extract the detection. Counting was completed by detecting head, body, or wholebody features in crowd images [16–18]. With the improvement of hardware technology and the advancement of deep learning technology, the performance of many computer vision tasks has been greatly improved, and CNN has played an important role in tasks such as target detection, image classification, and semantic segmentation. Therefore, CNN was widely used in counting tasks, and the related performance was greatly improved [19, 20]. Reference [21] designed a multitask framework based on CNN to simultaneously estimate the density level and the number of target crowds. It used the former to provide additional information to assist the latter to improve the counting performance of the model. Reference [22] established a multicolumn CNN, using different sizes of receptive fields to obtain target features of different scales. The crowd density map was generated by fusion with a 1 × 1 size convolution kernel. Reference [23] used the same network to process and generate crowd density maps for input images at different resolutions, and at the same time, output attention maps to supervise the generation of crowd scale predictions. However, this method needed to reason about multiple pictures of different scales at the same time, which greatly increases the number of network calculations. Reference [24] introduced an attention mechanism to fuse features based on detection and regression, but this method did not perform well in high-density areas and could not achieve realtime prediction. In order to enhance the perception of crowd density areas, reference [25] established a series of attention modules and regression modules. It used deformable convolution to establish an attention module to detect crowd areas and improve the perception of density maps for crowds of different densities. Reference [26] proposed a self-supervised counting algorithm that uses the rule that there are always more people in large image blocks than in small image blocks in unlabeled data to establish a self-supervised learning task to improve the counting performance of the algorithm. Reference [5] proposed an end-to-end population density estimation network to generate a high-quality population density map, which can obtain high-quality map estimation. Reference [27] proposed a crowd counting method based on crossconfrontation loss and global features for high-density scenes of different scales. The cross-countermeasure loss was used to generate the residual map, and the uniformity problem of the fusion density map was solved through the consistency between different scales, extracted a wide range of contextual information and focused on the key information in the global spatial features to generate a residual map. In reference [28], a multilevel neural network is constructed to estimate population density, and good results are achieved. Reference [29] proposed a multiscale context learning module called the multiscale context aggregation module. The module first extracted information on different scales, and then adaptively aggregated it to capture the fullscale of the crowd. However, most research is still focused on traditional shallow models. The fitting ability of shallow models is limited, and the effect is better in simple image processing of crowd scenes. But when the background is more complex, crowd density estimation is more difficult, and the extraction of scale features and crowd features is not sufficient.
Based on the above analysis, this paper proposes a crowd density estimation method using deep learning for passenger flow detection systems in exhibition centers in order to solve the problems of crowd distribution, scale feature and crowd feature extraction difficulty in the exhibition center scene. Firstly, extract the difference between the amplitude feature and gray feature of the center pixel to form the CLBP feature together to obtain more descriptive information about the crowd density. Then use the LR activation function to add nonlinear factors to CNN and use the dense blocks obtained by crowd density estimation to train the LR-CNN crowd density estimation model.
2. Proposed Model Framework
The primary problem of crowd behavior analysis is to detect an area where a large crowd gathers and perform corresponding crowd behavior analysis in this area. Based on the traditional algorithm, this paper uses the complete local binary pattern (CLBP) to extract the characteristics of crowd aggregation. On this basis, the deep learning model is used to construct the detection of crowd gathering. CNN is applied to crowd group detection, and the CLBP feature is trained by operations such as convolution and pooling. After extracting the fundamental features, the prediction result of crowd gathering is obtained. Comparing with the prediction results given by actual experts, five density results are obtained: sparse, normal, low-density, medium-density, and high-density. The steps of the crowd density estimation algorithm are shown in Figure 1.

3. Image Preprocessing and CLBP Feature Extraction
The local binary patterns (LBP) feature is one of the most commonly used texture feature detection methods. However, the LBP feature is not compatible with the density detection of any level of the crowd. The real-time performance and accuracy in complex scenes are not enough. Thus, this paper proposes a CLBP feature extraction method. Traditional LBP features use rectangular neighborhoods, which are not rotation invariant. In order to realize the texture feature of rotation invariance, a circular neighborhood is added. The schematic diagram of the circular neighborhood is shown in Figure 2.

In the circular neighborhood, the neighborhood of the center pixel has a larger selection range. When certain values cannot be read directly from the pixel, the bilinear interpolation method is used to give the calculation result and the pixel is read. For the same radius, when there is a rotation, the LBP value is different. In order to obtain the same LBP value, the smallest LBP value should be selected from all the results after rotation as the LBP value of the neighborhood. That is, a result that satisfies all the rotations should be selected. Figure 3 below shows the rotation result, where black represents “1” and white represents “0.” The calculated LBP value results are given in parentheses.

The introduction of the circle makes the calculation object more complicated, and the “uniform mode” calculation method should be adopted at this time. This method only performs two change calculations of 0–1 or 1–0, and the following formula is given to calculate the rotation-invariant LBP of the circular neighborhood:
Figure 4 shows the circular neighborhood rotation-invariant LBP method for uniform mode calculation.where the number in the center of the neighborhood represents the uniform mode LBP value of LBP, and the neighborhood value is the number of “1.” The LBP value of the neighborhood of nonuniform mode (U > 2) is P + 1, and the calculation formula is as follows:

(a)

(b)
Traditional LBP only extracts the difference between the pixel value of the neighborhood and the pixel value of the center point, and the characteristics that can describe the crowd are limited. In order to better express the local features of the crowd, this paper also extracts the amplitude feature of difference and the gray feature of the center pixel on top of the symbol feature of pixel difference, forming a CLBP feature. This feature can give more descriptive information about the group of people. The extraction process of CLBP is shown in Figure 5.

The matrix (a) gives the center pixel and its 8 neighboring pixels. First, calculate the difference between the neighboring pixels and the center pixel to get matrix (b). Then generate the sign of each difference to get matrix (c). Finally, take the absolute value of all the differences of the matrix (b), obtain the magnitude of the difference, and get matrix (d). After the preprocessing is complete, the following steps are taken:(1)The symbol matrix (c) is binarized to obtain a matrix composed of “0” and “1.” Then use the above formula (2) to calculate the characteristic S of symbols describing the difference;(2)The global average of elements of the difference magnitude matrix (d), denoted as is calculated. All elements in the matrix (d) are used to make the difference with the global average value. If the result is negative, it is recorded as “0,” and if it is non-negative, it is recorded as “1” to generate a binary matrix. Equation (2) is used again to calculate the characteristic M that describes the magnitude of the difference;(3)The average gray value of the center pixel is calculated, denoted as . In the same way, used to binarize the central pixel, and then equation (2) is used to calculate and describe the grayscale characteristics of the central pixel.
4. CNN Framework Construction of Crowd Density Estimation
4.1. Network Training Framework
After extracting the stable CLBP feature from the original crowd video sequence, it is necessary to predict the crowd group through a classifier, find the crowd group that exceeds the threshold range, and define it as a large crowd gathering situation. In this link, traditional methods use shallow learning models for prediction and tracking, including common BP neural networks and SVM classifiers. The shallow model has achieved good results in learning and predicting a small number of samples. However, when the scene of crowd group detection is more complicated, and there are occlusions and overlaps, the limited learning ability of shallow models will gradually reduce the effect of crowd group detection, and gradually lose a certain degree of robustness. In recent years, there have been relatively few studies on the detection of crowd groups in deep learning. But deep learning has made good progress in the fields of image processing and pattern recognition. Therefore, this paper intends to use the deep learning model to predict and track the CLBP feature to obtain the clustering of a crowd.
CNN is a feedforward neural network. CNN is based on the biological vision system. It simplifies the fully connected neural network into CNN, and the connections of neurons between the upper and lower layers of adjacent layers are no longer all related. From a mathematical point of view, the weight between the two fully connected network layers is overwhelmingly zero. For example, in image processing, each pixel is only related to the local area around it. By simplifying the number of connections of neurons, the neural network can be simplified without affecting the characteristics of the image itself, reducing network complexity and reducing calculation time.
When the input and the filter are given, the input signal length is much greater than the filter length , and the output of one-dimensional convolution is
One-dimensional convolution can be used in signal processing. When the filter is , the convolution is equivalent to the moving average of the signal sequence. Two-dimensional convolution is often used in image processing. Given an image and filter , generally . The output of a convolution is
Figure 6(a) is the fully connected layer of the network. If there are neurons in the layer, there are neurons in the layer, and there are connected edges. That is, the weight matrix has parameters. When the number of neurons increases, the parameters increase, and the time complexity of calculation increases, which greatly reduces the efficiency of training. As shown in Figure 6(b), the fully connected layer is replaced with a convolutional connection. At this time, each neuron in the layer is only connected to a neuron in a local area window of the layer, forming a local connection network. The input of neuron of layer is defined aswhere is an M-dimensional filter.

(a)

(b)
The above formula can be simplified to:
It can be seen from formula (6) that is the same for all neurons. This reveals another extremely important feature of CNN: weight sharing. That is, for two adjacent layers of networks, the weight matrix is the same. Only a few parameters are needed to describe the output from the network to the layer, and the number of neurons in the layer is determined, which is .
When processing images, the computer cannot directly recognize the surface features of an image, like a human brain, and the computer can only accept and process the data. Therefore, a digital image can be converted into a two-dimensional matrix, and the position of each pixel is used to describe the entire image. The two-dimensional matrix of image conversion is used as the input of the neural network, and two-dimensional convolution is required at this time. Assume that and are the neuronal activity of and layers, respectively. Each element of is
After a filter is processed, the characteristics of an image can be obtained. By increasing the number of filters used, a number of different features can be obtained, thus enhancing the ability of the convolutional layer to represent images. The filter is essentially a feature extractor. Due to the weight sharing, each set of output uses the same filter, which is the feature extractor. The output of the image processed by the filter is a feature of the image. This process can also be called feature mapping. Assume that the number of filters used in the layer is , and the size of each group of feature maps is . The total number of neurons in the layer is . The number of feature mapping groups in the layer is . If it is assumed that the input of each feature map of layer is all the feature maps of layer,
then feature map of layer iswhere represents the filter required from the p feature vector of layer to the feature vector of layer.
It can be found from the above formula that the neurons in the entire layer of the layer get the input of the next layer, the layer, through filter convolution and bias adjustment. Different filters can get different inputs. The connection relationship between feature maps can be defined as a connection table . The number of features is adjusted by setting the number of “0”s in the connection table to ensure that the desired features can be extracted and the computational complexity is reduced.
The convolutional layer is locally connected, which significantly reduces the number of connections compared to the fully connected layer, but the number of neurons does not change much. If the output is followed by a classifier, the input dimension of the classifier is still too high, overfitting will still occur, and the input image cannot be accurately classified. Pooling operation is introduced to reduce the dimensionality of features and avoid overfitting problems. The feature map obtained by convolution of the upper convolution layer through the filter can be divided into several regions . To perform pooling operations on these regions, a subsampling function is defined aswhere and are trainable weight and bias parameters, respectively.
As shown in Figure 7, the LR-CNN model designed in this paper is(1)Input layer: the input data is a 3232 image block.(2)Con1 : the first convolutional layer, using 8 55 filters, through convolution operation to obtain 8 2828 feature maps.(3)Pool2: the first subsampling layer uses the maximum pool sampling method, that is, one point is collected for every adjacent 22 pixel area. Its value is the function value with the largest gray value among the four pixels.(4)Con3 : the second convolution layer, using 15 55 filters, after convolution operation, 815 1010 feature maps are obtained.(5)Pool4: the second subsampling layer uses the same subsampling as the Pool2 layer.(6)Con5 : the last convolutional layer, using 5 55 filters.(7)Ful6: it is a fully connected layer that converts 600 11 neurons in Cons into a feature vector.(8)Output layer: input the feature vector obtained by CNN into the activation function to obtain the counting result. When training the model, it is also necessary to add the counting accuracy rate to estimate the accuracy rate of the counting and loss function layers.

The LR-CNN model proposed in this paper reduces the number of neurons when extracting features based on the same filter in the convolutional layer. However, since each convolutional layer requires multiple filters, different features of the image need to be extracted. Therefore, the total number of neurons is significantly increased after the convolution operation of the convolutional layer, and the purpose of convolution is to reduce the dimension of features. However, additional burdens are generated in this process. The layer-by-layer increase in the number of neurons and parameters will eventually cause the algorithm to crash and the computer will stop working. Thus, the subsampling layer is necessary, and it is an effective means to reduce the number of neurons and the number of parameters. Therefore, the subsampling layer must exist intermittently or uninterruptedly throughout the entire network. At the same time, it is considered that the influence of the subsampling layer on the feature is negative. Therefore, the alternate appearance of the convolutional layer and subsampling layer is the best design obtained by combining various factors.
4.2. Loss Function
This paper uses two loss functions to optimize the model. One is the Euclidean loss function, and the other is the cross-entropy loss function. Let denote training samples and denote the total number of training samples.
Euclidean loss function is used for density estimationwhere represents the estimated density map. is the weight parameter of the counting model and represents the true density map.
The cross-entropy loss function is
The total loss function is a linear combination of and , and the formula is as follows:where the parameter is a scale factor, which is used to control the proportion of cross-entropy loss.
5. Experiment and Analysis
5.1. Dataset
The experiment uses two commonly used datasets, namely the Shanghai Tech dataset and the UCF_CC_50 dataset.
The Shanghai Tech dataset consists of two parts, part_A_final and part_B_final. The picture of part_A is a crowd image randomly selected from the Internet, and the data picture of part_B is taken by a camera on the streets of Shanghai. Compared with the part_A dataset, part_B has a sparse distribution, but the scene is relatively fixed, while the scene of part_A changes greatly. part_A training set: 300 pictures, test set: 182 pictures. part_A training set: 400 pictures, test set: 316 pictures, a total of 1198 pictures, 330,165 annotation headers.
The UCF_CC_50 dataset pictures are all grayscale images downloaded from the Internet. They have extremely dense crowds and smallscale changes. Large amounts of data only have head features and are severely blocked by pedestrians. The sample size of this dataset is small, but the number of people varies greatly. In the experiment, a 5-fold crossvalidation method was used to evaluate the performance of different counting models. The specific method is to randomly divide the picture into 5 parts, with 4 parts for training and 1 part for testing. Five sets of experiments are carried out, and the average value is taken as the final result.
5.2. Evaluation Index
This paper uses MAE and MSE as two indicators to evaluate the performance of the algorithm. MAE and MSE are the most commonly used standards to measure the performance of the algorithm. The calculation formula of MAE and MSE is as follows:where represents the total number of test images, is the actual number of people in the image, and is the number of people estimated by the algorithm.
5.3. Analysis and Comparison
In the use of the CLBP feature extraction algorithm and the CNN depth model for crowd density estimation and group detection, this paper has carried out 2000 iterations of training. In order to visualize the results, when the CNN network becomes stable, the 200 verified samples are extracted from the CLBP feature and then input to the trained CNN network. Figure 8 shows the comparison between the real predicted value and the CNN predicted value.

After deep neural network training, the predicted value of the degree of aggregation of each pixel position is obtained. In the actual prediction, the mask value of prediction results 200200 is tested in the range of 1010, and the average value in the range is calculated, and the threshold Th = 0.5 is set as the criterion. When the predicted average value of a certain detection area reaches or exceeds the set threshold, the area is regarded as an area where people gather. And through the inverse process of the compression process, the position is projected into the original RGB image, and the corresponding area is standardized in the figure. This paper tests CLBP + CNN and LBP + CNN. It can be seen from the results that CNN can do most of the correct detection of crowd gathering groups. The comparison between the predicted value and the actual predicted value is almost the same. In actual use, there is a strong result presentation that can ensure the robustness and accuracy of the data. However, the CNN network requires a lot of training time to obtain better weights to predict complex scenes.
The CNN counting model was tested on the UCF_CC_50 and Shanghai Tech datasets, and the results obtained are shown in Figure 9.

The method in this paper is compared with the methods in reference [5, 27, and 29] in the UCF_CC_50 dataset and the Shanghai Tech dataset. The experimental results are shown in Table 1 and Table 2. The MAE and MSE of the proposed method in the UCF_CC_50 dataset is 325.6 and 369.4, respectively. The MAE and MSE in the part_A part of the Shanghai Tech dataset are 213.5 and 247.1, respectively, and the MAE and MSE in the part_B part are 85.3 and 99.7 respectively. The experimental results show that the method proposed in this paper can solve the problem of counting dense crowds within the allowable error range. The comparison results show that the proposed method is better than the comparison method in counting accuracy under high crowd density scenarios. This is because the proposed model extracts the difference between the amplitude feature and the gray feature of the central pixel to form the CLBP feature, which obtains more detailed information about the population density. However, the lack of effective feature extraction methods in comparison methods makes MAE and MSE much higher than the proposed methods. Besides, using the dense block in the image as a training set instead of the entire image provides a feasible method to solve the counting problems caused by crowd image congestion and scene distortion.
6. Conclusion
Aiming at the problems of crowd distribution, scale feature, and crowd feature extraction difficulty in exhibition centers, this paper proposes a crowd density estimation method using deep learning for passenger flow detection systems in the exhibition center. The difference between the amplitude feature and the gray level feature of the center pixel are extracted to form the CLBP feature together to obtain more descriptive information about the crowd density. The LR activation function is used to add nonlinear factors to CNN and use dense blocks obtained by crowd density estimation to train the LR-CNN crowd density estimation model. Finally, the experimental results show that the proposed method can achieve the lowest MAE and MSE on the tested datasets. This shows that by extracting the difference between the amplitude feature and the gray feature of the center pixel, using the CLBP feature for feature extraction, you can extract more effective information.
However, deep learning has a complex network structure and requires a large amount of calculation, which requires faster hardware support. In the future, GPUs can be introduced to increase the speed of computer processing data, or the concept of parallel computing can be introduced into CNN, and the execution speed of algorithms can be accelerated by shunting.
Data Availability
The data used to support the findings of this study are included within the article.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work was supported by the Yongchuan District Science and Technology Fund (no. Ycstc, 2019rb1202).