Abstract

To solve the problem of missing model detection for small targets, occluded targets, and crowded targets scenarios in mask detection, we propose an end-to-end mask-wearing detection model based on a bidirectional feature fusion network. Firstly, to improve the ability of the model to extract features, we introduce the modified EfficientNet as the backbone network in the model. Secondly, for the prediction network, we introduce depth-wise separable convolution to reduce the amount of model parameters. Lastly, to improve the performance of the model on small targets and occluded targets, we propose a bidirectional feature fusion network and introduce a spatial pyramid pooling network. We evaluate our proposed method on a real-world data set. The mean average precision of the model is 87.54%. What’s more, our proposed method achieves better performance than the comparison approaches in most cases.

1. Introduction

Recently, the global spread of the COVID-19 has severely damaged human health and economy. Globally, as of 4:52 pm CET, 11 January 2022, there have been 308,458,509 confirmed cases of COVID-19, including 5,492,595 deaths reported to WHO [1]. The global economy is projected to contract sharply by 3% in 2020, which is much worse than the 2008 global financial crisis [2]. Wearing a mask can effectively curb the spread of the epidemic. Therefore, it is very important to study a model that can efficiently solve the task of mask-wearing detection. Mask-wearing detection methods are mainly divided into two categories. One is the image classification method, which is applied to simple scenarios. For example, Loeay et al. [3] proposed a hybrid model to realize the mask detection task, which uses ResNet-50 as a feature extraction network and uses SVM, Decision trees, and Ensemble as classifiers. However, the real-time performance of this model is too poor to be used. Qin et al. [4] proposed an SRCNet classification network, which crops the target obtained after facial detection and inputs it into SRCNet for classification to realize the mask detection task. Obviously, this is not an end-to-end model. The other is target detection methods, which are applied to complex scenarios. For example, Farady et al. [5] used RetinaNet [6] as a detector for mask-wearing detection. However, the data set used in the experiment has a small number of pictures and a single scene, so the generalization ability of the model is poor. Singh et al. [7] used YOLOv3 [8] and Faster R-CNN [9] as detectors for mask-wearing detection respectively. Wang et al. [10] proposed a Hybrid transfer learning and broad learning system for the detection of wearing a mask. So for complex scenarios, improving and training the general target detection model may achieve good results. In addition, to improve the training effect of the model, many researchers use image inpainting algorithms [1113] to improve the quality of the data set.

However, it is very difficult to detect the region of the face and whether to wear a mask in complex scenarios. These scenarios include the following three factors:(i)Small targets: in the original image, the small target area is small, and after several downsampling of the backbone network, a lot of information is lost, which leads to poor performance of small targets by the detector.(ii)Occlude targets: the reason for the difficulty of detection is that a part of the feature information disappears after the object is occluded.(iii)Crowded targets: the crowded targets have the above two situations, which leads to a very difficult problem.

To address these challenges, we propose a mask-wearing detection model based on YOLOv3 model to detect the crowd wearing masks. The main contributions of this paper can be summarized as follows:(i)We propose an end-to-end mask-wearing detection model that employs modified EfficientNet to improve the ability to extract image features.(ii)We evaluate our proposed model with a real-world data set. The results demonstrate that our model outperforms other comparison methods.

The remaining sections of this article are organized as follows. A brief overview of related work about object detection is given in Section 2. Section 3 gives a brief overview of YOLOv3 model. Section 4 illustrates the details of EfficientNet-MW model. Based on a real data set, the experiments for evaluating the effectiveness of the EfficientNet-MW model are presented in Section 5.

As a fundamental problem in computer vision, object detection aims to find out the location and category of the objects in the images. It can effectively help autonomous and intelligent systems, such as machines, to perceive and understand the world better by automatically analyzing digital signals from cameras [14]. With the blowout development of deep learning, the object detection methods based on deep learning have gradually become mainstream. In this section, we discuss the object detection methods from two aspects: anchor-based methods and anchor-free methods.

2.1. Anchor-Based Methods

Anchor-based methods first lay a large number of anchor boxes evenly on the image, then predict the anchor boxes category, and optimize the coordinates of these anchor boxes once or twice, and finally output these optimized anchor boxes as a detection result. They can be roughly divided into two-stage methods and one-stage method. Two-stage methods are based on candidate regions. Several methods among them are Faster R-CNN [9], Mask-RCNN [15], and Cascade R-CNN [16]. In particular, Faster R-CNN is a classic two-stage method that introduces a region proposal network (RPN) on the basis of R-CNN [17] and Fast R-CNN [18]. Firstly, it finds a certain number of regions of interest through the RPN network. Secondly, it predicts a class and box regression refinement for each proposal using convolution neural networks. Thus, it can predict the bounding box and class scores of the objects for each region proposal at the same time. The two-stage methods are characterized by high detection precision and slow speed. One-stage methods are based on regression and are represented by YOLOv3 [8], SSD [19], M2det [20], RefineDet [21] RetinaNet [6], and RSANet [22]. They directly predict the bounding boxes and their corresponding classes with a single feed-forward network. Thus, the detection speed has been significantly improved. However, the detection precision is not as good as the two-stage detection methods. The advantages of anchor-based methods are that the training process is stable and the recall rate is high when detecting targets. But, their disadvantages are that the size of the anchor boxes is difficult to design and it is time-consuming to remove the excess bounding boxes during the detection process.

2.2. Anchor-Free Approaches

Anchor-free methods, as the name suggests, can find the object directly without preset anchor boxes. These methods can be roughly divided into center-based methods and keypoint-based methods. The center-based methods predict the probability that it is the center point of the objects for each position of the feature map and then predict the offset of the center point and the distance to four sides of bounding boxes. Their representative methods are FSAF [23], FCOS [24], GA-RPN [25], CenterNet(objects as points) [26], and FoveaBox. [27] CenterNet is a famous center-based method. It predicts the center points of the objects, then directly regresses the width and height of corresponding bounding boxes. The keypoint-based methods predict several predefined or self-learned keypoints and then generate the bounding boxes of objects. Their representative methods are CornerNet [28], CenterNet (Keypoint Triplets for Object Detection) [29], Reppoints [30], Grid R-CNN [31], and ExtremeNet. [32] CornerNet is a classic keypoint-based method that predicts the upper left and lower right corners of the objects and then performs corner matching to obtain the bounding boxes of objects. The advantages of anchor-free methods are that they do not need to preset a large number of anchor boxes, so they reduce the calculation of the model and speed up the detection. However, When the target center points coincide, it is easy to produce semantic ambiguity. In addition, the detection results of the model are unstable.

3. A Brief Overview of YOLOv3 Model

YOLOv3 is a very excellent object detector proposed by Joseph et al. [8] Its backbone network is DarkNet-53, in addition, the prediction layer network is the feature pyramid network. The overall model structure of YOLOv3 is shown in Figure 1. Firstly, the YOLOv3 model divides the input image into S × S grids. Secondly, DarkNet-53 is used as the backbone network to extract multiple levels of image features. Finally, the extracted image features are input into the FPN network for feature fusion to perform multiscale prediction. It contains 76 convolutional layers and 3 prediction branches. The size of the convolution kernel is 1 × 1 or 3 × 3.

3.1. The Backbone of YOLOv3

The network structure of DarkNet-53 is shown in Figure 2. DarkNet-53 is inspired by ResNet [33] to form many residual structures that can well control the propagation of gradients and avoid situations that are not conducive to training, such as gradient disappearance or explosion, which greatly reduces the difficulty of training deep networks. The main part of the network is composed of five residual blocks, and each residual block contains multiple residual units. Each residual unit has two convolutions. The role of these two convolutions is to change the number of channels. Then the unit adds output x to output f(x), and the final step is ReLU [34] activation function.

3.2. The Feature Fusion Networks of YOLOv3

To overcome the shortcomings of the YOLOv2 [35] of poor prediction of small targets and to further improve the precision of the model, YOLOv3 made further improvements on the basis of YOLO [36] and YOLOv2 and introduced a network based on the feature pyramid network of multiscale prediction (FPN [37]). As shown in Figure 3, small targets will be detected in the shallow feature map and large targets will be detected in the deep feature map. The DarkNet-53 network is used to perform feature extraction on the image. When extracting the deepest feature, the output will be executed and it will be upsampled and merged with the features of other shallow feature layers. In this way, each layer of feature maps will contain deep high-level features (such as object semantic information). Therefore, the new network structure improves the precision of detection, especially for small targets.

4. The EfficientNet-MW Model

The details of EfficientNet-MW model are shown in Figure 4, which consists of two parts: backbone feature extraction network and prediction layer network. Firstly, we employ modified EfficientNet as the backbone network of the model to extract the feature information of the image. Secondly, we introduce depthwise separable convolution into the prediction layer network to reduce the parameters of the model. Finally, we add a bottom-up integration path based on FPN to fully merge the features of each layer and introduce a spatial pyramid pooling network to increase the receptive field of the image at the end of the backbone network.

4.1. The New Feature Fusion Network and the Spatial Pyramid Pooling Network

In the feature extraction process, the features extracted by the shallow network and the deep network are different. The shallow features learn more object location information and the deep features learn more semantic information. The feature fusion network merges two types of feature information to achieve feature enhancement, while avoiding a large amount of information loss caused by the use of single characteristic information. We add a bottom-up integration path based on FPN.

The new feature fusion network we proposed is shown in Figure 5. After the backbone network, the feature maps a, b, and c are obtained. Firstly, the feature map a is upsampled and the feature map b is merged to obtain the feature map d. Secondly, the feature map d is upsampling and the feature map c is fused to obtain the feature map e. Thirdly, the feature map e is downsampled and fused with the feature map d to obtain the feature map f. Finally, the feature map f is downsampled and the feature map a is merged to obtain the feature map . In general, the network first gradually merges the deep feature maps with the shallow feature maps, and then the shallow feature maps obtained by the fusion are gradually merged with the deep feature maps so that the feature maps of each layer contain both shallow and deep features. The specific fusion operation is concatnated and then a convolution operation is added to adjust the number of channels. The new feature fusion network can improve the performance of a model on small targets and occluded targets. Furthermore, to improve the receptive field and separate significant context features, spatial pyramid pooling [38] is introduced at the end of the backbone network. The specific details are shown in the lower-left corner of Figure 4.

4.2. Modified EfficientNet

EfficientNet [39] is a state-of-the-art network in the field of image classification. Most networks such as ResNet, DenseNet [40], etc., only consider one depth of the network, the width of the network, and the resolution of the input image when designing and optimizing the network structure. However, EfficientNet series networks are designed based on a comprehensive consideration of these three factors. Compared with other networks, it has lesser model parameters, but it has a better ability to extract image feature information. So we introduce modified EfficientNet as the backbone of the EfficientNet-MW model to improve the ability of image feature information extraction.

Although there are many versions of the EfficientNet network, after analyzing the experimental results of Section 5, we introduce the modified EfficientNet-B6 network. As shown in Figure 6, it consists of 8 stages, stage1 as shown in Figure 7, and stage 2 to stage 8 are stacked by many MBConv (mobile inverted bottleneck convolution) blocks, as shown in Figure 7. The number of MBConv Blocks is 3, 6, 6, 8, 8, 11, 3. In Figure 8(a), this is a typical MBConv structure whose overall design idea is inverted residual structure [41] and the residual structure. The 1 × 1 convolution is used before depthwise separable convolution to increase the dimension; furthermore an attention mechanism [42] about the channel is added after the depthwise separable convolution, and finally a 1 × 1 convolution is used to reduce the dimension and add a large residual edge.

The output parameters of each stage are shown in Table 1. We connect the outputs of stage4 (52  ×  52  ×  72), stage6 (26  ×  26  ×  200), and stage8 (13  ×  13  ×  576) to the three branches of the prediction network.

4.3. Selection of Anchor Box

In this article, the sizes of the anchor boxes are designed to cluster the training set of our experiment through the k-means method of YOLOv3, which are very suitable for the mask-wearing detection task. It is not easy to choose the number of clusters. Thus, we did a lot of experiments in Section 5 to choose the appropriate number. Considering the amount of computation and detection speed of the network, the clustering result of K = 9 is finally adopted. The results of the size of anchor boxes are respectively (5, 9), (9, 16), (17, 28), (29, 44), (45, 72), (72, 101), (90, 171), (134, 144), and (183, 232). We choose 416  ×  416 as the model input size, and the model has three output sizes: 13 ×  13, 26 ×  26, and 52 ×  52. The feature map of 13 ×  13 is used to detect large-size targets, whose anchor boxes of size are (90, 171), (134, 144), and (183, 232). The anchor boxes of size (29, 44), (45, 72), and (72, 101) are allocated to the feature map with 26 × 26. In addition, the anchor boxes of size (5, 9), (9, 16), and (17, 28) are assigned to a feature map of 52 ×  52, which is used to detect small targets.

4.4. Preudocode

Training:(1)Generate dataset annotation file and class names file:Row Format: image_file_path box1, box2, ... boxN.Box format: x_min, y_min, x_max, y_max, class_id.(2)Load the weights that have been pretrained on the VOC dataset(3)Freeze the backbone of the model and train for some epochs until Plateau is reached.(4)Unfreeze all the layers and train all the weights while continuously reducing the learning rate until again plateau is reached.(5)End

Testing on picture or video using OpenCV:(1)Capture the video through video testing file.(2)Pass each frame of video or a picture through the model.(3)Output boxes, classes, and scores in the image and draw these boxes in the picture according to the output.(4)By testing each frame of the video, we can get dynamic results

5. Experimental Results and Analysis

In this section, we conduct experiments to evaluate the performance of the EfficientNet-MW model in terms of a mask-wearing detection task in a real-world data set. Firstly, we give an introduction to the experimental setup. Secondly, we conducted an ablation study to verify the effectiveness of the proposed methods. Thirdly, we compare the performance of our method with other classical methods such as CenterNet, RetinaNet, SSD, and so on. Furthermore, we did some hyperparameter adjustment experiments. Finally, we make a comparison of the detection results between EfficientNet-MW and YOLOv3 on the test set.

5.1. Experimental Setup
5.1.1. Data Sets Description

For the mask-wearing detection task studied in this article, currently, there is no well annotated public data set. MAFA [43] is a data set about facial occlusion, which is mainly collected by researchers from Chinese Academy of Sciences and BeiHang University. The WIDER FACE dataset [44] is a benchmark data set for face detection, and its creator comes from the Chinese University of Hong Kong. These two data sets are not very well suited for our research task. Therefore, we select some images that fit our task from the two public data sets, a total of 10,000 images. The final experimental data set contains complex factors such as small targets, occluded targets, and crowded targets. To better adapt to the wearing mask detection task, we relabel these images through open source labeling software labeling [45]. We randomly select 60% of the data as the training set, [46] 10% of the data as the validation set, and the remaining 30% of data as the test set. There is no overlap in training, validation, and test sets. The details of the target number of datasets are shown in Table 2. In this article, all experiments are based on the above divided data set. “Face” refers to the face target in the image, and “Face_mask” means the face with a mask target in the image.

5.1.2. Experimental Platform

The software environment is as follows: Windows 10 64 bit operating system, Pytorch deep learning framework, CUDA10.2, CUDANN7.4, and Python 3.7 The hardware environment is as follows:Intel Core i3-9100F CPU@3.60 GHz processor, 8 GB memory, and NVIDIA GeForce GTX 1660 GPU, 6 GB. Considering the fairness of the experiment, all experiments in this study are carried out in the above environments.

5.1.3. Evaluation Indexes

to evaluate the effectiveness of the model, we must choose evaluation indexes that can accurately evaluate the performance of the model. In the detection process, the images include positive samples (objects we are interested in) and negative samples (background). We use the intersection over union (IOU) criterion to check whether these detection results are positive or negative. The equation is as follows:

When the IOU of the bounding box is greater than 0.5, we define it as positive, otherwise, negative. AP (Average Precision) is the area under the precision–recall curve. mAP is the average value of APs in multiple categories. This index is one of the most important evaluation indexes in the field of object detection. It can be expressed by the following equation:

The letters P and R are precision and recall, respectively, where N is the category number of the objects we are interested.

5.1.4. The Details of Model Training

To reduce the training time of the model, we adopt the strategy of transfer learning to train our model. The specific training process is shown in Figure 9. Firstly, we use the VOC data set to pretrain EfficientNet-MW, the weights obtained from pretrain are used as the initial weights of the next training of the model, so that the model has some ability to extract the basic features of the object and can speed up the convergence of the model. Secondly, we freeze the backbone feature extraction network, set the batch size to 8, set the learning rate to 0.001, and only train the prediction layers of the model for about 30,000 iterations. Finally, we unfreeze the backbone feature extraction network, adjust the learning rate to 0.0001, adjust the batch size to 2, and train the whole network until convergence. The optimizer of the entire training process is Adam [45].

5.2. Ablation Study

As shown in Table 3, we can observe the effectiveness of the different components of EfficientNet-MW. A refers to replacing the bidirectional feature fusion network with FPN network. B refers to the SPP network module removed from the model. C refers to changing the deep separable convolution operation in the prediction layer network to the standard convolution operation. D refers to replacing the backbone network with Darknet-53 from EfficientNet. It can be seen from the table that when the prediction network is replaced by FPN, the Map of the model decreases by 1.29% and the size of the model hardly changes. Then, after removing the SPP module, the map of the model decreased by 0.4% and the size of the model hardly changed. What’s more, after the depthwise separable convolution, the operation of prediction is changed to standard convolution operation and the size of the model increased by 41%(from 168 MB to 237 MB). Lastly, after replacing the backbone network with DarkNet-53, the map of the model decreases by 1.66% and the size of the model hardly changes.

5.3. Result Analysis of Different Models

We have done a lot of comparative experiments, including the anchor-based methods and the anchor-free methods, and tested the trained model on the test set. The results are shown in Table 4. APface is the average precision of the face object. What’s more, the APface_mask means average precision of face with mask object. Generally speaking, although the speed of our proposed model is just a little slower than other models, the overall performance is the best. The backbone network of the Faster R-CNN model is ResNet50 whose APface and APface-mask are 79.67% and 95.44%, respectively. However, its detection speed is about 1.550 s per image. Although the average precision of Faster R-CNN is relatively high, its detection speed is too slow to meet the real-time requirements of the mask-wearing task.

The detection speed of RetinaNet model is about 0.115 s per image and its mAP is 85.76% whose backbone network is ResNet50. Although its detection speed is fast, its average precision is not very good. The SSD model whose backbone network is VGG16 has the fastest detection speed, about 0.097 seconds per image, but its mAP is the lowest at only 83.28%, making it unsuitable for the task. The mAP of YOLOv3 model is 83.74%. Furthermore, its detection speed is 0.109s per image. In this study, CenterNet is an anchor-free method whose backbone is ResNet50. Its mAP and detection speed are both better than YOLOv3 model, and the corresponding values are 85.30% and 0.105s. The mAP of EfficientNet-MW is 87.45% and its backbone network is EfficientNet-B6. Furthermore, its detection speed is about 0.116s per image. Considering the overall performance of the models, the EfficientNet-MW model is superior to other models.

5.4. Influence of Different Version Backbone

We choose EfficientNet as the backbone network of the model, but the EfficientNet series of networks have multiple versions. To achieve the best results, each version of the network has a different preference for the resolution of the input image. There is no uniform standard for the resolution of the data set in our experiment. We just resize them to the 416 × 416 size before entering the network. Therefore, we cannot directly choose the network version that fits our data set. We have selected eight versions of the network for the experiment, and their corresponding versions are B0, B1, B2, B3, B4, B5, B6, and B7. From B0–B7, the model structure is getting more and more complicated. In theory, the model should perform better and better. It can be seen from Table 5 that the experimental results from B0 to B7 show that the mAP values are 81.59%, 82.08%, 82.40%, 82.71%, 83.04%, 83.00%, 87.45%, and 83.14%, respectively. The general change trend of mAP is always increasing and finally decreasing. The B6 version of the backbone network works best, and that is why we choose EfficientNet-B6 as the backbone network of the EfficientNet-MW model.

5.5. Influence of the Number of Anchor Boxes

The EfficientNet-MW model contains three prediction layers, which divide the input image into 13 × 13, 26 × 26, and 52 × 52 grid cells, respectively. We believe that the number of anchor boxes for each cell may influence the final performance of the model. Table 6 reveals that the number of anchor boxes in each cell has a greater impact on the detection performance. With the increase in the number of anchor boxes per cell, the mAP increases firstly and then drops, and then finally rises slightly. Ranging from 1 to 6, 3 anchor boxes per cell achieve the best result (87.45% for mAP). This is why we choose 9 as the number of clusters when we use the k-means algorithm in the cluster training set. In addition, we can observe that with the increase in the number of bounding boxes per cell, due to the increasing number of parameters in the detector, the detection speed of the model becomes slower and slower.

5.6. Performance Comparision between YOLOv3 and EfficientNet-MW

P-R curves of the YOLOv3 and EfficientNet-MW model in test set are shown in Figures 10(a) and 10(b), respectively. The mAP of EfficientNet-MW is obviously greater than that of the YOLOv3 model. The APface and APface_mask of EfficientNet-MW are 79.33% and 95.56%, respectively, which are more than 5.57% and 1.85% of that of the YOLOv3 model. When the recall rate of “face_mask” reaches 97%, the precision rate of the EfficientNet-MW reaches 92%. The recall rate continues to increase and the precision rate is close to a straight downward trend.

The test results in various complicated environments are illustrated in Figures 1113. The red box and blue box mean that the model recognizes it as a human face and a human face wearing a mask, respectively. As shown in Figure 11, the image contains three objects we are interested in, including two small objects. Our model can correctly detect these two small objects. However, the YOLOv3 model cannot detect them. This means that our model performs better when detecting small targets. As shown in Figure 12, the image belongs to the occlusion target situation, where the three face targets are blocked by a book, a sticker, and a scarf. Our model can correctly detect the occluded targets; however, the YOLOv3 model incorrectly recognizes the occluded targets as “face with mask” object. The results show that our model is better than the YOLOv3 model in detecting occluded targets. Figure 13 reveals the different sizes of “face” and “face with mask” targets. It belongs to the scenario of crowded targets. There are 12 targets that can be correctly detected by the human eye in the whole image. The EfficientNet-MW model can correctly recognize 11 targets. However, the YOLOv3 model just can recognize 8 targets, and wrongly recognize a target. This means that our proposed model performs better in crowded scenarios. In summary, our proposed model is superior to the YOLOv3 model in solving complex scenarios such as crowded, occluded, and small targets.

6. Conclusion

In this article, we study the mask-wearing detection task. The main challenges are to accurately detect small targets, occlude targets, and crowded targets. To solve these problems, we propose an end-to-end mask-wearing detection model based on a bidirectional feature fusion network. Firstly, we employ the modified EfficientNet to extract the feature information of the image. Secondly, we design a new feature fusion network to fully merge features and introduce a spatial pyramid pooling network. Finally, we introduce the depthwise separable convolution to reduce the amount of model parameters. The evaluation results on a real-world data set show that our proposed model outperforms comparison methods in small targets, occluded targets, and crowded targets scenarios. However, the real-time performance of the model proposed in this article has great room for improvement. In the future, we plan to learn anchor-free methods and design a model to improve the speed of detection without reducing the precision.

Data Availability

The data set of this paper comes from MAFA data set and Wider Face data set, and the selected data are re labeled according to the research content of this paper. Get links to this paper dataset:链接: https://pan.baidu.com/s/1Ehk4xql2hRJupO_Z09mKnw 提取码:x6weMAFA dataset get link: http://www.escience.cn/people/geshiming/mafa.htmlWider face dataset get link: http://mmlab.ie.cuhk.edu.hk/projects/WIDERFace/.

Conflicts of Interest

The authors have no conflicts of interest.

Acknowledgments

This research was supported by the National Natural Science Foundation of China (Nos. 62067002, 62062033 and 61967006), the Natural Science Foundation of Jiangxi Province (No. 20212BAB202008), and the Education Department Project of Jiangxi Province (GJJ190317).