Abstract

Automatic identification and location of farmland pests are an important direction of target detection research. The wide variety of pests and the similarity between pest categories make the automatic identification of farmland pests have some problems, such as high error rate and difficult identification. In order to achieve a better target for automatic identification and location of farmland pests, this paper proposes a lightweight pest detection model, and the network is the EfficientNet proposed by Google, which achieves the detection of 26 pests, the idea based on the classical Yolo target detection algorithm. First of all, features were extracted through the lightweight backbone, and then multiscale feature fusion is performed by PANet; finally, three feature matrices with different sizes were output to predict pests of different sizes. Using CIOU as the loss function of regression prediction better reflects the relative position of the prior box and the real box. The experimental results are compared with other lightweight algorithms, and the results show that the accuracy rate of the algorithm for identification and localization of agricultural pest in this paper is the highest and could reach 93.73%. Moreover, the model is lightweight and can be deployed on low-cost equipment, which reduces the cost of equipment and accurately predicts the status of pests and diseases in farmland. In practice, it is shown that the algorithm can effectively solve the problems of large number of pests, pest accumulation, background interference, and has strong robustness.

1. Introduction

Wheat and corn are the main food crops in North China. The growth of crops often suffers from pests, which cause enormous economic losses to wheat and corn yields every year. There are many kinds of pests on agricultural crops, which attack the growth of crops and often erupt into disasters; so, there is a need for real-time and accurate monitoring of wheat and corn pests, develop reasonable prevention, and control measures to reduce economic losses. Traditional wheat versus corn pest detection primary methods still require staff from the base layer to enter the field to observe pest type characteristics, visually observe, and diagnose pest status in the area. This method has the characteristics of heavy workload and low efficiency. It cannot predict the occurrence of diseases and insect pests in real time, meet the needs of current pest monitoring, reduce the accuracy of agricultural pest monitoring, and is not conducive to the scale and automation of pest detection [13].

With the growth of computing resources, deep learning has developed rapidly, especially in the field of image, which provides a technical basis for lightweight farmland pest detection [4]. In the early stage of pest identification, artificial neural network, support vector machine, and other methods were used to realize pest identification, mainly based on the color, texture, morphology, and other characteristics of pests. It has high requirements for the body shape characteristics of pests in the data set and can only complete several categories. The identification results are very unstable. This method is essentially a classification problem. Only one pest can be solved in one picture, which is not in line with the actual environment [5]. With the growing maturity of deep learning technology [6, 7], a large number of excellent detection models have emerged in the field of target detection, such as SSD, Fast-RCNN, and Yolo [810]. These excellent target detection models extract features through convolutional neural networks. These algorithms are based on anchors to achieve target positioning. Recently, some target detection algorithms without anchors have emerged, such as Centernet [11]. However, for small target objects such as pests, the performance of no a priori frame algorithm is not very ideal. The target detection model has been widely used in pedestrian detection, vehicle detection, face detection, driverless, and other fields. It is also applicable to the target detection of pests. For example, Wei Yang and others have realized the automatic identification of pests by using the two-stage faster-RCNN detection model [12]. Yuan and others have realized the automatic recognition and counting of 8 types of insects by using the yolov3 model, and the recognition rate can reach 70.98% [13], while the accuracy of pest recognition still needs to be improved.

Most of the existing recognition methods use network pictures as datasets for training. Although they have good recognition rate, the pictures collected on the network can only identify one side of the pests [14]. There is a big gap in practical applications. The robustness of the model is not high, and the deployment needs a high computing resources device, which cannot meet the current actual needs [15]. In this paper, a lightweight detection model is proposed to solve the problem that the existing target detection model requires a large amount of computing resources. The model is deployed on low-cost devices, mainly to monitor pests in farmland, so as to achieve the scale and automation of pest monitoring. The detection model in this paper mainly refers to the idea of the Yolo algorithm. Because each pest has a different size, the model outputs three different size feature matrices, and sets three anchors with different size for each feature matrix, and the regression predicts pests with different size. The deployment of the detection model on the local device is implemented, which reduces the waste of computing resources and greatly reduces the cost.

2.1. Image Processing

The nature of depth learning is end-to-end; so, the construction of datasets is the basis of indepth learning. Because of the scarcity of public pest datasets, there is a big difference from the actual situation. The pest dataset used in this paper was obtained by using a telemetry lamp device in Shandong Province. The device mainly uses light to attract pests, kills pests through heating chamber, falls on the insect board, and takes pictures of pests through high-definition camera. A total of 10,000 pictures of pests were collected. 6,144 useful pictures of pests were manually screened out. A total of 26 pests were identified. Some of the samples were shown in Figure 1. The category and number distribution of pests were shown in Table 1 below. The labelImg tool was used for labeling manually to generate VOC2007 format [16].

The input of the pest detection model in this paper is a size picture, while the size of the data set is not the same, the data need to be adjusted to a uniform size to be the input of the network, if the direct resize the picture is distorted, and may lose the original characteristics of the picture; so, the method of adding padding to the picture is adopted to prevent the distortion of the data set. For a good target detection model, it requires massive data sets for training to avoid overfitting of the network and enhance the robustness of the model, while the number of this data set is obviously insufficient, a total of seven methods have been used to further augment the data set, namely, rotation, horizontal translation, vertical translation, perspective transformation, and scaling horizontal inversion as well as brightness enhancement, thereby enhancing the generalization ability of the model, and the dataset was expanded to more than 20000 sheets [17].

In order to further improve the robustness of the model and enhance its generalization ability, the mosaic data enhancement method is used when loading the data set. The mosaic data enhancement refers to the data enhancement method of Cutmix [18]. The Cutmix data enhancement method is to splice two images, but mosaic uses four images to enrich the background of the object and increase the diversity of the data. When calculating in BN (batch normalization) layer, the larger the setting of batch size is, the closer the mean value and variance of the whole data set will be, and the better the effect will be. Due to the limitation of GPU memory, it is impossible to train multiple pictures at one time. When we put four pictures together and input them into the network, the batch size of the input network will be increased in disguise. As shown in Figure 2, the image is enhanced by mosaic data.

2.2. Target Detection Algorithm

The object detection algorithm in deep learning consists of three parts: backbone, neck, and head. Backbone is mainly used for feature extraction to generate feature map, such as VggNet, ResNet, and Densenet [1921]. The function of neck is to fuse feature maps of different scales for further feature extraction, such as FPN, PAN, and BiFPN [2224]. Finally, the head is used for classification and regression prediction to complete the target recognition and positioning. The head is mainly divided into two parts. One is based on anchor, such as SSD, Yolo, and Retinanet. It sets anchor box in feature points in feature map in advance and locates the target by adjusting the size and position of anchor box. There are two main problems: the preset anchor box size is fixed, and the other is based on anchor free, such as Cornernet [25] and Centernet. When building the model, it takes the target as a point; that is, the center point of the target BBox uses key points to find the center point and returns to other target attributes; in the experimental study, the accuracy of anchor is higher than that of anchor free; so, this paper proposes a lightweight target detection algorithm based on Yolo’s idea and improves the two shortcomings of two anchor bases.

2.3. Model Structure

The network architecture draws lessons from the Yolo model structure and uses the one stage method to build the model. The Yolo series detection model has been very perfect after three generations of iteration. It has the advantages of high calculation speed and high accuracy and is widely used. In this paper, the network architecture constructed by the algorithm according to its idea is shown in Figure3. The darknet-53 of the backbone in yolov3 is replaced by the improved Efficientnet [26], and the PAN is replaced by the FPN in neck to improve the feature fusion.

The backbone in this algorithm is based on Efficientnet-B2 [26] network, which was proposed by Google in 2019 for image classification. The input of the Efficientnet-B2 network is . In order to better extract the characteristics of the image, the input size is changed to , and the SPP network structure is added in the last block, so as to further sample the feature map. The Efficientnet-B2 network is mainly composed of seven MBConv blocks. The structure is shown in Figure 4. It uses ordinary convolution for dimension raising, then BN and swish activation functions, and then uses deep separable volume for down sampling. After an SE module, it uses a convolution for dimension reduction, and normalizes through a BN layer. Finally, the input characteristic matrix is added with the main channel characteristic matrix through the shortcut branch to complete the output of the characteristic matrix. Only when the dimension of the input MBConv structure characteristic matrix is the same as that of the output characteristic matrix, the splicing operation is carried out. In the first ascending dimension of convolution layer, the input MBConv structure characteristic matrix is connected with the output characteristic matrix, and the number of convolution kernels is times of the input characteristic matrix channel.

The changed network parameters are shown in Table 2, rechanging the input size and adding the SPP structure, enhancing the generalization ability of the algorithm and having a broader vision of features.

Deep separable convolution [27] is a deformation of traditional convolution. It is different from traditional convolution in that the number of channels of its convolution core is equal to the number of channels of the input characteristic matrix, the number of channels of the output characteristic matrix, the number of convolution cores, and the size of the convolution core is a matrix of . The structural schematic of the ordinary convolution is shown in Figure 5, the deep separable convolution is shown in Figure 6, assuming that represents the size of the input feature matrix, is the number of feature matrices, is the size of the convolution kernel, is the number of output feature matrices, and the computational comparison of the deep separable convolution with ordinary convolution is shown by equation (1).

Assuming that the size of the convolutional kernel is , the formula is equal to ; so, the computation of the ordinary convolution is theoretically 8 to 9 times the depth, which is visible to be much less than the ordinary convolution.

The structure of SE module is shown in Figure 7. The feature map performs a global average pooling and transforms the size of the feature matrix to , which performs two full connection layers. The first full connection uses the swish activation function, and the number of channels becomes 1/4 of the original number. The second full connection uses the sigmoid activation function, the number of nodes is equal to the number of channels of the output characteristic matrix of the depth separable convolution layer, and the final output is obtained by multiplying with the input feature map. The SE module is similar to the self-attention [28] mechanism and increases the interesting features through the output of the sigmoid function.

In the main feature network, a SPP (spatial pyramid pooling) structure is added; that is, three maximum pooling samplings are conducted after the last MBConv block, because the stride is all one, the size of the padding, and feature matrix in the feature matrix is added not to change. The three obtained feature matrices and the input feature matrix are splicing to obtain the feature matrix of 4 times the depth. The structure is shown in Figure 8.

In the stage of feature fusion, the idea of the PANet (path aggregation network) structure is used for multiscale feature fusion. In the backbone feature network, three feature matrices with different sizes are collected, as shown in Figure 3. The three feature matrices are MBConv5, MBConv7, and MBConv8. First, convolution and SPP operations with convolution kernel size of are performed on MBConv8, and then up sampling and stacking with MBConv7 are performed. The feature layer continues to perform up sampling and fusion stacking with MBConv5. Through two times of length and width expansion, the up sampling operation is completed, and the feature layer with high semantics is obtained. The convolution operation with convolution kernel size of , and down sampling are carried out, respectively, to ensure the feature information of the target, which is more conducive to detecting objects of different sizes.

In neck, after completing the fusion stack, the three characteristic matrices perform the convolution operation with the convolution kernel size of , respectively, and output three characteristic matrices with different sizes, namely, (13,13,93), (26,26,93), and (52,52,93). The first two dimensions represent the size of the feature layer and are used to detect objects of different sizes, while 93 represents that each feature point has three a priori boxes. Each a priori box contains five parameters, namely, length, width, center point, and classification probability. A total of 26 pests are detected this time; so, this dimension is .

Swish activation function is used in MBconv block. Swish is an improved version of sigmoid and Relu, similar to the combination of Relu and Sigmoid, which contains a parameter β, and β can be set as a constant or a trainable parameter. It has the characteristics of no upper bound but lower bound, smoothness, and nonmonotonicity, as shown in formulae (2) and (3).

The Swish activation function not only has the advantages of the Relu and Sigmoid functions but also is superior to Relu in the deep model. It can be seen as a smoothing function between the linear function and the Relu function.

2.4. Loss Function

The loss function in target detection can be roughly divided into three parts: confidence loss, classification loss, and location loss. In the confidence loss function, IOU (intersection over union) is used to judge the relative position relationship between the prediction frame and the real frame, but IOU cannot judge the overlapping area, center distance, and aspect ratio between the prediction frame and the real frame. Therefore, this paper uses CIOU (complete, IOU) [29] to replace IOU and adds a penalty term, so that the regression loss function tends to converge and reduce the divergence of loss function in the training process. The loss function of the model is shown in formula (4).

Among them, λ1, λ2, and λ3 are equilibrium coefficients. The weight of each loss function is adjusted by setting the size of each . The specific formulas of confidence loss, classification loss, and positioning loss are shown in formula (5), (6), and (7).

In formula (5), , which represents the CIOU between the prediction frame and the real frame, is the prediction value, is the prediction confidence obtained by c through the function, and is the number of positive and negative samples. In formula (6), indicates whether there is a class target in the prediction frame , is the prediction value, is the target probability obtained by through the function, and is the number of positive samples. In formula (7), is the center point coordinate and length and width in the prediction frame, and is the information of the real frame. The formula of CIOU is shown in (8).

where represents the coordinates of the distance between the center point of the prediction box, represents the coordinates of the distance between the center point of the real box, represents the calculation of Euclidean distance, represents the distance between the prediction box and the diagonal line of the minimum bounding box of the real box, and penalty factors and are added to it, as shown in (9) and (10).

and represent the width and height of the real box. and are the width and height of the prediction box. This penalty is mainly to make the width of the prediction box as fast as possible to be close to the width and height of the real box. Finally, we get the regression loss function as shown in formula (11).

The present algorithm uses anchor base to predict the target position of the object. The main idea is to set anchor, of different sizes in each feature point in the feature matrix, although the problem of positive and negative sample imbalance during training, leading to reduced model accuracy. For example, an image may produce tens of thousands of candidate boxes, but only few parts contain the target; the target box is positive sample, and the negative sample with no candidate box. The focal loss [30] function solves the problem of positive and negative sample imbalance and also controls the weights of easily classified and difficult classified samples. The formula of the focal loss function is as follows: (12), (13), and (14).

Among them, is called the adjustment coefficient. When tends to 0, the adjustment coefficient tends 1, the contribution to loss increases, and the adjustment coefficient tends 0, equivalent to a small contribution to the total loss. When the coefficients , the traditional crossentropy loss function realizes the adjustment coefficient by adjusting the .

3. Results and Discussion

3.1. Experimental Environment

This experiment uses Pytorch as a deep learning framework to accelerate the training model. The hardware configuration is: R5-3600 processor, 16GB memory, and Nvidia RTX3070 graphics card; the software environment is Windows10 system, Python3.7, Pytorch1.9, CUDA version 11.0, and cuDNN version 8.0.1.

3.2. Evaluation Criteria

Average precision (AP) is a mainstream measure of the target detection model. AP is calculated by calculating the AP of each target category. AP is calculated by using the area under precision-recall (P-R) curve as the AP value, where precision and recall formulas (15) and (16) show.

TP (true positives) represents the correct target class classification and is a positive sample, FN (false negatives) represents the wrong result of model classification, and the sample is negative, FP (false positives) represents the wrong model classification, the sample is negative, and map has become a recognized method of target detection and is widely used.

3.3. Results and Analysis

In the process of regression prediction, nine candidate frames are set according to three characteristic matrices with different sizes. Because of different training data sets, the sizes of candidate frames are also different. In this paper, -means clustering algorithm is used to find the appropriate size of prior frames in the training set. The -means clustering algorithm is different from the standard one. It calculates the distance between candidate frames through CIOU, and the final nine candidate frames are (29, 35), (55, 70), (72, 117), (87, 83), (94, 149), (115, 110), (124, 186), (145, 146), and (182, 220), which fit 81% of the frames of the dataset. During training, the transfer learning method is adopted. The preloaded model is the model of VOC data set, and the partially frozen backbone method is used to iterate 50 times, then thaw the training, and then iterate 50 times. As shown in Figure 9, the decline curve of the loss function after 200 times of model training is displayed. The cosine annealing method is used to reduce the learning rate. Among them, the learning rate adopts the cosine annealing method to reduce the learning rate through the cosine function. In the cosine function, with the increase of , the cosine value first decreases slowly, then accelerates the decline and decreases slowly again. It is easier for the model to find the best advantage, and the label smoothing method is added, which is mainly to punish the classification so that the model cannot be classified too accurately and prevent overfitting

We divide the dataset into training set, validation set, and test set according to 7 : 1 : 2 ratio and then calculate the AP value of the pest species through the test set. Table 3 shows that the mAP value of the model can reach 93.73%, which proves that the detection model has good performance. We can see that the accuracy of Scarite pests can reach 100%, while the AP of Agrotis ipsilon pests is only 62.86%. From the table, we can know that this may be caused by the uneven distribution of pest species, or it may make the characteristics of pests have little difference, the number of samples is small, and the model is not enough to extract the features. To further demonstrate the robustness of the model, we tested the detection of the algorithm in real-world situations. As shown in Figure 10, the algorithm detection results in different scenarios. In Figure 10(a), there is little background interference, the number of pests is always constant, and there is no stacking. The algorithm accurately predicts the type and location of pests. In Figure 10(b), although the number of pests has decreased, the background has been seriously disturbed, and the pests have been flipped. The model still finds the location and classification of pests accurately. In Figure 10(c), stacks of pests appear, and the number and species increase. Although pests are identified as two species, this may be due to the low IOU threshold, but most pests are accurately identified. In Figure 10(d), there are a large number of pests, and some of them are stacked, which is compared with the actual number. Despite the missed pest detection, the model still detects the location of pests and their corresponding species. Through these complex cases, it can be shown that the target detection algorithm has excellent robustness, can cope with a variety of complex environments, and has wide application.

3.4. Comparison of Several Models

To evaluate the performance of the algorithm, this paper is compared with yolov3, yolov3 algorithm adopts Darknet-53 as the backbone feature network, it uses the original EfficientNet-B2, for comparison, and finally, on the basis of this algorithm replaced different lightweight backbone, such as Google MobileNet [31, 32] series, and more lightweight ShuffleNet [33, 34] and Huawei GhostNet [35], these are lightweight classification networks. The same training set is used in the experiments, finally, test evaluation using the same test set. Finally, the model is deployed on industrial tablets, its CPU is adopted as a J1900 processor, 4GB of running memory, and test the inference speed of the model, and results are shown in Table 4. We can see that the yolov3 model has high accuracy, but the number of participants can reach 60 million, the inference of the model is the longest time-consuming, high performance requirements for the equipment, not convenient for practical deployment, and using a lightweight backbone, we can see that the ShuffleNet parameter is minimal, the model is only 36 MB, and the detection speed is also the fastest, but the model really has the lowest accuracy. Compared with MobileNet series and GhostNet, the comprehensive performance of explicit GhostNet is the best, with the volume size of only 42 MB, the accuracy can be up to 82%, and the detection speed can basically reach two pictures per second, with good timeliness. Although these models are small, the accuracy does not meet the commercial standard. For this algorithm compared with EfficientNet-B2, the accuracy rate can reach 93% and improve by almost 5 percentage points, and the volume of the algorithm is only a quarter more than GhostNet; so, this algorithm has lightweight and high sex characteristics.

4. Conclusion

In order to achieve automatic recognition and classification of farmland pests, a lightweight target detection model is proposed. Based on the idea of yolov3, the EfficientNet-B2 classification network is used as the main feature extraction network and improved. PANet is added to THE neck, and CIOU is used as the loss function of target detection to highlight the relative position between the prediction box and the real box. The problem that the low confidence prediction box is filtered due to the overlap of the target box, and the prediction box is avoided. Focal loss function is used to solve the imbalance between positive and negative samples during training. In order to increase the diversity of datasets, the Mosaic data enhancement method is used to increase the diversity of data and improve the robustness of the model. Experiments show that the mAP value of this algorithm can reach 93% accuracy, and it has good recognition ability. The algorithm also has strong recognition ability in complex environment. Compared with other algorithms, this algorithm not only recognizes many kinds of classes but also has high accuracy and wide application.

Data Availability

The experimental data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declared that they have no conflicts of interest regarding this work.

Acknowledgments

This work is supported by the Major Agricultural Application Technology Innovation Project of Shandong Province. The project number is SD2019ZZ007.