Abstract
Computer vision has been integrated into people’s daily lives, but mainstream target detection algorithms deployed to embedded devices with limited hardware resources are difficult to meet the task requirements in terms of real time and accuracy. So we proposed YOLO-DFD, a light detection algorithm based on improved YOLOv4 to solve the problem of dog feces in our living environment. The main improvement strategies are as follows: the YOLOv4 backbone network is replaced with MobileNetV3, and the convolutions in the feature enhancement network are replaced by depthwise separable convolutions to further reduce the number of parameters. To enhance the accuracy of target detection, we introduced the convolutional block attention module (CBAM) in neck network, and the complete intersection over union (CIoU) loss of YOLOv4 is replaced with the SCYLLA intersection over union (SIoU) loss to reduce false detection rate. In this paper, the dataset we used was made up of pictures of dog feces taken in life. For the self-made dog feces dataset, we used data enhancement technology to expand it. The training result shows that the average precision (AP) has reached 98.66%. While maintaining detection performance, the parameter of YOLO-DFD is reduced by 82% and FPS increases 14 compared to YOLOv4. And YOLO-DFD has a lower parameter quantity and a smaller calculation than other algorithms, making it easier to deploy on embedded devices to clean dog feces.
1. Introduction
Nowadays, more and more people keep pet dogs, but the problem is that dog feces can be seen in parks, streets, or neighborhoods. Bishop and DeBess [1] identified parasites with zoonotic potential in dog feces. Yang et al. [2] discovered ARGs (antibiotic resistance genes) in dog feces that may cause ARG pollution. So timely detection and disposal of dog feces is essential. With the rapid development of machine vision technology, it is possible to apply detection algorithms to cleaning robots that are controlled by an embedded platform. Intelligent robots using the target detection algorithms to carry out automatic cleaning tasks have become an important field in smart city.
The traditional target detection model adopts hand-designed feature extractors, which is slow to detect and has poor performance across domains, so it is difficult to meet the industry demand. The AlexNet emergency [3] was a turning point in target detection. And the target detection algorithms have been developed from two-stage to single-stage. The former is a target detection algorithm based on the candidate region. There are two processes involved in performing a monitoring task: candidates are generated and classified using convolutional neural networks. R-CNN [4], Faster R-CNN [5], and Mask R-CNN [6] are representative target detection algorithms based on candidate regions. These methods usually have high detection accuracy, but their disadvantages are long training time and slow detection speed. In contrast, single-stage detection methods do not need to generate candidate frames in the target detection process, they transform the target border localization problem into a regression problem to solve, and typical networks include SSD [7–9] and YOLO series [10–13]. This kind of algorithm has good real-time performance, but the detection precision is poorer than that of two-stage algorithms. For the current mainstream target detection algorithm YOLOv4 and YOLOv5, the YOLOv4 algorithm is based on the YOLOv3 algorithm with a series of improvement strategies to improve accuracy and speed, but its huge computing power is difficult to deploy on embedded devices with limited hardware resources. The YOLOv5 has four versions of different scales, and the smallest YOLOv5s can be deployed in embedded devices, but its accuracy for small target detection is difficult to complete the task of cleaning up dog feces. Therefore, many scholars have improved the mainstream target detection algorithm to complete specific tasks. [14] proposed a waste classification model, which added the self-monitoring module to the residual network model; it improves the representation ability of the feature map and increases the classification accuracy to 95.87%. [15] introduced the multispatial attention mechanism (MSAM) in YOLOv5 to improve the accuracy of small object detection, [16, 17] incorporated the SE (squeeze-and-excitation) [18] module in the YOLOv4 backbone network, and [19, 20] introduced ECA (efficient channel attention) [21] module in the YOLOv4-tiny. In [22], the backbone network was replaced from ResNet50 to MobileNetV1 with modified RetinaNet, which effectively improved real-time network performance. [23] applied the focal loss in YOLOv5 to increase the accuracy in detecting tiny targets. In [24], Cai et al. used PAN++ and improved YOLOv4 with five scale detection layers, which effectively improved the detection of small target objects. In [25], DenseNet is used to improve the YOLOv4 network to realize cherry fruit detection in digital agriculture. Yu et al. replaced CSPDarknet of YOLOv4 by CSResNeXt module to increase the detection speed and accuracy [26].
To apply the detection model to the device with limited hardware resources, this article combines the MobileNetV3 [27] network with the following improvements to the YOLOv4 algorithm: (1)Replace the backbone network of YOLOv4 with MobileNetV3(2)Add the CBAM [28] to the model and replace traditional convolution of the enhancing feature network with the depthwise separable convolution to reduce the number of parameters(3)Replace the CIoU loss function of YOLOv4 with the SIoU loss function [29] to enhance the detection accuracy
2. Materials and Methods
2.1. Data Acquisition and Processing
The images in the dataset were taken in life. We expanded the dataset to improve the generality of the model and reduce the susceptibility of the model to the image. In this paper, we used image flipping, Gaussian noise, pretzel noise, and brightness variation to enhance the data. The dataset was expanded to 2240 images by data enhancement. The data augmentation sample plots are shown in Figure 1.

At the same time, to enrich the background of the detected object, we use the Mosaic method (read four images at one time, then randomly flip, zoom, or change the colour gamut of these four images, and finally combine them into a new picture) to further enhance the data. An example of data augmentation by the Mosaic method is shown in Figure 2.

2.2. Improved YOLOv4 Object Detection Algorithm
The YOLOv4 backbone network used CSPDarknet53 [30], and the enhancement feature network uses the SPP (spatial pyramid pooling) structure [31], which greatly increases the receptive field and significantly isolates contextual information, and the PANet structure (path aggregation network) structure [32], which iterates feature information. The prediction network is the YOLO head part, which is used to obtain features at three scales; it is responsible for detecting the position of dog feces in the picture.
But the huge number of parameters of YOLOv4 results in the low FPS value when used in embedded devices, so it is difficult to perform normal target detection tasks. To solve this problem, we used MobileNetV3 to replace the YOLOv4 backbone network, replaced the traditional convolution in the enhancing feature network with depthwise separable convolution, which will make the model lighter than before, and incorporated CBAM into the neck network to increase the accuracy of the model. In the part of perdition network, we replace the CIoU which is the position loss of the original algorithm with SIoU to make model prediction more accurate. The YOLO-DFD schematic is shown in Figure 3.

2.2.1. The Backbone of YOLO-DFD
CSPDarknet53 is the backbone of YOLOv4; although it has excellent feature extraction ability, the huge number of parameters is unfriendly to embedded device, so we used MobileNetV3 instead of CSPDarknet53 to reduce the number of parameters. The MobileNetV3 network has the advantages of MobileNetV1 [33] and MobileNetV2 [34]; it applies the depthwise separable convolution with the inverse residual structure.
The depthwise separable convolution divides the convolution process into two parts; the process is shown in Figure 4. Compared to traditional convolution, it used fewer parameters to achieve similar feature extraction results. Equation (1) is the quantity of computation of traditional convolution parameters, and Equation (2) is the calculated amount of depthwise separable convolution parameters. As can be seen from Equation (3), the parameters in the deep separable convolution are of ordinary convolution. where is the traditional convolution’s computation, is the depthwise separable convolution’s computation, and are the input image width and height, is the number of input channels, is the convolutional kernel size, and is the number of convolutional kernels.

MobileNetV3 was composed of Bneck blocks, SE attention mechanism was used to improve network efficiency, and the h_swish activation function was used to reduce the amount of computation and improve network performance; the h_swish activation function is more friendly to embedded device than the ReLu6 activation function; it can be demonstrated in Equation (4). It is used in Bneck blocks 7 to 15.
Figure 5 shows the Bneck structure.

The backbone structure is shown in Table 1.
2.2.2. Incorporating the Attention Mechanism
To reduce the probability of missed detection of YOLO-DFD to detect small targets, we add the attention mechanism to the neck network in this paper. In the task of detecting dog feces, most of the area detected are background, and dog feces usually takes up only a small part of pictures. Therefore, we incorporated the attention mechanism into the enhancement feature network. In this paper, we used CBAM to incorporate the neck network.
Compared with SE, ECA model, which only focuses on channel dimension, CBAM combines the channel dimension and the spatial dimension to achieve better result. The CBAM contains two submodules, channel attention module and spatial attention module. In channel attention module, MaxPool and AveragePool separately processed the input feature in spatial dimension, and then, their outputs which multiply the sharing weight (multilayer perceptron) are combined by adding elements; finally, the output that multiplies the sigmoid function is the channel attention ; it can be shown in where is sigmoid function, , , is the input channels, and is the reduction ratio. It can be shown in Figure 6.

In Figure 7, the input feature of spatial attention module can be defined , where is the element-wise multiplication. Firstly, the is processed by MaxPool and AveragePool successively in the channel dimension and by combining their output with element-wise multiplication, and then, a convolution layer with a kernel size of will be used to extract the information; finally, we can get spatial attention ; it can be shown in where is the sigmoid function and is the convolution layer with kernel size of .

The convolutional block is combined with the channel attention module and the spatial attention module; it can be seen as where is the output feature of CBAM. It can be illustrated in Figure 8.

Figure 9 is the comparison of the addition of CBAM to the MobileNetV3-YOLOv4 neck network before and after; the original images, the detected results of MobileNetV3-YOLOv4, and the detected results of the improving network that added CBAM to the neck network are presented in Figures 9(a)–9(c), respectively. Figure 8 shows that the CBAM can significantly increase the attention of the network in the target areas.

(a)

(b)

(c)
2.2.3. The Bounding Box Regression Loss of YOLO-DFD
The loss function calculates the network loss based on the difference between the real value and the predicted value. Optimize the network parameters by iterating several times to approximate the real value step by step. The loss function of YOLOv4 is divided into three parts, namely, bounding box regression loss, confidence loss, and classification loss. In this paper, we substitute the bounding box regression loss. In the YOLOv4 algorithm, the bounding box regression loss function adopts the CIoU loss function, which is calculated as where is the distance between the centroids of the predicted bounding box and the ground truth box, is the diagonal distance of the smallest closed area that can contain both the prediction box and the ground truth box, and is the penalty term that improves the rate of convergence. The CIoU calculation is shown as
The CIoU takes into account geometric factors such as overlap region, centroid distance, and aspect ratio of the predicted and ground truth boxes but does not consider the distance of inconsistency between the ground truth and predicted bounding box; it will influence the accuracy of the prediction. To get better effect of detection, we used SIoU loss function to replace CIoU loss function. SIoU consists of four parts, angle cost, distance cost, shape cost, and IoU cost. As a new penalty term, angle cost can improve the speed of regression; it will be calculated as where is defined as . If , the convergence process will first minimize , otherwise minimize . is the distance between b and bgt. is defined as . Figure 10 shows the calculation scheme of the angle cost contribution in SIoU.

Compared to CIoU, the distance cost of SIoU is considered the angle cost; it is defined as where , , and ; when approaches 0, the effect of distance cost will reduce; on the contrary, when , the effect of distance cost will increase.
The shape cost is defined in where and and is an important value that controls the effect of shape cost. Wang et al. [30] used a genetic algorithm to calculate it and defined its range as (2,6); in this paper, the value is 4.
The cost of the IoU cost is defined in
The SIoU loss function is given by
3. Results and Discussion
3.1. Experiment Environment
The training platform is a cloud container platform with GeForce RTX 3060 GPU (12 GB), and the experiment environment is PyTorch 1.8, CUDA 11.0, CUDNN 8.0.4, and Python 3.8.5. The training parameters are shown in Table 2. In order to train a better model, we applied the Mosaic method that combines four original images into one image in training course and set a 50% chance of using Mosaic data enhancement per step; because the training images generated by the Mosaic method are far from the real distribution of original images, we only applied the Mosaic method for the first 70% epochs in training course. The cosine annealing method is used to attenuate the learning rate in the training process.
3.2. Evaluation Indexes
In this paper, we used giga floating-point operations per second (GFLOPs) to represent the computation required by the network. Frames per second (FPS) indicate the speed of the algorithm. And we used the precision that represents the ability of the model to find the correct targets. The recall indicates the ability of the model to find all the correct targets to reflect the performance of network. They are shown in where TP is true positives, FP is false positives, and FN is false negatives. However, the precision and recall do not fully reflect the performance of the model, so we also introduce AP to indicate the precision of the model. It is defined in
3.3. Ablation Experiments
To verify the improvement of network performance by each strategy, we used YOLOv4 as baseline and combined different improved strategies to design an ablation experiment. As Table 3 shows, in method 1, we tested the metrics of YOLOv4 in dog feces dataset. In method 2, we replaced the YOLOv4 backbone network with MobileNetV3, which greatly reduced the amount of computation, but its recall value was reduced 7.34%, precision value was reduced 7.01%, and AP value was reduced 7.17% compared to YOLOv4. In method 3, we added the CBAM module to the network based on method 1. In method 4, we replaced the regression loss of the bounding box with SIoU based on method 1. Methods 3 and 4 show that the AP value has improved by 0.6% and 0.28% compared to YOLOv4. Method 5 is our algorithm. The experimental results indicate that YOLO-DFD can greatly lessen the number of parameters while ensuring the accuracy of detection. Figure 11 is the curve of the AP.

3.4. Comparison with Other Methods
We compared the performance of YOLO-DFD with five other algorithms; the comparison results are shown in Table 4. It can be seen the FPS, AP, and precision of SSD were higher than other models, but the recall value of SSD is much lower than other models. The recall and precision of YOLOv3 and YOLOv7l were higher than ours, but the AP, GFLOPs, Param, and FPS of YOLO-DFD were higher than them. The parameters and FPS of YOLOv5s are better than our model, but its target detection accuracy is lower than our model, so it is easier to miss the target objects when performing detection tasks. This indicates that YOLO-DFD is more suitable for deployment in embedded devices. The detection result samples are shown in Figure 12; our algorithm was usually able to detect all target objects in the detection task.

The training loss change curves of seven algorithms are shown in Figure 13. From Figure 13(a), it can be seen that the training loss convergence speed of Faster RCNN with SSD algorithm is slower than that of YOLO series algorithms. From Figure 13(b), it can be seen that the curve of YOLO-DFD using SIoU loss function converges faster than other algorithms.

(a)

(b)
3.5. Experiment in Embedded Device
In order to test the practicability of YOLO-DFD, we deployed YOLOv4 and YOLO-DFD in cleaning robot equipped with Jetson Nano to compare their effects. We tested the performance of the two algorithms on embedded device in four scenarios: street, warehouse, courtyard, and bedroom. It can be seen from Table 5 that the accuracy of YOLO-DFD is higher than that of YOLOv4, and the FPS is 4.93 higher than that of YOLOv4. The detection result samples of the cleaning robot in four scenarios are shown in Figure 14.

4. Conclusion
In this paper, we applied a new dataset, the dog feces dataset; the images were collected in life. And we proposed an improved lightweight network based on YOLOv4. Firstly, we replaced the backbone network of YOLOv4 with MobileNetV3, the MobileNetV3’s less computation makes it more suitable for embedded devices, and we took place of usual convolutions with depthwise separable convolution in the enhancement network to lessen the number of parameters; this strategy enhances the detection rate of the algorithm. Second, we added the CBAM into the neck network to advance the features of value and restrain the unnecessary features; it improves the accuracy of the algorithm. Then, the loss function of YOLOv4 is replaced by SIoU to better obtain the location of the target objects. In the end, we compared the performance of several mainstream detection algorithms in dog feces dataset; the consequence indicates that YOLO-DFD ensures the detection accuracy but also optimizes the number of parameters and the amount of computation to the minimum; this is more friendly to embedded devices. And then, in order to verify the effectiveness of the algorithm in this paper, we deployed YOLO-DFD and YOLOv4 to the cleaning robot and compared their performance, and the detection real-time performance of YOLO-DFD is better than YOLOv4 with guaranteed accuracy.
In the future, we will continue to expand dog feces dataset to improve the detection accuracy of YOLO-DFD. And then, we will continue to lighten the algorithm while maintaining its detection accuracy and improve the real-time detection of the algorithm on embedded devices.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the Natural Science Foundation of Liaoning Province (Grant number 20180550128) and Scientific Research Fund of Liaoning Provincial Education Department (Grant number LJKZ0448).