Abstract

To accurately detect small defects in urine test strips, the SK-FMYOLOV3 defect detection algorithm is proposed. First, the prediction box clustering algorithm of YOLOV3 is improved. The fuzzy C-means clustering algorithm is used to generate the initial clustering centers, and then, the clustering center is passed to the K-means algorithm to cluster the prediction boxes. To better detect smaller defects, the YOLOV3 feature map fusion is increased from the original three-scale prediction to a four-scale prediction. At the same time, 23 convolutional layers of size in the YOLOV3 network are replaced with SkNet structures, so that different feature maps can independently select different convolution kernels for training, improving the accuracy of defect classification. We collected and enhanced urine test strip images in industrial production and labeled the small defects in the images. A total of 11634 image sets were used for training and testing. The experimental results show that the algorithm can obtain an anchor frame with an average cross ratio of 86.57, while the accuracy rate and recall rate of nonconforming products are 96.8 and 94.5, respectively. The algorithm can also accurately identify the category of defects in nonconforming products.

1. Introduction

Defect recognition is one of the important applications of machine vision in the field of industrial manufacturing. It can improve factory production efficiency and reduce human labor and can be used to monitor product quality in real time [1]. However, the accurate identification of product defects is still a challenging problem that is under investigation in current research. To date, two main approaches have been used in research studies, namely, the use of traditional image recognition methods to extract and classify image features, and the direct use of deep neural networks for defect identification.

The traditional defect recognition algorithm includes the following steps: image preprocessing, image segmentation, feature extraction, and classifier training. The goal of image preprocessing is to reduce the noise contained in the images collected in the industrial field [2, 3]. Image segmentation is carried out in order to decompose the image into several areas with different characteristics, with the same or similar image characteristics in each area. Commonly used methods in image segmentation include threshold-based segmentation [4, 5] and edge-based segmentation [6, 7]. Commonly used edge detection operators include Canny operator, Sobel operator, and Roberts operator. Image feature extraction is performed to map a high-dimensional image space to a low-dimensional feature space. Commonly used image features include color, shape, and texture [8, 9]. Color-based feature extraction methods include color histograms [10], color moments, and color sets. Texture-based feature extraction methods mainly include statistical methods [11], frequency spectrum methods [1214], and model methods [15]. Commonly spectral methods include Fourier transform, wavelet transform, and Gabor transform. Image texture features can also be extracted from the second-order moments in the gray histogram, entropy, inverse moments, contrast, and correlation [16]. Huang et al. proposed a CDD-based defect detection algorithm to detect and classify defects [17]. The classifier learns the mapping relationship between the feature vector and the category through the training of features and labels [1820] and finds the model parameters with the smallest classification error. Commonly used classifiers include the ANN, Bayes, and SVM classifiers.

Traditional image recognition methods are inefficient and inaccurate. To improve the recognition algorithm, Girshick et al. [21] proposed the R-CNN algorithm that for the first time introduced deep learning into the field of computer vision. Then, He et al. [22] proposed the SPP-Net algorithm that solved the problem of object deformation caused by the candidate frame scaling to a uniform size. Girshick [23] proposed fast R-CNN by further improving the shortcomings of R-CNN and SPP-Net. Ren et al. [24] proposed faster R-CNN that improves the detection speed while ensuring a certain accuracy. The subsequently developed R-FCN [25] method is also a region-based target recognition method.

Later, regression-based target recognition methods, such as SSD [26] and YOLO [27] series appeared. The region-based method has higher positioning accuracy but has the disadvantage of low detection speed. The YOLO network based on regression has fast processing speed and high accuracy [28], is easy to deploy in industrial production, and has been widely used.

For the recognition of small targets, shortcomings such as small field of vision, single aspect ratio, and low detection accuracy are always present [29]. To solve these problems, many researchers have improved the network performance by improving the structure, introduced the top-down structure, and proposed algorithms such as DSSD [30, 31] and YOLOV3 [32] to improve performance. For example, Tao et al. [33] developed the OYOLO network by increasing the weight of the positioning error function. After combining with R-FCN, the detection speed is improved, but the weight of the confidence error function is reduced, affecting the confidence prediction of the network. Deng et al. [34] proposed a small target recognition algorithm based on CGAN that has performs accurate target recognition but only in a single application scenario. Zheng et al. [35] proposed a dense-YOLO network that improves the recognition of small targets in remote sensing images through feature reuse but has the disadvantage of a huge memory footprint.

To meet the needs of real-time detection in industrial production, this paper improves the detection accuracy of small defects while ensuring fast detection speed. The YOLOV3 network is used as the base detection model, because the YOLOV3 network performs the detection and classification of images simultaneously, greatly improving the detection speed. First, the prediction layer clustering algorithm of YOLOV3 is improved to avoid the influence of the randomly initialized prediction box on the prediction result and improve the accuracy of the prediction box. Additionally, for the smaller defects, YOLOV3 shows missed detections. Therefore, this paper adds a scale to the original YOLOV3, uses 4 scales to detect the target image, and improves the recall rate of small defects. Finally, aiming at the precision of small defects, SKNet structure is added on the basis of YOLOV3 to improve the score of small defects and obtain higher recognition accuracy. For the identification of qualified products and nonconforming products, the precision rate is 96.8, and the recall rate is 94.5. Moreover, our method can accurately identify six minor defects in the nonconforming products, including “Crooked,” “Stains,” “Marker pen,” “Burr,” “Short,” and “Peeling.” The contributions of this paper are as follows: (i)A prediction box clustering method using a combination of fuzzy C-means and K-means is proposed(ii)A fusion framework for small target recognition is proposed, and the SkNet structure is embedded in the YOLOV3 network model(iii)A urine test strip image data set with a size of 11634 was collected that provided data basis for future research and demonstrated new approaches for the identification of small defects in industrial products

Article structure: the article is divided into five parts. The second part introduces the YOLOV3 algorithm and SeNet structure. The third part introduces the designed SK-FMYOLOV3 network model. The fourth part analyzes the performance of the SK-FMYOLOV3 network model and compares and displays the experimental results of industrial product defect detection. The fifth part summarizes the algorithm.

2. Propaedeutics

2.1. YOLOV3

YOLOV3 is a new end-to-end target detection model after R-CNN, fast R-CNN, and faster R-CNN, as shown in Figure 1. It combines training with target classification and detection and returns the position and category of the target detection box directly at the output layer, transforming the detection problem into a regression problem.

YOLOV3 will predict 4 values for each border on each cell, that is, the coordinates of the upper left corner of the border (, ) and the width and height of the target (, ), recorded as (, , , ). If the center of the target is offset (, ) from the upper left corner of the image in the cell, and the anchor box has a width and height (, ), the revised border is

Among these, the selection of the anchor box adopts the method of dimensional clustering. Traditional clustering algorithms include hierarchical clustering and K-means clustering and model-based methods [36].

YOLOV3 uses the K-means clustering algorithm to cluster the size of the target frame in the training set in order to obtain the optimal size of the anchor box. Thereby, a more accurate target frame can be predicted. The distance metric of the K-means clustering algorithm is given by

Here, box refers to the border size sample in the data set, and centroid refers to the cluster center size. The K-means clustering algorithm randomly selects K target points as the initial clustering center, and K represents K classifications. This random approach increases the randomness of the cluster and affects the clustering effect of the algorithm.

2.2. SkNet

In the neural network, the receptive fields of each layer have the same size, but in human vision, the size of the receptive fields will change depending on the size of the object. To make the neuron adaptively adjust the size of its receptive field for different sizes of input information, the selective kernel network (SkNet) [37] module is proposed. This module uses a nonlinear method to aggregate kernels of different sizes, and these kernels are mixed together via softmax attention. The size of the receptive field in different fusion layers is different, as shown in Figure 2.

SkNet is divided into three parts: split, fuse, and select. The first is the split operation. For the input (, where is the dimension, is high, and is wide), the different receptive fields are obtained through the convolution kernels of and , respectively. The two feature maps are and . Next is the fuse operation, which adds two feature maps to get as

To obtain global information, perform global average pooling operations:

At the same time, to improve the accuracy and adaptability of the network, a fully connected layer is added after the pooling layer: where is Relu, is BN [38], and is equivalent to a queue [39] operation, that is, has a smaller dimension than . The dimension of is set to , and the value of is

Among them, and are artificially set, and is a ratio that compresses the dimensions. The select operation indicates that soft attention between channels is used to select information of different scales:

To summarize the idea of SkNet, there are several scale feature maps, and the features from squeeze are returned to c by several full connections, and then, the N fully connected results are put together. Then, perform softmax on each column vertically, so the same channel with different scales has different weights.

3. SK-FMYOLOV3

3.1. FMYOLOV3
3.1.1. Predictive Box Clustering Algorithm with Fuzzy C-Means (FYOLOV3)

To reduce the randomness and improve the accuracy of the prediction frame, the clustering method is improved. The fuzzy clustering algorithm [40] is set to generate an initial clustering center. Then, the initial clustering center is passed into the K-means algorithm, and the result of the clustering is the initial position of the anchor.

First, the data are standardized. The set of the image classification objects is . In this set, there are indicators in any sample , and the sample is used to label the characteristic index vector.

In Eq. (3), represents the index of the th characteristic in the sample , and the matrix of the characteristic indicators of the samples is constrain the value in to [0,1] through data transformation. The algorithm uses local neighborhood information and covariance to construct an objective function. The objective function and constraints of the algorithm are as follows: where is the total number of image pixels, is the number of image classifications, and represents the degree of membership of the pixel belonging to the th category. is a fuzzy weighting coefficient greater than 1, and represents the th cluster center. (, ) represents the Mahalanobis distance from the th data point to the th cluster center, which is the covariance distance of the data. represents the balance parameter that controls the influence of neighboring pixels. The Mahalanobis formula is as follows: where represents the matrix determinant, and represents the dimension of the problem. Clustering center and membership of the th pixel can be obtained based on the Langer multiplier method:

is the initial cluster center of the K-means clustering algorithm, where is the number of categories. For each sample in the data set, calculate its distance to the cluster centers and divide it into the class corresponding to the smallest cluster center . For each cluster center , recalculate its cluster center:

Repeat the distance calculation and update the distance center until the position of the cluster center no longer changes. The algorithm flow is presented in Algorithm 1.

Input: image , classification
Step0: Standardize the input data, Eg. (3)(4), constrain the input value to [0,1].
Step1: Define the objective function and constraints of the algorithm, Eg. (5).
Step2: Obtain the cluster center and membership of the i pixel by the Langer multiplier method.
Step3: Use = (, , , ..., ) as the initial cluster center.
Step4: Calculate the distance from each sample to the cluster center according to the Eg. (6), (7).
Step5: Divide the sample into the class corresponding to the cluster center with the smallest distance.
Step6: Update the cluster center by the Eg. (8).
Step7: Repeat Step5 and Step6 until the value of is unchanged.
Output: output prediction box center
3.1.2. Multiscale Detection (MYOLOV3)

The YOLOV3 algorithm uses three different scale feature map fusions, using high resolution of low-level features and high semantic information of high-level features. By upsampling the features of different layers, objects are detected on three different scale feature layers. As shown in Figure 3, the bottom-level downsampling feature map is , and the two upsampling feature maps are and , respectively.

The YOLOV3 network has 32 times downsampling of the input detection image. The downsampling factor is high, the receptive field of the feature map is relatively large, and the shallow information is not fully utilized, resulting in some information loss after multilayer convolution. Therefore, this network is suitable for detecting large-sized objects in an image. In industrial production, the defects of objects are relatively small. For better detection of small defects, the original three-scale detection is extended to four-scale detection.

As shown in Figure 4, when multiscale fusion is performed, an upsampling fusion operation is used, a scale is added for the fusion operation, and a feature map with an upsampling of is added. Due to the addition of a scale, the anchor value also must be readjusted as shown in Table 1.

3.2. SK-FMYOLOV3

In industrial images with relatively small defects, the conventional YOLOV3 often has incorrect or missed defects. This is due to misidentification caused by the imbalance of the confidence distribution. To enable the network to learn the global features and autonomously improve the score of small defects, the SkNet structure is embedded in the improved YOLOV3 network. This makes the network make choices about information at different scales. Under the condition that the detection speed is guaranteed, the detection accuracy is improved, and the efficiency of real-time quality inspection in the industry is improved.

Considering that there is a convolution operation in the YOLOV3 convolution layer, there is also one in SkNet. To maintain the original detection speed, starting from the original convolutional layer of layer 4 of YOLOV3, the subsequent convolutional layer was replaced with a SkNet structure. This makes the network have different receptive fields for feature maps of different sizes, replacing a total of 23 SkNet structures, as shown in Figure 5.

A feature map of is passed in a convolution layer, where is the width, is the height, and is the number of channels. Different receptive fields are obtained through the and convolution kernels, and the two feature maps are and , respectively. Add operations are performed on two feature maps to obtain , and then, fuse and select operations are performed on to output feature maps. The specific parameter configuration of the designed SK-YOLOV3 is provided in Table 2.

SeNet (Squeeze and Excitation Networks) and SkNet are network structures proposed by the same team, and both of these introduce attention to improve the global receptive field. SeNet adaptively selects the channel, and SkNet adaptively selects the convolution kernel. Therefore, this paper also designed the SE-YOLOV3 network model for experimental comparison.

4. Experiments

To accelerate the convergence speed of the network and avoid overfitting, 0.9 is used as the impulse constant, 0.0005 is used as the weight attenuation coefficient, and the initial learning rate is 0.0005. The experimental environment is the Ubuntu 14.04 operating system with Intel (R) Xeon (R) CPU E5-2698 v4 @ 2.20 GHz processor and 16 GB running memory (RAM), and the GPU is NVIDIA Tesla K80 with a 16 GB video memory.

The evaluation indicators are precision and recall. For these, precision is the precision rate that indicates the proportion of the samples in different categories that in fact belong to that category among the samples that are predicted to be positive:

Here, 8 types of qualified test paper, “unqualified test paper,” “Crooked,” “Stains,” “Marker pen,” “Burr,” “Short,” and “Peeling” are used as detection targets, and predictions are made according to different categories. TP (True Positive) indicates the number of samples that correctly identify the defective target. FN (False Negative) indicates the number of samples for which no defective target was identified. FP (False Positive) indicates the number of samples that incorrectly identify a defective target. Recall is the recall rate that represents the ratio of the number of correctly detected targets to the total number of the targets in the test set:

The denominator of recall is true positives plus false negatives and represents the total number of samples.

5. Dataset

While the target objects in the images of large public datasets such as COCO [41] are relatively complete, there have been almost no studies of small defect classification for industrial products. Therefore, to verify the practicability of the algorithm, it is necessary to manually collect and create the data sets. (1) The urine test strip data are mainly obtained through high-definition camera shooting and crawler technology. The captured data are the main component, and the data obtained by the crawler technology are the minor component. A total of 1562 urine test strip images were collected with a resolution of pixels. (2) The following methods are used for data enhancement of the original image: magnification (image width and height are enlarged to 1.5 times), reduction (image width is reduced to 0.3, height is reduced to 0.5, and image size is guaranteed to be a multiple of 32), brightness enhancement and reduction, flipping (90° and 180°), and clipping. Finally, 11634 images were obtained. (3) Labeling was used to mark 8 kinds of urine test paper defects in 1562 images. As shown in Figure 6, the images are classified as “qualified test paper,” “unqualified test paper,” “Crooked,” “Stains,” “Marker pen,” “Burr,” “Short,” and “Peeling.” The specific method is the selection of all of the defects in the image box and obtain the XML file in the VOC format. (4) The xml file is converted to a txt file with the format of ‘tag’ + ‘X’ + ‘Y’ + ‘W’ + ‘H’, and 6,634 images are randomly selected as the training set and 5000 are selected as the test set.

5.1. SK-FMYOLOV3 Convergence Verification

Based on the improved YOLOV3 structure, the SkNet structure is embedded to train on a homemade urine test strip dataset. Iterative training on the GPU server 300 times, the results show that the model can quickly converge to a stable state during the training process. During the training process, log information is collected for each iteration of the SK-FMYOLOV3 model training. GIoU [42] is used as the loss of the detection task, and the objectness [43] is recorded during the training process, as well as the val GIoU and val objectness of the verification set. GIoU takes into account the nonoverlapping areas that IOU does not take into account and can reflect the manner in which the predicted box and the ground truth overlap. The objectness value represents the probability of the target in the prediction box. Through the visualization of information, during the training process, as the number of iterations continues to increase, the loss function gradually converges in the first 200 iterations. The GIoU value of the training set and the validation set is stable at approximately 1.15, and the objectness value is stable at approximately 1.0, as shown in Figure 7.

5.2. Impact of Different Improvement Strategies on the Prediction Box

The impact on the accuracy of the prediction box is calculated for the three improved strategies proposed above, using the original YOLOV3 as the reference, as shown in Table 3. As shown in Table 3, FYOLOV3 represents the addition of an improved prediction box clustering algorithm based on the YOLOV3 algorithm. MYOLOV3 stands for multiscale improved YOLOV3 algorithm. FMYOLOV3 represents the addition of an improved predictive box clustering algorithm and an improved multiscale algorithm based on the YOLOV3 algorithm. SK-FMYOLOV3 represents the addition of SkNet structure on the basis of FMYOLOV3.

Each improvement strategy used in this paper improves the performance of the original YOLOV3 detection network to varying degrees. Among these, the improvement of the prediction box clustering algorithm displays the most significant improvement in the model accuracy, and the average IOU has increased by nearly 6 percentage points. The improvement of the multiscale algorithm leads to the average IOU increase of nearly 4 percentage points. The improved clustering box prediction algorithm and multiscale algorithm based on YOLOV3 increased the average IOU by nearly 7 percentage points. Combining all of the improvement strategies, the final average IOU is improved by nearly 8 percentage points over the original YOLOV3 network.

5.3. Performance Evaluation

The accuracy rate () and recall rate () are used as evaluation indicators, and the same data set is used in the same experimental environment. The YOLOV3 method with different improvement strategies is compared with R-CNN, fast RCNN, and faster RCNN. The test results of the qualified and unqualified products are shown in Table 4.

As seen from the above table, the accuracy and recall of the SK-FMYOLOV3 network is the highest. The reason is that the SkNet structure allows feature maps to be selected for training by different convolution kernels, improving the score of small features. At the same time, the accuracy of classification is increased, making the accuracy and recall of the network higher. The fastest network is YOLOV3, because the improved algorithm increases the number of the layers in the network and therefore increases the recognition time. FYOLOV3 is better than MYOLOV3, because the prediction box clustering algorithm is added to avoid the impact of random initial points on the prediction result. MYOLOVS is better than YOLOV3, because the annotations in the data set are small defects. After adding a small-scale feature fusion, the recall rate is improved, so that previously unrecognized defects are identified. SK-FMYOLOV3 achieved the highest recall and precision, because the convolution autonomous selection can be trained according to the size of different feature maps, leading to an improvement in the accuracy of classification. By embedding SkNet in the improved YOLOV3 structure, the accuracy rate is increased by 9 percentage points and the recall rate is increased by 23 percentage points.

The accuracy and recall of the 8 classifications of the homemade urine test strip data set in SK-FMYOLOV3 are shown in Figure 8. In the first 300 iterations, a higher precision is observed, but as the number of iterations increases, overfitting will occur, the recall will increase, and the precision will decrease. The test results of SK-FMYOLOV3 for 8 categories are as shown in Figure 9.

5.4. Comparison of Experimental Results

In the same experimental environment, the same number of iterations are used for training (epoch = 300). The test results of this method and faster-RCNN, YOLOV3 on the homemade urine test strip data set are shown below.

The first is the detection of qualified products, as shown in Figure 10. The accuracy rate of the qualified urine test paper in this method is 0.99, YOLOV3 is 0.73, and that of faster-RCNN is 0.89; thus, this method is more effective for the detection of qualified products.

Then, it is the defect detection of nonconforming products, as shown in Figure 11. For the detection of burr, it is observed that SK-FMYOLOV3 detected 11 burrs on the urine test strip. YOLOV3’s attention to small defects is not as high as that to large defects. Therefore, crooked was detected on urine test strips, and no burr was detected. Faster-RCNN detected 7 burrs on urine test strips. For the detection of crooked defects as shown in Figure 11, all three algorithms were successful. Because Crooked defects are relatively large and the features are obvious, all three algorithms show better performance. As shown in Figure 11, for detection of marker defects, the three algorithms can detect the marker better, and the marker is also a relatively obvious defect.

The detection of peeling defects is shown in Figure 12. It is observed that SK-FMYOLOV3 and Faster RCNN can detect these defects, while YOLOV3 cannot detect these defects. As shown in Figure 12, for the detection of short defects, it is found that the three algorithms show good performance, but the accuracy of the algorithm in this paper is approximately 0.5, while those for the other two models are approximately 0.2. The detection of stains defects is shown in Figure 12. All three algorithms can be detected to carry out the detection. The accuracy of this algorithm and of Faster-RCNN is approximately 0.5, and that of YOLOV3 is 0.45. It is important to note that in addition to the detection of defects, each urine test strip also detects the nonconforming products. As can be observed from the above group of figures, the accuracy rate of unqualified products is approximately 0.99, and showing that the method is suitable for use in the classification of qualified and unqualified products of industrial products.

6. Conclusions

To solve the problem of detection accuracy of defective products in industrial production, this paper proposes an SK-FMYOLOV3 algorithm based on the YOLOV3 network. First, fuzzy mean clustering is used to generate the initial clustering points to avoid the influence of randomly initialized prediction frame on detection accuracy. Then, the original three-scale prediction is changed to four scales, making the algorithm more suitable for detecting smaller defects. Finally, the SkNet structure is merged, so that the feature map selects the appropriate convolution kernel for training through the attention mechanism, and the scores of the defects that are not easy to identify are higher. The proposed network structure is based on the homemade urine test strip data set, and the detection precision rate and recall rate of the qualified urine test strip and the unqualified urine test strip are 96.8 and 94.5, respectively, were obtained. This method can accurately identify the 6 types of small defects in nonconforming products. In future research, we will consider using the network structure for other industrial products to conduct experiments in order to save human resources and improve production efficiency.

Data Availability

The [Urine dipstick dataset] data used to support the findings of this study were supplied by [Rui Yang] under license and so cannot be made freely available. Requests for access to these data should be made to [Rui Yang, 792481404@qq.com].

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (Nos. 61772149, U1701267, and 61762028), GUET Excellent Graduate Thesis Program (No. 18YJPYSS15), Guangxi Key Laboratory of Image and Graphic Intelligent Processing Project (No. GIIP2003), and Guangxi Science and Technology Project (Nos. AB20238013, ZY20198016, 2018GXNSFAA294127).