Abstract
The typical defect detection algorithm is ineffective due to the contrast between the magnetic tile defect and the various defect features. An improved YOLOv5-based algorithm, for detecting magnetic tile defects with varying defect features, is suggested. The procedure begins by incorporating the CBAM into feature extraction network of YOLOv5. It improves the feature of network learning capabilities for the target region by filtering and weighting the feature vectors in such a way that the processing of network is dominated by the essential target characteristics. A new loss function of detection model is then proposed according to the properties of the magnetic tile picture, and the confidence of prediction box is increased. Data augmentation technologies are introduced to increase the number of data samples. Based on magnetic tile defect datasets, the evaluation results have shown that the precision of the proposed approach is 98.56%, 3.21%, and 7.22% greater than the original YOLOv5 and Faster R-CNN, respectively, all of which demonstrate the effectiveness and accuracy of the proposed method.
1. Introduction
In the field of computer vision, defect detection is a hot research topic. Magnetic tiles are magnets comprised of permanent ferrite materials, and they are frequently used in industry as one of the essential components in the production of permanent magnet motors. During the batch production and processing of magnetic tiles at the factory’s end, defective products with various surface defects are produced invariably. If these defective products with various types of surface defects are used in the production of permanent magnet motors, the performance of the permanent magnet motors will be negatively impacted, and the motors may fail catastrophically. Magnetic tile is a vital component of the engine, and surface defects have a direct impact on the lifespan of engine and operating conditions. The magnetic tile dataset utilized in this study comprises five distinct types of defects, namely, blowhole, creak, break, fray, and unevenness. Magnetic tile defect images are often plagued by issues such as nonuniform brightness, low contrast, irregular shapes of defects, and complex textured backgrounds. The challenge of achieving rapid and accurate detection of magnetic tile defects remains a significant obstacle in the field of visual detection. On the basis of machine vision, many researchers then use traditional algorithms to detect surface defects on magnetic tiles. Haar cascades [1], SIFT [2], and HOG [3] are widely used algorithms for feature extraction, and multilayer perceptron (MLP) [4], support vector machines (SVM) [5], and AdaBoost [6] are made extensive use in defect classification. In practice, however, the contrast of different defects is variant, making it difficult to highlight the defect features and to design the corresponding feature extraction method; it is more challenging for traditional classification algorithms to learn the features of tile defects.
In recent years, researchers have proposed many convolutional neural network algorithms, such as SSD [7], Faster R-CNN [8], and fuzzy neural network [9, 10] for various applications, and in the field of magnetic tile defect detection, they have also obtained many results. Xie et al. [11] introduced a fusion feature CNN model for magnetic tile defect detection; it leverages multiple image augmentation techniques, such as rotation and flipping. Furthermore, the model integrates a novel focal loss function to deal with the problem of class imbalance. Ben Gharsallah and Ben Braiek [12] proposed a novel anisotropic diffusion filtering model. The model considers both gradient magnitude and local difference image feature, in contrast to conventional anisotropic diffusion models that only consider gradient magnitude data. Hu et al. [13] proposed a UPM-DenseNet two-stage detection model. A plug-and-play feature restoration module is additionally proposed to improve the neural network’s capacity to identify regions of interest for defects of various scales.
YOLOv5, as a typical representative of YOLO series [14, 15] convolutional neural network, adopts adaptive anchor frame computation at the input side to adapt to different datasets, adopts the focus module in the backbone to slice and sample the image so that a more complete downsampled target feature information can be obtained, and adopts a new CSP structure in neck part to enhance the feature fusion capability in the network, and its detection accuracy and real-time performance are greatly improved compared with YOLOv3 [16] and YOLOv4 [17].
However, the traditional detection algorithm cannot function well due to the influence of uneven illumination and other negative effects. For this purpose, this paper is aimed at developing an improved method to addresses the issue of effectiveness on detection algorithms. The main contributions of the paper can be summarized as follows: (1)An improved YOLOv5 algorithm is proposed, in which the convolutional block attention module (CBAM) [18] is introduced on the basis of the original algorithm to enrich the feature representation of the foreign object target, thereby enhancing the detection accuracy of the detection network(2)A new loss function is proposed, aimed to reduce loss value of the network, thereby accelerating the convergence of network and enhancing the detection accuracy of the entire detection network(3)Multiscale detection is improved to overcome the issues of minute defects on the surface of magnetic tiles, complex appearance, inconspicuous boundary features, and complex background. A combination of data augmentation techniques is implemented on the dataset to overcome the overfitting problem(4)The proposed algorithm significantly reduces the missed detection rate compared to the original algorithm, and the detection accuracy is improved by 3.21% compared to the original network, resulting in efficient tile defect detection
The rest of the paper is organized as follows. In Section 2, we give a review of the related work on defect detection. The original and improved YOLOv5 model is proposed in Sections 3 and 4, respectively. In Section 5, we list the datasets, and the experiments and analyses of the results are covered. At last, conclusions are drawn and future works are given in Section 6.
2. Related Work
2.1. Traditional Methods
Traditional machine learning methods and classical image processing methods are usually modeled for specific problems and are suitable for more stable production environments. For example, Çelik et al. [19] proposed to first perform wavelet transform, dual-threshold binarization, and mathematical morphological operations on the fabric image to obtain the defect contour and then complete the defect classification using a grayscale cooccurrence matrix and feedforward neural network. Saad et al. [20] proposed a multilevel threshold segmentation algorithm for semiconductor wafers to achieve the classification of wafer defects. The method first maps the multichannel color defect image into a grayscale image with 256 color steps. Then, this image is nonlinearly smoothed by median filtering. Finally, an improved multilevel threshold segmentation algorithm is applied to the smoothed image, and a better defect segmentation image is obtained.
In traditional machine learning methods, feature selection and classifier construction are usually involved, as Nguyen et al. [21] proposed to use support vector machine, random forest, and K-nearest neighbor algorithms for defect detection in OLED component classification. This method first designs the approximate features of the defects. Then, principal component analysis and RF studies are then used for feature selection to select the most compelling features automatically. The method effectively improves the classification accuracy of true and false defects.
Among the aforementioned techniques of classification, the traditional image processing approach has the advantages of minimal resource usage and quick computation speed, but some of its algorithmic parameters must be manually chosen and its algorithm design must be based on a particular picture. Because of the limited applicability and low robustness of this strategy, the traditional image processing techniques require changing the algorithm’s internal settings if there are anomalies in the input image [22]. Traditional machine learning methods have a very challenging time effectively extracting defect characteristics because the background gradient of tile surface defect photos changes greatly. Moreover, the model’s classification performance is still subpar, and typical machine learning techniques do not have parameter advantages. As a result, the deep learning approach is more suited for magnetic tile surface defect identification.
2.2. Detecting Methods Based on Deep Learning
As the field of defect detection is rapidly increasing, there are several studies and implementations utilizing both machine learning and deep learning methods to identify defects. In addition, a literature review is conducted to comprehend the mechanism of the defects of detection technique.
Tabernik et al. [23] proposed a segmentation-based deep learning method for defect detection. This method requires only about 27 images for training, as opposed to general deep learning methods that require hundreds or thousands of images for training. This enables the application of deep learning methods in industrial settings with limited defect samples. Huang et al. [24] proposed a real-time model called MCuePush U-Net, which is comprised of three major components: MCue, U-Net, and Push network, for the saliency detection of surface defects. Cui et al. [25] proposed the SDDNet network for surface defect detection. Large texture variation and small defect size are primarily addressed by the introduction of feature retention blocks (FRB) and skip dense connection modules (SDCM).
Deep learning methods can achieve automatic feature extraction mainly by building multilayer convolutional neural network models. Since convolutional neural networks have numerous parameters, they are decisive for image feature extraction [26]. Deep learning methods should generally be designed with high classification accuracy, low resource consumption, and short inference time.
YOLOv5 is one of the best comprehensive algorithms in YOLO series [27]; this paper uses YOLOv5 as the base algorithm. However, due to the problems of irregular shapes of defects, uneven brightness, low contrast, and intricately textured backgrounds in the tiles, the original model has been observed to suffer from missed detection. The issue of low confidence in detecting magnetic tile defects may also appear. To address the issues, improvements to the original algorithm have been deemed particularly crucial.
3. YOLOv5s Network Architecture
The YOLOv5 model is composed mostly of the following components: input, backbone, neck, and prediction. The input consists primarily of three components: enhanced mosaic data, adaptive anchor box computation, and adaptive image scaling. Focus, C3 module, and spatial pyramid pooling (SPP) [28] comprise the backbone. The focus structure slices the image to obtain a downsampled feature map with twice the information, and the C3 denotes a cross-stage partial (CSP) network [29] structure that includes three Conv blocks [30] and incorporates several residual components. The CSP structure is primarily used to address the issue of excessive computation in inference from a network structure design perspective. The C3 further enhances the learning ability of the network by optimizing the gradient backpropagation path, while simultaneously reducing the computational cost and memory overhead. The SPP layer increases the perceptual field and enhances nonlinear expression capability of the network by performing maximum pooling after convolving the feature layer three times. The neck section of YOLOv5 is a modification of the FPN structure. The neck component in YOLOv5 is an enhancement to the FPN structure designed to increase feature fusion of the network and conveyance of inference information. The prediction function in the original network makes use of the generalized intersection over union (GIoU) loss [31] function to estimate the recognition loss of the detection target rectangular box, and the entire network structure of YOLOv5 is depicted in Figure 1. YOLOv5 offers different model sizes or configurations based on the width and depth of the network architecture, namely, YOLOv5s, YOLOv5m, YOLOv5l, YOLOv5n, and YOLOv5x. The smaller models in YOLOv5 have fewer parameters and faster inference speed. To better meet the real-time requirements of defect detection, this paper selects the YOLOv5s model as the baseline model.

4. Improvement of YOLOv5s Network Architecture
Although YOLOv5 is one of the most efficient algorithms in the YOLO series, it may miss some defects and perform low confidence when detecting magnetic tiles [32]. This paper proposes multiple modifications to the YOLOv5 algorithm to detect magnetic tile defects to address the aforementioned issues. In particular, the convolutional block attention module (CBAM) is added to improve feature representation, the loss function is enhanced to address the problem of low confidence scores, and the detection layers are changed by the addition of a small object detection layer to enhance the performance of detecting smaller defects. These changes are anticipated to improve the algorithm’s ability to detect magnetic tile defects accurately and effectively.
4.1. Incorporating the Attention Mechanism
To solve the problem of the low significance of defective targets, the channel and spatial convolutional block attention module (CBAM) is introduced after the concat operation of the YOLOv5 network model, which is a lightweight and general attention mechanism compared with the mainstream SE (squeeze-and-excitation) [33, 34] attention mechanism; CBAM superimposes spatial attention on its channel attention, where channel attention focuses on the semantic input information and spatial attention focuses on the location information, and the combination of the two can obtain a better feature representation, and its structure is shown in Figure 2.

In Figure 2, given the feature map , is the number of channels of the feature map, and is the size of the feature map. The CBAM feeds into the channel attention module, obtains information about each channel via average and maximum pooling, superimposes the obtained parameters via the multilayer perceptron (MLP), and then activates them via the Sigmoid function to obtain the channel attention features; the calculation formula is shown in
In Equation (1), denotes channel attention in the convolutional block attention model, denotes the Sigmoid function, MLP denotes the multilayer perceptron, and denote average and maximum pooling operations on feature map space information of the module, respectively, and and denote global average and maximum average pooling operations of the channel attention mechanism, respectively.
As illustrated in Equation (2), after obtaining and splicing the two feature vectors, a convolution operation is done, and lastly, the function activates and obtains the output feature vector.
In Equation (2), denotes spatial attention, denotes a Sigmoid function, and denotes a convolution operation using a convolution kernel.
4.2. Improvement of Loss Function
By calculating the difference between predicted and actual values, the loss function plays a crucial role in directing the neural network training process. Through its calculation, network parameters can be iteratively adjusted, accelerating the convergence toward improved model performance and increased prediction accuracy. The GIoU function is used to calculate the loss of border regression in the original YOLOv5 network, and its formula is provided in
In Equation (3), is the smallest outer rectangle of the prediction box and the ground-truth box, and denotes the merged set of the two boxes. When the prediction and target boxes overlap, or when the width and height of the prediction and target boxes appear to be aligned, the loss function degrades to IoU (intersection over union). At this point, the relative positions of the prediction and target boxes cannot be determined, resulting in erroneous target positioning and loss convergence direction of the prediction box, which affects the ultimate detection precision.
As shown in Equations (4) and (5), we construct a new loss function based on the method for forming loss functions described in the paper [35].
In Equation (4), is a constant in the objective. In Equation (5), denotes any penalty term computed based on and .
Equations (4) and (5) consider the distance between the prediction and target boxes, the overlap rate, and the scale effect in order to provide a fast convergence rate and minimal scattering during the training phase. As a result, they are employed as the loss function to calculate the loss between the target and prediction boxes more efficiently.
4.3. Improvement of Detection Layer
In YOLOv5, the terms “3-scale” and “4-scale” refer to the number of detection layers used in the network for detecting small, medium, and large targets. The feature maps extracted by the network contain information about the position and semantics of the targets. The shallow feature maps are larger in size and contain more position information, while the deep feature maps offer additional semantic information.
The original YOLOv5s model uses three detection layers, corresponding to feature maps of size , , and when the input image is [36]. These feature maps are obtained by downsampling the input image by factors of 8, 16, and 32, respectively. These detection layers are used to detect targets of different sizes.
However, a dense map may contain defect targets of various sizes, including very small targets. To improve the detection of small targets, a small target detection layer is added to the YOLOv5 network, expanding the original 3-scale detection to 4-scale detection. This results in the addition of a fourth detection layer in the prediction part of the network. This approach enhances the learning ability of the shallow feature information and improves the detection of small target defects.
In comparison to the original YOLOv5 neck network, which requires only two upsampling operations (concatenate with the layer 4 and layer 6 of the backbone network, respectively) and subsequently downsampling. The proposed algorithm in this paper incorporates a 32-fold downsampling detection layer, resulting in a scale feature map for the detection layer, and the entire network structure of improved YOLOv5 is depicted in Figure 3. This results in four detection layers, and the network depth is increased to pull feature information from a deeper network, enhancing learning ability of the model at different scales under crowded targets and therefore improving detection performance of the model in dense scenarios [37].

5. Experiments and Analysis
5.1. Magnetic Tile Datasets
The tile surface defect map used in this paper is from the tile surface defect dataset published by the Institute of Automation, Chinese Academy of Sciences. The tile dataset consists of five common defects on the tile surface, blowhole, creak, break, fray, and uneven, and the ground-truth image that marks each defect. As illustrated in Figure 4, we turned the ground-truth photos into bounding boxes. There are 1344 photos in total. The images in the dataset vary in size, and the target typically lacks a consistent shape. The size image samples in Figure 4 are unified for better visualization result.

A combination of data augmentation techniques was applied to the dataset in accordance with the varied characteristics of the defects in the original dataset [38, 39]. To avoid overfitting and increase the dataset’s diversity, the Python Keras library’s picture data generator is employed. Using a rotation transform, the image is rotated to a 25° angle. The application of a width shift transform enables random shifting of the image to the right or left, with a width shift parameter value of 0.25.
The augmented dataset is in a total of 4032 samples, and Table 1 summarizes the number of photographs in each category. Each photo category was randomly divided into training and validation sets in a 7 : 3 ratio, yielding 2820 training images and 1212 test images.
5.2. Experimental Platform
The experiments in this paper are carried out on the experimental platform illustrated in Table 2.
5.3. Training Parameter
During model training, the momentum factor is set to 0.937 to avoid entering a local optimum or skipping the optimal solution [40, 41]. To prevent the network from overfitting during the training process, the learning rate is set to 0.01, and the weight decay regular term is set to 0.0005. Finally, 300 rounds of iterative model training are performed to get the ideal model weights.
5.4. Evaluation Indicators
Precision (P), recall (R), and mean average precision (mAP) are the important performance indicators for evaluating model performance. The precision is used to quantify accuracy of the model detection. The recall, also known as the check-all rate, is used to determine the comprehensiveness of model detection [42]. The average precision (AP) for a single category is computed by integrating the precision and recall curves and the area enclosed by the axes. The mAP value is calculated by multiplying the AP values of a single category by the number of categories. The IOU (intersection over union) is a metric commonly used in object detection tasks to evaluate the accuracy of a bounding box prediction. It measures the overlap between the predicted bounding box and the ground-truth bounding box by computing the ratio of their intersection area to their union area. Generally, the mAP value is determined when IoU equals 0.5, i.e., mAP@0.5, where IoU is the intersection ratio, a critical function for calculating mAP, as illustrated in the following equations:
In Equation (6), and denote the prediction and ground-truth boxes, respectively, while the denominator denotes the intersection of the two boxes and the numerator denotes the set of the two boxes. In Equations (7) and (8), true positive (TP) refers to the prediction of a positive target as positive, false positive (FP) refers to the inaccurate prediction of a negative target as positive, and false negative (FN) refers to the incorrect prediction of a positive target as negative. Equation (9) is the smoothed precision, recall curve, which is the integral operation area of the smoothed curve. In Equation (10), is the number of categories.
5.5. Experimental Results and Analysis
Experiments on the dataset were conducted using the proposed YOLOv5 model. The P-R (precision-recall) curves derived from the experiments are given in Figure 5; the closer the P-R curve is to the coordinate position (1, 1), the better the algorithm performs [43]. Table 3 summarizes performance of the algorithm as determined by the experiments. As shown in Table 3, the precision, recall, and mAP are 98.56%, 98.14%, and 99.49% on the proposed algorithm, respectively. The experimental findings demonstrate that the new algorithm described in this research is more accurate at classifying magnetic tile defect pictures.

The brightness of images may differ due to various lighting conditions in magnetic tile defect detection, which can significantly affect the detection accuracy. Figure 6 illustrates this by showing the results of inference for magnetic tile defects under various exposure scenarios. From the figure, it can be seen that the proposed algorithm has good robustness in detecting defects with different exposures; most of the confidence could reach 0.9.

(a)

(b)

(c)

(d)
5.6. Ablation Experiments
Because the proposed algorithm provides additional detection layers, CBAM optimizes the loss function within the original YOLOv5 detection framework. To validate the role of each component, ablation experiments are done, as indicated in Table 4.
The first row of Table 4 contains the detection results for the original YOLOv5 network (with the GIoU loss function). When the small target detection layer is added, the detection precision increases by 2.05% in comparison to the original network; when the loss function is changed, the detection precision increases by 1.42% in comparison to the original network; and when the CBAM is added to the original network, the feature expression ability of foreign object targets in a complex environment can be improved, resulting in a 2.78% increase in detection precision and 1.61% in recall. The algorithm presented in this research combines the advantages of each module and achieves a detection precision of 98.56% while maintaining a high level of real-time speed. The mAP value of the original YOLOv5s reached 95.35%, while the proposed algorithm achieved a mAP value of 99.49%.
5.7. Comparison with Other Algorithms
To indicate that this algorithm outperforms others, this research compares it to commonly used defect detection algorithms such as Faster R-CNN and SSD, using evaluation metrics such as mAP and FPS. The results of this comparison are provided in Table 5. The Faster R-CNN model does not dominate in terms of running speed, mainly because it is a two-stage model with deeper forward channels. Therefore, its forward propagation consumes more time than the rest of the one-stage model, resulting in a lower FPS. Compared between YOLOv3 and YOLOv4, the original YOLOv5 clearly has improvements on mAP and running speed. The proposed algorithm outperforms the original YOLOv5 by 4.14% with a mAP of 99.49%. The network’s additional detection layer and CBAM attention module made the detection speed marginally slower, but it still complies with real-time detection standards.
5.8. Comparison with Original YOLOv5
The proposed algorithm is compared to the original YOLOv5 algorithm in this paper, and the results are displayed in Figure 7. As illustrated in Figure 7(a), the precision of enhanced YOLOv5 eventually stabilizes at around 0.98; the improved curve oscillations are decreased, and convergence speed is accelerated. In Figure 7(c), following enhanced feature learning via the attention mechanism, a greater weight is placed on regions containing targets in network learning, resulting in an improved recall rate compared to the initial algorithm. It can be seen from Figure 7(b) that, after optimizing the loss function, the loss value of network decreases; it starts at approximately 0.014 and eventually stabilizes at 0.003. The detection model suggested in this article has a significantly improved training performance. In Figure 7(d), the mAP@0.5 curve of the proposed algorithm gradually stabilize at 200 epochs, much faster than original algorithm.

(a)

(b)

(c)

(d)
As illustrated in the inference results, it is clear that the original algorithm ignores the defects in the upper portion of Figure 8(a). The proposed algorithm can accurately detect the target and better solve the problem of missed detection, while improving the overall confidence of inference.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)
5.9. Discussion
The method above can also be used for other industrial scenarios with similar defect characteristics. The new loss function considers how the evaluation outcomes are impacted by the aspect ratio of the detection frame and the distance to the centre point, effectively improving the precision. The implementation of CBAM enhances the characterization of important features and reduces the interference of redundant information. The improvement of detection layer enhanced the capacity to detect small defect targets. However, due to the limited number of defect target images in the magnetic tile defect dataset, the performance of the algorithm may be compromised in real industrial scenarios.
In the future, we will develop high-quality datasets and augmentation strategies and improve the algorithm’s multiobjective detection performance. There are numerous more outstanding network structures in the field of deep learning, and it is planned to apply these superior structures to the algorithms described in this paper in future research. Additionally, the proposed algorithm can be deployed on edge devices; this can also enable on-site defect detection, reducing the need for transportation of magnetic tiles to a central location for inspection.
6. Conclusion
Due to the uneven brightness and poor contrast of magnetic tile defect images, which render the classic defect identification algorithm ineffective, this article provides an enhanced YOLOv5 algorithm for detecting defects on the surface of magnetic tiles. It is capable of retaining more information on small target and low contrast defects in the convolution layer and, to a certain extent, ensuring that no leakage or incorrect detection occurs. The experimental results indicate that the algorithm significantly increases the correct rate of tile surface defect detection and significantly reduces the false detection and leakage rates. Compared with Faster R-CNN, the proposed algorithm has an 8.15% higher mAP value, and compared with the original YOLOv5, the mAP value has increased by 4.14%.
Although the CBAM and extra detection layer added in the network made the detection speed slightly slower than that of the original YOLOv5, the proposed algorithm achieves the requirement of real-time detection as its detection frame rate can reach 55.3 per second. One of the crucial concerns to be accounted for in our follow-up research is how to reduce the training burden and optimize the network’s structure while retaining network performance.
Data Availability
The data sets used to support the findings of this study have been deposited in GitHub website (https://github.com/abin24/Magnetic-tile-defect-datasets.).
Conflicts of Interest
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Acknowledgments
This work was supported in part by key research projects of the Natural Science in Colleges and Universities of Anhui Province (2022AH051753).