Abstract
Intending to solve the problems including poor self-adaptive ability and generalization ability of the traditional categorizing method under big data, a parameter-optimized Convolutional Neural Network (CNN) based on Sparrow Search Algorithm (SSA) is proposed in this research. Initially, the raw data regarding a series of bearing vibration signals are processed with Fast Fourier Transform (FFT) and Continuous Wavelet Transform (CWT) to attain groups of time-frequency maps. Then, Locally Linear Embedding (LLE) and linear normalization are introduced to make these maps proper for the input of CNN. Next, the preprocessed data sets are utilized as training and testing samples for CNN, and the accuracy rate of the testing is considered as the fitness of SSA, which is used to search for optimal parameter combinations for CNN by SAA. Meanwhile, the construction of the CNN is determined by experience and other previous researches. Finally, an NN-based defect diagnosis model for bearings will be constructed after the SAA has determined the appropriate parameters. The model’s accuracy rate may reach 99.4 percent after repeated testing using samples, which is significantly superior to the classic fault detection approach and the fault diagnostic method based solely on shallow networks. This experimental result demonstrates that the suggested strategy may significantly increase the model’s self-adaptive feature extraction capacity and accuracy rate, implying a higher performance in defect diagnosis in the presence of huge data.
1. Introduction
The scale of the modern manufacturing industrial system is becoming larger and larger, which will add uncertainty to the manufacturing process and makes it more complicated to control the condition of various production equipment. In order to ensure the safety and product quality of manufacturing, fault detection based on large data has become one of the hotspots in the field of process control in the era of intelligent manufacturing. For instance, Caggiano et al. [1] created a machine learning system for online fault recognition via automatic image processing to quickly identify material defects caused by process nonconformities in metal powder Selective Laser Melting (SLM). Liu et al. [2] methodically summarized MVCMFD-MTs in order to give academics and engineers with a theoretical foundation and roadmap for further research or development of MVCMFD-MTs based on machined surface texture information. Xu and Yao [3] described how a laser line scanning sensor was integrated into a robot-based laser-aided additive manufacturing (LAAM) system to allow for part geometry measurement on machine.
Bearings are widely used in various rotating machines related to power supply, automobile, and military industries as core components of modern machinery. However, because these apparatuses are typically used in harsh environments such as high temperature, high moisture, and overload [4], bearing failure is unavoidable. The majority of shaft wear is difficult to detect. It will not be noticed unless the machine has a large jumping range, irregular noise, or an abnormal temperature. When these phenomena are discovered, the majority of the rolling shafts have already worn out, resulting in machine shutdown, which will almost certainly result in economic losses and accidents [5, 6]. Bearing condition monitoring and problem diagnostics can thus be used to maintain the equipment, extend its service life, and avoid potentially dangerous mishaps. It will be easier to obtain some parameters such as vibration, noise, and temperature using auxiliary condition monitoring and then use appropriate fault diagnosis methods to diagnose faults. Existing defect detection algorithms fall into two broad categories: approaches based on signal analysis and methods based on machine learning [7]. Duan et al. [8] suggested a time-frequency Kurtosis Spectrum-based bearing fault detection technique to reliably extract the characteristic frequency of the rolling bearing damage, which uses Slice Wavelet Transform to decompose the time-frequency vibration signal to obtain the corresponding amplitude of each frequency component Kurtosis. The Kurtosis sequence is used to construct the time-frequency Kurtosis Spectrum of the signal. The corresponding frequency bands are determined according to the corresponding frequencies of the larger frequency peaks in the time-frequency Kurtosis Spectrum, and the time-frequency slices are selected in the time-frequency space. Then, the signal components are separated by reconstruction, and the envelope of the reconstructed signals is obtained by Envelope Demodulation. On this basis, the characteristic frequency of rolling bearing is determined by the equivalent power spectrum of the envelope signal. Li [9] developed a technique for diagnosing bearing faults based on a VMD-based bispectrum. VMD simplifies complicated nonstationary vibration signals by decomposing them into a series of Intrinsic Mode Functions (IMF). VMD detects IMF at the center frequency using the Alternating Direction Multiplier Method (ADMM). Bispectrum analysis can be used to determine the presence of phase coupling effects. Bispectrum is immune to Gaussian and non-Gaussian noise and is hence ideal for identifying rolling bearing local problems. When condition monitoring systems are used to collect real-time data from the device, massive data will be attained after the long-time operation promoting machinery health monitoring to become big data. These kinds of big data have the features such as large volume, diversity, and high velocity. The traditional methods based on signal analysis will have limitations on efficiency, accuracy, and capability to deal with these massive data. Therefore, how to extract fault features both accurately and efficiently has become a hotspot for many researchers [10, 11].
A machine learning-based approach that employs powerful machine learning tools such as Backpropagation Neural Networks (BPNN), Probabilistic Neural Networks (PNN), and Extreme Learning Machines (ELM) is attracting an increasing number of people to apply it to defect diagnosis. Deep learning is a hot topic among researchers in the machine learning and pattern recognition fields. It has had tremendous success in a wide range of fields, including speech recognition, computer vision, and natural language processing [11]. Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and Deep Belief Networks (DBN) are critical components of big data solutions because they can extract useful knowledge from complex systems. Deep learning, as opposed to traditional learning methods, is thought to use a shallow structured learning architecture. It refers to machine learning approaches that use supervised or unsupervised schemes to automatically learn hierarchical expressions in deep classification methods [12]. The application of CNN to image classification has recently achieved great success. CNN is now one of the most effective methods for detection and identification [13]. Liu et al. [14] proposed a method based on a time-frequency diagram obtained by Fourier Transform (FT) for each vibration signal frame and CNN. Time-frequency diagrams can then be constructed using the Short-Time Fourier Transform (STFT). As a result, the training signal’s time-frequency diagram is utilized as the input of the convolution neural net to train the network. Rolling bearing faults may be identified after the time-frequency diagram of the test signal is entered into the network model. Consequently, the fault recognition rate is at least 97.64%. As some researchers have demonstrated already, network structure and training parameters, such as epochs, minibatch, and initial learning rate, have a considerable impact on test accuracy [15, 16]. In order to obtain a better recognition rate, this research uses Sparrow Search Algorithm (SSA) to optimize training parameters. SSA is a kind of intelligent optimization algorithm introduced by Xue [33] in 2020, which has more advantages, including aiming at more general problems and better preventing falling into local optimum compared with the traditional optimization algorithm. Wang et al. [17] established the possible danger to an aerial target depending on the commander’s emotional state as determined by the SSA-BP model. In their research, they suggested a system for predicting possible threats while taking commander emotion into account (PTP-CE) by combining a Bidirectional LSTM (BiLSTM) network with a Backpropagation Neural Network (BP) optimized with the Sparrow Search Algorithm (SSA). The results show that the prediction accuracy of the SSA-BP is higher than that of the Genetic Algorithms-based Backpropagation Neural Network (GA-BP), the BP, and the General Regression Neural Network (GRNN), which indicates that the SSA performs more robustly while solving the global optimal solution issue. Experiments show that the method proposed in this research can effectively improve CNN’s identification accuracy and solve the shortcomings of traditional methods under big data. This research employs a unique group optimization technique called SSA, inspired by the foraging and antipredatory behavior of sparrows, to maximize the CNN’s parameter combination, demonstrating the method’s application potential and workability. Moreover, it also established a practical and effective method for bearing diagnosis considering big data situation, which can obtain a higher test accuracy rate through a few iterations of SSA than other traditional or nonoptimized methods.
Involved theories are introduced in Section 2. Section 3 describes the construction of bearing fault diagnosis model based on parameter-optimized CNN. The case studied in this paper is discussed in Section 4. Section 5 makes a conclusion of the proposed method.
2. Theory
2.1. Theory of Continuous Wavelet Transform (CWT)
The CWT is a kind of alternative method of transforming the original signal into a certain domain in order to analyze and process [18, 19]. The foundation of CWT [20, 21] is a family of functionswhere is a fixed function, called “mother wavelet,” which is localized in both time and frequency. The function is produced by dilating (σ-dilation) in the time domain and translating (τ-translation) in the frequency domain to the mother wavelet. The mother wavelet utilized in this research is the Daubechies wavelet (dbN), which has no explicit expression (except for N = 1, which is Harr wavelet), but the square mode of the transformation function h is clear [22].
The CWT is defined as the inner product of the wavelet family and signal .
That is given bywhere is the complex conjugate of , is the time-scale map, and satisfies the condition:
In order to make sure the existence of the inverse wavelet transformation, the mother wavelet should satisfy the admissibility condition:where is the Fourier Transform of and is a constant for wavelet .
The integrand in equation (6) has an integrable discontinuity at ,which means that . Then, the inverse wavelet transform can be attained by
2.2. Theory of Manifold Learning
Manifold learning is a form of unsupervised learning. Compared to the standard linear dimensionality reduction approach, manifold learning successfully uncovers the underlying dimensionality of nonlinear high-dimensional data, hence facilitating dimensionality reduction and data analysis [23, 24]. It may be classified into linear and nonlinear manifold learning algorithms. Isomap [25], Laplacian Eigenmaps (LE) [26], and Locally Linear Embedding (LLE) [27] are all examples of nonlinear manifold learning algorithms. Linear manifold learning is a linear extension of nonlinear methods such as Principal Component Analysis (PCA) and Multidimensional Scaling (MDS). The LLE approach was used in this study.
The main idea of the LLE algorithm is to maintain the local order relationship between the data embedded in space and essential space. Our goal is to find low dimensional a dataset , which exists in the space of , based on the given data set in the high-dimensional space and .
Firstly, a number of adjacent points of every point in the data set should be found. can be the data points in the sphere that surrounds with as the distance.
Every data point can be expressed by linear combination of its adjacent points :where is a column vector and is line j of .
Then, it comes to minimizing the following loss function (9):
The weight coefficient can be obtained by solving the above formula:
Next, it is considered that, after reducing the original data from D dimension to d dimension , it can still be expressed by linear combination of its adjacent points , and its combination coefficient remains unchanged. Once again, minimize the loss function:
Eventually, it is able to get the data in low dimensional space after dimensionality reduction
2.2.1. Theory of Convolutional Neural Network (CNN)
CNNs, which are deep feed-forward artificial neural networks well-suited to extracting structural characteristics from two-dimensional signals, can help with both feature learning and time-frequency map recognition. For large-scale picture classification and recognition, it is also the most commonly employed deep learning model [28, 29]. Figure 1 depicts the general layout of the CNN.

The input of a convolutional layer is an image of size . The convolutional layer consists of filters of size and it has fewer dimensions than the input images. The output of that is a series of feature maps with the size of with the step size 1. The filter is implemented by assigning weights to each pixel in the input image and calculating it as a weighted sum to extract some features contained in the image. Then, the weighted sum is added by the additive deviation, and the pixels in the convolution map are obtained through the nonlinear function. Lately, the usage of rectified linear units (ReLU) as a nonlinear activation function has gotten an amount of popularity [30], which is utilized in the convolutional layer of this research, as shown as follows:
The output of a certain feature map in the convolutional layer is given aswhere is the nonlinear activation function; is the scalar bias for the th layer; is the selected feature map in the th layer, which is added up by the feature map in the th layer; indicates the convolutional operator that convolutes the activation of the preceding layer; is the filter used to perform the convolutional operation.
Then, this is followed by a pooling layer. Each feature map is needed to be subjected to region-wise pooling to retain the main features while reducing parameters and computation to prevent overfitting. The pooling process used here is average pooling. That is, the average sampling value of 2 × 2 is not overlapped in the output of the preceding layer. The output of the pooling layer will lead to a reduction of dimension. After reducing the dimension of the feature map in a layer , the output is given aswhere is the downsizing function downsized by a factor of and is a convoluted feature map which is to be downsized.
The last layer is a fully connected layer, and its output is given bywhere is the bias for the output layer; is the weight matrix between the input and output layers of the fully connected layers; is the feature maps of the fully connected input layer; and is the softmax function [30]. The training method applied to this research is stochastic gradient descent (SGD). Every time a datum is read, the stochastic gradient descent algorithm will immediately calculate the cost function gradient to update the parameters. The gradient is calculated using the backpropagation method [31]. All filter weights and deviations are updated according to the objective function of each input sample until the best representation of training samples is obtained. The cost function employed here is cross entropy which is defined aswhere is the number of training samples; is the value predicted by parameters and ; is the value of the original training sample, which acts as the standard answer; means the th sample.
2.2.2. Theory of Sparrow Search Algorithm (SSA)
The introduction of SSA in 2020 is primarily motivated by sparrow foraging and antipredator behavior. The technique is new and benefits from fast search and convergence. The fundamental requirements of the algorithm are as follows [32]. When foraging, sparrows are classified as either discoverers or followers. The discoverers are in charge of finding food within the population and establishing foraging areas and routes for the entire sparrow population, while the participants rely on the discoverers for food. Sparrows frequently use one of two foraging strategies: discoverers or participants. Individuals in the surveillance group recognize one another’s behavior, and attackers compete for food resources with high intake partners to increase their predation rate. Furthermore, as the sparrow population becomes aware of the threat, antipredator behavior develops.
In the -dimensional solution space, the location of each sparrow is , and the fitness value is . There are sparrows in the group. In each generation, sparrows with the best position in the population are selected as discoverers, while the remaining sparrows are selected as followers.
The location update formula of each generation of discoverers is as follows:where indicates represents the th dimension position of the th individual in the th generation; is a uniform random number in ; is a standard normal distribution random number; is a uniform random number in ; means the alarm value whose value range is .
Position update formula for followers is given aswhere is the worst position in the current population and is the best position in the current population.
While sparrows are foraging, some of them will be vigilant. When danger comes near, they will give up the present food. The location update formula is as follows:where indicates a random number conforming to the standard normal distribution; K means a uniform random number of ; is a smaller number that prevents the denominator from being unique; is the fitness value of the sparrow in the worst position. It can be seen from the formula (18) that if the sparrow is currently in the best position, it will escape to a location near itself, depending on the ratio of the worst food in the worst position to the food and its location. If the sparrow is not in the best position, it will escape to the nearest best location.
2.2.3. Proposed Method
The optimization steps of SSA for CNN parameters are as follows, which are shown in Figure 2.(1)Set parameters for SSA and initialize the population.(2)Use the CNN to train the test’s accuracy rate as a fitness function, and the individual fitness value is evaluated considering the learning rate and batch size. Then, mark the optimal position.(3)Calculate whether the current iteration times have achieved the termination condition; if they have, terminate the loop and print the result; otherwise, continue.(4)Update the position in the current population and initialize the individuals beyond the upper and lower limits.

3. The Construction of Bearing Fault Diagnosis Model Based on Parameter-Optimized CNN
The proposed bearing diagnosis method, which is based on parameter-optimized CNN, combines traits derived from a large data monitoring device with the benefits of deep learning. It combines unsupervised and supervised learning to extract fault features and identify the running state of devices in large amounts of data in an adaptive manner. Nonetheless, this measure has a limited adaptive capacity for feature extraction and a shallow network’s insufficient generalization capability for defect identification when compared to traditional approaches. Figure 3 depicts the entire procedure. The specific steps are as follows:(1)The bearing’s vibration signal is preprocessed using FFT to get frequency domain data.(2)A two-dimensional Continuous Wavelet Transform is used to generate time-frequency maps of the data, taking both time and frequency domain information into account.(3)Reduce the amount of the data acquired using LLE and standardize it.(4)Minimize accuracy rate in CNN by searching for the appropriate mix of CNN parameters and SAA and then comparing the ideal architecture distribution of the network.(5)Input standard samples of the bearing’s various states into the optimized CNN.

(a)

(b)
After developing a model for defect identification, test samples of the bearing in various states would be diagnosed.
4. Case Study
4.1. Experiment Setup
The experimental data used in this study are rolling bearing vibration acceleration data from Case Western Reserve University’s (CWRU) Bearing Data Center [33]. Figure 4 depicts the bearing test station. It is equipped with an induction motor, fan end bearing, drive end bearing, torque transducer, and dynamometer (load motor). These indications originate from the accelerometer installed on the driving end of the induction motor's bearing box. The accelerometer is connected with the torque sensor and the dynamometer, as shown in Figure 5. The sampling frequency of the considered data is 48000 Hz collected from the drive end bearing (6205-2RS JEM SKF), and the samples are collected under four working conditions: (1) normal operating conditions; (2) inner race fault; (3) ball fault; (4) outer race fault. There are a total of 10 types of data, including one normal type and nine error types of different injures and fault diameters. Different components of the bearing are shown in Figure 5. The details of the data, including motor load and motor speed, are illustrated in Table 1. 2400 samples were collected from each category. Each sample contained 1000 sampling points, and a total of 24000 samples were collected. Nineteen thousand two hundred of the samples are selected for training, and the rest of the 4800 samples for testing (20% of the total). Table 2 illustrates the details of the sample sets from each condition and their corresponding status labels.


4.2. Fault Diagnosis Using Parameter-Optimized CNN
4.2.1. Data Preprocessing
As previously stated, the vibration signal of the bearing is preprocessed. Figure 6 depicts the FFT spectrum under different operating conditions. Each signal is a superposition of multiple frequency domain components that can be dissected using frequency domain analysis. Each batch of data will be processed using FFT to obtain 1024 points in order to collect enough training samples and ensure that the model is properly trained. Because of the spectrum’s symmetry, half of the data points are used as the eigenvector.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)
Then, the data will go through CWT to attain a series of time-frequency maps, and the diagrams of different states are shown in Figure 7. The size of each map is and reshaped to . Then, the processed samples of the same state are gathered to form a data group whose size is and there are a total of ten data groups corresponding to 10 states.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)
4.2.2. Data Dimension-Reduction by LLE
The reprocessed data sets are linearly normalized to shorten training time and accelerate convergence to mitigate the effect of noise and aberrant samples on network training. As the CNN can only accept input in a specific format, the maps need to be downsized to an acceptable dimension through LLE (the number of neighbors is 15). Additionally, each group of samples of different states is dimension-reduced separately to maintain the characteristics of each category of information. The dimension-reduced maps (sized ) are presented in Figure 8. For each data group (sized ), the dimension-reduced size is .

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)
4.2.3. Network Setup and Training
Usually, CNN uses images as the input data. In this research, the dimension-reduced time-frequency maps derived from CWT and LLE are utilized as the input for the CNN with the size of . The input layer size has corresponded to the image’s input which is . The network architecture is obtained from repeated experiments and references to previous kinds of literature. After inputting the images, the following is a convolutional layer with eight filters sized using ReLU as the activation function and a mean-pooling layer of size . A convolutional layer repeats this with 16 filters sized using ReLU as the activation function and a mean-pooling layer sized . The next section is fully connected layers which contain 360 neurons in the first layer and 60 neurons in the second one. ReLU is applied as the activation function for both of them. Then, the softmax function is used as the classification function. The training method of the network is SGD with ten epochs. The structure of the CNN is shown in Figure 9.

4.2.4. Determination for Optimal Parameter of CNN with SSA
The optimal parameters of learning rate and batch size in CNN are searched by SSA with a search range of respectively and . The detailed setting for SSA is presented in Table 3. It is also quite significant to decide the fitness of the chosen parameters during the SSA, and the parameter determination criterion of this research is the accuracy rate of the testing, which is shown as follows:
A higher accuracy rate means a better training effect and parameters fitness. Through observing the accuracy rate, the training result of the model can be judged. While increasing the number of iterations improves the outcome of the defect diagnosis, it also significantly increases the computation time. Considering the SSA’s effect and the time necessary, the number of iterations is set at 10. Table 4 summarizes the parameters determined for CNN. Figure 10 illustrates the variance in accuracy rate of the best optimization coefficient of each iteration as a function of the number of rounds. Table 5 shows some details about the variance of optimization coefficient during this process.

4.2.5. Comparative Analysis of the Results with Other Methods
To demonstrate the suggested method’s benefit in defect identification, the accuracy rate of the improved CNN is compared to the CNN established only by practice (trained ten times). The suggested technique has an accuracy rate of 99.4 percent, indicating that this model has the benefit of high accuracy, efficiency, and stability for identifying bearing faults under a variety of conditions. The accuracy rate of the CNN using experience-based parameters is 97 percent, somewhat lower than that of the optimized one. Then, using the same experimental data provided from Case Western Reserve University’s Bearing Data Center, the accuracy of DBN is 92 percent [34], BP is 92.5 percent [35], and SVM is 91.7 percent [36]. Compared to shallow network models, deep network models are more suited for adaptive defect identification in huge data sets and complicated environments. In comparison to the deep network model, standard fault detection approaches suffer from limitations in terms of adaptive fault feature extraction, monitoring, and diagnostic accuracy, as well as generalization performance. Figure 11 illustrates the comparison figure.

5. Conclusion
The combination of parameters including epoch, batch size, and learning rate for CNN has a pronounced influence on training. To obtain the optimal combination of the aforementioned parameters, SSA using the accuracy rate of testing as the fitness is proposed for optimizing the parameters of CNN. Then, the improved method is utilized for the fault diagnosis, and its effectiveness is well demonstrated. Eventually, the detailed conclusions can be drawn as follows:(1)The excellent correctness of the parameter-optimized CNN demonstrates that the approach can adaptively extract defect information from the bearing vibration signal spectrum, reducing the need for a wide variety of signal processing techniques and diagnostic experience. It offers significant benefits in terms of defect diagnostic capability and generalization performance.(2)This research develops a unique integrated defect diagnostic model based on FFT, CWT, LLE, and optimized CNN that outperforms shallow layer networks and classic character recognition and pattern recognition techniques. It has practical utility in a world of huge data.
Data Availability
All data, models, and code generated or used during the study appear in the submitted paper.
Conflicts of Interest
The author declares no conflicts of interest.