Abstract

Autoencoders are used for fault diagnosis in chemical engineering. To improve their performance, experts have paid close attention to regularized strategies and the creation of new and effective cost functions. However, existing methods are modified on the basis of only one model. This study provides a new perspective for strengthening the fault diagnosis model, which attempts to gain useful information from a model (teacher model) and applies it to a new model (student model). It pretrains the teacher model by fitting ground truth labels and then uses a sample-wise strategy to transfer knowledge from the teacher model. Finally, the knowledge and the ground truth labels are used to train the student model that is identical to the teacher model in terms of structure. The current student model is then used as the teacher of next student model. After step-by-step teacher-student reconfiguration and training, the optimal model is selected for fault diagnosis. Besides, knowledge distillation is applied in training procedures. The proposed method is applied to several benchmarked problems to prove its effectiveness.

1. Introduction

As a result of the advancement of process control systems, modern industries have switched to automation. With advantages such as low cost, high efficiency, and improved safety, automated control systems have become increasingly popular worldwide, especially in power plants, the aerospace industry, and chemical plants [1]. However, in abnormal situations, these systems are unable to cope on their own and thus require the intervention of human operators. Fault detection and diagnosis (FDD) has been developed to determine where problems lie and prevent machine damage or accidents.

In practical chemical processes, FDD comprises two vital procedures that influence the life of devices and operators. In fault diagnosis, a model is built to distinguish fault types accurately. Corresponding measures are taken to overcome these faults when necessary. In the early application of FDD, mathematical-based models elicited the most attention and thus developed rapidly. However, model-based FDD methods require priori knowledge about chemical processes before application. Therefore, the development of this kind of method was limited. With the support of related technology, voluminous data can be collected easily. Although the data collected are high-dimensional, they remain useful for some FDD methods. This approach is called the data-driven method. Basic data-driven methods include principal component analysis (PCA) [2, 3], canonical correlation analysis [4], independent component analysis, and Fisher discriminant analysis [5, 6]. These data-driven methods greatly contribute to FDD, but they are not suitable for certain situations. For example, all the aforementioned methods fail to detect faults exactly in nonlinear process monitoring and fault diagnosis.

In the early 1990s, neural networks (NNs) were merged into FDD as a feature extraction tool [7]. The NNs at that time were shallow and narrow. These NNs were also prone to overfitting once they became deep and wide. Hence, NNs were never widely used for FDD during this period. Since Hinton proposed pretraining and fine tuning [8], deep learning (DL) has received much attention. As a result, DL has undergone unprecedented application and development. DL has been applied in computer vision, speech recognition, text processing, medical, finance advertising, and other fields. Stacks of restricted Boltzmann machines (RBMs), referred to as a deep belief network (DBN), have been found to behave better than other traditional data-driven methods (e.g., PCA) in data dimension reduction [9]. As pretraining helps deep neural networks (DNNs) jump out of the local minimum, the fact that DBN outperforms PCA may be easily disregarded. As for FDD, RBMs and DNNs have been applied to some specific processes to initialize offline fault detection models and detect online time-related samples automatically. Ren et al. used multiple DBN models to correspond to each working condition and set an adaptive threshold to detect faulty conditions [10]. Zhang et al. considered extracting features in spatial and temporal domains [11]. They built two subnetworks by mutual information and trained them with a backpropagation algorithm. In their model, each class has its own network. The result is a considerable improvement in model performance. Sun et al. trained a DBN with normal samples in offline modeling. While online, this DBN can discriminate faults according to reconstruction errors [9]. The convolutional neural network (CNN) is widely applied in image processing based on convolution calculation. Roy et al. used deep CNN with layer-wise training to recognize handwritten Bangla isolated compound character [12]. In addition, RMSProp algorithm is applied to achieve faster convergence. CNN also can be employed in FDD. Lee et al. applied CNN to multivariate time-series data and built a connection between the output of the CNN and the structural meaning of data to fault classification [13]. Wen et al. proposed a new method on the basis of CNN [14]. Instead of using time-domain data, they converted raw data into image data. After such transformation, the CNN can easily extract the features of faults. An autoencoder (AE) and a stacked autoencoder (SAE) are trained to extract valuable features to determine whether a machine is faulty. Lee et al. employed a stacked denoising AE for fault classification [15]. Compared with models with SAEs, their model adds noise to the samples before training to prevent overfitting. In addition, their model realizes improved generalization capability. Zhang et al. considered self-correlation in data. They first extended the data matrix by correlation analysis [16]. Then, they applied a deep SAE to extract the necessary features. Different from these existing studies, the current work highlights the results obtained from a trained DL model. This work takes advantage of learned knowledge and reuses it to train an identical model.

Much research has explored knowledge distillation (KD). Yim et al. calculated the inner products between two identical DNN models and created a new loss function using L2 norm to improve classification performance in image processing [17]. Similarly, Furlanello et al. used two identical models. In addition, they explored two distillation objectives: confidence weighted by Teacher Max and dark knowledge with permuted predictions [18]. They found that KD is suitable for almost all machine learning (ML) models. Xu et al. focused on one ML model and, trained it for several times [19]. After the first training, they softened the correct targets and reset the mistaken targets to 1. Bae et al. used residual networks as basic models to transfer knowledge in a teacher-student framework [20]. They proposed a layer-wise hint training method and improved the recognition results on several typical datasets.

The rest of the paper is organized as follows. Section 2 provides an overview of SAE and KD. Section 3 introduces the novel model based on SAE with the sample-wise strategy (SAE-SWS). Section 4 discusses the application of the method to several industrial benchmarked problems. Finally, Section 5 summarizes the methodology.

2. Preliminaries

2.1. Stacked Autoencoders

SAE refers to a stack of autoencoders. Nevertheless, SAEs cannot be formulated by connecting AEs directly. An SAE’s training process includes two procedures: unsupervised pretraining and fine tuning. In unsupervised pretraining, the features of a dataset are extracted. As shown in Figure 1(a), the input dataset is assumed to be , is the number of samples, and is the number of observed variables. . undergoes encoding and decoding and can be presented aswhere is the hidden layer feature of . and are the parameters of the encoding stage, where is the hidden number of nodes. and are the parameters of the decoding stage. They can be expressed together by . and denote the activations of the encoding and decoding stages, respectively. Several activations are available, and rectified linear unit is employed here. is the predictive estimate of the input . The process, including (1) and (2), is trained as

Feature is used to dig deep for feature , which is shown in Figure 1(b). Figures 1(a) and 1(b) belong to the feature extraction stage.

In the fine-tuning stage, the whole network is trained for classification or regression. The structure is shown in Figure 1(c). Regarding classification, the SAE assumes the label dataset (c is the number of classes) corresponding to X, and it uses one hot encoding to build . The parameters of the SAE are transferred from the previous pre-training, similar to that shown in the figure. Then a softmax layer is added with c nodes after the network. represents the output. denotes the SAE classification probabilities of input , which satisfy . The SAE is then trained aswhere .

2.2. Knowledge Distillation

KD means transferring “knowledge” between ML models. It was first proposed by Hinton to transfer knowledge learned by a cumbersome model to a much simpler one [21]. Before interpreting KD, the soft targets and hard targets must be understood. For example, for binary classification, label is a hard target, whereas the label whose value for a certain dimension is not 1, such as , and is a soft target. The latter clearly contains more information. KD tries to soften hard targets to gain messages for classification.

KD is closely related to the “softmax” function. Softmax converts multicategory output values into relative probabilities, making them easy to understand and compare. It is usually connected after the ML model. For example, assume that the softmax layer input is , where c denotes the number of classes. The probability that belongs to the ith class can be calculated bywhere denote the category indexes. By contrast, if softmax in KD (softmax-KD) is applied, then the probability is calculated bywhere denotes the temperature of softmax. A high equates to soft targets. When equals to 1, softmax-KD becomes original softmax. In this paper, is set within the range of 2.5–4 during model training.

3. FDD Model Based on SAE with Sample-Wise Strategy

3.1. Model Construction with Sample-Wise Strategy

Two identical models, namely, a teacher model and a student model, are used in the algorithm. Both models consist of SAEs for feature extraction and softmax-KD layers for classification. Following the variables used in Section 2.1, the teacher model is shown in Figure 2, where and are the input and output of the teacher model using temperature , respectively.

represents the set of . Given the label dataset , where , then the teacher model is trained by minimizing the cross-entropy cost function, which can be presented bywhere denotes weights and biases in the teacher model . is calculated by

The SWS is conducted after training the teacher model . Considering the ith sample , its predicted category and true category are obtained according to and , respectively. If , is considered to have been sorted into the correct category. As a result, all the samples are divided into two parts: the correct samples (CSs) and the mistaken samples (MSs). Assume MSs. In terms of MSs, the teacher model does not extract the needed features accurately. Hence, the strategy SWS improves the weight of the MSs by adding MSs to the student model’s dataset; that is, the input set of student model is . is a multiple. That is, is times the number of MSs. Then the label set of student model is . Put the student model’s input sample into the teacher model; then is obtained. In other words, , which represents the distribution of the output of teacher model , is obtained from teacher model by SWS and then used for training . Thus, the student model can be trained with the modified loss function, which can be presented aswhere denotes the softmax-KD output of using the temperature . is called the knowledge between model and using the temperature . denotes the weight of the cross-entropy between and . The novel loss function is composed of two parts: cross-entropy between the student model’s output and the knowledge and cross-entropy between the student model’s output and the ground truth label . can be regarded as a regularizing part for boosting the student model’s performance. The ground truth label information with MSs provides strong capability for further improving discriminative capability. The two parts make full use of the information from the result of the teacher model and strengthen the MSs in the student model training.

The above procedures are contents of step 1 in Figure 3. Besides, the SWS can be applied continuously, as shown in Figure 3. denotes the ith student model. denote the number of steps. Each model used in this paper is SAE with softmax-KD layer. For step 1, model is trained. Then SAE-SWS employs the SWS to mine further knowledge. Finally, the further knowledge is applied to train model . After steps, the optimal student model is selected according to the classification accuracy on several datasets.

3.2. Flow Chart of the Model for FDD

On the basis of the SWS above, the ML model for FDD is introduced. FDD models include two parts: offline modeling and online monitoring. The procedures of offline modeling and online monitoring are shown in Figure 4.

Offline modeling:(1)Dataset and label set are collected, where refers to the number of samples, denotes the number of variables, and denotes the number of classes. Then is divided into for training and for testing, where . Correspondingly, is divided into for training and for testing.(2)Every component of is normalized to zero mean and unit variance, respectively. Then every component of is normalized by using the mean value and variance of . and denote the normalized training and testing dataset, respectively.(3)Letting and , then the teacher model , which consists of SAE and a softmax-KD layer, is trained by matching the distribution of ground truth labels. Letting , denotes the index of steps.(4)Transfer knowledge and obtain the input set and label set for next student model by SWS.(5)Train the student model by matching the distribution of the teacher’s output and ground truth labels.(6)If , jump to ; or let the current student becomes teacher of next student. And . Then go back to .(7)Use and to calculate the classification accuracy of all trained model. Select the optimal model with the highest accuracy.

Online monitoring:(1)The new sample is normalized by using the mean value and variance of training data .(2)Integrate into the optimal model and obtain its predicted class.

4. Application to Benchmark Problems

4.1. Continuous Stirred Tank Reactor Process
4.1.1. CSTR Description

The CSTR process is an anaerobic treatment that is widely used in industrial effluent disposal [22]. As a result of the high research value of the CSTR, many modeling methods have been proposed to fill the gaps. In this experiment, the model proposed by Belevi is adopted [23]. The flow chart of the CSTR is shown in Figure 5. CAF denotes feed concentration. QF denotes feed flow. TF denotes feed temperature. QC denotes coolant flow. TCF denotes the coolant inlet temperature. T denotes the temperature in the reactor. h denotes the reactor level. TC denotes the coolant outlet temperature. CA denotes the concentration of component A in the reactor. Q denotes the flow of component A. The variables listed above are the 10 operational variables of the CSTR. They are selected as the monitored variables and are recorded in detail. Table 1 lists and describes 11 faults we set for the experiment. Except for the listed variables, other operational variables are within the allowable range.

4.1.2. Parameter Setting and Results

White noise is added to every variable to simulate actual working conditions. Then the parameter is set [24]. We select 1,000 samples with 0.2 min as the simulation step for the experiment: of these samples, 70% comprise the training set, and 30% comprise the testing set. Normal classification algorithms, such as support vector machine (SVM), k-nearest neighbor (kNN), and basic SAE, are chosen as competitive algorithms. For the proposed algorithm, SAE-SWS, the performance of the teacher model of SAE-SWS (SAE-SWS-t) and the optimal student model of SAE-SWS (SAE-SWS-s) are listed. Besides, SAE-KD, which removes the SWS from SAE-SWS, is applied as comparison. It shares the same teacher with SAE-SWS and then goes through successive teacher-student training. SAE-KD-s is the optimal student model. In this experiment, SVM uses radial basis function (RBF) as kernel function. Penalty factor is set to 1. As for kNN, number of neighbors is 5. Euclidean distance is the metric. For SAE-based algorithms, their structures are . Adam optimizer is used and learning rate is set to 0.001. The size of batch is 10. Epoch is set to 1000. In SAE-SWS, is set to 2.5 during training according to the reference [22]. In equation (9), is set to to obtain the optimal parameter. It is set to 0.4 based on the results shown in Figure 6. How to set remains a problem in the SWS. Hence, an experiment is conducted to decide and its result is shown in Figure 7. According to Figure 7, is set to 2. Theoretically, can be very large. However, limited by equipment, in this paper, is set to 4. The classification accuracy of each model in SAE-SWS is shown in Figure 8. According to Figure 8, is chosen as the optimal student model.

The comparison results presented in Table 2 show that the SAE-SWS algorithm has the best performance among all algorithms. By comparing SAE-SWS-t with SAE-SWS-s, we find that the latter outperforms the former by 1.69%. This value implies that the proposed SWS improves feature extraction capability. SAE-SWS-s outperforms SAE-KD-s by 1.03%, which shows the effective of SWS on knowledge transfer. Table 3 shows the detailed classification accuracy for each fault of the SAE-SWS-s.

4.2. TE Process

The TE process is an authoritative benchmark problem in FDD. Since it was proposed by Downs and Vogel, it has emerged as a popular tool to test the performance of the FDD model [25]. The entire process consists of five operating units: reactor, condenser, recycle compressor, separator, and stripper. It has 53 variables, including 12 manipulated variables and 41 measuring variables. When a fault occurs, it affects almost all variables [26]. The dataset that includes a normal state and 17 faults for training and testing is collected, respectively, to build a fault classification model. IDV3, IDV9, IDV15, and IDV21 are removed. The fault testing set removes the first 160 normal samples; hence, the test sample of a certain fault includes 800 samples. In this experiment, kernel function of SVM is RBF. Penalty factor is set to 1. Number of neighbors in kNN is 8. Euclidean distance is the metric. As for the SAE-based methods, the structure is set to . Adam optimizer is applied and its learning rate is set to 0.002. The batch size and epoch are 10 and 1500, respectively. Similar to that for the CSTR, a number of experiments are conducted to confirm the related parameters in SAE-SWS. The results are shown in Figures 9, 10, and 11. Accordingly, is set to . is set to 2. The optimal student model is .

SAE-SWS is compared with other algorithms. Table 4 shows the performance of the algorithms for the TE dataset. The SAE series outperforms the conventional methods. The comparison of the SAE series shows that SAE-SWS-s achieves the best performance for the TE dataset. SAE-SWS-s outperforms SAE-KD-s by 0.8%, which confirms the effectiveness of the SWS. Table 5 lists the specific accuracy values for SAE-SWS-s for training and testing.

4.3. Sensorless Drive Diagnosis

In sensorless drive diagnosis (SDD), variables are extracted from electric current drive signals. The drive has intact and defective components. There are 11 classes with different conditions in this dataset. The current signals are measured with a current probe and an oscilloscope on two phases. The empirical mode decomposition was used to generate the database for the generation of 48 variables. More information can be found in http://archive.ics.uci.edu/ml/datasets/Dataset+for+Sensorless+Drive+Diagnosis.

In this experiment, kernel function of SVM is RBF. Penalty factor is set to 1. Number of neighbors in kNN is 6. Euclidean distance is the metric. As for SAE-based methods, structure is set to . Adam optimizer is applied and its learning rate is set to 0.001. The batch size and epoch are 10 and 1000, respectively. Similarly, a number of experiments are conducted to confirm the related parameters in SAE-SWS. The results are shown in Figures 12, 13, and 14. Accordingly, is set to 0.3. is set to 2. The optimal student model is .

The comparison results presented in Table 6 indicate that the SAE-SWS algorithm has the best performance among all algorithms. Besides, results of SAE-SWS-t and SAE-SWS-s show that the student learns a lot from teacher. In addition, SAE-SWS-s outperforms SAE-KD-s by 1.81%, which implies the effectiveness of SWS on improving classification performance. Detailed classification accuracy for each fault of the SAE-SWS-s can be seen in Table 7.

5. Conclusions

In this work, a novel strategy called the SWS is proposed for SAEs to deal with fault classification problems. The proposed strategy tries to transfer knowledge between two identical models. After training a teacher model, the strategy SWS is applied to obtain “knowledge”, which denotes the output distribution of the teacher model, for student model training. Then the student model becomes next student’s teacher. After step-by-step teacher-student training, the optimal model, which is selected among all the trained models, is applied for classifying real-time data. The experiments on several datasets prove the effectiveness of SWS strategy as well as the proposed SAE-SWS algorithm. Besides, the improvements of proposed method are summarized as follows. A powerful DL model SAE is used for feature extraction in FDD. By successive teacher-student training, SWS boosts the probability of achieving superior performance by increasing the weights of the mistaken samples. KD helps to obtain more useful information about the correlation among classes. However, in practical applications, there are the following limitations. Processes with insufficient training sample size are not conducive to feature extraction in the proposed SAE-SWS, resulting in the extracted features not being the intrinsic representation of data. As the step increases, the number of parameters in SAW-SWS increases dramatically. So the application of SAE-SWS requires good hardware equipment and a much longer off-line modeling time than traditional monitoring methods. Therefore, further study will focus on reducing number of parameters as well as model complexity.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (21878081) and the Fundamental Research Funds for the Central Universities under Grant of China (222201917006).