Abstract

With the rapid expanding of big data in all domains, data-driven and deep learning-based fault diagnosis methods in chemical industry have become a major research topic in recent years. In addition to a deep neural network, deep forest also provides a new idea for deep representation learning and overcomes the shortcomings of a deep neural network such as strong parameter dependence and large training cost. However, the ability of each base classifier is not taken into account in the standard cascade forest, which may lead to its indistinct discrimination. In this paper, a multigrained scanning-based weighted cascade forest (WCForest) is proposed and has been applied to fault diagnosis in chemical processes. In view of the high-dimensional nonlinear data in the process of chemical industry, WCForest first designs a set of relatively suitable windows for the multigrained scan strategy to learn its data representation. Next, considering the fitting quality of each forest classifier, a weighting strategy is proposed to calculate the weight of each forest in the cascade structure without additional calculation cost, so as to improve the overall performance of the model. In order to prove the effectiveness of WCForest, its application has been carried out in the benchmark Tennessee Eastman (TE) process. Experiments demonstrate that WCForest achieves better results than other related approaches across various evaluation metrics.

1. Introduction

Performance improvement and surveillance facilitation have become increasingly important in industrial processes. Accompanied by extreme conditions, modern industrial processes are becoming more and more complex. In the case of underdeveloped monitoring technology and lack of historical fault data, diagnosis technology mainly consists of two types of diagnosis methods based on process and knowledge [1, 2]. They make the diagnosis results easy to understand, but the use cost is too high for systems with many devices and large state variables [3]. However, the modern industries are developing in the direction of large scale and complexity, and with the widespread use of monitoring technology, large volumes of industrial process data have been collected from broadly deployed sensors and other control equipment. Therefore, to maximize use of these massive data to further improve both accuracy and speed of fault diagnosis is significant for a complicated process monitoring system.

With the increase of storage capacity and computing power, data-driven fault diagnosis methods have been widely used in chemical processes [4, 5]. Among these methods, the multivariate statistical method, mainly including principal component analysis (PCA) [6, 7], partial least squares (PLS) [8, 9], independent components analysis (ICA) [10, 11], Fisher discriminant analysis (FDA) [12, 13], random forest (RF) [14], canonical correlation analysis (CCA) [15], exponential discriminant analysis (EDA) [16], and their derivatives [1722], have also made a rapid progress. Although certain effects have been achieved by these data-driven methods, there are still two shortcomings: On one hand, most of these methods rely on an assumption of a single data distribution (e.g., Gaussian distribution) [23, 24]. But in actual industrial processes, data do not always strictly follow a certain distribution. Therefore, expert experiences will be needed for these methods. Approximate hypothesis can also be used to process these data, but diagnostic errors may be generated. On the other hand, in the context of big data, the above methods are easy to be saturated for sample data; that is, when sample size increases to a certain scale, it is difficult to further utilize the remaining sample data to improve the fault diagnosis accuracy.

In order to maximize the use of massive data, in recent years, deep learning (DL) has been applied to various fields of big data, and a large number of DL based fault diagnosis methods have emerged [2528]. Xie and Bai [29] proposed a hierarchical deep neural network (HDNN) to diagnose faults in the benchmark Tennessee Eastman (TE) process. By training a monitoring deep neural network (DNN), the faults are divided into several groups. For each group, a special DNN trained is triggered for further diagnosis. Zhang and Zhao [30] presented an extensible deep belief network- (DBN-) based fault diagnosis model. The features of fault data in spatial and temporal domains are extracted by DBN subnet, and then fault classification is carried out by the global back propagation network. Moreover, a deep convolution neural network- (DCNN-) based fault diagnosis method was also proposed [31], which achieves better results than the former one. However, some shortcomings may limit the application of DNN in fault diagnosis: (1) DNN is mainly used to process the spectrogram of image and speech recognition in computer vision, and in order to extract both spatial and temporal features, the input data in fault diagnosis are usually processed to a two-dimensional data matrix composed of a period of time [2931]. So, it may result in a low real-time performance. (2) It is well known that the performance of DNN depends largely on parameter adjustment because of a large number of hyperparameters.

In order to alleviate the aforementioned shortcomings of DNN, an alternative of DNN, gcForest [32], was proposed in 2017, which can achieve comparable or even better results than DNN on several domains. gcForest has much fewer hyperparameters than DNN and can be easily trained without too many parameter-adjustment skills. However, in gcForest, two key issues, the diversity of classifiers and the power of each classifier, should be paid attention on. For the former, different forests can be used, such as random forest, completely random forest, and so on. For the latter, in this paper, a weighted cascade forest (WCForest) model is proposed. The main idea of WCForest is to design a strategy to set weight for each forest in cascade structure and to improve the performance of the good forests and to restrain the bad ones.

The remainder of this paper is organized as follows. Section 2 introduces the principle and mathematical model of gcForest. The WCForest-based fault diagnosis model is proposed in Section 3. The applications of WCForest in the TE process and the comparisons with other fault diagnosis methods are discussed in Section 4. Finally, conclusions are drawn in Section 5.

2. Multigrained Cascade Forest

gcForest consists of two integrated components: the multigrained scanning and the cascade forest. The multigrained scanning adopts sliding window to scan local context from high-dimensionality to learn representations of input data by different forests. The cascade forest learns more discriminating representations under the supervision of input representations at each level, so as to give a more accurate prediction according to the ensemble of forests.

2.1. Multigrained Scanning

Inspired by feature relationships of CNN, the cascade forest adopts a sliding window-based multigrained scanning strategy. An illustration of its process is given in Figure 1. Suppose that there are instances of classes in the training dataset and the dimension of each instance is . A sliding window of size is used to scan each instance, and the -dimensional feature vectors can be generated by scanning each raw instance sequentially. All feature vectors extracted from the raw instance are regarded as derived instances of this class. For each -dimensional derived instance, each forest generates -dimensional class vector. The -derived instances of each raw instance are input into random forest and completely random forest to generate their class distribution vectors and then to concatenate them into transformed feature vector of -dimensional. As shown in Figure 1, the training dataset includes three classes and each raw instance has 400 dimensions and the sliding window size is 100. Therefore, from the above process, a feature vector of an 1806-dimensional transformed feature vector corresponding to a 400-dimensional raw feature vector is obtained. Compared with the raw vector, the transformed feature vector has much more dimensions and an enhanced feature representation.

2.2. Cascade Forest

In the cascaded forest, each cascade layer assembles many decision forests, receives the features processed by its previous layer, and inputs its processing results to its next layer. In fact, each layer is designed to include different types of forests to encourage overall diversity. Figure 2 shows the schematic of an example cascade forest, in which two types of forests (random forest in green and completely random forest in blue) are used. The number of forests per layer and the number of trees in each forest are hyperparameters in practice. The instances are input to the cascade layer, and each forest produces an estimate of class distribution. The class distribution outputs of all forests in the same layer form a class vector, which is then connected with the raw vector as an input of the next cascade. Cross-validation is used to evaluate the overall expansion performance of the new layer. When there is no performance improvement, the expansion progress will be automatically terminated.

For each instance, each forest will generate an estimated vector of class distribution by averaging classification probability of all trees in the same forest. The classification probability of a tree is obtained by calculating the proportion of different classes of training instances at the leaf node where the concerned instance falls. The process of the distribution characteristics of random forests is shown in Figure 3. Suppose the class-distribution vector obtained by the tree in the forest is , where represents the number of classes and each forest contains trees, then the class distribution vector generated by the forest is , where .

3. WCForest-Based Fault Diagnosis Method

3.1. Weighted Cascade Forest

As a substitute for DNN, the cascade forest learns hyperlevel representation in a low cost. It does not learn hidden variables based on complex forward or backward propagation algorithms in DNN. Instead, it directly learns class-distribution features by assembling a large number of decision-tree-based forests under the supervision of input. The layer-wise supervised learning strategy makes cascade forests easy to be trained. Moreover, the ensemble of forests can acquire more precise class-distribution features, owing to its powerful ability in most classification applications. However, in a standard cascade forest model, all forests in each cascade structure contribute equally to the final prediction, which may result in a sensitive estimation of classification distribution to the amount of forests fitting. In order to alleviate this problem, based on cascade forest, this section introduces a new variant of cascade forest, WCForest.

Inspired by weighted voting, we give higher weights to excellent classifiers than poor classifiers in the training process of the cascade structure. Obviously, it is difficult to define rules to set weights to the forests in the cascade structure. On one hand, the sample set of training forests are random, but the result of a single classification is not suitable for measuring the quality of the forest. On the other hand, extensive calculation and estimation of weights may bring additional costs. In this study, we attempt to set weights for the forests as objectively as possible without additional costs.

Specifically, the performance of the forest can be measured by training results of different subtraining samples. In order to mitigate the risk of overfitting, cross-validation is used to evaluate the overall performance of each layer. Therefore, the classification accuracy of cross-validation for each forest can be used to estimate its weights. The reasons for using cross-validation to calculate weights are as follows: (1) Cross-validation itself is a default way to evaluate the performance of new layer in the cascade forest, so using it as a strategy for calculating weights of forests does not incur additional computational costs. (2) Cross-validation determines the weight of the forest classification quality through multiple verifications, which eliminates the contingency of verification.

Assuming that there are classes in the training set, the weight of each forest in each layer is estimated by -fold cross-validation. The training set is divided into subsample sets, one of which is retained as a verification set, and the other subsample sets are used to train the forest. Cross-validation is repeated times. Each subsample set validates a random forest at one time, leading to a classification accuracy. After training and verifying each level of the cascade forest, the classification accuracy matrix ACC can be generated as follows: where denotes the accuracy of the cross-validation of the forest, represents the number of random forests in each cascade structure, and is the number of cross-validations.

According to ACC, the average classification accuracy of each forest is as follows:the weight matrix is defined as follows:where represented the weight of forest, which is calculated as follows:

Given a new instance, each forest produces an estimate of the class distribution as described in [29]. Assuming that the class distribution vector obtained by the random forest in a cascade forest is , then the weighted class probability vector of the next cascade structure is and is connected with the raw vector together as an input to the next layer.

If the current layer is the last layer of the model, the class distribution matrix of the cascade forest is as follows:

The weighted class probability results can be calculated as follows:where represents the total probability of class .

Finally, the class with the maximum probability is chosen as the fault classification results, as shown in Figure 4.

3.2. WCForest-Based Fault Diagnosis Model

The process data of industrial processes are usually high-dimensional and noisy. Generally, the original input space is mapped to the feature space by feature extraction. However, the effect of feature extraction will directly affect the performance of the classifier. The two randomness of random forest make it to have better antinoise ability, and when the input data have high dimensionality, the representative learning ability can be further enhanced by multigrained scanning, which may make WCForest have a context or structure awareness. Based on the WCForest fault diagnosis model, the data extracted from each monitor in the industrial process are diagnosed and will get the evaluation status of data at each time.

In this paper, the process of the model consists of two parts: multigrained scanning-based feature extraction and weighted cascade forest-based fault diagnosis. After the data are collected, we use multigrained scanning to extract representation vectors from training and testing sets. Then, the weighted cascade forest classification model is trained by the representation vectors of training set and validated by the representation vectors of testing set. Finally, we obtain the classification results of testing set. The flow chart of the model is shown in Figure 5. Its diagnostic procedures include offline modeling and online diagnosis, described as follows:Offline stage:Step 1. Historical data are collected and preprocessed from the chemical process.Step 2. Data collected at each time are composed into -dimension vectors and labeled with their corresponding classes, including “normal” and their fault types.Step 3. The samples including their corresponding labels are divided into the training set and the testing set.Step 4. Given several different sets of windows, use the training set to select a set of windows from them for multigrained scanning.Step 5. The class probability vectors of training set and testing set is obtained by multigrained scanning of selected set of windows.Step 6. In training set, the k-fold cross-validation is used to train the WCForest model, and obtain the weight vector of each layer. Verify the WCForest model using the class probability vectors of testing set.Step 7. The fault diagnosis result is outputted and visualized.Online stage:Step 1. Online data are collected from the chemical process.Step 2. Online sample vectors are input to the WCForest, which can give a predicted diagnosis result for each sample vector. The diagnosis result is either “normal” or one specific fault type.

4. Experiment Result

In this section, the proposed WCForest-based fault model is applied to the TE process. Furthermore, the results of the proposed method are compared with other decision tree-based ensemble methods (RF, XGBoost, AdaBoost), gcForest, and existing literature.

4.1. Tennessee Eastman Process

As a real industrial process-based simulation platform, the Tennessee Eastman process is widely used to evaluate the performance of monitoring methods in the field of data-driven fault detection research. Figure 6 shows a flow diagram of the TE process. The process consists of 5 major unit operations: reactor, product condenser, vapor-liquid separator, recycle compressor, and product stripper. In addition, the process consists of four reactants—A, C, D, and E—and an inertia component B. The 4 reactants and the inertia component are sent to the reactor, and then the process produces liquid products G, H, and a byproduct F. The reaction process is irreversible, exothermic, and approximately first-order with respect to concentrations.

The TE process includes 41 measured variables and 12 manipulated variables. However, one of the manipulated variables, reactor speed, is always constant, and does not need to be analysed. The remaining 52 variables are used as research variables [33], which are all listed in Table 1. The first 41 variables are measured variables and the last 11 variables are manipulated variables. The TE process contains 21 faults which are listed in Table 2. The data used for faults classification of TE simulation system can be downloaded from http://web.mit.edu/braatzgroup. Each state (normal state and 21 different fault states) is divided into training and testing sections. The data are sampled once every three minutes. The training data are sampled 500 times for 25 hours and faults are introduced after one hour, so the simulation only uses the remaining 480 fault samples to a fault diagnosis model. The testing data are sampled 960 times for 48 hours, and faults are introduced after 8 hours; that is, the fault samples were collected from the 161th sampling point. With these normal sample set and fault sample sets, a completed WCForest model can be trained and tested.

4.2. WCForest Model for TE Process

The WCforest model suitable for TE process fault diagnosis is designed and constructed, in which the construction of forest and the setting of some hyperparameters need to be selected experimentally, such as the number of trees in each forest, the number and types of forest, and the setting of feature window. To find a suitable model, we tried the following experiment.

In the process of constructing a decision tree, information gain and Gini index are generally used as heuristic functions for feature selection. In this paper, we tested these two feature selection rules, respectively, and the test accuracy obtained was not significantly different. Therefore, relatively good Gini index was selected as the node splitting rule of the random forest model in this paper.

For hyperparameters in the model, the numbers of RF in the multigrained scanning and the cascade structure are set according to the setting in the literature [29]. 2 random forests (1 completely random forest and 1 random forest) and 8 random forests (4 completely random forests and 4 random forests) were set up in each layer of the weighted cascade forest, respectively. However, the number of trees in forest and the setting of scanning window have a great influence on the diagnostic accuracy, so its parameters need to be optimized. First of all, we discuss the value of , which has little effect on the diagnostic accuracy when trees. Considering the relationship of time complexity, fixed . It is a common issue that there is no scientific guidance for the setting of windows. In order to find a proper setting, we tried several window settings: [15, 30], [20, 45], [13, 30, 42], [18, 36, 45], [16, 27, 35, 42], and [18, 25, 36, 47].

Here the dataset samples with 400 samples of one class are randomly selected; 80% of each class samples are training dataset, and others are testing dataset. The fault diagnosis testing is on one sample each time. The testing average diagnostic accuracy and the training/testing time of the different window settings are listed in Table 3. The window setting of [18, 25, 36, 47] has the highest testing average diagnostic accuracy (75.9%) and takes 83.4 min for training. With a little decrease of the testing average diagnostic accuracy (75.6%), the window setting of [18, 36, 45] takes 20 min less than the setting of [18, 25, 36, 47]. In the following discussion, [18, 36, 45] is chosen as the best window setting.

4.3. Fault Diagnosis Result

Experimental results of fault diagnosis for display, two commonly used indicators, fault detection rate (FDR) and false positive rate (FPR), are considered to evaluate the diagnostic performance of the model, and they can be calculated by the general confusion matrix defined in Table 4, which are shown in the following equation:

Table 5 shows the FDRs of 21 faults in the TE process obtained by WCForest, gcForest, and three decision tree-based integration methods: Random Forest (RF), XGBoost, and AdaBoost. The setting parameter of gcForest is the same as WCForest. The parameters of the remaining algorithms are set as follows (which are all set by multiple parameter adjustments):(1)RF: 400 decision trees, Gini index as classification rule.(2)XGBoost: 400 decision trees, that is, the number of iterations, learning rate = 0.1, softmax loss function as objective function.(3)AdaBoost: 500 decision trees, learning rate = 0.6.

Compared with the diagnostic results of RF, the FDRs of most faults have increased in varying degrees, which is of great significance in industrial production and theoretical research. RF and XGBoost have a good performance for faults with obvious feature differences, such as 1, 2, 6, 7, and so on, and a low performance for well-known faults, such as 3, 9, and 15. AdaBoost generally has a low diagnostic rate, with the exception of fault 2 reaching nearly 100%. gcForest enhances the perception of differences between features in cascade structure and improves the classification ability of the model through multigrained scanning representation learning. And the weights in WCForest can improve the robustness and sparsity of the model.

WCForest has the best performance in these five methods, with an average FDR of 84.13%, which is about 62.18% higher than AdaBoost. The FDR of fault 6, 7, and 21 increased by 100%, the highest improvement among all 21 faults, and the improvements of more than 70% faults exceed 50%. For RF and XGBoost, their diagnostic rates for all faults are similar. The average FDR increased by 24.21% and 15.77%, respectively. Compared with gcForest, the performance of WCForest is improved by nearly 2%. The FDR for fault 2, 6, 7, and 21 is 100%, which means there are no false alarms and missing alarms. Furthermore, the FDR for 11 faults exceed 90% and the FDR for 6 faults exceed 95%, which is an important achievement.

A performance comparison of the five methods is shown in Figure 7. Obviously, WCForest and corset outperform RF, XGBoost, and AdaBoost. Compared with gcForest, the performance of WCForest is slightly improved.

To further demonstrate the validity of WCForest for the TE process, the FPR is shown in Table 6. In addition, RF, XGBoost, Adaboost, and gcForest are compared.

In Table 5, WCForest has an average FPR of 2.45%, a 27.58% decrease compared to AdaBoost and a 0.37% decrease compared to gcForest. The FPRs of fault 1, 2, 6, 7, 8, 17, 18, and 21 are zero, which is of great significance in industry. In addition, the FAR of a half of faults is reduced by more than 40% compared to AdaBoost. Figure 8 shows the detailed comparison results.

In order to examine the performance of WCForest, we compare it with methods listed in Table 7, which shows that our fault diagnosis model has a better performance than the others. Except for faults 3 and 15, the other 19 faults have a diagnostic rate of more than 50%, especially faults 3 and 9 are 20% better than the other models (except literature [30]). Compared with a DBN-based fault diagnosis model proposed in [30], the FDRs of the 21 faults have no much difference, so the average FDR is only 1.23% higher. It should be noted that fault 15 has a relatively poor diagnostic effect and needs to be further investigated.

To thoroughly evaluate the quality of the proposed method, score is selected as the evaluation indicator. It is a classical index in machine learning field [38], which analyses classifiers based on recall and precision and is calculated by their harmonic means.

From the general confusion matrix in Table 3, the formulas for calculating recall and precision are as follows:

Thus, the score can be calculated as follows:

The score of WCForest is shown in Table 8, reflecting the diagnostic ability of the model. The values of recall and precision almost achieve 100% on faults 1, 2, 6, 7, and 21, which means the great performance of true positive rate and false positive values. Finally, Figure 9 shows the recall and precision of WCForest which indicates the proposed method has good performance.

4.4. Hierarchical Representation Learning Visualization

In order to understand the characterization process of WCForest and the hierarchical representation of its learning process, it is very important to observe the diagnostic results of each layer intuitively. As features learned are high dimensional, the diagnostic results of each layer are difficult to be visualized. To address this problem, we use t-distributed stochastic neighbour embedding (t-SNE) [39] as a tool to visualize the hierarchical representation learning process of the WCForest model.

The t-SNE method is a variant of stochastic neighbour embedding (SNE) [40]. It uses symmetric SNE to replace conditional probability with joint probability between data points in high-dimensional space and low-dimensional space. Meanwhile, Gauss probability distribution is used in high-dimensional space and distribution with 1 freedom degree is used in low-dimensional space, which solves the problem of data point congestion in SNE. Therefore, t-SNE can better express the complex nonlinear relationship between high-dimensional data in the process of dimension reduction.

We use t-SNE method to embed high-dimensional features of each layer into two-dimensional or three-dimensional space, which can be visualized in scatter plots. The feature learning process can be easily visualized by using the 2D or 3D maps corresponding to each layer. Through experiments, we found that 3D maps are not suitable for visualization of WCForest-based fault diagnosis models. Therefore, the high-dimensional output features of each layer are embedded in the 2D map and then plotted in the subgraph of Figure 10.

600 samples of 22 classes (one normal and 21 faults) were randomly selected from testing set for visualization. The size of input data is , and then these 52-dimensional vectors are transformed into 600 vectors of 2-dimensional by using the t-SNE method. In each subgraph of Figure 10, these points are marked with their actual class labels, “Normal” with “0” and “Fault 01” with “1,” and so on. In addition, in order to distinguish clusters for viewing, different colors are used to represent their classes. The output of each layer is converted to a vector of 2-dimensional by using t-SNE, so that it can be visualized in 2D (Figures 10(b)10(h)).

As shown in Figure 10(a), the raw process data samples of all classes are mixed. Distribution of feature samples of multigrained scanning is shown in Figure 10(b). Then, by learning the representation of weighted cascaded forests at different levels, we can find that the samples are clustered gradually through class labels in t-SNE mapping (see Figures 10(c)10(f)). This indicates that the nonlinear expression ability of the WCForest model increases with the increase of the number of layers. WCForest maps the indivisible features to the nonlinear separable space by deepening the number of layers in the cascade forest. It also verifies the rationality of the WCForest model to deepen the design of the forest layer. Finally, these subgraphs strongly prove that the WCForest model is effective for fault diagnosis tasks.

4.5. Model Performance

Is the number of training samples crucial for obtaining good diagnosis performance? To answer the question, we compared the average accuracy for training and testing on different training datasets with 10560, 8800, 6600, 4400, and 2200 training samples and 600 testing samples, and demonstrated the result in Figure 11. The figure shows that the test accuracy of the WCForest model is greatly affected by the number of training samples, especially in the early stage. Although the late rise is relatively low, it has continued to increase, while the train accuracy is not greatly affected by the number of training samples.

5. Conclusions

In this paper, an improved deep forest model, WCForest, is proposed for fault diagnosis of chemical processes to improve accuracy, reduce false alarm rate, and process high-dimensional and nonlinear data. The main performance is that, without increasing the computational complexity, -fold cross-validation is used to calculate the weight of each forest in the cascade structure in order to boost the good performance of forests and weaken the bad ones, so as to improve the overall performance of the cascade random forest.

To show the performance of the proposed model, RF, XGBoost, AdaBoost, gcForest, and WCForest were applied to the benchmark TE process, containing 16 known faults and 5 unknown faults for testing. The WCForest model predicts an average FDR of 84.13% and a FPR of 2.45%, with a high accuracy and a low false positive rate, which is comparable to the average diagnostic rate reported in other literatures. To provide more information about the performance of the model, the score is also chosen as an evaluation measurement for the integrity and purity of the classifier. Our work shows the validity and efficiency of WCForest, which can predict fault diagnosis in the TE process and can provide a reference for other chemical processes. In addition, most data samples are clearly and correctly clustered by WCForest in the t-SNE map.

Because of its excellent fault diagnosis rate and false positive rate, this method has industrial prospects. The data-driven fault diagnosis methods depend on the collection of a large amount of various process malfunction samples. Inevitably, our WCForest-based fault diagnosis model suffers from the same drawback. In the near future, research also will be focused on fault diagnosis with limited number of fault samples available.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The paper was supported by grants from the National Natural Science Foundation of China (NSFC) (61562054).