Abstract
Faults occurring in the production line can cause many losses. Predicting the fault events before they occur or identifying the causes can effectively reduce such losses. A modern production line can provide enough data to solve the problem. However, in the face of complex industrial processes, this problem will become very difficult depending on traditional methods. In this paper, we propose a new approach based on a deep learning (DL) algorithm to solve the problem. First, we regard these process data as a spatial sequence according to the production process, which is different from traditional time series data. Second, we improve the long short-term memory (LSTM) neural network in an encoder-decoder model to adapt to the branch structure, corresponding to the spatial sequence. Meanwhile, an attention mechanism (AM) algorithm is used in fault detection and cause identification. Third, instead of traditional biclassification, the output is defined as a sequence of fault types. The approach proposed in this article has two advantages. On the one hand, treating data as a spatial sequence rather than a time sequence can overcome multidimensional problems and improve prediction accuracy. On the other hand, in the trained neural network, the weight vectors generated by the AM algorithm can represent the correlation between faults and the input data. This correlation can help engineers identify the cause of faults. The proposed approach is compared with some well-developed fault diagnosing methods in the Tennessee Eastman process. Experimental results show that the approach has higher prediction accuracy, and the weight vector can accurately label the factors that cause faults.
1. Introduction
In the modern manufacturing industry, most production processes can be viewed as a continuous rolling process, such as assembly/product lines. Sometimes, unexpected faults occur in control or manufacturing systems, and the entire process will break down. Before the faults are found and fixed, many costs are wasted. The cost of wasted energy, resources, and time is significant, especially for high energy consumption process industries. Therefore, fault diagnosis and prognosis have been a subject of intensive research in the past four decades [1]. There are generally two research directions to solve this problem: first, detecting or predicting faults before they break the process, which will help workers or engineers prepare for production breaks in advance and yield great cost savings, and, second, identifying the causes and improving the production process, which can reduce the occurrence of breaks. With the development of the Industrial Internet of Things (IIoT), we can collect almost all of the production process data, which can be used to predict faults and identify causes. While these two directions may be easy to implement in simple industrial processes, there are still serious challenges in complex industrial processes, especially in complex process industries.
Challenges proposed by complex industries in related research are reflected in the substantial volume and high-dimensional input data. These data, referred to as big data, are generated from sensors, production equipment, and testing instruments. In complex process industries, it is common to generate data with thousands of dimensions, even without considering video stream data. These data include control parameters of production equipment, real-time production data, environmental perception, and inspection data. For example, for a medium-sized pulp-and-paper mill, a typical process industry, its entire production process includes 19 processes, 4 key raw materials, and two waste removals. The equipment, instruments, and sensors involved in the production process can generate more than 2000 kinds of data, and the volume will continue to grow over time. Facing high-dimensional and continuous growing data, machine learning (ML) algorithms can continuously improve performance. Therefore, ML, mainly deep learning (DL) and neural networks, are widely used in big data processing [2], including fault detection based on industrial big data.
Traditional DL-based algorithms consider the input data as time series data, which means that an input item is the data generated by the entire production line at time , and the next input item is the data at time . Afterwards, a DL algorithm, similar to a recurrent neural network (RNN), can be used, such as gate recurrent units (GRUs) and long short-term memory (LSTM). This is very intuitive because the data collected from the production process are arranged in chronological order. However, because the sampling frequency of the data in each dimension is different, the data obtained at different times from the production line is not comprehensive, which brings difficulties to the construction of a DL model.
In the actual production process, the faults that caused production breaks generally occurred at a previous time, and it is difficult for engineers to identify this time. For example, in the fused magnesia industry, a typical high-energy-consuming complex process industry, the underburning condition of the furnace is a common fault, which will cause the furnace to fail and break production, but the duration before the break is difficult to identify. However, a DL-based model needs this time to label the training data. Traditional DL-based fault detection approaches may have a good performance in some applications [3], but they cannot help engineers find the cause of the faults.
In this paper, we regard these process data as a sequence in space according to the production process and propose an improved LSTM neural network. Afterwards, an encoder-decoder framework and an attention mechanism (AM) algorithm are used to predict faults before they occur. The input is a sequence in which different types of data are arranged, according to the position of the production process.
The output is still a sequence arranged by different fault types and is specific to a certain output item. Its value represents the length of time before fault occurrence. This approach has three advantages: (1) the method can handle long spatial sequences and improve prediction accuracy. (2) Weight vectors in AM can indicate the correlation between faults and input data. It should be noted that when the input data is expressed in a time series, this correlation cannot be reflected. (3) The format of the output sequence can facilitate the labelling of the training data. At last, the proposed approach is evaluated on the Tennessee Eastman Process (TEP) [4, 5]. The main contributions of this paper can be summarized as follows:(1)The weight vectors of AM in the trained neural network are firstly used in fault diagnosis to reflect the correlation between faults and input data. This can help engineers find the cause of faults and improve the production process.(2)Different from the traditional DL model, industrial production data are treated as a time series, and we regard industrial production data as a spatial sequence according to the production process and propose a branched LSTM structure.(3)We designed the output as a fault type sequence. The value of a specific item represents the length of time before fault occurrence. This output model provides convenience for labelling training data.
Experiments show that our approach can achieve a higher accuracy in fault detection than other traditional methods. Moreover, the specific factors causing the faults can be identified.
The rest of this paper is organized as follows: Section 2 gives brief reviews of related works. In Section 3, we describe the problem statement and provide some assumptions. Afterwards, Section 4 gives the algorithm details: an improved LSTM-based encoder-decoder model is introduced and describes an AM algorithm for identifying factors. In Section 5, we test the fault detection approach and evaluate its performance. Finally, Section 6 gives the conclusion and direction of future work.
2. Related Works
Fault prediction or diagnosis is the process of detecting (or predicting) deviations from normal or expected operation [6]. Fault diagnosis has been widely used in industries for cost saving and safe production, and its applications are growing with the development IIoT and CPS. Therefore, it has long been attractive to many researchers.
Statistical analysis techniques are popular traditional signal processing methods, and there are three algorithms commonly used for fault detection: principal component analysis (PCA) [7], independent component analysis (ICA) [8], and partial least squares (PLS) [9, 10]. The core idea of PCA is to take the direction of multidimensional data with the largest variance as the main feature and make them have no correlation in different orthogonal directions. This is suitable for fault detection based on multivariate time series (MTS) data. For example, the authors in [11] coupled PCA with a Kalman filter to improve fault detection accuracy, and the key operation was to project the subspace along the fault area. The ICA algorithm considers the data to be linear combinations of statistically independent components. It is a demixing process. PLS is a supervised method that includes the ideas of PCA and canonical correlation analysis. This type of technique has its own limitations in processing these nonlinear MTS and imbalance data [12].
Deep learning is a powerful tool, and it has been successfully applied in many fields [13–15]. A report mentions that advances in DL techniques are the main enablers of knowledge work automation [16]. MTS data is a sequence model, so the commonly used DL is a recurrent neural network (RNN), mainly the LSTM model [12]. For example, Park et al. developed an LSTM-based fault detection model, called LiReD [17]. They did not focus on how to process the multidimensional input data but on edge computing. Lu et al. introduced an LSTM network to solve the early fault detection problem in high-dimensional sequential data [18]. LSTM has an efficient performance for sequential data processing, and it has been applied to fault detection models in many industries [19–24].
In the industrial production process, it should be noted that fault cases are rare, and, accordingly, the obtained training data contains a few fault examples. This is a class-imbalanced problem, and the proposed approach will also face this problem. There are three basic methods in class-imbalance learning: (1) undersampling [25], (2) synthetic minorities [26], and (3) cost-sensitive learning [27]. There are already well-developed solutions, so we will not go into details in this article.
Identifying in a fault detection algorithm the factors recorded by sensors that cause faults is valuable for industries. However, such studies are still scarce. An attention mechanism (AM) was originally used to ease the complexity of neural network models [28], and it is not necessary to input all information to the neural network for calculation, but only to select some task-related information for input into the neural network [29]. AM was primarily used for natural language recognition [30], but it was soon applied in the field of image-based deep learning [31, 32]. For example, it has proven to be a very effective tool in a variety of applications such as reading comprehension, abstractive summarization, textual entailment, and learning task-independent sentence representations [33–35].
In this paper, we proposed a branched LSTM structure to adapt to the spatial data structure generated by industrial production lines. Moreover, AM was firstly used in the encoder-decoder model for fault detection to improve accuracy. The most important is that weight vectors of AM will be used to represent the attention distribution, which can help engineers to identify the specific factors that cause the faults.
3. Problem Statement and Assumptions
3.1. Problem Structure
Data comes from a multivariate time series process and is collected by a large number of various types of sensors, equipment, and instruments in manufacturing. They are the inputs for the encoder-decoder model. Training data contains regular time-intervals and the event label . The primary purpose of fault prediction is to build a classification model for different fault types and identify factors causing the faults.
The sensing data increases with time. For this sequence of data, we start the fault detection program with a certain frequency and then use the current time and data from a previous period as input. In other words, the output of the model at time will not be used as the input at time . The process is shown in Figure 1, where is the time interval for program startup and is all sampled data at time length .

The input data is a sequence. Each individual item in the sequence represents independent data collected from a certain position in the production process. We will describe in detail the structure of these items and how they are integrated in the encoder-decoder framework in the next section. The value of a single item is a data vector (time series) collected from a sensor, equipment, or instrument in the production line over a period of time .
Because the underlying structure is different, the sampling frequency and data type of the items in the sequence are different, which means that the data in each item will be different in length, type, and so on. Accordingly, one of the training datasets can be described as , where is the input data from position in the production line, and is a class label matrix (output), which indicates the length of time before fault occurrence. The length of each depends on , and it refers to the sampling frequency in position . The length of each depends on the type of faults.
According to the above description, the problem of fault prediction in the industrial production process can be regarded as a sequence-to-sequence (seq2seq) classification problem. The encoder-decoder model can then be used.
3.2. Assumptions
Industrial data used for fault detection is recorded by sensors, equipment, and instruments. The cause of these data anomalies may be faults in the production process or sensor failure. We focus on detecting or predicting faults in production in this paper, so we do not consider sensor failure. In addition, in the process of building a neural network, some basic operations are also involved to improve model performance, such as regularization and normalization. These operations are well-developed and popular technologies. Therefore, we will not describe them in detail in this paper.
The raw data collected from the production line is very rough. Generally, some simple algorithms can be used to reduce dimensionality. For example, a timestamp may be described as a six-dimensional vector, including year, month, day, hour, minute, and second. It can be easily integrated into a one-dimensional scalar. This situation is common in raw data, and it can be easily integrated according to the logical relationship. This integration algorithm is very simple and needs to be completed according to the actual situation. This article assumes that all input data has undergone such processing. However, readers need to pay attention to this step when using this algorithm and cannot be ignored.
4. Architecture and Algorithms
The architecture of the proposed approach works in an LSTM-based encoder-decoder model, and AM is used to improve fault detection accuracy and identify specific factors causing faults.
4.1. Input Sequence and Improved LSTM Structure
A typical encoder-decoder model solves a seq2seq problem. It is a multi-input multioutput model, also known as many-to-many. The structure is illustrated in Figure 2.

According to the description above, the input sequence can be described as , where means a time series data from the position in the production line. can be described as where means data from the position at time . And is the length of , meaning the number of data generated at position over a period of time , and it is related to the sampling frequency.
In the actual production process, the production line is not a simple one-dimensional sequence. There are usually branches, which make it more complicated than the traditional seq2seq problem. Figure 3 introduces a simple production example. The entire production process contains 6 steps, and each step generates production data . As shown in the figure, they are not a simple one-dimensional sequence. There is a branch at Step 5, which can execute itself only after Steps 2 and 4 are executed in parallel. As a result, the spatial structure of collected data is not a simple one-dimensional sequence. Thus, we improved the LSTM-based encoder structure based on the spatial structure.

According to the spatial structure, we design a branched LSTM chain, which is illustrated in Figure 4. Each arrow in Figure 4 means a mapping between the different layers of the neural network. Accordingly, and are the outputs from the previous layer of the neural network. In this encoder structure, there are two situations: one is a traditional LSTM cell and the other is a cell with branches, which will be described separately below.

At first, for a traditional LSTM cell, can be described aswhere is the activation function, is the weights matrix for the output of the previous layer, is the weights matrix for the input, and is the bias.
In industrial production, there is a deep connection in the time series for fault detection, and the LSTM model is capable of capturing this connection. The LSTM model was proposed by Sepp Hochreiter and Jiirgen Schrnidhuber in 1997 [36]. Compared to a Recurrent Neural Network (RNN), an LSTM cell contains three special-purpose gates for storing and selecting information, and there is a memory value between cells. The details are shown in Figure 5.

is the forget gate. According to the input and , the forget gate can determine which information can be “forgotten.” It can be expressed aswhere is the output of the previous LSTM cell, is the data at time, and is the weights matrix. After the sigmoid function, information with dimensions close to will be “forgotten.”
The update gate is , and it can determine which information can be “added.” is the output gate. They can be expressed as
The memory cell and the activated vector can be expressed aswhere tanh is a hyperbolic tangent function: in the equation above is an -dimensional vector, and is the number of sensors in production. Accordingly, , , and are an -dimensional matrix, where is the number of cell in the hidden layer. , , and are -dimensional vectors, and so is and . The dimension of the weights matrix within each cell is related to the length of the input vector, so the cell can uniformly output a vector with length .
Secondly, for a branched LSTM cell, an LSTM unit structure is illustrated in Figure 6. We suppose the cell of the other branch is . Accordingly, based on the traditional LSTM cell [36], the key calculation process is modified as follows:where is the output of a branch LSTM cell. Thus, the forget gate, update gate, and output gate can be expressed as

The memory cell and the activated vector can be expressed as
For ease of description, the LSTM in the example with this suction has only one branch. In actual applications, if there are multiple steps converging to one step, just add the corresponding and in the branched LSTM cell.
4.2. Output Sequence Structure in the Encoder-Decoder Model
The encoder can encode all input sequences into a unified feature . The decoder decodes it and outputs the results. We design the output as a fault type sequence. The value of a specific item represents the length of time before fault occurrence. This output model provides convenience for labelling training data.
The output is defined as a sequence of fault types. is the output, and is the type of fault. The value of is the time length before the fault occurs, but it is not a numerical value. We define it as a class set:
Each element represents a time period before fault occurrence. Therefore, the output cell of the neural network is a SoftMax function. The advantage of this model is that when labelling the training data, it can roughly label the length of time before the fault cures. However, its drawback is that the length of and the time period represented by each depends on prior knowledge. Obviously, the model is a unidirectional propagation neural network.
4.3. AM for Identifying Factors
In the production process, the amount of data is very large. In other words, the input of the LSTM model is high-dimensional data. However, when faults occur, the data that it can affect may be only one or several dimensions. Therefore, most of the other data is redundant and ineffective. However, we do not know which data is redundant and which data is crucial. In this paper, we use an attention mechanism to identify the crucial data. There are at least two benefits. Firstly, LSTM is not good at handling a long sequence, and the AM algorithm can help LSTM deal with long sequence inputs to improve prediction accuracy. Secondly, the weight vectors in AM, originally used to identify crucial data, can be used to identify fault factors. It is helpful for the industry to improve the production process to prevent faults.
The attention mechanism has been widely used in the processing of various types of sequence data now. We firstly use AM in fault detection to handle the problem of overly long input sequences. Meanwhile, AM weight vectors can reflect the specific factors that cause the faults.
The AM based on the encoder-decoder model is realized by adding an attention weight vector for each output. The outputs of every cell in the LSTM will combine the weight vector with the output features for the decoder. This is the same for the branched LSTM proposed in this paper. In other words, the encoder provides a feature vector for every output in each decoder instead of one single feature . The structure is shown in Figure 7.

The AM in encoder provides a series of attention weight vectors, indicating the feature matrix. It can be described aswhere is the feature matrix for output and is a weight for sensor in the attention weight vector . is the output of the cell . is the number of sensors.
Attention weights in one single vector need to meet constraints as follows:
Attention weight indicates the value of attention from output paid to each activation value . can be described as equation (13), which satisfies the constraint of equation (12):where is calculated through the previous layer of LSTM neural networks.
After completing the design of neural networks, the details of a backpropagation algorithm and network training process can be found in [27, 37]. Attention weight vectors in trained networks can be used to identify the specific factors causing the specific faults.
5. Experiment and Evaluation
In this section, we apply the TEP to simulate the process model in MATLAB. Based on data from this model, some other fault detection and diagnosis algorithms are compared with the proposed approach.
5.1. Tennessee Eastman Process Model
TEP is a well-known process simulation in the Chemical industry and is a benchmark of fault detection and diagnosis [3]. The latest revision of TEP was proposed in 2015, and there are more variables and types of faults exploded. The details and source code can be found in [4]. The piping and instrumentation diagram (P&ID) of the revised TEP simulator is shown in Figure 8.

The simulation model uses the input data from the definition of Downs and Vogel, including parameters and signals. The gaseous reactants A, C, D, and E and the inert B are fed to the reactor where the liquid products G and H are formed. The reactions in the reactor arewhere and lig indicate raw material status.
In the simulator, there are 12 manipulated variables (MVs) considered as control signals. There are 41 measured variables, which can be seen as the sensing data in this proposed approach. In other words, they are the inputs of the encoder-decoder model. The first 22 were measured continuously and sampled every 3 min, XMEAS(l) through XMEAS(22), and they are listed in Table 1. The rest are composition measurements.
There were 21 different types of faults during production, named “Fault1, Fault2, … , Fault21.” We selected the first 20 faults. Their settings are found in [38]. We delayed the labelled time stamp by dozens of minutes for three faults. Some faults did not break production until after a period of time. The process data are 7670 hours in a fault state and 4000 hours in a normal state. The samples were randomly selected from process data. The total number of samples is 30,000. According to the encoder-decoder model, we randomly selected of both fault and normal samples for the training dataset, and the remaining were used as the testing datasets. Descriptions of fault status are shown in Table 2.
5.2. Setup for the Encoder-Decoder Model
The input data came from 41 measured variables and 12 manipulated variables in the Tennessee Eastman Process (TEP) simulation, which entails 53-dimensional time series data. Therefore, the length of the input sequence for the encoder needed to be fixed at 53. Similarly, the length of outputs needed to be equal to the type number of faults, and situations with no fault detected indicated a normal status. In this simulation, it could be fixed to 21. The composition measurements from 41 measured variables were taken from Streams 6, 9, and 11. The sampling interval and time delay for Streams 6 and 9 were both equal to 6 minutes, and those for Stream 11 were equal to 15 minutes. All the process measurements included Gaussian noise. Based on the analysis of [39], we constructed the LSTM-based encoder-decoder model with one hidden layer.
The length of input data for one sensor, or single element in the input sequence, depends on the sampling time. It was empirically estimated. Its length needed to be greater than the duration before the faults broke production. In this simulation, we labelled several faults with time delays, which is illustrated in Table 2. Moreover, according to [38], within an hour, the time length is longer, and the accuracy of the deep learning algorithm classification is higher. We then set the max length of sampling time to 1 hour and tested the performance with less than 1 hour. According to the frequency of the sensor sampling frequency, the length of input data for one continuously measured variable was , and the discrete others were or. These setups in the TEP model are described in Section 5.1. To facilitate matrix operations in deep learning, when the number of discrete samples was , only the first 3 data were taken. The output of the decoder is the time length before fault breaks production. The output layer was a SoftMax function, so was not a continuous variable. When the value of the output sequence is maximum, the status is normal.
5.3. Evaluation
Each element in the output sequence is taken from a multiclassifier, and we used a multiclass evaluation indicator: macroaverage F1 score [40]. There were four possible results for fault detection, and the detection result was a different time length before the production break: (1) the result is positive, and the true value is positive too. The symbol used is (True Positive for fault , detection result ) representing the number of such results. (2) The result is positive, but the true value is negative. The symbol used is (False Positive). (3) The result is negative, and the result is negative. The symbol used is (False Negative). (4) The result is negative, but the true value is positive. The symbol used is (True Negative). They are shown in Figure 9. Based on the definition above, we counted the , , and for each type of fault. Afterwards, we calculated the precision and recall in equation (15). We also provide three confusion matrices of the typical Fault1, Fault9, and Fault17.

(a)

(b)

(c)

(d)
The F1-score, for every output from each fault type, can be described asThe average of F1-score iswhere is the number of output classes in every element of the output sequence.
Table 3 shows the F1-score for each type of fault. There are low scores for identifying Fault15 and Fault16. The main reason is that the correlation between faults and sensing data is very low. Therefore, we considered them as exceptions and ignored their results. In fact, the F1-score should exceed 0.8 for the classifier to be considered acceptable. However, most data shown in Table 3 cannot satisfy it, since the correlation between the data and the faults is not all linearly related to the time before production break. The ultimate goal of fault detection is a biclassifier that detects whether a fault occurs. Thus, as the description above, we chose a threshold, which is used to convert the multiclassifier of a time length into a two-classifier; then, the model performs better. We display the performance of the approach proposed in this paper in Table 4, and it has been also compared with other approaches, including a basic Principle Component Analysis (PCA) method, a typical LSTM-based encoder-decoder structure, an optimized LSTM [41], and a Supported Vector Machine with a linear kernel and autoencoder method. We used the F1-score to evaluate them.
As shown in the experimental results in Table 4, the traditional LSTM structure has a poor performance. The main reason is that the length of the input sequence is too long. A traditional LSTM structure lacks global information, and the update gate and forget gate in LSTM cells produce gradient disappearance during the propagation process. Only the autoencoder model performs better. AM can not only improve the accuracy of fault detection, shown in Table 4, but also identify the specific factors that cause the faults. In an encoder-decoder model with AM, each output (meaning the fault ) is deduced by a specific feature matrix . is calculated by all inputs and a weight vector . This structure is illustrated in Figure 7. The weight vector, that is, the attention weight, indicates the correlation of each input factor with fault . In the experiments above, illustrated in Figure 10, we show weight vectors for some faults. The x-axis represents factors (i.e., sensors), and the y-axis represents weight values. Accordingly, we can identify the specific factors that cause the faults—factors with a high correlation will have high weights. For example, as shown in Figure 10, the specific factors that cause Fault9 are sensors with ID 21, 17, and 11.

6. Conclusions
The main goal of this paper is accurate fault prediction and cause identification in the industrial production process. We propose a new spatial input sequence, which is different from a traditional time sequence or time series data. This sequence can solve the problem of input dimension changes in a traditional time series; moreover, each element in the input sequence comes from a different production position, which will provide the possibility of identifying their correlation with faults. According to the spatial sequence, we propose branched LSTM to adapt to the branch structure in the production process. These structures are then used in an encoder-decoder model, and an AM algorithm is used to solve the problem of long sequence inputs. Finally, the weight vectors in AM can be used to indicate the correlation between input data and faults.
Experimental results show that the approach has the capability of identifying critical factors. It also has improved prediction accuracy. The main drawback of this approach is that an AM is complicated. The algorithm will occupy a large number of computing resources and has a poor real-time performance. Therefore, future work will focus on optimizing the model structure, making it more suitable for fault detection in industrial big data. Another drawback is that the output model requires prior data.
Data Availability
The data generated by the TEP simulation platform are used to support the findings of this study, and the method of obtaining data is described in detail within the article.
Conflicts of Interest
The author declares that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This article was funded by the National Natural Science Foundation of China (61772122).