Abstract
The roof water hazard has severely weakened the safety of coal mine production. As the most direct data on the state of the roof, the monitoring big data of the hydraulic support is of great importance to coal mine safety. This paper proposes a neural network model based on multitimescale feature extraction for predicting the pressure-bearing data of hydraulic supports in coal mine working faces, drawing on the idea of multiscale time feature extraction in the empirical modal decomposition algorithm. The proposed model can excel in capturing temporal features at multiple scales in the time series, thus improving the prediction accuracy of the neural network model. We collected 80 days of monitoring data from 17 hydraulic supports in the working face and carried out comparative experiments with several existing models. The experiments show that the neural network model proposed in this paper is stable and reliable, with small prediction errors, and can provide important help for the prevention and control of the roof water hazard.
1. Introduction
China is rich in coal resources, with the world’s fourth largest total proven reserves [1, 2]. Coal has always been China’s main source of energy, dominating China’s primary energy production with a share of nearly 70% from 1978 to 2019 [3]. China also ranks first in the world with nearly 50% of its coal production [4]. With the rise of new energy industries in recent years, the share of coal in China’s primary energy production and consumption is gradually decreasing, but the total production and consumption of coal is still increasing year on year due to the increase in total energy demand. Coal will remain an important energy source and economic pillar for China for some time to come. The safety of coal production has therefore become a major concern for the country.
Water disaster accidents are one of the main disasters causing injuries and property damage in coal mines. Due to the complex hydrogeological conditions in China, there is a serious threat of water disaster. Particularly in recent years, with the growth of China’s energy demand, the scale and depth of coal mining have been increasing, making the hydrogeological conditions at coal mine workings more and more complex and the threat of water disaster more and more prominent [5]. The most direct water disaster faced by coal mines is water gushing from the roof of the coal seam, and the monitoring data of the hydraulic support is the most intuitive data to reflect the state of the roof. Therefore, the prediction of hydraulic support pressure data is an important step in the prevention and control of water disaster on the coal mine roof.
Traditional methods are the most widely used methods. They involve obtaining geological data from the working face by technical means such as geological exploration, and then using traditional mechanical or empirical formulas to analyse and estimate data such as water-conducting fracture zones. Liu’s “three top layers” theory [6] and Wu’s “three maps-two predictions” method [7, 8] are the most widespread methods currently in use. These methods predict and evaluate the roof water hazard prior to mining by analysing the development of water-conducting fracture zones and the water-richness of water-filled aquifers.
Numerical simulation is also a widely used method. The finite element model of the working face is established by simulation software combined with relevant mechanical equations. The entire mining process is simulated to predict the forces on the roof of the working face. Li et al. [9] obtained the groundwater distribution, seepage thickness, and floor failure thickness by the transient electromagnetic method, borehole method, and water injection test method and then carried out numerical simulation by using the constitutive model. Wang et al. [10] used Johnson–Holmquist damage constitutive model and introduced the smooth particle hydrodynamics method for numerical simulation to simulate the evolution of tunnel burst water under blasting action. Li et al. [11] carried out a numerical simulation based on the Drucker–Prager elastic-plastic damage theory to presimulate the changes in the geology around the working face during the mining process. However, due to the complexity of the geological conditions, the numerical simulation process requires a series of assumptions about the boundary conditions and other values. These assumptions may have some impact on the simulation results and distort them.
Machine learning algorithms have been successful in recent years in areas such as task offloading, healthcare, transportation, and anomaly monitoring [12–17]. At the same time, with the development of intelligent mines and manless working faces in various coal mines, a large number of sensors have been placed underground and a huge amount of monitoring data has been accumulated. As an objective information carrier, monitoring data can truly reflect the real-time status of the monitored object. Machine learning algorithms focus precisely on obtaining hidden, valid, and understandable knowledge from huge amounts of data. Researchers have therefore begun to investigate how machine learning algorithms can be used to predict coal mine workings data in order to get the accurate state of the workings roof. A Prophet + LSTM model is constructed to predict the mine pressure at the working face, and the model achieved an effective prediction of the mine pressure changes during the advancing process of the working face [18]. An initial roof pressure prediction model that combines backpropagation and genetic algorithms is developed for the Datong mine [19]. Liu et al. [20] combined the particle swarm optimization algorithm and radial basis function network to construct a model for predicting the height of hydraulic conductivity fracture zone. Guo et al. [21] established a fuzzy neural network-based water flow height prediction model for fissure zones by combining several factors, such as coal seam burial depth, inclination angle, and working face size. A random forest regression-based model is proposed for predicting the height of fracture conductivity zones [22]. Ma et al. [23] developed two exponentially weighted moving average modified grey models for predicting water inflow from fault. A random forest algorithm with a competing adaptive weighting algorithm is combined to develop a partial least squares regression model based on laser-induced fluorescence spectral data to predict the water inflow from different mines [24].
Empirical Modal Decomposition (EMD) is a commonly used method for analysing time-series data. By decomposing the input data into several essential modal functions, it helps us to understand the variation pattern of the data on different modes and extract the characteristics of the data [25]. However, the EMD algorithm has certain conditional constraints on the data to be analysed. As a result, the number of intrinsic mode functions for data decomposition is not controllable. This is detrimental to the prediction of dynamic monitoring data from the working face using neural networks.
In order to solve the problem of working face pressure prediction, in this paper, a neural network model based on multitimescale feature extraction is proposed for predicting the monitoring data of hydraulic supports at coal mine working faces, which improves the prediction performance compared with existing methods. The temporal feature extraction method in the proposed model is borrowed from the EMD algorithm, but it eliminates the data constraints of the EMD algorithm. It is more suitable for neural network models and facilitates subsequent applications. The experimental results demonstrate that our proposed model can excel in capturing temporal features in time series, which improves the prediction performance of the model. The proposed method can provide an important basis for early warning of the roof water hazard in coal mine.
The rest of the paper is organised as follows. Section 2 describes the architecture and algorithms of our proposed model. Section 3 provides a description of the experimental data. Section 4 presents the experimental design, results and analysis, and Section 5 offers the conclusions.
2. Model Architecture and Algorithms
EMD has been proposed as a fundamental part of the Hilbert–Huang transform [26]. The basic idea is to decompose a complex irregular sequence into a number of sequences of a single frequency plus a residual sequence. These single-frequency sequences are the values of the intrinsic mode functions (IMFs) within this data segment. The data decomposition steps are as follows.
First, find all the maximal points in the original data series and fit them with a spline function to form the upper envelope of the original data. Then, similarly find all the minimal value points to form the lower envelope . Calculate the mean (or median) of the upper and lower envelopes. Subtract from to obtain the intermediate sequence . At this point, it is necessary to determine whether screening is the IMF we need. The IMF needs to meet the following two constraints:(1)The number of extreme value points and over-zero points must be equal or differ by at most one throughout the data segment(2)At any point in time, the average value of the upper and lower envelopes is zero; that is, the upper and lower envelopes are locally symmetric with respect to the time axis
If does not satisfy the conditions, repeat the above steps with as a new input sequence until the two constraints are satisfied. The original sequence is subtracted from the first IMF to obtain the residual sequence , which is then repeated as the new original sequence to obtain the next IMF. This iteration is carried out until all IMFs are decomposed, and the iteration generally stops when contains no more than two extreme values.
The EMD algorithm can theoretically decompose any type of complex sequence into a finite number of IMFs. Each IMF component contains features from different time scales of the original series. This is advantageous for the prediction of time series. Large-scale temporal features help the model to understand the trend direction of the series as a whole, while small-scale temporal features allow the model to capture the details of the data variation. However, the volume of monitoring data gradually accumulates over the time of mining. The variation in the amount of data makes it uncertain how many IMFs that are decomposed by the EMD algorithm, which is unacceptable for a neural network. At the same time, the EMD algorithm requires all extreme points to be involved in the calculation during the decomposition process. Therefore, the data needs to be decomposed again after updating, which is not conducive to dynamic learning updates. We use neural networks to simulate the decomposition process of EMD and perform multiscale temporal feature extraction on the input data. The overall architecture of the model is shown in Figure 1.

The model consists of two modules. One is the multitimescale feature extraction module, which is used to decompose the input data into multiple time-scale feature components similar to the IMF components. The other is a feature fusion and prediction module, which fuses the feature components and performs the final output prediction results.
2.1. Multitimescale Feature Extraction Module
We use some combination of convolutional and recurrent neural networks to extract the input data step by step, allowing the module to obtain multiple feature components at different time scales. As shown in Figure 2, the module consists of alternating convolutional and GRU layers. The feature points of the sequence are extracted by the convolutional layers, and then the temporal features are abstracted and recorded by the GRU layers. The deeper layers have a larger scale for the extracted features.

2.1.1. Convolutional Component
The first component of the feature extraction module is a pooling-free convolutional neural network. The convolutional layers used in this paper are one-dimensional convolutions. The features of the data in different temporal dimensions are extracted layer by layer through multiple one-dimensional convolutional layers. The scale of the temporal features extracted by the convolutional layers becomes larger as the number of layers increases.
The convolution layer consists of a filter of width k and height one (since it is a one-dimensional convolution). When the input vector is passed through the filter of the convolution layer, the following output is obtained bywhere is the weight learned in the convolutional layer, is the bias term in the convolutional layer, and the activation function exponential linear unit (ELU) [27] is calculated as follows:where is an adjustable hyperparameter to control the saturation of the ELU at negative inputs and the value of is normally taken as 1. In the feature extraction module, the output of each convolutional layer is the vector . To ensure that the temporal feature extraction is complete and efficient, we only perform effective convolution; that is, we do not zero-padding the output of the convolution.
2.1.2. Recurrent Component
The output of each convolutional layer can be regarded as feature points in each feature component. In order to record the information and obtain the feature components at different scales while ensuring that the first features to enter the network are not lost, we choose the gate recurrent unit (GRU) as the recurrent component of the feature extraction module and insert it between the convolutional layers.
GRU as a variant algorithm of long short-term memory (LSTM) was proposed by Cho et al. [28]. Both GRU and LSTM are designed to solve the long-term memory problem in traditional recurrent neural networks [29]. GRU simplifies the network structure and computation on the basis of LSTM to improve the speed of network training. Compared with the LSTM, which has one memory cell and three gate units, the GRU has only two gate units in the hidden cell to control the memory and output of values inside the cell.
As shown in Figure 3, the GRU’s hidden unit fuses the memory unit with the hidden state. A reset gate controls whether a past hidden state is forgotten. The reset gate is computed bywhere is the sigmoid activation function, is the input, and and are the weight matrices obtained from the training. With the reset gate control, a new hidden state can be computed using the past hidden state and the input :where and are the weight matrices, which is learned during training. Finally, the update gate is trained to control whether the new hidden state is updated to the output . The update gate is calculated in a similar way to the reset gate:where and represent the weight matrices obtained from training.

2.2. Feature Fusion and Prediction Module
The feature extraction module fixes the number of times that the input sequence is to be extracted. The advantage of this approach is that it maintains a uniform dimensionality of the output data, which facilitates the subsequent continuation of processing using the neural network. However, a fixed number of decompositions may obtain redundant interference features when decomposing certain simple sequences. To solve this problem, we have added an attention mechanism to the module. Feature fusion and prediction is achieved using a residual structure with an attention mechanism and a GRU layer.
2.2.1. Attention Mechanisms
Attention mechanisms were proposed in the field of image processing. The original intention of the algorithm is to simulate human attention when processing images, focusing on what is of interest to us and blurring the parts that are not, with the aim of filtering information for efficiency. The attention mechanisms are essentially a mapping of queries to a series of keys and values. Google’s team used scaled dot-product attention in their transformer model [30], where a set of queries, keys, and values are packed into a matrix to speed up the computation of the attention function.
The calculation of the attention function in this paper also takes the form of a matrix calculation. The purpose of the attention function in the module is to focus the attention of the algorithm on useful features and blur redundant features. We choose the self-attention algorithm as the attention function to be used in the module. In the self-attention layer, all queries, keys, and values come from the same place, which in this paper are the output of the feature extraction module. The output of the feature extraction module is combined and packed into a matrix with dimensionality , where is the number of features and is the dimensionality of a single feature. The output of the attention layer is the same -dimensional matrix. The self-attention layer is calculated as follows:where is the element in the weight matrix obtained by training in the fully connected layer. In order to autonomously focus the self-attention layer on valid features, we use a fully connected layer to measure the interfeature correlation by means of feedback learning.
2.2.2. Residual Blocks
The residual block is calculated by readding the output from a series of transformations to the original input .
Residual blocks are proposed in neural networks to solve the degradation problem that occurs during the training of deep neural networks. This connection of residual blocks across multiple layers allows the output of the shallow layers to go directly into the deeper layers. The deep neural network can approximate the weights of multiple nonlinear layers in towards zero to keep the overall result accurate when the shallow layer result is already optimal.
The main role of the residual block in our model is not for solving degeneracy problems in deep neural networks. As shown in Figure 4, we have added normalisation to the residual calculation. This allows the larger data in the attention layer (features that receive a high level of attention) to be further amplified by the calculation of the residual blocks with the attention layer, making the redundant information as small as possible to approach zero.

3. Experimental Data
The data used in this paper is the monitoring of hydraulic support pressure at an underground working face in a coal mine. The hydraulic support pressure data is an important piece of data in the process of coal mine safety. It not only reflects the changes in roof stress above the working face, but also records the working of the hydraulic supports.
In Figure 5, it is clear that there is a periodicity in the strut pressure curve, which is caused by the working cycle of the hydraulic support.

As shown in Figure 6, the working cycle curve of the general hydraulic support pressure can be divided into four stages: (I) the initial support stage, when the support is lifted to begin supporting the roof and reaches the initial support force; (II) the load-bearing stage, when the hydraulic support holds the roof and the pressure on the support rises gradually as the coal miner continues to work; (III) the adjacent hydraulic support moving stage, during which the adjacent hydraulic support is lowered and moved forward, some of the pressure originally carried on the adjacent support is instantaneously transferred to the present support; and (IV) lowering the support stage, the pressure carried by the hydraulic support is rapidly reduced when it is lowered.

In order to reflect the intensity of the mineral pressure showing on the roof of the working face and to evaluate the adequacy of the rated working resistance of the support, we need to calculate and filter the pressure data further. The end-of-cycle resistance of the support is an important parameter for evaluating these indicators. It is the maximum value of the working resistance of the support for a period of time before the support is moved. We need to convert the monitored individual strut pressure data bywhere is the working resistance of the hydraulic support obtained by conversion, and represent the pressure value of one strut in front and one in back of the four column support, and and represent the inner diameter of the front and back struts, respectively. We filter the end-of-cycle resistance data manually. We collated the monitoring data and the end-of-cycle resistance data into separate prediction series data. Using the data from moment t-n to moment t, the data at moment t+1 is predicted.
4. Experiments
The experiments were conducted on two data sets: hydraulic support pressure monitoring data and hydraulic support end-of-cycle resistance.
4.1. Training and Validation
The Adam optimization algorithm [31] was used to update the network weights of the model during model training. The advantage of Adam’s algorithm is that it requires only a first-order gradient and can adaptively adjust the learning rate. It is therefore computationally efficient, has low memory requirements, and is suitable for nonfixed target and sparse gradient problems.
The validation of the model helps us to obtain a reliable and stable model when it is being trained. Considering the strong dependence of the sequence prediction model on data timing, we used nested cross-validation to validate the model. A common method in nested cross-validation called forward-chaining is to split the training and test sets multiple times and then calculate the mean of the errors on these splits. The data segmentation method for forward-chaining is illustrated in Figure 7. The biggest advantage of forward-chaining for time-series data is that it ensures time-series dependencies in the data when they are split multiple times, and that splitting time-series data does not cause information leakage.

The parameters of the model are set as shown in Table 1. The parameters , , , and of the Adam optimization algorithm use the values from the original paper, and the values of these parameters have been experimentally verified by the authors to have achieved excellent results. The number of GRU units and the number of temporal characteristic component layers are determined by the grid search method. We split the data a total of ten times, with the last split having a ratio of approximately 7 : 3 between the training and test sets. The validation results on the two data sets are shown in Figure 8.

(a)

(b)
4.2. Experimental Results
We have experimented and compared our proposed model with models from a number of existing studies. The comparison models include GRU, the hybrid method coupling artificial neural networks with genetic algorithm methods (ANN-GA) [32], temporal convolutional network (TCN) [33], the deep neural network-based traffic flow prediction model (DNN-BTF) [34], and the prediction model combining LSTM and support vector machine (LSTM + SVM) [35]. These models have similar algorithms to our model, or study problems that are similar to this paper. A grid search was also performed for some parameters in the model to ensure the prediction accuracy of the comparison model.
Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) are chosen as the metrics for evaluating the prediction results. Considering that the data used in the experiment is different from that in the original paper of the comparison method, we need to modify the input and output dimensions of the comparison model. All prediction errors in the comparison experiments are averaged over ten times.
Hydraulic Support Pressure Monitoring Data Set. This data set collects two and a half months (80 days) of pressure monitoring data from 17 hydraulic supports at the working face. The sensors collected data at five-minute intervals for a total of 292,294 data. Using the same parameters and data partitioning methods as the model validation, we experimented with all the model using data from different hydraulic supports separately. The prediction results of the model proposed in this paper are plotted in Figure 9, and the prediction errors for all models are shown in Table 2.

As previously described during the data processing, the hydraulic support pressure monitoring data has a regular cyclical variation, which is evident from the experimental results. Several models have good prediction results with very similar errors and no great differences in performance. Our model predicts slightly better than the other models.
Hydraulic Support End-of-Cycle Resistance Data Set. This data set is formed by the calculated filtering of the support pressure monitoring data set. It contains end-of-cycle resistance data for 17 hydraulic supports, with a total of 20,723 data items. We conducted experiments similar to the previous dataset. The results of the experiments are shown below. The prediction results of the model proposed in this paper are plotted in Figure 10, and the prediction errors for all models are shown in Table 3.

Let us compare the results of the previous experiments. The hydraulic support end-of-cycle resistance data does not have a regular and distinctly cyclical character. Our proposed model demonstrates the ability to extract temporal features at different scales. The best prediction results were achieved for all hydraulic support data. Clearly, the GRU model and the DNN-BTF model are the two with the best prediction results in addition to our proposed model. These two models both use the GRU network, which has been proven over time to be good at dealing with time-series forecasting problems. However, the LSTM + SVM model, which has a similar network structure, does not perform well since the SVM algorithm is not good at handling complex sequence prediction problems. The TCN model uses multiple CNN layers for temporal feature extraction at different scales; however, CNNs are unable to record-keep long-term features and the field of view size is limited by the input step size. The ANN-GA uses a traditional multilayer perceptron, which has a mediocre prediction performance.
5. Conclusions
In this paper, a neural network model based on multitimescale feature extraction is proposed for real-time prediction of hydraulic support pressure data and working resistance at coal mine working faces. After a series of comparative experiments and analyses, the model is shown to have a higher prediction accuracy than other comparative models. These experiments also demonstrate that our model has better time-scale feature extraction capability, while the structure of the network is more flexible and can adapt to data with different features. Overall, the model proposed in this paper is a stable and reliable method for predicting hydraulic support pressure in coal mine workings. The proposed model can be applied to predict malware propagation [36] and COVID-19 spread [37, 38].
Our model also has shortcomings, the most prominent of which is the slow training of the model. The structure of the multitimescale feature extraction module allows us to extract more complete time-series features; it also increases the computational effort of the model. In future work, we will further improve the model so that it can extract multiscale temporal features of the data more quickly and accurately. This work will improve the accuracy of the prediction results and provide technical support to prevent the roof water hazard and ensure coal mine safety.
Data Availability
The data used for the experiments in this paper are monitoring data from coal mine production. It is confidential data of the coal mining company. Therefore, data access is not available.
Conflicts of Interest
All authors declare that they have no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work was supported in part by the Natural Science Basic Research Plan of Shaanxi Province (no. 2019JM-348) and in part by the Key Research and Development Program of Shaanxi Province (no. 2019GY-056).