Abstract
This paper proposed a Bayesian optimized extreme gradient boosting (XGBoost) model to recognize small-scale faults across coalbeds using reduced seismic attributes. Firstly, the seismic attributes of the mining area were preprocessed to remove abnormal samples and high-noise samples. Secondly, chi-square bins were performed for each feature of the processed attributes. The weight of evidence (WOE) was calculated in each container, and each element's information value (IV) was obtained to characterize the importance of each feature. Features with low information values are reduced to remove high-noise attributes. Thirdly, the reduced attributes are decomposed by variational modal decomposition (VMD) to obtain new features. Finally, the optimized XGBoost model and traditional methods were constructed to identify and locate faults across coalbeds. Here, the Bayesian-optimized XGBoost objective function was used to balance the training weights of asymmetric examples. As the Bayesian optimization algorithm quickly falls into local optimums, it does not easily balance the “exploit” and “explore” approaches. Therefore, this paper proposed an adaptive balance factor change algorithm to overcome this shortcoming. Comparing the identification outcomes, the optimized XGBoost model has a higher prediction accuracy than the BP neural network, support vector machines (SVMs), K-means clustering, extreme learning machines (ELMs), and random forests (RFs). In summary, the proposed method can improve the identification accuracy of small-scale faults in coal mining areas.
1. Introduction
Small-scale faults are prone to geological accidents during coal mining, including roof falls, water inrushes, and gas outbursts. Therefore, the identification accuracy of small-scale faults is essential for estimating mining risks and improving mining efficiency.
As the fault throw may fall between one or two meters to ten meters, small-scale faults slightly distort seismic events and/or decrease seismic amplitude in seismic sections. Depending on the fault scale, this phenomenon may extend tens of meters to hundreds of meters laterally [1]. The “Coal Seismic Prospecting Standard” in 2017 stipulates that the three-dimensional seismic exploration of mining areas should identify faults with a throw of 5 meters or more, and the lateral errors should be less than 30 meters. At present, seismic exploration of coal mining areas hopes to identify small-scale faults with a throw less than 5 meters, further increasing mining efficiency and avoiding potential risks. Therefore, it is critically important to develop the identification method of small-scale faults.
Small-scale fault recognition technology includes manual interpretation methods, seismic attribute interpretation technologies [2–4], and automatic recognition methodologies [5–7]. Manual interpretation is a method that observes waveform, amplitude, and travel time differences of seismic waves across seismic traces in the seismic section with the naked eye to determine whether a small-scale fault exists. However, because of the limited range marked by the human eye, it is difficult to detect small changes in seismic characteristics. Furthermore, it is strictly restricted by the resolution of seismic data. The seismic waves reflected from the deep stratum have the characteristics of low dominant frequency and narrow frequency band caused by the profound absorption and attenuation phenomena. The resolution of those seismic waves is low, which increases the difficulty of manually identifying small-scale faults. The technology of seismic attribute interpretation uses forward simulation, coherence cube technology, ant technology, and some other methods to improve the accuracy of small-scale faults identification. However, the process requires interpreters to use human-computer interaction software to interpret according to the characteristics of the corresponding relevant seismic waves. The interpretation duration is long, and the interpretation efficiency is low. For small-scale faults whose throw is less than 5 meters, the attribute changes of seismic reflection are directly related to the interpreter’s experiences. The technology of automatically identifying small-scale faults is the leading research direction in the literature. The BP neural network and SVM have been used to recognize small-scale faults in many practical projects. However, the prediction accuracy of the neural network depends on the design of the neural network architecture. The deep-level network brings increased accuracy, however, it increases the training time. In addition, the networks do not understand the meaning of data. SVM can achieve good results when the sample size is small. However, with the condition optimization of data collection, SVM prediction accuracy will decrease because the data size and data collection time increase.
This paper proposed an optimized XGBoost model to identify small-scale faults. The XGBoost objective function was improved considering the asymmetric characteristics of samples. The XGBoost optimal hyperparameters were achieved through Bayesian optimization, and the Bayesian optimization acquisition function was improved to prevent falling into the local optimum. The proposed model can improve the accuracy and robustness of identifying small-scale faults in coal mining areas validated by a forward modeled seismic section and a practical seismic section.
2. Theory
2.1. WOE and IV
Weight of evidence (WOE) is often used in risk assessment and credit card rating [8, 9]. Information value (IV) can be obtained by summing the weighted average, which measures the predictive ability of independent variables on dependent variables. Therefore, it is an effective tool for feature selection [10].
F.P. Agterberg proposed WOE in 1989, inspired by the concept of entropy.
For a continuous random variable X, the probability density function is p(x), and the entropy is defined as follows:
For two random variables, X and Y, relative entropy is a means to describe the difference in probability distribution between the two, and relative entropy is defined as follows:where and represent the probability density of Y and X, respectively. represents the expectation of random variable Y.
Information value is a tool based on relative entropy to describe the difference in the distribution of positive and negative samples. Information value is defined as the sum of the relative entropy of the negative samples to the positive samples and the relative entropy of the positive samples to the negative samples, namely,where and represent the probability density of negative samples and positive samples, respectively.
The evaluation criteria for the correlation between information value and characteristics are shown in Table 1.
According to the information value for the selected feature X, the weight of evidence can be defined as the difference between the log-likelihood function of the positive sample and the negative sample. In actual applications, WOE needs to perform binning processing on eigenvalues. Suppose that the random variable X is divided into k intervals, and Goodi and Badi represent the number of positive and negative samples in the ith interval, respectively.where , B represent the total number of positive and negative samples.
The method of characteristic binning affects the value of IV. Standard binning methods include equal-distance binning and equal-frequency binning. In this paper, the chi-square binning method was used.
Binning refers to discretizing continuous variables, and chi-square binning relies on the chi-square test. It merges the adjacent intervals with the smallest chi-square value until the chi-square value of all adjacent boxes is greater than or equal to the set threshold, or the number of packages reaches the set maximum number. The specific steps for chi-square binning are as follows: Step 1. Sort the discrete variables and put each variable into the initial setting box. Step 2. Combine interval: calculate the chi-square value of each pair of adjacent intervals, and merge adjacent boxes with low chi-square values until the set threshold is reached. Chi-square is defined as follows: where represents the number of instances of type j in the ith interval, and represents the expected frequency of .
2.2. VMD
VMD is an adaptive and nonrecursive modal variation method used in signal processing [11]. The advantage is to determine the number of modal decompositions. Its selfadaptability determines the number of modal decompositions of the given sequence according to the actual situation. In the subsequent search and solution process, it can adaptively match the optimal center frequency and limited bandwidth of each mode. It can realize the practical separation of intrinsic modal component (IMF) frequency domain division of signal. Then, the functional decomposition components of the given signal are obtained. Finally, the optimal solution of the variational problem is obtained. It overcomes the difficulties of endpoint effect and modal feature aliasing in the EMD method and has a more solid mathematical theoretical foundation. It can reduce the nonstationary time series with high complexity and strong nonlinearity and decompose and obtain relatively stable subsequences containing multiple frequency scales, suitable for nonstationary sequences. The core idea of VMD is to construct and solve variational problems.
2.3. XGBoost
XGBoost is a large-scale scalable machine learning system proposed by Chen Tianqi of the University of Washington. It is widely used in various machine learning complex tasks and has achieved almost perfect results [12–16].
The advantage of XGBoost is that it performs a second-order Taylor expansion on the loss function to increase the accuracy, and a regularization term is added to the objective function to prevent overfitting. It also handles missing value data and supports parallel computing, improving training speed.
XGBoost is composed of a large number of different regression trees. Its goal is to build K regression trees so that the predicted value of the tree group is as equal to the actual value as possible. It has a strong generalization ability. The greedy algorithm is used to generate these K regression trees. The idea of this algorithm is to add trees one by one until K trees stop.
XGBoost itself is a boosting algorithm that follows forward stepwise addition. The goal is to learn the residuals from the previous step to the current stage to achieve the gradient boosting effect.
Its object function is as follows:where is the loss function, is the predicted value, and is the actual value. For different tasks, it is necessary to choose loss functions accordingly. is a regularization term representing the model’s complexity.
Define as follows:—indicates the number of leaves; —represents the shrinkage coefficient; —represents the L2 norm coefficient.
The objective function is optimized by second-order Taylor expansion as follows:where and are the first derivative and the second derivative, respectively.
According to the theory of quadratic matrix optimization, the above formula can be simplified as follows:where = , =
The optimal weight can be obtained as follows:
The final simplified objective function is as follows:
Different tasks choose different objective functions. The first and second derivatives are different, and the final objective function is changeable.
The d represents the maximum depth of the tree, K represents the total number of subtrees, and ||x|| represents all nonmissing values and the amount of data. The algorithm’s time complexity is O (Kd ||x||logn). Since K, d, and ||x|| are much less than n, the overall time complexity is approximately o (logn).
2.4. Bayesian Optimization Theory
Bayesian optimization is a global optimization algorithm that assumes the prior distribution, obtains the posterior distribution, and modifies the reliability of the original distribution. It solves the information obtained according to the black box objective function f, finds the next evaluation position, and continuously solves the problem of approximating the optimal solution [17, 18].
The Bayesian optimization framework mainly includes two core parts: the probabilistic proxy model and the acquisition function.
The probabilistic proxy model starts with a hypothetical prior, and the beta distribution is a standard hypothetical previous primary. Through continuous iteration, new information is generated, and the initial preceding probability model is continuously revised to obtain a more accurate probabilistic proxy model.
Probabilistic proxy models can be divided into parametric and nonparametric models according to the fixed number of model parameters.
The nonparametric model has higher scalability than the parametric model. The Gaussian process is a commonly used nonparametric model, and it is also the model used in this article.
A Gaussian process is composed of a mean function m: and a covariance function k: , then the Gaussian generative model is as follows:
The mean and variance of the posterior distribution of random variable X under the influence of the observed data are as follows:
The µ(x) is the positive kernel or covariance function, k(x) is a vector of covariance terms between x and x1:n, and is the observed variance.
The acquisition function is a function that maps from the input space, the observation space, and the hyperparameter space to the actual number space. The role is constructed from the posterior distribution obtained from the observed data set D and maximizes it to guide the selection of the following evaluation point.
Standard acquisition functions are divided into promotion strategy-based, information gain strategy-based, and confidence interval-based. The acquisition functions commonly used in the industry are TS, PI, EI, UCB, etc. The acquisition function used in this article is based on the PI acquisition function.
PI (probability of improvement) is a promotion-based strategy, which quantifies the likelihood that the observed value of x can improve the current optimal objective function. The acquisition function of PI is as follows:where represents the current optimal function value, is the standard normal distribution cumulative density function, and is the balance parameter, balancing the local and global search relationship.
Mockus et al. proposed a new strategy based on improvement EI (expected improvement), and the acquisition function of the EI strategy is as follows:
EI integrates the promotion probability and reflects the different promotion amounts.
UCB is based on the confidence interval, and it optimistically takes the upper bound of the confidence interval every time. The acquisition function of the UCB strategy is as follows:
The adjustment of the changes the size of the confidence interval. Actively changing the parameters can effectively improve the optimization process.
When the number of samples is n, the time complexity of Cholesky approximation decomposition is o(n2).
3. Materials and Method
3.1. Seismic Attribute Data
Firstly, a conceptual and symmetrical model is constructed to simulate the real geological characteristics of the coalbed. During synthetic seismic recording, the seismic receiver spacing is set as 1 m. The synthetic seismic section is obtained by convolving the conceptual model with a 50 Hz Ricker wavelet. As a result, the seismic section with 101 seismic traces is obtained in Figure 1. There are five faults with an increasing fault throw from left to right. Traces 11∼13 are small-scale faults with a throw of 2 m, traces 20∼22 are small-scale faults with a throw of 3 m, traces 34∼37 are small-scale faults with a throw of 5 m, traces 50∼52 are faults with a throw of 10 m, and traces 68∼71 are faults with a throw of 15 m. Eighteen seismic attributes have been extracted along with the positive phase, including eight time-domain attributes, seven frequency domain features, and three fractal dimension parameters, such as the main frequency (Fmain), main frequency phase (Phm), correlation coefficient (Rxy), and two-dimensional fractal parameter (Ttd2).

3.2. Evaluation Index
The accuracy, F1 value, PR curve, and ROC curve were selected as indicators to measure the model’s performance.
TP indicates the number of samples that positive samples predicted to be positive by the model; FP indicates the number of samples that positive samples predicted to be negative by the model; FN indicates the number of samples that negative samples predicted to be positive by the model; TN indicates the number of samples that negative samples predicted to negative by the model.
The accuracy rate is as follows:
Precision represents the probability that the prediction is correct in samples with optimistic forecasts, and it reflects the model’s ability to distinguish negative samples.
The recall represents the probability that the accurate actual sample is the correct prediction in the positive example, and it reflects the model’s ability to distinguish positive samples.
The measure is a weighted harmonic average of P and R, taking both into account.
The PR curve describes the difference in accuracy rate/recall rate, and its location under the curve demonstrates the learning model’s performance. The area under the curve is more extensive than the upper area, indicating better performance.
The ROC curve [19] is also a curve for evaluating the model’s predictive ability. It is reflected in the turn that the ROC curve corresponding to the classifier should be as close as possible to the upper left corner of the plot, and the position of the diagonal line means that the effect of the classifier is the same as random guessing. AUC is the size of the area under the ROC curve, and the larger the area under the ROC curve, the better the model performance.
3.3. XGBoost Objective Function Improvement
The asymmetric distribution of positive and negative samples in binary classification tasks is standard [20, 21]. This problem will cause the classifier to train categories with many samples, resulting in a lower overall training effect. In small-scale fault prediction, the number of small-scale fault samples is much smaller than that of nonfault samples, resulting in insufficient training. Therefore, the model’s accuracy in identifying small-scale faults will decrease. To sufficiently train the classes with small sample sizes, the objective function of the XGBoost two-class classification is improved to increase the training weight of a small number of samples. The two-category label uses one-hot coding, where the samples with a small number are labeled as positive samples “1,” and the other samples are labeled as negative samples “0.”
The original objective function of XGBoost is as follows:where is the loss function, is the predicted value, and is the actual value. indicates the number of leaves, represents shrinkage coefficient, and represents the L2 norm coefficient.
In classification tasks, cross-entropy is used as the objective function.where
The improved objective function is as follows:
The improved objective function can meet the weight adjustment needs. When yi is 0, its weight is 1, and the importance of samples with many proportions remains unchanged. When yi is 1, the weight of a small number of pieces increases, and K is an adjustable parameter controlling the increase in sample weight. At the same time, the weight coefficient is a linear function, which has a faster calculation speed than other complex functions.
This paper establishes the XGBoost model and the XGBoost model with the improved objective function. By manually adjusting the parameter K, the optimal K value can be obtained as 0.4, equivalent to adding 1.4 times the weight to the positive sample. The original XGBoost and the improved XGBoost were tested on a small-scale fault dataset divided into 7 : 3, and the comparison result is shown in Figure 2. In terms of accuracy, the ability to distinguish positive samples from negative samples, and the comprehensive ability, the results of the improved XGBoost model of the objective function are better than those of the original XGBoost model.

3.4. Bayesian Optimization Acquisition Function Improvement
The acquisition function in Bayesian optimization is an important part. It determines how to filter the optimal point of the posterior distribution. When the assumption of the prior distribution is not much different from the actual data, the effect of Bayesian optimization is perfect [17]. The acquisition function is essential to balance the relationship between exploration and mining. Mining obtains the optimal value in the current space domain, however, the exploration prevents falling into an optimal local solution and expands the space domain of inquiry. In Bayesian optimization, PI is a standard acquisition function. The idea is to use the probability that the posterior distribution model estimate is greater than the current actual observation value to find the location of the next point. However, the PI algorithm is susceptible to the value of the balance parameter. If it is too small, it quickly falls into the optimal local solution. If it is too large, it will reduce exploration efficiency. Therefore, this paper proposes an adaptive algorithm. According to the observation value, the value of the balance parameter is automatically adjusted to make the PI algorithm jump out of the optimal local solution as much as possible. The improved PI algorithm is as follows:where .
The parameter YMAX represents the maximum value of the objective function of the current observation, and the parameter YMAX represents the theoretical maximum value that the objective function can reach. When the observed maximum value of the objective function is closer to the theoretical maximum value, ε tends to 0, and the acquisition function is more inclined to the mining state. When the observed maximum value of the objective function is far from the theoretical maximum value, ε tends to 1, and the acquisition function is more inclined to explore the state.
The enhanced PI algorithm is applied to the Rosenbrock function and compared with other standard acquisition functions EI and UCB. The expression of the Rosenbrock function is as follows:
Theoretically, when x and y both take 1, this function will obtain the maximum value of 0.
In this test, the value range of the variables x and y in the Rosenbrock function is set to [−5, 5]. Run PI, EI, UCB, and the improved PI algorithm 50 times, respectively, and count the maximum, minimum, average, and variance of their function values. The results are shown in Table 2.
In these 50 optimization tests, the maximum value of the EI algorithm is −1.2309, which is much smaller than other algorithms. It has the smallest mean value and the most significant standard deviation, showing the worst optimization performance. The maximum PI and UCB are still some distance from the theoretical maximum. The standard deviations are 5644 and 12246, respectively, showing poor optimization effects, however, they are better than EI. The improved PI algorithm offers more apparent advantages than other algorithms, whether the maximum, average, or standard deviation. Therefore, the enhanced PI algorithm proposed in this paper can better avoid falling into the optimal local solution and find the optimal global solution.
Based on the above results, Bayesian optimization was used to obtain essential parameters of the improved XGBoost model, and ten experiments were also carried out. The results are shown in Figures 3 and 4. The accuracy and the F1 value of Bay_XGBoost optimized based on the Bayesian optimization algorithm are eight times better than those of unoptimized XGBoost, and the F1 value is ten times better than that of unoptimized XGBoost, showing excellent performance.


4. Model
4.1. Data Processing
Seismic attributes are closely related to faults, however, seismic attributes do not have the same sensitivities to small-scale faults. Before using them, it is necessary to evaluate them. In this paper, the attributes extracted from the forward simulation illustrated in Section 3.1 are used as input.
In the fault data collection, chi-square binning is performed on 18 kinds of seismic attributes, and the best separation points for each feature are as follows: A1: (777.0, 1177.0, 1252.0, 1306.0, 1332.0), Af: (3243.708, 3535.995, 3868.301, 4048.326, 4436.038), At: (36.965, 39.241, 42.571, 50.254), Df: (−0.703, −0.587, −0.568, −0.503, −0.5, −.358), Fa: (68.0, 85.0), Fd2: (1.273, 1.824, 1.915, 2.016, 2.123), Fmain: (39.0, 40.0, 59.0, 77.0), Pha: (−0.862, −0.803, −0.648, −0.608), Phm: (−0.121, 0.22), Qf: (2.924, 4.707), Qfl: (9.286, 10.906), Qfw: (30, 60, 63), Rflw: (0.154), Rxy: (0.936, 0.963, 0.999), Td2: (1.153, 1.163, 2.045, 2.241, 2.343), Tmax1: (15), Tmin1: (21, 28), Ttd2: (2.038, 2.287, 2.79, 2.803, 2.835).
Binning the feature according to the separation points, we count the number of positive and negative samples in the bin, calculate the WOE value in each container, and compute the IV value accordingly. The IV value of all features is shown in Figure 5, which reflects the amount of information contained in each feature attribute and the degree of contribution to the prediction result, as it is regarded as feature importance. According to the information value correlation theory, the influence of features with an IV value less than 0.1 is almost negligible. Therefore, the four attributes, including At, Fmain, Fa, and Tmin1, are eliminated in the following research.

Next, the variational mode decomposition is performed for each feature, where the number of modes is 3. After processing, the selected seismic attributes are decomposed into 42 new features. The decomposed and nondecomposed components are input to the XGboost model for comparison. The results in Figure 6 show that the accuracy, precision, and F1 value have improved considerably, and the recall has decreased slightly.

Figure 7 shows the results of XGboost with objective function optimization, XGboost with VMD, and XGboost with VMD and objective function optimization. As shown, both objective function optimization and VMD can improve the accuracy, precision, and F1 value of the XGBoost model. Comprehensively, the performance of the XGboost model with VMD and objective function optimization is the best.

4.2. Bayesian-Optimized XGBoost Model for Small-Scale Fault Recognition
The XGBoost ensemble learning framework has several hyperparameters, which affects the quality of the model. n_estimators represent the number of weak classifiers, which affects the ensemble effect of the model, and the gamma parameter defines the minimum loss of splitting. The larger the value, the more conservative the algorithm. Parameters, such as Max_depth, min_child_weight, subsample, lambda, alpha, and max_leaves, can adjust the degree of model fit.
As Figure 8 shows, this paper uses the improved Bayesian optimization theory to find the best hyperparameters as Bayesian optimization’s iterative speed is faster than grid search and random search, and Bayesian optimization has good robustness on nonconvex, multimodal, and costly evaluation problems. Bayesian optimization and improvement of XGBoost for small-scale faults identification are as follows: Step 1. Clean the small-scale fault data in the mining area and clean the outliers that are seriously inconsistent with the distribution of the sample point set and the sample points with a large proportion of missing features. Step 2. Perform chi-square box analysis for each seismic attribute and select the best feature segmentation point. Calculate the WOE value in each box, obtain the IV value of each element, and remove the part with too little information. Then, construct new features using VMD. Step 3. Divide the processed small-scale fault data according to 6 : 2 : 2 proportion into a training set, test set, and cross-validation set. The training set is constructed by an XGBoost model with an improved objective function, and a black box function with Bayesian optimization accuracy is set on the cross-validation set. Step 4. Set the range of important parameters for Bayesian optimization: learning_rate (0.01, 1), max_depth (10, 500), max_delta_step (0, 10), lambda (0, 5), alpha (0, 5), gamma (0, 1), max_leaves (0, 10), min_child_weight (0.1, 1), and set the acquisition function of Bayesian optimization and the total number of iterations as 100 times. Step 5. Perform model training and Bayesian optimization simultaneously. Ten sample point sets of hyperparameters within the range are randomly initialized. According to the Gaussian process, we obtained the posterior distribution. The optimized PI function proposed in this paper is used to determine the next sampling point and is added to the sample set. Step 6. Continue to iterate according to the set objective function. Among all the sample points after the iteration, we select the sample point as the hyperparameter of the XGBoost model. Then, we get the final small-scale fault recognition model.

Assuming that the number of samples is n and the maximum number of iterations of Bayesian optimization is m, the overall time complexity of XGboost training and Bayesian Optimization is mO(n2logn).
4.3. Experimental Parameter Settings
The balance factor of the improved Bayesian optimization PI function is initialized to 0. The booster model parameters in XGBoost select gbtree, the objective function selects weighted_logistic, and other parameters are obtained through Bayesian optimization. To validate XGBoost, we compare it with some traditional methods, including SVM, BP, ELM, random forest, and K-means. The parameters of SVM [22] are as follows: Gaussian kernel function, penalty coefficient C = 1.0, and degree = 3. The parameters of BP [23] are as follows: 14 ∗ 20 ∗ 20 ∗ 1 network structure, rule activation function, initializing learning_rate = 0.001, max_inter = 200, and momentum = 0.9. The parameter of ELM [24] is as follows: 14 ∗ 40 ∗ 1 network structure. The parameter of the random forest [25] is as follows: 20 subtrees and Gini coefficient feature gain standard. The parameter of K-Means [26] is as follows: K = 2.
4.4. Comparison of Experimental Results
The dataset is randomly divided into 6 : 2 : 2 subset. The ROC curve and PR curve obtained in one experiment are shown in Figures 9 and 10. The experimental results show that the ROC curve of XGBoost is close to the upper left corner, and the maximum AUC value is 0.823. The random forest, ELM, SVM, and BP neural networks are 0.707, 0.738, 0.417, and 0.417, respectively. The area under the PR curve is also the largest, the AUC value is 0.909, and the random forest, ELM, SVM, and BP neural networks are 0.855, 0.890, 0.755, and 0.755, respectively. Considering the above parameters, XGBoost has the best performance.


Similarly, considering the parameters of accuracy, precision, recall, and F1 value, the results of XGBoost are much better than the K-means, as shown in Figure 11.

We run different algorithms ten times and compare their accuracies and F1 values, as shown in Figures 12 and 13. Among the ten random tests, XGBoost has eight times the best performance than other methods, considering the accuracy and F1 value.


The above experiments show that the comprehensive performance of XGBoost is better than other algorithms in terms of accuracy and F1 value, and the improved Bay_XGBoost is slightly better than XGBoost. The improved XGBoost model with Bayesian optimization identifies small-scale faults with higher prediction accuracy.
5. Case Study
5.1. Settings of the Study Area
Although the XGBoost predicting model demonstrates its ability to predict small-scale faults with synthetic seismic data, it is still not enough to use the XGBoost predicting model directly for small-scale fault prediction with actual coal mine data. Hence, we select the Liangjia coal mine as the research area, as shown in Figure 14. The Liangjia coal mine is located in Longkou in the northwestern Jiaodong Peninsula and on the southern bank of Bohai Bay. It is the largest seaside coal mine in China. The minefield is located in the northwest part of the Huangxian coalfield in Longkou city, whose mineable area is near 48 km2. Small-scale faults in this area are relatively developed, and roof falls and gas outbursts are prone to occur. Therefore, the Liangjia coal mine is ideal for studying small-scale faults.

5.2. Prediction Results and Comparison
This paper selects the seismic data of section inline 100 of the Liangjia mining area, 18 kinds of seismic attributes, and 419 samples. After removing the data with outliers and missing values, 371 samples remained. According to the forward simulation experiment, four low-value attributes are removed.
Construct an improved XGBoost model, input the reduced 14 attribute data into the model, and predict small-scale faults of section inline 100 as shown in Figure 15. The abscissa is the CDP number, the interval is 5 m, and the ordinate is the predicted value. Suppose the threshold is 0.8, small-scale faults are identified as shown in Figure 16.


Compared with the BP neural network, the accuracy of the proposed method is higher than the BP neural network. In Figure 16, small-scale faults have been revealed, marked as A-K. The empty circles are the small-scale faults recognized using the BP neural network, and the triangles are small-scale faults recognized using XGBoost. The BP neural network mistakenly identified locations 209 and 371 as small-scale faults and missed small-scale faults in location 318. In contrast, XGBoost accurately identified small-scale faults.
To further test the model’s reliability, we compare the prediction results of RF, BP neural network, and SVM, as shown in Table 3. The prediction results of the XGBoost model based on Bayesian optimization are excellent, and the XGBoost based on Bayesian optimization has a more negligible probability of misjudging negative samples.
6. Conclusions
This paper proposes a small-scale fault prediction method using a Bayesian-optimized XGBoost method. Based on the experiments from synthetic data and actual coal mine application data, some conclusions can be achieved as follows:(1)The combination of IV and VMD effectively reduces the features and fully extracts the information.(2)The objective part of XGBoost is improved to solve the problem of the asymmetric sample distribution. The Bayesian optimization acquisition function is improved to solve the problem of the maximum value falling into the local optimum.(3)The Bayesian-optimized XGBoost predicting model better predicts small-scale faults than the BP, SVM, ELM, and RF predicting model.(4)The objective function of Bayesian optimization has high dimensionality, and it takes considerable time to calculate once. Improving the recognition speed while ensuring accuracy is a problem that needs to be solved in future research.
Data Availability
The data used to support the findings of this study have not been made available because the ownership of data belongs to the Longkou mineral group.
Conflicts of Interest
The authors declare no conflicts of interest.
Acknowledgments
This research was funded by the National Key R&D Program of China, grant number 2021YFC2902003, and the National Natural Science Foundation of China, grant numbers 41704115 and 41774128.