Abstract

Landslide susceptibility prediction is critical in open pit mines and geotechnical fields. Prediction accuracy is very essential to reduce the risk of slope instability. Traditional statistical learning methods have been widely used in early warning systems, but they cannot thoroughly explore the coupling effect among related factors, which often results in low prediction accuracy. This paper establishes an ensemble learning prediction model optimized by a genetic algorithm (GA) to determine landslide susceptibility more quickly and efficiently. The model is based on 290 sets of slope cases containing height (H), slope angle (α), unit weight (γ), cohesion (c), friction angle (φ), and pore water pressure (ru). Two common algorithms are incorporated into the ensemble learning model: Xgboost and gradient boosting decision tree (GBDT). The area under the curve (AUC) of GA-GBDT and GA-Xgboost were found to be 0.928 and 0.933, respectively, both of which could predict landslide susceptibility better. Compared with multiple logistic regression and other machine learning algorithms, both GA-GBDT and GA-Xgboost models perform better in terms of accuracy and applicability. The study results demonstrate that the developed optimized machine learning model can accurately predict landslide susceptibility and that the parameters should be optimized on a case-by-case basis to achieve more accurate results after building a suitable machine model. The optimization model proposed in this paper can be an effective new method for the intelligent prediction of landslide susceptibility.

1. Introduction

Instability can form on slopes during the construction of mines, water conservancy actions, and construction projects, which can have substantial catastrophic consequences [1, 2]. However, there are few effective and accurate tools for prediction and prevention [3, 4]. Accurate prediction of landslide susceptibility provides a basis for the safe operation of projects and helps to determine the stability state of slopes, prevent risk hazards and provide safety management solutions. Therefore, constructing a reliable and efficient landslide susceptibility prediction model is of great theoretical and practical significance.

The current research broadly classifies landslide susceptibility prediction models into two categories: traditional and machine learning. Traditional models mainly include analytical methods [5, 6], numerical simulations [7, 8], and others. These evaluate landslide susceptibility based on the corresponding mechanical theories (ex., elastoplasticity theory and viscoelastic theory). However, the above methods have several limitations. The limit equilibrium method is convenient and straightforward. However, the theoretical basis is imperfect and requires the assumption of a slip surface, which does not reflect the actual stress conditions of the slip surface, resulting in reduced accuracy because of the simplified assumptions [9]. Numerical methods are usually time-consuming, and their accuracy depends heavily on assessing the geotechnical and physical parameters [10, 11]. As a nonlinear, dynamic, and complex open-loop system, slopes have many of risk factors that are difficult to assess. This leads to applying traditional models with significant limitations to achieve the desired prediction accuracy and effectiveness.

Conventional landslide susceptibility analysis methods are complex, iterative, and overload the computational system [12]. This has inspired researchers to look for alternative methods for calculating landslide susceptibility. Soft computing techniques can solve highly complex, nonlinear, multivariate problems [13]. Machine learning algorithms such as neural networks and support vector machines have also been used to solve landslide susceptibility problems [14, 15]. The increased use of machine learning algorithms has promoted the crossover, integration, and development of such tools with slope engineering problems, providing new ideas and methods for landslide susceptibility prediction problems [16]. Qi and Tang [17] proposed and compared six integrated artificial intelligence (AI) methods for landslide susceptibility prediction based on metaheuristics and machine learning algorithms and demonstrated that integrated AI methods have great potential in predicting landslide susceptibility. Tien Bui et al. [18] employed machine learning-based techniques to predict the safety factor of slope failures, showing that the multilayer perceptron (MLP) outperformed other machine learning-based models. Chang et al. [19] investigated the performance of eight commonly used machine learning models for predicting slope safety coefficients. Parameter optimization and cross-validation combined historical slope data to establish a machine learning-based slope safety coefficient prediction system. Lin et al. [20] developed a machine learning (ML) model for landslide susceptibility evaluation and found that the performance and reliability of nonlinear regression methods were slightly better than linear regression methods. Huang et al. [21] used 369 recorded landslides and 13 associated conditional factors to study landslide areas. They compared analytic hierarchy process (AHP), general linear model (GLM), information value (IV), binary logistic regression (BLR), multilayer perceptron (MLP), BPNN, support vector machine (SVM), and C5.0 decision tree (C5.0DT) models for prediction. It is found that machine learning models are more suitable for landslide susceptibility prediction than the other two types of heuristic and general statistical models. Machine learning (ML) models based on remote sensing (RS) imagery and geographic information systems (GIS) have been widely and effectively implemented for landslide susceptibility prediction. Chang et al. [22] compare the landslide susceptibility prediction performance of these supervised machine learning (SML) and unsupervised machine learning (USML) models to further explore their strengths and weaknesses. They were able to achieve more accurate and reliable prediction results. In summary, machine learning algorithms have become a hot research topic in data mining and classification prediction in landslide susceptibility, but different prediction algorithms have their limitations [23, 24].

Landslide susceptibility evaluation research focuses on high prediction accuracy, which requires algorithms to continuously find newer and more robust algorithms to build landslide susceptibility evaluation models for better prediction results [25]. Therefore, it is necessary to find intelligent algorithms with high accuracy and better applicability. Integrated learning can train multiple algorithms, resulting in complementary advantages and better landslide susceptibility prediction results than a single algorithm. Both gradient boosting decision tree (GBDT) and eXtreme gradient boosting (Xgboost) are classified as ensemble learning, which is an excellent engineering implementation and optimization improvement of random forest (RF). The purpose of integrated learning is to improve a single learner’s generalization ability and robustness by combining the prediction results of multiple base learners. Achour and Pourghasemi [26] found that the RF model achieved the highest prediction accuracy by comparing RF, support vector machine (SVM), and boosted regression tree (BRT) to assess the susceptibility of landslides near roads. Pham et al. [27] used 16 landslide condition factors to predict landslide susceptibility, which was found to be more accurate with RF prediction capability after comparing traditional models with machine learning. Achour et al. [28] prepared an inventory map with 12 variables (including geomorphological, geological, hydrological, and environmental factors) to predict landslide susceptibility. The RF and Xgboost models had the same prediction accuracy (AUC) and better prediction performance. Drid et al. [29] selected and evaluated eleven gully erosion condition factors to identify the areas most vulnerable to this hazard, and the results showed that the Xgboost model had the best predictive performance. Xgboost and GBDT have been widely used in various scenarios and achieved good results, but the single integrated learning model is affected by its parameters. In this paper, we explore the problem of algorithm accuracy based on the prediction of landslide susceptibility optimized by heuristic algorithms.

This study investigates the feasibility of GA-GBDT and GA-Xgboost algorithms with numerous machine-learning algorithms for landslide susceptibility prediction. Firstly, the collected data are described and analyzed. Then the principle process of GA-optimized GBDT and Xgboost algorithms and the accuracy evaluation criteria of prediction models are introduced. The performance of different prediction models under the same data is compared and analyzed by calculating various prediction model evaluation indexes and receiver operating characteristic curve (ROC) curve quantitative tests to explore the feasibility of the method in this paper.

2. Data Collection and Description

2.1. Dataset and Predictor Variables

The database includes 290 cases (156 and 134 for stable slopes and failed slopes, respectively) derived from the information in Feng et al. [30] and Zhou et al. [31]. The database contains the basic geometric slope design parameters, such as slope height (H), slope angle (β), unit weight (γ), cohesion (c), friction angle (φ), and pore water pressure coefficient (ru). The external trigger considered in this study is the pore water pressure (ru), defined as the pore water pressure to overburden pressure (Michalowski, 1995; Kim et al., 1999) [6]. The six parameters chosen are strongly correlated with the geometry and geotechnical properties of the slope and have different degrees of influence on slope stability.

Slope instability is recorded as 0, and stability is recorded as 1. Figure 1 shows the violin plots of the six influencing factors for each of the 290 historical slope cases under the failure and stability scenarios. The advantage of the violin diagram is that it can be more intuitive to visualize the distribution of different influencing factors when the slope is stabilized or damaged. The violin plot combines a box-line plot and a density plot to show the data dispersion statistics for each group and to provide the density of data distribution. Wider violin curves correspond to higher densities and represent areas of concentrated distribution. Here, the data are widely distributed. However, the distribution of variables is asymmetric, and the data distribution of different influencing factors under stable and unstable conditions is broad. From basic plots such as Figure 1, it is impossible to visually distinguish the essential parameters affecting landslide susceptibility.

2.2. Principal Component Analysis

Principal component analysis (PCA) was used to further determine the influence of the six basic characteristics study on landslide susceptibility. First, PCA investigated the relevant contribution of each factor to landslide susceptibility, summarizing and visualizing the collected landslide susceptibility data for interpreting its variance-covariance structure. PCA also assessed the database to ensure a representative dataset.

As shown in Figure 2, slope height has the highest contribution to PC1 at 41.983%, and pore water pressure has the lowest contribution to PC1 at 5.992%. PCA allows for the visualization of the classification function of the slope dataset in a two-dimensional plane. There are overlapping domains between the two types of landslide susceptibility from the two-dimensional space. In addition, some indicators with significant skewness can impact the prediction model. According to the PCA results (shown in Figure 2(a)), the components of the first and second dimensions are visualized (Figure 2(b)). The data distribution areas for the two types of slope states on the first two components are relatively close with overlapping areas.

A comprehensive analysis of the statistical characteristics of the six influencing factors of the slopes shows that each characteristic presents a different distribution, and the span and density of the values are relatively large. This indicates that the database contains slopes of different heights, slope angles, lithologies, and types. Table 1 shows the variability of the effect of different influencing factors on slope stability. By using completely different types of slopes and using their common characteristics for slope instability prediction, the powerful ability of machine learning algorithms to handle nonlinear data can be better reflected.

3. Methodology

3.1. Ensemble Learning Algorithm

The ensemble learning algorithm, commonly known as GBDT or Xgboost, is implemented by changing the data distribution to determine the weights of each sample based on whether a sample is correctly classified in each training set and the accuracy of the last overall classification. The new dataset with modified weights is sent to the lower classifier for training. The classifiers obtained from each training are finally combined as the final decision classifier.

3.2. GBDT Model

GBDT is one of the boosting integrated learning algorithms, often used for classification and regression problems. Rather than simply adjusting the weights of weak learners, GBDT reduces the residuals after each computation by building a new model in the direction of gradient descent of the residuals [32]. The GBDT model inherits the advantages of statistical models and artificial intelligence methods, using the advantage of calculating the relative importance between variables while identifying complex nonlinear relationships [33]. Because of the complex nonlinear relationship of a slope system, this paper selected GBDT as the core model to study the landslide susceptibility judgment problem. The steps of the GBDT training model are as follows:Step 1: Initialize the model function with the side slope training set and loss function:Each record of the training slope dataset is of the form , where are known and refer to the slope features (e.g., unite weight, slope height, etc.). The aim is to predict the value of y (landslide susceptibility) based on . Hence, a mapping needs to be identified such that the expected value of the loss function is minimized as given in the following equation:The is expanded as . is a base learner with its parameters given by . Every iteration attempts to find a better fit for the expansion coefficients and the parameters of the base function to achieve a better prediction. In the beginning, the training is initialized by guessing .Step 2: For , regression trees are generated iteratively:The loss function is assumed to be differentiable and the function is fit by minimizing the k-class multinomial negative log-likelihood. The predicted value of output for at the mth iteration is given in the following equation:The optimally fit is used to find the optimal value of coefficient The base learner is a decision tree where, in each iteration m, the tree segments the input slope feature space into Z-disjoint regions and predicts a separate constant value in each one as described in the following equation: is the majority class predicted in each region , i.e., the majority of the points in the region are predicted to belong to this class. It can also be considered as the class with the highest probability to be predicted in that region, i.e., . In decision trees, the parameters are the features/attributes/variables being split at node and the specific value at which the chosen variable is split. These two parameters define the region of the partitions at the mth iteration.Step 3: Generate the GBDT:Because the decision tree produces a constant value within each region , the expansion coefficients and base learner function’s value can be reduced to the following equation:

The current approximation is updated in each region as depicted in the following equation:

The shrinkage parameter controls the learning rate of the procedure [32, 34].

3.3. Xgboost Model

Xgboost is widely used in major machine-learning puzzle tasks and is fitting for landslide susceptibility prediction. Chen and Guestrin [35] added a regularization term to the loss function on Xgboost to control the complexity of the trees and prevent overfitting. The regularization term is represented as follows:where xi is the ith sample data of input; is the model prediction value of the ith sample; is the number of trees; is the set space of trees; and is a function in set space .

The objective function in the formula consists of two parts. The first calculates the error between the predicted value and the true value . The second is the regularization term, which represents the sum of the complexity of each tree:

For the tth round loss function , a second-order Taylor function is taken at the point and the loss function can be calculated for the accumulation of every leaf node as follows:where Ij represents the samples in leaf node j.

3.4. Hyperparameter Tuning

Classical machine learning prediction algorithms are more sensitive to hyperparameters, which are crucial for building high-accuracy prediction models for landslide susceptibility. To find the optimal hyperparameters, particle swarm optimization [36], genetic algorithm (GA) [37], artificial bee colony ABC [24], grid search [38], and firefly algorithm (FA) [39] have been adopted, amongst others. GBDT and Xgboost have many parameters that are tedious to adjust. Furthermore, each parameter can significantly impact the algorithm’s prediction performance and needs to be optimized for tuning parameters. Based on GBDT and Xgboost integrated with multiple decision trees, the global search ability and flexibility of GA are used to compensate for the defects of the GBDT and Xgboost model, including the abundant tuning of parameters, slow convergence, and easily falling into local optimum.

3.5. GA Feature Selection

GA is a class of randomized search algorithms that draws on natural selection and natural genetic mechanisms in the biological world [40]. After each iteration, the termination criterion is checked to see if convergence has been reached or if the maximum number of iterations allowed has been completed. Convergence occurs when all chromosomes in the population have reached the same fitness level. This indicates that an optimal set of characteristics has been determined, and the process can be terminated. However, because this condition may not always be satisfied, a limit is placed on the number of iterations (the maximum number of iterations = 50). Thus, the algorithm is ended if convergence is achieved before the completion of 50 iterations. Otherwise, the maximum number of generations is executed. If neither of these termination criteria is met, the next chromosome population is generated from the previous population by applying tournament selection, mutation, and crossover operations. Figure 3 shows the principle and optimization flowchart of the GA.

3.6. Evaluation Criterion

This work evaluated the predictive performance of classification algorithms for landslide data using the area under the operating characteristic curve (AUC). The receiver operating characteristic curve (ROC) curve plots the relationship between sensitivity and specificity. It evaluates the performance of different classifiers. The ROC [31] is a two-dimensional plot of the false positive rate (FPR) (1-specificity) versus the true positive rate (TPR or sensitivity) on the horizontal and vertical axes. It is a quantitative metric based on which to assess the model’s overall performance. The AUC represents the area under the ROC curve and is mainly used to measure the model’s generalization performance, namely, the good or bad classification effect. It can quantitatively compare good or bad models. Different AUC values reflect different classification effects [41, 42]: 0.900–1.00 represents outstanding performance, 0.800–0.900 represents good performance, and 0.700–0.800 represents average performance. The destructive consequences of slope instability are more serious, so accuracy is introduced to evaluate the model performance. When slope instability actually occurs, the model prediction is not wrong. Such a classifier is optimal, and the classifier’s recall must be as high as possible under the premise of a certain correct rate.

4. Results

4.1. Comparison of Model Performance after GA Optimization

Height , cohesion , slope angle , unit weight , friction angle , and pore water pressure [10] are the input parameters of the GBDT model, and the slope stability state (failure/stability) is the output. In this paper, 80% of the 290-group database was used as training data to train the model, and the remaining 20% was used as test data to verify the model’s accuracy. AUC and recall were used to evaluate the accuracy of the prediction model. The computation time was calculated to be 95.5s for GA-GBDT and 94.6s for GA-Xgboost. Another calculation shows that the CPU utilization is 3.1%, with 55.3% of memory used and 44.7% of memory available. The different parameters of different models lead to differences in model complexity, training effects, and results during the training process. Figures 4 and 5 illustrate the significant differences in the performance of Xgboost and GBDT with different hyperparameters, respectively. The different nodes in the figure represent the model prediction accuracy under the four parameters. The different parameters have different degrees of influence, which together determine the prediction performance of the model. Table 2 shows the hyperparameter search space table for GBDT and Xgboost. Table 3 shows the optimal values of the parameters after optimization by the GA algorithm. The n _ estimator is a numerical parameter with a default value of 100, which specifies the number of weak classifiers. The max depth is numeric with a default value of 3, which is a parameter related to pruning. The learning rate is numeric, with a default value of 0.1, and is commonly used to specify a learning rate. The random state is a random seed that controls random mode as a parameter in any random class or function.

`Table 4 shows the number of accurate predictions and recalls of different models for predicting landslide susceptibility and instability conditions. Figure 6 shows the AUC line graphs of different models before and after optimization. After comparison, before the optimization of the GA algorithm, both the GBDT and Xgboost algorithms have the same accuracy of 93% and recall of 85.2%, but after the optimization, both the model performance and accuracy improved. Overall, GA-GBDT has better accuracy and model performance.

4.2. Multivariate Statistical Logistic Regression Prediction Model

Multiple logistic regression is simpler and more convenient for analyzing multifactor models, and it can accurately measure the degree of correlation and fit between each factor [43]. The landslide data are dichotomous, so binary logistic regression was used to further investigate the effect of six factors on the slope. First, the overall validity of the model was analyzed. Table 5 shows that the original hypothesis is that the quality of the model is the same in both cases, whether the independent variables (slope height, slope angle, unit weight, cohesion, friction angle, and pore water pressure) are inputs or not. The p value is less than 0.05, which indicates that the original hypothesis is rejected, i.e., the independent variables are valid for this model construction and the model construction is meaningful.

Slope height, slope angle, unit weight, cohesion, friction angle, and pore water pressure were used as independent variables, while stability was a dependent variable for binary logit regression analysis. Table 6 shows that slope height, slope angle, unit weight, cohesion, friction angle, and pore water pressure can explain the 0.22 variation in stability. The fitted model equation iswhere represents the probability that stability is 1 and represents the probability that stability is 0. Based on the results of multiple logistic regression, the AUC and recall are 0.824 and 77.8%, respectively, which have lower accuracy and performance compared with GA-GBDT and GA-Xgboost.

Furthermore, the regression coefficient value of the slope height is 0.002, but it does not show significance , implying that slope height has less effect on the stability. The regression coefficient value of the slope angle is −0.045 and shows a 0.01 level of significance , indicating that the slope angle can significantly negatively influence stability. The regression coefficient value of unit weight was 0.202 and shows significance at 0.01 level , which indicates that the unit weight will significantly positively influences stability. Furthermore, the dominance ratio (OR value) is 1.224, indicating stability increases 1.224x when the unit weight increases by one. The regression coefficient value of cohesion is 0.002, but it does not show significance , implying that cohesion has little influence on stability. The regression coefficient value of the friction angle is 0.086 and shows a significance at 0.01 level , implying that the friction angle significantly influences stability. The dominance ratio (OR value) is 1.090, implying that stability increases by 1.090x when the friction angle increases by one unit. The regression coefficient value of the pore water pressure is 0.370, but it does not show significance , implying that the pore water ratio does not have a significant influence on stability .

Overall, according to the results in Table 7, unit weight and friction angle significantly positively affect stability. The slope angle has a negative effect on stability. However, slope height, cohesion, and pore water ratio have less influence on slope stability.

4.3. Comparison of Statistical-Based Models with Multiple Machine Learning Algorithm Models

To compare the performance of different classification algorithms, SVM, logistic regression (LR), GBDT, K-nearest neighbor (KNN), RF, XGboost, Naive Bayes model (GaussianNB), GA-Xgboost, and GA-GBDT methods were used for landslide susceptibility prediction. The prediction results of the models are shown in Figure 7. The AUC of SVM is 0.746, LR is 0.824, GBDT is 0.894, KNN is 0.817, RF is 0.894, Xgboost is 0.910, GaussianNB is 0.824, GA-Xgboost is 0.928, and GA-GBDT is 0.933. Table 8 shows the recall of SVM, LR, KNN, RF, and GaussianNB. It is worth noting that the AUC determines classifier performance, with 1.0 representing an ideal performance. The ROC curve of the GA optimization model is closer to the left and upper axes than the other models. The AUC value of GA-GBDT is highest at 0.933, slightly higher than RF and GA-Xgboost, and significantly higher than SVM, LR, GaussianNB, GBDT, and Xgboost.

Figure 8 is the radar chart of AUC and recall distribution, which visually shows the prediction effects of different algorithm models. The ensemble learning algorithm is more suitable for landslide susceptibility prediction than the other algorithms, while the prediction performance of SVM is the lowest. Using GA to optimize Xgboost and GBDT significantly improves the AUC curve compared with the original model, indicating that the algorithm accuracy strongly correlates with the parameters. However, the GA-GBDT model has the best model accuracy and performance. The test results also show that the GA-GBDT model is more suitable for predicting landslide susceptibility.

Different modeling approaches can lead to different results. An evaluation of the predictive performance of numerous machine learning models shows that most models are more accurate than the traditional statistical models used for landslide susceptibility modeling. Machine learning methods can automatically identify the hidden complex relationships between valid variables. The results of this study are more applicable to the integrated learning algorithm for landslide susceptibility prediction when compared with the results of recent studies [2729] and that the accuracy of the model is improved after the optimization of the heuristic algorithm. The results of the integrated algorithm model differ from other studies in different country regions of the world, widely showing excellent landslide susceptibility predictions (AUC > 0.8). It is worth noting that different geological conditions and regional influences have important effects on the accuracy of the model prediction results. Therefore, selecting a rich and high-quality landslide dataset can increase the actual predictive power of the model.

4.4. Analysis of the Importance of Influencing Factors

It is crucial to determine the sensitivity of factors affecting landslide susceptibility to evaluate the landslide susceptibility and the design of support structures. The GA-GBDT algorithm has a good feature identification function and can output the strength of different parameters on landslide susceptibility. Figure 9 shows the importance of the six influencing factors ranked by the GA-GBDT model. This study used relative importance scores to investigate the sensitivity based on the best prediction results (GA-GBDT). The method was selected based on the superior performance during the test setup. The results of the GA-GBDT feature selection were displayed to obtain the variable importance ranking. Figure 7 shows the normalized scores for the importance of the variables. Unit weight (score = 0.4593), friction angle (score = 0.1245), and cohesion (score = 0.1237) are the most sensitive factors for landslide susceptibility, which indicates the importance of the slope material variables. Therefore, the values of material unit weight, friction angle, and cohesion in artificial slopes must be selected reasonably and accurately based on specific indoor and field tests. The geological material cohesion and friction angle should be increased when assessing a slope for landslide potential. The importance scores of slope angle and height are 0.1161 and 0.1026, respectively, indicating that geometric variables also affect landslide susceptibility. Optimizing these two variables in the actual design is a feasible approach to ensure the stability. Finally, pore water pressure (ru) (0.0737) has the lowest sensitivity.

5. Discussion

5.1. Variability in Model Performance

For landslide susceptibility prediction results compared with other models, the integrated learning models GBDT, RF, and Xgboost have higher accuracy than statistical models and some machine learning models such as SVM and GaussianNB. The models optimized by the heuristic algorithm GA had increased prediction accuracy. Pham et al. [27] strongly recommend implementing the chosen heuristic along with the machine learning model rather than directly applying only the machine learning model or its integrated form.

It is also essential to discuss the strengths and weaknesses of the applied models, as different models have their strengths and weaknesses. Usually, model performance depends on different research areas and related influencing factors [44]. In this study, GA-Xgboost and GA-GBDT have higher accuracy than machine learning models (ex., Xgboost and GaussianNB). GA has the advantages of fast random search capability and scalability with free attention to problem domains and is easy to combine with other algorithms, but the potential power of the algorithm’s parallel mechanism is not fully exploited [45]. There are several other drawbacks as well, such as the dependence on fitness function. This is new and current research in GAs [46].

RF is highly tolerant to outliers and noise and can handle multidimensional data without overfitting, yet many trees may slow down the algorithm and prevent real-time prediction (Arabameri et al., 2019). KNN only needs to save training samples and tokens without estimating parameters and training. However, the categories of the new samples are biased towards the category with the dominant number in the training sample when the samples are unbalanced, which can easily lead to prediction errors [47]. The GaussianNB model can achieve high prediction accuracy in learning data with missing conditions [30]. It assumes the independence of attributes among data, which leads to difficulties in practical applications and low efficiency of model classification when facing more data attributes or stronger correlations among data [48]. The Xgboost method supports linear classifier and CART classifier, which can better prevent overfitting and reduce model complexity despite lacking smoothness [49]. Although SVM is a classical small-sample learning algorithm, which can reach a high level of learning accuracy with a small amount of data, its accuracy and computational speed both decrease when facing multidimensional variables and large data [50]. The performance of GBDT is a step up from RF, so its advantages are also obvious. It is flexible to handle various data types and has high prediction accuracy with relatively less tuning time. Because it is boosting, there is a serial relationship before the base learner, which makes it challenging to train data in parallel [51]. Therefore, it is necessary to apply the selected models to different datasets and scenarios of landslide susceptibility problems for training and comparative analysis concerning the merits and performance of each model.

5.2. Challenges in Landslide Susceptibility Assessment

With the deepening of research, many researchers have proposed optimization models with high prediction accuracy and strong generalization ability [52]. These techniques have a very strong self-learning ability and can process large amounts of data efficiently and accurately. In addition, these techniques are mainly used in landslide susceptibility evaluation to help make decisions on risk reduction [53]. However, there are still some problems and challenges to be solved in the future. (1) Incomplete consideration of influencing factors when using machine learning techniques for landslide susceptibility modeling. In most of the current research focuses on geotechnical properties, slope, height, and so on. But in fact factors not covered often have significant impacts, such as climate and environmental changes. (2) Since machine learning is based on data-driven search for inherent hidden relationships as a way to achieve the purpose of landslide geohazard prediction interacting with the natural world. If the prediction process has a better machine learning model but is not able to obtain higher-quality data, it will not be able to obtain better prediction results. (3) Based on the data-driven machine learning prediction model is difficult to explain the mechanism of landslide occurrence, the model applicability is poor, which also affects the accuracy of the model prediction, so how to improve the interpretability of the machine learning algorithm is also an important research direction in the future.

6. Conclusions

This work successfully used GA-GBDT and GA-Xgboost methods to study landslide susceptibility prediction using 290 historical case records of slope conditions. The variability of model performance with different parameters was checked, and GA was used to optimize the four parameters of Xgboost and GBDT, which are n_estimators learning_ rate max_depth random_state.

GA-GBDT and GA-Xgboost models were compared with SVM, LR, KNN, RF, GaussianNB, Xgboost, and GBDT to check the fit. It was found that Xgboost and GBDT optimized using GA obtained classification models with the highest AUC of 0.928 and 0.933, respectively. Relative variable importance analysis showed that the geometric slope design parameters (unit weight, friction angle, and cohesion) significantly affected landslide susceptibility. The results suggest that the GA-GBDT and GA-Xgboost can explore the nonlinear relationship between landslide susceptibility and its influencing factors.

Based on the database of this paper, the results of logistic regression to study the influence of different variables show that unit weight and friction angle significantly positively influences stability, slope angle has a significant negative influence on stability. However, slope height (m), cohesion, and pore water pressure have less influence on stability. The intrinsic effects were further explored.

However, although the analysis results are impressive and encouraging, there are still some outstanding issues. The impact of data imbalance on landslide susceptibility prediction will be discussed in future research. The developed model can be improved by analyzing more extensive datasets, and its applicability to other mining and geotechnical damage problems can be recommended when data are available.

Data Availability

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Authors’ Contributions

Binbin Zheng and Jiahe Wang worked together to complete the entire article. Wensong Wang and Tingting Feng conceptualized the study. Guangjin Wang and Yonghao Yang contributed to data curation. Yufei Wang, Wensong Wang, and Tinging Feng performed formal analysis. Wensong Wang, Binbin Zheng, and Guangjin Wang provided funding acquisition. Jiahe Wang, Yufei Wang, and Yonghao Yang performed investigation. Jiahe Wang contributed to the methodology. Wensong Wang provided project administration. Binbin Zheng performed supervision. Tingting Feng and Yufei Wang contributed to validation Tingting Feng and Binbin Zheng performed visualization. Wensong Wang and Jiahe Wang wrote the original draft. Wensong Wang wrote, reviewed, and edited the manuscript.

Acknowledgments

This study was supported by the National Natural Science Foundation of China (51804178 and 51804051), the Chongqing Natural Science Foundation (No. CSTB2022NSCQ-MSX0497), the Science and Technology Research Program of Chongqing Municipal Education Commission (Grant no. KJQN202100717), and the Yangtze River Joint Research Phase II Program (No. 2022-LHYJ-02-0201).