Abstract

For the safe and economical construction of embankment dams, the mechanical behaviour of the rockfill materials used in the dam’s shell must be analyzed. The characterization of rockfill materials with specified shear strength is difficult and expensive due to the presence of particles greater than 500 mm in diameter. This work investigates the feasibility of using an extreme gradient boosting (XGBoost) computing paradigm to estimate the shear strength of rockfill materials. To train and validate the proposed XGBoost model, a total of 165 databases obtained from the literature are chosen. The XGBoost model was compared against support vector machine (SVM), adaptive boosting (AdaBoost), random forest (RF), and K-nearest neighbor (KNN) models described in the literature. XGBoost beats SVM, RF, AdaBoost, and KNN models in terms of performance evaluation metrics such as coefficient of determination (R2), Nash–Sutcliffe coefficient (NSE), and error in the root mean square ratio (RMSE) to the standard deviation of the measured data (RSR). The results demonstrated that the XGBoost model has the highest prediction performance with (R2 = 0.9707, NSE = 0.9701, and RSR = 0.1729), followed by the SVM model with (R2 = 0.9655, NSE = 0.9639, and RSR = 0.1899), RF (R2 = 0.9545, NSE = 0.9542, and RSR = 0.2140), the AdaBoost model with (R2 = 0.9390, NSE = 0.9388, and RSR = 0.2474) and the KNN model with (R2 = 0.6233, NSE = 0.6180, and RSR = 0.6181). A sensitivity analysis has been conducted to ascertain the impact of each investigated input parameter. This study demonstrates that the established XGBoost model for estimating the shear strength of rockfill materials is reliable.

1. Introduction

Rockfill materials (RFM) are commonly used in the construction of high embankment dams in order to harness natural water resources. RFM is comprised of gravels, cobbles, and boulders obtained by blasting rock quarries or natural riverbeds. Material from riverbeds is rounded to subrounded, and material from quarries is angular to subangular. Mineral composition, particle size, shape, gradation, individual particle strength, void content, relative density (RD), and particle surface roughness all influence the behaviour of these RFMs used in the construction of rockfill dams. Therefore, it is essential to comprehend and characterise the behaviour of these materials for the study and safe construction of rockfill dams.

In engineering practice, the particle size of rockfill materials ranges from 400 to 600 millimetres and can exceed 1000 millimetres. Due to the constraints of laboratory testing equipment, rockfill materials that exceed the maximum permissible particle size must be scaled. To determine the mechanical properties of rockfill materials on-site, analog simulation is used in laboratory testing to build test specimens with the same internal structure as the prototype rockfill materials, thus determining the engineering characteristics of the prototype rockfill materials. Several research studies have investigated the behaviour of the RFM such as Abbas et al. [1], Gupta [2], Venkatachalam [3], Marsal [4], Mirachi [5], and Honkanadavar and Sharma [6] and carried out laboratory experiments on different RFMs, and it was revealed that their stress-strain behaviour is dependent on the stress level, but nonlinear and inelastic. They also reported that the angle of internal friction increases as the maximum particle size of riverbed RFM increases, while the opposite trend is true for quarry RFM. Frossard et al. [7] proposed a rational approach for estimating RFM shear strength based on size effects; Honkanadavar and Gupta [8] developed a power law for the relationship between the shear strength parameter and various riverbed RFM index features due to the difficulty of conducting large-scale strength testing and defining the mechanical behaviour of RFMs. Numerous methodologies have been developed to anticipate the behaviour of such soils. Large particle size RFM cannot be tested under laboratory circumstances as maximum large-scale shear tests are time-consuming and complicated, and it is hard to predict the nonlinear shear strength function without an analytical method (particle size 1200 mm) [8].

Over the last ten years, a newly developed approach based on machine learning (ML) algorithms has been widely applied to solve real-world problems, particularly civil engineering. Numerous practical problems have been effectively addressed using ML techniques, paving the way for many promising opportunities in civil engineering and other fields such as environmental [9] and geotechnical [1015] including prediction of RFM shear strength [1618]. In this context, the artificial neural network (ANN) approach is utilized by Kaunda [16] for estimating RFM shear strength. Cubist and random forest regression techniques are used by Zhou et al. [17], and they found that both models are accurate for RFM shear strength estimations than ANN and traditional regression models. Ahmad et al. [18] used support vector machine (SVM), random forest (RF), AdaBoost, and K-nearest neighbor (KNN) algorithms to estimate the shear strength of RFM and concluded that the SVM model achieved a better prediction performance compared to the RF, AdaBoost, and KNN models. This field, however, is currently being investigated. The article aims to provide the following contributions in the research field:(i)To evaluate the predictive capacity of the XGBoost algorithm for the shear strength of RFM(ii)To compare the proposed model to the reference models used in the published literature(iii)Conduct sensitivity analysis to assess the influence of each input parameter on the RFM’s shear strength

The structure of the paper is as follows: The theory of extreme gradient boosting is explained in Section 2. Data collection and correlation analysis are presented in Section 3. Section 4 explains the performance measurement employed. Section 5 presents the obtained results and a discussion of them. Finally, conclusions based on the achieved results are provided.

2. Extreme Gradient Boosting (XGBoost)

Chen and Guestrin [19] proposed the sophisticated supervised technique extreme gradient boosting (XGBoost) under the gradient boosting framework which has received widespread recognition in Kaggle machine learning contests due to its advantages of high efficiency and considerable flexibility. XGBoost’s loss function adds a regularization term to the objective function, which helps to smoothen the final learning weights and avoid over-fitting [19]. It also optimizes the loss function using first and second-order gradient statistics. XGBoost also supports row and column sampling to address this issue in addition to providing regular terms to prevent over-fitting. As a result of the parallel and distributed computation, faster model exploration is possible.

The following is a description of the XGBoost algorithm [20]: given a dataset with n examples and m features K additive functions will be used to predict the output values of a tree ensemble model as follows:where F is the regression trees space. It is calculated aswhere q represents for the structure of each tree, T represents for the number of leaves in the tree, and fk is a function that corresponds to an independent tree structure q and leaf weights To reduce errors of ensemble trees, the objective function is found in the XGBoost model:where l is a differentiable convex objective function to calculate the error between predicted and measured values; and are regulated and predicted values, respectively; t shows the repetitions in order to minimize the errors; and is the complexity penalized with the regression tree functions:

is the vector of the score for the blades, and the minimal loss required for the further isolation of a blade node. is the regularization function. In addition, and are parameters which are able to control the complexity of the tree, and the regularization term helps to avoid overfitting by smoothening the final learnt weights. Taylor expansion is applied to the objective function in order to further simplify it aswhere and are the first and second derivatives obtained on the loss function, respectively. More detailed explanations of the XGBoost algorithm can be found in Chen and Guestrin’s [19] research paper.

3. Dataset Collection and Correlation Analysis

In this study, a database of 165 samples of RFM shear strength reports was collected from Kaunda [16] and is presented in Appendix A and Table A1 in supplementary file. All input parameters that might influence the shear strength results of RFM were considered. The included parameters are D10, D30, D60, and D90, corresponding to the 10%, 30%, 60%, and 90% sieve sizes passing, respectively. Cc and Cu refer to the curvature uniformity coefficients (Cc), respectively; FM and GM describe fineness modulus and gradation modulus, respectively; R represents International Society of Rock Mechanics (ISRM) hardness rating; UCSmin, and UCSmax (MPa) signify the uniaxial compression strengths boundaries (MPa); and γ represents the dry unit weight (kN/m3), while σn is the normal stress (MPa). The considered output is the shear strength of RFM (MPa) (denoted as τ (MPa)). The summary of the database statistics is presented in Table 1, which includes the boundary and standard deviation values of all parameters used in this study.

Correlation (ρ) was used to verify the intensity of correlation between different parameters (see Figure 1). For a given pair of random variables (m, n), the following equation for ρ is used:where cov denotes covariance, denotes the standard deviation of m, and denotes the standard deviation of n. represents a strong correlation between m and n, values between 0.3 and 0.8 represents a moderate relationship, and represents a weak relationship [21]. As per Song et al. [22], correlation is considered as “strong” if . In the order of strong to weak, the relationships between input and output parameters are represented in Figure 1. Consequently, no factors from the estimation model’s τ were deleted. The correlation coefficient has a maximum absolute value of 0.97, as shown in Figure 1.

4. Evaluation and Prediction

To evaluate the predictive capacity of the XGBoost algorithm, we compared it with some other machine learning methods developed in literature using performance measures.

4.1. Compared Machine Learning (ML) Methods

The XGBoost model was compared with other prediction methods such as support vector machine, adaptive boosting, random forest, and K-nearest neighbor proposed in literature. A brief description of each technique is presented. For a more in-depth discussion, the reader is referred to the relevant references.

4.1.1. Support Vector Machine (SVM)

The Support Vector Machine (SVM) regression technique relies on feature classification and generates an interclass hyperplane and minimizes the vector lengths and variance between the features and the plane. The SVM is compatible with the majority of kernel types, including Euclidean, Gaussian, Exponential, and Dirichlet kernels [23]. The objective function for SVM regression contains a coefficient generated from the cost analysis that aids in determining the flatness of the created hyperplane [24]. This allows the user to change the SVM technique to fit unique datasets.

4.1.2. Adaptive Boosting (AdaBoost)

Adaptive Boosting is a boosting machine learning technique in which strong learning algorithms augment weak learning algorithms. AdaBoost must define the number of beginning students (n) as a parameter [25]. During the training phase, AdaBoost develops learners with low accuracy who improve based on their predecessors [26]. Using this method, the AdaBoost dynamically modifies the training weight based on the performance of the fundamental learning algorithms [27].

4.1.3. Random Forest (RF)

Random Forests are ensemble models that use many decision trees as base-learners to obtain more precise outcomes. Individual trees are generated from training data using random parameters as their roots and nodes using the bootstrap sampling method [28]. Multiple decision trees are more stable than a single tree because they reduce overfitting and average the outcomes [26]. The number of trees in the forest at each binary node, the number of randomly selected predictors, and the lowest number of observations at the nodes of the trees are the three primary parameters for random forests [29].

4.1.4. K-nearest Neighbor (KNN)

The supervised KNN is a machine learning algorithm that can be used to tackle both classification and regression problems. In regression problems, the input data set is comprised of k that is most similar to the training data sets utilized in the highlighted set. The outcome of KNN regression is the object’s characteristic value, which is the mean value of k’s nearest neighbors. As the distance metric, a parameter such as Euclidean or Mahalanobis distance can be utilized to locate the k of a data point [30].

4.2. Evaluation Measures

Three quantitative statistical indices, i.e., coefficient of determination (R2), error in the root mean square ratio to the measured data standard deviation (RSR), and Nash–Sutcliffe coefficient (NSE) were employed to validate and compare the XGBoost model. The following equations characterise the supplied indices:where is the total number of data; and are the actual shear strength and the predicted shear strength, respectively; and is the mean of the actual shear strength.

Values of the coefficient of determination (R2) that are closer to 1 imply that this model better fits the data. When R2 is greater than 0.8 and close to 1, the model is deemed robust [31]. The NSE is a normalized statistic that regulates the level of residual variance compared to the variance of the data being measured [32]. The NSE scale ranges from , with 1 denoting an ideal match. If the NSE value is greater than 0.65, a strong correlation exists [32, 33]. The root mean square error (RMSE)-standard deviation ratio (RSR) is computed by dividing the RMSE by the standard deviation of the observed data. The RSR varies from 0, representing the optimal value, to a significant positive value. The RSR ranges from the optimal value of 0 to a substantial positive number. Classification ranges are expressed as very good, good, acceptable, and unacceptable. The RSR ranges are 0.000 ≤ RSR ≤ 0.500, 0.500 ≤ RSR ≤ 0.600, 0.600 ≤ RSR ≤ 0.700, and RSR > 0.700, respectively [34].

5. Methodology

The present study is carried out based on the proposed framework that involves four main steps as follows: (1) data preparation and correlation analysis, (2) development of the model, (3) validation of the proposed model, and (4) sensitivity analysis (Figure 2):(1)Data preparation and correlation analysis: In this first step, the data of samples from the laboratory were utilized to build the training and testing datasets. The training dataset was constructed using 80% of the total data, while the testing dataset was built from the remaining 20%.(2)Development of the model: In this second step, the training dataset was applied for training the model based on the XGBoost algorithm. The optimization of user defined parameters is undertaken by carrying out multiple runs with these parameters on the training data and analyzing the performance of the resulting models on testing data. All training and testing operations were conducted out in Orange software.(3)Validation of the proposed models: In this third step, the testing dataset was adopted for validating the proposed models. Statistical indices including R2, NSE, and RSR were applied to validate the models. The proposed model is compared to the reference models used in the published literature. Furthermore, Taylor diagram is utilized to illustrate how similar the models (including the proposed XGBoost) are to the reference/observed point position.(4)Sensitivity analysis: In the last step, sensitivity analysis is used for evaluating the influence of input factors on the shear strength of rockfill material.

6. Results and Discussion

The proposed model that estimates the RFM shear strength is developed using orange software. The predictor variables were provided via an input set (x) defined by x = [D10, D30, D60, D90, Cc, Cu, GM, FM, R, UCSmin, UCSmax, γ, and σn], while the target variable (y) is shear strength (τ) of the rockfill material. Every modelling stage requires the selection of the suitable size of training and testing datasets. Consequently, 80% (132 cases) of the total data were employed to generate models while the remaining 20% (33 cases) of the data were used to test the developed models in this study. The XGBoost model was tuned through trial and error to get an optimal hyperparameters values owing to accurate estimate of the shear strength of rockfill materials. This study optimizes some essential XGBoost parameters and clarifies the definitions of these hyperparameters. The tuning parameters for the model were selected and then changed during the trials until the best metrics from Table 2 were obtained.

The predictive performance of the training and testing datasets is shown in regression form in Figure 3. In terms of training, the XGBoost model produced the best prediction results (i.e., R2 = 0.9707, NSE = 0.9701 and RSR = 0.1729) compared to SVM (i.e., R2 = 0.9655, NSE = 0.9639 and RSR = 0.1899), RF (i.e., R2 = 0.9545, NSE = 0.9542, and RSR = 0.2140), AdaBoost (i.e., R2 = 0.9390, NSE = 0.9388, and RSR = 0.2474), and KNN (i.e., R2 = 0.6233, NSE = 0.6180, and RSR = 0.6181). It is also verified by the findings of R2, NSE, and RSR in Figure 4 as XGBoost produced lesser RSR, higher R2, and NSE values compared to SVM, RF, AdaBoost, and KNN models developed in the literature by Ahmad et al. [18] and the parameter optimization is presented in Table 2.

As depicted in Figure 4, the XGBoost model performed the best in terms of R2, NSE, and RSR (i.e., R2 = 0.9676, NSE = 0.9672, and RSR = 0.1812) compared to SVM (i.e., R2 = 0.9656, NSE = 0.9654, and RSR = 0.1861), RF (i.e., R2 = 0.9656, NSE = 0.9164, and RSR = 0.2891), AdaBoost (i.e., R2 = 0.9181, NSE = 0.8835, and RSR = 0.3414), and KNN (i.e., R2 = 0.6304, NSE = 0.6076, and RSR = 0.6264) in the testing phase. The outcomes of this and a prior study by Ahmad et al. [18] (see Figure 4) demonstrate that the ML method may accurately predict the shear strength of RFMs. The comparison of study outcomes makes sense because the data sets and inputs are the same. In contrast, the XGBoost model beats the other models in terms of predictive performance and offered a balanced prediction throughout the training and testing data sets. In addition, due to the study’s small data set, additional research on other data sets is necessary to establish the most generic model for predicting the shear strength of RFM.

The difference between the actual and predicted shear strength of RFM is represented in Figure 5 by comparing the results of the training and testing sets. The proposed XGBoost model is satisfactory for predicting the RFM shear strength, barring a few noise points.

Taylor diagram (see Figure 6) is utilized to illustrate how similar the models (including the proposed XGBoost) are to the reference/observed point position based on their correlation, root-mean-square error difference, and amplitude of their variations (represented by their standard deviations). The better the performance, the closer each model point is to the position of the reference/observed point. In terms of predictive ability, the proposed XGBoost model beats the SVM, RF, AdaBoost, and KNN models developed in the literature by Ahmad et al. [18].

The sensitivity results of the XGBoost model were evaluated utilising Yang and Zang’s [35] approach for evaluating the influence of input factors on the shear strength of rockfill material. This approach, which has been the topic of numerous studies [3641], is as follows:where n represents the number of values (i.e., 132); and denotes input and output variables, respectively. For each input parameter, the value ranges from zero to one, with the greatest values indicating the efficient output variable (i.e., τ). Figure 7 shows the scores for all input variables and demonstrates that σn ( = 0.99) has the greatest effect on the shear strength of rockfill material. Furthermore, Figure 1 shows that the normal stress σn has the highest ρ of 0.97 in all other parameters validating the sensitivity analysis results.

7. Conclusions

Using an XGBoost algorithm, a new prediction model for RFM shear strength is proposed in the current study. Comparisons reveal that the proposed XGBoost model provides the most accurate prediction of the RFM’s shear strength when compared to the algorithms developed using the SVM, RF, AdBoost, and KNN model. Important findings found from this study include as follows:(1)In the test phase, results showed that the XGBoost had the highest power performance (R2 = 0.9676, NSE = 0.9672, and RSR = 0.1812) compared to other machine learning models. Furthermore, based on the scatter plots of actual and predicted values, the XGBoost model exhibited a better fit to the observed data, indicating that it has potential for broader applications in RFM material properties prediction.(2)Compared to SVM, RF, AdaBoost, and KNN models in the literature, the proposed XGBoost model has a superior predictive capability. In addition, the proposed model is amenable to further modification so that the accumulation of further data will considerably enhance its predictive potential.(3)The findings of the sensitivity analysis indicate that five parameters, namely, the normal stress, the 90% passing sieve diameters (D90), the dry unit weight, and the ISRM hardness rating, are the most sensitive and important factors for estimating the shear strength of rockfill materials.(4)The developed XGBoost model gives predictions with the same level of accuracy as existing soft computing methods.

Since the proposed XGBoost model produces predictions based on the input values, interpolation between the input variables is more accurate and reliable than extrapolation. Therefore, the model should not be used for input parameter values beyond the defined range of the study.

Data Availability

The data presented in this study are available in Appendix A, Table A1 (see supplementary file).

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

The research was partially funded by the Ministry of Science and Higher Education of the Russian Federation under the strategic academic leadership program “Priority 2030” (Agreement 075-15-2021-1333 dated 30.09.2021).

Supplementary Materials

Table A1. Dataset used in the development and validation of the model. (Supplementary Materials)