Abstract
Soil cohesion (C) is one of the critical soil properties and is closely related to basic soil properties such as particle size distribution, pore size, and shear strength. Hence, it is mainly determined by experimental methods. However, the experimental methods are often time-consuming and costly. Therefore, developing an alternative approach based on machine learning (ML) techniques to solve this problem is highly recommended. In this study, machine learning models, namely, support vector machine (SVM), Gaussian regression process (GPR), and random forest (RF), were built based on a data set of 145 soil samples collected from the Da Nang-Quang Ngai expressway project, Vietnam. The database also includes six input parameters, that is, clay content, moisture content, liquid limit, plastic limit, specific gravity, and void ratio. The performance of the model was assessed by three statistical criteria, namely, the correlation coefficient (R), mean absolute error (MAE), and root mean square error (RMSE). The results demonstrated that the proposed RF model could accurately predict soil cohesion with high accuracy (R = 0.891) and low error (RMSE = 3.323 and MAE = 2.511), and its predictive capability is better than SVM and GPR. Therefore, the RF model can be used as a cost-effective approach in predicting soil cohesion forces used in the design and inspection of constructions.
1. Introduction
The cohesion (C) of the soil is created by the bonds between the compounds, the particles, and the viscosity of the water-glue film that surrounds them. Along with the internal friction angle, the cohesion force is part of the shear resistance (slip resistance) of the cohesive soil, used to calculate the load capacity of the ground soil. Cohesion force is usually measured based on the Mohr–Coulomb theory. In the stress plane of the shear effect normal stress, the soil cohesion is the intercept on the shear axis of the Mohr–Coulomb shear resistance line [1–3]. The soil cohesion of the soil greatly depends on the composition of particles in the soil, soil texture, and moisture [4]. In the design of geotechnical constructions such as foundations, slopes, or open-pit pits, the precise determination of the soil cohesion is of great concern [5]. This important parameter can be determined in the field or laboratories [3]. Tests for soil cohesion determination are usually carried out as a direct shear test (slow cut, quick cut, and fast consolidation) or indirect soil shear test with a triaxial compressor [6]. However, the experiments to determine this parameter are often cumbersome, expensive, and time-consuming [7]. With field estimation, a team of skilled and experienced engineers is required [8–10]. To overcome the above difficulties, technical design models have been proposed based on useful correlations that exist between indicator properties obtained from field tests. Several studies have employed models to predict different soil properties and characteristics, for example, Masada’s [11] study for clay and silt embankments, Mofiz and Rahman [12] for Barind soils, Cola and Cortellazo [13] for peaty soils, and Hajarwish and Shakor [14] for mudrock. However, soil is an extremely complex material, and the geological conditions in each region are different, so it is not possible to apply these models thoroughly to different regions [15]. This confirmed the need to propose a general method to be able to predict soil cohesion under different conditions.
More recently, machine learning (ML) or artificial intelligence (AI) based on computer science has gradually become popular and applied in many different fields [16–18]. The wide applications of ML have been applied in areas of the construction industry, such as determining the critical force of steel [19]. Many dependent variables are affecting the critical force of steel [20] and the mechanical properties of the soil [21]. Therefore, the application of artificial intelligence to determine soil cohesion is completely feasible. Kovačević et al. [22] used a support vector machine (SVM) to estimate the chemical and physical properties of soil and classify soil types. Guo et al. [19] used Artificial Neural Network (ANN) and Generalized Linear Model (GLM) to predict soil aggregate stability. Moufiz and Rahman [12] used and compared different ML models, including Linear Regression (LR), ANN, SVM, random forest (RF), and M5 Tree (M5P) for prediction of Standard Penetration Test (SPT) based N-value of soil in the state of Haryana, India. In general, the ML models are proved as potential and highly accurate tools for the prediction of soil properties [23, 24].
In this study, the main aim of this study is to apply one of the most popular ML models, namely, random forest (RF) [25–27], for predicting the cohesion force of the soil quickly, avoiding costly and time-consuming experiments. Database of soil properties was constructed from the experimental results of the Da Nang-Quang Ngai expressway project, Vietnam. Two other ML models, namely, support vector machine (SVM) and Gaussian process regression (GPR), have been used for comparison.
2. Database Collection and Preparation
In this study, the testing results of 145 data of soil samples collected from Da Nang-Quang Ngai expressway project, located in the Central South part of Vietnam (Figure 1), were used to construct the database for modeling soil cohesion force prediction. In the modeling, we considered six input parameters, namely, clay content, moisture content, liquid limit, plastic limit, specific gravity, and void ratio, and one output parameter of soil cohesion force. The detailed determination of input and output parameters is calculated according to the formulas in the published works [28, 29].

The data in this study are randomly divided into two subsets using a uniform distribution, in which 70% of the data is used as a model training set, and 30% is used to test the performance of the model. All data are scaled to the range [0; 1] to reduce numeric error while processing with ML algorithms, as Witten et al. [30] recommended. This process ensures that the training phase of the AI models can be performed with functional generalization capabilities. Such proportions are represented bywhere xmax and xmin are the maximum and minimum values of the considered variable and xn is the normalized value of the variable x.
3. Modeling Approaches
3.1. Random Forest
Random forest (RF) is one of the most commonly used ML algorithms for its simplicity and variety. This is a supervised learning model used for classification and regression problems proposed by Breiman in 2001 [30]. RF is an integrated learning method that gathers results from single decision trees, thereby improving predictive efficiency through the form of majority voting or averaging results depending on each specific problem.
Suppose that there is an input data set X = x1, x2, x3, ..., xn where n is the number of data dimensions or the number of predictive variables. An RF model would be a set of T trees T1(X), T2(X), T3(X),…, Tn(X). The prediction result of these decision-making trees is . For the regression problem, the final result of the RF model will be the average of all the prediction results of the above trees. The development of tree growing is done with the principle of dividing the initial training sets into smaller training sets, and in each split, only a few predictive variables are selected randomly. Decision trees are continuously developed without pruning to predetermined stopping criteria by the programmer. Commonly used tree growth stops are RMSE, Gini Diversity Index, or Mean Square Error. Trees with low predictive results are then discarded, and only plants with sufficient predictive value are selected in the final RF model. The random selection of predictor variables and the result set of decision trees eliminate the overfitting problem of the single decision tree model [30, 31]. The structure of the random forest is depicted in Figure 2. In this study, the RF model was trained and validated using the tools in MatLab application.

3.2. Support Vector Machine
Support vector machine (SVM), proposed by Vapnik since 1995 [32], is an effective and popular learning model for classification of linear and nonlinear regression problems. SVM machine learning model gives accurate prediction results and stable, good noise tolerance and is practical for high-dimensional feature spaces [33, 34]. Many successful SVM applications with classification and regression problems have been published in different fields [35–37]. The basic theory of SVM is summarized as follows.
A training dataset is selected for an SVM model as shown in Figure 3, where is the input data, is the output data corresponding to xi, and N is the number of training samples. The SVM aims to find an optimal hyperplane function f (x) (determined by the weight vector and the offset b), passing through all the data elements with the insensitive loss coefficient ε (based on two supporting hyperplanes, w.x – b = ε and w.x – b = -ε).

In the case of nonlinear regression, the function f (x) is determined as follows:withwhere C is the penalty constant used to control the penalty error, are the Lagrange multipliers, and K (xi, xj) is the kernel function defined as follows:With F being a nonlinear mapping function. Linear, polynomial, sigmoid, and Gaussian functions are the most commonly used kernel functions:
3.3. Gaussian Process Regression
Gaussian process regression (GPR) is a nonparametric, Bayesian approach applied to regression problems. GPR has several advantages, working well on small datasets and having the ability to provide uncertainty measurements on the prediction values.
Given the training data set , where N is the training set's dimension, represent input data, and is the corresponding output value. In data set D, random variables corresponding to input data set compose set and are subjected to the joint Gaussian distribution. For the simplest case, the relation between the latent function f (x) and the observed target y iswhere denotes the weight, ε is the independent noise, is the variance of the noise, and ΣP is covariance. The distribution in the Gaussian process is represented by a mean function, denoted as m (x), and a covariance kernel function, denoted as K (x, x') [38]:where x and are random numbers of random variables. For the basic GPR, m (x) is set to be zero, and formula (1) can be rewritten aswhere x is the learning sample whose measure in the GP is the finite-dimensional distribution of the GP. As defined by the GP, the finite-dimensional distribution is a normal joint distribution as
The noise e is free from f (x), and it is subject to the Gaussian distribution. When f (x) is an object of the Gaussian distribution, and y is also subjected to the Gaussian distribution. Then, the prior distribution of the observed target value y is inferred as:With given test sample points (x, y), the joint probability distribution of the observed target value y and prediction value y at test points is expressed aswhere K (x, x) = (Kij) is a positive defined symmetry matrix of size ; Kij = K (xi, xj) are the elements in the matrix, respectively, to measure the correlation of xi and xj; K (x, x∗) is the matrix of covariance of the training set and the testing set.
Applying the conditional distribution properties of the Gaussian distribution, an equation is proposed:whereThe mean value is the estimation value of ; is the variance matrix of test samples, which reflects the estimation value’s reliability.
3.4. Model Evaluation
The application of modeling tools in the field of geotechnical engineering is increasingly popular and effective. However, to assess the ability of these models to make an accurate prediction still needs to be tested by appropriate model evaluation indicators. In this study, 3 indicators are used to evaluate the quality of the model compared to data collected from the experimental results, including mean absolute error (MAE), root mean square error (RMSE), and correlation coefficient (R) [39, 40].
MAE is calculated by Equation (2), which evaluates the difference between actual data and is calculated from the model [28]. However, it does not tell the bias trend of the predicted and experimental values. When MAE = 0, the value of the model completely coincides with the actual value, and the model is considered “ideal.” MAE value is in the range (0, +∞).
RMSE is one of the basic quantities and is commonly used for evaluating the results of predictive models [41]. RMSE is often used to denote the mean magnitude of the error. In particular, the RMSE is extremely sensitive to large error values. Therefore, the closer the RMSE is to the MAE, the more stable the model error is. Just like MAE, RMSE also does not indicate the deviation between forecast value and actual value. RMSE is determined by formula (3), and the value of RMSE is in the range (0, +∞).
R is the correlation coefficient representing the data’s suitability with the algorithm, a measure commonly used in ML algorithms [42]. The equation for calculating the value of R is presented in equation (4). The R values range from -1 to 1. The absolute value of R equal to 1 represents a perfect distribution between the simulated and real values, while a value of 0 indicates no correlation.where n is the number of database, y0 and are the actual experimental value and the average real experimental value, and yt and are the predicted value and the average predicted value, calculated according to the model forecast.
3.5. Methodological Flowchart
The process of implementing the methodology is depicted in Figure 4, including the following basic steps:(i)Data acquisition: in this step, soil sample data collected from the Da Nang-Quang Ngai expressway project is used to build the model. On the basis of the data set collected, determine the input and output parameters to be defined.(ii)Database preprocessing: this is one of the most critical steps in ML to help build a more accurate ML model. Some techniques are used to process data, such as transforming data, ignoring missing values, and filling in missing values. After that, the data set is randomly divided into two parts: the training part and the testing part.(iii)Select the model best suited to the data type: in this study, a random forest (RF) algorithm is used to estimate soil cohesion. The results of RF model are also compared with the support vector machine (SVM) [32] and Gaussian regression process (GPR) [43].(iv)Train and test the model on data: in this step, train the tuple and tune the parameters using the “training database,” and then test the performance on the unseen “testing database.” An important point to note is that the test dataset is not used in the training process.(v)Model evaluation: model evaluation is an indispensable part of the model development process, helping find the model to predict the best results.

4. Results and Discussion
4.1. Descriptive Statistics Analysis
The statistical analysis of the data was performed (Table 1 and Figure 5). In the database, the value of the clay content varies in the range of 4.09–47.96%, the natural moisture content is in the range of 15.53–115.41%, the liquid limit varies from 20.8 to 154.12%, the plastic limit ranges between 13.42 and 63.96%, the specific density value varies from 2.59 to 2.75 g/cm, and the void ratio ranges from 0.58–3.25. Besides, the soil cohesion values are in the range of 0.29 to 30.39 kPa. The histograms of the corresponding variables are presented in Figure 5. Besides, the quantitative analysis of input and output parameters is detailed in Table 1.

(a)

(b)

(c)

(d)

(e)

(f)

(g)
4.2. Prediction Performance of RF
In this section, the effectiveness of the RF model is evaluated. The hyperparameters of RF model are selected using trial and error tests, presented in Table 2. The comparison results between the experimental values of soil cohesion with those obtained from the RF model for the training and testing dataset are shown in Figure 6. Observe that the line representing the cohesion value of the soil is predicted to be quite close to the line representing this value experimentally. This good correlation was confirmed by the error diagram between the predicted and experimental soil cohesion for the training set (Figure 7(a)) and the testing dataset (Figure 7(b)). Of the 102 data samples of the training dataset and 43 data samples of the testing dataset, only a very few samples have an error in the range of [-7; 11] kPa. These errors show that the predictability of the RF algorithm is feasible with small errors.


(a)

(b)
Finally, the relationship between the actual data value and the predicted value is given as a regression graph in Figure 8. The quantitative values of the three criteria evaluating model performance are shown in Table 3. As shown in Table 3, the RF model provides R = 0.90; RMSE = 3.56; MAE = 0.90 and SD = 3.58 for the training dataset. For the testing dataset, these values are R = 0.84; RMSE = 2.68; MAE = 2.11; SD = 2,71, respectively. When considering all the data, the model provides R = 0.89; RMSE = 3.32; MAE = 2.51 and SD = 3.33. It can be seen that the predictability of the model is relatively high. Therefore, the RF model application to predict soil cohesion is feasible with high accuracy and low error.

(a)

(b)

(c)
4.3. Analysis of Simulation Convergence of RF and Other ML Models
In this work, the performance of the proposed model is assessed by the number of simulation runs. Several studies [44, 45] have shown that the predictive performance of the algorithm depends on randomly dividing the data set into training and test sets. Therefore, analysis of the model's performance should be performed with a sufficient number of simulations to demonstrate the generality of the obtained results. In this study, a total of 200 simulations were conducted to study the performance of the proposed RF model. The hyperparameters of other models are selected using trial and error tests, presented in Table 2.
Figures 9(a), 9(c), and 9(e) represent the normalized convergence values of RMSE, MAE, and R, respectively. In contrast, Figures 9(b), 9(d), and 9(f) represent the convergence values of the three respective criteria. As observed, after about 50 simulations, the oscillation of RMSE and MAE was in the range of less than 1% with the training set (Solid Green Line). With the testing set (Red dashed line), the number of simulations after about 70 times, the RMSE and MAE values fluctuate within the 1% error range. Meanwhile, the correlation coefficient R with the training set converges immediately after the first simulations. The testing set takes about 75 simulations to ensure the convergence of errors in a small range. When the number of simulations reaches 200, all RMSE, MAE, and R values are converged. It turns out that the selection of 200 simulators is suitable to get optimized results for all R, RMSE, and MAE values.

(a)

(b)

(c)

(d)

(e)

(f)
Figure 10 shows a box plot illustration of RMSE, MAE, and R values after 200 runs corresponding to the training and testing sets simulated by RF algorithm. The mean and corresponding standard deviations of R are 0.90 and 0.01 for the training dataset. For the testing dataset, these values are 0.71 and 0.08, respectively. Considering the RMSE criterion, the mean and standard deviation are 3.25 and 0.16, respectively, for the training dataset, and 4.73 and 0.65 for the testing set. For MAE, these values are 2.37 and 0.13, respectively, corresponding to the training set, and 3.54 and 0.48 for the testing set. Besides, the minimum and maximum values of R, RMSE, and MAE for the two data sets are shown in Table 4. In addition, 200 simulations with SVM and GPR algorithms are performed and presented in Figure 10. It could be easily observed that RF model outperforms other algorithms on both the training and testing datasets. The average R values of RF are significantly higher than those of SVM (R = 0.27) and GPR (R = 0.69) for the training parts, whereas the average RMSE and MAE values of RF are lower than those of SVM (RMSE = 8.14, MAE = 7.18) and GPR (RMSE = 4.83, MAE = 3.60). Similar observations are noticed for the testing parts (RMSE = 5.46, MAE = 4.00 for SVM, and RMSE = 5.00, MAE = 3.72 for GPR), which reflect the prediction capability of the models.

(a)

(b)

(c)

(d)

(e)

(f)
Overall, the proposed RF algorithm is a better ML model compared with other ML models (SVM, GPR) in predicting soil cohesion. It is reasonable because RF has many advantages such as the following: (i) it can be effectively applied to large-scale datasets as it provides the facility for size reduction without deleting unwanted variables from the training dataset; (ii) it can handle thousands of input features and variables at a time; (ii) it has an embedded efficient technique for estimating missing or null values. Hence, it is possible to maintain a level of accuracy (i.e., consistent performance) even when a large portion of the data is missing; (iv) it is able to perform a good parallel simulation because the number of trees generated and computed is completely independent of each other; and (v) this model can minimize errors as the results are synthesized from different “learners” (random forest trees) [46]. The results of this study are also comparable with other previous published works [46–48].
4.4. Sensitivity Analysis
In this section, the estimation of the feature importance of input variables is performed. For each simulation, the importance value is calculated by the sum of the difference taken by the splits of the given predictor and divided by the sum of the branch in RF. Figure 11 shows the out-of-bag feature importance over 200 simulations (by mean values) along with the standard deviation values. It can be seen that the void ratio is the most important variable in predicting soil cohesion. Besides, the moisture content is the second important input for the problem, followed by the plastic limit, liquid limit, specific gravity, and the clay content. These sensitivity results are reasonable and comparable with other published works [28, 49, 50].

5. Conclusion
In this study, a data set of 145 soil samples collected from the Da Nang-Quang Ngai expressway project was used to construct an RF model for the purpose of soil cohesion prediction. Input data for network training includes clay, moisture content, liquid limit, plastic limit, specific gravity, and void ratio. Three statistical criteria, namely, correlation coefficient (R), mean absolute error (MAE), and root mean square error (RMSE), are used to evaluate the correlation between the values predicted by the RF model and actual experimental values. The analysis results show that the built model can predict soil cohesion accurately and quickly, avoiding costly and difficult experiments that require complicated equipment.
However, in ML problems, data is the key factor in creating a reliable predictive tool. Therefore, the next research direction is to collect additional data to further improve the algorithm, making the prediction more accurate, avoiding costly on-field experiments.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This research was funded by the Ministry of Transport, project titled “Building Big Data and Development of ML Models Integrated with Optimization Techniques for Prediction of Soil Shear Strength Parameters for Construction of Transportation Projects” under grant number DT 203029. We thank the ones who have supported us with the additional data for carrying out this research.