Abstract
Constructing tunnels in urban spaces usually uses shield tunneling. Because of numerous uncertainties related to underground construction, appropriate monitoring systems are required to prevent disasters from happening. This study collected the settlement monitoring data for Tender CG291 of the Songshan Line of the Taipei Mass Rapid Transit (MRT) system and considered that influential factors were examined to identify the correlations between predictor variables and settlement outcomes. An inference model based on symbiotic organisms search-least squares support vector machine (SOS-LSSVM) was proposed and trained on the collected data. Moreover, because the dataset used for this study contained far less data at the alert level than at the safe level, the class of the dataset was imbalanced, which could compromise the classification accuracy. This study also employed the probability distribution data balance sampling methods to enhance the forecast accuracy. The results showed that the SOS-LSSVM exhibited the most favorable accuracy compared to four other artificial intelligence-based inference models. Therefore, the proposed model can serve as an early warning reference in tunnel design and construction work.
1. Introduction
The rapid development of urban areas in recent decades has led municipal authorities around to world to relocate urban transportation infrastructures underground [1–4]. In Taiwan, underground public transport systems have been developed and expanded in major cities such as Taipei, Taoyuan, and Kaohsiung. In Taipei alone, all heavy-load transportation routes through the city have been relocated underground and five major metro lines with a total operating mileage of 131.2 km have been constructed and are in operation since 1996. Construction work underground is mostly conducted in small and confined spaces subject to numerous uncertainties, making work here much more difficult than aboveground [5–7]. Besides, with the exception of subway stations, most of the below-ground subway system infrastructure is built using the shield tunneling method, in which a tunnel boring machine (TBM) simultaneously excavates the soil ahead, removes the excavated material, and installs a supporting shield structure to stabilize the newly excavated tunnel section. However, underground construction is not only challenging but also risky. During shield tunneling, factors such as changes in stress, tail void closures, disturbed soil compaction, and lining segment deformation can displace lateral soil layers, leading to the ground settling, bulging, or experiencing lateral displacement [8]. Therefore, while the TBM is in operation, a safety monitoring system must be active. This system collects site data and supervises TBM maneuvers to prevent excessive ground settlement, which can damage existing urban infrastructure and buildings and trigger disastrous accidents [3, 5, 9–11].
However, data on settlement generated by the safety monitoring system alert users to settling that has already occurred and are thus useful only for developing and implementing postdeformation remedies that prevent a situation from worsening [2, 11]. Shield tunneling safety would benefit greatly from a database created with limited monitoring data and soil layer parameters that may be used to predict settlement conditions, provide early warnings of deformation, and increase reaction times [1, 12–14]. With this aim in mind, the settlement monitoring data for Tender CG291 of the Songshan Line of the Taipei MRT system were collected in this study. In this monitoring data, safe-level data entries far outnumber alert-level data entries, creating an imbalanced dataset. Classification models based on ordinary classification techniques can result in serious bias in class forecasting when imbalanced processing data [15] which renders inference models based on artificial intelligence (AI) unable to classify scarce data with accuracy. For this reason, effectively processing imbalanced data to prevent forecasting bias is critical for AI-based inference models.
Few researchers have developed AI models for use as autonomous integrated systems to predict ground settlement in tunnel construction. Thus, in this study, a novel advancement of symbiotic organisms search-least squares support vector machine (SOS-LSSVM) and data balancing methods is proposed to help predict settlement and help project decision-makers to prevent geotechnical disasters. The developed model is at the forefront of efforts to integrate metaheuristics, AI techniques, and data balancing methods to automatically and accurately predict shield-tunnel settlement. For this purpose, factors influencing settlement were investigated, and historical monitoring data were gathered for the training on AI. This prediction model is expected to be useful for design and construction agencies in predicting settlement, thereby helping them adopt preventive measures against settlement. Thus, the objectives of this study are as follows:(1)Identifying influential factors for settlement in shield tunneling: the literature on settlement estimation was reviewed for possible influential factors, which were further tested using statistical methods by the software SPSS.(2)Conducting resampling for imbalanced data: two methods were applied against imbalanced data, probability distribution data balance sampling (PDDBS), and synthetic minority oversampling technique (SMOTE).(3)Establishing a model for settlement prediction: the proposed settlement prediction model for shield tunneling was developed using symbiotic organisms search-least squares support vector machine (SOS-LSSVM).(4)Verifying the effectiveness of the proposed model: the prediction results of SOS-LSSVM and another four AI-based models were compared to determine the best performer based on prediction accuracy. Also, the receiver operating characteristic (ROC) curve and the area under the curve (AUC) were used to evaluate the classification accuracy of the data balanced by PDDBS and SMOTE. Thus, the proposed model has been verified to solve the data imbalance problem effectively.
In this study, SOS-LSSVM was integrated with the data balance sampling method to create a shield-tunnel settlement prediction system optimized to help prevent ground-settlement-related disasters during tunnel construction. The system, based in the construction control center, utilizes automatically collected and wirelessly transmitted monitoring data to forecast tunnel settlement status in real-time. When predicted settlement levels exceed the warning value, engineers may take appropriate actions to prevent disaster.
2. Literature Review
2.1. Causes of Settlement in Shield Tunneling
In Taiwan, shield tunneling has been in use for over 31 years since its debut in 1976; through the years, TBMs have seen considerable improvements, from the most primitive open-face manual types to the later mechanical, slurry pressure balanced, and earth pressure balanced types. Because of the lack of slurry deposit yards or facilities, the Rapid Transit System in Taipei mostly employs earth pressure balanced TBMs, except for the Xindian Line (CH22), which uses two slurry pressure balanced machines. Shield tunneling would result in ground settlement having negative impacts on the adjacent structures [5, 16]. The soil layer and surface displacements caused by shield tunneling are related to the type and diameter of the TBM, excavation depth, site condition, soil properties, and groundwater level. When a TBM is advancing, if the thrust force against the tunnel face is lower than the static earth pressure of the soil layers, the soil releases its stresses along the tunnel face and rushes toward the tunnel face because the soil layers are under active earth pressure. This leads to ground loss and results in settlement. If the thrust force is equal to the static earth pressure of the soil layers, the tunnel face becomes static. Furthermore, if the thrust force is greater than the static earth pressure of the soil layers, the soil along the tunnel face is pressed forward, causing the ground to bulge. Ground settlement during shield TBM tunneling develops in the following steps: (1) before and during tunnel face excavation, (2) during the passage of the shield skin plate, and (3) after installation of segmental lining and backfill grouting [17].
According to previous studies, various factors contribute to ground settlements, such as geometrical, geological (e.g., the strength characteristics and the overconsolidation ratio of the soil), and shield operational parameters [4, 7, 8, 11–14, 18]. Fargnoli et al. summarized that face support pressure, grouting pressure, machine stoppage time, and installation time for one-ring tunnel lining were essential parameters to predict surface settlement [2]. Luo et al. also indicated that the groundwater condition is an important factor because shield tunneling would cause pore water pressure variation [18]. The fill factor of grouting and grouting pressure was identified as the most affecting parameters when applying an AI-based algorithm to predict settlements [14].
2.2. AI-Based Algorithm Applications for Predicting Settlements
Establishing a settlement prediction model is necessary for underground construction safety. Analytical, empirical, and numerical methods were proposed to predict settlement and other tunnel deformations. The most important weakness of such proposed methods is that they fail to consider all parameters contributing to the settlement (e.g., ground condition, operational parameters, and tunnel geometry) [14]. Also, because the process around shield TBM tunneling is complicated, most of the studies could not provide statistically meaningful relationships between the volume loss and operation parameters [17].
Recently, some researchers have successfully used AI-based algorithms to establish a model for predicting the settlement induced by shield tunneling, such as artificial neural networks (ANNs), fuzzy logic (FL), support vector machine (SVM), and gene expression programming (GEP) [7, 14]. Wang et al. successfully applied an adaptive relevance vector machine (aRVM) to predict real-time settlement development [9]. Bouayad and Emeriault proposed a methodology that combines the principal component analysis (PCA) with an adaptive neuro-fuzzy-based inference system (ANFIS) to model the nonlinear relationship between ground surface settlements induced by an earth pressure-balanced TBM [7].
Symbiotic organisms search-least squares SVM (SOS–LSSVM) was developed by Cheng and Proyogo [19] and proved to be reliable in prediction tasks [20–22]. SOS-LSSVM uses an advanced metaheuristic to search optimal parameters and identify the correlations between input and output variables from the historical case data to establish inference models. Previous studies also identified that the SOS method exhibited excellent performance [19, 23, 24]. In addition to SOS-LSSVM, this study also applied backpropagation neural network (BPNN), least squares support vector machine (LSSVM), evolutionary least squares support vector machine inference model (ELSIM) [25, 26], and SVM to estimate the settlements for comparison.
2.3. Strategies against Data Imbalance
Data imbalance refers to one class of samples in a dataset overwhelming another class; this has serious consequences in classification. Generally, the term “minority” (MI) is used to refer to the class of scarce samples in the dataset and “majority” (MA) for the dominant class [27]. For example, when a dataset contains 95% majority class samples and 5% minority class samples, an inference model will tend to classify all of the samples as the majority class and achieve 95% accuracy; however, its accuracy for the minority class will be 0%. This bias is caused by the characteristics and limitations of AI, which requires a large amount of evenly distributed data for training and testing to achieve satisfactory forecasting results.
Once the distribution of imbalanced data is skewed, an AI-based inference model trained on them will also produce skewed results accordingly. The major measures to solve the data imbalance problem are undersampling and oversampling. Besides, this study also introduces a sampling method that utilizes probability distribution to balance data and improve the classification accuracy.
2.3.1. Undersampling
Undersampling is a technique that decreases the number of MA samples for the balance of a training dataset. It reduces the number of MA samples until the MA class is the same in size as the MI class. Undersampling is superior to oversampling for the training of imbalanced data; however, this approach can eliminate some potentially useful training samples; hence, it lowers the performance of the classifier.
Excessive MA samples could be eliminated through random selection to balance out the two classes. To avoid uncertainty pertaining to random undersampling, Kubat and Matwin proposed an alternative undersampling approach that they considered more appropriate. To mitigate data imbalance, they removed the redundant data in the MA class, followed by removing the borderline samples close to the boundary of the MA and MI classes as well as the noisy data [28].
2.3.2. Oversampling
Oversampling increases the number of MI samples for the balance of a training dataset. It increases the number of MI samples until the MI class is of the same size as the MA class. As an approach against data imbalance, it is highly popular, and it is effective for the training of imbalanced data. However, because oversampling introduces some high-precision samples into the dataset, the result is often a lengthy training time or even overtraining.
In addition to random oversampling, the synthetic minority oversampling technique (SMOTE) was used in this study. Unlike random oversampling, which duplicates the MI class to expand the sample size, SMOTE generates synthetic samples by adopting linear interpolations between two near samples. Specifically, SMOTE identifies and calculates the difference between MI samples using the nearest one, then multiplies the difference using a random value between 0 and 1, and then adds it to the MI class via the generation of a new MI sample class.
3. Establishing Settlement Inference Model for Shield Tunneling
This chapter addresses how the influential critical factors for settlement in shield tunneling were identified. These factors serve as the input variables for the proposed model, which uses SOS-LSSVM and relies on historical case data for training and testing to determine the optimal mapping of input and output variables, thereby predicting the settlement of tunnels. The flowchart is illustrated in Figure 1.

Step 1. Identify influential preliminary factors
Review studies on shield tunneling and list the reasons that are attributed as the cause for settlement. The ones that are mentioned more frequently will be identified as preliminary influential factors. Then, implement SPSS on the preliminary influential factors to determine the factors to be included.
Step 2. Collect and establish the case dataset
Collect case data according to the required input and output variables and thus establish a complete case dataset that provides the input data.
Step 3. Balance the dataset
A total of 999 data were collected for the present study, of which 75 were of alert level; therefore, the data were imbalanced. To overcome this problem, this study proposed a new data balancing method: probability distribution data balance sampling (PDDBS). There are two types of probability distribution data balance sampling (PDDBS): PDDBS oversampling and PDDBS median sampling, as shown in Figure 2. PDDBS oversampling balances a dataset by increasing the MI samples to the same amount of MA samples. By contrast, PDDBS median sampling simultaneously increases MI samples and decreases MA samples to the median total sample size to achieve balance in the dataset [29].(1)PDDBS oversampling procedure (Figure 2(a)) Step a: select one type of attribute data from the dataset and calculate its sample size and R (MI). The number of samples of R (MI) that must be added to the MI class is determined as follows: Step b: divide the MI class ni (MI) into k intervals. Step c: calculate the probability of an interval, as shown in Figure 3. The conversion equation for the normal distribution of the sample is The probability of an interval is Step d: calculate the number of samples S that must be increased in an interval (Figure 3). The formula for S is Step e: generate the values and add them to the MI class. The formula to increase S samples in Step d is Step f: examine whether the sample sizes are balanced. Examine if the classes in the dataset are equal in size. If not, they will require balancing again; if they are, the dataset is considered balanced.(2)PDDBS median sampling procedure (Figure 2(b)) Step a: select one type of attribute data from the dataset and calculate its sample size, R (MI), and R (MA). The number of samples R (MI) that must be added to the MI class is The number of samples R (MA) that must be detracted from the MA class is Step b: divide the MI class ni (MI) into k intervals and the MA class ni (MA) into k2 intervals. The number k1 can be calculated as The number k2 can be calculated as Step c: calculate the probability of an interval as shown in Figures 4 and 5. The conversion equation for the normal distribution of sample is The probability of an interval is Step d: calculate the number of small-class samples S1 and reduce the number of multiclass samples S2 for each interval. The equation for the number of S1 (Figure 4) that must be increased for the small number of samples in each interval is The equation for the number of S2 that must be reduced for multiple types of samples in each interval is Step e: generate values and add samples. The equation for increasing the number of S1 samples in step d is which directly reduces the S2 samples calculated in step d from the multiclass samples. Step f: confirm that the sample sizes are balanced and the classes in the dataset are equal in size. If not, they must be balanced again. In the preceding equations, , , , , , , , , , , , , , , and .Four methods, including PDDBS oversampling, PDDBS median sampling, SMOTE oversampling, and SMOTE median sampling, were implemented to thoroughly examine their performance in dealing with imbalanced classification based on their respective advantages and disadvantages [30–32]. PDDBS provides larger numbers of replicated minority samples but increases the likelihood of overfitting, while SMOTE reduces the risk of overfitting but tends to exclude helpful information. Also, while oversampling minimizes information loss and generates equal numbers of minority and majority class samples, the process may overfit the classifier. Finally, although the use of the median in median sampling to punish differences in nominal features associated with typical differences in continuous eigenvalues provides an effective theoretical model to remove noise and redundant samples, its sampling performance on the same datasets may be poor.

(a)

(b)



Step 4. Establish the inference model and compare the forecast results
Feed the case data to SOS-LSSVM to establish the inference model.
Step 5. Results
The prediction accuracy of the proposed model was compared with other AI-based inference models to determine its forecasting ability. Furthermore, the best model’s ROC and AUC with various balancing dataset methods were compared to examine the classification performance.
Step 6. System development and implementation
In this step, the proposed model is developed and implemented into an integrated system that engineers may use in smart decision-making related to preventing and resolving ground-settlement problems.
3.1. Identifying Influential Factors
As listed in Table 1, the dataset had ten possible influential factors . Based on the findings of previous shield tunneling studies [33, 34], soil shear strength is the primary factor affecting tunnel settlement, with lower strength values associated with a higher risk of settlement. The parameters of soil shear strength may be derived from cohesion force and internal friction angle . The groundwater level variation during shield tunnel construction [35] is another factor affecting tunnel settlement, with Liu et al. (2023) finding a positive correlation between the groundwater level and downward movement in the tunnel [36]. Tunnel geometry (e.g., depth of the tunnel center line and tunneling distance) [34], chamber pressure, total thrust force, tunneling speed, backfill quantity, excavated soil quantity, and water pressure [32] are also significant factors in tunnel settlement.
Although SOS-LSSVM can process large quantities of data, if the factors are not positively correlated with the output, they can interfere with the output in training, resulting in excessive errors. Therefore, an objective method was required to analyze the correlation between the factors and settlement, to help select significantly correlated parameters for the inference model. In statistics, the term “correlation” refers to the strength and direction of the linear relationship between two variables; hence, it also indicates the degree to which the two variables are mutually independent. In this study, SPSS 22.0 was used, which employed Pearson’s correlation coefficient, Kendall’s tau-b, and Spearman’s rho, to analyze the 10 factors and determine the correlation between the input and output variables.
The results of the correlation analyses of ten influential factors are presented in Table 2, with eight of the 10 factors accepted and two rejected as input variables. The depth of the tunnel center line factor met the requirements of the correlation test from the Pearson test only, with the results of Kendall’s tau-b and Spearman’s rho tests showing no significant correlation . With regard to the quantity of excavated soil factor, none of the correlation tests identified a significant correlation with soil settlement at a 0.01 level of significance. Based on these findings, the depth of the tunnel center line and the quantity of excavated soil were not included as input variables in the model. The eight other factors showed strong correlations with soil settlement and were thus included as input variables. The factors that exhibited a significant correlation of more than twice the aforementioned three methods at a significance level of 0.01 (two-tailed) were selected for the models.
3.2. Case Study and Data Collection
The datasets used in this study included data from construction projects implemented by the Taipei Mass Rapid Transit (MRT) system in Taiwan. A total of 999 settlement monitoring records were collected, with each covering ten input variables and one output variable. Tender CG291 of the Taipei Mass Rapid Transit system was used as a case study covering G16 Zhongshan Station to G14 Beimen Station, as shown in Figure 6. The shield tunnel employed a two-section configuration with a total length of 1861 meters. The first section extends from Beimen Station (G14) to Tianshui Road Extension Station (G15) and the second extends from Tianshui Road Extension Station (G15) to Zhongshan Station (G16). In terms of geological properties, the route is in what is classified as Zone T2 (Tamsui River Zone 2), and the tunnel is primarily in the strata between Songshan Formation 3 and Songshan Formation 4. The strata are evenly layered, and the groundwater level is approximately 2.7–3.5 m underground (EL 99.2–100 m). The monitoring system used was the settlement reference point-shallow subsurface type (SSI). The monitoring setups used are summarized in Table 3.

3.3. Building a Balanced Dataset
As shown in Table 4, 75 of the 999 samples in the dataset were at the alert level, while the remaining 924 were at the safe level, making the dataset imbalanced. Thus, the dataset was balanced separately using PDDBS and SMOTE by modifying the number of samples in the majority and minority classes. Table 5 shows the number of modified samples by method. The process of balancing the dataset changed the majority: minority ratio from 12.32 to ≈1 by increasing the number of minority samples and reducing the number of majority samples. PDDBS oversampling and PDDBS median sampling, respectively, generated 924 and 499 samples in the minority class, while SMOTE oversampling and SMOTE median sampling, respectively, generated 918 and 495 samples in the minority class. Based on the balanced datasets, SOS-LSSVM was then applied to predict the settlement.
3.4. Data Testing Using SOS-LSSVM
After PDDBS and SMOTE were applied to balance the dataset, the SOS-LSSVM was used to test the balanced dataset. Then, the two data balance methods were compared with each other based on the same algorithm. Furthermore, in addition to SOS-LSSVM, this study also provides the estimation results produced by BPNN, LSSVM, ELSIM, and SVM for comparison.
3.4.1. Data Preprocessing
Data preprocessing involves scaling the entire dataset, which significantly affects the model outcomes. Preprocessing is required before training the model to scale the data to an equivalent range. Although SOS-LSSVM is able to process large quantities of data and identify the nonlinear mapping of input and output values, learning speed and accuracy are seriously compromised when applied to very large variable ranges. Therefore, before training a method for processing input and output values, these values must be determined to prevent the model from becoming temporarily unstable or failing to converge. In this study, a normalization method was used to transform data inputs into a 0-1 range using linear scaling. The following equation shows the normalization function applied to the datasets:where is the normalized value, is the actual value, and and are the maximum and minimum values, respectively.
3.4.2. Tenfold Cross-Validation
Tenfold cross-validation, recommended to reduce bias and obtain reliable accuracy in statistical analysis [37], was implemented in this study to divide the datasets randomly into ten folds of approximately equal size to evaluate the learning model’s performance. Ninety percent of the dataset was used for training and 10% was used as validation data for testing. The process was repeated ten times, and the final result was calculated using the average of the tenfold results.
3.4.3. Inference Settlement Evaluation and Error Indices
This study proposes SOS-LSSVM as the inference model, which only requires setting up the number of iterations for SOS and the range of LSSVM parameters. The LSSVM serves as a supervised-learning-based predictor to accurately build the input and output variables relationship. The SOS algorithm is used as a metaheuristic search to find the optimal parameters of LSSVM. This hybrid system enhanced the learning process through the mutualism, commensalism, and parasitism phases. The search process stops when the stopping condition is fulfilled. The model will proceed to the next iteration if the criteria remain unsatisfied. The training and testing data were fed to SOS-LSSVM; and the output values were then denormalized for the mean performance index, which indicated the accuracy of the forecasts. Error indices are detailed in this section.
Errors are inevitable in forecasts; therefore, effective indices are required to appraise errors; this is to determine the accuracy of an inference model and how the inference model fares are compared with other inference models. In this study, five such indices were used (as listed in Table 6): mean absolute percent error (MAPE), correlation coefficient (R), mean absolute error (MAE), root mean squared error (RMSE), and reference index (RI). RI served as the index for comprehensive evaluation. Such performance measures allowed for more accurate results and a fairer test [24].
This study also used ROC curves and AUC values to evaluate the accuracy of classification. The ROC curve is a coordinate-based diagram for the sensitivity of a classifier. It has gained increasing popularity in machine learning and data mining. The basic concepts of the ROC curve are the following four scenarios: (1) true positive (TP), (2) false positive (FP), (3) true negative (TN), and (4) false negative (FN).
Only the true positive rate (TPR) and false positive rate (FPR) are required for plotting the ROC curve. TPR stands for the rate of actual positive samples being accurately identified as such. On the other hand, FPR stands for the rate of actual negative samples being erroneously identified as positive. Consequently, the ROC space defines FPR as the x-axis and TPR as the y-axis, and the ROC curve is made up of FPR and TPR coordinate points. No doubt, the perfect classification point is 0 and 1, and the AUC value is exactly the area under the curve of the ROC.
4. Results, Discussion, and System Applications
4.1. Training and Testing Results
Through the training and testing of SOS-LSSVM, the following information was obtained. The SOS-LSSVM training and testing results of the dataset, both before and after being balanced, were compared to determine the accuracy of SOS-LSSVM. The results of BPNN, LSSVM, ELSIM, and SVM were also presented as references to illustrate the relative accuracy of SOS-LSSVM. The training and testing results of SOS-LSSVM with the various indices are shown in Figure 7. Both PDDBS and SMOTE methods increased the accuracy of settlement prediction. Compared with the original data, the MAPE, MAE, RMSE, and R values were improved by both the methods. Overall, this study provides an alternative approach to data balancing that enhances the accuracy of SOS-LSSVM. The average value of the performance evaluation is shown in Table 7. In terms of MAPE and MAE, SMOTE median sampling was superior, achieving the smallest error values in both training and testing. SMOTE oversampling achieved the highest correlation (R) value for testing (0.988), while PDDBS oversampling achieved the best training performance, with a correlation value (R) and RMSE of 0.996 and 0.7852, respectively. It should be noted that although the original data exhibited acceptable performance of settlement estimation, the accuracy of classification determining the safe or alert level of the settlement condition was still unknown.

(a)

(b)

(c)

(d)
4.2. Comparison of Inference Models and Resampling Methods
To determine whether SOS-LSSVM could be superior to other AI-based inference models, the dataset was balanced separately with PDDBS and SMOTE and then used to train BPNN, LSSVM, ELSIM, and SVM. Subsequently, the corresponding RI values were calculated for comparison, and the performance of each algorithm was then ranked accordingly. Based on Tables 8 and 9, the RI values indicated that SOS-LSSVM was superior to other models. The comparison revealed that the RI value of SOS-LSSVM was superior to the other AI-based inference models, regardless of whether the data were left unbalanced or balanced by either PDDBS or SMOTE.
Unexpectedly, RI values also indicated that SMOTE was superior. The RI values of data balanced by SMOTE appear to be superior to PDDBS. This is because RI is an index for general performance instead of a specific index for classification accuracy. Moreover, this highlights the requirement for introducing a ROC curve and AUC to evaluate the performance of SOS-LSSVM in classifying the level of settlement status (i.e., safe or alert).
The ROC curves are shown in Figures 8 and 9, and the average FPR, TPR, and AUC values are shown in Table 10. Judging from the average AUC values, if the original imbalanced settlement data were used, the classification accuracy would be the lowest. Both PDDBS oversampling and PDDBS median sampling achieved a slightly higher classification accuracy than SMOTE in training and testing. Also, both the sampling methods achieved a better oversampling performance than the median sampling method. These results support a positive relationship between the amount of sample data added in the minority class and the classification accuracy. Thus, the proposed resampling method is a competitive alternative to currently popular approaches that both solve the problem of data imbalance and enhance the accuracy of AI-based forecasting.


4.3. System Development and Implementation
The application of the developed shield-tunnel settlement prediction system is shown in Figure 10. In this system, data loggers with built-in sensors record settlement data regularly, which are transferred to a computer storage system in the control center via wireless Internet. The stored data are then preprocessed and transformed for further analysis. The operating process of the SOS-LSSVM system features a graphical user interface that allows users to interact easily with the algorithm. The system automatically trains a model and performs accurate prediction analysis using the input data. The results provide engineers with a decision-making tool for creating project-specific, real-time monitoring solutions. System implementation may be integrated into regular tunneling management, providing a centralized and a handy platform integrated with state-of-the-art technologies, including data mining techniques and artificial intelligence.

5. Conclusions
Settlement monitoring and estimation are essential for underground construction. This study applied SOS-LSSVM as the basis for an inference model for settlement in shield tunneling of the Taipei Mass Rapid Transit system, using historical case data for training and influential factors of settlement as input variables. This study contributes to developing an AI system integrated with metaheuristic and data balancing methods to predict ground settlement and facilitate appropriate response measures to prevent urban geotechnical hazards. The proposed model remarkably outperforms other AI models (BPNN, LSSVM, ELSIM, and SVM) and accurately predicts settlement to help engineers anticipate settlement status over the course of tunnel construction projects.
PDDBS and SMOTE methods were applied to solve the data imbalance problem. Tenfold cross-validation was used to evaluate the performance of the developed model, showing that applying SOS-LSSVM to data balancing using PDDBS and SMOTE achieved the highest RI value in both training and testing. In addition, the ROC curve and AUC were used in this study to assess combinations of SOS-LSSVM with various balancing dataset methods (PDDBS and SMOTE) in terms of their respective abilities to accurately classify settlement status as either safe or alert. A comparison of average AUC values demonstrated that the classification accuracy of PDDBS was higher than that of SMOTE. In addition, the accuracy of PDDBS oversampling was found to be superior to PDDBS median sampling. These results demonstrate that the proposed method can effectively balance an imbalanced dataset and can enhance the AI-based forecast accuracy.
This study is a pioneering effort to develop an autonomous system that integrates monitoring sensors for data collection, wireless data transmission functionality, and settlement prediction for disaster warning prevention. The system was tested on data from a real MRT construction project in Taiwan to demonstrate its novelty and practicality in real-world applications. Moreover, due to the limited time available to conduct structural analyses and technical expertise inadequacies in the field, most project-site engineers are challenged to make the timely and correct decisions necessary to effectively prevent soil-settlement disasters. The system developed and proposed in this study can be easily and quickly used by engineers to facilitate appropriate preemptive actions to avoid disasters, construction failures, and their associated losses in terms of property and life.
The findings of this study provide two directions for future research. First, the differences in soil characteristics, as well as the quantity and completeness of the data, may directly influence the accuracy and reliability of estimation results. Future researchers are advised to collect other types of settlement monitoring data for training and testing, to determine if this makes a difference. Second, the PDDBS method was only compared with SMOTE. Future researchers are advised to include more resampling methods for comparison to learn their relative effectiveness. Furthermore, only one imbalanced dataset was used in the present study. Future researchers are advised to collect a wider variety of imbalanced data for training and testing more resampling methods and to compare their classification accuracy as well as improve the practicality and accuracy of the inference models.
Data Availability
The data used to support the findings of the study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.