Abstract
Landslides occur in most countries. As one of the most serious geological hazards, landslides affect infrastructure construction. Thus, it is vital to prepare reliable landslide susceptibility evaluation maps to avoid landslide-prone areas for various construction projects. In recent years, supervised machine learning algorithms have been widely used in landslide susceptibility evaluation, but many flaws remain in the selection of nonlandslide point samples for comparative analysis. It is significant to improve the authenticity of sample data and reduce the impact of noise. China’s Funing County was used as a case study in this paper, which first identified 122 landslide incidents based on historical data, fieldwork, and remote sensing images to create a landslide inventory in the research area. In addition, 12 causal factors of landslides were determined, including elevation, slope, aspect, plan curvature, profile curvature, distance to roads, distance to rivers, distance to faults, rainfall, normalized difference vegetation index (NDVI), lithology, and land cover. K-means clustering was used to purify the factor data factors before data-driven certainty factor (CF) and frequency ratio (FR) models, and machine learning models, random forest (RF) and artificial neural network (ANN), were used for a comparative study on landslide susceptibility evaluation in Funing County. The results show that the selection method of nonlandslide sample data will affect the accuracy of different evaluation models. The purified sample data improved the prediction accuracy of the four models, with significant prediction accuracy improvements observed in the ANN model. The purification of nonlandslide sample data by K-means method is of great significance for the drawing of landslide sensitivity map.
1. Introduction
Landslides are not only natural geological hazards but also secondary disasters that threaten the safety of people and damage property. The rapid economic development of China in recent years has accelerated infrastructure construction such as highways, railways, and hydropower projects in the southwestern mountainous area. However, human activities and climate change have led to an increasing number of landslides that caused severe damage. Therefore, there is an urgent need to develop an accurate landslide susceptibility map of a specific region for planning, construction, operation, and maintenance of various projects.
In landslide susceptibility evaluation, it is usually assumed that there is a causal relationship between the environment where landslides have occurred in the past and the environment where landslides will occur [1]. The probability of landslide occurrence can be predicted based on factors in the geological environment [2] and historical landslide sample data [3]. The evaluation methods can be classified as qualitative and quantitative evaluation and further divided into knowledge-driven and data-driven methods, as well as machine learning algorithms [4–6]. In the knowledge-driven method, a region is mainly evaluated based on the knowledge and experience of cartographers or experts. For example, Degraff [7] drawn the fuzzy map of landslide sensitivity in the study area based on lithostratigraphic units, slope orientation, and historical landslide data. However, this conventional qualitative evaluation method is affected by subjective factors, and has been gradually improved or replaced. On the other hand, data-driven methods usually use mathematical and statistical methods to analyze and evaluate the research area. For example, Chen et al. [8] combined the statistical index method (SI) and kernel logistic regression (KLR) to evaluate the susceptibility of landslide disasters in Chongren County, which improved the model’s prediction accuracy. Weight of evidence (WoE) is also a commonly used data-driven method. For example, Batar and Watanabe; Li and Wang; Pamela and Arifiantiet [9–11] used this method to conduct landslide susceptibility mapping in the research area. Certainty factor (CF) and frequency ratio (FR) models are generally used together for comparative analysis. For example, Oztuk and Uzel-Gunini [12] combined both models with analytic hierarchy process (AHP) to form a new hybrid model. And the prediction accuracy of the model is greatly improved.
With the development of computer science and interdisciplinary fields, smart computing technologies such as machine learning and data mining have been widely used in the classification and regression of big data, and their algorithms have gradually been used in the study of landslide susceptibility evaluation [13, 14]. Compared with data- and knowledge-driven methods, machine learning has increasingly been recognized as a more accurate prediction method and does not heavily depend on data quality. Machine learning models commonly used in landslide susceptibility mapping include decision trees [15–19], random forests [20–23], boosted regression trees [24–27], and artificial neural networks (ANN). Many studies have validated the use of these models and their learning models. For example, Lee et al. (2020); Jennifer and Saravanan et al. (2022) [28, 29] used ANN algorithms to evaluate landslide susceptibility. Nguyen et al. [28] also combined ANN algorithms with artificial bee colony (ABC) and particle swarm optimization (PSO) algorithms to optimize the model. The results showed that the optimized model enhanced prediction compared with conventional methods.
In a nutshell, various algorithms have been widely used in landslide susceptibility evaluation, and the evaluation accuracy is dependent on the model used and the purity of sample data. There is currently a lack of relevant research on nonlandslide samples, and many flaws in improving the authenticity of nonlandslide samples still exist. Thus, this study used K-means clustering to evaluate the factor accuracy of nonlandslide samples in the research area, while the prediction accuracy of CF, FR, RF, and ANN in landslide susceptibility in the research area were compared and analyzed to develop a more reliable method in the planning, construction, operation, and maintenance of various projects. The research ideas of this study are shown in Figure 1.

2. Research Area
Funing County is located in the southeast of China’s Yunnan province. Its geographic coordinates are between 105°13′-106°12′E and 23°11′-24°09′N. It is situated at the intersection of the Sichuan-Yunnan-Guizhou meridional tectonic zone and Qinghai-Tibet-Yunnan-Myanmar tectonic system. It has 360 peaks as natural barriers across the southern and northern areas. With plateaus, mountains, hills, and basins, the mountainous area accounts for 96% of the total area. The plateaus are located in the southwest, the mountains in the central north and south, the hills in the northeast, while the basins lie in Xinhua, Puyang, and Mugang. The landform and landslide distribution in the research area are shown in Figure 2.

The highest altitude in the territory is 1,851 m. It has a subtropical monsoon climate with distinct dry and wet seasons, as well as rain and heat in the same season. The average annual rainfall is 1,103.5 mm. There are 5 major rivers and 29 tributaries, with the main river spanning 555.8 km. The main stratigraphic units belong to the Cambrian, Ordovician, Devonian, Carboniferous, Permian, Triassic, and Neoprotezoic eras. Meanwhile, the main lithological units in the research area are limestone, dolomite, mudstone, shale, siltstone, and diabase. Its land cover includes forests, grasslands, and shrubs, but the development of mountainous areas in recent years has led to land cover changes.
3. Data Sources
The preparation of preliminary data is a prerequisite for the study of a region. The data used in this paper include the location and relevant information of existing landslides and parameters related to causal factors. Table 1 shows the data sources.
3.1. Establishment of Landslide Inventory
A landslide inventory of Funing County was created based on historical geological disaster data, field surveys, aerial photos, and remote sensing image data. In addition, nonlandslide point data with the same number of landslides that had been investigated was selected for comparative analysis. The sample size can be increased on a reasonable scale and the use of a small number of landslide samples can still ensure the model’s accuracy. In this paper, 122 landslide points and 122 nonlandslide points were selected as the training set and test set, respectively, with a ratio of 7 : 3.
3.2. Causal Factors of Landslides
Natural and human factors cause landslides, such as landform, lithology, and climate. Proper selection of causal factors is a prerequisite for landslide susceptibility evaluation. In this study, 12 causal factors were selected based on previous research experience, availability of data, geological environment conditions, and landslide occurrence mechanism. The factors include elevation, slope, aspect, plan curvature, profile curvature, lithology, rainfall, NDVI, distance to faults, distance to roads, distance to water systems, and land cover. Collinearity analysis was then conducted on the factors. It was found that there was no obvious collinearity between the factors, which could provide data support for the subsequent prediction model. Among them, elevation, slope direction, distance from fault, distance from road, and distance from water system are classified artificially based on field investigation and previous experience summary. Plane curvature, profile curvature, rainfall, and NVDI were graded using the traditional natural breakpoint method.
3.2.1. Elevation
Elevation is a basic indicator in characterizing landforms. Many studies show that there is a clear relationship between geological hazards and altitude [30]. The altitude of Funing County is between 168.7 m and 1,804.2 m. In this paper, elevation is divided into six areas, including <300 m, 300−600 m, 600−900 m, 900−1,200 m, 1,200−1,500 m, and >1,500 m. The elevation distribution is shown in Figure 3(a).

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)
3.2.2. Slope
Slope is one of the main parameters affecting the stability of slopes [30, 31]. The steeper the slope, the more obvious the stress concentration at the foot of the slope and the greater the probability of landslides. In this paper, DEM data were used to extract the slope map of the research area and divide it into six categories, as shown in Figure 3(b).
3.2.3. Aspect
Aspect has been identified as one of the factors that contribute to the occurrence of landslides [32]. It is defined as the projected direction of the slope perpendicular to the horizontal plane. There are significant differences in the distribution of landslides in different aspects due to solar radiation. The degree of sun exposure in different directions results in different vegetation cover, evapotranspiration, land degradation, and weathering processes. These factors will subsequently affect the slope runoff, rainfall infiltration, as well as physical and mechanical properties of the slope, and ultimately the stability of the slope. Aspect is divided into 9 categories in this paper, as shown in Figure 3(c).
3.2.4. Lithology
Lithology is one of the main factors affecting slope stability. It determines the physical, chemical, mechanical, and hydraulic properties of rock mass. For example, landslides are common in the Loess Plateau. Moreover, the form, occurrence, scale of geological structures, such as faults, folds, joints, cleavage, will have a major impact on slope stability. After the 1:50,000 geological map of Yunnan Geological Survey Bureau was vectorized, the lithostratigraphic map of the research area was drawn. Mudstone, siltstone, shale, marlstone, and gabbro-diabase of unknown age were mainly distributed in the area (Table 2).
3.2.5. Average Annual Rainfall
Rainfall is the main factor causing landslides, thus it is the key parameter for landslide susceptibility evaluation modeling [33]. The rainfall infiltration process is mainly manifested as the effect on the mechanical parameters of the slope’s rock mass, which changes the stability of the slope and leads to geological disasters such as landslides or debris flows. The research area is located in the dry-hot valleys close to the South China Sea. There are no large temperature variations during the four seasons, with plenty of rainfall and distinct dry and wet seasons, as the climate is affected by the southeast monsoon, topography, ocean, and other factors. With an annual rainfall of 1,258 mm–1,717 mm, it has a low-latitude plateau climate with significant vertical climate differences. The Kriging interpolation method is used to generate the annual mean rainfall map of the study area and the natural breakpoint method is used to divide it into six categories in ArcGIS 10.2 (Figure 3(e)).
3.2.6. Plan Curvature
Curvatures are widely used in landslide susceptibility evaluation [34]. They reflect the extent of surface drainage, erosion, and sedimentation. In this paper, ArcGIS 10.2 was used to extract the plan curvature map of the research area from DEM data. The positive values represent convex surfaces, while the negative values denote concave surfaces. When the value is closer to 0, the surface is flatter, while the opposite is true for profile curvature.
3.2.7. Profile Curvature
The values of profile curvature are the opposite of those indicated by plan curvature, with positive values representing concave surfaces and negative values indicating convex surfaces. By comparing and analyzing the planar curvature with the profile curvature, the surface relief condition of the study area can be more accurately reflected.
3.2.8. Normalized Difference Vegetation Index
Normalized difference vegetation index (NDVI) reflects information on surface vegetation, such as vegetation health, density, biomass, and root system. The plant root system can reinforce the soil and reduce the infiltration of surface water during rainfall. Vegetation cover, especially forest cover, reflects the degree of protection of surface vegetation. Under the same conditions, geological disasters are less prone in areas with large vegetation cover.
Its equation can be written as follows:
NIR is the value of near-infrared reflectance band, and R is the reflection in the red light band. The NDVI value is between −1 and 1. A negative value indicates that the ground is covered by water and snow, which are highly reflective to visible light. The closer the value is to 0, the lower the surface vegetation coverage. This means that the area is close to exposed bedrock or arable land. Meanwhile, a positive value indicates vegetation cover, and the coverage increases with the increase in NDVI value. Figure 3(h) shows the map of different vegetation indexes in the research area.
3.2.9. Land Cover
The type of land cover refers to the natural or man-made surface cover on the ground, and does not only include vegetation but also artificial land cover or artificial modifications on the surface. In other words, land cover refers to the combination of vegetation cover and artificial cover on the earth’s surface, which involves the natural attributes of land and effects of human activities. With the increase of human demand for land, forest cover and grassland have decreased dramatically in many countries, while the reduction of vegetation cover has a significant impact on water and soil conservation. Areas with low vegetation coverage are more susceptible to landslides and other natural disasters due to heavy rainfall. See Figure 3(i) for the map of land cover in the research area.
3.2.10. Distance to Faults
Active faults increase the risk of landslides. Faults control the development of geological joints. Funing and Yanshan fault zones were formed in the research area, and many small-scale fault zones exist. A fault distribution map was acquired after vectorization of the 1: 50,000 geological map in the area. The map was classified into six categories using the Euclidean distance tool in ArcGIS, as shown in Figure 3(j).
3.2.11. Distance to Roads
In various engineering projects such as road construction, the slope will be excavated, which will destroy the stability of the slope and trigger landslides [34]. More construction activities have destroyed the natural slope amid rapid economic development. In particular, large-scale blasting and excavation during the construction of railways and highways have destroyed the stability of slopes and increased the chances of landslides occurring. The buffer zones of the roads were set as 0–150 m, 150–300 m, 300–450 m, 450–600 m, 600–750 m, and >750 m. Figure 3(k) shows the distribution map of buffer zones of roads in the research area.
3.2.12. Distance to Rivers
Water runoff on slopes is an important factor affecting the occurrence of landslides. This means that the distance to rivers is a key factor of landslides in mountainous areas [35]. The research area’s river system is well-developed, and the rainy season affects the river flow. Six buffer zones were set to study the impact of rivers on the stability of landslides. Figure 3(l) shows the distribution map of buffer zones of rivers in the research area.
4. Research Methods
This study consists of four steps. First, databases, geological surveys, and aerial technologies are used to determine the position of landslide and imported into Arcgis10.2 created a landslide inventory. Second, data purification was performed using K-means clustering to extract nonlandslide point sample data for preliminary susceptibility evaluation in IBM SPSS Statistics 22 software and comparative analysis. Third, the CF, FR, RF, and ANN models were used to create landslide susceptibility maps in Spyder5.15. In the final step, the evaluation accuracy of each model was compared and validated.
4.1. K-Means Clustering
Cluster analysis is a data mining function and an important part of unsupervised learning. Cluster analysis classifies unlabeled sample data into different sets based on certain rules, so that the samples in each set have maximum similarity and different sets are the most dissimilar. K-means is the most commonly used clustering technique that groups the dataset into K points as the initial centroid based on a specific strategy, and the remaining data are divided into clusters closest to the K points. In practical applications, a flexible method is often used to set the maximum number of iterations. When the maximum number of iterations is reached, the calculation is terminated. The main steps are described below:(1)Input the original dataset X and randomly select K centroid, i.e., a1, a2, a3,…, ak; X = {x1, x2, x3,…, xn}(2)Calculate the Euclidean distance from all samples xi in X to each initial centroid al, and classify it to the nearest centroid. The equation is expressed as follows: where is the Euclidean distance between the sample and centroid.(3)Calculate the mean value of the centroid of each cluster and update the centroid. The equation is written as follows:(4)Repeat steps 2 to 4 until the position of the centroid does not change. The equation is defined as follows:(5)Output the clustering results.
The results are imported into ArcGIS, and a preliminary landslide susceptibility zoning map is prepared using the Kriging interpolation method.
4.2. Certainty Factor
The certainty factor (CF) method is a probability function that was first proposed in 1990 [35]. It is one of the commonly used functions to handle the combination of different data layers as well as heterogeneity and uncertainty of the input data [34].
The equation is expressed as follows:where represents the probability of landslides under conditional classification a, while represents the total probability of the landslide in the research area.
The CF value ranges from −1 to 1. A positive value indicates an increase in the certainty of landslide occurrence, while a negative value suggests a decrease in the certainty of landslide occurrence [36]. The occurrence of landslides is uncertain when the value is close to 0.
4.3. Frequency Ratio Method
The frequency ratio method determines the correlation between location and causal factors of the landslide. One of the main assumptions of this knowledge-driven method is that future landslides will happen under the same external conditions of previous landslides. Thus, it is crucial to analyze the relationship between past landslides and environmental conditions. The equation is written as follows:where
The value of FR usually reflects the density index of the landslide distribution in a certain area. An FR value greater than 1 indicates that this condition will likely trigger a landslide. On the contrary, an FR value less than 1 indicates that this condition will unlikely lead to a landslide.
4.4. Random Forest Model
RF is a popular ensemble learning algorithm widely used for classification, regression, clustering, and prediction of data. This model can process high-dimensional data without feature selection and has strong adaptability to datasets. It can process both discrete and continuous data. The datasets are not required to be normalized and it is relatively easy to operate and implement the algorithms.
The process of the random forest algorithm is as follows: (i) obtain the original training data and resample several times; (ii) select a random set of features in each resampling; (iii) estimate a decision tree based on the resampling and set of random feature sets; and (iv) aggregate the estimated set of decision trees to get a single decision tree.
During the establishment of the random forest model, it is necessary to define n_estimators, min_samples_leaf, and min_samples_split parameters. n_estimators represents the number of trees in the forest. Usually, the larger the number, the longer the calculation time and the better the outcome. However, the effect will not be improved significantly and may even result in poor effects when a critical value is reached. Meanwhile, min_samples_leaf and min_samples_split represent the minimum number of samples required to split internal nodes and minimum number of samples required at leaf nodes, respectively. They can make the model smoother. Overfitting may occur if the number of this parameter is too small. In contrast, setting a parameter that is too large will prevent the model from learning the data. Hence, these parameters need to be optimized multiple times. After multiple debugging and cross-validation, the final n_estimators value is 400 and the values of min_samples_leaf and min_samples_split are 10 and 50, respectively.
4.5. Artificial Neural Network Model
Computational intelligence, especially artificial neural networks, has developed rapidly to solve various engineering issues in the past two decades [27]. It is an abstraction, simplification, and simulation of the human brain to perform complex information processing. The performance of neural networks is determined by the mathematical model of simulated neurons, connection method of the neural network, and learning method of neurons [37]. This method uses small training sets for training and processing of large datasets. The neural network method can be selected if the tasks are not clearly defined, the observation process is impossible or the complex relationship between functions cannot be determined.
This paper used the multilayer perceptron network (MLP), a classic artificial neural network. It consists of input layer, hidden layer, and output layer. The hidden layer can be set with multiple layers, but various studies showed that a hidden layer is enough for modeling of most complex problems. In particular, it can learn and build nonlinear data models to realize complex logical operations and nonlinear relationships [38–40].
5. Results and Discussion
5.1. Analysis on the Importance of Causal Factors
The importance of each of the 12 causal factors is shown in Figure 4. Three factors had the most significant impact on landslide susceptibility--slope (importance measurement (IM = 0.138)), NDVI (IM = 0.114), and distance to roads (IM = 0.103). In comparison, six factors had a relatively small impact, including profile curvature (IM = 0.074), aspect (IM = 0.071), lithology (IM = 0.07), plan curvature (IM = 0.069), rainfall (IM = 0.065), and land cover (IM = 0.026). Factors with an IM value of 0.08–1 have a moderate impact on landslide susceptibility, such as distance to faults (IM = 0.089), distance to rivers (IM = 0.088), and elevation (IM = 0.074).

5.2. K-Means Model
In ArcGIS 10.2, the research area was divided into 523,002 raster data, and all data of causal factors were extracted into each raster. After the centroid is selected for calculation, the distance from each factor to the centroid is shown in Figure 5.

The preliminary map of the landslide susceptibility evaluation obtained through K-means clustering is shown in Figure 6. The map is divided into five regions-very low, low, medium, high, and very high. The proportions of each area are very low (13.2%), low (13.6%), medium (22.6%), high (37.1%), and very high (13.5%). The landslide points in the very low risk area only account for 6.6% of all landslides, while landslide samples in the high and very high risk areas make up 59% of the total samples. According to previous studies, landslide susceptibility zoning is characterized by the following two principles: (a) identified landslides should be distributed in areas with high and very high susceptibility; and (b) the proportion of areas with very high susceptibility should be as small as possible.

Based on the preliminary landslide susceptibility map prepared using K-means (Figure 6), nonlandslide point data with the same number of landslides in the research area were randomly selected from very low and low risk areas. Compared with random point selection in the entire area, this method theoretically improves the authenticity of nonlandslide samples, reduces noisy data and enhances the model’s prediction accuracy. Random selection points and K-means selection points are shown in Figure 7.

5.3. FR and CF Models
The FR and CF models are data-driven and the landslide susceptibility evaluation results of FR and CF models based on the purification of nonlandslide point factor data through K-means clustering are shown in Table 3. The table shows that the FR and CF values reached the highest levels at 1.562 and 0.350, respectively, under the elevation classification of 600–900 m. The shear strength of rocks is lower at gentle slopes, thus, landslides usually occur at moderate and steep slopes, with the highest FR and CF values of 2.469 and 0.595, respectively, at 12–18° slopes. The aspect of a slope affects sunshine duration, rainfall, and other factors. The table shows that most landslides in the area are distributed in the north and northeast directions with FR and CF values at 1.82 and 0.281, respectively. It can be seen from the curvature that landslides are susceptible on convex surfaces, and less likely to happen on plane and concave surfaces. The NDVI value results show that the probability of landslides is greatest under the 0–0.199 classification. The probability of landslides becomes smaller with the increase of NDVI value. This suggests that the denser the vegetation, the smaller the probability of landslides. In terms of land cover, landslides mainly occur in bed rock and agricultural land. On the other hand, the FR and CF values were higher at the highest and lowest average annual rainfall because high-frequency rainfall can cause landslides, while vegetation cover in areas with less rainfall will be affected, leading to higher values. In terms of lithology, the CF and FR values of Tangjiaba Formation were the highest in the late Cambrian period at 0.78 and 4.66, respectively. Limestone and a large amount of dolomite were found in this area. Construction of highways and other human activities are also a key factor that contributed to landslide incidents. In the 800–1,000 m classification, excavation of slopes led to a large-scale landslide, causing a sharp increase in FR value. In general, the closer it is to roads, the greater the FR and CF values. In the analysis of distance to faults, the FR and CF values were the greatest at 1.65 and 0.39, respectively, in the <350 m classification. The FR and CF values were the highest at the respective 3.16 and 0.68 when the distance to rivers is <150 m.
5.4. Landslide Susceptibility Map Analysis
A landslide susceptibility map of Funing County (Figure 8) was drawn based on the 12 causal factors of landslides using four different evaluation models. The natural breaks method was used to classify the results into five regions-very low, low, medium, high, and very high. The overall results of the landslide susceptibility map obtained through the four methods were consistent, but the local zoning results were different. The RF and ANN machine learning models have relatively large distribution areas in very high regions.

(a)

(b)

(c)

(d)
It is evident from Figure 7 that areas with high and very high susceptibility are mainly distributed in the central and southeast parts of the research area. A severe landslide occurred in Situn district, Bo’ai town in the northeast of Funing County in 2018, threatening more than 300 residents in the surrounding area and destroying over 200 houses and property worth RMB20 million. Heavy rainfall in November and high water levels of Baise dam had caused extensive deformation of the slope at the time of the landslide. Some houses were damaged, with ground fissures appearing on the surface. Hence, instantaneous heavy rain and human activities are the main causal factors of landslides in this area. According to historical data and field investigation, areas with heavy rainfall or human activities are more susceptible to landslides. Moreover, inappropriate land use, such as deforestation, excessive road construction, and slope excavation, also increase the risk of landslides.
5.5. Validation of Landslide Susceptibility Evaluation Model
A standard validation analysis to compare prediction performance of various classifiers is the Receiver Operating Characteristic (ROC) curve and the calculation of Area Under Curve (AUC) [41]. Its value ranges from 0.5 to 1.0, with larger AUC values indicating that the model has good performance.
The evaluation results of ROC-AUC are shown in Figure 9. After the sample data were purified through K-means clustering, the prediction accuracy of the CF model was 85.3%, while the prediction accuracy of random points was only 74.6%. FR model was used to predict and analyze the purified data and randomly selected data, with a prediction accuracy of 81.2% and 73%, respectively. Meanwhile, the prediction rate of random points in the RF model only reached 67.5%, but significantly improved to 82% after K-means clustering and purification. In the ANN model to evaluate landslide susceptibility, the prediction accuracy of random points stood at 60.6% and improved to 85.82% following K-means clustering and purification. Consequently, it can be concluded that the purified sample data have greatly improved the prediction accuracy of the four evaluation models. The prediction accuracy of the ANN model showed the biggest improvement by 25.2%

(a)

(b)

(c)

(d)
6. Conclusion
Funing County in China’s Yunnan province was used as a case study in this paper. With a small data sample, K-means clustering was used to purify the selected nonlandslide sample data. On this basis, CF, FR, RF, and ANN models were used to evaluate landslide susceptibility. The improvement of each evaluation model’s accuracy was compared and analyzed after sample purification. The following conclusions can be drawn:(1)There were significant differences in the importance of each causal factor. The slope topography, NDVI value and distance to roads had the most significant impact on landslide occurrence in the research area, while plan curvature, average rainfall, and land cover had little impact. On the contrary, elevation, aspect, profile curvature, distance to rivers, distance to faults, and lithology had a moderate impact on landslide susceptibility in the research area. All 12 factors played a role in causing the occurrence of landslides.(2)There were major improvements in the accuracy of the evaluation models after K-means clustering was used to purify the sample data of nonlandslide points. The accuracy of CF, FR, and RF models improved by 10.7%, 8.2%, and 14.5%, respectively. Meanwhile, the ANN model reported the biggest improvement of 25.2%.(3)The evaluation accuracy of various landslide susceptibility evaluation models were different, but were all above 80% after data purification and prediction validation. The prediction accuracy of CF, FR, RF, and ANN evaluation models stood at 85.3%, 81.2%, 82%, and 85.82%, respectively.
Data Availability
All data included in this study are available upon request by contact with the corresponding author.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
The authors acknowledge the support by National Natural Science Foundation of China (Grant no. 42267020) and the Key Research and Development Program of Yunnan Province in 2022 (Grant no. 202203AC100003).