Abstract

Aiming at the problem of high-dimensional features and data imbalance in credit risk assessment that affect model prediction results, a credit risk prediction method based on integrated learning is proposed from three levels of features, data and algorithms. First, a hybrid filter and Random Forest feature selection method is used to select features. This method uses the improved Relief algorithm to initially select features, then combines the maximum information coefficient to eliminate redundant features, and uses the Random Forest algorithm to further reduce the feature dimension. Second, on the basis of the Borderline-SMOTE method, an adaptive idea is introduced to generate a different number of new samples for each minority sample at the boundary, and a new interpolation method is used to make the new sample distribution more reasonable, so as to reduce the sample imbalance. Finally, the Focal Loss is used to improve the loss function of LightGBM, and the sample weight is adjusted through the parameters α and γ in the Focal Loss function so that the model pays more attention to minority samples and difficult-to-classify samples, and improves the accuracy of model classification. And use the improved algorithm as the base classifier and then use AdaBoost and random subspace methods to integrate to establish a credit risk prediction model. Through comparative experiments with other methods, the results show that this method effectively improves the -mean value and the AUC value and has a better default prediction effect.

1. Introduction

The rapid development of Internet technology has brought a huge impact to the traditional financial industry, and online lending is an important innovation. Due to its flexible and convenient financing methods, online lending has become a financing channel for more and more people. But on the other hand, credit risk issues have been restricting the development of online lending platforms, and the high default rate has brought great negative effects [1]. To solve this problem, many experts are committed to building credit risk prediction models to improve the classification rate of user default. These studies also confirmed that compared with the traditional credit scoring model, the credit risk prediction model built with machine learning method not only has better classification effect, but also saves a lot of economic costs [26]. Therefore, the establishment of an effective credit risk prediction model is of great significance to the risk control of borrowers and the continuous development of online lending platforms [7].

In recent years, machine learning algorithms have been widely used in classification models. Zhang et al. [8] established a classification model based on the LightGBM algorithm to evaluate credit risk, and experiments showed that this method has higher classification accuracy. Wang and Sun [9] used the new weighted voting parameters to improve AdaBoost algorithm, and proved the superiority of the improved algorithm in classification prediction. Zhao et al. [10] used the weighted hybrid integration method to deal with the data classification problem and verified the effectiveness of the method through experiments. Shen et al. [11] first used the SMOTE oversampling method to balance the dataset and then combined with the AdaBoost integrated BP-PSO classification model to evaluate the credit risk. The results showed that this method improved the classification performance. Although the above methods have improved the classification efficiency to a certain extent, in the practical problems of studying credit risk assessment models, the high dimension of datasets and the imbalance of categories cannot be ignored. High-dimensional features will increase the complexity of the model, and the imbalance of positive and negative sample proportions will lead to overfitting, reduce the generalization ability of the model, and seriously affect the classification performance of the model [12]. In response to these problems, many experts and scholars have carried out a lot of research and analysis and proposed corresponding solutions. The main solution is to process at the feature level, data level, and algorithm level [13].

1.1. Feature Level

A large number of studies have confirmed that a single feature selection method has many limitations and cannot get the best feature subset, and the method of mixing multiple feature selection will combine the advantages of the subalgorithm, which is more stable than the single method. Qiu and Gao [14] used hybrid filter and improved adaptive GA feature selection method to process data and used multiple classification algorithms for classification. The results proved that compared with a single feature selection method, this method effectively improved the accuracy of model classification. Thejas et al. [15] proposed a feature selection method of hybrid filter and wrapper and obtained the optimal feature subset through K-means clustering and standardized mutual information method, combined with Random Forest algorithm. Pei et al. [16] first used the chi-square filtering algorithm and the LightGBM algorithm to filter the features and then combined with the adaptive genetic algorithm to obtain the final feature subset. Although the above feature selection algorithms have achieved certain results, for imbalanced datasets, the imbalance of the ratio of positive and negative samples will also affect the result of feature selection.

1.2. Data Level

The more common methods to deal with the imbalance of datasets are undersampling and oversampling [17]. Chen et al. [18] used the undersampling method to balance the dataset and combined with the integrated learning method that introduced parameter disturbances to establish a credit scoring model. Although this method improves the problem of information loss caused by the random undersampling method, the effect still needs to be improved for a small number of datasets with a small sample size. The SMOTE method proposed by Chawla et al. [19] improved the overfitting problem to a certain extent. Niu et al. [20] used the SMOTE method to deal with imbalanced datasets and verified the effectiveness of the method in the credit risk assessment model. Khemakhem et al. [21] used random oversampling and synthetic minority oversampling methods to solve the problem of dataset imbalance, and the results showed that oversampling methods can improve the accuracy of model classification. However, the SMOTE method does not differentiate among the minority samples in the process of generating new samples, and it is prone to the problem of sample overlap [22]. In this regard, Han et al. [23] proposed the Borderline-SMOTE algorithm to improve the problem of sample overlap. This method only oversamples the minority class samples at the boundary, which easily causes the problem of blurring the boundary between positive and negative classes. Nakamura et al. [24] proposed an improved density-based SMOTE algorithm, which forms clusters according to the classification density of positive samples to control the synthesis of new samples. Tian et al. [25, 26] proposed the minority class oversampling method with majority class weight combined with the credit evaluation model of Random Forest, which obtained better prediction effect than traditional Random Forest and naive Bayes. In addition, the ADASYN [27] method is improved on the basis of the SMOTE method. This method generates a different number of new samples for each minority sample according to the data distribution. Although the distribution of new samples has been improved, the problem of sample overlap still occurs.

1.3. Algorithm Level

Traditional classification algorithms have limitations in solving the classification problem of imbalanced data. For this reason, improvements can be made at the algorithm level. The main methods include cost-sensitive learning and ensemble learning methods [28, 29]. The method of cost-sensitive learning to solve the imbalance of data is to increase the penalty cost of the misclassification of minority samples. By optimizing the objective function, the classification model pays more attention to the classification accuracy of minority samples. The ensemble learning method integrates multiple base classifiers in a certain way to reduce the error caused by a single classifier to classify unbalanced data, thereby improving the overall prediction effect of the classifier. At present, most of the methods used are combining ensemble learning with sampling methods or cost-sensitive learning methods. Chen et al. [30] used the misclassification loss function in ensemble classification algorithms, which greatly improved the classification performance of minority classes. Wang and Yan [31] proposed a classification algorithm that combines undersampling and cost-sensitive methods to improve the classification performance on unbalanced data.

Based on the above analysis, this paper takes into account the impact of unbalanced data on the screening results at the feature level, and proposes a hybrid filter and Random Forest feature selection method to obtain the optimal feature subset, and then improves the unbalanced data pair model from both data and algorithms. The impact of classification results is as follows. In terms of data, an improved oversampling method Borderline Adaptive Synthetic Minority Oversampling Technique (BA-SMOTE) is proposed to generate new samples to balance the dataset. In terms of algorithms, Focal Loss is first used to improve the loss function in LightGBM [32], and the improved algorithm is used as the base classifier, combined with AdaBoost and random subspace algorithms for integration, and an integrated classification model is obtained. Through phased comparison experiments and comparisons with other oversampling methods and classification models that deal with imbalanced data (RUSBoost [33], CUSBoost [34], and KSMOTE-AdaBoost [35]), the results show that the improved model proposed in this paper has a better classification effect in credit risk prediction.

The second section introduces the mixed feature selection method, which combines filter and Random Forest to obtain a more efficient feature selection method. The third section introduces the specific steps of the improved oversampling algorithm, which improves the sample classification rate by changing the number and location of interpolation of different samples. In the fourth section, the integrated classification model proposed in this paper is introduced in detail, in which the improved LightGBM algorithm is integrated by combining random subspace and AdaBoost to obtain the final integrated classification model. The fifth section describes the data source and data preprocessing process, as well as the experimental results of the paper. The effectiveness of the method proposed in this paper is verified by comparative analysis of experiments. The sixth section is a summary and analysis.

2. Feature Selection Method of Hybrid Filter and Random Forest

2.1. Hybrid Filter Method
2.1.1. Relief Algorithm

The Relief algorithm [36] is a filtering feature selection algorithm that assigns different weights to features according to the difference in the correlation of each feature between similar samples and nonsimilar samples, so as to measure the ability of features to distinguish between two types of samples. We deleted the features whose weight is lower than the threshold to achieve the purpose of dimensionality reduction. The main method is to randomly select a sample X from the training set, we found the nearest neighbor sample H among samples of the same type as X, and we found the nearest neighbor sample M among samples of different types from X. If the distance between R and H on a certain feature is less than the distance between R and M, it means that the feature has a better ability to distinguish close samples, so increase the weight of the feature. Conversely, if the distance between R and H on a certain feature is greater than the distance between R and M, the weight of the feature is reduced. Repeat the above process m times to obtain the average weight of each feature. The weight update formula is shown as follows:where Xi (j) represents the value of the j-th feature of the sample Xi and m is the number of random extractions. When the feature is a character type feature, the distance function is as shown in the following formula:

When the feature is a numerical feature, the distance function is as shown in equation (3). Among them, max (j) and min (j) represent the maximum and minimum values of the j-th feature.

2.1.2. BmRelief Algorithm

Although the Relief algorithm has a high operating efficiency and is widely used on two classification problems, the following problems will occur when dealing with unbalanced datasets. One is to ignore the characteristics of strong discriminating ability for minority samples. The second is to randomly select samples without paying attention to the probability of the sample being selected at the classification boundary.

Aiming at the problems of the Relief algorithm when dealing with unbalanced data, an improved Relief algorithm, namely, bmRelief algorithm, is proposed. In the original Relief algorithm, random sampling in the training set is improved to extract N samples from the minority class to calculate feature weights, N samples from the majority class to calculate feature weights, increase the probability of boundary samples being selected when samples are drawn, and finally, we took the mean value of feature weights. By balancing the number of positive and negative samples in the process of randomly sampling samples, the positive and negative samples have the same influence on the feature importance evaluation, and the final feature importance evaluation is more accurate. Compared with samples far from the boundary, samples close to the classification boundary contain more classification information, which is more conducive to measuring the importance of features. Therefore, δ is defined in the bmRelief algorithm to measure the distance of the sample from the boundary, as shown in the following equation:

Among them, represents the number of samples of different classes in the nearest neighbor of the sample. The larger the value of δ, the closer the sample is to the classification boundary, and the more the probability of being drawn should be increased. Therefore, the value of δ and the probability that the sample is selected to update the feature weights become positively correlated. We defined the probability that the sample is selected to update the feature weight as shown in the following equation:

When the bmRelief algorithm selects samples to update feature weights, it fully considers the probability of the minority class sample and the boundary sample being selected, and draws the same number of samples from the minority class and the majority class, respectively, and dynamically calculates that each randomly sampled sample is the probability used to update the weight of the feature makes the selected sample closer to the classification boundary. The specific algorithm steps are shown in Algorithm 1.

Input: dataset D = {(, )}in, xi ∈ RT, random sampling times N, feature weight threshold β;
Output: the filtered variable feature set S
(1) Set the feature weight to 0, F = True, and the set S is empty
(2)  for i = 1 to N do
(3)   if F == True
    Randomly draw a sample xi from the minority sample, F = False
(4)   else
    A sample is randomly selected from the majority of samples, F = True
(5)   Use formulas (4) and (5) to calculate the probability that xi is used to update feature weights
(6)   Generate a random number from 0 to 1 μ
(7)   if μ < 
    Find the nearest neighbor sample H from the similar samples of , and find the nearest neighbor sample M from different classes
    for i = 1 to T do
     Update the feature weight according to formula (1)
(8) for i = 1 to T do
  if  ≥ β
  Add the i-th feature to the set S
(9) return S
2.1.3. Maximum Information Coefficient

Although the improved Relief algorithm can avoid the impact of data imbalance on feature selection, it still cannot evaluate the degree of redundancy between features and cannot eliminate redundant features. Therefore, this paper uses maximal information coefficient (MIC) [37] to filter redundant features. The maximal information coefficient is calculated by mutual information and meshing methods. Among them, mutual information is used to measure the reduction of uncertainty of another variable under the condition of a given variable, which can describe the amount of shared information between two random variables. Suppose, random variable X = {, i = 1, 2, 3, …, n}, random variable Y = {, i = 1, 2, 3, …, n}, and represents the marginal probability of random variables X and Y, respectively, and represents the joint probability distribution of X and Y. Then, we defined the mutual information between X and Y as shown in the following equation:

Suppose, a finite set of ordered pairs S = {(xi, yi), i = 1, 2, 3, …, n}, mesh the scatter plot composed of xi and yi in the set, and calculate them separately. The mutual information value in each grid is calculated as follows: selecting the maximum value of I (X; Y) under different division methods as the mutual information value of the grid, setting the maximum mutual information value to max (I (X; Y)), and dividing it by log (min (X; Y)) is normalized, and the maximum information coefficient MIC (I (X; Y)) is obtained as in the following equation:

In the formula, B is the upper limit of grid division. The larger the value of MIC (x, y), the stronger the redundancy between x and y. According to the maximum information coefficient, a threshold is set to eliminate redundant features.

2.2. Random Forest Algorithm

When the Random Forest algorithm [38] constructs a decision tree, M samples are drawn from the dataset each time with replacement, and a total of n times are drawn. The samples that are not drawn each time are called out of bag, OOB. When using the Random Forest algorithm for feature selection, the importance of each feature is measured by the minimum error rate criterion of out-of-bag data. The basic idea is that after adding noise to a feature, the prediction accuracy will decrease. The change in accuracy determines the importance of this feature, and the features are sorted based on this. The feature selection algorithm is as follows:(1)Calculating the error value of each decision tree based on n groups of out-of-bag data, denoted as Error1i (i = 1, 2, 3, …, n)(2)Under the condition that the distribution of the remaining features remains unchanged, noise interference is added to the jth feature, and the error value Error2ji (i = 1, 2, 3, …, n) of each decision tree is calculated again(3)The importance of a feature is related to the average of the two error changes before and after, so the importance of the jth feature is shown in the following formula:

2.3. Feature Selection Method of Hybrid Filter and Random Forest

At present, due to the many limitations of a single feature selection method, many scholars are devoted to studying hybrid feature selection methods, because they cannot only speed up the time of feature selection but also effectively improve the classification effect of the model. Aiming at the problem that the Relief algorithm is susceptible to unbalanced data and cannot handle redundant features, this paper proposes an improved Relief algorithm combined with the maximum information coefficient method. First, the improved Relief algorithm, namely, the bmRelief algorithm, is used for feature selection to reduce the impact of unbalanced data on feature selection and then the maximum information coefficient method is used to measure the dependence between features and eliminate redundant features in variable features. In order to further reduce the data dimension and better fit the classification performance of the subsequent learner, after the initial screening of the variable characteristics of the dataset by the hybrid filter method, the Random Forest algorithm is used to perform feature selection again, and the variable features that are of high importance to model training are selected according to the threshold size. The flow chart of the feature selection method of hybrid filter and Random Forest is shown in Figure 1, and the specific steps are as follows:(1)Input dataset D = {(xi, yi)}, i = 1, 2, 3, …, n, xi is the sample, yi is the category label(2)Dividing the dataset into minority class samples and majority class samples, and extract N samples from the minority class samples and the majority class samples respectively according to the bmRelief algorithm(3)Calculating the probability that the extracted sample is used to update the feature weight according to formulas (4) and (5)(4)Setting the initial value of the weight to 0, generate a random number μ, if > μ, update the feature weight according to formula (1)(5)The feature set S is initially an empty set, and the threshold β is set. If the feature weight > β, the ith feature is added to the set S(6)Calculating the maximum information coefficient (MIC) of each two variable features in the set S and eliminating redundant features according to the threshold to obtain the set S1(7)Random Forest algorithm is used to calculate the importance of features, and the features are further screened according to the threshold to obtain the final feature subset S2

3. Improved Oversampling Method

3.1. Borderline-SMOTE Algorithm

The SMOTE [19] algorithm interpolates all minority samples when oversampling, which can easily cause overlap of samples and affect the classification effect of the model. The Borderline-SMOTE algorithm is an improved oversampling method based on the SMOTE algorithm [23]. It only uses random linear interpolation to generate new samples for a minority of samples at the border, thereby improving the phenomenon of sample overlap. The specific algorithm steps are as follows:(1)Calculating the K-nearest neighbor samples of each minority sample.(2)Classifying the minority samples according to the distribution of the majority samples in the K-nearest neighbor samples. If all K neighbors are majority samples, then the minority samples are considered to be noise samples; if K neighbors are all minority samples, then the minority samples are considered safe samples; if the number of samples of the majority class in K neighbors is more than the number of samples of the minority class, the minority class sample is considered to be boundary samples.(3)For each minority sample Xi (i = 1, 2, 3, …, n) in the boundary sample, the nearest neighbor M minority samples (Y1, Y2, Y3, …, Ym) are calculated according to the Euclidean distance.(4)Several samples are randomly selected from the M-nearest neighbor samples, and random linear interpolation is performed between each selected sample Yj and the original sample Xi to generate a new sample Snew. The interpolation method is shown in the following formula, where rand (0, 1) is expressed as a random number in the interval (0, 1):(5)Adding the newly generated sample to the original dataset.

3.2. Improved Borderline-SMOTE Algorithm

Compared with the SMOTE algorithm, although the Borderline-SMOTE algorithm improves the problem of sample overlap, the method of generating new samples is the same as the SMOTE algorithm. The number of new samples synthesized for each minority sample is the same, and the influence of sample difference is not considered. And when oversampling the minority class samples at the border, the newly generated samples will also be at the sample border, which will easily make the border between the majority class and the minority class more and more blurred and difficult to distinguish.

Therefore, this paper proposes an improved oversampling method, which introduces the idea of adaptive density distribution into the Borderline-SMOTE algorithm and uses a new interpolation method to generate new samples to solve the above problems. The algorithm steps are as follows:(1)Calculating the K-nearest neighbor samples of each minority sample.(2)If the number of samples of the majority class in the K-nearest neighbors is more than the number of samples of the minority class, the original minority class samples are added to the boundary sample set.(3)Calculating the total number of samples to be synthesized  = (Lmaj − Lmin) b, where Lmaj is the number of samples in the majority class in the original dataset, Lmin is the number of samples in the minority class, and b is the number in the interval [0, 1].(4)For each minority sample in the boundary sample set (X1, X2, X3, …, Xn), mark it as Xi, and calculate the number of majority samples in the K neighbors of Xi, and mark it as Ni. Then, the proportion Ri of the majority samples in K-nearest neighbors is shown in the following formula, and the sum of the distribution of the majority samples is calculated and denoted as Z:(5)Calculating the number of new samples that needs to be synthesized for each minority sample Xi in the boundary sample set, where ri represents the proportion of the majority samples around the minority sample Xi.(6)For the minority samples Xi at the boundary, use the new interpolation method to generate minority samples.

The new interpolation method is as follows:(1)Randomly selecting two samples from the K-nearest neighbors of the minority sample Xi, denoted as k1, k2. If both k1 and k2 are samples of the majority class, first performing linear interpolation between k1 and k2 to generate a temporary sample Xt and then performing random interpolation between Xi and Xt, and the generated new sample Xnew is put into the minority sample set. where ε is used to limit the size of the synthesis area, 0 < ε < 1. The interpolation area is shown in Figure 2. The white circle represents the minority sample, and the black circle represents the majority sample. It can be seen from the figure that the interpolation area is close to the original minority sample.(2)If k1 is the minority class and k2 is the majority class, the formula for generating the temporary sample Xt and the new sample Xnew is as follows: where ε is used to limit the size of the synthesis area, 0 < ε < 1. The interpolation area is shown in Figure 3. It can be seen that the interpolation area is still close to the minority samples.(3)If both k1 and k2 are minority samples, a new sample is generated according to formulas (13) and (16).

4. Integrated Classification Algorithm Based on RAdaBoost-FLLGBM

4.1. Improved LightGBM Algorithm Based on Focal Loss
4.1.1. LightGBM Algorithm

LightGBM (Light Gradient Boosting Machine) [39] is a gradient boosting framework based on a decision tree algorithm. Compared with the XGBoost algorithm, it is faster and has lower memory usage. An optimization of LightGBM is to use a decision tree algorithm based on histogram to discretize continuous eigenvalues into K values and form a histogram with a width of K. When traversing the sample, use the discrete value as an index to accumulate statistics in the graph, and then find the optimal split point by traversing the discrete value in the histogram.

Another optimization of LightGBM is to use leaf-wise with depth limitation. Different from the level-wise decision tree growth method, the leaf-wise method finds the leaf with the largest split gain from all the current leaves and then splits it, which can effectively improve the accuracy and add the maximum depth limit to prevent overfitting.

The principle of the LightGBM algorithm is to use the steepest descent method to take the value of the negative gradient of the loss function in the current model as the approximate value of the residual and then fit a regression tree. After multiple rounds of iteration, finally, the results of all regression trees are accumulated to get the final result. Different from the node splitting method of GBDT and XGBoost, the feature is divided into buckets to construct a histogram, and then the node splitting calculation is performed. For each leaf node of the current model, it is necessary to traverse all the features to find the feature with the largest gain and its division value to split the leaf node. The steps of node splitting are as follows:(1)Discrete feature value, dividing the value of all samples on the feature into a certain bin.(2)A histogram is constructed for each feature, and the histogram stores the sum of the gradient of the samples in each bin and the number of samples.(3)Traverse all bins, taking the current bin as the split point, and accumulate the gradient and SL from the bin on the left to the current bin and the number of samples nL. According to the sum of the total gradient and the total number of samples on the parent node, using the histogram to do the difference method, the sum SR of the gradient of all bins on the right and the number of samples nR are obtained. Calculating the gain according to formula (17), taking the maximum gain in the traversal process, and using the feature value at this time and the bin feature value as the node split feature and split feature value.

4.1.2. FLLGBM Algorithm

Although the LightGBM algorithm is fast and efficient in classification problems, it is susceptible to the impact of data imbalance. In response to this problem, this paper uses the Focal Loss function to modify the LightGBM loss function to reduce the impact of imbalanced data on the classification results.

Focal Loss was proposed to solve the problem of sample imbalance affecting the classification effect in target detection [40, 41]. It is modified based on the standard cross-entropy loss function. In the loss function, the category weights, easy-to-classify sample weights, and difficult-to-classify sample weights are adjusted to improve the classification accuracy of the model. The cross-entropy loss function is as follows:

The Focal Loss function introduces the category weight factor α to adjust the weight of different categories of samples, and balances the importance of positive and negative samples by increasing the weight of minority samples. The loss function after introducing the weight factor α becomes

Although α can balance the importance of positive and negative samples, it cannot solve the problem of easy-to-classify samples and difficult-to-classify samples. To reduce the weight of easily separable samples, Focal Loss adds a coefficient γ to the loss function, γ > 0. For positive samples y = 1, the closer the prediction result is to 1, the easier it is to distinguish the samples, and the smaller the value of the adjustment factor , which reduces the loss function value and makes the algorithm pay more attention to samples that are difficult to distinguish. The loss function after adding the adjustment factor is as follows:

The FLLightGBM algorithm changes the loss function based on the LightGBM algorithm, uses Focal Loss to replace its original loss function, solves the problem of sample category imbalance at the algorithm level, and further improves the accuracy of the classification model.

4.2. RAdaBoost-FLLGBM Algorithm
4.2.1. AdaBoost Algorithm

The AdaBoost algorithm is called an adaptive enhancement algorithm, which is an improved application of boosting integrated learning. Its principle is to increase the weight of samples misclassified by the previous base classifier, reduce the weight of samples that are correctly classified, use the updated weight samples to train the next base classifier, and iterate according to the training process until it reaches a certain value. The predetermined error rate is small enough or the maximum number of iterations is reached before stopping. Finally, each base classifier is combined according to the updated base classifier weight, and the final integrated classification model is obtained through the sign function.

4.2.2. Random Subspace

Random subspace method (RSM) [42] is an ensemble learning method, which trains each classifier by randomly selecting some features, thereby increasing the difference between the classifiers and improving the generalization of the ensemble model ability. The specific steps of the random subspace method are shown in Algorithm 2.

Input: dataset X = {x1, x2, …, xn|, i = 1, 2, …, n}
Output: generated q random subspaces S1, S2, …, Sn
(1)Data initialization. Initialize the number n of random subspaces, the dimension of random subspace .
(2)For the variable features in the dataset, we used a random function to extract p features among them to obtain a random subspace, and we repeated this extraction process n time to obtain n random subspaces.
(3)Returning generated q random subspaces S1, S2, …, Sn.
4.2.3. RAdaBoost-FLLGBM Algorithm

In order to further improve the classification effect of the model, a credit risk prediction model based on ensemble learning is proposed. This paper uses the AdaBoost algorithm to change the spatial distribution of the training dataset, thereby obtaining different training datasets to train multiple FLLGBM models. However, one problem that needs to be paid attention to in the ensemble learning method is the difference of the base classifiers. If the difference of each base classifier is small, it is difficult to improve the overall classification effect of the ensemble model. In order to increase the difference of the base classifiers, this paper uses the random subspace method to perform feature perturbation on the dataset, so that the input feature attributes of each base classifier are different. The algorithm steps of the final integrated classification model are shown in Algorithm 3. The flow chart of the modeling process is shown in Figure 4. The algorithm flow is shown in Figure 5.

Input: data set D = {(x1, y1), (x2, y2), (x3, y3), …, (xn, yn)}, number of iterations T, feature dimension in random subspace ;
Output: final integrated classifier
(1) Initialize sample weights (i) = 1/n, i = 1, 2, …, n;
(2)  For t = 1 to T do
(3)   Use random subspace method to generate feature subspace St with dimension on data set D
(4)   Train the base classifier FLLGBM according to the sample weight and feature subspace St to obtain Ht
(5)   Calculate the training error of the base classifier Ht: , that is, is equivalent to the sum of the weights of the misclassified samples
(6)   if εt > 0.5
    then break;
(7)   Calculate the base classifier coefficients , update training sample weights , , where is the normalized coefficient.
(8)Fuse the output results of each classifier and output

5. Empirical Analysis

5.1. Data Source and Processing

The data used in this article is the lender data available on the official website of the P2P online lending platform Lending Club in the first quarter of 2018. The first 10,000 user samples are selected, involving 145 fields of information. The main variable features are shown in Table 1. Each user sample contains personal attribute variables and a target variable. For the target variable, there are 7 states, namely, Current and Fully Paid, In Grace Period, Late (16–30 days), Late (31–120 days), Charged Off, and Default. Defining Current and Fully Paid as “good” users, and the remaining status as “bad” users. The target variable is digitized, with 0 for “good” users and 1 for “bad” users. The loan status distribution is shown in Figure 6. It can be seen from the figure that the dataset belongs to unbalanced data, the ratio is about 17 : 1, which seriously affects the classification effect of the model, so it is necessary to deal with the imbalance of the dataset.

After statistical analysis of the dataset, the attribute variables of the sample include 107 numerical variables and 37 character variables. Due to various reasons such as the P2P online lending platform did not collect and users did not fill in, there are 61 variable features in the original data set with missing data, some of the continuous features are marked with discrete characters. Therefore, the data must be preprocessed before training the model. This article deletes the features with a missing ratio of more than 60% and converts some character type features into numerical type features. The special value filling method is adopted for the type variables, and the empty value is treated as a special attribute value, and all the empty values are filled with “Unknown.” For the treatment of missing values of numerical variables, the mean filling method is used. Then one-hot encode the type data. After data missing value processing, 101 variable characteristics remain. Since there are redundant and meaningless features in the data set, 25 variables are deleted through data filtering, leaving 76 variable features.

5.2. Model Evaluation Indicators

The confusion matrix, also called the error matrix, is mainly used to compare classification results with actual measured values [43]. The confusion matrix of the two categories is shown in Table 2.

Using 0 to indicate that the positive class means repayment on time, and 1 means that the negative class means default. Among them, TP (True Positive) represents the number of samples with a true value of 0 and a predicted value of 0, FN (False Negative) represents the number of samples with a true value of 0 and a predicted value of 1, and FP (False Positive) represents the number of samples with a true value of 1, and a predicted value of 0, and TN (True Negative) represents the number of samples with a true value of 1, and a predicted value of 1.(1)Precision and Recall. Precision represents the proportion of model predictions that are correct among all results predicted by the model as positive; recall represents the correct proportion of the model prediction among all the actual positive results.(2)Specificity. Specificity represents the ratio of the correct prediction of the model in all negative classes.(3)F1-score can be regarded as a weighted average of model accuracy and recall. Its maximum value is 1 and its minimum value is 0.(4)-mean, which means geometric mean, can measure the average performance of the model in two categories.(5)ROC curve and AUC value [44]: The abscissa of the ROC curve represents the false positive rate (FPR), and the ordinate represents the true rate (TPR). FPR = FP/(FP + TN), TPR = TP/(TP + FN), respectively, expressed as the ratio of negative instances mistaken as positive instances to the total number of negative instances and the ratio of positive instances with correct predictions to the total number of positive instances. However, it is not very intuitive to evaluate the prediction effect of the classification model with the ROC curve, so the AUC value is introduced. The AUC value represents the area of the area formed under the ROC curve and above the x-axis, and the AUC value is between 0.5 and 1. In the case of greater than 0.5, the closer the AUC value is to 1, the better the prediction effect of the model.(6)KS value [45]: The KS value mainly verifies the model’s ability to distinguish default users, and two values, FPR and TPR, are needed. The KS value is between 0 and 1. If the KS value is less than 0.2, the model is not available, and the KS value is greater than 0.3, which indicates that the model has a better distinguishing ability.

5.3. Experimental Results and Analysis
5.3.1. Phased Experimental Comparison of Feature Selection Methods

In order to verify the effectiveness of the hybrid feature selection method proposed in this paper, this paper uses the LightGBM algorithm as a classifier to compare the effects of the four feature selection methods Relief, bmRelief, bmRelief + MIC, and bmRelief + MIC + RF on the model classification effect. Among them, the Relief algorithm and the bmRelief algorithm set the same number of sample extraction times, both are set to 100 times, the threshold is both 0.0002, the threshold of the maximum information coefficient method is set to 0.8. In the Random Forest method, the number of trees is 50, the maximum depth of trees is 7, and the threshold of feature importance is set to 0.01. The experiment uses a five-fold cross-validation method to divide the dataset into 5 parts. Each time, 4 parts are selected as the training set and 1 part is the test set. The final result is averaged. The experimental results are shown in Table 3.

It can be clearly seen from Table 3 that when the classifiers are the same, compared to the Relief algorithm, the model classification effect using the bmRelief algorithm for feature selection is better, and the -mean value, AUC value, and KS value have been significantly improved. This also stems from the fact that the bmRelief algorithm pays more attention to the extraction probability of minority samples and boundary samples when extracting samples so that the filtered features are more suitable for model classification. After removing redundant features by the maximum information coefficient method, the value of each index has been further improved. It can be seen that removing redundant features has a positive impact on the classification effect of the model. From the experimental results, it can also be found that the feature selection method of hybrid filter and Random Forest has the best classification effect and each evaluation index achieves a higher value, which also shows that the feature selection method combined with Random Forest can get the best feature. Thus, the model has better classification performance. It also confirms the effectiveness of the proposed hybrid feature selection method. Therefore, the mixed feature selection method is used to screen the variable features in the subsequent experiments. According to the threshold set above, the finally screened features are shown in Table 4.

5.3.2. Parameter Sensitivity Analysis

The BA-SMOTE oversampling method proposed in this paper at the data level requires setting the value of b to control the number of new samples that need to be generated and setting the value of ε to adjust the size of the interpolation area for generating new samples. To evaluate the influence of the values of b and ε on the results of the algorithm, five classifier models of LightGBM, XGBoost, GBDT, Random Forest (RF), and logistic regression (LR) were selected, and the historical data of borrowers from the Lending Club platform was used for testing, and F1-score, evaluation indicators such as -mean, AUC, and KS evaluate the influence of parameters. The experimental process is implemented using the PyCharm 2018 platform, using a five-fold cross-validation method to divide the data set into 5 parts, and each time 4 parts are selected as the training set and 1 as the test set, and the final result is averaged.

The value of b is used to control the sampling magnification. This article sets two values of b = 0.5 and 1. The value of ε controls the interpolation area. The larger the value of ε, the easier it is for new samples to be generated close to most types of samples, resulting in blurred boundaries. The smaller the ε value, the closer the new samples generated are to the minority samples. Although the boundary blur problem is effectively improved, sample overlap is more likely to occur. Therefore, the values of ε are set to 0.3 and 0.5. We combine the values of b and ε, combine (b, ε) into 4 groups of (0.5, 0.3), (0.5, 0.5), (1, 0.3), and (1, 0.5) to conduct experiments, respectively. The value of K-nearest neighbor is 5, and the experimental results are shown in Table 5. It can be seen from the results of the evaluation indicators in Table 5 that when the value of (b, ε) is (1, 0.5), the prediction result of the classifier is better, that is, the ratio of positive and negative samples is balanced and the range of interpolation area is limited to the middle in some cases, the algorithm is easier to distinguish positive and negative samples.

In the FLLGBM algorithm proposed at the algorithm level, coefficients α and γ are introduced to adjust the size of the sample category weight and the sample difficulty weight, respectively, to improve the accuracy of model classification. In order to evaluate the influence of the values of α and γ on the results of the algorithm, we set the values of (α, γ) to (0.75, 0.2), (0.5, 0.5), (0.25, 1), (0.25, 2), and (0.25, 5) several parameter combinations, the same method of five-fold cross-validation is adopted, and G-mean and AUC values are used as evaluation indicators. The horizontal axis represents the value of (α, γ), and the result is shown in Figure 7. It can be seen from the figure that when the value of (α, γ) is (0.25, 2), the -mean value and the AUC value are higher than the value of other parameter combinations, indicating that the parameter value at this time has a better classification effect for the FLLGBM algorithm.

5.3.3. Phased Experimental Comparison of Improved Methods

In order to verify the effectiveness of the improved method proposed in this paper at the data level and algorithm level, the experiment is based on the original LightGBM model, the sampled BA-SMOTE-LightGBM model and the modified loss function BA-SMOTE-FLLGBM model, as well as the complete improvement. Comparative analysis between BA-SMOTE-RAdaBoost-FLLGBM models. The experiment first uses the hybrid feature selection method to screen the features and then conducts subsequent experiments based on the obtained feature subsets. The experimental parameters are based on the results of the above parameter analysis, (b, ε) is (1, 0.5), (α, γ) is (0.25, 2). In the integration method, the number of iterations is set to 50, and the dimension of random subspace is set to 10. F1-score value, -mean of each model, KS value, and AUC value are shown in Table 6.

It can be seen from the results in the table that compared with the original LightGBM algorithm, the various indicators of the algorithm after BA-SMOTE oversampling have been improved, which also confirms the effectiveness of the BA-SMOTE algorithm. After the improved loss function, the various evaluation indicators of the FLLGBM algorithm have been further improved, indicating that the modified loss function makes the model pay more attention to the classification of minority samples and easy-to-classify samples, thereby improving the classification effect of the model. It can also be seen from Table 6 that the evaluation index values of the improved BA-SMOTE-RAdaBoost-FLLGBM algorithm are all greater than those of other algorithms, which shows that the integration of AdaBoost and random subspace can effectively improve the classification performance of the model.

5.3.4. Comparative Analysis with Other Imbalanced Classification Algorithms

In order to further prove the effectiveness of the integrated classification model combined with the oversampling method proposed in this paper, the method in this paper is compared with the improved algorithms RUSBoost, CUSBoost, and KSMOTE-AdaBoost for imbalanced data classification. After the data is preprocessed and feature selected, the above algorithms are used for model training, respectively. The model in this paper needs to use the BA-SMOTE oversampling method to balance the dataset before training the processed data. The experiment process adopts the method of five-fold cross-validation, and the F1-score value, -mean value, AUC value, and KS value of each algorithm are shown in Table 7, and the ROC curve is shown in Figure 8.

From the results in Table 7 and Figure 8, it can be seen that compared to other classification algorithms that deal with imbalanced data, the integrated classification model proposed in this paper combined with the BA-SMOTE oversampling method has higher accuracy and better classification performance. Compared with the RUSBoost algorithm, the advantages of the algorithm in this paper are more obvious. This is because the uncertainty of the random undersampling method affects the classification performance of the RUSBoost algorithm. For the CUSBoost algorithm and the KSMOTE-AdaBoost algorithm, although the F1-score value is not significantly improved, the -mean value, AUC value, and KS value are all greatly improved, which can confirm that the model in this paper is not in the data. The balanced credit risk prediction has a better classification effect.

5.3.5. Comparative Analysis with Other Ensemble Classification Algorithms

When the data balance processing method is the same, that is, when the BA-SMOTE method proposed in this paper is used to balance the dataset, the RAdaBoost-FLLGBM model proposed in this paper is compared and analyzed with other ensemble classification models. Other integrated classification models include the Bagging-FLLGBM model and the Bagging-LightGBM model obtained by integrating the FLLGBM algorithm, the LightGBM algorithm using the Bagging method, and the RBagging-FLLGBM and RBagging-LightGBM algorithms that combine the random subspace method. After the data undergoes the same preprocessing, feature selection and balance processing, they are put into the above models for training. The experiment also uses a five-fold cross-validation method to evaluate the model classification effect according to the F1-score, -mean, AUC value, and KS value. The experimental results obtained are shown in Table 8, and the ROC curve is shown in Figure 9.

From the results in Table 8 and Figure 9, it can be found that the classification effect of the ensemble model combined with the random subspace method is better than that of the uncombined ensemble model. Compared with the ensemble model without the random subspace method, the -mean value and AUC value of the RBagging-FLLGBM model and the RBagging-LightGBM model have been greatly improved, indicating that increasing the differential performance of the base classifier can effectively improve the classification performance of the ensemble model. Compared with other algorithms, the RAdaBoost-FLLGBM model proposed in this paper has the highest index values, in which the -mean value increases by 9.41% at most, and the AUC value increases by 6.41% at most. This shows that under the imbalanced data set, compared to the model integrated by the Bagging method, the classification effect of the model based on the AdaBoost method is better, and it also confirms the effectiveness of the integrated classification model based on the RAdaBoost-FLLGBM in the credit risk assessment.

6. Conclusion

Credit risk issues have always restricted the development of online lending platforms, and an effective credit risk prediction model is the focus of this research. In actual research, the problem of feature high-dimensionality and dataset imbalance has seriously affected the classification effect of the model. For this reason, this paper proposes a credit risk prediction method based on ensemble learning from three levels of features, data, and algorithms. The method first uses the hybrid filter and Random Forest feature selection method to select features at the feature level. Second, at the data level, the BA-SMOTE oversampling method is used to balance the dataset, taking into account the impact of sample distribution differences, and improving the problem of blurred sample boundaries. Finally, in terms of algorithms, the Focal Loss is used to improve the loss function of LightGBM to obtain the FLLGBM algorithm, and the AdaBoost and random subspace methods are used to integrate the improved FLLGBM classifier, and an integrated classification algorithm based on RAdaBoost-FLLGBM is proposed. This paper verifies the effectiveness of the hybrid feature selection method through phased comparison experiments of feature selection methods. After data preprocessing and hybrid feature selection methods are used to screen the features, the method proposed in this paper is compared with other oversampling methods and integrated classification algorithms. The evaluation indicators of the method proposed in this paper have the highest value, among which the -mean value and the AUC value has increased significantly, which can prove that the model in this paper has a better predictive effect in credit risk prediction. However, the model proposed in this paper still needs further improvement. In the future, we should pay more attention to the influence of model parameters on the results.

Although the method proposed in this paper alleviates the impact of high-dimensional features and data imbalance on credit risk assessment, there are still some deficiencies. When the amount of data increases, the time cost of bmRelief algorithm increases, making the feature selection stage more time-consuming. In addition, the use of random subspace for feature perturbation in the integrated classification model will reduce the classification accuracy of partial basis classifiers and affect the overall prediction effect of the model. Therefore, in the future research, I will pay attention to these two issues and expect to further improve the classification effect of the model.

Data Availability

In this paper, we used the Lending Club dataset. The Lending Club dataset can be obtained from the https://www.lendingclub.com/investing/peer-to-peer.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

This study was funded by the National Natural Science Foundation of China (NSFC) (Grant no. 61772160) and the Special Foundation of Scientific and Technological Innovation for Young Scientists of Harbin, China (Grant no. 2017RAQXJ045).