Abstract

The advent of the era of big data has provided a new way of development for Internet financial credit collection. The traditional methods of credit risk identification of Internet financial enterprises cannot get the characteristics of credit risk zoning, leading to large errors in the results of credit risk identification. Therefore, this paper proposes a new method of credit risk identification based on big data for Internet financial enterprises. According to the big data perspective, the credit risk assessment steps of Internet financial enterprises are analyzed and the weight of assessment indicators is calculated using the improved analytic hierarchy process (AHP), and the linear weighted synthesis method is applied to comprehensively assess the credit of clients. Using the unique characteristics of big data credit risk region division, the big data credit risk is determined by rule-based matching method. The eXtreme Gradient Boosting (XGBoost) machine learning algorithm is used to establish a credit risk identification model of Internet financial enterprises. The kappa coefficient and ROC curve are used to evaluate the performance of the proposed method. Experimental results show that the proposed method can accurately assess the credit risk of Internet financial enterprises.

1. Introduction

Internet finance refers to a new financial form that, based on traditional finance, realizes functions such as payment, financing, and credit intermediaries under a series of Internet technologies such as big data technology and cloud computing technology [1]. The forms of Internet finance are generally third-party payment, financial e-commerce, credit evaluation, Internet money funds, big data index funds, and other models. Internet finance is the assimilation of the Internet field and the financial field. However, it is not the simple integration of the Internet and finance but the innovative transformation of financial business by using a safe and effective Internet network under certain market conditions. It is the fusion of traditional financial business and the Internet spirit. Internet finance is a new thing, its development speed and scale far exceed that of other industries, and its development prospect is great; therefore, the Internet finance industry is a sunrise industry. The risk of Internet finance is a big difficulty to the development of the industry. The investigation on the risk management of Internet finance has become a hot topic in society and academia [2].

Jiang and Yuan [3] proposed a credit evaluation method in e-commerce transactions based on decision analysis. The credit risk influencing factors of business transactions are divided into several evaluation factors by introducing the Kanno doff integral evaluation method, and the credit risk F-integral evaluation model is constructed to divide the transaction credit grade. The interval distribution of commodity prices is used to add points to successful transactions and deduct points from failed transactions at different levels, to solve the problem of cycle deception in transactions. Based on the credit scoring results, the risk calculation method of e-commerce transactions is designed to evaluate the credit degree of the current transaction. Yang et al. [4] used the enterprise transaction data of an Internet financial platform as the object, analyzed the propagation behavior of overdue loan default, and proposed to build a model to identify the high-risk enterprises of the Internet financial platform through the propagation characteristics. Based on constructing SIS and SIR models based on threshold propagation and random propagation, the model is transformed into an algorithm that can evaluate the enterprise value at risk and further verified and compared with the actual default data. However, the above two traditional methods cannot obtain the regional division characteristics of credit risk, resulting in large errors in the results of credit risk identification. Gao and Xiao [5] studied the management of credit risk for consumer finance using big data. Their risk management model exhibited good prediction ability, can discriminate between normal loan customers and default loan customers, and is appropriate for practical personal credit risk control business. Wang [6] preprocessed the Internet financial credit data and selected the variables for the active credit tracking model of BP neural network using adaptive genetic algorithm. Liu [7] studied the importance of machine learning and big data as an efficient data exploration approach for insurance risk management using random forest algorithm. Fatao et al. [8] employed supply chain financial credit risk indicators and built an online evaluation index model for supply chain financial credit risk in commercial bank. Shen [9] established effective financial risk early warning systems and techniques and took effective actions to avoid risks and to guarantee the normal operation of Internet banking. The financial risk early warning system based on large data will quickly expand under the financial Internet era background. Lyu and Zhao [10] investigated the use of compressed sensing in risk evaluation of Internet finance based on big data. Yang et al. [11] developed an Internet supply chain financial risk managing model through data science. Zhang [12] constructed a financial investment risk method with the help of an intelligent fuzzy neural network. Similarly, Teles et al. [13] proposed a credit risk prediction model applying artificial neural network (ANN) and Bayesian network models. Xu et al. [14] employed backpropagation neural network (BPNN) and information entropy to recognize and classify the risk of the bank branches. The neural network can solve the nonlinear problem without depending on the function setting to get a more precise simulation effect, so it can measure the estimate effect of the risk early warning model more precisely. In this study, we propose a new credit risk identification method for Internet financial enterprises based on big data. The XGBoost algorithm is employed to develop a credit risk identification model of Internet financial enterprises. The performance of the model is evaluated using the kappa coefficient and ROC curve. Results show that the proposed risk identification model can accurately measure the credit risk of Internet financial enterprises.

The rest of the paper is arranged as follows. In Section 2, the index weight calculation method is discussed. Section 3 illustrates the proposed XGBoost algorithm for the identification of credit risk. The results are given in Section 4, and Section 5 concludes the paper.

2. Credit Risk Detection of Internet Financial Enterprises Based on Big Data

When designing the enterprise credit risk assessment index system, it is essential to consider the defects of the traditional enterprise credit risk assessment index systems [15]. In the development of a credit risk assessment index system, the integration of dynamic data and static data is adopted to complete data mining, analysis, and modeling from big data. Moreover, the enterprise identity data, behavior data, and external data are used to construct an enterprise credit risk assessment index system [16]. It is crucial that the enterprise credit risk evaluation index system shall be mainly designed from the perspectives of income information, loan information, account information, repayment and overdue information, and third-party information, such as multi-end loan information, black and gray list, credit information, and so on. The enterprise credit risk evaluation index system based on big data has the characteristics of rich evaluation data items, a combination of static data and dynamic data, wide data sources, and timeliness. Although the traditional enterprise credit risk assessment index system in the crowd coverage is not ideal, the data are mostly static, and the data authenticity cannot be verified. The data of the big data platform for Internet enterprises are mostly used from multiple channels, such as APP product data of companies, credit information data of the PBC, credit information data of Internet credit information companies, online shopping data of e-commerce companies, data of third-party cooperative enterprises, public data crawled by crawlers, and data published by the Public Security and Inspection Law. The data used in this study are desensitized from many aspects to ensure the safety of the data.

2.1. Credit Evaluation Index Weight Calculation

The improved analytic hierarchy process (AHP) method [17] is used to compute the weight of evaluation indicators. This method uses the concept of optimal transfer matrix to improve AHP, make it naturally meet the consistency requirements, and directly calculate the weight value. The major steps are as follows:(i)A credit risk evaluation index system is established based on market transactions and the evaluation index is set according to the customer credit evaluation index system.(ii)A judgment matrix is built. After establishing the matrix according to the credit risk evaluation index system, the weight of each level index in the customer credit risk evaluation index system is determined by AHP. By comparing the two elements, the relative importance of each element in the hierarchy relative to a certain factor in the upper hierarchy is determined, and a judgment matrix is created. A comparison matrix of two factors for a given criterion is calculated as follows:where is the scale of the importance of factor and factor relative to the criterion index.(iii)The weight of evaluation indexes at all levels is calculated based on the improved AHP method. There is no need to do a consistency check after calculating the weights of indexes by the improved AHP method. Firstly, the judgment matrix is modified to get the quasi-optimal matrix , followed by the square root method is used to solve the eigenvector. Next, the elements of the judgment moment are multiplied by the lines to get the following expression:By dividing the product into n power roots, an equation is obtained, and the root vector is normalized, i.e., the sorting weight vector P can be calculated as

The improved AHP method is represented in Figure 1.

2.2. Credit Evaluation Based on Linear Weighted Synthesis Method

The linear weighted synthesis method for customer credit evaluation is a kind of comprehensive method to obtain the comprehensive evaluation value by weighted summation of each index value.

The evaluation value of the credit risk of the first j customer can be calculated aswhere the index value of the ith index of the jth customer is , .

The static client credit risk rating is classified into four levels, with grade 1 denoting 89–100 points, grade 2 denoting 75–88 points, grade 3 denoting 59–74 points, and grade 4 denoting ≤58 points. The weighted credit scoring method is used in static client credit risk rating. The first step is to score some credit survey indicators as a whole, the second step is to uniformly weighted average, and the last step is to get the method of credit risk score. This can be computed using the following equation:where the credit score of a customer is represented by X, The weight of the proposed ith credit investigation index is represented by and set to . The evaluation score of the ith evaluation index is shown by . This method is applied to quantitative analysis and research of customer credit. The theoretical basis of financial enterprise decision making is mainly to obtain the credit status of the enterprise based on objective data, while the static client credit rating method is a simple and easy method to understand and operate.

2.3. Credit Risk Detection Based on Big Data

According to the regional division characteristics of big data credit risk [18], the credit risk is detected by rule-based matching method. The centralized detection process is shown in Figure 2.

The data are divided into different packets and matched. If the matching process is successful, the output is generated. The credit risk of big data is divided into five areas and the centralized detection problem will be converted to the target maximum value, and read the data of the fitness function before the processing step, where is the actual output of data, represents the expected output of data, and shows the total amount of data obtained.

The matching process of packets given in Figure 2 is as follows.

2.3.1. High Credit Risk Data Matching Rules

In the process of matching from right to left, if the character in the string does not match its corresponding string, it shall be handled as follows.

If data are in a string , the string can be moved to character alignment, as shown in Figure 3.

is set to indicate the height at which the string moves upward, indicates the length of the string, indicates that the string appears at the position closest to the right of the data , and indicates that the position below the position does not match (the length from the far left), as shown below:

2.3.2. Matching Rules for Low Credit Risk Data

When different strings are aligned, the moving distance is determined by matching rules of low credit risk data during the matching process from right to left. The specific matching process is as follows.

The string is entered and the moving distance is initialized. Next, the string is traversed from right to left, and the traversal position is analyzed as given in the following equation:

All the strings are aligned and matched one by one from right to left. If it is matched with the leftmost end of the string, it means that the matching is successful. According to the matching process, it can ensure that any distance is a safe match and no omission will occur. Furthermore, it can achieve centralized and precise detection of big data credit risk areas at the fastest matching speed.

3. Credit Risk Identification Model of Internet Financial Enterprises

Based on the XGBoost algorithm [19], the credit risk identification model of Internet financial enterprises is constructed. It is a common and effective open-source implementation of the gradient boosted trees algorithm. The XGBoost algorithm provides better performance because of its vigorous handling of different data types, distributions, relationships, and the variety of hyperparameters that can be fine-tuned [20]. The XGBoost algorithm can be used for regression, binary and multiclass classification, and ranking problems [21]. The basic element of constructing the XGBoost model is a tree set, and the binary tree structure is in the classification and regression tree which can reflect the actual result of the decision tree [22]. There are two branches in the structure of the decision tree, namely, “Yes” and “No,” which correspond to the right and left branches. Each feature variable is partitioned by the binary tree, and the feature space is partitioned to obtain several leaf nodes.

Suppose a set , in which there are variables and samples. Through functions, the output of the prediction model can be obtained based on the regression tree integration model:where represents the regression tree space; shows the score corresponding to the leaf; is the number of leaf nodes present in the tree structure; indicates the tree structure; represents the tree; and is the independent variable corresponding to the th sample.

The tree model is trained with the objective function as given in the following equation:where is the convex loss function to measure the difference between the real value and the predicted value and stands for penalty term, and its expression is as follows:where represents the regular term and stands for leaf node penalty, which is mainly used to avoid overfitting problems.

In the process of credit risk identification of Internet financial enterprises, the European space cannot be directly used to optimize the objective function. Therefore, the credit risk identification model is trained through boosting learning strategy model, and the specific process is as follows:where represents the function newly added to Round .

Based on the above process, the objective function can be converted aswhere is a constant.

The fitting results of the model and training data in the identification process can be measured by the loss function , in which the logical loss function and the square loss function are widely used in the identification process. The credit risk identification model of Internet financial enterprises based on the RB-XGBoost algorithm is brought into the square loss function in the objective function and the following equation is obtained.where stands for residual.

The loss function can be approximated by a Taylor expansion to obtain the following expression:

When the loss function is a square loss during training, the following equation can be computed.

Substitution of parameters and in the objective function yields the following expression:where represents the output of the model in Round training and represents the dependent variable in the objective function, and if is known, the above objective function [20] can be simplified to obtain the following expression:where and are the parameters in the loss function. In different loss functions, the values of the above parameters are different; therefore, the values of are determined by the loss function. Hence, each tree is redefined by the following equation.where represents the weights of leaf nodes in the tree structure, denotes the predicted values obtained by the tree model, and C shows the tree structure.Model complexity includes two parts: L2 regularization of leaf node score and a total number of leaf nodes. Model complexity can be obtained by tree definition:The smoothness of leaf nodes can be improved by L2 regularization to solve the overfitting problem. When the complexity of the model increases, there are two different types of accumulation, one of which is , where represents the set of samples in a leaf node . After adding complexity to the objective function, the final objective function, that is, the credit risk identification model of Internet financial enterprises, is obtained:The credit risk identification model of Internet financial enterprises constructed above is used to complete risk identification.

4. Experimental Results and Analysis

To examine the credit risk evaluation ability of the proposed credit risk identification model of Internet financial enterprises, help the enterprises to avoid the risk of electricity arrears, and provide the basis for urging the payment of electricity fees, five Internet financial enterprises in a certain city are selected as experimental objects, and the annual reports of these five companies in the recent three years are selected as experimental data samples to carry out the experimental analysis.

4.1. Accuracy Test of Credit Risk Assessment with Different Methods

To evaluate the credit risk of five selected Internet financial enterprises, the proposed method is used along with the methods given in [3, 4], and the results of these methods are compared with the actual credit risk of the 5 companies. The experimental results are shown in Figure 4.

It can be seen that the credit risk grade scores of the five enterprises evaluated by the proposed method are closer to the actual credit risk grade scores of the five enterprises. This is because the proposed method can effectively combine the actual and objective data of the market transaction data of five companies to obtain the credit status of the enterprises. In addition, the proposed method can accurately obtain detailed information of the credit risk of the enterprises and establish an accurate credit risk evaluation index system, which makes the results of credit risk evaluation more scientific and accurate. The comparison results for the credit risk assessment accuracy of the two methods are shown in Figure 5.

By analyzing Figures 4 and 5, we can see that this method can effectively measure and evaluate all credit risk indicators and obtain the credit risk of enterprises. When the number of samples reaches 20000, the credit risk evaluation accuracy of references [3, 4] is 0.50% and 0.30%, respectively, whereas the accuracy of credit risk evaluation of the proposed method reaches 0.92%.

4.2. Kappa Coefficient and ROC Curve Test of Different Methods

To verify the recognition accuracies of all the three methods, the kappa coefficient and ROC curve are used. The kappa coefficient can weigh the difference between the predicted result and the real result. The kappa coefficient can be computed aswhere represents the proportion of correctly identified samples in the total number of samples and is the randomness ratio. The higher the kappa coefficient is, the more accurate is the recognition result of the method. The kappa coefficients of the proposed method, the method in reference [3], and the method in reference [4] are shown in Table 1.

It is evident that the kappa coefficients of the proposed method are higher than those of reference [3] and reference [4] in multiple iterations, which indicates that the proposed method can accurately identify the credit risks of Internet financial enterprises. This is because the method constructs a risk identification index system based on the data with high balance and completes the identification of credit risks of Internet financial enterprises based on the high-precision risk identification indexes.

In the ROC curve, the abscissa is the real case rate and the ordinate is the false-positive case rate. The larger the area enclosed by the ROC curve and the abscissa is, the higher is the recognition accuracy of the method. The proposed method and the methods given in reference [3] and reference [4] are used to identify the credit risks of different Internet financial enterprises, and the obtained ROC curve is shown in Figure 6.

Figure 6 shows that the area of ROC curves and abscissa obtained by the proposed method is larger than the area of ROC curves and abscissa obtained by the methods proposed in [3, 4], indicating that the accuracy of the proposed method is higher, and the identification of credit risks can be accurately completed in the Internet financial enterprises.

5. Conclusion

The rapid development of Internet finance provides new financing channels for the development of small and microenterprises and individual entrepreneurship. The conventional methods of credit risk prediction of Internet financial enterprises cannot get the characteristics of credit risk zoning, leading to large errors in the results of credit risk identification. In this study, a new method of credit risk identification based on big data for Internet financial enterprises is proposed. The risk evaluation steps of Internet financial enterprises are studied, and the importance of assessment indicators is measured using the improved AHP method. The linear weighted synthesis method is employed to systematically assess the credit of clients. Based on the unique characteristics of big data credit risk region division, the big data credit risk is predicted with the help of rule-based matching method. The XGBoost supervised machine learning algorithm is used to develop a credit risk prediction model of Internet financial enterprises. The performance of the model is evaluated with the kappa coefficient and ROC curve. Experimental results show that the proposed method can correctly assess the credit risk of Internet financial enterprises.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The author declares that there are no conflicts of interest.