Abstract
Nowadays, the banks are facing increasing business pressure in loan allocations, because more and more enterprises are applying for it and financial risk is becoming vaguer. To this end, it is expected to investigate effective autonomous loan allocation decision schemes that can provide guidance for banks. However, in many real-world scenarios, the credit status information of enterprises is unknown and needs to be inferred from business status. To handle such an issue, this paper proposes a two-stage loan allocation decision framework for enterprises with unknown credit status. And the proposal is named as TLAD-UC for short. For the first stage, the idea of deep machine learning is introduced to train a prediction model that can generate credit status prediction results for enterprises with unknown credit status. For the second stage, a dynamic planning model with both optimization objective and constraint conditions is established. Through such model, both the profit and risk of banks can be well described. Solving such a dynamic planning model via computer simulation programs, the optimal allocation schemes can be suggested.
1. Introduction
Since the rise of banks, loans have become the most common way of financing the process of enterprise development. Two main bodies are involved in general loan activities [1]. One is the communities that provide funds, such as financial companies and banks, while the other is the communities that apply for borrowing funds, such as enterprises [2, 3]. With the continuous growth of urbanization and modernization, the business volume of loans shows a gradually increasing trend [4]. This not only brings greater capital pressure to banks but also increases some uncertain financial risk for banks [5, 6]. Because the operation status and qualification of many enterprises are diverse, a great challenge is posed to the later management of banks [7, 8]. Therefore, how to calculate the optimal loan schemes that can maximize profit or minimize risk for banks under limited capital, serves as an important problem [9, 10].
It is never an easy task to determine the optimal loan allocation schemes for banks [11]. Although there are some research works focused on this issue, most of them did not consider the limitation of the total capital amount [12, 13]. They focused more on the scenes that decide whether to provide loans to specific enterprises [14–16]. They did not consider more substantive issues such as amount setting or global risk [17–19]. In addition, in the process of loan review, the most important consideration of enterprises is credit status. In the existing research works, they basically assume that the credit status of enterprises is known [20, 21]. However, in many actual business scenarios, the credit status of enterprises is unknown. The circumstances bring many challenges to the formulation of loan allocation plans. Therefore, how to generate the optimal loan allocation scheme for enterprises with unknown credit status is essentially a more realistic problem.
To deal with such an issue, this paper proposes a two-stage loan allocation decision framework for enterprises with unknown credit status, which is named as TLAD-UC for short. For the first stage, it is expected to tackle with the issue that credit status for enterprises is unknown. As a consequence, a typical machine learning named as K-nearest neighbor (KNN) is utilized here to predict credit status for enterprises. Specifically, a historical dataset that records information of 123 enterprises with credit status is selected to train the machine learning-based prediction models. And the trained models will be used to generate prediction results for enterprises with unknown credit status. For the second stage, a dynamic planning model is formulated to fit the decision process of banks, in which profit and risk are both expressed with quantified expressions. The dynamic planning model is composed of both optimization objective and constraint conditions. By solving the dynamic planning model, the optimal loan allocation decision schemes can be obtained. The main contributions of this paper can be summarized in three aspects.(1)It is recognized that loan allocation for enterprises with unknown credit status is challenging.(2)We propose TLAD-UC which is a two-stage loan allocation decision framework for enterprises with unknown credit status.(3)Simulation is conducted on a real-world dataset to demonstrate the workflow of the proposed TLAD-UC.
2. Preliminaries
Two datasets are involved in this work. Dataset A records some information of 123 enterprises and has credit status information for them. Dataset B only has some basic information of 302 enterprises yet has no credit status information. Inside both datasets, each enterprise has some business records of input invoices and sale invoices as their basic features. Let denote the index number of enterprises that ranges from 1 to , and the equals to the number of enterprises in corresponding datasets. Taking the 123 enterprises with credit information as references, the main goal is to determine loan allocation schemes for the 302 enterprises without credit information. To handle such a problem, the TLAD-UC is implemented via two stages. As is shown in Figure 1, the two stages involved in the architecture of TLAD-UC are the machine learning stage and the optimization decision stage.

For the first stage, initial business data is preprocessed into a format that is suitable for data analysis models. And then, the KNN model is trained on the basis of dataset A. After training, it can directly predict unknown credit information for 302 enterprises in dataset B. For the second stage, the profit and risk of the 302 enterprises are quantified via mathematical expressions. On such a basis, a dynamic planning model with both optimization and constraint conditions is established for the side of banks. Then, the dynamic planning model can be solved by using computer simulation programs to search for optimal solutions for the planning model. Naturally, the optimal loan allocation schemes can be obtained after a solution to the optimization objective.
3. The Proposed Approach
3.1. Data Preprocessing
In the beginning, the initial datasets need some basic procedures to extract features. The following procedures are the basic process of feature engineering:(1)For each enterprise, the total amount of its input invoices and the total amount of its sale invoices are respectively counted via aggregation of all related records that are labeled as “valid”. For the -th enterprise, its amount of input invoices and amount of sale invoices are denoted as and , separately.(2)It is noted that some of the values in and are less than 0, which means that the corresponding business record is a chargeback record. Similarly, the total amount of chargeback for each enterprise is counted. For the -th enterprise, its amount of chargeback amount in input invoices and sale invoices are denoted as and , respectively.(3)The ratio of chargeback data can be computed for both input invoices and the sale invoices via the following two formulas: For the -th enterprise, its ratio of chargeback data in input invoices and sale invoices are denoted as and , respectively.(4)It is noted that values in and do not equal to the turnover amount because there exists the tax. The real turnover amount for a business record equals the sum of the invoice amount and tax amount. Thus, the turnover amount of input business and sale business can be calculated and denoted as and , respectively. For the two indicators, their average values of one day can be computed as follows: For the -th enterprise, its ratio of chargeback data in input invoices and sale invoices are denoted as and , respectively.(5)For each enterprise, the number of chargeback records is counted. For the -th enterprise, such feature is denoted as .(6)For each enterprise, its turnover amount needs to be processed by introducing logarithmic operations, which can be calculated as follows: For the -th enterprise, its two features are denoted as and , respectively.(7)For each enterprise, its turnover amount also needs to be processed by introducing logarithmic operations, which can be calculated as follows: For the -th enterprise, its two features are denoted as and , respectively.(8)For each enterprise, it has four possible label options which correspond to four credit ratings. The label of -th enterprise is denoted as .
As shown in Figure 2, the main workflow of machine learning algorithms is composed of four procedures: data preprocessing, model selection, model training, and prediction. Having finished the data preprocessing, it is expected to implement model selection and model training. For the -th enterprise, its thirteen features can be denoted as , where ranges from 1 to 13. Given , it is expected to generate prediction results for it. This process can be represented as the following formula:

To realize such a goal, the idea of machine learning is then introduced.
3.2. Prediction of Unknown Credit Information
As has been mentioned above, the dataset A has credit information and dataset B has no credit information. Thus, the dataset A is viewed as a golden dataset, from which unknown pattern rules can be discovered. Viewing the dataset A as a training set and the dataset B as the set to be predicted, a typical machine learning model named as KNN is selected here for this purpose.
The full name of KNN is K-nearest neighbors, and the KNN can be used for both classification problems and regression problems. The KNN realizes classification tasks or regression tasks by measuring the distance between different eigenvalues. Naturally, the selection of K-nearest neighbors is upon the basis of distance in sample spaces. The KNN is a quite easy but special machine learning algorithm, as it lacks the general learning process. Its working principle is to divide the feature vector space by using the training data and take the division results as the final algorithm model. After entering the unlabeled data, it is supposed to compare each feature of the unlabeled data with the corresponding feature of the data in the sample set. Then, the classification labels of the data with the closest features (nearest neighbors) in the sample are extracted.
We take Figure 3 as an example to illustrate the basic principles of KNN. Inside the figure, red points and blue points refer to samples that have been labeled. They belong to two different classes. It is expected to generate classification results for the green point. When equals to 3, the selected neighbors for the green point include two red points and one blue point. According to the majority voting rule, the green point will be annotated as the class of red points. When equals to 5, the selected neighbors for the green point include two red points and three blue points. According to the majority voting rule, the green point will be annotated as the class of blue points. From this example, it can be deduced that the setting of is quite important in KNN because the constitution of neighboring samples may be diverse with different settings of . Then, there is an essential problem in KNN, how to measure the distance in sample spaces?

In this work, the most prevalent distance measurement named as “Euclidean distance” is selected for use. Supposing that there are two sample points denoted as and in sample spaces, they are both four-dimensional samples. The Euclidean distance between and is calculated as the following formula:
It can be seen from the formula that the value of is sensitive to a diverse value range. For example, if the value range of and is larger than other features, the final value of will be influenced to some extent. To reduce such an effect, it is supposed to make normalization operations towards all the feature values. Universally, the value range of normalization is fixed as . Taking as an example, the normalization procedure can be calculated as follows:where denotes the minimum value in all the values in sample spaces, and denotes the maximum value in all the values in sample spaces. Naturally, all the features need to be normalized before substituting into models.
Therefore, major procedures of the KNN algorithm can be described as follows:(1)The distance between the test data and each training data is calculated.(2)All the possible neighbors are sorted by an increase in distance.(3)K samples with the nearest distance are selected as the neighbors.(4)The occurrence frequency of the category to which these k samples belong to is counted.(5)The category with the highest frequency in the K samples is returned as the prediction classification of the test data.
And the above process can be summarized in Figure 4.

3.3. Model Evaluation and Prediction
After training a KNN model, the credit status information in dataset B can be calculated accordingly. Before that, we would like to evaluate the performance of the KNN model. For dataset A, it is further divided into two parts the training part and the evaluation part. Of all the 123 samples, the training part has 93 samples, and the evaluation part has 30 samples. The 93 samples are used to train a KNN model and the 30 samples are used to evaluate the performance of the KNN model because the 30 samples have been labeled. Their labels are removed at first and then are compared with predicted labels.
The KNN model outputs prediction results for the 30 results, of which 18 of them are correct and the other 12 of them are incorrect. Thus, we can say that prediction accuracy in the evaluation data is 0.6. Although such accuracy is still not ideal, it can have some guidance for enterprises with unknown credit status information. Because it can predict credit status information for the enterprises with some reliability. After training, the KNN model is implemented on dataset B to predict unknown credit status for them. Then KNN model is implemented on computers with the use of Python language. The running result of the computer program can be demonstrated in Figure 5.

In the next stage, the optimization decision model will be formulated on the basis of such prediction results. To sum up, 27 enterprises are labeled as credit rating A, 149 enterprises are labeled as credit rating B, 74 enterprises are labeled as credit rating C, and 52 enterprises are labeled as credit rating D.
3.4. Optimization Decision
To generate optimal allocation decisions for enterprises, a dynamic planning model is formulated in this section to realize this purpose.
From the side of banks, their total income from loan activities can be represented as the following formula:, , , and are the number of enterprises with four different credit ratings. , , , and are loan amounts for enterprises with four different credit ratings. , , , and are interest ratios for enterprises with four different credit ratings. , , , and denote the proportion of no default for enterprises with four different credit ratings.
And for the side of banks, their risk in loan activities can be represented as the following formula:
Among, , , and denote the ratio of enterprises with four different credit ratings.
Besides, there are also some constraint conditions to be satisfied as follows:(1);(2);(3);(4).Here, denotes the total amount that can be used for loan activities in banks.
Further, the total profit for the side of banks can be represented as the following formula:among, , , and denote customer loss ratio of enterprises with four different credit ratings. Then, the optimization objective can be formulated from two aspects: risk minimization and profit maximization.
For risk minimization, the following optimization model can be formulated as follows:
Substituting , , , , , , , , , , and into the model, the total profit and total risk can be written as follows:
Assuming that the total amount for loan activities is set as 1, the optimal allocation scheme is computed as follows:
It is noted that enterprises with credit rating D will not be approved for loans here. And the interest ratio is set at 0.08.
For profit maximization, the following optimization model can be formulated as follows:
Substituting , , , , , , , , , , and into the model, the optimal decision for allocation schemes can be represented as follows:
It is noted that enterprises with credit rating D will not be approved for loans here. And the interest ratio is set at 0.15.
In order to visualize the allocation results more clearly, Figure 6 demonstrates the allocation results of three kinds of enterprises via a stacked bar chart. Inside the figure, the blue bar corresponds to allocation results for enterprises with credit rating A, the green bar corresponds to allocation results for enterprises with credit rating B, and the yellow bar corresponds to allocation results for enterprises with credit rating C, while no allocation is provided for enterprises with credit rating D. And the results under two situations are also illustrated respectively in Figure 7, in which two subfigures correspond to situations of risk minimization and profit maximization. We also make a visualization of interest rate under two situations in Figure 8. It is a bar chart with two main bars, in which the blue bar corresponds to the interest rate under risk minimization and the red bar corresponds to the interest rate under profit maximization.


(a)

(b)

4. Discussion about Machine Learning Application
This work deals with loan allocation decision situations where the credit status information of enterprises is unknown. As a consequence, this work introduces machine learning to predict unknown credit status information. The machine learning models are with simple principles and are more resilient compared with general mathematical modeling thoughts. Besides, there are many support services for the machine learning models, as many available interfaces can be directly imported. It can really act as an alternative for time-consuming manual decision tasks and can even be comparable to expert experience in some situations.
However, the machine learning models also have some limitations. The most common issue for machine learning models lies in the fact that they are highly reliable on labels and sample amounts, because the machine learning models need to be trained on the basis of gold labels in the training set and are quite sensitive to sample amount. In other words, there needs some cost to train an effective machine learning model. In addition, the selection of features may also have some effect on the fitting efficiency of machine learning models, which is attributed to the explainability problem of general machine learning models. Due to the weak explainability, the establishment of models may lead to many redundant labors. But on the whole, the machine learning models can still work as a feasible solution in many business scenarios.
5. Conclusion
This paper focuses on a smart finance task using machine learning methods. To complete unknown credit status information of users, this work uses the KNN model for this purpose. After that, a dynamic planning model is utilized to realize decision-making processes. The whole technical framework is named as TLAD-UC for short which is composed of two stages. A real-world dataset is selected to evaluate the performance of the proposed TLAD-UC. A case study is presented to display the workflow of the proposal. It is also noted that the current technique is still in the initial exploration of this area, and efficiency needs to be further improved in future works. Therefore, it is expected to improve technical methods and promote decision effect. And the idea of an autonomous decision may be considered in future works.
Data Availability
The research data can be requested from the first author via e-mail.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported in part by National Natural Science Foundation of China (No. 71902007), in part by the General project of Social Science of Beijing Municipal Education Commission (SM202210011004), and in part by Open Project Program of National Research Centre for Agri-Product Quality Traceability (AQT-2022-YB4).