[Retracted] Application Analysis of the Machine Learning Fusion Model in Building a Financial Fraud Prediction Model

Xu, Hongsheng; Fan, Ganglong; Song, Yanping

doi:https://doi.org/10.1155/2022/8402329

Security and Communication Networks

On this page

Abstract Introduction Conclusion Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Research Article Retraction

!

This article has been Retracted. To view the article details, please click the ‘Retraction’ tab above.

Special Issue

Computational Technologies for Malicious Traffic Identification in IoT Networks

View this Special Issue

Research Article | Open Access

Volume 2022 | Article ID 8402329 | https://doi.org/10.1155/2022/8402329

[Retracted] Application Analysis of the Machine Learning Fusion Model in Building a Financial Fraud Prediction Model

Hongsheng Xu,^1,2Ganglong Fan,^1,2and Yanping Song^1,2

Academic Editor: Muhammad Arif

Received20 Jan 2022

Revised12 Feb 2022

Accepted22 Feb 2022

Published17 Mar 2022

Abstract

Financial data fraud by listed companies has brought an extremely bad impact on the market and society. Predicting the financial data fraud of listed companies in advance may reduce losses. Therefore, the key to solving the problem is to build a financial fraud prediction model. This paper analyzes the prediction and identification models of financial fraud at home and abroad in detail, and finds the problems existing in these prediction models. In view of these shortcomings, this paper proposes to build a financial fraud prediction model based on a machine learning fusion model. The first is the unbalanced processing of data samples. The oversampling method is used to improve the model prediction effect by setting a reasonable sampling ratio. Then, four machine learning models (GBDT, random forest, support vector machine, and decision tree) are selected suitable for financial data. The training set is used to optimize the hyperparameters of the four machine learning models separately. This paper proposes integrating the random search and grid search mechanisms to adjust the parameters to the optimum. Finally, a financial fraud prediction model is constructed based on the multimodel fusion of the integrated learning framework. First, the base learner integrates the predicted results of the four models and performs five-fold crossvalidation on the training set. The meta-learner then uses the GBDT model to train integrated data from the first layer, resulting in a fusion model. The experimental results show that the AUC value of the fusion model is significantly higher than that of the single model. Therefore, the fusion model proposed in this paper can effectively improve the prediction effect.

1. Introduction

Today, with the rapid development and growth of the securities market, the news of financial fraud of listed companies emerges one after another. From the falsification of IPO financial data at the beginning of the company’s listing to various financial reports after the listing, investors undoubtedly have huge doubts about the authenticity of the financial reports of listed companies. Because the method of financial fraud is too complicated and hidden, there is information asymmetry between enterprises and investors. So how to effectively predict the financial fraud behavior of enterprises is considered to be a scientific research problem that needs to be solved urgently, both at home and abroad.

This paper makes an in-depth analysis of the research status at home and abroad. Goel and Uzuner use natural language processing techniques to study whether there is an expression of emotion in financial reports. Research shows that fraudulent financial reports use more positive and negative emotions and have more adjectives and adverbs than real financial reports [1]. The author compares and analyzes the four methods of logistic regression, decision tree, ANN, and expert systems to identify 129 fraudulent companies and 447 non-fraudulent companies. And find out the difference between model recognition and expert judgment. The study found that ANN has the best recognition effect [2]. Literature 3 uses decision tree, logistic regression, and neural network algorithms to predict the financial fraud of listed companies in Turkey [3]. Glancytal uses singular value decomposition and clustering technology to mine the text information in the sample financial reports and build a financial fraud identification model. The research results show that the model has a good identification effect [4].

Qian and Luo took the financial fraud listed companies in Shanghai and Shenzhen stock markets from 1994 to 2011 as observation samples, and for the first time, comprehensively examined the characteristic indicators of financial fraud and earnings quality problems in domestic and foreign studies. This paper establishes a more reliable and practical financial fraud prediction model suitable for the Chinese market [5]. But the model mainly focuses on the selection of indicators. The author used logistic regression, Bayesian discriminant analysis, and decision trees to establish corresponding recognition models, and finally analyze the recognition effects of various models [6]. But this model only selects 104 companies with proven financial fraud as a fraud sample and 208 nonfinancial fraud companies as a normal sample. Whether the sample proportion of financial fraud is seriously inconsistent with the actual situation and the number of samples is too small, the model may be overfitting.

This paper uses the logistic regression method, the support vector machine method, and the decision tree method in machine learning to classify the training set. The three models have high accuracy, and the prediction ability is successively divided from best to inferior: decision tree, support vector machine, and logistic regression model, and it is believed that the decision tree model is a better choice for the company’s financial prediction [7]. The authors used the decision tree, XGBoost, and random forest models to identify enterprise financial fraud. The empirical study found that the random forest model had the highest recall among the three models, so the authors considered the random forest model as the optimal model [8].

After analyzing the prediction effect of the current financial fraud prediction model, it is found that there are problems in the financial prediction model. Mainly, a single model may have defects such as excessive bias or weak generalization, and the method of hierarchical integrated learning can improve the prediction effect of the model. Therefore, this paper proposes building a financial fraud prediction model based on machine learning fusion mode.

The process of building a predictive model is divided into three steps: the first step is data preprocessing, mainly unbalanced data processing. Because the sample categories are unbalanced, the prediction effect of the model may be biased towards normal samples. The prediction effect of the model can be improved by setting a reasonable sampling ratio. The oversampling method is used to deal with unbalanced data; the second step is model parameter tuning. This paper accurately selects four machine learning models suitable for financial fraud prediction, including support vector machine, decision tree, GBDT, and random forest. This paper proposes to integrate random search and grid search mechanisms and uses the training set to optimize the super parameters of each machine learning model, respectively, so that the model’s prediction effect is the best. The third step is model fusion. A stacking integrated learning framework is adopted to fuse the four machine learning models by setting up a two-layer learner. The base learner in the first layer integrates the predictions of the four models. The meta-learner of the second layer uses the GBDT model to train the hybrid data obtained from the first layer, and finally builds a fusion model. The experimental results show that the AUC value of the fusion model is significantly higher than that of the single model, so the fusion model can effectively improve the prediction effect of financial fraud.

2. Machine Learning Model Analysis

2.1. GBDT Model

GBDT stands for gradient boosting decision tree, which is an iteration-based decision tree algorithm composed of multiple decision trees. The decision conclusions of all the trees in the algorithm can be combined as the final prediction result of the model, which can be used to flexibly handle various types of data, including continuous and discrete values. The core of GBDT is the descent of the loss function in the direction of the gradient. GBDT goes to fit the negative gradient of the loss function under the current model at each iteration [9].

GBDT first builds a set of weak learners and then accumulates the results of multiple trees as the final prediction output. GBDT is initialized in the first step after the input training set to obtain the weak learner as shown in the following equation:

In each iteration, for each sample , the negative gradient of the loss function is calculated, that is, the residual, as shown in the following formula:

After computing the residuals, they are used as the new true values, and then the data is used as training data for the next tree to obtain the new regression tree. The corresponding leaf node region is (), where denotes the number of leaf nodes. We calculate the best-fit value of the leaf region as shown in the following equation:

Then we update the strong learner as shown in the following equation:

Thus, the final learner is obtained, as shown in the following equation:

The GBDT algorithm is divided into two parts: the GBDT regression algorithm and the GBDT classification algorithm, the fundamental difference between the two algorithms is the choice of the loss function. The loss function of the GBDT regression algorithm is the loss function of square error, while the GBDT classification algorithm chooses the logarithmic function as the loss function, and the algorithm steps are as follows: Step1. Let us initialize m = 1, j = 1 where . Step2. Let us calculate the negative gradient value of the loss function of the ith sample of the mth iteration. Step3. Let us choose the kth feature and the value , it takes as the optimal cut variable and cut point, and then compute where . Let be the number of samples in , , . Let us update , if , then the regression tree structure iteration ends and goes to Step4, otherwise, Step3 is called recursively for the subregion . Step4. Let us calculate the optimal output value for the mth iteration. Step5. Let us update . If , the iteration ends and the final result is output, otherwise, we turn to Step2.

2.2. Analysis of Random Forest (RF) Algorithm

Random forest is a supervised machine learning algorithm and an ensemble learning algorithm. A random forest is a classifier that contains multiple decision trees and whose output classes are determined by the plurality of the classes output by the individual trees. A random forest is a fusion of multiple decision trees to obtain a more accurate and stable model. The construction of a random forest is divided into the following three steps: extracting the dataset and generating a training set for each tree; constructing each decision tree; and forming a random forest model and evaluating the model.

A random forest uses the idea of a bagging algorithm to construct multiple decision trees. When a prediction is needed for a sample, the prediction results of each tree in the forest are counted for that sample, and then the final result is selected from these predictions by the voting method. After each base learner has finished predicting the samples, the predictions of all base learners need to be combined to output the final result [10].

2.2.1. Averaging Method

The averaging method can be specifically divided into simple and weighted averaging methods. Let us assume that the model contains T base learners , where the output of on example x is denoted as . The specific formula is shown as follows:where is the weight of an individual learner .

2.2.2. Voting Method

Voting method can be divided into absolute majority voting, relative majority voting, and weighted voting methods. Let us assume that the model contains T base learners , where the output of on example x is denoted as . The specific formula is shown as follows:where denotes the category tag and denotes the output of on .where is the weight of an individual learner .

The random forest adopts the result of multiple decision trees to vote. Since the random forest takes a multiple decision tree voting decision, it defines the difference between the mean number of correct predictions in the prediction results of the correctly classified sample in the decision tree and the most votes in the incorrect predictions as the edge function, denoted as . The equation is as follows:

From the definition of the edge function, it can be seen that when the value of the edge function is larger, the confidence of the random forest classification result is higher. Meanwhile, the generalization error of the random forest model can be obtained from the edge function as

The above equation can be obtained from the law of large numbers. The generalization error never exceeds a fixed value when the value of k tends to infinity over a certain value N. Let the average correlation between the decision trees in the random forest be and the average strength is s and it is expressed as

From the above equation, it is clear that random sampling of samples and attributes can reduce the correlation between decision trees. When the average correlation between the decision trees decreases, the generalization error decreases, and the classification generalization performance of the random forest is enhanced.

2.3. Description of Support Vector Machine (SVM) Model

Support vector machines (SVM) is a method used to train binary classification models. The core idea is to find a suitable set of support vectors in the training set and then to segment the samples through a hyperplane [11]. The selection principle of the hyperplane is to maximize the sample interval, which solves a convex quadratic programming problem with a unique and globally optimal result. When training a classification model using SVM, the training set is required to be:

Where is the n features that the sample has and is the label of the sample

If a hyperplane can be found such that

Then the training set is linearly separable and the hyperplane can separate positive and negative samples, where is the normal vector of the hyperplane, the weights of the features, and b is a constant.

We define the kernel function as follows: let the original space X (input space) be a subset in and H be the feature space. If there exists a map p: X ⟶ H, x ⟶ p(x) from X to some Hilbert space H, such that makes , then the function is said to be a kernel function. Where is the inner product of and .

Commonly used kernel functions include linear kernel functions, polynomial kernel functions, Gaussian kernel functions (RBF kernel/radial basis kernel), etc. The last used in parameter tuning is the Gaussian kernel function, which is given by:

2.4. Analysis and Application of Decision Tree (DT) Model

The decision tree is a common algorithmic model for computing regression and classification, with a treelike structure and conditional branching for decision discrimination. The decision process is as follows: first, each feature in the sample data can be considered as a potential splitting node of the decision tree, and by traversing all features, the gain obtained when splitting each feature is calculated, which will be used as the basis for the further splitting of the decision tree; next, the feature with the greatest classification benefit is used as the leaf node to continue the division, and the above process is continuously cycled until the splitting gain is less than a threshold; and finally the complete decision tree is obtained. To improve the efficiency of model training, feature selection is performed before training the model, and some irrelevant features or features with weak effects are screened out in advance. This selection process is generally obtained by information gain or information gain ratio calculation [12].

We give the definition of entropy and conditional entropy before introducing the information gain, also known as impurity, which is a measure of the uncertainty of a random variable.

The entropy of the random variable X is defined aswhere is the probability of . H(X) is also the mathematical expectation of the amount of information of each event in X. It is important to note that when , . It can be seen that the entropy is not related to the specific value of the random variable X, but the distribution of X. The higher the entropy, the stronger the uncertainty.

The conditional entropy represents the uncertainty of Y under the condition that X is known, defined as

The information gain is the difference between and , the amount of information uncertainty reduction of class Y after learning the information of feature X. The larger the information gain, the stronger the classification capability.

Information gain ratio: the ratio of the information gain ratio of feature A to the training set D to the entropy of the training set D concerning the value of feature A, i.e.,where and n is the number of values taken by feature A.

Gini value: this value reflects the probability that two samples are randomly drawn from a set of data, and that the two samples belong to different classes. The formula is as follows:

Gini index: CART algorithm, that is, the Gini index as the basis for the division of characteristics. The formula is as follows:

The pruning of decision trees is to solve the overfitting problem. Often, the generated decision trees have a better classification effect for training data and poorer classification performance for new data, which is because the decision trees are divided too carefully and generate too complex tree models, leading to the emergence of overfitting. The solution to this problem is to consider the complexity of the model and thus simplify the decision tree by removing some of the finer branches to improve its generalization ability.

3. Unbalanced Data Processing

The sample of the experiment in this paper is the financial data of listed companies in the manufacturing industry for the first five years, of which there are 8598 nonfraudulent data and only 91 fraudulent data. This implies the existence of unbalanced data, as shown in Figure 1.

The minority class in an unbalanced dataset usually contains more important information. For example, in the dataset of a financial fraud prediction model, the fraud data samples belong to the minority type, but they are of great significance to the judgment of the model. The national and international studies on unbalanced data processing are: in 2014, Lunardon et al. proposed another approach to oversampling: the ROSE algorithm [13]. L’opez et al. proposed a one-sided sampling method for unbalanced data, also known as the OSS algorithm [14]. The authors point out the shortcomings of random undersampling-based data processing and propose two new sampling methods, easy ensemble and balance cascade [15], which do not apply to traditional datasets and are only applicable to sample sets with unbalanced categories. The authors propose an undersampling method based on k-nearest neighbors for the KNN-Near Miss method [16].

The unbalanced data problem is mainly found in supervised machine learning tasks. When unbalanced data is encountered, traditional classification algorithms with the learning goal of overall classification accuracy focus too much on the majority class, thus degrading the classification performance of minority class samples. The vast majority of common machine learning algorithms do not work well for unbalanced datasets. Unbalanced data reduces classification accuracy because it inaccurately predicts a small number of instances of the class. Category imbalance may cause the model prediction effect to be biased towards normal samples, so the data imbalance needs to be dealt with.

The main idea to solve this kind of problem is to perform resampling, which mainly includes undersampling and oversampling [17]. However, undersampling large samples are to achieve sample equalization by reducing the sample size of majority class samples, that is, randomly removing some data to reduce the size of majority class samples, but using this method for the data in this paper will lead to a sudden decrease in the number of normal samples and may lose important discriminant information. Therefore, this paper uses the oversampling method to process. The oversampling method is random oversampling, i.e., the minority class individuals in the sample are directly sampled randomly and added to the original dataset, similar to generating new individuals, thus increasing the minority class proportion. However, this method only makes the number of individuals increase without providing new information, and it may lead to low variance in the dataset due to the generation of individuals with duplicate information, which affects the results of data analysis. Therefore, the commonly used improved oversampling method is now the SMOTE algorithm [18].

The idea of the SMOTE algorithm is to synthesize new minority class samples. The strategy of synthesis is to select a sample b randomly from its nearest neighbors for each minority class sample a, and then to select a random point on the line between a and b as the newly synthesized minority class sample. When data imbalance treatment is applied to the training set, no treatment is applied to the test set.

The SMOTE algorithm generates new samples based on the principle of K-nearest neighbors, where the principle of calculating K-nearest neighbors is to calculate the distance, using the Euclidean distance as the criterion for judging. The most commonly used value of K is 5. The value of K affects the classification function. The process of generating a new sample is as follows: selecting any minority class sample , selecting K-nearest neighbor samples within the main sample , a sample among all the nearest neighbor samples, and generating a new sample randomly between the main sample and the line of the selected nearest neighbor sample . The formula for the new sample is as follows:

In this paper, the specific operation of unbalanced processing is as follows: first, we obtain the number of zeros in the label column of the training set and set the number of ones in the sampling strategy (sampling_strategy) for the training set using the SMOTE method to achieve the purpose of oversampling. To prevent the newly generated samples from having a certain regularity, the feature and label columns in the training set are combined and disrupted, where the FLAG column is used as the label column of the training set after data balancing, and the remaining columns are used as the new feature columns.

The unreasonable sampling ratio may cause the problem of blurred discriminative boundary and reduced model generalization ability. To obtain a more suitable sampling strategy, four machine learning models (SVM, DT, RF, and GBDT) are selected in this paper to test the AUC values under different sampling effects, as shown in Table 1.

After experimental analysis, this paper finally selects 1 : 1 oversampling processing sample ratio so as to achieve the processing effect of unbalanced data.

4. Hyperparameter Optimization of Machine Learning Models

4.1. Evaluation Indicators

To increase the accuracy of the model prediction for a few classes and the overall data, Yong Li et al. proposed building a confusion matrix [19], as shown in Table 2. For the binary classification problem, all data can be classified into four classes: samples with a positive class are predicted as the positive class (TP), samples with a negative class are predicted as the positive class (FP), samples with a positive class are predicted as the negative class (FN), and samples with a negative class are predicted as the negative class (TN).

According to Table 2, the four data obtained from the confusion matrix are extended by calculation to obtain four secondary metrics: accuracy, precision, recall, and F1-score, which are the core metrics for evaluating the classification model.

In the unbalanced dataset, another metric is ROC (receiver operating characteristic) [20]. The horizontal and vertical axes of the ROC curve are no longer recall and accuracy, and each point on the ROC curve is the true decision value ranked from largest to smallest, and the vertical coordinate of the axis of the ROC curve is the true rate (TPR). The horizontal coordinate is the false positive rate (FPR). The FPR is defined as the proportion of all true negative classes that are predicted to be positive, and the formula is as follows:

Another more commonly used evaluation criterion is the area under the ROC curve, also known as the AUC (area under ROC curve), and the approximate formula for calculating the AUC value is formula is as follows [21]:

4.2. Analysis of Parameter Optimization Experiments

Tuning is the process of adjusting the parameters of a model to find the parameters that make the model perform optimally. The main purpose of tuning is to achieve the best possible harmony of bias and variance. There are generally two types of parameters in machine learning models: one type of parameter can be estimated from the data, and the other type of parameter cannot be estimated from the data and can only be specified by human experience, which is called hyperparameters. Then there are two general parameter optimization methods: grid search cross-validation (Grid Search CV) and randomized search cross-validation (Randomized Search CV) [22].

Grid search crossvalidation is a powerful hyperparametric optimization technique in which the process is to try every possible parameter among all candidate parameter choices by iterating through them in a loop, and the best-performing parameter is the final result. However, after the original dataset is divided into a training set and a test set, where the test set is used not only to adjust the parameters but also to measure how good the model is, this leads to a final scoring result that is higher than the actual one. Therefore, the problem with grid search is that the final performance is highly dependent on the results of the initial data partitioning. To deal with this situation, we use crossvalidation to reduce the chance. Another drawback of grid search exists: it is time-consuming. The more parameters there are, the more candidates there are, the longer it takes.

In practice, the grid search method generally uses a wider search range and larger step size to find the possible locations of the global optimum; and then narrows the search range and step size to find a more precise optimum. This reduces the required time and computational effort, but since the objective function is generally nonconvex, the global optimum is likely to be missed [23].

Random sampling crossvalidation implements a mechanism to perform a random search on the parameter space where the values of the parameters are drawn from some probability distribution that describes the probability of all cases of taking the corresponding parameters. The search mechanism of random search is different from that of web search. If the set of sample points is large enough, then the global optimum or an approximate value can be found with high probability by random sampling. Thus, random search has two major advantages: (1) a relatively small number of parameter combinations can be chosen compared to the overall parameter space and (2) adding parameter nodes does not affect performance and does not reduce efficiency [24].

RandomizedSearchCV implements a random search that requires two parameters to build: an estimator and the set of possible values of the hyperparameters, called the parameter grid or space. In addition to the acceptable estimator and the parameter grid, there is a parameter that controls the number of iterations of randomly chosen combinations of hyperparameters that we allow in the search.

In conclusion, the tuning of model parameters can both increase the predictive ability of the model and make it easier to train the model. After the analysis of these two tuning methods, this paper proposes to integrate two-parameter tuning methods, firstly defining the parameter range of each parameter needed for a random search, so that the range can be narrowed using random search, and secondly, to improve the operational efficiency, the step size of each parameter is increased as much as possible within a reasonable range, based on RandomizedSearchCV to obtain the optimal parameters within the parameter range and the corresponding Based on this, we set the range and size of the tuning parameters of the grid search and use GridSearchCV to obtain the final parameters and scores.

The specific modeling is done by first fixing the initial value of each parameter, then setting its tuning range, and conducting random search, network search, and crossvalidation to find the optimal results. The initial value, range, and tuning reference results are shown in the following table for each algorithm framework.

4.2.1. Parameter Optimization of GBDT Model

For the GBDT model, which has many parameters, to improve efficiency and speed of parameter adjustment, the important parameters of GBDT are selected for adjustment. Table 3 shows the names and meanings of the important parameters of the GBDT model.

The specific tuning process and results on GBDT are shown in Table 4.

For the tuned GBDT model, the learning curve and the ROC curve are used to determine the effectiveness of the model prediction. A 5-fold crossvalidation is performed using the training set to observe the effect of the size of the dataset on the model’s performance. Figure 2 shows the learning curve of the GBDT model. The ROC curve of GBDT on the test set is shown in Figure 3.

The learning curve of GBDT shows that the generalization ability of the model increases with the increase of the training set samples. On the whole, the prediction of the model is good and always at a high level, and the combination of the results on the test set proves that there are no overfitting and underfitting problems.

4.2.2. Parameter Optimization of RF Model

There are also many parameters of random forest model, and here only the parameters that are more relevant to the financial prediction model are selected, and Table 5 describes the significance of the important parameters of RF.

Parameter tuning is performed on the RF model to set the range of tuning parameters to the best AUC value, and the results are shown in Table 6.

After tuning parameters on the RF model, the learning curve and ROC curve are used to determine the accuracy of the model prediction. A 5-fold crossvalidation is performed using the training set to analyze the effect of the size of the dataset on the model performance. Figure 4 shows the learning curve of the RF model. Figure 5 shows the ROC curve of RF on the test set.

The learning curve of RF shows that the overall performance of the model is good. The ROC curve shows that the model has a good and consistently high level of prediction, and there is no overfitting or underfitting problem.

4.2.3. Parameter Optimization of SVM Model

Three important parameters are selected for adjustment in the SVM model, and Table 7 shows the names and meanings of the important parameters of the SVM model.

According to the hyperparameter tuning method proposed in this paper and tested on the SVM model, the specific tuning process and results are shown in Table 8.

For the tuned SVM model, the performance of the learning curve and the ROC curve are analyzed to determine the effectiveness of the model prediction. A 5-fold crossvalidation is performed using the training set to observe the effect of the size of the test dataset on the model prediction performance. Figure 6 shows the learning curve of the SVM model. The ROC curve of the SVM on the test set is shown in Figure 7.

The learning curve of the SVM shows that the increase in samples also leads to an enhancement of the model performance. The ROC curve shows that the model performs well in terms of prediction, and there is no overfitting or underfitting problem.

4.2.4. Parameter Optimization of DT Model

Based on the characteristics of the financial data, four important parameters of DT are selected for reconciliation, and Table 9 illustrates the importance of the parameters in the DT model.

The specific parameter tuning process and results on the decision tree model are shown in Table 10.

For the tuned DT model, the learning curve and the ROC curve are used to determine the goodness of the model. A 5-fold crossvalidation is performed using the training set to observe the effect of the size of the dataset on the model performance. The learning curve of the DT model is shown in Figure 8. Figure 9 shows the ROC curve of DT on the test set.

After analyzing the learning curve of the DT model, it shows that the overall performance of the model gradually becomes better after tuning the parameters. From the ROC curve, the prediction effect of the model performs well and remains at a high level, and there is no overfitting or underfitting problem.

5. Construction of Financial Fraud Prediction Model Based on Fusion Model

According to the characteristics of manufacturing financial data, four suitable machine learning models, namely GBDT, RF, SVM, and TR, are selected to predict samples from the manufacturing test set, and the AUC values derived from the test set are mainly selected as indicators for evaluating each classification model. The AUC results of the four models on the test set after hyperparameter tuning are shown in Figure 10.

In Figure 10, we can see that the GBDT model has an AUC value of 0.898, which is significantly higher than the AUC values of RF, DT, and SVM. Then, for the comparison of the prediction performance of a single model, the GBDT model has the best prediction effect.

Since a single model may have defects such as excessive bias or poor generalization, the ensemble learning idea is used here to combine several homogeneous or heterogeneous models according to some strategy to train a fusion model with better and more powerful performance.

In order to make the model more generalizable, the model fusion steps proposed in this paper are as follows: firstly, the parameters of these four models are tuned separately to find the best parameters. Then, five-fold crossvalidation is performed on the training set, and the training data is randomly divided into five equal parts, four of which are taken as the training set each time, and the remaining one is used as the test set. The results obtained from the predictions of the four models are integrated in turn, and the four models are used as the base learners of the first layer in the stacking fusion model, while GBDT is used as the meta-learner of the second layer in the fusion model. In the second layer, we use GBDT to train the hybrid data obtained from the first layer, and optimize the performance of the model through parameter adjustment, thereby constructing the fusion model. The framework of this fusion model is shown in Figure 11.

Finally, the fusion model is compared with four single models on the test set for the ROC area (AUC) value, as shown in Figure 12. The experimental results show that the AUC value of the fusion model is 0.914, which is significantly higher than all the single models. Therefore the fusion model proposed in this paper has a significant enhancement effect on the prediction of financial fraud data.

6. Conclusion

This paper finds out the problems based on the analysis of the existing financial data fraud prediction model. A single model may have defects such as excessive bias or poor generalization. Therefore, a fusion model of multimachine learning is proposed to build a financial fraud prediction model. The fusion model can take on the advantages of various models and train the models through a multilevel integrated learning framework. The experimental samples of this paper come from the manufacturing financial data of listed companies for nearly 5 years.

The construction process of the fusion model is divided into three steps. The first step is data unbalanced processing based on the oversampling method. The purpose of upsampling is achieved by setting a reasonable sampling ratio; the second step is to select four machine learning models suitable for financial fraud prediction after analyzing the financial data, including random forest, support vector machine, GBDT, and decision tree. This paper proposes to integrate random search and grid search mechanisms to optimize the parameters so as to achieve the best prediction effect of each model; the third step is to build the fusion model layered based on the integrated learning framework. The base learner is composed of four machine learning models, and the GBDT model is responsible for the meta-learner. The second layer trains the hybrid data obtained from the first layer and optimizes the model performance through parameter adjustment so as to build the fusion model. Experimental results show that the fusion model proposed in this paper can effectively improve the prediction effect. Since the experimental sample data mainly comes from listed manufacturing companies, the prediction effect of the selected fusion model may not be optimal for other industries. Future research will mainly focus on the prediction of the fusion model on the financial data of different industries, and achieve the prediction effect through parameter optimization.

Data Availability

No data were used to support this study.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Acknowledgments

This work was supported by National Natural Science Funds of China (61272015) and 2022 Henan Province Science and Technology Research Project: “Construction and Application of Intelligent Ontology in Internet of Things Based on Semantic Concept Model” (222102210316).

References

S. Goel and O. Uzuner, “Do sentiments matter in fraud detection? Estimating semantic orientation of annual reports,” Intelligent Systems in Accounting, Finance and Management, vol. 23, no. 3, 2016.
View at: Publisher Site | Google Scholar
C.-C. Lin, A.-A. Chiu, S. Y. Huang, and D. C. Yen, “Detecting the financial statement fraud: the analysis of the differences between data mining techniques and experts’ judgments,” Knowledge-Based Systems, vol. 89, no. 4, pp. 459–470, 2015.
View at: Publisher Site | Google Scholar
G. Ozdagoglu, A. Ozdagoglu, and Y. Gumus, “The application of data mining techniques in manipulated financial statement classification:the case of Turkey,” Journal of AI and Data Mining, vol. 5, no. 1, pp. 67–77, 2017.
View at: Publisher Site | Google Scholar
F. H. Glancy and S. B. Yadav, “A computational model for financial reporting fraud detection,” Decision Support Systems, vol. 50, no. 3, pp. 595–601, 2011.
View at: Publisher Site | Google Scholar
P. Qian and M. Luo, “Predicting accounting fraud in China,” Accounting Research, no. 7, pp. 18–25, 2015.
View at: Google Scholar
H. Zheng, The Recognition Research of Chinese Listed Companies’ Financial Fraud, Southwestern University of Finance and Economics, Chengdu, China, 2016.
X. Li and Q. Wang, “Comparative research on financial prediction models of listed companies based on machine learning,” Market Modernization, no. 7, pp. 150–152, 2020.
View at: Google Scholar
X. Ma, Research on Machine Learning Based Chinese Company Financial Risks Detect Sysem, Nanjing University, Nanjing, Jiangsu, China, 2019.
A. Vikrant and M. R. Dev, “Eden Formation lithology classification using scalable gradient boosted decision trees,” Computers & Chemical Engineering, vol. 128, pp. 392–404, 2019.
View at: Google Scholar
H. Gamal, S. Elkatatny, and A. Abdulraheem, “A Intelligent prediction for rock porosity while drilling complex lithology in real time,” Computational Intelligence and Neuroscience, vol. 2021, no. 15, pp. 1–12, 2021.
View at: Publisher Site | Google Scholar
M. Jalal, P. Arabali, Z. Grasley, J. Bullard, and H. Jalal, “Behavior assessment, regression analysis and support vector machine (SVM) modeling of waste tire rubberized concrete,” Journal of Cleaner Production, vol. 273, 2020.
View at: Publisher Site | Google Scholar
D. Chaparro-Herrera, R. Fuentes-García, M. Hernández-Quiroz, E. Valiente-Riveros, E. Hjort-Colunga, and C. Ponce de Leon-Hill, “Comprehensive health evaluation of an urban wetland using quality indices and decision trees,” Environmental Monitoring and Assessment, vol. 193, no. 4, pp. 1–12, 2021.
View at: Publisher Site | Google Scholar
N. Lunardon, G. Menardi, and N. Torelli, “ROSE: a package for binary imbalanced learning,” RELC Journal, vol. 6, no. 1, pp. 82–92, 2014.
View at: Publisher Site | Google Scholar
V. López, A. Fernández, S. García, V. Palade, and F. Herrera, “An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics,” Information Sciences, vol. 250, pp. 113–141, 2013.
View at: Publisher Site | Google Scholar
X. Y. Liu, J. Wu, and Z. H. Zhou, “Exploratory undersampling for class-imbalance learning,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 39, no. 2, pp. 539–550, 2009.
View at: Publisher Site | Google Scholar
S. del Río, V. López, J. M. Benítez, and F. Herrera, “On the use of map reduce for imbalanced big data using Random Forest,” Information Sciences, vol. 285, pp. 112–137, 2014.
View at: Google Scholar
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.
View at: Publisher Site | Google Scholar
H. Han, W. Wang, W.-Y. Wang, and B.-H. Mao, “Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning,” Lecture Notes in Computer Science, vol. 3644, pp. 878–887, 2005.
View at: Publisher Site | Google Scholar
Y. Li, Z. Liu, and H. Zhang, “Review on ensemble algorithms for imbalanced data classification,” Application Research of Computers, vol. 05, no. 31, pp. 1287–1291, 2014.
View at: Google Scholar
D. J. Hand and R. J. Till, “A simple generalization of the area under the ROC curve for multiple class classification problems,” Machine Learning, vol. 45, no. 2, pp. 171–186, 2001.
View at: Publisher Site | Google Scholar
D. Brzezinski and J. Stefanowski, “Prequential AUC: properties of the area under the ROC curve for data streams with concept drift,” Knowledge and Information Systems, vol. 52, no. 2, 2017.
View at: Publisher Site | Google Scholar
Y. Xia, C. Liu, Y. Li, and N. Liu, “A boosted decision tree approach using Bayesian hyper-parameter optimization for credit scoring,” Expert Systems with Applications, vol. 78, pp. 225–241, 2017.
View at: Publisher Site | Google Scholar
J. Chai, J. Du, K. K. Lai, and Y. P. Lee, “A hybrid least square support vector machine model with parameters optimization for stock forecasting,” Mathematical Problems in Engineering, vol. 2015, Article ID 231394, 7 pages, 2015.
View at: Publisher Site | Google Scholar
C. E. d. S. Santos, R. C. Sampaio, L. D. S. Coelho, G. A. Bestard, and C. H Llanos, “Multi-objective adaptive differential evolution for SVM/SVR hyperparameters selection,” Pattern Recognition, vol. 110, 2021.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2022 Hongsheng Xu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies