The Optimized Anomaly Detection Models Based on an Approach of Dealing with Imbalanced Dataset for Credit Card Fraud Detection

Zhang, Yan-Feng; Lu, Hong-Liang; Lin, Hong-Fan; Qiao, Xue-Chen; Zheng, Hao

doi:https://doi.org/10.1155/2022/8027903

Mobile Information Systems

On this page

Abstract Introduction Results Conclusion Data Availability Conflicts of Interest Authors’ Contributions Acknowledgments References Copyright Related Articles

Special Issue

Advanced Artificial Intelligence Technologies for Service Enhancement on the Internet of Medical Things

View this Special Issue

Research Article | Open Access

Volume 2022 | Article ID 8027903 | https://doi.org/10.1155/2022/8027903

The Optimized Anomaly Detection Models Based on an Approach of Dealing with Imbalanced Dataset for Credit Card Fraud Detection

Yan-Feng Zhang,¹Hong-Liang Lu,²Hong-Fan Lin,³Xue-Chen Qiao,⁴and Hao Zheng⁵

Academic Editor: Hasan Ali Khattak

Received25 Jan 2022

Revised22 Mar 2022

Accepted28 Mar 2022

Published25 Apr 2022

Abstract

Credit card fraud is a major problem in today’s financial world. It induces severe damage to financial institutions and individuals. There has been an exponential increase in the losses due to fraud in recent years. Hence, effectively detecting fraudulent behavior is of vital importance for either financial institutions or individuals. Since credit fraud events account for a small proportion of all transaction events in real life, the datasets about credit fraud are usually imbalanced. Some common classifiers, such as decision tree and naïve Bayes, are unable to detect fraud. Furthermore, in some cases, traditional strategies for dealing with an imbalanced problem, such as the synthetic minority oversampling technique (SMOTE), are not effective for the fraud detection datasets. To accurately detect fraud behavior, this study uses anomaly detection on imbalanced data, as well as Isolation Forest (IForest) with kernel principal component analysis. A one-class support vector machine (OCSVM) with AdaBoost is used as two models to detect outliers which significantly improves detection accuracy and efficiency. The model achieved 96% accuracy, 100% precision, 96% recall, and 98% score, respectively. The proposed model is expected to become a helpful tool for finding credit card fraud detection, and the analysis presented in this study will provides useful insights into credit card fraud detection mechanisms.

1. Introduction

The unauthorized use of credit cards to obtain money, products, or services is known as credit card fraud. Credit cards are now extensively used anywhere and at any time, especially with the introduction of E-commerce and contactless payment [1]. Every year, millions of people become victims of credit card theft. Credit card fraud is pervasive worldwide as a result of the widespread use of credit cards. Due to the large volume of data, monitoring credit card transactions is challenging. As a result, credit card theft is usually ignored, resulting in significant losses for both consumers and issuers [2].

Over the years, the quantity of card transactions appears to have increased, resulting in a similar increase in stolen card numbers. Credit card fraud losses follow a similar pattern, with increasing losses recorded each year [3]. The fraud detection system has evolved into an important tool for ensuring the integrity of E-payments. To allow authentic consumers to conduct their transactions, a safe and reliable banking system for E-commerce needs real-time scrutiny of transactions [4].

Many strategies for detecting credit card fraud have been recommended in the literature [5, 6]. Lucas and Johannes [7] identified three challenging problems in credit card fraud detection. The first is a data imbalance issue created by the large disparity in numbers between the positive and negative classes. Credit card fraud detection systems are prone to overfit normal transactions since normal transactions greatly outnumber illegal transactions. The second issue is the shifting in datasets, which means that fraud patterns may change. New client behaviors and attacks on credit card transactions will make it difficult for fraud detection methods to maintain their effectiveness. The final issue is the omission of sequential data between consecutive operations. Srivastava et al. [8] proposed a credit card fraud detection system based on a hidden Markov model and explained the respective stochastic process. Kunlin [9] developed a memory-enhanced framework for detecting credit card frauds. A sequential memory-enhanced neural network is used in the model. The authors employed transaction logs in their model, bringing it nearer to a real-world application. It also has a low false positive rate, which aids in system stability and the reduction of false complaints. The disadvantage of this model is that it included a full description of each consumer. The authors in [10] proposed a new fraud detection system called CoDetect. It detects several types of frauds using graph similarity and feature matrices. A confidence level is also included to detect potentially suspicious transactions. The system can be used to observe the fraudulent transaction, as well as to see the characteristic that caused the flag to be raised. Moreover, tests were conducted on real-world datasets and yielded positive results. The patterns of features were also developed, which can aid in the detection of emerging fraud methods. Randhawa et al. [11] employed various algorithms for credit card fraud detection. The performance of different algorithms was compared including Bayesian, neural networks, decision trees, logistic regression, support vector machine, and AdaBoost. The statistic employed was the Matthews correlation coefficient. Using AdaBoost and majority voting, they get a perfect score of 1.0. Different levels of noise were added to the data to ensure the robustness of the system. Satisfactory results were obtained with up to 30% noise addition. Jiang et al. [12] used the user’s behavioral tendencies to create a profile of the user. The classifier rating is updated based on the stream of data as the system provides a feedback option, so users can be confident in the results. Acceptable results were obtained with up to 30% noise addition. Gmez et al. [13] used artificial neural networks to detect and prevent fraudulent transactions and analyzed the model using several indicators, including value detection rate and true false positive rate. The authors in [14] employed a bagging ensemble classifier for credit card fraud detection using a decision tree. They put the models to the test with varied numbers of fraud cases ranging from 3% to 20%. Across all test cases, the model produced the same results.

Taking credit fraud information as the dataset, in this study, a new model is proposed to effectively detect a few sample categories and abnormal samples while ensuring accuracy. Anomaly detection algorithms are employed to the imbalanced dataset problems, building new models, and Isolation Forest with KPCA and OCSVM with AdaBoost instead of conventional methods such as SMOTE. The model is evaluated in terms of accuracy, precision, recall, and score. The proposed model has the potential to detect more fraud samples accurately and provide high performance as compared to the other fraud detection models.

The rest of the manuscript is organized into five sections. In Section 2, a detailed description of the data preparation is presented and four different classifiers and SMOTE are used on the dataset to detect fraud samples. The IForest with KPCA and OCSVM with AdaBoost as two models are used to detect fraud samples. In Section 3, the proposed model is evaluated and compared with other state-of-the-art models. Finally, Section 4 is the conclusion.

2. Data Preparation

2.1. Dataset

In this study, a credit card fraud dataset is used. The dataset is imbalanced data. The dataset is comprised of 307510 rows and 122 columns including target columns. The feature information includes the gender of the client, the income of the client, and loan annuity. The dataset is too large and is divided into 3 parts (object, integer, and float) based on its data type. Second, missing value detection is performed on the dataset. Next, a histogram is used to check the distribution of variables. Numerical values are used to represent relationships between variables and between variables and classes.

2.2. -Best Feature Selection

Feature selection is a critical issue in machine learning, which can reduce some irrelevant features and improve classification performance. Therefore, it is a good way to analyze high-dimensional datasets [15]. Through EDA, we found a lot of missing data in this dataset. Moreover, many features play a minor role in anomaly detection. The -best feature selection algorithm is used to screen out 10 features that critically influence anomaly detection.

2.3. SMOTE

SMOTE algorithm is a typical oversampling method. An unbalanced dataset is one in which at least one of the classes represents a very small minority. SMOTE is an oversampling method that can be applied to datasets that are unbalanced [16]. When a model is tested on an unbalanced dataset, the results can be substantially different. When SMOTE is used to an unbalanced dataset, the minority class is increased until it equals the total number of majority class members. One of the common issues found in datasets used for classifications is the imbalanced class issue. In this study, SMOTE is used to balance the credit card detection database [17].

2.4. IForest

IForest is an ensemble technique that realizes data anomaly detection by isolating abnormal data points. The IForest has the characteristics of high precision and high calculation rate, and it is suitable for dealing with big data problems. The IForest algorithm has a similar idea with random forest; however, it does not require computing the distance between two points. The IForest is an unsupervised version of random forest and is widely used in anomaly detection. The IForest randomly isolates the data space to construct an isolation tree [18]. It also conforms to the idea of quality evaluation. Quality is defined as the depth of nodes in the isolation tree, which is also called ITree. The smaller the depth, the more likely it is to be an outlier.

The iForest solves two problems of anomaly detection in high-dimensional datasets. First, the IForest does not depend on distance, and time cost does not relate to data dimension, which is a linear time complexity, and second, the IForest can deal with large datasets and is an ensemble method. The more there are ITrees, the more stable is the IForest [19]. Although IForest is suitable for anomaly detection of high-dimensional datasets, the detection efficiency will decrease with data distribution complexity. The algorithm has high volatility in anomaly detection of very high-dimensional data. Therefore, this paper proposes to use KPCA to reduce the dimension of the dataset to improve accuracy and efficiency. In short, to isolate high-density clusters, it requires to be cut so many times. However, low-density points can be isolated easily, which are depicted in Figure 1.

(a) Isolating

(b) Isolating

The process of detecting abnormal data through iForest is shown in Figure 2. It shows each step and marks the training stage and testing stage for iForest.

Firstly, the dataset is subsampled. The definition of an anomaly in IForest algorithm is “few but different.” If the amount of data is too large, the anomaly points may be densely distributed. In this case, it is difficult to separate the abnormal data a few times. If there is little difference between the average path length required by the abnormal value and the normal value, it is difficult to distinguish the abnormal data for a long time. Therefore, subsampling should be carried out, and establishing the local model through subsampling can reduce this impact to achieve a better anomaly detection effect. The details are given in Algorithm 1.

Inputs: : input data, : number of trees, : subsampling size
Output: a set of iTrees
1: Initialize Forest
2:set height limit
3: for to t do
4:
5:
6: end for
7: returnForest

The construction method of the ITree is a binary method, which recursively builds ITree to form an isolation forest. For each generation of ITree, we need to randomly select attributes and randomly select the partition value between the maximum and minimum values to divide the data into left and right sub-trees. The process is illustrated in Algorithm 2.

Inputs: : input data, : current tree height, : height limit
Output: an iTree
1: ifthen
2: return
3: else
4: let be a list of attributes in
5: randomly select an attribute
6: randomly select a split point from and values of attribute in
7:
8:
9: return ,
10: ,
11: ,
12:
13: end if

The path length of sample point is the number of edges passing from the root node of ITree to the leaf node. The computation of is listed in Algorithm 3.

Input: : an instance, : an iTREE, : current path length;
to be initialized to zero when first called
Output: path length of
1: if is an external node then
2: return { is defined in Equation (1)}
3: end if
4:
5: ifthen
6: return
7: else
8: return
9: end if

The anomaly score is calculated by Equation (1). If approaches 1, the sample is judged as abnormal, and if approaches 0, the sample is judged as normal.

2.5. KPCA

PCA is a linear transformation approach for compressing high-dimensional data with minimal data loss. The original sample space is used for PCA, whereas the extended feature space is used for kernel PCA. Kernel PCA is a form of learning machine that is built on kernels. PCA is not good at extracting nonlinear features from data, but KPCA can solve such problems [8]. Generally, KPCA uses kernel functions to map data samples to nonlinear functions, transform the nonlinear relations in the original space into linear relations in the high-dimensional space, and calculate the principal components of the high-dimensional feature space.

KPCA can project the input control into high-dimensional feature space through nonlinear mapping to process nonlinear data.

Suppose can be a random nonlinear map into the feasibly high-dimensional space . To calculate dot products (), kernel representations can be used to calculate the dot product value without using .

Figure 3 is an example of KPCA. It shows a possibility that after a nonlinear map, nonlinear data become linear data.

The RBF is used to realize the division of nonlinear data. The process is shown in Figure 4.

2.6. OCSVM

An OCSVM based on the traditional SVM can be used to train a model for anomaly detection to solve the problem of training classification with only one class of samples. Using negative samples during the training can improve the performance of the OCSVM, such as obtaining a tighter boundary for the target data with outliers surrounded.

The basic idea of OCSVM is using a kernel function to map the input space into a high-dimensional space. The coordinate origin is seen as the abnormal sample, OCSVM then tries to separate samples from the origin with maximum margin in the high-dimensional space. After training data , the decision function will be made. As shown in Figure 5, the positive value is considered the normal class, and the negative value is on behalf of the abnormal class.

To separate the dataset from the origin, this study solves the following quadratic programming problem.

The represents the input space, the training data points are represented by , and denotes the map from input to feature space. and are normal vector and compensation parameters of a hyperplane in feature space, respectively. The adjustable parameter shows an upper bound on the fraction of outliers and varies to get the best performance. Moreover, the slack variable allows some training points that are misclassified. If and are the solutions of the previous quadratic programming problem, the decision function can be obtained as follows:

Since most of the point values in the dataset are positive. At the same time, the value of should be relatively small. In this study, we used the Gaussian radial basis kernel function (RBF), which is commonly used for the OCSVM.

Therefore, we can get that the parameters and are the keys to solving the problem, and is the parameter combination that is optimized.

In this study, the process of the OCSVM algorithm dealing with grid search is depicted in Figure 6.

The major steps are described as follows: (i)Step 1: find the training samples and test samples from the dataset(ii)Step 2: apply normalization to reduce the training time and guarantee the converge of the model(iii)Step 3: select the RBF kernel function and use cross-validated grid search to find the best parameter combination . Through the experiment, we got that (iv)Step 4: complete the algorithm and build the OCSVM model

2.7. Boosting-AdaBoost

AdaBoost combines a group of weak learners into a solid learner to make the training errors minimum. This algorithm iteratively combines weak learners to form a powerful learner that predicts the outcomes more accurately. There are different types of boosting methods. In this study, we used the AdaBoost classifier.

As shown in Figure 7, the AdaBoost algorithm gets better results by repeating the given base classifier several times. By using iterative operation, the AdaBoost algorithm identifies misclassified data points and adjusts their weights to minimize the training error. The model continues to optimize sequentially until it yields the strongest predictor. One of the advantages of the algorithm is that it uses the classification error rate to adjust the distribution of the data and the coefficients of the base classifiers.

To optimize OCSVM and further compare it with IForest at the same level, we used OCSVM as a base classifier to carry out boosting-integrated training. From the principle of the AdaBoost algorithm, we know that the training error drops exponentially if each weak base classifier is slightly better than the random classifier, and we can get better results.

We modified the AdaBoost algorithm to make it fit for OCSVM as given in Algorithm 4. To obtain the appropriately considerable weight of misclassification data and significant coefficients of high-accuracy classifiers, we selected a constant like the exponent of the weight for varying base classifiers.

Input
Training set:
The probability distribution of each data:
Maximum number of iterations:
Output
, which is the ensemble of the base classifiers.
Initialize:
For
Step 1: train base learner by distribution .
Step 2: get base classifiers , get the rate of the error-classified samples
Step 3 choose is a constant.
Step 4: update

where is a normalization factor.
End for
Output the final hypothesis:

3. Results

3.1. IForest and IForest with KPCA

It can be seen from the confusion matrix that the IForest achieves the effect of identifying exceptions, which is higher than the traditional classifier without SMOTE processing. However, there are still many outliers that have not been identified.

The dataset has more than 100 dimensions; because it is taken from the real dataset, there are many missing values and attributes that contribute little to anomaly detection. However, the IForest algorithm selects attributes randomly, so too many attributes will reduce the probability of selecting practical attributes and have a bad impact on the training effect.

According to Figure 8 and Table 1, after adding KPCA, the overall effect of the algorithm is significantly improved with 87% accuracy as compared to independent IForest with 80% accuracy. This shows the proposed model is significantly improved, and more outliers can be detected.

3.2. OCSVM and AdaBoost-OCSVM

From the confusion matrix shown in Figure 9, although the performance of OCSVM seems to be great in handling one-class classification problems, its classification ability and outlier detection ability still need to be enhanced. Therefore, in this study, we modified AdaBoost to adapt it to OCSVM for better performance. The results of the modified AdaBoost are shown in Table 2 and Figure 9, respectively. It can be easily seen that the precision of the modified AdaBoost with OCSVM is superior to that of the traditional OCSVM. The AdaBoost-OCSVM achieved an accuracy of 96% as compared to the accuracy of OCSVM with 91% accuracy. Through the experiment, the modified AdaBoost with OCSVM has a tighter boundary than the traditional one when ν has the same value.

(a)

(b)

3.3. Performance Comparison

To validate the performance of the proposed model, we used some basic machine learning algorithms to test their accuracy and performance and make a comparison with the proposed model. Table 3 shows that the four classifiers (logistic regression, decision tree, -nearest neighbors, and naïve Bayes) contain a high accuracy, and the performances of the four classifiers are also relatively high. However, the four classifies are either unable to detect outliers. Since the dataset we used is highly imbalanced, they could not partition the fraud and nonfraud one precisely. This is significantly lower than the accuracy of the proposed model (AdaBoost+OCSVM) with 96% accuracy (Table 2).

3.3.1. SMOTE and IForest

As the performances are shown in Figure 10, although the SMOTE could detect outliers, its accuracy is relatively low. We believe it is because the dataset is super high dimensional, so even though with the technique like SMOTE to oversample the dataset, it still seems very complex for classifiers to process efficiently. However, IForest, according to its theory, could not only detect outliers with high efficiency but also maintain a very high level of accuracy and performance. Moreover, the IForest process the dataset within one second, which is much lower than SMOTE which required ten seconds.

(a) Logistic regression

(b) -nearest neighbors

(c) Decision tree

(d) Naive Bayes

4. Conclusion

Although the traditional models with SMOTE have achieved excellent performance for solving the unbalanced class issue, their classification ability still needs to be enhanced in many real-world applications. This study utilized anomaly detection on unbalanced data, as well as IForest with kernel principal component analysis, to accurately detect fraud activity. To detect outliers, two models were used: a one-class support vector machine (OCSVM) and AdaBoost, which considerably improved detection accuracy and efficiency. The proposed fraud detection model obtained 96% accuracy, 100% precision, 96% recall, and 98% score. Furthermore, the proposed method was compared with other algorithms such as logistic regression, -nearest neighbors, decision tree, and naïve Bayes. The results show that the improved IForest and AdaBoost with OCSVM have better performance than the traditional methods. The method will be extended in the future by incorporating numerous more datasets for solving other application problems that can be solved using the proposed anomaly detection method.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no competing interests.

Authors’ Contributions

Yan-Feng Zhang, Hong-Liang Lu, Hong-Fan Lin, Xue-Chen Qiao, and Hao Zheng contributed equally to this work and should be considered co-first authors.

Acknowledgments

We are grateful to Prof. David Woodruff and Dr. Hudson Li for their contributions to this article.

References

B. Krawczyk, “Learning from imbalanced data: open challenges and future directions,” Progress in Artificial Intelligence, vol. 5, no. 4, pp. 221–232, 2016.
View at: Publisher Site | Google Scholar
H. M. Nguyen, E. W. Cooper, and K. Kamei, “Borderline over-sampling for imbalanced data classification,” International Journal of Knowledge Engineering and Soft Data Paradigms, vol. 3, no. 1, pp. 4–21, 2011.
View at: Publisher Site | Google Scholar
L. Zhu, S. He, L. Wang, W. Zeng, and J. Yang, “Feature selection using an improved gravitational search algorithm,” IEEE Access, vol. 7, pp. 114440–114448, 2019.
View at: Publisher Site | Google Scholar
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, no. 1, pp. 321–357, 2002.
View at: Publisher Site | Google Scholar
F. T. Liu, K. M. Ting, and Z. Zhou, “Isolation Forest,” Eighth IEEE International Conference on Data Mining, vol. 2008, pp. 413–422, 2008.
View at: Publisher Site | Google Scholar
B. Schölkopf, A. Smola, and K. Müller, “Nonlinear component analysis as a kernel eigenvalue problem,” in Neural Computation, vol. 10, no. 5, pp. 1299–1319, 1998.
View at: Publisher Site | Google Scholar
Y. Lucas and J. Johannes, “Credit card fraud detection using machine learning: A survey,” arXiv 2020, arXiv: 2010.06479.
View at: Google Scholar
A. Srivastava, A. Kundu, S. Sural, and A. Majumdar, “Credit card fraud detection using hidden Markov model,” IEEE Transactions on Dependable and Secure Computing, vol. 5, no. 1, pp. 37–48, 2008.
View at: Publisher Site | Google Scholar
Y. Kunlin, “A memory-enhanced framework for financial fraud detection,” in in 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 871–874, Orlando, FL, 2018.
View at: Google Scholar
D. Huang, D. Mu, L. Yang, and X. Cai, “CoDetect: financial fraud detection with anomaly feature detection,” IEEE Access, vol. 6, pp. 19161–19174, 2018.
View at: Publisher Site | Google Scholar
K. Randhawa, C. K. Loo, M. Seera, C. P. Lim, and A. K. Nandi, “Credit card fraud detection using AdaBoost and majority voting,” IEEE Access, vol. 6, pp. 14277–14284, 2018.
View at: Publisher Site | Google Scholar
C. Jiang, J. Song, G. Liu, L. Zheng, and W. Luan, “Credit card fraud detection: a novel approach using aggregation strategy and feedback mechanism,” IEEE Internet of Things Journal, vol. 5, no. 5, pp. 3637–3647, 2018.
View at: Publisher Site | Google Scholar
J. A. Gmez, J. Arvalo, R. Paredes, and J. Nin, “End-to-end neural network architecture for fraud scoring in card payments,” Pattern Recognition Letters, vol. 105, pp. 175–181, 2018.
View at: Publisher Site | Google Scholar
M. Zareapoor and P. Shamsolmoali, “Application of credit card fraud detection: based on bagging ensemble classifier,” Procedia Computer Science, vol. 48, pp. 679–685, 2015.
View at: Publisher Site | Google Scholar
Y. Xiao, H. Wang, and W. Xu, “Parameter selection of Gaussian kernel for one-class SVM,” IEEE transactions on cybernetics, vol. 45, 2015.
View at: Publisher Site | Google Scholar
X. Chen, H. Xing, and X. Wang, “A modified AdaBoost method for one-class SVM and its application to novelty detection,” in 2011 IEEE International Conference on Systems, Man, and Cybernetics, pp. 3506–3511, Anchorage, AK, USA, 2011.
View at: Publisher Site | Google Scholar
Y. Freund, R. Schapire, and N. Abe, “A short introduction to boosting,” Journal-Japanese Society For Artificial Intelligence, vol. 14, 1999.
View at: Google Scholar
W. Shang, P. Zeng, M. Wan, L. Li, and P. An, “Intrusion detection algorithm based on OCSVM in industrial control system,” Networks, vol. 9, no. 10, pp. 1040–1049, 2016.
View at: Publisher Site | Google Scholar
Z. Y. Xiong, Q.-Q. Gao, Q. Gao, Y.-F. Zhang, L.-T. Li, and M. Zhang, “ADD: a new average divergence difference-based outlier detection method with skewed distribution of data objects,” Applied Intelligence, vol. 52, no. 5, pp. 5100–5124, 2022.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2022 Yan-Feng Zhang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies