Analysis of Resampling Techniques on Predictive Performance of Credit Card Classification

Credit card fraud detection has been a very demanding research area due to its huge financial implications and rampant applications in almost every area of life. Credit card fraud datasets are naturally imbalanced by having more legitimate transaction in comparison to the fraudulent transactions.  Literature represents numerous studies that are aimed to balance the skewed datasets. There are two major techniques of resampling in balancing these sets i.e. under-sampling and oversampling. However both under-sampling and oversampling techniques suffer from their own set of problems that can seriously affect the performance of classifiers that have been inducted for credit card studies in the past. Thus to accelerate detection of credit card fraud, it is very important to implement the strategy that could possibly provide better predictive performance. This paper attempts to find out what resampling technique can work best under different skewed distributions for the domain of credit card fraud detection.


Introduction
Over recent years the rampant application of credit card has led many losses to financial institutions and other recipient organizations. This has made detecting credit card fraud a hard challenge for concerned authorities. Credit card is considered as an easy fraud target because the fraudsters can gain a lot of money in a very short period of time and with less risk; as the fraud is detected after many days (Zareapoor & Shamsolmoali, 2015). Credit card fraud detection has been a very arduous research area due to the losses generated by these plastic gadgets. In 2015, according to "The Nilson Report" (Neilson, 2012), only in United States of America (USA) the credit card fraud has increased to 12.75 cents for every $100 annually and it contributes 21.4% of the total fraud losses across the world. In another study (Stolfo et al., 1997), it is revealed that worldwide 40% of the total financial losses are only generated by the credit cards alone. To reduce the losses to minimum by the stolen or misused cards, it is very necessary to block these cards as quickly as possible. Fraudsters use a lot of techniques in attempting frauds and always look for the sensitive information related to the card stolen. In this regard the financial institutions also adopt number of solutions to combat fraud. These techniques usually involve the process of classifying transaction either to fraud or non-fraud.
Credit card fraud datasets have been found to be naturally skewed (He & Garcia, 2009) which mean that these datasets have more legitimate transactions than the fraudulent ones. These imbalances between the majority and minority classes bias the classifiers to the majority class and misclassify the instances of the class that has less representation in the data. Usually in the classification process, class with less representation is more important than the other classes (Rahman & Davis, 2013). In credit card fraud detection, the instances belonging to the minority class i.e. the fraudulent transactions are of prime interest. Classification algorithms utilized in detection of credit card fraud are often overwhelmed with the majority class (Tremblay et al., 2007, Anis & Ali, 2017and Shen et al., 2007 leading to bad predictive performance for the minority class. To increase the prediction rate of minority class, a lot of studies in literature have given numerous resampling techniques. However there are two basic techniques which are followed widely. These include Over-Sampling and Under-Sampling or combination of both Over-Sampling and Under-Sampling: which is called Hybrid Sampling.

•
Under-Sampling: it removes the majority samples to the desired level of imbalance.
• Over-Sampling: generate new minority samples to the desired level of imbalance.
• Hybrid Sampling: This implements both over-Sampling and under-Sampling techniques until we reach the desired level of imbalance.
There is wide range of resampling techniques that are implemented for credit card frauds but among them we have selected three mostly used resampling techniques. In this study, resampling techniques utilized and compared are Random Over-Sampling (ROS), Random Under-Sampling (RUS) and SMOTE. In this study we have implemented these techniques to balance the datasets. These resampling techniques provide varied predictive performance for different classifiers. Thus we aim to explore and analyze these techniques for classification algorithms that have been widely implemented for credit card fraud detection.
An exhaustive list of classification algorithms have been inducted for the credit card studies. However there are some algorithms that have been extensively used for this purpose. For example Shen et al applied Decision Tree, Logistic Regression and Neural Network to analyze their performance for credit card fraud detection (Shen et al., 2007). Anis et al implemented the family of DTs for different levels of imbalance for credit card fraud (Anis et al., 2015). Brown and Mues also analyzed different classification models for a set of skewed levels to check predictive performance for the minority class in credit scoring (Brown & Mues, 2012). Peng et al ranked the most implemented classification algorithms for the credit card frauds (Peng et al., 2011). Considering the studies that have been specifically formulated to find the best classification algorithms, we utilized two famous and widely implemented algorithms that include, Decision Trees and Support Vector Machine. West and Bhattacharya presented a comprehensive literature review of financial fraud detection studies and found that Logistic Regression, Bayesian Belief Network, Support Vector Machines, Logistic Regression, Neural Network are the algorithms that perform optimally for credit card fraud (West &Bhattacharya, 2016 andZhang &Zhou, 2004).

Methods
As explained in section 1, objective of this paper is to explore resampling strategies in balancing the imbalanced datasets for the classification algorithms. Thus we will provide a brief overview of the classification algorithms and notation for the problem statement and the resampling strategies and the evaluation metrics that have been implemented for this study.

Classification Algorithms
Decision Trees: Decision Tree is a technique of classifying data by generating a tree like structure. This tree has internal nodes that represent binary choices for each attribute whereas the branches of the tree symbolize the outcomes of that choice (Breiman, 2001). These nodes are created in such a way that the samples could be traversed using them. Decision Trees have many types e.g. Classification and Regression Trees (CART), J48 and Random Forest etc. Among the family of DT's Random Forest (RF) or decision forest is the widely used classification tree (Breiman, 2001). RF is collection of trees that are created to minimize the risk of over training the samples and to avoid the instability with in a single tree (Bhattacharyya et al., 2011). Another technique in DTs is pruning: which is used to reduce overfitting. Pruning removes the nodes of a DT without affecting the overall performance of a tree. Pruning also makes RF robust to noise and over training of samples. In RF each tree is created independently with little complexity and thus it requires tuning of only two parameters that include number of attributes and number of trees at each node. This process makes the generation of RF very simple (Bhattacharyya et al., 2011). Support Vector Machine: Support Vector Machine (SVM) was developed by Vapnik, 1995. It is a classification technique of mapping linear functions to higher dimensional space. This enables a nonlinear complex classification problem to be solved linearly with minimum computational complexity. SVM uses a kernel function for the transformation of data to high dimensional space. Kernel function is defined as a linear mapping between the data and a high dimensional space. Mathematically it is given by: ( , ) = 〈 ( ), ( )〉 Where : → represents a mapping from function to the higher dimensional space . After the implementation of kernel function a hyper plane is generated to classify the data points to their respective classes and is defined as: This hyper plane is constructed to have maximum separation between the instances of both the classes. Thus the final classification of SVM can be defined as:

Problem Statement and Notations
This section will define the problem statement and the necessary notations that have been utilized in resampling and classification of credit card fraud datasets. As the credit card fraud transaction has to be classified to either legitimate or fraudulent, therefore we will consider a binary classification problem. Consider a data set D having m elements. Then D = (x , y ) . Where x is a set of d-dimensional transactions and y gives the labels i.e. y = {0,1}. Here 0 & 1 represents majority and minority classes respectively. Let As credit card fraud datasets are imbalanced where this imbalance can be described by defining an imbalance ratio i.e.

IR(D) = D D
It is worth to note that higher ratio of IR(D) will give more skewed dataset. Thus our aim is to resample the dataset by lowering the IR(D). For IR(D) = 1 we will acquire a fully balanced dataset. Every standard classification problem is modeled on some training data whereas the modeled is verified using the test data. Here we assume that D is the training data that needs to be resampled. Learning a classifier from imbalanced training set D can be done in two stages. In the first phase the dataset D is resampled such that a desired imbalance ratio IR(r(D) is achiever whereas IR(r(D)) < IR(D). This is performed by dropping majority transactions or by adding new minority samples that will be generated synthetically. After performing the resampling a standard classification function C is learned on resampled dataset r(D) to generate a model C ( ) that maps all the instances in mdimensional space to the target set {0,1} i.e. C ( ) : ℝ → {0,1}.
Next step is to validate the model C ( ) by checking its performance on the test set D using classifier C. For this purpose performance of any classifier is determined using performance metrics P for which the input is the trained model on the resampled dataset C ( ) and D to produce better classification metrics. Higher values of these metrics give a better predictive model. In order to find the performance of the parameter r on the classifier C, k-fold cross validation is implemented during the training phase.

Resampling Techniques
Each resampling method r considered in this study, will follow a schematic way given below.
It will take input of training dataset for the resampling. A resampling multiplier l will be adjusted so that IR r(D) = IR(D) where, l > 1. l is called the resampling multiplier that is used to regulate the amount of resampling.
Training dataset will be modified by adding new minority samples (oversampling) or by reducing the majority samples (undersampling). This will be done according to the method implemented for resampling.
Finally we get a resampled dataset r(D) that can be classified using classifier C where IR r(D) ≤ IR(D). Now we explain the resampling techniques we will use in this paper.

Random Undersampling
Random Under-Sampling is an effective technique that tends to eliminate the majority samples from the training data. A number of studies point towards the effectiveness of this sampling technique. In a study presented by Liu, it was found that by reduction of majority samples in large number can bring significant savings in terms of the training time and memory that is required in building a training model (Liu, 2004). However, randomly eliminating majority instances by great number can lead to drop useful information necessary in building a model that can detect more minority samples. It was suggested in Ganganwar, that RUS procedure should be performed on larger datasets idyllically as the larger datasets are acceptably redundant in majority samples, thus most of the data to be discarded is redundant in nature (Ganganwar, 2012). In another study, it revealed RUS as the naivest and most frequently used resampling technique. Major drawback faced by RUS technique is that the amount of information withdrawn from the training data cannot be controlled (Krishnaveni & Rani, 2011). Despite the fact, that this technique can affect the classifiers performance, RUS have been considered as the most effective technique and can outperform certain other sophisticated techniques (Wang et al., 2008).
RUS is performed until both the classes (i.e. majority and minority) have same number of samples. Also RUS does not take in to account any additional parameters. For this study, a random subset of D with samples that will be withdrawn. All the samples in D have equal probability to be selected for the process of under-sampling.

Random Oversampling
Random Over-Sampling (ROS) inclines to increase the number of samples of the minority class. ROS until they represent a balance number of samples with respect to the majority class samples in the training data. A detailed analysis of over-sampling is given by Chawla in which the importance of over-sampling has been emphasized (Chawla et al., 2002). ROS retains the existing information of the dataset in contrast to RUS. However, the shortcomings of ROS were marked and significantly include need of large memory and longer time in training the model because of greater number of samples for both classes (Wang et al., 2008). In another study, it was further pointed out that ROS create an issue of overfitting due to replication of minority instances and thus the model cannot be generalized to the new data (Ganganwar, 2012).
Despite the fact that this technique holds certain limitations, Liu insisted the use of ROS as a very effective procedure of resampling (Liu, 2004). In his study, it was suggested to generate new minority instances from the existing training data rather generating new instances from the new training set could possibly bias the process of random selection of instances.
In this study, minority sample will be randomly generated until they become equal in number with the majority samples. For this purpose, |D |(l − 1) minority samples are added to the training set.

SMOTE
SMOTE stands for Synthetic Minority OverSampling Technique. This novel technique was presented by Chawla et al., 2002. SMOTE mainly creates a new sample by interpolation of existing minority samples that lie together. For any original sample x , it randomly selects one or more k nearest neighbors of x and performs interpolation of the existing sample and its neighbor and creates a new sample. More specifically, it follows the subsequent procedure in creating new samples. SMOTE takes the difference between x and its nearest neighbor, this difference is multiplied by a random number between 0 and 1. Finally this is added to the original sample x to get a new sample x . This technique forces the decision region of the minority class towards the majority space that can effectively reduce the problem of overfitting. Although SMOTE significantly improves the performance of minority class, it hinders the performance of classifiers by assigning the same sampling rate to each neighboring instance of x . To overcome this problem, certain SMOTE based studies have been proposed to assign different weightings to the neighboring minority class instances for x , e.g. (Lu &Ju, 2011 andNgai et al., 2011). SMOTE uses an additional parameter k for defining the sampling rate. FOR this study we have implemented SMOTE for k=5.
Following procedure is adopted in synthetically generating new minority samples.

Initialize a new set
= .
ii) Find nearest neighbors of . Randomly choose any nearest neighbor and call it .
iii) Interpolate these two samples in the following way to find new sample for : iv) Label all new samples as the minority class samples and add to mas.ccsenet.org Vol. 14, No. 7;2020 3. Add the newly generated samples to the set i.e. = + +

Performance Measures
The most common measure in classification is accuracy. However, high accuracy does not imply that all the fraudulent transactions have been classified correctly. Cost of misclassifying fraudulent transaction is far greater than misclassifying legitimate transaction in credit card fraud detection. As accuracy of any classifier cannot characterize the performance for the minority class, it is considered to be a biased metric. Other than accuracy, there have been other measures developed by the data mining community. An assortment of these metrics is based on a confusion matrix that is illustrated in Table 1. Confusion matrix is a 2x2 matrix with 4 elements described under:

Experimentation
Credit card fraud detection is an area of fraud detection that is more explored during recent years. The methods adopted to detect credit card fraud support an auto detection of fraudulent behavior among the given transaction. However, there are some constraints t this domain follows either naturally or due to some restrictions imposed by the financial institutions. Firstly the credit card datasets are heavily skewed (Juszczak et al., 2008 andHe et al., 2008)and the real datasets are mostly not provided by the financial institutions (Lu &Ju, 2011 andNgai et al., 2011) due to privacy concerns of the customers. Also the datasets available have very low number of samples which becomes the cause of not learning all the rules by the classifier. In this paper we are utilizing 3 datasets. 2 datasets, German Credit Card and Australian Credit Card datasets have been taken from UCI repository (Asuncion & Newman, 2010). These datasets have been implemented in most of the studies (Peng et al., 2011, West & Bhattacharya, 2016and Li et al., 2013. Third dataset, Give Me Some Credit (GMSC) have been obtained from kaggle repository that was used for a competition. All the dataset utilized in this paper contain different ratios of fraud and non-fraud transactions (www.kaggle.com). For all datasets, 70% of the data is kept for the training and validation while 30% is used for the testing purpose. As the credit card datasets are extremely skewed, each training dataset, D is altered to have imbalance ratios of D where i = 1,2,3,4. These datasets contains different ratios of fraud transactions i.e. 5%, 10%, 20% and 30%. Each D is further bifurcated to its corresponding majority and minority class instances se i.e. D and D for application of corresponding resampling method r.
For the classification of three datasets, we selected two classifiers that have been explained in section 1. The classifiers SVM and DT are executed using default parameters. For SVM radial kernel was used. Classification is implemented using 10-fold cross-validation. This means that during the training phase the dataset have been divided in to 10 equal parts. Among 10 parts, 9 have been used to build model while 1 part of the training data is mas.ccsenet used to va validation  Vol. 14, No. 7;2020