You are on page 1of 8

Cyber Data Analytics - Assignment 1

Mateusz Garbacz
Student number: 4571681

Bartosz Czaszyński

Student number: 4571894


Data Visualization

Dataset used for this task comprised 290382 17-dimensional labelled data points belonging to classes:
‘Chargeback’ (indicating that the transaction is fraudulent), ‘Refused’ (cancelled for some other reason than illegal
activity) and ‘Settled’ (legal, accepted transaction). Class representation is greatly imbalanced, with only 0.1456%
of data points belonging to the ‘Chargeback’ class. Transactions labelled as ‘Refused’ are disregarded from analysis,
as they neither indicate a legal nor illegal transaction and precise reason for their cancellation is not known.
Main focus was put on the amount of money transferred, currency in which the transfer was made and
type of card used. These fields are suspected to be of high significance due to their direct impact on the
transaction. Initial steps taken in order to design a classification model, performing well under conditions of
significantly imbalanced data set, involved preliminary analysis of possible relationships embedded in the data set.
For this purpose different visualization methods were used which results are presented in Figures 1 - 3.

Fig 1.​ - Frequency histogram presenting percentage contribution to number of fraudulent and settled transactions per currency
The above graph clearly shows that 2 main currencies are used for fraudulent credit card transactions.
Illegal transactions are mostly made in Mexican Pesos and Australian dollars, despite the fact that most of the
accounts present in the data set were registered in
Britain. Furthermore, most of the legal and
accepted payments are in pounds while much less
often with other currencies. Therefore, it is safe to
say that use of currency other than british pound
can be a strong indicator of possible fraud and can
constitute a valuable descriptor.

Fig 2​. - Frequency histogram presenting types of credit


card used in both legal and illegal transactions.
Figure 2 indicates another significant
relationship found in the data. There is a clear
preference towards 4 particular card types, out of
which, 3 are used substantially more often for
fraudulent activities.. Therefore, the “visadebit”
card is an indicator of a highly probable legal
transaction, while “mccredit” or “visaclassic”, may
indicate fraudulent transaction.

Fig 3.​ - Weekly mean and standard deviation of the


amount spent in Euro on fraudulent (red) and settled
(blue) transactions.

Firstly, it is visible that in general the fraudulent transactions have higher values than legal ones. The
settled transactions have stable mean over the whole measurement period and low standard deviation compared
to the fraudulent ones. The non-legal transactions are characterized by high variety fluctuations in both mean and
standard deviation values. High standard deviation indicates wide range of possible transaction values, most
probably caused by few high value transfers. In general fraud values are slightly higher than legal, moreover, in
some weeks there are few very high valued fraud transactions which makes them easily distinguishable from
allowed ones. Therefore, high value transfers might indicate fraudulent activity.
It is important to point out that each aforementioned feature carries information valuable for
discriminating between fraudulent and legal transactions. Nevertheless, taking into consideration the ratio of
available fraudulent data points to available legal transaction data, it is necessary to utilize informative power of all
these descriptors together for satisfactory success rate of fraud classification.

Imbalance Task

For testing purpose of SMOTE impact 3 different classifiers have been chosen: linear SVM, logistic
regression and ADABOOST. Each of these was run without sampling as well as with varying level of sampling (10%,
20% and 30%). In our analysis we focus mainly on the area under the curve as well as which curve ROC curve has
best highest True Positive Rate (TPR) while still keeping False Positive Rate (FPR) below 0.1. Classifier with such
characteristics can be successfully used for credit card fraud detection, while minimizing the number of falsely
accused clients. The ROC curve metric has been chosen for performance evaluation, because it is capable of
showing how well the algorithms work in general (AUC), but also indicates important performance characteristic
for the fraud detection problem, namely low number of False Positives.
The ROC curves for Linear SVM [Fig 4.] show general performance improvement, when SMOTE is applied. Value of
area under the curve (AUC) is slightly higher whenever the method is used and the line where SMOTE ratio is set to
20% has the highest TPR for a very low FPR values. Therefore, SMOTE would be beneficial for some threshold in
this case.

Fig 4.​ - ROC curve for Linear SVM classifier with different SMOTE ratios.

As far as Logistic regression is considered, obtained values for AUC are again higher for classifiers utilizing SMOTE
functionality. Moreover, the ROC curve with highest TPR for lowest FPR is the one for 30% ratio. Thus, applying the
over-sampling method slightly improves the performance of classification for some threshold for the fraud
detection task.
Fig 5.​ - ROC curve for Logistic Regression classifier with different SMOTE ratios.

Fig 6.​ - ROC curve for ADABOOST classifier (using 50 predictors) with different SMOTE ratios.

Lastly, the ROC curves for ADABOOST have been shown in the Figure 6. Clearly, the curve for non-sampled
classifier has the highest AUC as well as TPR for lowest FPR. This indicates that for this classifier SMOTE method
may deteriorate the results. Therefore, the classifier should not be used with this kind of sampling.

Based on the AUC results, the best classifiers to use are ADABOOST without sampling and Logistic Regression with
20% SMOTE sampling. However, before deciding on a specific classifier, one has to take into consideration that the
high TPR for low FPR is the most important performance indicator for this kind of problem, which would require a
superimposed plot of the best ROC curves. Moreover, the AUC takes the area of the entire curve, while its shape
for the high FPR values does not matter in this case. Finally, it is shown that SMOTE may improve the classification
results, however, requires careful tuning. We suppose that the results of linear classifiers are boosted due to their
generality and low sensitivity to noise in the vicinity of decision boundary, while ADABOOST starts overfitting on
the synthetic variables and therefore deteriorates its overall classification performance. Therefore, even if noisy
data points are oversampled it does not influence the performance of resulting linear classifier as much as it would
for nonlinear ones.
Classification Task
This section focuses on the training and evaluation of two classifiers for fraud detection. Our goal was to
train two classifiers: white-box and black-box. For both classifiers, the Preprocessing and Feature Extraction
applied is the same, therefore it is described together in the Section 1. Then, we the focus on the Training process
in Section 2 separately for the algorithm and in the Section 3 we evaluate the algorithms together. The tuning of
the parameters training parameters and evaluation has been done based on 10-fold cross-validation.
1. Preprocessing and Feature Extraction:
First phase of data preparation aims at obtaining more descriptive representation than the data in raw form. First
the data is loaded and cases with ‘Refused” label are removed, as they do not indicate any of the classes. Cases
with booking date higher than the transaction date are also removed. Then, for each object a number of additional
features is are extracted, namely:
- Accumulative numerical features: two sets of features computed within a weekly and monthly timeframe
from current data point are added. These include number of transactions made, as well as, cumulative
and average transaction amounts calculated for each unique MailID, IpID and CardID.
- Day of the week the transaction was made.
- Amount of money converted to the same currency.
Categorical features are then converted into OneHot encoding, where each categorical value has a separate binary
variable. This results in having 318 numerical features, which are normalized to 0 mean and standard deviation
equal to 1 to make sure that classifiers and techniques applied (e.g. SMOTE or PCA) are not biased towards any of
the features.
2. Training​:
This section describes the different data handling and training processes applied in the white and black-box
classification. The aim of the white-box algorithm is training a classifier which effectively detects fraud, however, it
should be possible to explain to the customer why his transaction has been denied or accepted. On the other hand,
the black-box algorithm does not have the requirement, therefore, it is possible to apply feature extraction
methods like PCA for more complex and efficient representation, where a lower number of features maintains
most of the variability of the data. In both cases of white-box and black-box we have selected the Logistic
Regression classifier due to its wide applicability in the field [1] and the training time efficiency. Moreover, the
classifier is shown to work well together with the SMOTE sampling method.
2.1 ​White-box algorithm:
Firstly, we apply a SMOTE algorithm to create synthetic samples. Then, we apply the linear classifier, due to a large
number features handled, to prevent overfitting. We used a Logistic Regression model, which had the best results
out of the tested linear classifiers. We have tested different oversampling parameters and ended up with 10%
oversampling of the minority class.
2.2 ​Black-box algorithm:
For the black-box classifier we wanted to apply a more complex approach. Therefore, to prevent overfitting we
extract a number of PCA components (50), which can be done with no additional preprocessing, thanks to data
normalization. Then, we apply SMOTE oversampling with ratio 0.1 to make sure that the fraud class is not
underrepresented.
3. Evaluation:
The classifier evaluation has been done using 10-fold cross-validation. We have tested several parameters and
selected those, where there is a highest averaged TPR over the folds for the maximum averaged FPR of 2.5%
(around 600 False Positives per fold). Selected performance criteria are the based on the paper [1], which
mentions their use as standard metric for a problem such as credit card fraud detection, due to imbalanced class
representation within a data set.

Model Accuracy Sensitivity Specificity Precision F score

White-box 0.9744 0.4382 0.9751 0.026 0.050

Black-box 0.9745 0.4410 0.9753 0.027 0.051


Table 1.​ - Evaluation of the two produced classifiers

Based on the Table 1, we can analyze the performance of the produced classifiers. First of all, the Black-box
classifier performs slightly better in every metric, which might be a result of applying more efficient feature
representation. Moreover, the classifiers have a very high accuracy, however, the metric is less representative in
the fraud detection problem, due to its indifference to class weights. Nevertheless, both classifiers detect around
44% of fraudulent cases, having 2.5% of non-fraud cases classified as fraud (Specificity 97.5%). Obviously, the
evaluation is highly dependent on the cost of misclassification of each class, which has not been given in the
assignment and could be only approximated. Therefore, we suppose that the system detects a large number of
fraud cases with a low false positives ratio. However, the precision is very low, meaning that out of predicted
fraudulent cases only 2.6% is actually fraud. Finally, the F score is low as well, at the level of 5% due to a low
precision. All in all, the systems detect a large amount of fraudulent cases, however, it has a rather low precision,
which is standard to appear in case of such a case distribution. Again, it is hard to say how well our system works
not knowing the exact misclassification costs, however, it has been tuned to work well for the baseline of 2.5% of
false positive rate.

References:
1. ‘​Data mining for credit card fraud: a comparative study​’, S. Bhattacharyya, S. Jha, K.
Tharakunnel, J. C. Westland, ScienceDirect, 2010

You might also like