You are on page 1of 7

Final Project Report

By Xiaogang Dong
Stanford ID # 05638478

Abstract
The main target of this project is to detect anomaly online credit card transactions. I try six
different classifiers: Nave Bayes, Decision Tree, K-Nearest Neighbor, Support Vector
Machine, Random Forest, and AdaBoosting with Decision Tree. Random Forest performs
best based on my experiments. A fine-tuning of Random Forest is performed to check if a
better performance can be achieved.

1. Introduction
The goal of this project is to identify anomaly transactions from large amount of online credit
card transactions. A training set of 19 attributes and 94682 observations is provided. In a
separate file, the class labels are given: 1 for being anomaly and 0 for not being. In addition, a
test data set with 36019 observations is provided for a final evaluation of the project.

2. Data Observation
Both training and test data files have been examined and no missing data are found. There
are about 2.2% anomaly transactions in the training set. There are a total number of 19
attributes for each transaction. amount may be the amount of transaction and its distribution
is included in the following figures. Note total is highly correlated with amount, the
correlation coefficient is about 0.9994217. Either hour1 or hour2 may be the transaction
time and these two have a strong correlation, 0.994708. The histogram of hour1 is plotted
below. state1 is likely the shipping (or billing) state. zip1 is the corresponding zip code, but
with the last two digits masked. domain1 is likely the web domain from where the transaction
was done. The rest of variables are hard to determine their exact meanings. There are some
binary variables: field2, flag1, flag2, flag3, flag4, indicator1, and indicator2. field1
has four levels, field4 has 38 levels, field5 has 26 levels, and flag5 has 36 levels. field3
has so many levels that it can be treated as a continuous variable.
Ideally, we should check the correlations between variables to determine which variables to
be used in classifications. Pearsons chi-square test can be used to check the independence
of categorical variables. Here given the limited time, these steps are skipped. However, I try to
trim some variables when I fine-tune the Random Forest classifier.

STATS202

Xiaogang Dong

Stanford ID # 05638478

Final Report

Page 1 of 7

STATS202

Xiaogang Dong

Stanford ID # 05638478

Final Report

Page 2 of 7

3. Solution Evaluations
I try six different classifiers in this project. In order to evaluate different classifiers, the training
data are divided into five folds and cross validation is performed. The details are as follows.
First, the training data is divided into two groups: training data with positive labels (or label 1)
and training data with negative labels (or label 0). All positive training data are randomly
divided into five folds of approximately equal sizes. So are the negative training data. The first
fold of positive training data and the first fold of negative training data are grouped together,
named Group 1. Similarly, rest of data forms Group 2 to Group 5. Then Group 1 is taken as
the test data and the rest four groups as the training data. After testing, each transaction in
Group 1 is assigned a probability of being positive. The transactions with top 20% probability
of being positive are taken out, and percentage of true positives among these transactions is
calculated based on the ground-truth labels. Repeat the same procedure for Group 2 to
Group 5. The average percentage of true positives among top 20% transactions is used as
the final metric for performance evaluation.

4. Candidate Solutions and Data Selection


In this report, I always feed as many attributes as possible to candidate classifiers. Skipping
some attributes may potentially improve the performance. But I only try it on the Random
Forest classifier due to the limited time. Ideally it should be tried on all methodologies to fully
explore their potentials. The following table lists the percentages of true positives among all
top 20% positive-likely transactions for each classifier, and then we describe the
implementation details of each classifier
Classifier
w/o Laplace Smoothing

6.991599

With Laplace Smoothing

7.757317

All data

4.789717

All data except domain1

4.610169

K=1

4.995667

K=3

1.753231

K=5

1.446945

Support Vector
Machine

Radial Kernel

6.379245

Linear Kernel

3.216024

Random Forest

Number of Trees = 500

10.176170

Adaboosting

With Decision Trees

6.199699

Nave Bayes
Decision Tree

K Nearest Neighbor

a)

Percentage (100%)

Nave Bayes Classifier

Nave Bayes classifier is probably the most straight-forward one and it can deal with both

STATS202

Xiaogang Dong

Stanford ID # 05638478

Final Report

Page 3 of 7

numerical and categorical attributes. In this project, I am using the function naiveBayes()
included in the package e1071. An important aspect of Nave Bayes classifier is the Laplace
smoothing, which helps to deal with unseen and seldom occurred observations. In the above
table, the results are listed for the case without using Laplace smoothing and the case without
using Laplace smoothing. In conclusion, Laplace smoothing significantly improve the
classification result by about 0.75%.
b)

Decision Tree Classifier

Decision Tree classifier generally performs less effectively. Here I still include the results from
Decision Tree so that they can be used as a benchmark for ensemble methods such as
Random Forest and AdaBoosting. One advantage of Decision Tree is that it can handle both
numerical and categorical variables. Here I use rpart() from the package rpart to perform the
classification. The default setting of rpart() gives a percentage of 4.79%. By examining the
trees generated, I find out that Decision Trees use the variable domain1 a lot. To further
investigate, I exclude the variable domain1 from data and it gives slightly worse results:
about 4.61%.
c)

K-Nearest Neighbor Classifier

K-Nearest Neighbor is another popular classifier used in practice. I use knn() from the
package class to implement K-Nearest Neighbor. The key issue here is to define distance
and scale variables. Note that it may potentially improve the classification performance by
defining appropriate distances for categorical variables. However for simplicity, both
categorical variables state1 and domain1 are discarded. Strictly speaking, zip1 is also a
categorical variable. But it is treated like a numerical variable here since these numbers
somewhat describe the distances between different areas. The variables are scaled by
normalization, i.e. each variable has zero mean and unit variance after scaling. The
percentage of true positive is about 5.00% when k=1. Given any true positive, it is very likely
to be labeled as negative since the number of true positives is significantly smaller than the
number of true negatives, only about 2.2% of total data. This is probably the reason why
K-nearest Neighbor classifier doesnt work well here. It is further evidenced by the dramatic
performance drops when k=3 and k=5.
d)

Support Vector Machine Classifier

Support Vector Machine classifier is also very popular. It faces the same problem as
K-nearest Neighbor, i.e. defining distance and scaling variables. For the exactly same reason
as in K-nearest Neighbor, state1 and domain1 are discarded and zip1 is treated as a
numerical attribute. svm() in the package e1071 is used for actual classification and it
handles variable scaling inside the function. Two different kernels, radial and linear, are tried
in the classifications. Radial kernel gives a better performance, a percentage about 6.38%.

STATS202

Xiaogang Dong

Stanford ID # 05638478

Final Report

Page 4 of 7

e)

Random Forest Classifier

Random Forest classifier is a very useful ensemble method based on Decision Tree classifier.
Here I choose randomForest() in the package randomForest for the actual classifications.
Random Forest classifier supposes to work with both categorical and numerical variables
since it is based on Decision Tree. However, randomForest() has the restriction for only being
able to handle a categorical variable with less than or equal to 32 values. So state1 and
domain1 are both discarded for that reason. zip1 is also treated as a numerical variable.
An alternative way of dealing with this restriction is to group some less frequent values into
one value so that the total number of values is less than 32. It may improve the performance
and I dont try it due to the limited time. To work around memory issue caused by
randomForest(), all decision trees generated during the classification process are discarded
by setting keep.forest=False. The Random Forest Classifier obtains the best result among
all others, at a percentage of 10.17%.
f)

Ada Boosting with Decision Tree

AdaBoosting is another popular ensemble method. Here Decision Tree is used as the weaker
classifier. I use the R-code in lecture notes for the implementation. It is an interesting problem
to evaluate the classification error during the process of re-adjusting weights. The method
used in the lecture notes (label as positive when prob > 0.5) doesnt work well here. Since
the total number of true positives is known in the training set when performing cross validation,
I choose the same of number of transaction with higher probabilities of being positive and flag
them as positive. The classification errors include both false positives and false negatives.
There should be better ways to evaluate the classification error. I dont try any further due to
the limited time.

5. Fine Tuning
After identifying the Random Forest being the best classifier, I try to tweak both the
parameters and input data to see if the performance is improved.

a) the number of trees generated


Increasing the number of trees generated slightly improves performance. But as a trade-off,
the running time is approximately linearly increasing as the number of trees grows. For the
further discussions, we still use the default value of 500 in randomForest().

STATS202

Xiaogang Dong

Stanford ID # 05638478

Final Report

Page 5 of 7

# of Trees

Percentage (100%)

100

10.03359

300

10.14977

500

10.17617

700

10.21314

b) perform bias correction


There is an option corr.bias in randomForest(), which can be used to correct correlation bias.
By setting it as True, I obtain slightly better result than the default setting. The percentage is
10.21842%.

c) omit one variable


I also try omitting one variable in the input data and seeing if the performance improves. In
cross validation, it improves the performance to discard either amount or total, which are
highly correlated. However, such an improvement is not large enough to guarantee a better
result on the test set. I actually get a bit lower score (4.227) by dropping total in the test set.
Due to the limited time, I will not discuss further about variable selections.
Omitted Variable

STATS202

Xiaogang Dong

Percentage (100%)

None

10.17617

Amount

10.19201

Hour1

10.19202

Zip1

10.03887

Field1

9.93853

Field2

10.17617

Hour2

10.17617

Flag1

10.15505

Total

10.20258

Field3

9.83820

Field4

10.04943

Field5

10.18145

Indicator1

10.16033

Indicator2

10.19729

Flag2

10.17617

Flag3

10.10224

Flag4

10.20258

Flag5

10.21842

Stanford ID # 05638478

Final Report

Page 6 of 7

6. Submission to UCSD website


My testing results are submitted under team name dongxiaogang and campus Sony
Electronics. A total number of four test results are submitted. The first one is under all default
setting of randomForest() and gets a score of 4.247. The second one is based on
corr.bias=T and gets a score of 4.267. The third one is based on corr.bias=T and dropping
total and gets a score of 4.227. The last one is based on corr.bias=T, dropping total, and
adding a binary variable to indicate whether total is equal to amount. It gets a score of
4.273. Further tweaking variables and adding the number of trees generated may slightly
improve the performance. But such an improvement wont be significant.

STATS202

Xiaogang Dong

Stanford ID # 05638478

Final Report

Page 7 of 7

You might also like