You are on page 1of 49

Credit Card Transaction Fraud Detection

Niall Adams
Department of Mathematics Imperial College London

January 2009

Obligatory contents slide


My objective is to report some of the our recent work on fraud detection, mostly sponsored under the EPSRC ThinkCrime initiative. 1. Transaction fraud problem, process, challenge 2. Supervised and unsupervised approaches (and data manipulation) 3. Combining methods 4. Streaming approach to handle change Collaborators: David Hand, Dave Weston, Chris Whitrow, Piotr Juszczak, Dimitris Tasoulis, Christoforos Anagnostopoulos. Collaborating banks: Abbey National, Alliance and Leicester, Capital One, Lloyds TSB

Another plastic card fraud anecdote...

Fraud is a piece of cake? Two very hungry German couriers ate a fruit cake destined for a German newspaper and in its place mailed a box of credit card data. The data including names, addresses and card transactions ended up at the Frankfurter Rundschau daily. The mix-up triggered an alarm, and police advised credit card customers with Landesbank Berlin to check their accounts for inconsistencies. Fruitcake must be dierent in Germany for people to want to use it as something other than a paperweight. (from slashdot)

Transaction fraud
Plastic-card fraud is a serious problem: losses to fraud in the UK in FH 2008 amounted to 307 million pounds. ...and once upon a time, this sounded like a lot of money... The problem is getting worse: (Source: www.cardwatch.org.uk)

There seems to be a perception that there is a fundamental background level of fraud. Losses to fraud are absorbed by customers, merchants, lenders. Thus the industry is very concerned with determining chargeback responsibility.

Types of fraud

Transaction fraud is often grouped into a number of categories, including: Counterfeit. Creating duplicate cards. Skimming refers to reading a cards magnetic strip to create a duplicate card. Mail non-receipt fraud. Cards are intercepted in the post and used without permission. Extra eort may be required by the fraudster to get other information, perhaps by phishing. Card not present fraud. Includes phone, internet and mail order fraud. Chip&PIN simply shifted the problem.

The nature of fraud attacks is changing

Online banking fraud dramatically increasing: FH 08 21.4m - up 185%! (source: apacs). Mostly phishing incidents, but increasing money mule adverts.

(Source: www.cardwatch.org.uk) Fraudsters change tactics to adapt to bank security measure (eg. HSBC checking all transactions now) arms race. Fraud population behaviour changing, but not legitimate population?

Transaction stream

Plastic card transaction processing uses a very complicated IS infrastructure (eg. Visa Europe processes 6000 transactions a second) to connect banks and merchants. Processing requirements include Speed fraud ltering while minimizing false positives

Schematic processing path

Challenges I

In this talk we will explore methods that could stand in for the fraud lter, or operate immediately after. We will have to use a modied performance metric, and note that our data is subject to selection bias due to the fraud lter. Temporal aspects each account consists of an irregularly spaced sequence of (complicated) transactions records need for rapid processing shifting fraud tactics fraud identication delayed

Challenges II

Population and system factors Imbalanced data sets (P(fraud) << 1%) Fraud behaviour can look legitimate Denition of fraud (bad debt book) Legacy systems How to measure performance of a fraud detector (2 imbalanced classes plus time and cost aspects)

Approaches
Most existing fraud lters are relatively simple supervised predictive models, based on very carefully selected variables. For example, FALCON is (essentially) a logistic regression on a large set of variables. Fundamentally, can consider two approaches: supervised learning - using the transaction fraud labels. A population approach unsupervised learning - has a customer departed from normal behavior? An account level approach.

Will consider these approaches, and hybrids. Taking a tool based approach means that dierent approaches need dierent features

Supercially: Supervised
use known fraud label, so possibly resistant to unusual non-fraud transactions implemented on a window, so using older fraud transactions, and not immediately responsive to each account decision threshold specication straightforward (in principle)

Unupervised
respond to every transaction on an account capacity to respond to new types of fraud, since not modeling known frauds risk of higher false positive rate setting some parameters less straightforward (in principle)

Data
Typical transaction record has more than 70 elds, including transaction value transaction time and date transaction category (payment, refund, ATM mobile top-up etc) ATM/POS indicator Merchant category code - large set, ranging from specic airlines to massage parlours card reader response codes Fundamental problem is to select which data to extract. Moreover, dierent (supervised/unsupervised) tools will handle transactions dierently.

Performance assessment
Supercially, fraud detection looks like a two class classication problem fraud versus non-fraud for which a suitable measure is AUC (area under ROC curve). However, AUC integrates over allocation costs. Moreover, there is a temporal aspect related to timeliness of detection. Suppose that the cost of investigating a case is 1 unit. Both TP and FP incur this. Estimates from a collaborating bank suggest a missed (FN) fraud costs 100 such units. We construct a measure, TC, that accounts for the number of fraud and non fraud transactions on an account, which deploying this cost information. Subtle arguments in Hand et al. (2008) show that this exact summary can be derived from a operating characteristic curve modied to account for temporal ordering of transactions.

Supervised methods
Perhaps most natural approach - transactions ultimately labeled as fraud or non-fraud - is two class classication. Many possible methods for this, ranging from logistic regression to support vector machines. The question is how to pre-process (eg. alignment issues) the transaction database for presentation to the supervised learner. We explored the approach of transaction aggregation transforming transaction level data to account level data. xi - xed length vector extracted from account i transaction. yi = (xi
{1}

. . . , xi

{n}

This is the activity record for account i, based on n sequential transactions. is the transformation - which we restrict to be insensitive to the order of the arguments

Selected variables for x using expert advice and extensive exploratory data analysis, to explore relationship between variables and fraud label. Variables included: number of POS transactions value of POS transactions transactions identied by magnetic strip simplied merchant category codes The function was tailored to compute various counts and averages (again, using extensive exploratory analysis - which seems hard to escape)

To illustrate, using 5 transactions

Various tricks to handle time-of-day data. Note these type of approaches are reasonably standard in the industry. We end up with a maximum of 67 variables.

If any transaction in an activity record is labeled as fraud, then we deem all transactions in the record as fraud. We x the number of days in the activity record across the population - thereby inducing variable numbers of transaction per account. We experimented with the following classiers to explore the impact of this length, considering activity records of 7 days, 3 days, 1 day, and 1 transaction: Logistic regression Naive Bayes (all variables binned) QDA (with some covariance regularization) SVM withe Gaussian RBF kernels, kernel width and regularization parameter set by experimentation Random forests, using 200 bootstrap samples, and 10 variables set at each split CART, K-NN (both with some further tinkering)

To recap, we occupy a feature space with activity records, of length 1,3,7, built using consecutive windows. each object in this space has a fraud label, and we use a variety of classiers, of various expressive power, to made predictions. These methods are deployed on real data samples from commercial collaborators, consisting of tens or hundreds of millions of transactions. We try to use the data fairly, so quote out of sample predictions representing the temporal ordering of the data.

Bank A

TC=0.09 corresponds to guessing. Standard error approx. 0.001 (bootstrap). In general, longer records better. Best performance from random forest.

Bank B

Mostly, longer activity records better. Random forests again best method. Note dierent performance on dierent banks - dierent customer bases. Mixing dierent length records would be of interest, and remains for future work.

Unsupervised Methods: Account level anomaly detection


Attempt to detect departure from normal behavior. Construct a density estimate over suitable dened transaction features, then ag new transactions with low estimated density. Three accounts, dierence in time of consecutive transactions:
4 3.5 x 10
3

x 10

0.014 0.012

5
3 2.5

0.01
4

0.008
pdf 3

pdf

2 1.5 1 0.5 0 0 30 trans.

pdf 0.006 0.004 0.002


63 trans.

53 trans. 2 4 6 time [s] 8 10 12 x 10


5

6 time [s]

10

12 x 10
5

0 0

6 time [s]

10

12 x 10
5

0 0

All these accounts are (?) legitimate. Account level approach avoids need to handle this source of heterogeneity.

We consider a two-stage approach. 1. Estimation stage - accumulate enough transactions to construct a model of normal behaviour. We use a xed number, but this is a free parameter. 2. Operational stage - use the model of behaviour to ag transactions as normal or abnormal. Treat abnormal as fraud Generic issues to handle: choice of model, choice of threshold, method of handling temporal nature of data.

For account i, we have the transaction sequence Xi = xt |xt RN , t = 1, 2, . . . Here, we have chosen to represent the transaction record as a collection of continuous variables. Of course, other options are possible. Some trickery required to handle categorical variables like merchant category codes.

For a specic account, suppose we have legitimate transaction data, X , then our detector for new transaction x h(x|X , ) = I (p(x|X , ) > ) = 1 0 x is classied as a legitimate, x is classied as a fraud

Here p() is a density estimate (the model), and refers to control parameters for the model. is the alert threshold. Dicult to set without context, but one possibility relate to the maximum proportion of agged cases that we can aord to investigate.

20000

20000

15000 money [p] money [p] 2 4 time of day [s] 6 8 4 x 10

15000

10000

10000

5000

5000

0 0

0 0

4 time of day [s]

8 4 x 10

Models
We explored many possibilities, including Kernel density estimate (Parzen) Naive Parzen (NParzen) Mixture of Gaussians (MoG) Gaussian (Gauss) nearest neighbour (1-NN) etc, etc... Control parameters dicult; various procedures, or arbitrarily xed. ATM and merchant type represented as distances and modelled with linear programming data description. Essentially nd distance of each point from representative plane, and transform to have character of probability

Features
We represent the jth transaction as amount amount dierence time time dierence (crude method of incorporating some temporal structure) Merchant type * ATM-id * * - categorical variables. Of course, selection of variables could be optimised, but this might be impractical in the streaming context.

Some results

same data sets as before, dierent features, so avoid direct comparison with supervised. Performance order, two banks, two measures (TC and AUC)
performance curve ROC performance curve ROC SVDD SVDD SVDD SVDD MST MST MST MST 1-NN NParzen 1-NN NParzen D1 Gauss SOM D2 NParzen SOM NParzen 1-NN SOM 1-NN SOM MPM Gauss Gauss MoG Gauss MoG MoG MPM MoG Parzen Parzen Parzen Parzen MPM MPM

Supervised-classiers built on this data exhibit similar performance.

Of interest to examine performance into the future. Build supervised classier on same data as unsupervised, then examine performance into the future (xed costs, false positive rates)
Parzen
0.17

1NN oneclass classifier twoclass classifier


0.21

SVM oneclass classifier twoclass classifier

oneclass classifier twoclass classifier

0.15

0.15

0.17

FP

FP

0.13

FP
0.13 0.09 3 4 5 6 2 3 4 5 6

0.13

0.11 2 3 4 5 6

0.11 2

month

month

month

Evidence that the account-level approach degrades more gracefully over time.

Unsupervised Methods: Peer group analysis

Peer group analysis (PGA) is a new method, attempting to use more than just a single accounts data for anomaly detection. Premise: some accounts exhibit similar behaviour (ie follow similar trajectories through some feature space). Use anomaly concept, but incorporating behaviour of similar accounts. Two stage process: (1). learn identity of similar accounts (temporal clustering) (2). anomaly detection over similar accounts. Lots of implementation issues! One instantiation not competitive with previous approaches, but does identify objects that are not simply population outliers.

Combination

Cannot practically run dierent detectors in parallel. Combination essential, and perhaps yield improved performance? Again, dierent approaches to combination possible. Perhaps most elegant is to incorporate unsupervised scores into supervised method. But technically and practically dicult. Instead, we consider the output of each detector, and consider how to combine them. For each transaction we have a score from each of Random forest, an SVM-based anomaly detector, and an instantiation of PGA. Normalize all scores to have character of P(fraud).

with each transaction represented by three scores (three variables), one from each detection sub-system we can consider dierent sorts of combiner ad-hoc
max

Supervised
logistic regression, naive Bayes K-NN

Build all sub-systems on rst part of data, and predict on second. Note, no model updating. Method Loss % 0.1 AUC % 0.1 Random forest 8.63 68.5 SVM anomaly 9.08 54.8 PGA 9.08 32.8 logistic 8.14 67.9 NB 7.23 87.9 133-NN (2 var ) 7.22 88.2 123-NN (3 var) 7.04 88.5 - not using PGA, K selected by CV study. Strikingly, PGA, which has no standalone merit, may add a little to the performance of a suitably constructed combiner.

Combination strategies can certainly provide improved performance. Still working out why: one point is that PGA works on histories with frequent transactions, account level detection better for infrequent transactions. So, tools can be put together, but we have ignored the issue of change over time. All these methods have been built on static windows of data. This is consistent with the industry norm - build the detector - monitor performance - rebuild when performance is deemed to have degraded too far. Clearly, a static window gives some capacity to handle changing populations (old data not relevant). But there may be a better way to do it...

Temporal adaption - current work


Consider the problem of computing the mean vector and covariance matrix of a sequence of n multivariate vectors. Standard results say this computation can be implemented as a recursion

mt = mt1 + xt , t = mt /n, m0 = 0 St = St1 + (xt t )(xt t )T , t = St /n, S0 = 0

(1) (2)

After n steps, this would give the equivalent oine result. If we are monitoring vectors coming from a non-stationary system, the simple averaging of this type is biased. If we knew the precise dynamics of the system, we have a chance to construct an optimal lter. However, we do not.

One approach to tracking the mean value would be to run with a window. Alternatively, we can use ideas from adaptive lter theory, and incorporate a forgetting factor, (0, 1], in the previous recursion nt = nt1 + 1, n0 = 0 mt = mt1 + xt , t = mt /nt St = St1 + (xt t )(xt t )T , t = St /nt down weights old information more smoothly than a window. nt is the eective sample size or memory. = 1 gives oine solutions, and nt = n. For xed < 1 memory size tends to 1/1(1 ) from below. (3) (4) (5)

Setting
Two choices for , xed value, or variable forgetting, t . Fixed forgetting: set by trial and error. Variable forgetting: result from Haykin (1997) (from adaptive lter theory) say tune t according to a local gradient descent rule t = t1
2 t , t : residual error at time t, small

(6)

Amazingly, using results from numerical linear algebra, this framework can still yield ecient updating rules. Performance very sensitive to . Very careful implementation required. We are exploring extending this idea, to construct a framework for sequential likelihood estimation with forgetting.

Illustration
Tracking mean and covariance in 2d

change detection properties, two xed values of , 5D, abrupt change


Reaction of Gradient at Abrupt Change at t = 1000 1600 1400 1200 Value of Gradient 1000 800 600 400 200 0 -200 -400 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Time lambda fixed at 0.99 lambda fixed at 0.95

Streaming classier
Since we have a method for adaptively and incrementally estimating mean vectors and covariances matrices, we can now consider an adaptive version of Gaussian based classication (since these methods only require means and covariances). Recall P(c|x) = f (x|c)P(c) f (x)

Change can happen in various ways, but population drift usually refers to the class prior P(C ) and/or the class conditional densities P(x|c). Linear/quadratic discriminant analysis (LDA/QDA) motivated by reasoning that f (x|c) N(, ). LDA: assume covariance matrix common across classes. QDA: dierent covariances.

Now, instead of using static estimates of and , use the adaptive estimates. Can use the same adaptive forgetting factor to handle changing prior probabilities also. This leads to a number of ways of constructing a stream classier. This requires some amount of hack, because the theory requires regularly-spaced data in time. To handle this, we simply update every time an observation arrives. Also, to test the idea, provide fraud ag immediately after classication (unrealistic, but we have some ideas for the real problem).

Using 18 variables, based on numerical elements of the transaction record, and a means of coding merchant category codes, we have the following performance, over 5 years (AUC measured once a month). (Note, a little regularisation required).

Performance of static versions (adaptive windows) very poor. This model can be implemented very eciently.

What does the adaptive forgetting factor show?

While rebuilding the detector will always be needed, this approach might help mitigate some losses due to population drift.

In the context of the whole problem, we have the opportunity to place the time-adaption in dierent places. This suggests the following intriguing idea for fraud detection between system rebuilds:

Conclusions

Transaction fraud detection is an important, but hard, real world problem. A signicant amount of engineering is required to produce eective solutions. Dierent modeling approaches and tools can have merit - and it appears they can be eectively combined. This suggests that the dierent tools are capturing the fraud signal in non-overlapping ways. We have the problem of handling changing populations (arms race, economic drift etc). Preliminary results suggest that temporally adaptive methods may have some utility in this context.

Future Work

Explore continuous updating and adaption of subsystems and combiner. Extend adaptive classier to nite mixture model (more exible), approximate logistic regression and RBF networks. More realistically handle the delayed fraud label.

References

Anagnostopoulos, C., Tasoulis, D.K, Adams, N.M. and Hand, D.J., Streaming Gaussian classication using recursive maximum likelihood with adaptive forgetting. Technical report. Hand, D.J., Whitrow, C, Adams, N.M., Juszczak, P. and Weston, D.J., Performance criteria for plastic card fraud detection tools J. Oper. Res. Soc., 58, (2008), 956-962. Haykin, S., Adaptive Filter Theory, third edition, Prentice-Hall. Juszczak, P., Adams, N.M., Hand, D.J., Whitrow, C. and Weston, D.J., O-the-peg and bespoke classiers for fraud detection Comput. Stat. Data An., 52, (2008), 4521-4532. Tasoulis, D.K., Adams, N.M., Weston, D.J. and Hand, D.J., Mining information from plastic card transaction streams, in COMPSTAT 2008, Proceedings in Computational Statistics: 18th Symposium, P. Brito (ed), 2008, 315-322. Weston, D.J., Hand, D.J., Adams, N.M., Whitrow, C., and Juszczak, P., Plastic card fraud detection using peer group analysis Adv. Data An. Classif., 2(1), (2008), 45-62. Whitrow, C., Hand, D.J., Juszczak, P., Weston, D.J., and Adams, N.M., Transaction aggregation as a strategy for credit card fraud detection, Data Min. Knowl. Disc, (2008), in press. Whitrow, C., Hand, D.J., Adams, N.M., Weston, D.J., and Juszczak, P., Combining transaction fraud detection methods, Technical report (2008)

You might also like