You are on page 1of 15

A COMPARISON OF PREDICTIVE MODELS IN CLASSIFICATION OF CREDIT

CARD APPLICANTS

1 2
Irma Rohaiza Ibrahim and Yap Bee Wah
1,2
Faculty of Computer and Mathematical Sciences,
Universiti Teknologi MARA, 40450 Shah Alam, Selangor, Malaysia

Abstract

The process of credit scoring is very important for banks and financial institutions as they need to
segregate good credit risk from bad credit risk in term of their credit worthiness. With the advancement of
computer technology and statistical software such as SAS Enterprise Miner, banks can use credit scoring
to classify credit card applicants. Credit scoring involves building predictive models. The objective of the
study is to build a predictive model to classify credit card application as accepted or rejected. The sample
size of this study is 4305. The data was first partitioned into training (70%) and validation (30%) samples.
Three different credit scoring model were compared: Logistic Regression, Decision Tree and Neural
Network. Comparisons of the performance of the three credit scoring models were based on
misclassification rate. Results show that Decision Tree using chi-square splitting criteria has the lowest
misclassification rate (LR = 28.07%, NN = 24.28%, CART = 20.18%). Results also show that female and
older applicants are more likely to be accepted. Applicants with more years of employment, shorter loan
duration and those who owned house and properties are more likely to be accepted.

Keywords: credit scoring, logistic regression, decision tree, classification, predictive modeling

1. Introduction

In Malaysia, credit cards and charge cards were introduced in the mid-1970s. In the early days, holdings
of such payments cards are reserved for the rich. Prior to the introduction of payment cards, cheque is
the dominant form of non-cash payment instrument. However, it is only accessible to a small number of
consumers and is dominantly used for large value transactions. By early 1990s, credit cards were more
accessible to the general public but there were strict income requirements imposed on credit cardholders.
In 1998, the entrants of non-bank credit card issuers have raised the competition further to an already
competitive credit card market in Malaysia.

Nowadays, millions of people rely on credit cards for purchasing goods and services. They enjoy the
advance of 30 day credit, an organized bill at the end of the month and with money prestige “gold cards”
and “platinum cards”, additional service such as insurance plan and free air miles proportionate to their
purchases via credit cards. Due to the intense competition of credit issued by banks, more and more
people can easily apply a credit card without the bank carefully examining their credit worthiness.
Presently, banks do have some screening procedures such as using credit scorecards to make decisions.

The credit vendor who evaluates new applications must use a certain set of profiles of old good credit
applicants or past applicants as a yardstick against which to evaluate new applicants who may be either
new bad or good credit applicants. Credit scoring development is based on historical information from the
databank on existing clients, in order to assess whether the prospective client will have a greater chance
of being a good or bad payer. The main idea in credit risk modeling consists of building classification rules
that predict bank customers as good or bad credit risk. With the existence of data mining software,
predictive model can easily be deployed by bank to analyze a large number of applications quickly and
efficiently. Moreover, a good credit risk scoring model enables the management to make better and more
accurate decisions while processing credit card applications. Hence, the objective of this paper is to
compare the predictive ability of three predictive models: Logistic Regression (LR), decision tree and
Neural Network (NN) model in the classification of credit card applicants.

This paper is organized as follows. In Section 2, we briefly review the applications of predictive models
and the selection of variables. Section 3 presents the methodology for constructing the predictive models.
The results are discussed in Section 4. Finally, some concluding remarks are given in Section 5.

2. Predictive Models
Credit scoring was first introduced in the 1940s and over the years had evolved and developed
significantly. In the 1960s, with the creation of credit cards, banks and other credit card issuers realized
the advantages of credit scoring in the credit granting process. In the 1980s, the banks started to use
credit scoring for other purposes such as personal loan applications. In recent years, credit scoring has
been used for home loans, small business loans and insurance applications and renewals (Thomas,
2000; Koh, 2004). Credit scoring is based on statistical or operational research methods. Historically,
discriminant analysis and linear regression have been the most widely used techniques for building score-
cards. Other techniques include logistic regression, probit analysis, nonparametric smoothing methods
especially k-nearest neighbours, mathematical programming, Markov chain models, recursive partitioning,
expert systems, genetic algorithms and neural networks (Hand and Henley, 1997). Multivariate adaptive
regression splines (MARS), classification and regression tree (CART) case based reasoning (CBR), and
Support Vector machines (SVM) are some recently developed techniques for building credit scoring
models (Huang et al., 2004; Lee at al., 2006; Huang et al., 2007).

Recently, with the development of data mining software the process involved in building credit scoring
model is made much easier for credit analysts. However, the popular techniques for banking and
business enterprises are credit score-cards, logistic regression and decision trees as it is relatively easy
to identify the important input variable, interpret the results and deploy the model.

In building a scoring model, historical data on the performance of previously made loans and borrowers
characteristics are required. Vojtex and Kocenda (2006) provided a table of indicators that are typically
important in retail credit scoring models. They classify the indicators as demographic, financial,
employment and behavioral indicators. Mavri et al., (2008) used variables such as gender, age,
education, marital status and monthly income to estimate the risk level of credit card applicants.

The data in this study used only the demographic, behavioral and employment indicators. The variables
and their categories for this study are shown in Table 1.

Table 1 Categories of variables

DEMOGRAPHIC FINANCIAL BEHAVIORAL EMPLOYMENT


INDICATORS INDICATORS INDICATORS INDICATORS

1. Age 1. Amount of loan 1. Loan History 1. Employment


2. Number of
2. Gender 2. Duration Loans
3. Marital status
4. Years at Current
address
5. Home phone
6. Housing
7. Property
8. Job

3. Methodology

This section explains the process of constructing predictive models.

3.1 Variable Identification


In building credit scoring models using SAS Enterprise Miner, the variable role and measurement level
must be stated. Table 2 lists the variables used in constructing the predictive models.

3.2 Data Preparation


First, the data given was in Excel format and imported into SAS using SAS Enterprise Guide. Next, we
proceed with the data cleaning stage. In general, data cleaning involves removing outliers, redundant
data and imputation of missing values.
Table 2 Description of Variables

Variable Variable
Role Description
Name Type
Credit card application
APPLICATION STATUS Target Binary
0 : Rejected 1: Accepted

AGE Input Interval Age in years

EMPLOYMENT Input Ordinal Present employment since


0 : unemployed
1: < 1 year
2 : 1 <= ... < 4 years
3 : 4 <=... < 7 years
4 : >= 7 years

Years living at current address


0: <= 1 year
CURRENT_ADD Input Ordinal 1 :1<…<=2 years
2: 2<…<=3 years
3:>4years

Applicant
HOUSING Input Binary
0: Rent 1: Own

Number of existing loans


NUM_LOAN Input Interval
(1, 2, 3, 4)

Nature of job
0 : unemployed
1 : unskilled
JOB Input Ordinal
2 : skilled employee / official
3 : management/ self-
employed/ officer

Applicant has home phone in


HOME PHONE Input Binary his or her name
0: No, 1: Yes

Applicant is
GENDER Input Binary
0: Male, 1: Female
Applicant is
MARITAL_STATUS Input Nominal
0 : Single, 1: Married ,
2: divorced/widowed

Applicant own property


PROPERTY Input Binary
0 : No, 1 : Yes

AMOUNT Input Interval Amount of Existing loan

DURATION Input Interval Duration of loan in months

LOAN_HISTORY Input Nominal loan history


0: no loans taken
1: all loans at this bank paid
back duly
2: existing loans paid back duly
till now
3: delay in paying off in the past
4: critical account/ unpaid

3.3 Data Modeling


This section discusses the construction of predictive models. The data consists of 2975(69%) accepted
and 1,330(31%) rejected applicants. The dependent variable, which is APPLICATION STATUS is coded
as accepted=1 and rejected=0. SAS Enterprise Miner 5.3 software was used for building the predictive
models. This data mining software provides a Graphical-User-Interface (GUI) workspace whereby nodes
(tool-icon) can be easily selected from a tools palette and placed into the diagram workspace. Nodes are
then connected to form a process flow diagram that structure and document the flow of analytical
activities. Figure 1 shows the process flow diagram and the process begin with the sample data node.
The sample data node is connected to the Data Partition node to split the data into training and validation
sample. The sample data was partitioned at 70:30 that is 70% for training (used for model building) and
30% for validation sample (used for validating the model). The modeling nodes are connected to the Data
Partition node. The predictive models were then assessed and compared using the Model Comparison
node. Meanwhile, the Multiplot and StatExplore node are multipurpose tools used to examine variable
distributions and statistics of the data.
Figure 1 Data Mining Process Flow

The modeling stages are:


1. Compare logistic regression models using ENTER, STEPWISE, BACKWARD and
FORWARD selection methods. Select the best model based on validation sample predictive
accuracy.
2. Compare decision tree models using CHI-SQUARE, GINI, and ENTROPY splitting criteria.
Select the best model based on validation sample predictive accuracy.
3. Compare logistic regression model selected in (1), with decision tree model selected in (2)
with neural network model.

3.3.1 Logistic Regression Model


Logistic regression is a widely used statistical modeling technique in which the probability of a
dichotomous outcome (Y=0 or Y=1) is related to a set of potential predictor variables. The logistic
regression model (Roiger & Geatz, 2003) is written as

 P(Y = 1) 
log   = α + β1 X 1 + β 2 X s + ... + β n X n (1)
1 − P(Y = 1) 
where P(Y=1) is the probability of the outcome of interest.

Equation (1) can be solved to obtain


1
P (Y = 1) = , where z = α + β1 x 1 + β 2 x s + ... + β k x k (2)
1 + e −z
Thus, the objective of a logistic regression model is to determine the conditional probability of a specific
applicant belonging to a class (accepted or rejected), given the values of the independent variables of
that applicant. For this study, the logistic regression was used to model the event Y=1 (accepted).

3.3.2 Decision Tree Model


A decision tree model consists of a set of rules for dividing a large collection of observations into smaller
homogeneous group with respect to a particular target variable. The target variable is usually categorical
and the decision tree model is used either to calculate the probability that a given record belongs to each
of the target category, or to classify the record by assigning it to the most likely category. Decision tree
can also be used for continuous target variable although there are other techniques such as multiple
regression which are more suitable for such variable (Berry and Linoff, 2004). Given a target variable and
a set of explanatory variables, decision algorithms automatically determines which variables are most
important, and subsequently sort the observations into the correct output category (Olson and Yong,
2007). The common decision tree algorithms in data mining software are CHAID (Chi-Square Automatic
Interaction Detector), CART (Classification and Regression tree) and C5. CART uses gini as the splitting
criteria for categorical dependent variable while C5 uses entropy. Meanwhile, CHAID uses chi-square test
as a splitting criteria. (Berry and Linoff, 2004). CART has been widely used in business, weather
forecasting and medical research (Davis, et. al, 1999; Berry and Linoff, 2004; Fu, 2004, Kurt et al., 2008).
CART has also been used by Lee, at el. (2006) in building credit scoring models.

3.3.3 Neural Network Model


According to Hian and Chan (2004), neural networks are useful for recognizing patterns in the data,
especially when the form of relationships between the dependent and independent variables is unknown
and/or complex. They are modeled after the human brain, which can be perceived as a highly connected
network of neurons (called nodes in neural networks terminology). Each node (in a layer of nodes)
receives inputs from at least one node in a previous layer and combines the inputs and generates an
output to at least one node in the next layer. Generally, the independent variables comprise the input
layer and the dependent variable comprises the output layer. Between the input and output layers may
exists one or more hidden layers of nodes.

The multilayer perceptron (MLP) is the most widely used neural network model in data analysis. An
illustration of the MLP is given in Figure 2. It is a feed-forward network which composed of input layer
(units corresponding to input variables), hidden layers (consists of neurons which outputs a nonlinear
function of a linear combination of its inputs) and an outer layer (consists of neurons corresponding to the
target). If the target variable has multiple class (>2), there are multiple neurons in the output layer (SAS
Institute Inc, 2005).
Figure 2 An Illustration of Multilayer Perceptron NN model
(Source:SAS Institute Inc,2005)
4 Results
This section presents and discusses the results obtained.

4.1 Profile of sample applicants


Profile analysis results in Table 3 shows that the sample consists of 69% accepted applicants. About 55%
are female applicants, more than half (55%) are single while 40% are married. Majority (63%) of the
applicants are skilled employees or officials. Only 6% are unemployed and about a quarter (25%) has
been employed more than seven years. About 64% have one existing loan and about half has delayed in
paying off loan in the past. The average age of applicants is 36 years old while the average amount and
duration of loan is RM16420 and 21 months respectively.

Table 3(a) Frequency Distribution of variables

Variable Description Frequency Percentage

APPLICATION STATUS Rejected 1330 30.9


Accepted 2975 69.1

Demographic Indicators

GENDER Male 1926 44.7


Female 2379 55.3

STATUS Single 2365 54.9


Married 1721 40
Divorced/ Widowed 219 5.1

JOB Unemployed 88 2
Nature of job Unskilled 879 20.4
skilled employee / official 2717 63.1
management/ self-employed/officer 631 14.5

CURRENT_ADD < 1 year 550 12.8


Years living at current 1 < …<=2 years 1336 31
address 2<…<=3 years 655 15.2
>4years 1764 41

Table 3(a) Frequency Distribution of variables (cont’d)


PROPERTY No 3068 71.3
Applicant own property Yes 1237 28.7

HOME PHONE No 2616 60.8


Applicant has home Yes 1689 39.2
phone in his or her name

HOUSING Rent 1239 28.8


Own 3066 71.2

EMPLOYMENT unemployed 257 5.9


Present employment < 1 year 719 16.7
since 1 <= ... < 4 years 1502 34.9
4 <=... < 7 years 741 17.2
>= 7 years 1086 25.3

Financial Indicators

NUM_LOAN 1 2784 63.8


Number of existing loans 2 1411 32.8
3 122 2.8
4 24 0.6

HISTORY no loan taken 174 4


Loan history all loans at this bank paid back duly 214 5
existing loans paid back duly till now
delay in paying off in the past 2278 52.9
critical account/ unpaid
365 8.5
1274 29.6

Table 3(b) Descriptive Statistics

Demographic Indicator Mean SD

Age Age in years 35.74 11.41

Financial Indicators

AMOUNT Amount of existing loan 16420 1417

DURATION Duration of loan in months 21.32 12.321


4.2 Logistic Regression results
Based on the summary in Table 4, the misclassification rate for Stepwise, Backward and Forward
selection method is the same. The significant predictors for all four models were similar namely gender,
age, years at current address, employment, housing, home phone, number of loan, property, duration,
and loan history. Comparison results in Table 4 show that consequently the stepwise logistic regression
model is the selected model.

Table 4 Comparison of Logistic Regression Models


Fit Statistics
Model selection based on _VMISC_

Valid: Train:
Selected Model Misclassification Misclassification
Model Node Rate Rate

Y Reg2 0.28074 0.27158


Reg3 0.28074 0.27158
Reg4 0.28074 0.27158
Reg 0.28770 0.27424
Note: Reg= ENTER, Reg2=STEPWISE, Reg3=FORWARD; Reg4=BACKWARD;Y=selected model

Table 5 presents the STEPWISE model results. Results show that female and older applicants are more
likely to be accepted. Those who do not own homes or property are more likely to be rejected. Applicants
with many existing loans and longer duration loans are also more likely to be rejected.

Table 5 Analysis of Maximum Likelihood Estimates (STEPWISE MODEL)

Standard Wald Standardized

Parameter DF Estimate Error Chi-Square Pr > ChiSq Estimate Exp(Est)

Intercept 1 1.8708 0.2490 56.43 <.0001 6.494

AGE 1 0.00924 0.00439 4.43 0.0354 0.0576 1.009

CURRENT 0 1 0.3620 0.1055 11.77 0.0006 1.436

CURRENT 1 1 -0.0784 0.0757 1.07 0.3009 0.925

CURRENT 2 1 -0.3046 0.0905 11.32 0.0008 0.737


DURATION 1 -0.0431 0.00372 134.10 <.0001 -0.2923 0.958

EMPLOYME 0 1 -0.3045 0.1410 4.66 0.0308 0.737

EMPLOYME 1 1 -0.3531 0.0972 13.20 0.0003 0.702

EMPLOYME 2 1 0.00550 0.0775 0.01 0.9435 1.006

EMPLOYME 3 1 0.6046 0.1047 33.37 <.0001 1.831

GENDER 0 1 -0.2008 0.0463 18.83 <.0001 0.818

HOME_PHO 0 1 -0.1518 0.0453 11.21 0.0008 0.859

HOUSING 0 1 -0.1585 0.0488 10.54 0.0012 0.853

LOAN_HIS 0 1 -0.4859 0.1616 9.04 0.0026 0.615

LOAN_HIS1 1 -0.7647 0.1569 23.76 <.0001 0.465

LOAN_HIS2 1 -0.00582 0.0867 0.00 0.9465 0.994

LOAN_HIS3 1 0.4112 0.1288 10.19 0.0014 1.509

NUM_LOAN 1 -0.3877 0.0988 15.39 <.0001 -0.1206 0.679

PROPERTY 0 1 -0.2424 0.0530 20.91 <.0001 0.785

Note: -2log likelihood = 3301.169, Chi-Square=48.24, p-value < 0.05

4.3 Decision Tree model


Results in Table 6 shows that the decision tree using the Chi-Square splitting criteria is selected as the
best decision tree model based on the validation misclassification rate. The importance of the variables is
shown in Table 7. For decision tree gender, number of loan and marital status are not important variables
and not used in building the tree. The decision tree is shown in Figure 3.

Table 6 Misclassification rate for decision tree models

Valid: Train:
Selected Model Misclassification Misclassification
Model Node Rate Rate

Y Tree 0.20186 0.19389


Tree2 0.22583 0.22112
Tree3 0.22892 0.22112
Note:Tree=Chi-Square; Tree2=ENTROPY, Tree3=GINI; Y=Selected model

Table 7 Variable of Importance


Obs NAME LABEL NRULES IMPORTANCE VIMPORTANCE RATIO

1 AMOUNT 7 1.00000 0.77431 0.77431


2 DURATION 4 0.92039 1.00000 1.08650

3 LOAN_HIS 3 0.70094 0.88471 1.26217

4 AGE 3 0.61361 0.61924 1.00917

5 HOME_PHO 2 0.42610 0.58092 1.36335

6 HOUSING 2 0.41863 0.52163 1.24604

7 PROPERTY 2 0.39482 0.31537 0.79877

8 EMPLOYME 1 0.36534 0.36741 1.00566

9 CURRENT_ 1 0.25524 0.35490 1.39045

10 JOB 1 0.24187 0.33680 1.39250

Figure 3 The decision tree (Chi-Square splitting criteria)

4.4 Model Comparisons


In this section, comparison of the performance of the models is discussed. The models were compared
based on validation misclassification rate. The misclassification rate is the percentage of misclassified
observation by the model. Results in Table 8 shows that the selected best predictive model is decision
tree as it has the lowest misclassification rate. Table 9 presents the event classification table for the three
predictive models for both the training and validation sample
Table 8 Misclassification rate for DT, LR and NN

Fit Statistics
Model selection based on _VMISC_

Valid: Train:
Selected Model Misclassification Misclassification
Model Node Rate Rate

Y Tree 0.20186 0.19389


Neural 0.24285 0.19821
Reg2 0.28074 0.27158

Table 9 Event Classification Table


Model selection based on _VMISC_

Data True False True False


MODEL MODELDESCRIPTION Role Negative Negative Positive Positive

Tree CHI-SQUARE TRAIN 83 429 501 1999


Tree CHI-SQUARE VALIDATE 34 173 227 859
Reg2 STEPWISE TRAIN 152 264 666 1930
Reg2 STEPWISE VALIDATE 53 90 310 840
Neural Neural Network TRAIN 202 535 395 1880
Neural Neural Network VALIDATE 112 198 202 781

Table 10 displays the sensitivity, specificity and the misclassification rate for each model. The sensitivity
rate is the true positive rate (the percentage of accepted applicants predicted correctly as accepted) while
specificity is the true negative rate (percentage of rejected applicants predicted as rejected). Decision tree
is the best predictive model as it has the lowest misclassification rate and the highest sensitivity.

Table 10 Sensitivity, specificity and misclassification rate


Model Sample Sensitivity Specificity Misclassification
rate
DT Training 0.9601 0.4613 0.1938
Validation 0.9619 0.4325 0.2018
LR Training 0.9270 0.2839 0.2716
Validation 0.9406 0.2250 0.2807
NN Training 0.9030 0.5753 0.1982
Validation 0.8746 0.4950 0.2428
Model Sample Sensitivity Specificity Misclassification
rate
DT Training 0.9601 0.4613 0.1938
Validation 0.9619 0.4325 0.2018
LR Training 0.9270 0.2839 0.2716
Validation 0.9406 0.2250 0.2807
NN Training 0.9030 0.5753 0.1982
Validation 0.8746 0.4950 0.2428

5. Conclusion
This study focused on the construction and evaluation of three predictive models which include logistic
regression, decision tree and neural network model to classify credit card applicants. Results revealed
that the decision tree has the lowest misclassification rate. The performance of predictive models
depends on the data structure, data quality and the objective of the classification. In practical applications,
classification methods such as decision trees and logistic regression which are relatively easy to
understand and deploy are more appealing to users. With the availability of data mining software, more
banks are finding data mining techniques useful in gaining competitive advantage.

References
Berry, M.J.A. and G.S. Linoff (2004). Mastering Data Mining: The Art and Science of Customer
Relationship Management, New York: John Wiley & Sons, Inc.

Davis, R.E., K. Elder, D. Howlett and E. Bouzaglou, (1999). Relating storm and weather factors to dry
slab activity at Alta, Utah and Mammot Mountain, California, using classification and regression trees.
Cold Regions Science and Technology, Vol. 30, pp. 79.

Fu, L. (2004). Efficient evaluation of sparse data cubes. In Advances in Web-Age Information
Management: Fifth International Conference (WAIM’04), Dalian, China, July 15-17, (pp. 336-345).

Hand, D. J and Henley, W. E. (1997). Statistical Classification Method in Consumer Credit Scoring: a
Review. Journal Statistics Social A, Vol. 160(3), pp. 523-541.

Hian, C. K. and Chan, K.L. (2004). Going concern prediction using data mining techniques, Managerial
Auditing Journal, Vol 19, No 3, 462-476.

Huang, Z., Chen, H., Hsu, C. J, Chen W.H., Wu, S., (2004). Credit rating analysis with support vector
machines and neural networks: A market comparative study. Decision Support Systems, Vol.37, pp.
543– 558.

Huang, C.L, Chen, M. C and Wang, C. J. (2007). Credit scoring with a data mining approach based on
support vector machines. Expert Systems with Applications, Vol. 33(4), pp. 847–856.

Koh, H.C., Tan, W.C. and Goh, C.P. (2004). Credit scoring using data mining techniques. Singapore
Management Review. Vol. 26, No. 2, pp. 25-47.
Kurt, I., Ture, M., & Kurum, A. T. (2008). Comparing performances of logistic regression, classification
and regression tree, and neural networks for predicting coronary artery disease. Expert Systems with
Applications, Vol. 34, pp. 366–374.

Lee, T. S., Chiu, C. C., Chou, Y. C., & Lu, C. J. (2006). Mining the customer credit using classification and
regression tree and multivariate adaptive regression splines. Computational Statistics & Data Analysis,
Vol. 50, pp. 1113–1130.

Mavri, M,. Angelis, V. and Loannou, G. (2008). A two-stage dynamic credit scoring model based on
customers profiles and time horizon. Journal of Financial Services Marketing. Vol.13, No.1, pp.17-27.
Olson, D. and Yong, S. (2007). Introduction to Business Data Mining. McGraw Hill International Edition.
Roiger, R. J. and Geatz, M. W. (2003). Data Mining: A Tutorial-Based Primer. Pearson Education, Inc.
®
SAS Institute Inc. (2005). SAS Training Course Notes: Applying Data Mining Techniques Using SAS
Enterprise Miner™, SAS Institute Inc., Cary, NC 27513, USA.

Thomas, L. C.(2000). A survey of credit and behavioral scoring: Forecasting risk of leading to customer,
International, Journal of Forecasting, 16, 149-172,2000.
Vojtek, M. & Evžen Kočenda, (2006). Credit-Scoring Methods (in English), Czech Journal of Economics
and Finance (Finance a uver), Charles University Prague, Faculty of Social Sciences, vol. 56(3-4), pages
152-167.

You might also like