You are on page 1of 14

EUROPEAN

JOURNAL
OF OPERATIONAL
RESEARCH
ELSEVIER European Journal of Operational Research 95 (1996) 24-37

Theory and Methodology

A comparison of neural networks and linear scoring models


in the credit union environment
V i j a y S. D e s a i a, *, J o n a t h a n N . C r o o k b, G e o r g e A . O v e r s t r e e t , Jr. a
a Mclntire School of Commerce, University of Virginia, Charlottesville, VA 22 903, USA
b Department of Business Studies, University of Edinburgh, 50 George Square, Edinburgh, EH89JY, UK
Received January 1995; revised August 1995

Abstract

The purpose of the present paper is to explore the ability of neural networks such as multilayer perceptrons and modular
neural networks, and traditional techniques such as linear discriminant analysis and logistic regression, in building credit
scoring models in the credit union environment. Also, since funding and small sample size often preclude the use of
customized credit scoring models at small credit unions, we investigate the performance of generic models and compare
them with customized models. Our results indicate that customized neural networks offer a very promising avenue if the
measure of performance is percentage of bad loans correctly classified. However, if the measure of performance is
percentage of good and bad loans correctly classified, logistic regression models are comparable to the neural networks
approach. The performance of generic models was not as good as the customized models, particularly when it came to
correctly classifying bad loans. Although we found significant differences in the results for the three credit unions, our
modular neural network could not accommodate these differences, indicating that more innovative architectures might be
necessary for building effective generic models.

Keywords: Neural networks; Banking; Credit scoring

1. Introduction network technology, offers, among other things,


products for credit card fraud detection (Falcon),
Recent issues of trade publications in the credit automated mortgage underwriting (Colleague), and
and banking area have published a number of articles automated property valuation. Clients for H N C ' s
heralding the role of artificial intelligence (AI) tech- Falcon software include AT & T Universal Card,
niques in helping bankers make loans, develop mar- Household Credit Services, Colonial National Bank,
kets, assess creditworthiness and detect fraud. For First U S A Bank, First Data Resources, First Chicago
example, HNC Inc., considered a leader in neural Corp., Wells Fargo & Co., and Visa International
(American Banker, 1994a,b; 1993a,b). According to
Allen Jost, the director of Decision Systems for HNC
Inc., "Traditional techniques cannot match the fine
* Corresponding author. resolution across the entire range of account profiles

0377-2217/96/$15.00 Copyright © 1996 Elsevier Science B.V. All rights reserved


SSDI 0377-2217(95)00246-4
V.S. Desai et a l . / European Journal of Operational Research 95 (1996) 24-37 25

that a neural network produces. Fine resolution is and generic credit scoring models for consumer loans
essential when only one in ten thousand transactions in a credit union environment using conventional
are frauds" (Jost, 1993 p. 32). Other software com- statistical methods such as regression and discrimi-
panies marketing AI products in this area include nant analysis. Unlike large US banks, credit unions'
Cybertek-Cogensys and Nestor Inc. Cybertek-Cogen- loan files were not kept in readily available comput-
sys markets an expert system software called Judg- erized databases. As a result, samples were labori-
ment Processor which is used in evaluating potential ously collected by analyzing individual loan files.
borrowers for various consumer loan products, and The purpose of the present paper is to use the rich
includes customers such as Wells Fargo Bank, San database of three credit unions in the Southeastern
Francisco, and Commonwealth Mortgage Assurance United States assembled by Overstreet to investigate
Co., Philadelphia (American Banker, 1993c,d); and whether the predictive power of the variables em-
plans to introduce a neural net software product for ployed in the above studies c a n be enhanced if the
under $1000 (Brennan, 1993a, p. 52). Nestor Inc.'s statistical methods of regression and discriminant
customers for a neural network-based credit card analysis are replaced by neural network models. In
fraud detection software include Mellon Bank Corp. particular, w e explore two types of neural networks,
(American Banker, 1993e). namely, feedforward neural networks with backprop-
While acknowledging the success of expert sys- agation of error commonly referred to as multilayer
tems and neural networks in mortgage lending and perceptrons ( M L P ) a n d modular neural networks
credit card fraud detection, reports in trade journals (MNN). Neural networks can be viewed as a nonlin-
claim that artificial intelligence and neural networks ear regression technique. While there exist a number
have yet to make a breakthrough in evaluating cus- of nonlinear regression techniques, in a number of
tomer credit applications (Brennan, 1993a). Accord- these techniques one has to specify the nonlinear
ing to Mary A. Hopper, senior vice president of the model b e f o r e proceeding w i t h the estimation of pa-
Portfolio Products Group at Fair, Isaac and Co., a rameters; hence these techniques can be classified as
major provider of credit scoring systems, " T h e prob- model-driven approaches. In comparison, use of a
lem is a quality known as robustness, The model has neural network is a data-driven approach, i.e,, a
to be valid over time and a wide range of conditions. prespecification o f the model is n o t required. The
When we tried a neural network, which looked great neural network: qearns' the relationships inherent in
on paper, it collapsed - it was not predictive. We the data presented to it. This approach seems particu-
could have done better sorting the list on a single larly attractive in solving t h e p r o b l e m a t hand, be-
field" (Brennan, 1993b, p. 62). Typical techniques cause, a s Allen Jost (1993, p, 30)says, "Traditional
of choice for software marketed by the big-three statistical model development includes time-consum-
credit information companies such as Gold Report, ing manual data review activities such as searching
developed by Management Decision Systems (MDS) for non-linear relationships and detecting interactions
and marketed by TRW, Delphi developed by MDS among predictor variables".
and marketed by Trans Union, Delinquency Alert I n the credit: union environment, funding and small
System developed by MDS and marketed by Equifax, sample size often preclude the use of credit scoring
Empirica developed by Fair Isaac for Trans Union, models that are custom tailored for the:individual
and Beacon developed by Fair Isaac for Equifax, credit union. A recent Filene Research Institute sur-
include multivariate discriminant analysis (Gothe, vey found that only 20% of Credit unions with over
1990, p. 28), and regression (Jost, 1993, p. 27). $25 million in assets have a credit scoring system.
In spite of the reports in the; trade journals indi- Among the vast number of smaller credit unions,
cated above, papers in academic journals investigat- credit scoring usage can be expected to be even
ing and reporting on the claims appearing in the lower. In light of this, generic models appear to offer
trade journals are not common. Perhaps this is due to a potential avenue from which credit scoring could
the lack of data available to the academic commu- be feasible for even the smallest institution within
nity. Exceptions include Overstreet et al. (1992) and this industry. To this end, we train and compare the
Overstreet and Bradley (1994) who compare custom performance of customized and generic models so
26 V.S. Desai et a l . / European Journal of Operational Research 95 (1996) 24-37

that the costs and benefits of the two approaches can 2. Conventional statistical techniques
be evaluated.
Proponents of customized models argue that credit Let the vector x = ( x l, x 2 Xp) represent the
. . . . .

unions enjoy very narrow fields of membership, p predictor variables, and let y represent the binary
normally employment related, and hence individual (categorical) dependent variable. Let the predictor
credit unions differ markedly from each other. For variables be metric or nonmetric. Given that the
example, one credit union included in the present dependent variable is binary, conventional methods
study consists of teachers whereas another one con- typically used are as follows.
sists of telephone company employees. Thus, generic
models would miss important differences among in- Linear Discriminant Analysis (LDA)
dividual credit unions. If indeed there are differences The objective of linear discriminant analysis is to
in the data representing the different credit unions, deliver a function
one way to detect these differences and take advan-
tage of them would be to use modular neural net- Z = wtx = W l X 1 -~- w 2 x 2 -~ • • • - ~ W p X p , (1)
works, a network architecture consisting of a multi- where the weight vector w = ( w 1, w e . . . . . Wp) is
plicity of networks competing with each other to such that it maximizes the ratio
learn various segments of the input data space. Thus,
we train and compare the performance of modular - (2)
neural network models with the generic and cus- H e r e / z I and /z 2 are the population mean vectors for
tomized neural networks as well as models based the two categories, and ~ is the common covariance
upon linear discriminant analysis and logistic regres- matrix for the two populations. The intuition behind
sion. this is that, if the difference between the weighted
A comparison of the neural network models with mean vectors is maximized relative to their common
linear discriminant analysis and logistic regression covariance, the risk of misclassification would be
suggests that, in terms of correctly classifying good relatively small.
and bad loans, the neural network models outperform The linear discriminant model assumes that 1) the
linear discriminant analysis, but are only marginally predictor variables are measured on an interval scale;
better than logistic regression models. However, in 2) the covariance matrices of the predictor variables
terms of correctly classifying bad loans, the neural are equal for the two groups; and 3) the predictor
network models outperform both conventional tech- variables follow a multivariate normal distribution.
niques. Since bad loans are only a small proportion As will be clear from Section 5, the predictor vari-
of the total loans made, this result resonates with the ables used in our credit union application are not all
claim made by Allen Jost of HNC Inc. that tradi- metric, and hence the first and third assumptions are
tional techniques cannot match the fine resolution clearly violated. It is well known that when predictor
produced by neural nets. In comparing generic and variables are a mixture of discrete and continuous
customized models we found that the customized variables, the linear discriminant analysis function
models perform significantly better than generic may not be optimal, and special procedures for
models. However, the performance of modular neu- binary variables are available (e.g., Dillon and Gold-
ral networks was not significantly better than the stein, 1984). However, in the case of binary vari-
generic neural network model, suggesting that per- ables, most evidence suggests that the linear discrim-
haps the differences between the individual credit inant function performs reasonably well (e.g., Gilbert,
unions are not that important. 1968; Moore, 1973).
In Section 2 and Section 3 we review conven-
tional statistical techniques and neural network mod- Logistic Regression (LR)
els respectively; Section 4 describes the data, the In the case of logistic regression it is assumed that
sources of data, and provides the specifics of the the following model holds:
neural networks used; Section 5 presents the results
1
of our experiments, and Section 6 presents the con-
clusions and suggestions for future research. Z = 1 + e -z ' (3)
VS. Desk et al./European Journal ofOperational Research 95 (1996) 24-37 27

where Output n

Z = The probability of class outcome.


z = wa + w,x, + wzx* + . . . +w,x,.
The logistic regression model does not require the
assumptions necessary for the linear discriminant Hidden
problem. In fact, Harrell and Lee (1985) found that
even when the assumptions of LDA are satisfied, LR
is almost as efficient as LDA. One advantage of
Input
LDA is that ordinary least-squares procedures can be
used to estimate the coefficients of the linear dis-
Multilayer Perceptron
criminant function, whereas maximum-likelihood
methods are required for the estimation of a logistic Fig. 1. Multilayer perceptron.
regression model. However, given the availability of
high speed computers today, computational simplic-
ity is no longer considered to be an adequate crite- brain). The network generally consists of at least
rion for choosing a method. A second advantage of three layers, one input layer, one output layer, and
discriminant analysis over logistic regression is that one or more hidden layers. Fig. 1 illustrates a net-
prior probabilities and misclassification costs can be work with one hidden layer.
easily incorporated into the LDA model. Misclassifi- “If a complex function is naturally decomposable
cation costs and prior probabilities can also be incor- into a set of simpler functions, then a modular
porated into neural networks (e.g., Tam and Kiang, network has the built-in ability to discover the de-
1992); however, since we want the results of the composition’ ’ (Haykin, 1994). While there exist a
three methods to be comparable, the present study number of architectures consisting of a multiplicity
does not incorporate misclassification costs into the of networks (e.g., Nilsson, 1965), the modular neural
models studied. The LDA model did use the propor- network used in the present study is based upon the
tion of good and bad loans in the training samples as architecture presented in Jacobs et al. (1991), and
prior probabilities. consists of a group of feedforward neural networks
(referred to as ‘local experts’) competing to learn
different aspects of the problem. A gating network
3. Neural networks controls the competition and learns to assign differ-
ent regions of the data space to different local expert
Research on multilayer feedforward networks networks. Fig. 2 illustrates a network with three local
dates back to the pioneering work by Rosenblatt experts and a gating network.
(1962) and Widrow (1962). A computationally effi-
cient method for training multilayer feedforward net-
works came in the form of the backpropagation Gating
Network
algorithm. Credit for developing backpropagation output 0
into a usable technique, as well as popularizing it, is
usually given to Rumelhart et al. (1986) and the
members of the Parallel Distributed Processing group,
although a number of independent origins including
Parker (19821, Werbos (19741, Bryson and Ho
(19691, and Robbins and Munro (1951) have been
cited. A neural network model takes an input vector
X and produces an output vector 0. The relationship
between X and 0 is determined by the network
architecture. There are many forms of network archi- Modular Neural Network
tectures (inspired by the neural architecture of the Fig. 2. Modular neural network.
28 V.S. Desai et a l . / European Journal of Operational Research 95 (1996) 24-37

3.1. Network architecture trois the competition and learns to assign different
regions of the data space to different networks. Both
3.1.1. Multilayer perceptron (MLP) the local experts and the gating network have full
Each layer in an MLP consists of one or more connections from the input layer. The gating network
processing elements ('neurons'). In the network we has as many output nodes as there are local experts,
will be using, the input layer will have p processing and the output values of the gating network are
elements, i.e., one for each predictor variable. Each normalized to sum to one.
processing element in the input layer sends signals Let t represent the number of local experts, g,,
x i (i = 1..... p) to each of the processing elements represent the output of the m-th neuron of the gating
in the hidden layer. Each processing element in the network, u m the weighted sum of the inputs applied
hidden layer (indexed by j = 1 . . . . . q) produces an to the m-th output neuron of the gating network, and
'activation' o,, the output of the m-th local expert, and o denote
the output of the whole MNN. Then,
a j = G ( ~i wijxi ) t
O= E groOm • (7)
where wij are the weights associated with the con- m=l

nections between the p processing elements of the Thus, the final output vector of the MNN is a
input layer and the j-th processing element of the weighted sum of the output vectors of the local
hidden layer. The processing element(s) in the output experts. The outputs of the gating network,
layer behave in a manner similar to the processing gl, g2 ..... gt, are interpreted as the conditional a
elements of the hidden layer to produce the output of priori probabilities that the respective local experts
the network generate the current training pattern. Thus, it is
required that
ok = F G ~_.wijx i w~k where k = l . . . . . r. t
i=l O <~gm <~l and ~ gm = l.
m=l
(4)
This is achieved in the MNN using the following
The main requirements to be satisfied by the activa- softmax activation function for the output neurons of
tion functions F ( - ) and G(-) are that they be nonlin- the gating network:
ear and differentiable. Typical functions used are the
sigmoid, hyperbolic tangent, and the sine functions,
g,. = e"~/ e u* . (8)
i.e.,
'l/(i+e-X),or The goal of the MNN and the learning algorithm
F ( x ) = (e x - e-X)/(e x + e-X), or (5) is to model the probability distribution of the set of
training patterns {X, Y}. This gives us the maximum
sin x.
likelihood objective function given below (Jacobs et
The weights in the neural network can be adjusted al., 1991):
to minimize some criterion such as the sum of
squared error (SSE) function. E2 =ln( ~ gm exp(-O'5ll Y - ° " ] ' 2 ) ) (9)

=
lkk ( yk - 0 0 2 . (6)
where I1" II represents the Euclidian norm.
l=l k=l

3.2. Network training


3.1.2. Modular Neural Network (MNN)
An MNN consists of a multiplicity of networks, 3.2.1. Multilayer perceptron
referred to as local experts, competing to learn dif- The most popular algorithm for training multi-
ferent aspects of a problem. A gating network con- layer perceptrons is the backpropagation algorithm.
V.S. Desai et al./ European Journal of Operational Research 95 (1996) 24-37 29

As the name suggests, the error computed from the the current paper we allowed the network to run for
output layer is backpropagated through the network, a maximum of 100000 iterations. Training was
and the weights are modified according to their stopped before 100000 iterations if the error crite-
contribution to the error function. Essentially, back- rion ( E 1 defined in (6)) reached below 0.1. Also, the
propagation performs a local gradient search, and percentage of training patterns correctly classified
hence its implementation, although not computation- was checked after every cycle of 1000 iterations and
ally demanding, does not guarantee reaching a global the network saved if there was an improvement over
minimum. A number of heuristics are available to the previous best saved network. Thus, the network
alleviate this problem, some of which are presented used on the test data set was the one that had shown
below. the best performance on the training data set during a
Let x~~] be the current output of the j-th neuron training of up to 100,000 iterations. The same crite-
in layer s, I~ s] the weighted summation of the inputs ria were also used for the MNN described below.
to the i-th neuron in layer s, f ' ( l [ ~J) the derivative of
the activation function of the i-th neuron in layer s, 3.2.2. Modular neural network
and elks] the local error of the k-th neuron in layer s. Like the MLP, training in MNN is also done
Then, after some mathematical simplification (e.g., using the backpropagation of error. Training of the
see Haykin, 1994, pp. 144-152) the weight change local experts and the gating network can occur si-
equation suggested by backpropagation and (6) can multaneously, or each local expert can be trained in a
be expressed as follows: hierarchical fashion. The learning rule is designed to
encourage competition among the local experts so
AWl; ] = ~ft ( i:s]) E ( elks+ l]Wki)X~ s- 1] ..1_ OAwij.
that, for a given input vector, the gating network will
k
tend to choose a single local expert rather than a
(lO) mixture o f them. In this way the input space is
for non-output layers, and automatically partitioned into regions so that each
local expert takes responsibility for a different re-
AWl} ] = 7"1(Yi -- o i ) f ' ( l i ) x ~ s-l] "b OmWij. (11)
gion.
for the output layer. Here 7/ is the learning coeffi- Given (9), after some mathematical simplification
cient, and 0 is the momentum term. One heuristic (e.g., Haykin, 1994, pp. 482-487) the modified
that is used to prevent the network from getting weight change equations for the output layers of the
stuck at a local minimum is random presentation of local experts and the gating network can be ex-
the training data (e.g., Haykin 1994, pp. 149-150.) pressed as follows:
In the absence of the second term in (10) and (11), Aw!; l --- rlhi(Yi-- o i ) f ' ( t i ) x ~ s-'] + OAwij (12)
setting a low learning coefficient results in slow
learning, whereas a high learning coefficient can for local expert, and
produce divergent behavior. The second term in (10) Aw~;] --- r/( hi-- gi) x~ s- '1 + OAwij (13)
and (11) reinforces general trends whereas oscilla-
tory behavior is canceled out, thus allowing a low for the gating network, where
learning coefficient but faster learning. Last, it is gi exp( - 0.5 II Y -- oi [12)
suggested that starting the training with a large learn-
h i = ~_,tm=lg m exp(-0.511 y - o , , t l 2 ) " (14)
ing coefficient and letting its value decay as training
progresses speeds up convergence, All these heuris- h i can be interpreted as the posterior probability that
tics have been used in the models studied in the the i-th local expert is responsible for the current
current paper. output vector.
There are three criteria that are commonly used to
stop network training, namely, after a fixed number 3.3. Network size
of iterations, or after the error reaches a certain
prespecified minimum, or after the network reaches a Important questions about network size that need
fairly stable state and learning effectively ceases. In to be answered are as follows: one, how does one
30 V.S. Desai et a l . / European Journal of Operational Research 95 (1996) 24-37

determine the number of processing elements in the intervals and eliminate a certain maximum number
hidden layer, and second, how many hidden layers of nodes in the hidden layers if the elimination does
are adequate? As yet there are no firm answers for not lead to a significant deterioration in performance.
these questions; however, certain heuristics are avail- We implemented this hypothesis in our current paper
able. A higher number of processing elements will as follows. As mentioned above each network was
lead to a greater flexibility on the part of the network trained for up to 100000 iterations. The following
to fit the data. However, this need not necessarily be procedure was used after every cycle of 1000 itera-
good because then the network will not only capture tions to check whether one or more of the processing
the information content of the data but will also learn elements could be permanently disabled:
the error in the data, and hence will not be very (a) Find the percentage of the training data set
useful for the purpose of generalization or prediction. correctly classified to determine a 'reference' level
As for the second question, theoretically, a single of performance.
hidden layer should be sufficient, but the training (b) Disable the processing element by setting its
time required with a single hidden layer may be very output to 1.0.
high. Addition of a second hidden layer may reduce (c) If the percentage of the training data set
the training time. correctly classified did not drop below 75% of the
In stepwise regression, one can start with a small 'reference' level the processing element was added
number of variables and add the most significant to the 'candidate' prune list.
variable (forward selection) one by one, or one can (d) Disable the processing element by setting its
start with a large number of variables and eliminate output to 0.0.
the least significant variable (backward elimination) (e) If the percentage of the training data set
one by one. Criteria to select best subsets of regres- correctly classified did not drop below 75% of the
sors are also available. Equivalent strategies in neu- 'reference' level the processing element was added
ral networks modeling have been explored. For ex- to the 'candidate' prune list.
ample, the cascade correlation network (Fahlman and (f) Re-enable the the processing element and con-
Lebierre, 1990) starts with zero neurons in the hid- tinue with the next one.
den layer and adds one neuron at a time to the (g) Sort the 'candidate' prune list in the order of
hidden layer till the error criterion is satisfied or the performance and disable the top two candidates by
performance stops improving. Alternatively, one can setting their output to zero.
start with a network with a large number of neurons
and eliminate neurons. It has been suggested that
"starting with oversized networks rather than with 4. Data sample and model construction
tight networks seems to make it easier to find a good
solution" (Weigend et al., 1990). One then tackles 4.1. Data sample
the problem of overfitting by somehow eliminating
the excess neurons. While a number of methods for As mentioned earlier, unlike large banks, credit
eliminating the excess neurons have been explored, a unions' loan files were not kept in readily available
method that seems to be readily applicable in our computerized databases. Consequently samples had
case is given below. to be laboriously collected by analyzing individual
loan files. Data were collected from the loan files of
3.3.1. Network pruning three credit unions in the Southeastern United States
The hypothesi s that " . . . the simplest most robust for the period 1988 through 1991. Credit union L is
network which accounts for a data set will~ on predominantly made up of teachers and credit union
average, lead to the best generalization to the popula- M is predominantly made up of telephone company
tion from which the training set has been drawn" employees, whereas credit union N represents a more
was made by Rumelhart (reported in Hanson and diverse state-wide sample. The narrowness of mem-
Pratt, 1988). One of the simplest implementations of bership is somewhat mitigated by the inclusion of
this hypothesis is to check the network at periodic family members in all three credit unions. Only
V.S. Desai et al./ European Journal of Operational Research 95 (1996) 24-37 31

Table 1 ' g o o d ' category provided that the most recent loan
List of predictor variables was between 48 months and 18 months old. After
MAJCARD Number of major credit cards eliminating observations with corrupted or missing
OWNBUY Owns home elements we were left with 505 observations for
INCOME Salary plus other income
GOODCR 'Good' depending upon derogatory information credit union L with 81.58% good loans, 762 observa-
and number of tions for credit union M with 74.02% g o o d loans,
01-09 ratings on credit bureau reports and 695 observations for credit union N with 78.85%
JOBTIME Number of years in current job good loans.
DEPENDS Number of dependents There exist several approaches for validating sta-
NUMINQ Number of inquires in past 7 months
TRADEAGE Number of months since trade line opened tistical models (e.g., see Dillon and Goldstein, 1984,
TRADE75 Number of trade lines 75% full or Hair et al., 1992). The simplest approach, referred
PAYRATE1 Monthly payments as a proportion of income to as the cross-validation method, involves dividing
DELNQ Delinquent accounts in past 12 months the data into two subsets, one for training (analysis
OLDDEBT Total debt as a proportion of income sample) and a second one for testing (holdout sam-
AGE Age of borrower
ADDRTIME Number of years at the current address ple). More sophisticated approaches include the U-
ACCTOPEN Number of open accounts on credit bureau method and the jacknife method. Both these methods
reports are based on the 'leave-one-out' principle, where the
ACTLOANS Number of active accounts on credit bureau statistical model is fitted to repeatedly drawn sam-
reports ples o f the original sample. Dillon and Goldstein
PREVLOANS Number of previous loans with credit union
DEROGINF Information based upon 01-09 ratings on credit (1984, p. 393) suggest that in the case o f discrimi-
bureau reports nant analysis a large standard deviation in the esti-
mator for misclassification probabilities can over-
whelm the bias reduction achieved by the U-method,
and if multivariate normality is violated, it is ques-
credit union M had added select employee groups to tionable whether jackknifed coefficients actually rep-
actively diversify its field o f membership. resent an improvement in general. Also, these meth-
Predictor variables c o m m o n l y used in credit scor- ods can be computationally expensive, An intermedi-
ing studies include various debt ratio and other cash ate approach, and perhaps the most frequently used
flow-oriented surrogates, e m p l o y m e n t time, home approach, is t o randomly divide the original sample
ownership, major credit card ownership, and repre- into analysis and holdout samples several times.
sentations o f past p a y m e n t history (e.g., Overstreet et Given the substantial size o f the data, and the fact
al., 1992). Additional variables that can be added to that w e investigate seven models, we decided to use
the model include detailed credit bureau reports (e.g., an i n t e r m e d i a t e approach. The data sample was di-
Overstreet and Bradley, 1994). In selecting predictor vided into two parts with two thirds o f the observa-
variables care must be taken to comply with regula- tions being used for training and the remaining one
tion B of the Equal Credit Opportunity Action terms, third for testing. Observations were randomly as-
as so to avoid non-intuitive variables. Based on all signed to the training or testing data set, and ten such
the considerations mentioned above, eighteen vari- pairs o f data sets were created. A popular approach
ables were selected for the p r e s e n t study, and are is to use stratified sampling in order to k e e p the
given in Table 1. proportion o f good loans and bad loans identical
The data collected had 962 observations for credit across all data sets. Since the percentage o f bad loans
union L, 918 observations for credit union M, and is different for the three credit unions, and since
853 observations for credit union N. A binary depen- claims b y practitioners imply that the performance of
dent variable was defined in the following manner; a neural networks in comparison to the conventional
case was allocated to the ' b a d ' category if at any methods would depend upon the proportion o f bad
time in the last 48 months the customer's most loans in the data set, we decided not to use stratified
recent loan was either c h a r g e d off or if the customer sampling and let the percentage of bad loans vary
went bankrupt. All other cases were allocated to the across the ten data sets so t h a t our results would not
32 V.S. Desai et a l . / European Journal of Operational Research 95 (1996) 24-37

be dependent upon the particular composition of the learning coefficient (~7) were set at 0.3 for the hidden
data sample at hand. As Section 5 indicates when the layer and 0.075 for the output layer, and the momen-
results were compared, we accounted for this varia- tum term (0) was set at 0.4. All these coefficients
tion by performing paired t-tests. The data sets for were allowed to decay by reducing their values by
the generic models were created by merging the data half after 10000 iterations, and again by half after
sets for the individual credit unions. 30 000 iterations.
In all, the three customized models and the four
4.2. Model construction generic models tested were as follows:
Model lda_c: Customized model using linear dis-
As the preceding paragraphs indicate, we want to
criminant analysis.
use eighteen predictor variables to predict whether a
Model lr_ c: Customized model using logistic re-
loan will be a 'good' loan or a 'bad' loan. This fixes
gression.
the number of neurons in the input and output layers
Model mlp_c: Customized model using multilayer
to eighteen and one respectively. We have used a
perceptron.
sigmoid activation function in the hidden layers and
Model lda_ g: Generic model using linear discrimi-
the output layer, i.e.,
nant analysis.
F(z) = 1 / ( 1 + e-Z). Model lr_ g: Generic model using logistic regres-
After some preliminary testing we decided to use a sion.
single hidden layer starting with nine neurons for the Model mlp_ g: Generic model using multilayer per-
customized models, and two hidden layers starting ceptron.
with nine neurons in the first layer and three neurons Model m n n g: Generic model using modular neural
in the second layer for the MLP generic models. For network.
the MNN generic models we started with nine neu- In the LDA models a case was classified into the
rons in the hidden layer of each local expert and six group for which the posterior probability P(Gj[Z)
neurons in the hidden layer of the gating network. was greatest, where
All models were trained separately for each of the 10
samples for each credit union. Also, note that due to
P(ZIGj)P(Gj)
p( jlz) = 2
pruning the models at the end of training did not
have the same number of neurons as some of them E p(zL c,)P(a/)
i=1
were eliminated due to pruning. Thus, the number of
neurons deleted ranged from 0 to 7 out of the 9 with:
neurons in the hidden layer for the MLP models, and P(G~) = Prior probability that a case is a mem-
0 to 4 neurons out of the 9 neurons in the hidden ber of group j.
layer of the local experts of the MNN models. Also, P(ZI Gj) = Conditional probability of a score of Z,
up to 1 of the 3 local experts in the MNN models given membership of group j.
was eliminated during training. Here Z is the discriminant score generated using (1),
For the MLP models the initial values of the and P(Gj) was set equal to the proportion of cases in
learning coefficient (r/) were set at 0.3 for the hidden the training sample which were members of group j.
layer and 0.15 for the output layer, and the momen- In the LR models a case was allocated to group j if
tum term (0) was set at 0.4 for all layers. These the probability (equal to Z in (3)) that it was a
coefficients were allowed to decay by reducing their member of group j was greater than 0.5. Following
values by half after 10000 iterations, and again by Tam and Kiang (1992), in the neural networks mod-
half after 30 000 iterations. For the local experts in els a case was allocated as a 'bad' loan if the output
the MNN models the initial values of the learning of the neural network was less than 0.5, and as a
coefficient (~7) were set at 0.9 for the hidden layer 'good' loan otherwise, the rationale for this scheme
and 0.6 for the output layer, and the momentum term being that in the training data set the 'good' loans
(0) was set at 0.4 for all layers. Also, for the gating were coded as ones and the 'bad' loans were coded
network in the MNN models, the initial values of the as zeros.
V.S. Desai et al. / European Journal of Operational Research 95 (1996) 24-37 33

4.3. Computational experience machine with a 33 MHz 486 processor, using Neu-
ralWorks Professional H/Plus. The real time re-
The LR and LDA models were implemented on a quired for training each sample ranged from 300 to
Sequent $2000/750 machine with eighteen 386 and 900 seconds.
486 processors, using SPSS version 6.0. The pro-
gram was probably run on a single 20 MHz 386
processor. The CPU seconds required for training 5. Comparison of results
each sample ranged from 4 to 25 seconds for LDA
models and 7 to 65 seconds for LR models. The Table 2 gives the results of our experiments for
MLP and MNN models were run on an IBM P S / 2 the customized models, and Table 3 does the same

Table 2
Customized models
Data set Sample Percentage correctly classified
% bad
mlp_ c Ida_ c lr_ c
% total % bad % total % bad % total % bad
Credit union L"
1 14.88 83.33 32.00 83.30 16.00 82.70 24.00
2 17.26 87.50 31.03 82.70 20.69 81.00 34.48
3 17.26 86.90 41.38 83.90 31.03 82.10 37.93
4 22.02 81.55 37.84 84.52 32.43 82.74 37.84
5 18.45 82.14 29.03 77.40 16.13 78.00 38.71
6 17.26 83.33 41.38 • 81.55 27.59 82.74 41.38
7 21.43 80.36 16.67 78.50 08.33 80.95 30.56
8 17.86 79.76 56.67 82.14 10.00 81.55 23.33
9 16.67 85.71 32.14 86.31 32.14 86.90 39.29
10 18.45 83.33 29~03 79.76 19.35 80.36 29.03
Credit union M:
1 26.77 85.43 79.71 86.60 70.97 87.40 73.53
2 26.77 88.58 76.47 85.40 18.64 88.20 79.41
3 20.47 85.43 80.77 87.40 31.58 85.80 75.00
4 23.63 90.94 75.00 90.20 35.14 89.00 75.00
5 24.02 89.37 88.52 89.80 22.22 88.60 81.97
6 26.38 88.58 74.63 88.19 12.96 86.22 71.64
7 26.77 85.43 74.63 85.04 28.30 86.22 77.61
8 25.59 88.19 75.38 86.61 28.89 87.40 67.69
9 27.59 87.40 72.86 85.43 31.37 86.61 67.14
10 24.41 90.55 82.26 89.37 21.05 89.37 80.65
Credit union N:
1 25.43 75.86 49.15 75.40 18.64 78.50 27.11
2 24.57 77.15 40.35 78.50 31.58 78.50 24.56
3 15.95 81.90 35.14 83.20 35.14 84.10 35.14
4 19.40 77.15 44.44 78.90 22.22 80.60 28.89
5 23.28 76.72 24.07 75.40 12.96 75.90 18.52
6 22.84 79.31 35.85 75.00 28.30 76.72 26.42
7 19.40 81.90 31.11 80.60 28.89 81.03 28.89
8 21.98 74.13 37.25 74.57 31.37 75.43 31.37
9 24.57 80.17 47.37 77.59 21.05 78.88 21.05
I0 21.98 77.59 19.61 77.16 21.57 76.72 19.61

average 83.19 49.72 82.35 38.49 82.67 44.93


p-value a 0.018 5.7E - 07 0.109 0.007
a The p-values are for a one-tailed paired t-test comparing the mlp_ c results with the other two methods.
34 V.S. Desai et al./ European Journal of Operational Research 95 (1996) 24-37

for the generic models. For each model the first factors indicate that, perhaps as one would expect,
column gives the total (i.e., the good plus the bad) the diversity of examples in the data set in terms of
percentage of loans correctly classified, and the sec- size and variety is important in creating good mod-
ond column gives the percentage of bad loans cor- els.
rectly classified. Since the cost of giving a loan to a
defaulter is far greater than rejecting a good appli- Customized models
cant, the percentage of bad loans correctly identified As Table 2 indicates, the multilayer perceptron
is an important measure of a model's effectiveness. identifies 83.19% of the loans correctly, in compari-
While interpreting the results given in Tables 2 and 3 son to 82.35% for the linear discriminant analysis,
one must keep in mind that, in the experiments and 82.67% for the logistic regression. This indicates
reported in the present study, we did not explicitly that using this measure,
include misclassification costs because one of the
methods, namely logistic regression, does not allow mlp_ c > lr_ c > Ida_ c,
that feature. Note that given the information in Sec- albeit marginally. This is further confirmed using a
tion 4.1, the typical classification matrix can be more formal comparison, namely a paired t-test.
easily obtained from the data given in these tables. Since the data sets differ in that the proportion of
In comparing the three credit unions one can see bad loans is different for the ten data sets, we
that models for credit union M are the best, followed accounted for this difference by using the paired
by credit union N and L respectively. Given the t-test. As the p-values indicate, the multilayer per-
limited amount of information one can only specu- ceptron is clearly better than linear discriminant
late as to the reasons for these differences. First, analysis, whereas the difference is not as significant
credit union M has the largest sample size, followed when compared to logistic regression.
by credit union N and L, which is the same order as As Table 2 indicates, the multilayer perceptron
their performance. Second, credit union M is the correctly identifies 49.72% of the bad loans in com-
only credit union that added select employee groups parison to 38.49% for the discriminant analysis, and
to actively diversify its field of membership, and 44.93% for the logistic regression. This indicates that
credit union N represents a more diverse statewide using this measure, once again,
sample, whereas credit union L has perhaps the
narrowest field of membership. Thus, both these mlp_ c > lr_ c > Ida_ c.

Table 3
Generic models
Data set Percentage correctly classified
mlp_ g mnn_ g Ida_ g lr_ g
% total % bad % total % bad % total % bad % total % bad
1 79.97 28.29 80.12 28,95 79.80 27.63 80.30 33.55
2 82.26 35.06 81.80 38.31 80.60 31.82 81.40 35.06
3 81.04 36.44 79.20 47.46 82.40 37.29 82.70 40.68
4 82.88 33.80 83.33 38.03 83.49 40.85 83.49 44.37
5 79.21 32.88 78.59 43.15 80.60 37.67 80.01 39.04
6 81.04 53.69 80.73 35.57 80.58 38.93 81.35 42.95
7 80.73 44.30 79.82 44.97 80.73 36.24 82.57 41.61
8 78.89 50.00 79.82 41.96 80.58 32.88 81.35 39.73
9 81.92 43.87 80.43 32.90 81.04 30.97 81.35 33.55
10 79.51 62.50 80.73 32.64 81.35 33.33 82.42 40.28

average 80.75 42.08 80.46 38.39 81.12 34.76 81.70 39.08


p-value a 0.191 0.197 0.98 0.035 0.83 0.19

a The p-values are for a one-tailed paired t-test comparing the mlp_ g results with the other three methods;
V.S. Desai et al. / European Journal o f Operational Research 95 (1996) 24-37 35

Of course this measure indicates a much greater In comparing the results for customized models
difference between the three methods, in comparison versus generic models one can see that in going from
to the first measure. This is further confirmed using customized to generic models the maximum differ-
the paired t-tests. ence is 2.44% for identifying good and bad loans
(mlp_c versus mlp_g), and 7.64% (mlp_c versus
Generic models mlp_g) for identifying bad loans. This seems to
As Table 3 indicates, the multilayer perceptron suggest that if identifying bad loans is the primary
identifies 80.75% of the loans correctly in compari- criterion, then customized models offer the most
son to 80.46% for the modular neural network, promising avenue, whereas generic models would be
81.12% for the linear discriminant analysis, and almost as promising as customized models if the
81.70% for the logistic regression. This indicates that primary criterion were the total percentage correCtly
using this measure, identified.
lr_ g > I d a g > mlp_ g > mnn_ g, In comparing the results of the best neural net-
work method with the best conventional method one
albeit marginally. The paired t-tests further confirm
can see that the maximum difference is 0.52% for
that we cannot claim that the multilayer perceptron is
identifying good and bad loans (mlp_ c versus lr_ c),
significantly better than the conventional methods
and 4.79% for b a d loans (mlp_ c versus l r c). This
with any degree of confidence. Thus, logistic regres-
seems to suggest that if identifying bad loans is the
sion still performs better than linear discriminant
primary criterion, then neural networks offer the
analysis, but now the multilayer perceptron goes
most promising avenue, whereas conventional meth-
from being the best of the three methods to being the
ods would be almost as promising if the primary
worst. Also, the performance of the modular neural
criterion were the total percentage correctly identi-
network is worse than the multilayer perceptron.
fied.
As Table 3 indicates, the multilayer perceptron
The fact that the performance of models for one
correctly identifies 42.08% of the bad loans in com-
credit union is different from another, and the fact
parison to 38.39% for the modular neural network,
that the performance of all three methods deterio-
34.76% for the discriminant analysis, and 39.08%
rates in going from customized to generic models,
for the logistic regression, indicating that using this
seem to suggest that there are significant differences
measure,
between the three credit unions. This would suggest
m l p _ g > lr_g > mnn_g > lda_g. that a method such as a modular neural network
Once again, this measure indicates a much greater should help improve the predictive ability that can be
difference betweep the four methods, in comparison generated from the data sample. The fact that we
to the first measure. This is further confirmed using could not design an effective modular neural net-
the paired t-tests. work could perhaps be due to a failure o n t h e part of
the modelers, or perhaps, other modular architectures
Discussion need to be explored.
Note that none of the methods for customized or
generic models are as good at correctly identifying
bad loans as they are at correctly identifying good 6. Conclusions and future research
loans, seeming to confirm claims by practitioners
reported in Section 1. The fact that the logistic An attempt to investigate the predictive power of
regression models do better than the linear discrimi- feedforward neural networks in comparison to tradi-
nant analysis models, for customized as well as tional techniques such as linear discriminant analysis
generic models, is consistent with results reported and logistic regression was made. In particular we
elsewhere (e.g., Harrell and Lee, 1985), and is per- used feedforward neural networks with backpropaga-
haps due to the presence of categorical independent tion of error. A particular advantage offered by
variables, which violates the multivariate normality neural networks is that a prespecification of the
assumption required for linear discriminant analysis. model is not required. Since funding and small sam-
36 V.S. Desai et aL / European Journal of Operational Research 95 (1996) 24-37

pie size often preclude the use of customized credit Gothe, P. (1990), "Credit Bureau point scoring sheds light on
scoring models at small credit unions, we investi- shades of gray", The Credit WorM, May-June, 25-29.
Hair, J.F., Anderson, R.E., Tatham, R.L., and Black, W.C. (1992),
gated the performance of generic models and com-
Multivariate Data Analysis: Eighth Readings, Macmillan, New
pared them with customized models. York.
Our results indicate that customized neural net- Hanson, S.J. and Pratt, L. (1988), " A comparison of different
works offer a promising avenue if the measure of biases for minimal network construction with back-propa-
performance is percentage of bad loans correctly gation", in: D.S. Touretzky (ed.) Advances in Neural Infor-
mation Processing Systems I, Morgan Kanfinann, San Mateo,
classified. However, if the measure of performance CA, 177-185.
is percentage of good and bad loans correctly classi- Harrell, F.E., and Lee, K.L. (1985), " A comparison of the
fied, logistic regression models are comparable to the discrimination of discriminant analysis and logistic regression
neural networks approach. The performance of under multivariate normality", in: P.K. Se (ed.) Biostatistics:
generic models was not as good as the customized Statistics in Biomedical, Public Health, and Environmental
Sciences, North-Holland, Amsterdam.
models, particularly when it came to correctly classi-
Haykin, S. (1994), Neural Networks: A Comprehensive Founda-
fying bad loans. Also, there were significant differ- tion, Macmillan, New York.
ences in the results for the three credit unions, Jacobs, R.A., Jordan, M.I., Nowlan, S.J., and Hinton, G.E. (1991),
indicating that more innovative architectures might "Adaptive mixtures of local experts", Neural Computation 3,
be necessary for building effective generic models, 79-87.
Jost, A. (1993), "Neural networks: A logical progression in credit
which, we believe, could be a fruitful area for future
and marketing decision systems", Credit WorM, March/April,
research. 26-33.
Lapedes, A., and Farber, R. (1987), "Non-linear signal processing
using neural networks: Prediction and system modeling", Los
Acknowledgement Alamos National Laboratory Report LA-UR-87-2662.
Moore, D.H. (1973), "Evaluation of five discrimination proce-
dures for binary variables", Journal of the American Statisti-
Financial support from the McIntire Associates cal Association 68, 399.
Program is gratefully acknowledged by the first au- Nilsson, N.J. (1965), Learning Machines: Foundations of Train-
thor. able Pattern-Classifying Systems, McGraw-Hill, New York.
Overstreet, Jr., G.A., and Bradley, Jr., E.L. (1994), "Applicability
of generic linear scoring models in the USA Credit Union
environment: Further analysis", Working Paper, University of
References Virginia.
Overstreet, Jr., G.A., Bradley, Jr., E.L., and Kemp, R.S. (1992),
American Banker (1994a), March 2, 15:1. "The flat-maximum effect and generic linear scoring model:
American Banker (1994b), April 22, 17:1. A test", 1MA Journal of Mathematics Applied in Business and
American Banker (1993a), August 27, 14:4. Industry 4, 97-109.
American Banker (1993b), July 14, 3:1. Parker, D.B. (1982), "Learning logic", Invention Report 581-64
American Banker (1993e), June 25, 3:1. (File I), Office of Technology Licensing, Stanford University.
American Banker (1993d), March 29, 15A:l. Robbins, H., and Munro, S. (1951), " A stochastic approximation
American Banker (1993e), October 5, 14:1. method", Annals of Mathematical Statistics 22, 400-407.
Brennan, P.J. (1993a), "Promise of Artificial Intelligence remains Rosenblatt, F. (1962), Principles of Neurodynamics, Spartan
elusive in banking today", Bank Management, July, 49-53. Books, Washington, DC.
Brerman, P.J. (1993b), "Profitability scoring comes of age", Rumelhart, D.E. (1988), "Learning and generalization", in: Pro-
Bank Management, September, 58-62. ceedings of the IEEE International Conference on Neural
Bryson, A.E., and Ho, Y.C. (1969), Applied Optimal Control, Networks, plenary address, San Diego.
Hemisphere, New York. Rumelhart, D.E., Hinton, G.E., and Williams, R.J. (1986),
Dillon, W.R., and Goldstein, M. (1984), Multivariate Analysis "Learning internal representations by error propagation", in:
Methods and Applications, Wiley, New York. D.E. Rumelhart and J.L. McCleland (eds.), Parallel Dis-
Fahlman, S.E., and Lebierre, C. (1990), "The Cascade-Correla- tributed Processing: Explorations in the Microstructures of
tion learning architecture", School of Computer Science Re- Cognition, Vol. 1, MIT Press, Cambridge, MA, 318-362.
port CMU-CS-90-100, Carnegie Mellon University. Werhos, P. (1974), "Beyond regression: New tools for prediction
Gilbert, E.S. (1968), "On discrimination using qualitative vari- and analysis in the behavioral sciences", Unpublished Ph.D.
ables", Journal of the American Statistical Association 63, dissertation, Dept. of Applied Mathematics, Harvard Univer-
1399-1412. sity.
V.S. Desai et aL / European Journal of Operational Research 95 (1996) 24-37 37

Widrow, B. (1962), "Generalization and information storage in neural networks: The case of bank failure predictions", Man-
networks of adaline 'neurons'", in: M.C. Yovitz, G.T. Jacobi, agement Science 38, 926-947.
and G.D. Goldstein (eds.), Self-organizing Systems, Spartan Weigend, A.S., Huberman, B.A., and Rumelhart, D.E. (1990),
Books, Washington, DC, 435-461. "Predicting the future: A connectionist approach", Working
Tam, K.Y., and Kiang, M.Y. (1992), "Managerial applications of Paper, Stanford University.

You might also like