You are on page 1of 11

Expert Systems with Applications 30 (2006) 437–447

Evaluation of rating systems

Andreas Oelerich a,*, Thorsten Poddig b,1
HSH Nordbank AG, Gerhart-Hauptmann Platz 50, 20095 Hamburg, Germany
University of Bremen, chair of finance, Hochschulring 4, 28359 Bremen, Germany


Since the publication of the initial consultative proposal of a new Basel capital accord in June 1999 (and latest proposal from summer 2004) the
influence of the proposed changes in bank management has been discussed intensively. Especially, the possibility to forecast insolvencies is one of
the most relevant questions in many empirical studies. In this paper, we present an evaluation methodology for quantitative rating systems. As an
example, we use the well known logistic regression model in order to demonstrate the evaluation methodology proposed and we discuss the results
obtained in detail. Any other method (statistical or artificial intelligence methods, e.g. neural networks, fuzzy logic) can be evaluated in the same
manner. As a side effect, the approach proposed might lead to improved forecasting results.
q 2005 Elsevier Ltd. All rights reserved.

1. Introduction In order to evaluate the accuracy of forecasted rating grades,

a Monte-Carlo simulation approach is presented in this paper.
Rating grades are used to individually symbolise and We briefly discuss the theoretical background in Section 2. In
quantify economic risks of a company. Most popular are the Section 3, we shortly describe the logistic regression and the
symbols of the leading rating agencies like Moody’s and bootstrap method. Section 4 introduces our evaluation system
Standard & Poor’s (e.g. AAA and Aaa, respectively). As of the based on Monte-Carlo simulation and artificial data sets. The
beginning of 2007 the broad use of rating grades is recommend results are presented in Section 5 with subsequent conclusional
by the BIS (BIS; Bank for International Settlement, 2003). remarks.
According to the recommendation of the BIS, national
supervisory authorities will reflect these proposals in national 2. Theoretical background
supervisory procedures or standards.
Rating grades are an important input parameter to portfolio The proposal of a new Basel capital agreement contains
credit risk models because credit ratings are mapped to three fundamental innovations (pillars). The first pillar includes
probabilities of default (Carey and Hrycay, 2001, p. 197). In the integration of operational risk in the calculation of the
order to obtain quantitative based rating grades, statistical capital ratio and the use of rating systems to individually
models are applied. They provide a simple score of a company, quantify the economic risk of a borrower. The minimum capital
distinguishing between solvent and insolvent companies (e.g. requirement retains both the current definition of the total
Foreman, 2003). In the next step, these scores are further capital and the minimum requirement of at least 8% of the
mapped to more refined and discrete rating grades. Unlike the bank’s equity to risk weighted assets. Concerning the second
historical information about the state of solvency of a pillar, a supervisory review process has to validate the bank’s
company, its true rating grade is always unknown and capital allocation techniques and its compliance with relevant
unobservable. Therefore, it is not possible to compare standards. The third pillars aim is to strengthen market
forecasted ratings with true ratings. discipline through enhanced disclosure requirements for banks.
This paper deals with the first pillar, especially with the
* Corresponding author. Tel.: C49 40 3333 11986; fax: C49 40 3333 generation and use of internal or external rating grades. Banks
611986. have to calculate the regulatory capital necessary for each
E-mail address: (A. Oelerich). borrower individually. The capital requirement depends on the
Tel.: C49 421 218 7548; fax: C49 421 218 4838. individual economic risk of the borrower. This economic risk is
0957-4174/$ - see front matter q 2005 Elsevier Ltd. All rights reserved. represented by an internal or external credit rating grade. Banks
doi:10.1016/j.eswa.2005.10.004 could or should use an internal rating system based on
438 A. Oelerich, T. Poddig / Expert Systems with Applications 30 (2006) 437–447

historical data in order to optimise their risk management. In the resulting scores whether a borrower should be viewed in
principle, thereare three techniques available to forecast a detail or not. To minimise this effort, banks need a high
rating grade: a qualitative, a quantitative and a combined quality statistical approach. This implies that a rating system
approach. Moody’s and Standard and Poor’s mainly use has to provide accurate dichotom classification, so that it
qualitative approaches, which means that they use the know- minimises additional analysis of borrowers.
how of credit risk experts instead of statistical methods. Yet the Beside the dichotom classification another characteristic of
qualitative approach has crucial disadvantages. It is very time rating systems has also a significance importance: the
consuming and expensive. Therefore such an approach cannot forecasted rating grade. It is quite a difference whether a
be applied to rate the large number of small and medium-sized borrower gets the rating grade AAA or BBB. The individual
enterprises. Banks are interested in an automated process to rating grade (in conjunction with the mapping process from a
forecast rating grades. Some banks use a statistical approach to rating grade to a probability of default) is the important input
calculate a rating grade from historical accounting data only. parameter to portfolio credit risk models. A classification
Other banks apply a combined approach. They use statistical system with also an excellent polytom classification rate is
methods to get a quantitative based rating grade which is finally essential. Overestimating or underestimating correct grades is
refined by experts resulting in an improved rating grade. expensive and includes a bias to risk judgements and the whole
The major disadvantage of a pure quantitative approach is risk management process.
the lack of an additional judgement by credit risk experts which Therefore, in a first step banks need a powerful approach to
might improve the forecast. Therefore, a quantitative rating forecast the dichotom classification, because this decides about
grade forecast must be of high quality. Banks must be able to an acceptance, a rejection or a further detailed analysis of the
rely on the rating grade forecasting system, because the borrowers through credit experts. In a second step banks
resulting probability of default is a crucial input parameter require a powerful system to forecast detailled rating grades for
to credit risk portfolio models (see Carey and Hrycay, correct capital allocation and risk management. Unfortunately,
2001, p. 197). there is no possibility to observe true rating grades in a
Many statistically based methods used to generate a rating historical data sample. Only the event of insolvency is
grade are only asymptotically consistent (for logistic observable. If a solvent borrower gets a rating grade e.g.
regression see, e.g. Fahrmeir and Kaufmann, 1985; Fahrmeir BBB, it is impossible to find out whether this (forecasted) grade
and Kaufmann, 1986). Therefore, analysts need a large will prove to be correct or not (as long as the borrower stays
database of information about insolvencies and solvencies to solvent). A validation of the forecasted rating grades by using a
achieve nearly consistent forecasts of rating grades. Very real data sample is impossible. Nevertheless, forecasted rating
often, a large database is not available. So it is questionable, grades will affect credit conditions, the credit portfolio
whether such statistical methods are unbiased and it is optimisation and the banks success. In the following, we
unclear, which effects may result. Furthermore, the whole present an evaluation system to assess the quality of statistical
portfolio credit risk management system of a bank is based on methods based on a given (artificial) data sample. Based on this
these probabilities of default. The influence of a bias in the system a bank is able to decide which method should be used to
forecasting process of rating grades might be huge. Because forecast rating grades.
of this small sample problem, the quality of statistical
methods is a crucial question concerning modern portfolio 3. Logistic regression and bootstrapping
credit risk management systems.
The major objective of many empirical studies is to In this section, we introduce the classical logistic
compare different statistical methods to classify the default regression and the integration of the bootstrap method. The
risk of borrowers (see, i.e. Altman, 1968; Desai, Crook, & logistic regression serves as an example in order to
Overstreet, 1996; Frerichs & Wahrenburg, 2003; Huang, demonstrate the Monte-Carlo based evaluation system later
Che, Hsu, Chen, & Wu, 2004; Srinivasan & Kim, 1987). By on. The integration of the bootstrap method in the logistic
using accounting data and other historical information about regression is an example for a possible improvement of
solvent and insolvent companies, a statistically based model classical methods. The evaluation system can be applied to
enables experts to forecast whether a company will likely modern methods like neural networks or support vector
become insolvent or not. A possible qualitative assessment of machines. But the evaluation of such modern methods is
such a model is e.g. the proportion of correct classified beyond the scope of this paper.
borrowers with respect to solvent vs. insolvent firms in a out- In logistic regression, a single outcome variable Yk (wherein
of-sample or out-of-time sample test (see e.g. Altman, 1968; kZ1,.,N indicates the N objects) follows a Bernoulli
Anders, 1998; Desai et al., 1996; Frerichs & Wahrenburg, distribution. It becomes one with probability pk and zero
2003; Huang et al., 2004; Leker & Schewe, 1998; Srinivasan with 1Kpk. In the case of modelling insolvencies, the event
& Kim, 1987). Borrowers are frequently classified with high one may describe the default of a company and the probability
accuracy in the cited empirical studies. Nevertheless, some pk is then called the probability of default (shortly: PD) of the
borrowers must be analysed in detail in order to get an kth company. pk varies over the different objects as described
adequate classification. For example, the Deutsche Bundes- by an inverse logistic function (also called link function) of a
bank (see Deutsche Bundesbank, 1999) decides based on linear predictor ðxTk bÞ. Here, xk contains all explanatory
A. Oelerich, T. Poddig / Expert Systems with Applications 30 (2006) 437–447 439

Bootstrap method to predict insolvencies and rating grades

original dataset dataset for prediction
Data management

random selection of

200 runs
2.Phase subsets
Generating of

Prediction of PD

Empirical expected
Empirical distribution
3.Phase value


Fig. 1. Bootstrap methodology.

variables with respect to the kth company and b all model particular Efron & Tibshirani, 1993). We use this method to
parameters (including the cutpoint). reduce the variability of predictions with the logistic regression
expðxTk bÞ In many studies, the bootstrap is used to quantify, e.g. a
pk Z (1) standard error or confidence interval. The bootstrap method-
1 C expðxTk bÞ
ology is based on the following idea: A variation of a given
The model parameters are estimated by the maximum- data set allows an estimation of an empirical distribution of a
likelihood approach (see, e.g. Agresti, 1996). The maximum- random variable. Fig. 1 shows the bootstrap process.
likelihood analysis works by finding the value of b that returns The first phase includes all steps necessary to prepare the
the maximum value of the log-likelihood function. This value data set for the analysis. These steps include the reduction of
is labelled as b. ^ The estimated value b^ has important multicollinearity and standardisation. The second phase
characteristics: it is asymptotically consistent and asymptoti- includes the generation of the empirical distribution function.
cally normal distributed (see, e.g. Fahrmeir & Kaufmann, We use the bootstrap method to generate an empirical
1985, 1986; for more details and discussions see, e.g. Hosmer distribution of the probabilities of default ðp ^ k Þ as following.
& Lemeshow, 2000). The design matrix X includes all design vectors xk (with kZ1,
As discussed above, banks are limited by small data samples .,N). These design vectors xk might represent real data (e.g.
in insolvency analysis regarding different sectors of borrowers. accounting information in an observed sample of real data) or
The asymptotic properties of the logistic regression might artificial data. To generate an empirical distribution of the
cause a bias in forecasting models. In some studies, the logistic forecasted probabilities of default, a number of subsets of X is
regression showed a bias even in very simple models (one or needed. We take a random selection out of all N design vectors.
two explanatory variable) when the sample size was small (see, In every selection, we choose a sample size of N. It is possible
e.g. Matin, 1994). Banks often use very complex models which that a company appears more then once in a subset or might not
have various numbers of explanatory variables (see, e.g. be selected at all. We generate 200 subsets and estimate for any
Hayden, 2002). The number of variables is usually reduced by subset the probability of default for all companies. Finally, an
applying a selection algorithm that eliminates insignificant empirical distribution of probabilities of default for every
variables. Nevertheless, the sample size usually remains small company results. We use the mean of this empirical
in relation to model complexity. As a result, a bias in distribution for each company as prediction of probability of
predictions may result. This influences the entire risk manage- default (see Fig. 1) instead of the classical PD estimated by the
ment and decision process. simple logistic regression model (see Eq. (1)). The variation of
In order to eliminate such an inaccuracy in logistic the data set might help to reduce the errors due to the small
regression facing such data problems, we propose to integrate sample size problems in classical logistic regression. In order
the bootstrap method in the forecasting process to stabilise the to investigate whether boostrappingimprove the classical
predictions. The bootstrap method is developed to generate an logistic regression, we compare the bootstrap method of
empirical distribution of a variable in those cases in which forecasting insolvencies and rating grades with the classical
proof of an analytical distribution is impossible (see in logistic regression model (without bootstrapping). We run
440 A. Oelerich, T. Poddig / Expert Systems with Applications 30 (2006) 437–447

Quality of quantitative rating weakness, which is caused by systematic errors. Concerning

systems systematic errors, we discuss the popular logistic regression as
an example for the influence of such errors in the following.
Systematic errors result from the assumptions and properties
of the specific method. The logistic regression describes the
Quality of a method in special Quality of a method in
cases general cases relation between explanatory (or independent) variables and
probabilities of default (or in the more general case, the
probability of success). This model-based relation (described
Assessment based on by a link function) might be incorrect in reality. Additionally,
Assessment based on real data
Monte-Carlo Simulation the maximum-likelihood estimation is (only) asymptotically
consistent and asymptotically normal distributed. For finite
Fig. 2. Quality of rating methods in special and general cases.
sample sizes, this estimation could be biased. Also, the
estimated vector of model parameters and the resulting
the Monte-Carlo simulation to evaluate this method as estimation of probability of defaults are (only) asymptotically
explained in Section 4. consistent and asymptotically normal distributed. Small sample
sizes provide (especially) for banks problems to get accurate
forecasts. Tiny modifications of the data sets might change the
4. Evaluation of quantitative rating systems estimated probabilities of defaults significantly.
Monte-Carlo simulations are a tool to eliminate unsyste-
As outlined in the second section, the quality of quantitative matic errors. The remaining systematic error can be used to
rating systems is very important because the whole risk quantify the quality of a method. The evaluation process of the
management depends on it. We distinguish two major logistic regression and the bootstrap method is shown in Fig. 3.
questions to assess such systems, the meaning of quality in It shouldbe noted that the evaluation of the bootstrap method
empirical studies (special cases) and the meaning of quality in described above has to estimate 500,000 logistic regression
simulation studies (general cases) (see Fig. 2). models, due to Monte-Carlo Simulation to assess the method in
Empirical studies use mainly quantities like hit rates general and the specific bootstrapping method.
(proportions of correct classified companies) to compare The evaluation process is based on artificial data. The data
different statistical methods for predicting bankruptcies (see set is generated with the logit link known as the link function in
e.g. Altman, 1968; Desai et al., 1996; Huang et al., 2004). We
the logistic regression model. The data set (table, matrix)
regard this as a quality in a special case. This assessment of
includes the known qualitative and quantitative information
different method is based on real data.
about all companies. In a first step, we generate the explanatory
In the case of real data, the assessment of methods used to
variables. The sample size is denoted by N. A bold X describes
forecast rating grades is impossible, because the true rating
the matrix which contains quantitative and qualitative
grade is unknown and unobservable. Different methods cannot
information of all N companies. A row of X, xk, represents
be compared in order to decide which method is the best
all information of the kth company, where k is an index with
concerning rating grades. It can be shown with an artificial data
kZ1,.,N. In empirical studies, these vectors xk include
set that the best method to forecast bankruptcies is not
information like financial ratios or qualitative information such
necessarily the best method to forecast rating grades (see, e.g.
Poddig & Oelerich, 2004). Banks cannot generalise the quality
of bankruptcy predictions to the quality of rating grade
forecasts. Evaluation process of logistic regression and bootstrapping
Artificial data allow to develop an evaluation system. Using
such data we are able to integrate the structure of a given data Data management
Generate artificial data

set and to simulate the rating process. Such an evaluation

system allows to assess the methods of bankcrupcy predictions
and also the quality of rating grade predictions. Moreover, Bernoulli experiments
2500 runs

based on artificial data we are able to quantify the quality of 2.phase

different statistical methods. Here, all necessary information Monte-Carlo simulation Classical logistic
are known: information about companies, the model and the
true rating grades. Based on this evaluation system, an
Bootstrap method
assessment of different methods for forecasting rating grades
is possible.
In order to assess the quality of a predictive method, one 3.phase Empirical distributions Empirical parameters

distinguishs between two errors. First, any statistical method Empirical distributions
Evaluation of predictions
includes a typical statistical error, which is called unsystematic
error. The judgement of any method should not be based on this
error, because a quality judgement should describe a structural Fig. 3. Evaluation process of statistical methods.
A. Oelerich, T. Poddig / Expert Systems with Applications 30 (2006) 437–447 441

as sector, legal form of acompany or quality of management. Given vector xk and a vector b (and the cutpoint a), we
This vector can be written as xk Z ðxQk ; xk Þ whereas xk contains
calculate the probability pk Z ð1C expðKaKxTk bÞÞK1 for each
codings for qualitative information and xk includes quantitat- company. A realisation of Yk can be simulated by a Bernoulli
ive information about the kth company. experiment based on these probabilities (pk). Therefore, we
Sometimes the information of xM k will be standardised only need a given cutpoint parameter (a) and a vector (b) of
before it is used in a model (see, e.g. Moody’s Investors model parameters to define an artificial data generating
Service, 2001). The advantage of standardisation is to compare process. We calculate for each company its probability of
different information measured on different scales. A default using the parameters discussed above (b) (see also
standardised variable has an expected value of zero and a Table A1) and the design vectors xk. These probabilities are the
variance of one. For this reason, we generate artificial data by inputs to generate Bernoulli random variables. Thereby, an
using standard normal distributed random variables artificial data sample corresponding to a given equation results.
ðxM M
k wNð0; VÞÞ. Here, V : Zcovðxk ÞZ I is assumed as the Based on this artificial data, we know all information about
identity matrix. For reasons of simplification, we used every company (the table X), the model Eq. (a and vector b),
uncorrelated data. In reality the data will not be perfectly and the true probabilities of default pk. Inorder to investigate
uncorrelated. Therefore, orthogonality methods might be used the small sample size properties of different statistical methods
for data preprocessing to minimise or remove the correlation. with respect to forecast rating grades, different in-sample and
Despite these problems concerning real data our framework out-of sample sizes are generated.
fullfills the theoretical assumption of many statistical methods, To generate the true rating grade for any company, we
e.g. the logistic regression model or the discriminant analysis. orientate on the scale in (Rolfes & Emse, 2000). We divide
To generate data more similar to real world data, an empirical the interval from 0 to 1 into nine classes (see Table 1). This
covariance matrix should be used instead of the identity- rating scale is based on nine rating grades, whereas the grades
matrix. In this case, the covariance matrix V has to be replaced 1–8 symbolise solvent grades and the grade 9 describe
by the empirical matrix V. ^ Our framework works with both: a insolvent ratings. Banks might use other definitions of rating
diagonal and non-diagonal covariance matrix. To simplify the grades and/or add more rating classes. Our focus is not to look
simulation system, we focus on a diagonal covariance matrix for a perfect definition of rating grades, but a system to
assuming that multicollinearity is removed by data evaluate statistical methods. Typically, rating agencies use up
preprocessing. to 21 rating grades. Such complex splittings can also be
Qualitative information like legal form, sector, management integrated into our framework, but they are not regarded here
quality and so on allow to group companies. To simulate this in detail.
kind of information, we define the number of qualitative The evaluation system is based on four indexes. First, we
factors, the corresponding number of levels and all resulting define an index to quantify the quality of the probability
combinations. Then, we choose the number of companies for estimation. We call this index upsilon ðYÞ. The upsilon index is
each combination. The data generating process including based on the idea of mean squared errors defined by:
qualitative information is described in Oelerich and Poddig
(2004). An artificial design vector might look like Eq. (2) 1 XN
MSEp Z ^ Kpk Þ2
ðp (4)
N kZ1 k
xk Z ð1; 2; 0:06; 0:98;K0:876Þ (2)
Here, the first element 1 describes the first level of the first Here, pk describes the true probability of default of the kth
qualitative factor noted as A. The second element 2 company and p ^ k the corresponding prediction (by any kind of
corresponds to the second level of the second factor (noted model). The index is equal to zero if all probabilities of defaults
as B). These information has to be coded with dummy variables are forecasted correctly. The upper limit is 1 ð0% MSEpN % 1Þ.
by applying reference or effect coding preprocessing (see, e.g. The well known hit rate (proportion of correct classification) is
Hosmer & Lemeshow, 2000). In our simulation system, we use interpreted in a similar, but reversed manner. So we use the
reference coding for qualitative variables, because this coding
is easier to interpret than the effect coding (see Hosmer &
Table 1
Lemeshow, 2000). The vector further includes three additional Probabilities of default and corresponding rating grades
variables representing quantitative information. To summarise,
Grades Lower limit Upper limit
we generate quantitative information as standard normal
distributed random variables and use qualitative information 1 0.00000 0.00025
as grouping variables. 2 0.00025 0.00055
3 0.00055 0.00115
Let Yk wBðpk Þ be the dependent variable, where Bðpk Þ 4 0.00115 0.00405
describes the Bernoulli distribution with pk as probability for 5 0.00405 0.01335
the event of bankruptcy. The logistic regression model is based 6 0.01335 0.07705
on the model Eq. (3) (see, e.g. Hosmer & Lemeshow, 2000) 7 0.07705 0.16995
  8 0.16995 0.20000
pk 9 0.20000 1.00000
ln Z a C xTk b: (3)
1Kpk A similar definition can be found in (Rolfes and Emse, 2000).
442 A. Oelerich, T. Poddig / Expert Systems with Applications 30 (2006) 437–447

Table 2
Overview of evaluation indexes for rating systems

Index Yp YD YP YR
Desciption Quantify the quality of Quantify the quality of binary Quantify the quality of rating grade Quantify the quality of rating grade
probability forecasts classification (hit rate) predictions (hit rate) predictions (distances)

normed index number of incorrect classifications so that YP corresponds to

the proportion of correctly rated companies.
Yp :Z 1KMSEpN ; (5)
The distance between the true and the estimated rating grade
to get the same interpretation: The Y -index in Eq. (5) becomes is also important. For example, if a company had a true grade of
one if all forecasts are correct. BB, a forecasted rating grade of AA or CCC would be a severe
The logistic regression has been applied in many empirical misclassification. It influences the entire risk management
studies to forecast insolvencies (e.g. Huang et al., 2004). The process including especially the price and credit conditions. To
estimated probabilities are used to divide all companies include the distance of the forecasted rating grade to the true
into two groups/classes. Defining a cutpoint-parameter (e.g. rating grade, we define an index based on distances. The true
cZ0.50) every company is assigned to one class (insolvent or rating of the kth company is denoted by Rk and the
solvent). This binary classification is less sensitive to the corresponding estimate by R^ k . The normed sum of squared
variability of the probability estimation by the statistical model differences is defined as
than the estimation of the probability itself.
N ^ 2
The Bernoulli experiment gives the true class (insolvent, 1X Rk KRk
solvent) in every run of the simulation. Thus, the evaluation of
Y Z 1K ; (8)
N kZ1 r
binary classifications is possible. In the binary case, this index
is equal to the proportion of correctly classified companies. We
with rC1 equal to the number of rating grades. In our
denote the class of solvent companies by zero and the class of
framework r equals eight. The index YR is one if all ratings are
insolvent companies by one. The probability pk of the
correct. It becomes zero if the absolute value of the differences
Bernoulli distribution for the event one is the true probability
R^ k KRk is for any company equal to r. The division by r is
of default. The symbol Dk expresses the true class of the kth
important to formulate a normed index (Table 2).
company (that means Dk2{0,1}) and D^ k the corresponding
estimation. The index

1 XN 5. Results
YD Z 1K ðD^ KDk Þ2 (6)
N kZ1 k
Before presenting the results obtained, we describe three
describes the quality of forecasts with respect to a binary data generating models used in our simulation studies. Table
classification. The squared difference in Eq. (6) ðD^ k KDk Þ2 of A1 shows the parameters of different data generating models
the distance between the true and the estimated class is similar used in the simulation. A qualitative (main) factor is denoted
to the case of probabilities discussed above. In the binary case, by A, B, C, a quantitative factor by M1 to M10. The
this index is identical to the classical hit rate (proportion of parameters b1 to b5 correspond to the five levels of factor A in
correctly classified companies), because the distance equals model0. Sectors or branches, legal forms or ratings for
one if the classification is wrong and zero if the classification is management quality are typical examples for such factors.
correct. Interaction terms are denoted by the corresponding main
Rating grades are more difficult to evaluate than a binary factors, e.g. AB denote the interaction between the main
classification in real world applications because the true rating factors A and B. For simplifications, no interaction between
grades are unknown and unobservable. To evaluate a statistical qualitative factors is assumed, so all parameters hABij are set to
rating system using artificial data, we define a similar index to zero. In model0, the quantitative factors M1 to M5 have
quantify the predictive power of a method as usually used for different effect sizes, M6 to M10 are assumed to have no
binary classifications. We calculate the proportion of correctly effect. These artificial parameter settings have the same
classified companies as discussed for the binary case. Due to important characteristics as real data, e.g. some factors (e.g.
artificial data, the true rating grade is known. Table 1 shows AB, M4 and M5) have no effect and other have different effect
how the probability of default is mapped to rating grades in our sizes. The symbol ‘***’ means that this factor (e.g. C, M5) is
simulations. The YP -index is defined as not used in the specific data generating model, e.g. the model0
does not contain a qualitative factor C.
1 XN
In order to investigate the small sample size properties of
YP Z 1K P; (7)
N kZ1 k different statistical methods to forecast rating grades, different
sample sizes are generated. Table 3 shows these sample sizes
with Pk being zero if the estimated rating grade is equal to the for all models. The proportion of solvent and insolvent
true rating grade and one otherwise. The sum counts the companies in an artificial data set depends on the specific
A. Oelerich, T. Poddig / Expert Systems with Applications 30 (2006) 437–447 443

Table 3 generating models with different sample sizes and forecasts

Sample sizes for all data generating models used with the VM and the WM prediction models.
Data generating model Sample sizes The Yp -Index illustrate the general quality of logistic
In sample Out of sample Sum regression to forecast probabilities. The predictive models (VM
Model0 125 125 250 and WM) seem to be powerful. The average index is often
250 250 500 approximately 0.99. The reason for the closeness to one of this
500 500 1000 index is the usage of squared differences of probabilities.
Model1 225 225 450 Important is the evaluation of the YD index for binary
450 450 900 classification based on these probabilities (see Table A2).
900 900 1800
Fig. 4 shows the empirical histogram of the YD -index for data
Model2 270 270 540
675 675 1350
generating model1. In literature, a hit rate of 75% is often
1350 1350 2700 regarded as a good result (see, e.g. Leker & Schewe, 1998).
Our simulation studies show that such quotes are not likely to
be observed on average. The mean of correct classifications in
our simulation studies is lower than 75% in the most cases. In
model and the Bernoulli experiments. The observed proportion real world studies, an analyst also face further problems, like
in our simulation studies is approximately 0.2. multicollinearity or outlier problems. Additionally, the
A preliminary simulation study considers only two assumed link-function might be incorrect (see the discussion
predictive models. We denote the logistic regression using about systematic errors). The decision whether a method is
the reference coding for qualitative factors and no selection good or not depends on the sample size and the model
algorithm as VM. The resulting model by using a forward complexity. We used quite simple models. In some empirical
selection algorithm based on the Wald test and the reference studies, the complexity is much higher with occasionally more
coding for qualitative factors is denoted by WM. In this than thirty types of information for one company (see, e.g.
preliminary study, we do not discuss the bootstrapping method Hayden, 2002). In these cases, it is really doubtful that a hit rate
which will be applied later on. of 75% results on average in repeated applications of the same
As an example, the distributions of two indexes (given data model development procedure.
generating model1 with the smallest sample size) is shown in The YP index shows the hit rates for rating grades. These hit
Fig. 4. The logistic regression model exhibits some undesirable rates are smaller than for binary classifications and the variance
properties for small sample sizes. The figure shows also thatthe is higher. For example, using data generating model0 (with 125
rating grade predictions have a lower average and a higher companies) the average binary classification hit rate is nearly
variance in this example. 75%, whereas the hit rate for rating grades is nearly 36%. This
Tables A2 and A3 show important characteristics of the is only half of predictive power than in the binary case. The
empirical distribution of the indexes simulating all data variance is more than three times higher. With increasing

Fig. 4. Empirical distribution of the YD -index (left side) and the YP -index (right side) for data generating model1 with the smallest sample size of 225 companies. The
forecasting model is a logistic regression using a forward selection algorithm based on the Wald-test and the reference coding for qualitative factors. This figure
shows the hit rates (x-axis) and their corresponding relative frequency in percent (y-axis).
444 A. Oelerich, T. Poddig / Expert Systems with Applications 30 (2006) 437–447

sample size the quality of rating grade forecasts becomes To summarise the major results, the bootstrap method
better. If a bank uses 21 rating classes, the quality will likely be shows the highest average of all four indexes in our
lower than the results presented in this paper, because we use simulations in most cases. The variability of forecasts is
only nine classes. In many empirical studies the sample sizes extremely different. The variance of the VM is up to fifteen
correspond to the sample sizes used inour simulation studies. times higher than the variance obtained by the bootstrap
We do not regard extremely small or extremely large sample method. The main advantage of the bootstrap method is its
sizes (see Table A3). robustness. The results seem to be more reliable. However,
To quantify the difference of predicted rating grades and with increasing sample size the classical methods becomes
true rating grades, we propose the YR index (see Table A3). more powerful. For the largest sample size used, the
Here, we observe similar results as seen for the Yp index. For variance of the bootstrap method is approximately only
the smallest sample sizes, the variablility of the VM is two times lower than that of the classical methods (for the
extremely large. Using model1 and model2 as data generating Yp and YR indexes).
processes, the variance is nearly ten times higher as observed Such simulation studies show the power of different
applying the WM. methods or the improvements of modifications to
These results shows that the variability of forecasts differ for traditional statistical methods. For example, the compari-
different techniques within the same statistical model. Our son of classical logistic regression with the integration
simulations studies show the advantage of forecasting rating of bootstrapping suggests the use of resampling
grades using a selection algorithm. Nevertheless, for small approaches.
sample size the WM is biased. All in all, even the WM seems to
provide unreliable forecasts when the sample size is small. The
bootstrap method might result in improved forecasted 6. Conclusion
probabilities because its integration into the WM should reduce
the variability of forecasts. This should lead to a more reliable In this paper we discuss the possibility to assess the
predictive model. We discuss this for data generating model1 in quality of statistical methods to forecasts insolvencies and
detail. rating grades. Whereas an evaluation of statistical methods
In the second simulation study, we apply the data predicting insolvencies does not pose difficult problems, it is
generating model1 with the smallest (225 companies) and impossible forrating grades, because they are unknown and
the largest (900 companies) sample size due to the extremely unobservable. To solve this problem, we propose a
time consuming simulation runs. We find two important simulation system using artificial data, wherein all necessary
results in this second study. First, the bootstrap method has information are known, the information about companies,
the highest average in all predictive models (in our studies) the probability of default, the insolvencycoding and a the
in general. The forecast using the simple WM model is based rating grade. Based on such data we are able to evaluate
on asymptotic consistent estimators (see also Hocke, 1974; rating processes, which are based on statistical methods. We
Matin, 1994), but it seems to be more incorrect than the apply this simulation system to several logistic regression
bootstrap method when the sample size is small. The models and the bootstrap method for one specific logistic
bootstrap method seemingly stabilise the predictions of regression model. There are two major results: First, a
probabilities. Note, that the sample size applied in our prototype of an evaluation system to quantify the quality of
studies is similar to such used in empirical studies (e.g. rating grade forecasting models could be studied. Second,
Anders, 1998; Huang et al., 2004; Leker & Schewe, 1998). we find that the integration of bootstrap results in more
Second, the variance is an important property of forecasting confident forecasts. We call such a rating process ‘robust’,
models in real world applications. The bootstrap method has because it reduces the variability of predictions of
in all cases the smallest variances (see Table A4). For ratings insolvencies and rating grades.
generated by statistical methods, this robustness is very In order to reflect real world data in a more realistic
important, because ratings should be conservative and stable approach, it is recommended to use an empirical
with respect to time (see Bank for International Settlement, covariance matrix in the data generating process when
2003). Unreliable information in the data base might have a building up the design vectors (the matrix X). This
significant influence on the forecast. For example, changes in empirical covariance matrix could reflect interdependencies
the accounting data of one company could influence the between real companies in the artificial data. The
model equation in such a way that another company gets a simulation system allows to assess different rating models
different estimation of probability of default and thereby and to identify optimal rating models regarding a given
another rating grade is assigned. The bootstrap method might real data sample.
reduce this variation in rating grade forecasts. Note, that the
integration of bootstrap is easy with respect to most
statistical methods (e.g. logistic regression) and can be Appendix A
realised by the use of many statistical software solutions (e.g.
SAS or SPSS). See Tables A1–A4
A. Oelerich, T. Poddig / Expert Systems with Applications 30 (2006) 437–447 445

Table A1
Parameters in three different data generating models

Factor Parameter Value of parameter in data generating model

model0 model1 model2
b1 1.50 0.75 1.00
b2 0.75 0.00 0.00
A b3 0.00 -0.75 -1.00
b4 -0.75 *** ***
b5 -1.50 *** ***
g1 -0.50 -0.25 -0.75
g2 -0.25 0.00 0.00
B g3 0.00 0.25 0.75
g4 0.25 *** ***
g5 0.50 *** ***
d1 *** *** 0.00
d2 *** *** 0.00
C d3 *** *** 0.00
d4 *** *** 0.00
d5 *** *** 0.00
ij 0.00 0.00 0.00
ik *** *** 0.00
ik *** *** 0.00
*** *** 0.00
M1 m1 K0.40 -0.75 -1.00
M2 m2 K0.75 -0.40 -0.75
M3 m3 K0.50 0.25 -0.25
M4 m4 0.25 0.00 0.00
M5 m5 0.75 0.00 0.00
M6 m6 0.00 *** ***
M7 m7 0.00 *** ***
M8 m8 0.00 *** ***
M9 m9 0.00 *** ***
M10 m10 0.00 *** ***

Table A2
Empirical distribution of the Yp - and YD -Index for the VM and the WM models. The YD -Index is given in percent. All indexes are based on the data generating models
shown in Table A1 in the out-of sample test

Index Data generating model Sample size Prediction Empirical distribution

model Minimum Average Maximum Variance
Y Model0 125 VM 0.7037797 0.8821377 0.9597116 0.0017986
WM 0.7745611 0.9349743 0.9857128 0.0006322
250 VM 0.8805836 0.9312435 0.9665835 0.0001741
WM 0.9020020 0.9463119 0.9846160 0.0001388
500 VM 0.9710683 0.9865392 0.9955058 0.0000126
WM 0.9694741 0.9921865 0.9983600 0.0000117
Model 225 VM 0.9634257 0.9895332 0.9979800 0.0000190
WM 0.9708070 0.9908616 0.9983320 0.0000178
450 VM 0.9846521 0.9952083 0.9992547 0.0000037
WM 0.9814356 0.9959866 0.9991158 0.0000040
900 VM 0.9941839 0.9976368 0.9996633 0.0000008
WM 0.9929668 0.9980054 0.9997606 0.0000009
Model2 270 VM 0.9459336 0.9793925 0.9932119 0.0000338
WM 0.9702192 0.9917139 0.9990288 0.0000221
675 VM 0.9807539 0.9913713 0.9972507 0.0000052
WM 0.9862482 0.9966762 0.9996188 0.0000029
1350 VM 0.9902074 0.9956640 0.9983464 0.0000012
WM 0.9936707 0.9984178 0.9998965 0.0000008
YD Model0 125 VM 57.60 73.49 85.60 18.02
WM 56.80 75.26 87.20 22.43

(continued on next page)

446 A. Oelerich, T. Poddig / Expert Systems with Applications 30 (2006) 437–447

Table A2 (continued)

Index Data generating model Sample size Prediction Empirical distribution

model Minimum Average Maximum Variance
250 VM 60.80 70.68 82.00 8.52
WM 57.20 69.23 79.20 8.87
500 VM 65.60 73.31 79.80 3.82
WM 67.00 73.65 79.40 3.56
Model1 225 VM 53.78 71.74 83.11 17.55
WM 43.56 71.00 83.56 20.01
450 VM 55.78 69.81 78.89 10.14
WM 54.44 69.63 80.00 12.49
900 VM 56.22 66.80 75.33 7.42
WM 56.78 66.70 75.89 8.56
Model2 270 VM 58.52 72.14 82.22 11.86
WM 58.15 71.71 81.85 12.61
675 VM 67.56 74.39 80.59 3.56
WM 67.41 74.83 80.44 3.25
1350 VM 69.11 73.54 77.11 1.56
WM 69.70 73.70 77.19 1.54

Table A3
Empirical distribution of the YP and YR -Index for the VM and the WM models

Index Data generating Sample size Prediction model Empirical distribution

model Minimum Average Maximum Variance
P Model0 125 VM 7.20 21.89 40.80 29.16
WM 8.80 36.77 64.00 70.33
250 VM 18.00 30.43 45.20 18.98
WM 21.60 39.91 62.40 26.95
500 VM 30.40 54.65 75.40 31.12
WM 39.80 66.16 84.20 43.06
Model1 225 VM 20.44 52.63 80.44 74.10
WM 22.67 55.89 82.67 91.61
450 VM 39.11 64.41 84.00 45.90
WM 42.67 68.47 82.44 50.64
900 VM 56.33 74.31 90.00 25.17
WM 57.33 76.94 93.11 27.43
Model2 270 VM 21.11 44.46 64.44 42.10
WM 28.15 62.96 85.96 94.83
675 VM 42.37 61.77 79.41 24.91
WM 45.78 75.59 91.56 39.44
1350 VM 54.00 72.17 84.07 16.08
WM 62.44 83.00 95.93 23.16
YR Model0 125 VM 0.4851250 0.7215956 0.9056250 0.0058781
WM 0.5031250 0.9222988 0.9871250 0.0035467
250 VM 0.6919375 0.8400099 0.9497500 0.0013861
WM 0.7370000 0.9532424 0.9868125 0.0008482
500 VM 0.8221563 0.9390746 0.9918750 0.0006739
WM 0.8231250 0.9877972 0.9967813 0.0002389
Model1 225 VM 0.7623611 0.9570286 0.9961111 0.0013208
WM 0.8185417 0.9841445 0.9963194 0.0001420
450 VM 0.8825000 0.9870136 0.9969792 0.0001924
WM 0.8839931 0.9916225 0.9968056 0.0000192
900 VM 0.9448090 0.9941055 0.9983333 0.0000044
WM 0.9456076 0.9948134 0.9989236 0.0000037
Model2 270 VM 0.7423032 0.8929976 0.9832755 0.0010667
WM 0.8255787 0.9870792 0.9972222 0.0001299
675 VM 0.8986806 0.9722261 0.9953704 0.0002371
WM 0.9413426 0.9943625 0.9986574 0.0000098
1350 VM 0.9328935 0.9892359 0.9965509 0.0000614
WM 0.9501042 0.9965378 0.9993634 0.0000032

The YP -index is given in percent. All indexes are based on the data generating models shown in Table A1 in the out-of sample test.
A. Oelerich, T. Poddig / Expert Systems with Applications 30 (2006) 437–447 447

Table A4
Empirical distribution of all indexes for three different statistical models

Index Sample size Statistical model Empirical distribution

Minimum Mean Maximum Variance
Y 225 VM 0.9634257 0.9895332 0.9979800 0.0000190
WM 0.9708070 0.9908616 0.9983320 0.0000178
BM 0.9757352 0.9932666 0.9992551 0.0000103
900 VM 0.9941839 0.9976368 0.9996633 0.0000008
WM 0.9929668 0.9980054 0.9997606 0.0000009
BM 0.9945386 0.9982839 0.9998238 0.0000006
YD 225 VM 53.78 71.75 83.11 17.55
WM 43.56 71.00 83.56 20.01
BM 49.33 71.07 84.44 18.50
900 VM 56.22 66.80 75.33 7.42
WM 56.78 66.70 75.89 8.56
BM 56.00 66.65 75.56 8.42
YP 225 VM 20.44 52.63 80.44 74.10
WM 22.67 55.89 82.67 91.61
BM 28.00 60.73 86.22 77.27
900 VM 56.33 74.31 90.00 25.17
WM 57.33 76.94 93.11 27.43
BM 60.78 78.28 93.89 24.17
YR 225 VM 0.7623611 0.9570286 0.9961111 0.0013208
WM 0.8185417 0.9841445 0.9963194 0.0001420
BM 0.8271528 0.9877336 0.9974306 0.0000878
900 VM 0.9448090 0.9941055 0.9983333 0.0000044
WM 0.9456076 0.9948134 0.9989236 0.0000037
BM 0.9895486 0.9952950 0.9990451 0.0000020

VM is the logistic regression with all factors (without a selection algorithm). WM means the logistic regression using a forward selection algorithm based on the
Wald-test and the reference coding for qualitative factors. BM is the WM with bootstrapping. All indexes are based on data generating model1 in Table A1.

References Frerichs, H., & Wahrenburg, W. (2003). Evaluating internal rating systems
depending on bank size. Working Paper 115. Frankfurt: Universität
Agresti, A. (1996). An introduction to categorical data analysis. New York:
Hayden, E. (2002). Modelling an accounting-based rating system for austrian
firms. Dissertation. Fakultät für Wirtschaftswissenschaften und Informatik
Altman, E. I. (1968). Financial ratios, discriminant analysis and the prediction
der Universität Wien.
of corporate bankcruptcy. Journal of Finance, 4, 589–609.
Hocke, J. (1974). Der Einfluss der Multikollinearität auf die Kleinstichprobe-
Anders, U. (1998). Prognose von Insolvenzwahrscheinlichkeiten mit Hilfe
neigenschaften diverser ökonometrischer Schätzmethoden—Eine Monte
logistischer neuronaler Netzwerke. Zeitschrift für betriebswirtschaftliche
Carlo-Studie. Dissertation. Universität München.
Forschung, 50, 892–915.
Hosmer, D. W., & Lemeshow, S. (2000). Applied Logistic Regression (2nd ed.).
Bank for International Settlement. (2003). Consultative document: The new
New York: Wiley.
basel capital accord. Zürich.
Huang, Z., Chen, H., Hsu, C., Chen, W., & Wu, S. (2004). Credit rating analysis
Carey, M., & Hrycay, M. (2001). Parameterizing credit risk models with rating
with support vector machines and neural networks: A market comparative
data. Journal of Banking & Finance, 25, 197–270. study. Decision Support Systems, 37, 543–558.
Desai, V. S., Crook, J. N., & Overstreet, A. (1996). A comparison of neural Leker, J., & Schewe, G. (1998). Beurteilung des Kreditausfallrisikos im
networks and linear scoring models in the credit union environment. Firmenkundengeschäft der Banken. Zeitschrift für betriebswirtschaftliche
European Journal of Operational Research, 95, 24–37. Forschung, 50, 877–891.
Deutsche Bundesbank. (1999). Zur Bonitätsbeurteilung von Wirtschaftsunter- Matin, M. A. (1994). Small-sample properties of different tests and estimators
nehmen durch die Deutsche Bundesbank. Deutsche Bundesbank Mon- of the parameters in the logistic regression model. Research Report 4.
atsbericht 1999 (pp. 51–64). Schweden, Uppsala: Uppsala Universitet.
Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. London: Moody’s Investors Service. (2001). Moody’s RiskCalcTM für nicht börsenno-
Chapman & Hall. tierte Unternehmen: Das deutsche Modell.
Fahrmeir, L., & Kaufmann, H. (1985). Consistency and asymptotic normality Oelerich, A., & Poddig, T. (2004). Modified wald statistics for generalized
of the maximum likelihood estimators in generalized linear models. The linear models. Allgemeines Statistisches Archiv, 1, 23–34.
Annals of Statistics, 13(1), 342–368. Poddig, T., & Oelerich, A. (2004). Evaluierung quantitativer Ratingverfahren.
Fahrmeir, L., & Kaufmann, H. (1986). Correction: Consistency and asymptotic In D. Bayer, & C. Ortseifen (Eds.), SAS in Hochschule und Wirtschaft (pp.
normality of the maximum likelihood estimators in generalized linear 195–212). Aachen: Shaker Verlag.
models. The Annals of Statistics, 14(4), 1643. Rolfes, B., & Emse, C. (2000). Rating basierte Ansätze zur Bemessung der
Foreman, R. D. (2003). A logistic analysis of bankruptcy within the US local Eigenkapitalunterlegung von Kreditrisiken. ecfs—Forschungsbericht, Vol. 3.
telecommunicaion industry. Journal of Economics and Business, 55, 135– Srinivasan, V., & Kim, H. (1987). Credit granting: A comparative analysis of
166. classification procedures. Journal of Banking and Finance, XLII(3), 665–683.