You are on page 1of 3

Calculating Probability of Default (PD) Using Logistic Regression and Python

Logistic Regression (LOGIT) is used to model the relationship between a quantitative response
variable y and one or more weighted explanatory variables x. We denote the weight coefficients
as b. If we consider default as our response variable, then it has only two values: Default or Not
Default. We use a scoring relationship that combines the different pieces of information to get an
assessment of default probability:

Scorei = b1xi1 + b2xi2 + ..+ bkxik (the Score in instance i)

Consider N companies, each having one of the two possible response variables: Default or Not
Default. This can be written as a 1 (one) or a 0 (zero) for each company. Assuming the
companies are independent of each other, meaning that if one defaults this does not influence if
another will default, we have N independent companies or N independent observations. A bank
would collect information on several explanatory variables x for default prediction (denoted here
by k), for each firm (company). Five common explanatory variables x are (k = 5):
1. Working Capital (WC)
2. Retained Earnings (RE)
3. Earnings before interest and taxes (EBIT)
4. Sales (S)
5. Market Value of Equity (ME).

Except for Market Value (ME), all these items are found in the balance sheet and income
statement of the firm. The Market Value (ME) is given by the number of shares outstanding
multiplied by stock price. It is customary to take the first four variables, WC, RE, EBIT and S and
divide them by Total Assets (TA). Also, the fifth variable ME is usually divided by Total Liabilities
(TL). This gives fractions between 0 and 1 for the five x explanatory variables.
Based on annual data from companies that have either defaulted or not defaulted, and using in
total N = 1000 observations, we find the best combination of weight coefficients b for the
explanatory variables x that explain observed default behavior. A company can show up more
than once if there is information on this company for more than a year. However, once a
company has defaulted it tends to stay in default for several years, so no observations are used
after the year the company defaulted. The table shows some of the company data used in this
model. The data set is hypothetical, generated through Python. It is only used to illustrate the
model.
The scoring relationship predicts a high default probability for those observations that defaulted
and a low default probability for those that did not. As such, scores are linked to default
probabilities through:

Prob(Default)I = (Scorei) = 1 / (1+ exp(-bTxi)) where is logistic distribution function.

One way of estimating the weights b is the maximum likelihood (ML) method. For this purpose,
we set up the likelihood function Li :

Li = ((-bTxi))yi (1-(-bTxi))1- yi

Assuming defaults are independent, the likelihood of a set of observations is just the product of
the individual likelihoods L = Li . According to the ML estimation principle, the weights b are
chosen such that the likelihood of observing the given default behavior is maximized. For the
purpose of maximization, it is more convenient to examine the logarithm lnL. Using first and
second derivative and Newtons method, we find weights b for which lnL is maximized. Of
course, all this is implemented into Python and an excerpt of the code is shown in the figure.

while go:
bsaved = b
H = np.zeros([c,c])
g = np.zeros([c,1]) # this is a column matrix
for i in range(0,r):
xT = A[i,:] # selects ith data as a row matrix
x = xT.transpose() # makes it a column matrix
Lambda = 1/(1 + math.exp(-b.T*x)) # it is a scalar
yi = float(y[i]) # coerce y[i] to a scalar
g = (yi-Lambda)*x + g
H = -Lambda*(1-Lambda)*x*xT + H
b = b - (LA.inv(H))*g

The method yields for the five weights b:

For this particular data set, the mean y_mean for the default variable y is:

To check that it leads to the desired result, we examine the default prediction of a logit model
with just a constant:
Prob(y=1) = (b1) = y_mean

In our case, we get (b1) of comparable value to y_mean:

The model can be further refined by taking into account other facts such
as variables may jointly explain defaults, even though they are
insignificant individually or include a treatment of outliers.

To further test the validity of the model we can perform a one-sample t-test for the estimated b
coefficients. We would like to know if the estimated b coefficients are statistically different from
the null hypothesis which in this case is that the variable x coefficients b are zero. This can be
easily done with the Python function:

T = scipy.stats.ttest_1samp(b,0.0)
print('The t-test for the b coefficients gives....')
print(T)

The PD model shown here is just one of several credit risk model classes that can be
implemented into Python.

You might also like