You are on page 1of 9

Business Statistics Project

Question 01.
(a) Describe the distribution of the PRSM scores using both graphics and descriptive
statistics.
(b) Does it appear reasonable to use the Empirical Rule to describe the variation of PRSM?
Briefly explain your reasoning.
(c) If you were to remedy any evident anomalies with this variable, how well would the
Empirical Rule work now?
Solution 01
Part a)
PRSM Score has been calculated using,
PRSM Score = (2*Amt paid in 6 months)/Total amt to be paid
Post this, we need to check whether the data is normally distributed or not, for this, we have
included the beginning 628 data records in our test.
Descriptive Method:
Mean of PRSM scores=1.932137
Standard deviation of PRSM scores=28.04
Inorder to find if the data follows normal distribution, we have to follow empirical rule and
check whether 68% of the values fall under one sd,95% fall under two sd and 99.73% fall
under three sd. Through R, we obtained that for one standard deviation 99.84% of values are
coming under the curve which means that the data is not normally distributed.
Graphical Method:
We plotted a density histogram with normal curve to check. Fig. 1 depicts that the data is
highly skewed to the right and does not follow normal distribution.

Figure 1: Normal plot


We also tried testing with Q-Q plot and this method also indicated an outlier as in Fig 2.

Figure 2: Quantile plot


(b) Does it appear reasonable to use the Empirical Rule to describe the variation of PRSM?
Briefly explain your reasoning.
No, as it has already been indicated in the (a) part, 99.84% of values are falling within one
standard deviation and hence it is not advisable to use the empirical rule.
(c) If you were to remedy any evident anomalies with this variable, how well would the
Empirical Rule work now?

We noticed that one of the PRSM scores had abnormally high value of 703.535 occurring at
row no. 527. So, we assumed that there was discrepancy in recording the data and hence we
removed this particular record from the dataset.
After removing the particular record, we found that
Mean of PRSM scores=0.81
Standard deviation of PRSM scores=0.22
Descriptive method:
We found that for empirical rule the following values were obtained:
One standard deviation: 69.69%
Two standard deviation: 94.89%
Three standard deviation: 99.68%
Hence, as per empirical rule we can say that the data approximately follows a normal
distribution.
Graphical Method:
After re-plotting the histogram and qq plot we found that the data nearly follows a normal
distribution (Fig 3 and 4)

Fig. 3 Normal curve after removing anomaly

Fig. 4 Quantile plot after removing anomaly


Question 02
Management is concerned that current lending procedures produce loans that, on average,
have PRSM scores below the target 1. After correcting the anomaly noted in Question 1,
does a confidence interval indicate that management should indeed be concerned that average
PRSM scores are lower than desired (i.e., lower than 1)? Be sure to justify the use of a
confidence interval in this context.
Solution 02
This shall be evaluated on the basis of a t-test with 99% confidence interval (based on the
industrial practice credit risk modelling is done with 99% CI) .

The t-test result show that null hypothesis is rejected and mean is less than the target of 1.
Question 03:
Control charts can be used to measure the stability of many types of data,
including the performance of loans. Assume that the loans in your data table are
arranged in chronological order, starting from the first row through row 628.
Generate a control chart of the PRSM scores for your sample, completing the JMP
dialog as shown below. These choices set the process mean = 0.9, the

standard deviation = 0.24, and group the loans into batches of size 40. Be sure
to resolve the anomaly noted in the first question.
Part a)
Do the resulting x-bar and s-charts indicate that the lending process has been in
control over the sampled period? What are the implications, especially with
regard to the confidence interval in Question 2?
Part b)
Why are the control limits for the mean in the x-bar chart so much wider than the
confidence interval for the mean used in Question 2? Two reasons, please! On

drawing the x-bar and s-charts, it can be noted that the process is in control (Fig. 5 and 6).
Solution 03
Part a)

Fig 5. X-bar chart

Fig 6. S-chart
Part b)

The control limits are based on 6 while the control limits in question 2 were based on 4 for
95% confidence interval.

(a) The variable Years in Business may ultimately be useful in predicting


the PRSM score. Describe the distribution of this variable.

The plot shows that the data follows negative exponential distribution and not normal
distribution.
(b) Describe the distribution of the variable defined by log(1+Years in Business).
In
particular, is the variation in the transformed variable nearly normal?

The plot shows that the data follows normal distribution.

5. The distribution of Average House Value in Zip Code reveals an unusual


feature that appears to be an artificial consequence of the data
processing. If this artificial feature could be corrected so that the data
showed the actual average house values, then:
(a) How would the mean and standard deviation of this variable change?
Would x-bar and s remain unchanged, increase or decrease?
(b) Would the median and IQR of this variable remain unchanged, increase
or decrease?

Ahv is the variable which has the average house values.


Ahv2 is the variable after removing the outliers of data points above 1000000.
By comparing Summary and sd of ahv vs ahv2 shows that,
Mean, sd ,median, IQR all have decreased by removing the outliers.
6. An analyst extracted a sample of 25 loans from the same population as
your sample of loans. Estimate the probability that the average PRSM in a
sample of 25 from this population is greater than 0.9. Identify any
relevant assumptions and state why you believe them to be plausible.
Mean

0.8094

Std Dev

0.2067

Kurtosis

-0.0216

Z= (x-mu)/ (sigma/sqrt(n))* (sqrt(N-n)/(N-1))


= (0.8094-0.8132)/ (0.2067/5) *(sqrt(1678-25)/(1677))
= -0.0038/1.0335* (0.99281)
= -0.0038/1.0260
= -0.003703
So Probability is 0.498523

7. Would you be surprised if the sample of 25 obtained by the analyst


described in the prior question contains fewer than 5 loans that were
originated from ISO named Credit Divas? Explain your answer and justify
any assumptions you have made.
Its not a big surprise as getting 5 credit divas from random sample of 25 is just
20%. This sample contains 685 credit divas out of 1687 equals to 41 %. so there
is a good possibility of obtaining such a sample.
Weighted Sampling can be used to avoid such biased scenarios.

8. For this question, recode the PRSM score into a two-level categorical
variable, labeled as Above or Below depending on whether the PRSM
score is above or below the average PRSM score in your dataset.
It has been suggested that loans given to repeat customers (Loan Type
identifies this variable) perform worse than those given to first time
borrowers. To what extent is this assertion supported by the data? Provide
a brief discussion using ideas from the first two lectures to support your
answer.

From, the Cross table Type-1 refers to original customer, Type 2 refers to repeating customer,
Below and high are categorized as per the target PRSM score of 1.
As the table suggests 85.5% of original customers are not meeting the target whereas 75.5%
of repeating customers are not meeting the target PRSM. So our assumption cannot be
justified based on this table.

9. Does your answer to the previous question imply that Original/Repeat loan
status is the cause of an improved PRSM score or might there be another
explanation? If so, suggest a possible lurking variable that could
influence the comparison (the lurking variable can be a hypothetical one,
it doesnt have to exist in the data set). Otherwise, explain briefly why it is
not possible. (Reading Section 5.2 of SF could be helpful here.)

From the correlation matrix we can notice that there is no significant correlation between the
variables and PRSM , hence a new variable by variable transformation is required.

10.If you wanted to construct an approximate 95% confidence interval for the
proportion of future loans that originate from ISO Loan Masters, and this CI
was to have a margin of error of +/-2%, then what sample size would you
recommend?
Zalpha=1.96
E=.02
Standard Deviation=0.211651
N=Zalpha^2*Standard Deviation^2/E^2
N=430

So the required sample size for a margin of 2% is 430.

You might also like