You are on page 1of 2

DATA ANALYSIS ASSIGNMENT - PROJECT 1 (COURSERA)

Interest Rate Prediction for Social Loans


Erik J.F. van Kempen, Student Member, IEEE, <erikvankempen@ieee.org> Fontys Hogeschool Financieel Management

AbstractIn social lending, people lend semi-directly to eachother. Of course setting interest rates is still based on risk assessments. By analyzing a sample set form a social lending platform, I found that the interest rate is correlated strongly to the applicants FICO score and the amount requested. Two estimators for the interest rates were modelled with a multivariate linear regression, involving the two covariates. Every FICO point is worth a discount of 8.17 base points for 36 month loans and a discount of 9.86 base points for 60 month loans. Index TermsPersonal loans, peer-to-peer, social lending, FICO, base points, multivariate linear regression

a score, taking the lower bound of the range as the score. To nd the upper bound of the range, you will have to add 5 to the resulting score. The analysis was used to determine which personal properties are affecting the interest rate. The properties which most clearly affect the interest rate are selected for further analysis to be able to nd the necessary variables for the regression model. C. Statistical Modeling To relate the interest rate to the selected variables I used a standard multivariate linear regression model. [3], [4] The estimated coefcients in the nal model were calculated using the least squares method. Errors were calculated using standard asymptotic approximations. III. R ESULTS In the exploratory phase of the analysis, I found no clear correlation between the interest rate and the state of residence of the applicant. The total length of employment at the current job and home ownership also show to be uncorrelated to the interest rate. Some potential confounders were found. For example the debtto-income ratio seems to be correlated to both the result, i.e. the interest rate, and the FICO score. This may affect the estimation of the coefcients. Also the amount requested and the number of open credit lines were found to be correlated with the interest rate and the FICO score. The FICO score and length of the loan in months have been selected as the necessary variables for the regression model due to their clear correlation with the interest rate and because other variables were eliminated for being potential confounders. A. Regression model A multivariate linear regression model was used to dene an estimator for the interest rate, based on the FICO score, length of the loan in months and debt-to-income ratio. The nal estimator is dened as follows: = {0 , 1 | n R} i l = 0 + 1 F +
+

I. I NTRODUCTION INCE the start of the nancial and economic downturn, social lending has thrived in the UK and the United States. Over $1.3 billion USD was funded via the Lending Club, a US-based online nancial community which brings together borrowers and investors. [1] The concept of social lending entails that consumers are lending semi-directly to other consumers, without the costly involvement of a bank. The margin the bank normally takes can be divided between the borrower and the investor, resulting in a better deal for both parties. Interest rates are calculated for each individual borrower based on his or her personal situation. The criteria used are the FICO score, the purpose of the loan, the total amount the person applies for and other risk indicators. By analyzing the relation between interest rates offered and specic criteria, an estimator can be dened to predict the interest rate. Criteria like length of time employed at current job, monthly income, length of time of the loan, total amount of outstanding loans and the debt-to-income ratio are considered as possible inuential factors.

II. M ETHODS A. Data Collection For my analysis I used the data on peer-to-peer loans as issued by Lending Club via the Coursera website. [2] This set contains 2,500 samples including additional personal properties of the borrower. Only two loan terms are provided: 60 month and 36 month loans. Of these 2,500 samples, only 548 are 60 month loans while the other 1952 are 36 month loans. B. Exploratory Analysis Exploratory analysis was performed on the raw data by using data cleansing techniques and plotting subsets of the data. Before any analysis could be started some variables were quantied. For example the FICO range was transformed to

(1) (2)

DATA ANALYSIS ASSIGNMENT - PROJECT 1 (COURSERA)

With the terms dened as: Interest rate (36 or 60 months) FICO score (lower bound of range) Error term i36 or i60 F
+

The multivariate linear regression resulted in the following linear model: i36 = 69.77 0.0817 F + i60 = 86.10 0.0986 F + Coefcients i36 : 0 i36 : 1 i60 : 0 i60 : 1 Mean 69.77 0.0817 86.10 0.0986
+ +

(3) (4)

95% Condence Interval 67.74 0 71.82 0.0845 1 0.0788 81.92 0 90.27 0.1045 1 0.0927

These formulas dene an estimator for the interest rate given the FICO score of the applicant. With a P95 [ ] < 0.00001, the found correlations are 2 statistically signicant. Furthermore, with a R36 = 0.61 and 2 R60 = 0.66 the models ts fairly good to the supplied sample set. The resulting estimator is shown in gure 1. IV. C ONCLUSION According to the results of the analysis and the dened estimator functions the interest rate offered is strongly correlated to the FICO score of the applicant and the amount requested. Ceteris paribus the higher the FICO score, the lower the interest rate. So, given that all other properties are the same, a lower interest rate is offered to people with a higher FICO score. Every FICO point is worth a discount of 8.17 base points for 36 month loans and a discount of 9.86 base points for 60 month loans. These results are based on a fairly small number of samples, i.e. 1952 and 548 loans, so in order to nd a better multivariate linear model more samples can be collected and analyzed. ACKNOWLEDGMENT I would like to thank Jeff Leek PhD and Roger D. Peng PhD as the instructors of the course Data Analysis on Coursera. Leek is an Assistant Professor of Biostatistics at the Johns Hopkins Bloomberg School of Public Health. Peng is an associate professor of Biostatistics at the Johns Hopkins Bloomberg School of Public Health. R EFERENCES
[1] Lending Club, Lending Club Statistics, accessed on 09/02/2013, available at https://www.lendingclub.com/info/statistics.action. [2] Loans data sample set, Lending Club, accessed on 09/02/2013, available at https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv. [3] Draper, N. R., Smith, H., & Pownell, E. (1966). Applied regression analysis (Vol. 3). New York: Wiley. [4] Seber, G. A., & Lee, A. J. (2012). Linear regression analysis (Vol. 936). Wiley. [5] Wasserman, L. (2003), All of Statistics - A Concise Course in Statistical Inference, Springer. Fig. 1. (A) Interest Rate vs. FICO score for 36 months loans and (B) Interest Rate vs. FICO score for 60 months loans. The estimator for the interest rate is shown as the red line, while the sample set is presented as blue open circles. The higher the FICO score, the lower the resulting interest rate. The applicant is offered a discount of 8.17 base points for every FICO point on 36 month loans, while the discount is 9.86 base points for every FICO point on 60 month loans.

PLACE PHOTO HERE

Erik J.F. van Kempen Erik (born April 25, 1987) is a Dutch student at Fontys Hogeschool Financieel Management in Eindhoven, The Netherlands. He is a nal year student persuing a Bachelors Degree in Accountancy. His research interests are in the areas continuous monitoring and auditing.

You might also like