Professional Documents
Culture Documents
Pristine www.edupristine.com
Pristine
4. Correlation and Regression
I. Covariance and Correlation coefficient
II. Regression
Pristine 1
4a. Correlation
I. Covariance and Correlation coefficient
i. Definition
ii. Sample and population correlation
iii. Illustrative example
iv. Statistical significance test for sample correlation coefficient
Pristine 2
4a. Covariance and Correlation Coefficient
Covariance is a statistical measure of the degree to which the two variables move together.
The sample covariance is calculated as
n
(X i X )(Y iY )
cov xy i 1
n 1
Correlation coefficient
It is a measure of the strength of the linear relationship between two variables
The correlation coefficient is given by:
cov xy
xy
x isy denoted by (rho)
Population correlation
Sample correlation is denoted by r. It is an estimate of same way as
S2 (sample variance) is an estimate of 2 (population variance) and
(sample mean) is an estimate of (population mean)
Features
X of and r
Unit free and ranges between -1 and 1
The closer to -1, the stronger the negative linear relationship
The closer to 1, the stronger the positive linear relationship
The closer to 0, the weaker the linear relationship
Pristine 3
4a. Example: Covariance and Correlation of the S&P 500 and
NASDAQ Returns given a sample
Pristine 4
4a. Solution: Covariance and Correlation of the S&P 500 and
NASDAQ Returns given a sample
Date S&P 500 NASDAQ S&P 500 NASDAQ S&P 500 NASDAQ
12/2/2011 1,244.28 2,626.93 Xi Yi Xi- X Yi- Y (Xi-X )*(Yi- Y )
12/5/2011 1,257.08 2,655.76 1.03% 1.10% 1.14% 1.20% 0.0137%
12/7/2011 1,261.01 2,649.21 0.31% -0.25% 0.43% -0.15% -0.0006%
12/8/2011 1,234.35 2,596.38 -2.11% -1.99% -2.00% -1.89% 0.0378%
12/9/2011 1,255.19 2,646.85 1.69% 1.94% 1.80% 2.05% 0.0369%
12/12/2011 1,236.47 2,612.26 -1.49% -1.31% -1.38% -1.21% 0.0166%
X Y
Pristine 5
4a. Examples of Approximate r Values
y y y
x x x
r = -1 r = -0.6 r=0
y y
x x
r = +.3 r = +1
Pristine 6
4.b.Case- Multivariate Linear Regression (Revisited)
Adam, an Analytics consultant works with First Auto Insurance company. His manager gave him data
having "Loss" amount and policy related information and asked him to "identify" and "quantify" the
factors responsible for losses in a multivariate fashion. Adam has no knowledge of running a
multivariate regression.
Now suppose, he approaches you and request for your help to complete the assignment. Lets help
Adam in carrying out the multivariate regression.
Pristine 7
4a. Testing the significance of the correlation coefficient
Test whether the correlation between the population of two variables is equal to zero
Null hypothesis, H0: r = 0
Assuming that the two populations are normally distributed, we can use a t-test to determine
whether the null hypothesis should be rejected.
The test statistic is computed using the sample correlation, r, with n 2 degrees of freedom (df )
t = r (n-2)
(1- r2)
Calculated test statistic is compared with the critical t-value for the appropriate degrees of
freedom and level of significance
Reject H0 if t > tcritical or t <-tcritical
Example: Correlation of the S&P 500 and NASDAQ Returns given a sample
n = 5, r = 0.979811179, v = 5-2 = 3
Calculate, t = 8.4885
tcritical at 95% confidence interval (df = 3) = 2.3534
Hence, reject H0 at CI of 95%
Pristine 8
4b. Regression
Pristine 9
4.b.The Million Dollar Question
Hours Mumbai Delhi Chennai Kolkata Bangalore Pune Hyderabad Online Singapore Middle East
10 20 7 5 13 10 11 14 9 7 12
20 8 24 34 24 16 19 20 20 25 12
30 19 8 16 37 62 29 33 25 36 30
40 67 31 44 43 32 19 38 27 49 35
50 36 46 78 57 36 82 55 33 53 41
60 67 54 90 58 23 45 62 67 58 78
70 56 68 93 71 76 72 68 81 70 57
80 81 89 78 86 45 68 83 58 90 98
Pristine 10
4.b.The Population
100
Marks in Test
80
60
40
20
0
0 10 20 30 40 50 60 70 80 90
Hours of Study
Pristine 11
4.b.Introduction to Regression Analysis
Independent variable: the variable used to explain the dependent variable. Denoted by X
Pristine 12
4.b.Simple Linear Regression Model
Pristine 13
4.b.Assumptions
1. A linear relationship exists between the dependent and the independent variable.
E
i
2 2
5. The residual term is independently distributed; that is, the residual for one observation is not
correlated with that of another observation
[E( i j ) 0, j i]
6. The residual term is normally distributed.
Pristine 14
4.b.Types of Regression Models
Pristine 15
4.b.Population Linear Regression
(continued)
Y Y 0 1 X u
ui Slope = 1
Intercept = 0 Individual
person's marks
xi X
16
Pristine
4.b.Population Regression Function
Random Error
Dependent Population y Population Slope Independent term, or
Variable intercept Coefficient Variable residual
Y 0 1 X u
Linear component Random Error
component
Pristine 17
4.b.Information that we actually have
Hours Mumbai
10 20
20 8
30 19
40 67
50 36
60 67
70 56
80 81
Pristine 18
4.b.Sample Regression Function
(continued)
Y y b 0 b1 x e
Observed Value
of y for xi
ei Slope = 1
Intercept = 0
xi X
19
Pristine
4.b.Sample Regression Function
Pristine 20
4.b.The error term (residual)
Represents the influence of all the variable which we have not accounted for in the equation
It represents the difference between the actual "y" values as compared the predicted y values
from the Sample Regression Line
Wouldn't it be good if we were able to reduce this error term?
What are we trying to achieve by Sample Regression?
Pristine 21
4.b.Our Objective
Y 0 1 X u
yi b0 b1x
Pristine 22
4.b.One method to find b0 and b1
e 2
(y y) 2
(y (b 0 b1 x)) 2
Pristine 23
4.b.OLS Regression Properties
The sum of the residuals from the least squares regression line is 0.
( y y ) 0
The sum of the squared residuals is a minimum.
Minimize( ( y
y ) 2
)
The simple regression line always passes through the mean of the y variable and the mean of
the x variable
Pristine 24
4.b.Interpretation of the Slope and the Intercept
b0 is the estimated average value of y when the value of x is zero. More often than not it does
not have a physical interpretation
b1 is the estimated change in the average value of y as a result of a one-unit change in x
y
Y b0 b1 X
slope of the line(b1)
b0
Pristine 25
4.b.Hypothesis Testing: Two Variable Model
How do we know whether the values of b0 and b1 that we have found are actually meaningful?
Is it actually possible that our sample was a random sample and it has given us a totally wrong
regression line?
We do know a lot about the sample error term "e" but what do we know about the error terms
"u" of the Population Regression Function?
How do we proceed from here?
Pristine 26
4.b.Assumptions about "u"
The underlying relationship between Y
the X variable and the Y variable is linear
Cov(ex1,ex2) = 0
For a given value of Xi the sum of error
terms is equal to 0
e
The error term is uncorrelated with
the explanatory variable X
Error values are normally distributed
for any given value of X
The probability distribution of the errors
for a given Xi is normal
The probability distribution of the errors
for different Xi has constant variance x1 x2 X
(homoscedasticity)
Error values u for given Xi are statistically independent, their covariance is zero
Once we make these assumptions about "u" we are able to estimate the variance and
standard errors of b0 and b1 and this has been possible because of the properties
of OLS method (beyond the scope of lecture)
Pristine 27
4.b.Standard Error of Estimate (SEE)
The standard deviation of the variation of observations around the regression line is estimated
by:
RSS
su
n k 1
Where
RSS= Residual Sum of Squares (summation of e2)
n = Sample size
k = number of independent variables in the model
Note: When k=1
RSS
su = Sample standard error of the estimate
n2
Pristine 28
4.b.Comparing Standard Errors
Variation of observed y values from the regression line Variation in the slope of regression lines from
different possible samples
y y
small s u x smallsb1 x
y y
Pristine 29
4.b.Inference about the Slope: t-Test
t-test for a population slope
Is there a linear relationship between x and y?
Null and alternative hypotheses
H0: 1 = 0 (no linear relationship)
H1: 1 0 (linear relationship does exist)
Test statistic
b1 1
t
sb1
d.f.
The null n2
hypothesis can be rejected if either of the following are true:
tc <t or
t < -tc
where:
b1 = Sample regression slope coefficient
1 = Hypothesized slope
sb1 = Estimator of the standard error of the slope
tc= the critical t value
Pristine 30
4.b.Confidence Interval for 'y'
The confidence interval for the predicted value of Y is given by:
Y (tc * s f )
where:
Y = predicted 'Y' value (dependent variable)
n-2 = degrees of freedom
tc = the critical t value
s f = the standard error of the forecast
Pristine 31
4.b.The Confidence Interval for a Regression Coefficient
The confidence interval for the regression coefficient, b1 is given by:
b1 (tc * sb )
1
where:
b1 = correlation between x and y
n-2 = degrees of freedom
tc = the critical t value
Pristine 32
4.b.Explained and Unexplained Variation
y
yi SSE = Sum of squared errors
y
SST = Total Sum of Squares SSE = (yi - yi )2
_
SST = (yi - y)2
y _ 2
_ RSS = (yi - y) _
y RSS = Regression sum of squares
y
Xi x
Pristine 33
4.b.Explained and Unexplained Variation (Cont)
Pristine 34
4.b.Explained and Unexplained Variation (Cont)
SST ( y y ) 2
SSE ( y y ) 2 SSR ( y y ) 2
Where:
y = Average value of the dependent variable
y = Observed values of the dependent variable
y = Estimated value of y for the given x value
Pristine 35
4.b.Coefficient of Determination, R2
The coefficient of determination is the portion of the total variation in the dependent variable
that is explained by variation in the independent variable
The coefficient of determination is also called R-squared and is denoted as R2
SSR
R 0 R 1
2 2
where
SST
Pristine 36
4.b.Coefficient of Determination, R2 (Cont)
Coefficient of determination
(continued)
SSR sum of squaresexplained by regression
R 2
SST total sum of squares
Where:
R r
2 2
R2 = Coefficient of determination
r = Simple correlation coefficient
Pristine 37
4.b.Examples of Approximate R2 Values
y
R2 = 1
x
R2 = +1
Pristine 38
4.b.Examples of Approximate R2 Values (Cont)
y
0 < R2 < 1
Pristine 39
4.b.Examples of Approximate R2 Values (Cont)
R2 = 0
y
No linear relationship between x and y:
R2 = 0 x
Pristine 40
4.b.Limitations of Regression Analysis
Parameter Instability - This happens in situations where correlations change over a period of
time. This is very common in financial markets where economic, tax, regulatory, and political
factors change frequently.
Public knowledge of a specific regression relation may cause a large number of people to react in
a similar fashion towards the variables, negating its future usefulness.
If any regression assumptions are violated, predicted dependent variables and hypothesis tests
will not hold valid.
Pristine 41
4.b.General Multiple Linear Regression Model
In simple linear regression, the dependent variable was assumed to be dependent on only one
variable (independent variable)
In General Multiple Linear Regression model, the dependent variable derive sits value from two or
more than two variable.
General Multiple Linear Regression model take the following form:
Yi b0 b1 X 1i b2 X 2i ......... bk X ki i
where:
Yi = ith observation of dependent variable Y
Xki = ith observation of kth independent variable X
b0 = intercept term
bk = slope coefficient of kth independent variable
i = error term of ith observation
n = number of observations
k = total number of independent variables
Pristine 42
4.b.Estimated Regression Equation
As we calculated the intercept and the slope coefficient in case of simple linear regression by
minimizing the sum of squared errors, similarly we estimate the intercept and slope coefficient in
multiple linear regression.
n
Sum of Squared Errors i
2
i 1
is minimized and the slope coefficient is estimated.
i Yi Yi Yi b0 b1 X 1i b2 X 2i ......... bk X ki
Pristine 43
4.b.Interpreting the Estimated Regression Equation
Intercept Term (b0): It's the value of dependent variable when the value of all independent
variables become zero.
b0 Value of Y
when X 1 X 2 ....... X k 0
Slope coefficient (bk): It's the change in the dependent variable from a unit change in the
corresponding independent (Xk) variable keeping all other independent variables constant.
In reality when the value of the independent variable changes by one unit, the change in the
dependent variable is not equal to the slope coefficient but depends on the correlation among
the independent variables as well.
Therefore, the slope coefficient are called partial slope coefficients as well
Pristine 44
4.b.Assumptions of Multiple Regression Model
There exists a linear relationship between the dependent and independent variables.
The expected value of the error term, conditional on the independent variables is zero.
The error terms are homoskedastic, i.e. the variance of the error terms is constant for all the
observations.
The expected value of the product of error terms is always zero, which implies that the error
terms are uncorrelated with each other.
The independent variables doesn't have any linear relationships between each other.
Pristine 45
4.b.Hypothesis Testing of Coefficients
The values of the slope coefficients doesn't tell anything about their significance in explaining the
dependent variable.
Even an unrelated variable when regressed would give some value of slope coefficients.
To exclude the cases where the independent variables doesn't significantly explain the dependent
variable, we need the hypothesis testing of the coefficients for checking whether they contribute
in explaining the dependent variable significantly or not.
The t-statistic is used to check the significance of the coefficients.
The t-statistic used for the hypothesis testing is same as used in the hypothesis testing of
coefficient of simple linear regression.
Following are the hypothesis and alternative hypothesis to check the statistical significance of b k:
Hypothesis H0: bk =0
Alternative Hypothesis (Ha): bk 0
The t-statistic of (n-k-1) degrees of freedom for the hypothesis testing of the coefficient bk
bk bk
t
s
bj
If the value of t-statistic lies within the confidence interval, H0 can't be rejected
Pristine 46
4.b.Confidence Interval for the Population Value
The confidence interval for a regression coefficient is given by:
b j (tc s )
bj
Where,
tc is the critical t-value, and
sb is the standard error
j
Pristine 47
4.b.Predicted Dependent Variable
The regression equation can be used for making predictions about the dependent variable by
using forecasted values of the independent variables.
Yi b0 b1 X 1i b2 X 2i ......... bk X ki
Where,
Y is the predicted value for the dependent variable
i
bi is the estimated partial slope for the ith independent variable
X ni is the forecasted ith value for the nth independent variable
Pristine 48
4.b.Analysis of Variance (ANOVA)
Analysis of variance is a statistical method for analyzing the variability of the data by breaking the
variability into its constituents.
A typical ANOVA table looks like:
Source of Variability DoF Sum of Squares Mean Sum of Squares
Regression(Explained) k RSS MSR=RSS/1
Error(Unexplained) n-k-1 SSE MSE=SSE/n-2
Total n-1 SST=RSS+SSE
SSE
Standard Error of Estimate(SEE) = MSE
n2
Total Variation( SST) Unexplaine d Variation( SSE)
Coefficient of determination(R2) =
Total Variation( SST)
Pristine 49
4.b.F-Statistic
An F-test explains how well the dependent variable is explained by the independent variables
collectively.
In case of multiple independent variable, F-test tells us whether a single variable explains a
significant part of the variation in dependent variable or all the independent variables explain the
variability collectively.
n: Number of observations
n 1 2
where Ra2 1
1 R
n k 1
n = Number of Observations
k = Number of Independent Variables
= Adjusted R2
Ra2
Pristine 52
4.b.Representing Qualitative Factors
How can we represent Qualitative factors in a regression equation?
By using 'dummy variables'; variables that take values of either 1 or 0, depending whether it is
true or false.
If we wanted to consider the spike in soft drink sales in the summer, we may have a regression
equation:
Rev(t) 10,000 2,000t 50,000S
Here,
1 if it' s summer
S
0 if it' s not summer
If there are n mutually exclusive and exhaustive classes, they can be represented by n-1 dummy
variables. This is derived from the concept of degrees of freedom.
For example, to represent the 4 stages of the business cycle, we can use 3 dummy variables.
The fourth variable would be represented by zeros for all three dummy variables.
We do not use 4 variables as that would indicate a linear relationship between all 4 variables.
Pristine 53
4.b.Heteroskedasticity
When the requirement of a constant variance is violated, we have a condition of
heteroskedasticity.
Error u
Predicted y
Pristine 54
4.b.Unconditional and Conditional Heteroskedasticity
Presence of heteroskedasticity in the data is the violation of the assumption about the constant
variance of the residual term.
Heteroskedasticity takes the following two forms, unconditional and conditional.
Unconditional Heteroskedasticity is present when the variance of the residual terms are not
related to the values of the independent variable.
Unconditional Heteroskedasticity doesn't pose any problem in the regression analysis as the
variance doesn't change systematically
Conditional Heteroskedasticity pose problems in regression analysis as the residuals are
systematically related to the independent variables
Y b 0 b1 X
Y
Low Variance of
Residual Terms
High Variance of
Residual Terms
X
Pristine 55
4.b.Detecting Heteroskedasticity
Heteroskedasticity can be detected either by viewing the scatter plots as discussed in the previous
case or by Breusch-Pagan chi-square test.
In Breusch-Pagan chi-square test, the residuals are regressed with the independent variables to
check whether the independent variable explains a significant proportion of the squared residual
or not.
If the independent variables explain a significant proportion of the squared residuals then we
conclude that the conditional heteroskedasticity is present otherwise not.
Breusch-Pagan test statistic follows a chi-square distribution with k degrees of freedom, where k is
the number of independent variables.
BP Chi Square Test Statistic n Rresid
2
where:
n: number of observations
2
Rresid :Coefficient of determination when residuals are regressed with independent variables
There are two methods for correcting the effects of conditional heteroskedasticity
Robust Standard Errors
Correct the standard errors of the linear regression model's estimated coefficients to account
for conditional heteroskedasticity
Generalized Least Squares
Modifies the original equation in an attempt to eliminate heteroskedasticity.
Statistical packages are available are available for computing robust standard errors.
Pristine 57
4.b.Multicollinearity
Another significant problem faced in the Regression Analysis is when the independent variables or
the linear combinations of the independent variables are correlated with each other.
This correlation among the independent variables is called Multicollinearity which creates
problems in conducting t-statistic for statistical significance.
Multicollinearity is evident when the t-test concludes that the coefficients are not statistically
different from zero but the F-test is significant and the coefficient of determination (R2) is high.
High correlation among the independent variables suggests the presence of multicollinearity but
lower values of correlations doesn't omit the chances of presence of multicollinearity.
The most common method of correcting multicollinearity is by systematically removing the
independent variable until multicollinearity is minimized.
Pristine 58
4.b.Model Misspecifications
Apart from checking the previously discussed problems in the regression, we should check for the
correct form of the regression as well.
Following 3 misspecification can be present in the regression model:
Functional form of regression is misspecified:
The important variables could have been omitted from the regression model
Some regression variables may need the transformation (like conversion to the logarithmic scale)
Pooling of data from incorrect pools
The variables can be correlated with the error term in time-series models:
Lagged dependent variables are used as independent variables with serially correlated errors
A function of dependent variables is used as an independent variable because of incorrect dating
of the variables
Independent variables are sometimes measured with error
Other Time-Series Misspecification which leads to the nonstationarity of the variables:
Existence of relationships in time-series that results in patterns
Random walk relationships among the time series
These misspecifications in the regression model results in the biased and inconsistent regression
coefficients which further leads to incorrect confidence intervals leading to TYPE-I or TYPE-II errors.
Nonstationarity means that the properties(like mean, variance) of the variables is not constant
Pristine 59
4.b.The Economic meaning of a Regression Model
Consider the equation:
Rev_Growth 4% 0.75GDP_Growth 0.5WPI_Infl
The economic meaning for this equation is given by the partial slopes or coefficients of the
variables.
If the GDP Growth rate was 1% higher, it translates into a 0.75% higher Revenue growth.
Similarly, if the WPI Inflation figures were 1% higher, it translates into a 0.5% higher revenue
growth.
Pristine 60
4.b.Case- Multivariate Linear Regression (Revisited)
Adam, an Analytics consultant works with First Auto Insurance company. His manager gave him
data having Loss amount and policy related information and asked him to identify and
quantify the factors responsible for losses in a multivariate fashion. Adam has no knowledge
of running a multivariate regression.
Now suppose, he approaches you and request for your help to complete the assignment. Lets
help Adam in carrying out the multivariate regression.
Pristine 61
Case- Multivariate Linear Regression (Rules of Thumb)
In due course of helping Adam to complete his task, we will walk him through following steps:
Variable identification
Identifying the dependent (response) variable.
Identifying the independent (explanatory) variables.
Variable categorization (e.g. Numeric, Categorical, Discrete, Continuous etc.)
Creation of Data Dictionary
Response variable exploration
Distribution analysis
Percentiles
Variance
Frequency distribution
Outlier treatment
Identify the outliers/threshold limit
Cap/floor the values at the thresholds
Independent variables analyses
Identify the prospective independent variables (that can explain response variable)
Bivariate analysis of response variable against independent variables
Variable treatment /transformation
Grouping of distinct values/levels
Mathematical transformation e.g. log, splines etc.
Pristine 62
Case- Multivariate Linear Regression (Rules of Thumb)
Heteroskedasticity
Check in a univariate manner by individual variables
Easy for univariate linear regression. Can be done manually.
Too cumbersome to do manually for multivariate case
The tools (R, SAS etc.) have in-built features to tackle it.
Fitting the regression
Check for correlation between independent variables
This is to take care of Multicollinearity
Fix Heteroskedasticty
By suitable transformation of response variable a bit tricky).
Using inbuilt features of statistical packages like R
Variable selection
Check for the most suitable transformed variable
Select the transformation giving the best fit
Reject the statistically insignificant variables
Fitting the regression
Analysis of results
Model comparison
Model performance check
R2
Lift/Gains chart and Gini coefficient
Actual vs Predicted comparison
Pristine 63
Multivariate Linear Regression- Data
Snapshot of the data
Data description (known facts):
Auto insurance policy data
Contains policy holders and loss amount
information (variables)
Policy Number
Age
Years of Driving Experience
Number of Vehicles
Gender
Married
Vehicle Age
Fuel Type
Losses (Dependent/Response Variable)
Next step
Create the Data Dictionary
Pristine 64
Multivariate Linear Regression- Data Dictionary
Pristine 65
Multivariate Linear Regression- Data Dictionary
6 Married Marital status of the Policy holder Married, Single Categorical (binary)
Pristine 66
Multivariate Linear Regression- Response Variable (Losses) SAS
Pristine 67
Multivariate Linear Regression- Response Variable (Capped
Losses) SAS
Pristine 68
Code to generate bivariate profiling
Pristine 69
Multivariate Linear Regression- Bivariate Profiling SAS
Pristine 70
Multivariate Linear Regression- Bivariate Profiling SAS
Pristine 71
Multivariate Linear Regression- Bivariate Profiling SAS
Pristine 72
Multivariate Linear Regression- Bivariate Profiling SAS
Pristine 73
Multivariate Linear Regression- Bivariate Profiling SAS
Pristine 74
Code to check heteroskedasticity
Pristine 75
Multivariate Linear Regression- Heteroskedasticity (Age)
Pristine 76
Multivariate Linear Regression- Heteroskedasticity (Gender)
Female
Male
Pristine 77
Multivariate Linear Regression- Heteroskedasticity (Married)
Married
Unmarried
Pristine 78
Multivariate Linear Regression- Heteroskedasticity (Vehicle Age)
years
years
years
Pristine 79
Multivariate Linear Regression- Heteroskedasticity (Fuel Type)
Petrol
Diesel
Pristine 80
Multivariate Linear Regression- Variable Selection
Variable selection to be done on the basis of
Multicollinearity (correlation between independent variables)
Banding of variables e.g. whether to use Age or Age Band (also called custom bands)
Statistical significance of variables tested after performing above two steps
List of independent variables:
1. Age
2. Age Band
3. Years of Driving Experience
4. Number of Vehicles
5. Gender
6. Married
7. Vehicle Age
8. Vehicle Age Band
9. Fuel Type
Pristine 81
Covariance and Correlation
Pristine 82
Choosing b/w age and years of experience
Age and Years of Driving Experience are highly correlated (Correlation Coefficient = 0.9972). We can
use either of the variables in regression
Q: Which one to use and which one to reject?
Sol: Fit two separate models using either of the variable one at a time. Check for goodness of fit (R2 in this
case). The variable producing higher R2 gets accepted.
Pristine 83
Code to make bands and choose
Pristine 84
Age vs age band
Investigate whether to use Age or Age band
Fit regression independently using Age and Age Band
Before fitting regression, Age Band needs to be converted to numerical form from categorical. Replace
Age Band values with Average Age for the particular band.
2. Number of Vehicles
3. Gender
4. Married
5. Vehicle Age Band in the form of Average Vehicle Age of the band (selected out of Vehicle Age and Vehicle
Age Band).
6. Fuel Type
We will run regression in multivariate fashion and then select final list of variables by taking into
consideration statistical significance.
Pristine 87
Multivariate Linear Regression- Categorical variable conversion
Categorical variables in Binary form need to be converted to their numerical equivalent (0, 1)
1. Gender (F = 0 and M = 1)
3. Fuel Type (P = 0, D = 1)
Snapshot of the final data on which we will run the multivariate regression
Pristine 88
Code for running the full regression
Insignificant
Pristine 89
Result after removing number of vehicles
Pristine 90
Multivariate Linear Regression- Regression Equation
Predicted Losses = 625.0241
5.56069 * Avg Age + 50.88366 * Gender Dummy +
78.40224 * Married Dummy -15.14453 * Avg Vehicle Age + 267.93268 * Fuel Type Dummy
Interpretation:
Sign of
Coefficients Inference
Coefficient
Intercept 625.005
Avg Age -5.561 -ve Higher is the age, lower is the loss
Gender Dummy 50.883 +ve Average Loss for Males is higher than Females
Married Dummy 78.402 +ve Average Loss for Single is higher than Married
Avg Vehicle Age -15.144 -ve Older is the vehicle, lower are the losses
Fuel Type Dummy 267.932 +ve Losses are higher for Fuel type Diesel
Pristine 91
Multivariate Linear Regression- Residual Plot
Residual plot:
Residuals calculated as Actual Capped Losses Predicted Capped Losses
Residuals should have a uniform distribution else theres some bias in the model
Except for a few observations (circled in red), residuals are uniformly distributed
Pristine 92
Code to generate scorecard
Pristine 93
Scorecard Performance Checks- Rank Ordering
Pristine 94
Thank you!
Pristine
702, Raaj Chambers, Old Nagardas Road, Andheri (E), Mumbai-400 069. INDIA
www.edupristine.com
Ph. +91 22 3215 6191
Pristine www.edupristine.com
Pristine