You are on page 1of 96

Business Analytics

Predictive Modeling using Linear


Regression

Pristine www.edupristine.com
Pristine
4. Correlation and Regression
I. Covariance and Correlation coefficient

II. Regression

Pristine 1
4a. Correlation
I. Covariance and Correlation coefficient
i. Definition
ii. Sample and population correlation
iii. Illustrative example
iv. Statistical significance test for sample correlation coefficient

Pristine 2
4a. Covariance and Correlation Coefficient
Covariance is a statistical measure of the degree to which the two variables move together.
The sample covariance is calculated as
n

(X i X )(Y iY )
cov xy i 1
n 1
Correlation coefficient
It is a measure of the strength of the linear relationship between two variables
The correlation coefficient is given by:

cov xy
xy
x isy denoted by (rho)
Population correlation
Sample correlation is denoted by r. It is an estimate of same way as
S2 (sample variance) is an estimate of 2 (population variance) and
(sample mean) is an estimate of (population mean)
Features
X of and r
Unit free and ranges between -1 and 1
The closer to -1, the stronger the negative linear relationship
The closer to 1, the stronger the positive linear relationship
The closer to 0, the weaker the linear relationship

Pristine 3
4a. Example: Covariance and Correlation of the S&P 500 and
NASDAQ Returns given a sample

Closing Index Value


Date S&P 500 NASDAQ
12/2/2011 1,244.28 2,626.93
12/5/2011 1,257.08 2,655.76
12/7/2011 1,261.01 2,649.21
12/8/2011 1,234.35 2,596.38
12/9/2011 1,255.19 2,646.85
12/12/2011 1,236.47 2,612.26

Pristine 4
4a. Solution: Covariance and Correlation of the S&P 500 and
NASDAQ Returns given a sample

Closing Index Value Returns Deviation

Date S&P 500 NASDAQ S&P 500 NASDAQ S&P 500 NASDAQ
12/2/2011 1,244.28 2,626.93 Xi Yi Xi- X Yi- Y (Xi-X )*(Yi- Y )
12/5/2011 1,257.08 2,655.76 1.03% 1.10% 1.14% 1.20% 0.0137%
12/7/2011 1,261.01 2,649.21 0.31% -0.25% 0.43% -0.15% -0.0006%
12/8/2011 1,234.35 2,596.38 -2.11% -1.99% -2.00% -1.89% 0.0378%
12/9/2011 1,255.19 2,646.85 1.69% 1.94% 1.80% 2.05% 0.0369%
12/12/2011 1,236.47 2,612.26 -1.49% -1.31% -1.38% -1.21% 0.0166%
X Y

Mean -0.12% -0.10% Total 0.1044%


sx sy
Standard
0.01630504 0.01633798
Deviation
Covariance 0.000261013
Correlation 0.979811179

Pristine 5
4a. Examples of Approximate r Values

y y y

x x x
r = -1 r = -0.6 r=0
y y

x x
r = +.3 r = +1
Pristine 6
4.b.Case- Multivariate Linear Regression (Revisited)
Adam, an Analytics consultant works with First Auto Insurance company. His manager gave him data
having "Loss" amount and policy related information and asked him to "identify" and "quantify" the
factors responsible for losses in a multivariate fashion. Adam has no knowledge of running a
multivariate regression.

Now suppose, he approaches you and request for your help to complete the assignment. Lets help
Adam in carrying out the multivariate regression.

Pristine 7
4a. Testing the significance of the correlation coefficient

Test whether the correlation between the population of two variables is equal to zero
Null hypothesis, H0: r = 0
Assuming that the two populations are normally distributed, we can use a t-test to determine
whether the null hypothesis should be rejected.
The test statistic is computed using the sample correlation, r, with n 2 degrees of freedom (df )
t = r (n-2)
(1- r2)
Calculated test statistic is compared with the critical t-value for the appropriate degrees of
freedom and level of significance
Reject H0 if t > tcritical or t <-tcritical

Example: Correlation of the S&P 500 and NASDAQ Returns given a sample
n = 5, r = 0.979811179, v = 5-2 = 3
Calculate, t = 8.4885
tcritical at 95% confidence interval (df = 3) = 2.3534
Hence, reject H0 at CI of 95%

Pristine 8
4b. Regression

I. Explain what is meant by response and explanatory variables.


II. State the usual simple regression model (with a single explanatory variable).
III. State and explain the least squares estimates of the slope and intercept parameters in a simple
linear regression model.
IV. Calculate R2 (coefficient of determination) and describe its use to measure the goodness of fit of
a linear regression model.
V. Use a fitted linear relationship to predict a mean response or an individual response with
confidence limits.
VI. Use residuals to check the suitability and validity of a linear regression model.
VII. State the usual multiple linear regression model
VIII. Discuss about issues in linear regression
i. Heteroskedasticity
ii. Multicollinearity
IX. Detailed case study on multivariate regression by using
I. MS Excel
II. R software

Pristine 9
4.b.The Million Dollar Question

Hours Mumbai Delhi Chennai Kolkata Bangalore Pune Hyderabad Online Singapore Middle East
10 20 7 5 13 10 11 14 9 7 12
20 8 24 34 24 16 19 20 20 25 12
30 19 8 16 37 62 29 33 25 36 30
40 67 31 44 43 32 19 38 27 49 35
50 36 46 78 57 36 82 55 33 53 41
60 67 54 90 58 23 45 62 67 58 78
70 56 68 93 71 76 72 68 81 70 57
80 81 89 78 86 45 68 83 58 90 98

If I study for five more hours will it actually increase my marks?

Pristine 10
4.b.The Population

Hours of Study Versus Marks


120

100
Marks in Test

80

60

40

20

0
0 10 20 30 40 50 60 70 80 90

Hours of Study

Can we draw a trend line to predict this relationship?

Pristine 11
4.b.Introduction to Regression Analysis

Regression analysis is used to:


Predict the value of a dependent variable based on the value of at least one independent
variable
Explain the impact of changes in an independent variable on the dependent variable
Dependent variable: the variable we wish to explain usually denoted by Y

Independent variable: the variable used to explain the dependent variable. Denoted by X

Pristine 12
4.b.Simple Linear Regression Model

Only one independent variable, x


Relationship between x and y is described by a linear function
Changes in y are assumed to be caused by changes in x

Pristine 13
4.b.Assumptions
1. A linear relationship exists between the dependent and the independent variable.

2. The independent variable is uncorrelated with the residuals.

3. The expected value of the residual term is zero


E( ) 0
4. The variance of the residual term is constant for all observations (Homoskedasticity)

E
i
2 2

5. The residual term is independently distributed; that is, the residual for one observation is not
correlated with that of another observation

[E( i j ) 0, j i]
6. The residual term is normally distributed.

Pristine 14
4.b.Types of Regression Models

Negative Linear Relationship Relationship NOT Linear

Positive Linear Relationship No Relationship

Pristine 15
4.b.Population Linear Regression

(continued)

Y Y 0 1 X u

ui Slope = 1

Predicted Value Random Error for this x value


of Y for Xi

Intercept = 0 Individual
person's marks

xi X
16
Pristine
4.b.Population Regression Function

Random Error
Dependent Population y Population Slope Independent term, or
Variable intercept Coefficient Variable residual

Y 0 1 X u
Linear component Random Error
component

But can we actually get this equation?


If yes what all information we will need?

Pristine 17
4.b.Information that we actually have

Hours Mumbai
10 20
20 8
30 19
40 67
50 36
60 67
70 56
80 81

Pristine 18
4.b.Sample Regression Function

(continued)

Y y b 0 b1 x e
Observed Value
of y for xi

ei Slope = 1

Predicted Value Random Error for this x value


of Y for Xi

Intercept = 0

xi X
19
Pristine
4.b.Sample Regression Function

Estimate of the Estimate of the


regression intercept regression slope
Independent
variable

yi b 0 b1x e Error term

Notice the similarity with the Population Regression Function

Can we do something of the error term?

Pristine 20
4.b.The error term (residual)
Represents the influence of all the variable which we have not accounted for in the equation
It represents the difference between the actual "y" values as compared the predicted y values
from the Sample Regression Line
Wouldn't it be good if we were able to reduce this error term?
What are we trying to achieve by Sample Regression?

Pristine 21
4.b.Our Objective

Y 0 1 X u

To Predict PRL from SRL

yi b0 b1x

Pristine 22
4.b.One method to find b0 and b1

Method of Ordinary Least Squares (OLS)


b0 and b1 are obtained by finding the values of b0 and b1 that minimize the sum of the squared
residuals

e 2
(y y) 2

(y (b 0 b1 x)) 2

Are there any advantages of minimizing the squared errors?


Why don't we take the sum?
Why don't we take absolute values instead?

Pristine 23
4.b.OLS Regression Properties

The sum of the residuals from the least squares regression line is 0.
( y y ) 0
The sum of the squared residuals is a minimum.
Minimize( ( y
y ) 2
)

The simple regression line always passes through the mean of the y variable and the mean of
the x variable

The least squares coefficients are unbiased estimates of 0 and 1

Pristine 24
4.b.Interpretation of the Slope and the Intercept

b0 is the estimated average value of y when the value of x is zero. More often than not it does
not have a physical interpretation
b1 is the estimated change in the average value of y as a result of a one-unit change in x

y

Y b0 b1 X
slope of the line(b1)

b0

Pristine 25
4.b.Hypothesis Testing: Two Variable Model
How do we know whether the values of b0 and b1 that we have found are actually meaningful?
Is it actually possible that our sample was a random sample and it has given us a totally wrong
regression line?
We do know a lot about the sample error term "e" but what do we know about the error terms
"u" of the Population Regression Function?
How do we proceed from here?

Pristine 26
4.b.Assumptions about "u"
The underlying relationship between Y
the X variable and the Y variable is linear
Cov(ex1,ex2) = 0
For a given value of Xi the sum of error
terms is equal to 0
e
The error term is uncorrelated with
the explanatory variable X
Error values are normally distributed
for any given value of X
The probability distribution of the errors
for a given Xi is normal
The probability distribution of the errors
for different Xi has constant variance x1 x2 X
(homoscedasticity)
Error values u for given Xi are statistically independent, their covariance is zero

Once we make these assumptions about "u" we are able to estimate the variance and
standard errors of b0 and b1 and this has been possible because of the properties
of OLS method (beyond the scope of lecture)

Pristine 27
4.b.Standard Error of Estimate (SEE)
The standard deviation of the variation of observations around the regression line is estimated
by:

RSS
su
n k 1
Where
RSS= Residual Sum of Squares (summation of e2)
n = Sample size
k = number of independent variables in the model
Note: When k=1

RSS
su = Sample standard error of the estimate
n2

Pristine 28
4.b.Comparing Standard Errors

Variation of observed y values from the regression line Variation in the slope of regression lines from
different possible samples
y y

small s u x smallsb1 x

y y

large s u x large sb1 x

Pristine 29
4.b.Inference about the Slope: t-Test
t-test for a population slope
Is there a linear relationship between x and y?
Null and alternative hypotheses
H0: 1 = 0 (no linear relationship)
H1: 1 0 (linear relationship does exist)
Test statistic

b1 1
t
sb1

d.f.
The null n2
hypothesis can be rejected if either of the following are true:
tc <t or
t < -tc
where:
b1 = Sample regression slope coefficient
1 = Hypothesized slope
sb1 = Estimator of the standard error of the slope
tc= the critical t value

Pristine 30
4.b.Confidence Interval for 'y'
The confidence interval for the predicted value of Y is given by:

Y (tc * s f )
where:
Y = predicted 'Y' value (dependent variable)
n-2 = degrees of freedom
tc = the critical t value
s f = the standard error of the forecast

* SE of forecast is NOT Standard Error of Coefficient Estimate or Standard Error of Estimate

Pristine 31
4.b.The Confidence Interval for a Regression Coefficient
The confidence interval for the regression coefficient, b1 is given by:

b1 (tc * sb )
1

where:
b1 = correlation between x and y
n-2 = degrees of freedom
tc = the critical t value

sb = the standard error of the regression coefficient


1

Pristine 32
4.b.Explained and Unexplained Variation

y
yi SSE = Sum of squared errors

y
SST = Total Sum of Squares SSE = (yi - yi )2
_
SST = (yi - y)2

y _ 2
_ RSS = (yi - y) _
y RSS = Regression sum of squares
y

Xi x

Pristine 33
4.b.Explained and Unexplained Variation (Cont)

SST = Total sum of squares


Measures the variation of the yi values around their mean y (continued)
SSE = Sum of squared errors
Variation attributable to factors other than the relationship between x and y
SSR = Regression sum of squares
Explained variation attributable to the relationship between x and y

Pristine 34
4.b.Explained and Unexplained Variation (Cont)

Total variation is made up of two parts:

SST SSE RSS


Total sum of Regression Sum of Squares
Sum of Squared Errors
Squares Also known as
Square Sum of Regression SSR

SST ( y y ) 2
SSE ( y y ) 2 SSR ( y y ) 2

Where:
y = Average value of the dependent variable
y = Observed values of the dependent variable
y = Estimated value of y for the given x value

Pristine 35
4.b.Coefficient of Determination, R2

The coefficient of determination is the portion of the total variation in the dependent variable
that is explained by variation in the independent variable
The coefficient of determination is also called R-squared and is denoted as R2

SSR
R 0 R 1
2 2
where
SST

Pristine 36
4.b.Coefficient of Determination, R2 (Cont)

Coefficient of determination
(continued)
SSR sum of squaresexplained by regression
R 2

SST total sum of squares

Note: In the single independent variable case, the coefficient of determination is

Where:
R r
2 2
R2 = Coefficient of determination
r = Simple correlation coefficient

Pristine 37
4.b.Examples of Approximate R2 Values

y
R2 = 1

Perfect linear relationship between x and y:

x 100% of the variation in y is explained by variation in x


R2 =1
y

x
R2 = +1

Pristine 38
4.b.Examples of Approximate R2 Values (Cont)

y
0 < R2 < 1

Weaker linear relationship between x and y:

Some but not all of the variation in y is explained by


x variation in x

Pristine 39
4.b.Examples of Approximate R2 Values (Cont)

R2 = 0
y
No linear relationship between x and y:

The value of Y does not depend on x. (None of the


variation in y is explained by variation in x)

R2 = 0 x

Pristine 40
4.b.Limitations of Regression Analysis
Parameter Instability - This happens in situations where correlations change over a period of
time. This is very common in financial markets where economic, tax, regulatory, and political
factors change frequently.
Public knowledge of a specific regression relation may cause a large number of people to react in
a similar fashion towards the variables, negating its future usefulness.
If any regression assumptions are violated, predicted dependent variables and hypothesis tests
will not hold valid.

Pristine 41
4.b.General Multiple Linear Regression Model
In simple linear regression, the dependent variable was assumed to be dependent on only one
variable (independent variable)
In General Multiple Linear Regression model, the dependent variable derive sits value from two or
more than two variable.
General Multiple Linear Regression model take the following form:

Yi b0 b1 X 1i b2 X 2i ......... bk X ki i
where:
Yi = ith observation of dependent variable Y
Xki = ith observation of kth independent variable X
b0 = intercept term
bk = slope coefficient of kth independent variable
i = error term of ith observation
n = number of observations
k = total number of independent variables

Pristine 42
4.b.Estimated Regression Equation
As we calculated the intercept and the slope coefficient in case of simple linear regression by
minimizing the sum of squared errors, similarly we estimate the intercept and slope coefficient in
multiple linear regression.
n
Sum of Squared Errors i
2

i 1
is minimized and the slope coefficient is estimated.

The resultant estimated equation becomes:



Yi b0 b1 X 1i b2 X 2i ......... bk X ki
Now the error in the ith observation can be written as:

i Yi Yi Yi b0 b1 X 1i b2 X 2i ......... bk X ki

Pristine 43
4.b.Interpreting the Estimated Regression Equation
Intercept Term (b0): It's the value of dependent variable when the value of all independent
variables become zero.
b0 Value of Y
when X 1 X 2 ....... X k 0
Slope coefficient (bk): It's the change in the dependent variable from a unit change in the
corresponding independent (Xk) variable keeping all other independent variables constant.
In reality when the value of the independent variable changes by one unit, the change in the
dependent variable is not equal to the slope coefficient but depends on the correlation among
the independent variables as well.
Therefore, the slope coefficient are called partial slope coefficients as well

Pristine 44
4.b.Assumptions of Multiple Regression Model
There exists a linear relationship between the dependent and independent variables.

The expected value of the error term, conditional on the independent variables is zero.

The error terms are homoskedastic, i.e. the variance of the error terms is constant for all the
observations.

The expected value of the product of error terms is always zero, which implies that the error
terms are uncorrelated with each other.

The error term is normally distributed.

The independent variables doesn't have any linear relationships between each other.

Pristine 45
4.b.Hypothesis Testing of Coefficients
The values of the slope coefficients doesn't tell anything about their significance in explaining the
dependent variable.
Even an unrelated variable when regressed would give some value of slope coefficients.
To exclude the cases where the independent variables doesn't significantly explain the dependent
variable, we need the hypothesis testing of the coefficients for checking whether they contribute
in explaining the dependent variable significantly or not.
The t-statistic is used to check the significance of the coefficients.
The t-statistic used for the hypothesis testing is same as used in the hypothesis testing of
coefficient of simple linear regression.
Following are the hypothesis and alternative hypothesis to check the statistical significance of b k:
Hypothesis H0: bk =0
Alternative Hypothesis (Ha): bk 0
The t-statistic of (n-k-1) degrees of freedom for the hypothesis testing of the coefficient bk

bk bk
t
s
bj

If the value of t-statistic lies within the confidence interval, H0 can't be rejected

Pristine 46
4.b.Confidence Interval for the Population Value
The confidence interval for a regression coefficient is given by:

b j (tc s )
bj

Where,
tc is the critical t-value, and

sb is the standard error
j

Pristine 47
4.b.Predicted Dependent Variable
The regression equation can be used for making predictions about the dependent variable by
using forecasted values of the independent variables.

Yi b0 b1 X 1i b2 X 2i ......... bk X ki

Where,
Y is the predicted value for the dependent variable
i
bi is the estimated partial slope for the ith independent variable
X ni is the forecasted ith value for the nth independent variable

Pristine 48
4.b.Analysis of Variance (ANOVA)
Analysis of variance is a statistical method for analyzing the variability of the data by breaking the
variability into its constituents.
A typical ANOVA table looks like:
Source of Variability DoF Sum of Squares Mean Sum of Squares
Regression(Explained) k RSS MSR=RSS/1
Error(Unexplained) n-k-1 SSE MSE=SSE/n-2
Total n-1 SST=RSS+SSE

From the above summary(ANOVA table) we can calculate:

SSE
Standard Error of Estimate(SEE) = MSE
n2
Total Variation( SST) Unexplaine d Variation( SSE)
Coefficient of determination(R2) =
Total Variation( SST)

Explained Variation( RSS)


=
Total Variation( SST)

Pristine 49
4.b.F-Statistic
An F-test explains how well the dependent variable is explained by the independent variables
collectively.
In case of multiple independent variable, F-test tells us whether a single variable explains a
significant part of the variation in dependent variable or all the independent variables explain the
variability collectively.

F-statistic is given as:


Where:

MSR: Mean Regression sum of squares

MSE: Mean Squared Error

n: Number of observations

k: Number of independent variables


RSS
MSR k
F
MSE SSE
n k 1
Pristine 50
4.b.F-statistic contd.
Decision rule for F-test: Reject H0 if the F-statistic > Fc (Critical Value)
The numerator of F-statistic has degrees of freedom of "k" and the denominator has the degrees
of freedom of "n-k-1"
If H0 is rejected then at least one out of two independent variable is significantly different that
zero.
This implies that at least one out of household income(independent variable) or household
expenses(independent variable) explains the variation in the pocket money of Edward.

F-test is always a single tailed test while testing the hypothesis


that the coefficients are simultaneously equal to zero
Pristine 51
4.b.Coefficient of determination (R2) and Adjusted R2
Coefficient of determination(R2) can also be used to test the significance of the coefficients
collectively apart from using F-test.
SST - SSE RSS Sum of Squares explained by regression
R2
SST SST Total Sum of Squares
The drawback of using Coefficient of determination is that the value of the coefficient of
determination always increases as the number of independent variables are increased even if the
marginal contribution of the incoming variable is statistically insignificant.
To take care of the above drawback, coefficient of determination is adjusted for the number of
independent variables taken. This adjusted measure of coefficient of determination is called
adjusted R2
Adjusted R2 is given by the following formula:

n 1 2
where Ra2 1
1 R
n k 1
n = Number of Observations
k = Number of Independent Variables
= Adjusted R2
Ra2

Pristine 52
4.b.Representing Qualitative Factors
How can we represent Qualitative factors in a regression equation?
By using 'dummy variables'; variables that take values of either 1 or 0, depending whether it is
true or false.

If we wanted to consider the spike in soft drink sales in the summer, we may have a regression
equation:
Rev(t) 10,000 2,000t 50,000S

Here,
1 if it' s summer
S
0 if it' s not summer
If there are n mutually exclusive and exhaustive classes, they can be represented by n-1 dummy
variables. This is derived from the concept of degrees of freedom.
For example, to represent the 4 stages of the business cycle, we can use 3 dummy variables.
The fourth variable would be represented by zeros for all three dummy variables.
We do not use 4 variables as that would indicate a linear relationship between all 4 variables.

Pristine 53
4.b.Heteroskedasticity
When the requirement of a constant variance is violated, we have a condition of
heteroskedasticity.

Error u

Predicted y

We can diagnose heteroskedasticity by plotting the residual against the predicted y.

Pristine 54
4.b.Unconditional and Conditional Heteroskedasticity
Presence of heteroskedasticity in the data is the violation of the assumption about the constant
variance of the residual term.
Heteroskedasticity takes the following two forms, unconditional and conditional.
Unconditional Heteroskedasticity is present when the variance of the residual terms are not
related to the values of the independent variable.
Unconditional Heteroskedasticity doesn't pose any problem in the regression analysis as the
variance doesn't change systematically
Conditional Heteroskedasticity pose problems in regression analysis as the residuals are
systematically related to the independent variables
Y b 0 b1 X
Y

Low Variance of
Residual Terms

High Variance of
Residual Terms

X
Pristine 55
4.b.Detecting Heteroskedasticity
Heteroskedasticity can be detected either by viewing the scatter plots as discussed in the previous
case or by Breusch-Pagan chi-square test.
In Breusch-Pagan chi-square test, the residuals are regressed with the independent variables to
check whether the independent variable explains a significant proportion of the squared residual
or not.
If the independent variables explain a significant proportion of the squared residuals then we
conclude that the conditional heteroskedasticity is present otherwise not.
Breusch-Pagan test statistic follows a chi-square distribution with k degrees of freedom, where k is
the number of independent variables.
BP Chi Square Test Statistic n Rresid
2

where:
n: number of observations
2
Rresid :Coefficient of determination when residuals are regressed with independent variables

Conditional Heteroskedasticity can be corrected by using White-corrected standard errors


which are also called heteroskedasticity consistent standard errors
Pristine 56
4.b.Correcting for Heteroskedasticity

There are two methods for correcting the effects of conditional heteroskedasticity
Robust Standard Errors
Correct the standard errors of the linear regression model's estimated coefficients to account
for conditional heteroskedasticity
Generalized Least Squares
Modifies the original equation in an attempt to eliminate heteroskedasticity.
Statistical packages are available are available for computing robust standard errors.

Pristine 57
4.b.Multicollinearity
Another significant problem faced in the Regression Analysis is when the independent variables or
the linear combinations of the independent variables are correlated with each other.
This correlation among the independent variables is called Multicollinearity which creates
problems in conducting t-statistic for statistical significance.
Multicollinearity is evident when the t-test concludes that the coefficients are not statistically
different from zero but the F-test is significant and the coefficient of determination (R2) is high.
High correlation among the independent variables suggests the presence of multicollinearity but
lower values of correlations doesn't omit the chances of presence of multicollinearity.
The most common method of correcting multicollinearity is by systematically removing the
independent variable until multicollinearity is minimized.

Presence of Multicollinearity leads to TYPE-II errors

Pristine 58
4.b.Model Misspecifications
Apart from checking the previously discussed problems in the regression, we should check for the
correct form of the regression as well.
Following 3 misspecification can be present in the regression model:
Functional form of regression is misspecified:
The important variables could have been omitted from the regression model
Some regression variables may need the transformation (like conversion to the logarithmic scale)
Pooling of data from incorrect pools
The variables can be correlated with the error term in time-series models:
Lagged dependent variables are used as independent variables with serially correlated errors
A function of dependent variables is used as an independent variable because of incorrect dating
of the variables
Independent variables are sometimes measured with error
Other Time-Series Misspecification which leads to the nonstationarity of the variables:
Existence of relationships in time-series that results in patterns
Random walk relationships among the time series
These misspecifications in the regression model results in the biased and inconsistent regression
coefficients which further leads to incorrect confidence intervals leading to TYPE-I or TYPE-II errors.

Nonstationarity means that the properties(like mean, variance) of the variables is not constant

Pristine 59
4.b.The Economic meaning of a Regression Model
Consider the equation:
Rev_Growth 4% 0.75GDP_Growth 0.5WPI_Infl

The economic meaning for this equation is given by the partial slopes or coefficients of the
variables.
If the GDP Growth rate was 1% higher, it translates into a 0.75% higher Revenue growth.
Similarly, if the WPI Inflation figures were 1% higher, it translates into a 0.5% higher revenue
growth.

Pristine 60
4.b.Case- Multivariate Linear Regression (Revisited)

Adam, an Analytics consultant works with First Auto Insurance company. His manager gave him
data having Loss amount and policy related information and asked him to identify and
quantify the factors responsible for losses in a multivariate fashion. Adam has no knowledge
of running a multivariate regression.
Now suppose, he approaches you and request for your help to complete the assignment. Lets
help Adam in carrying out the multivariate regression.

Pristine 61
Case- Multivariate Linear Regression (Rules of Thumb)
In due course of helping Adam to complete his task, we will walk him through following steps:
Variable identification
Identifying the dependent (response) variable.
Identifying the independent (explanatory) variables.
Variable categorization (e.g. Numeric, Categorical, Discrete, Continuous etc.)
Creation of Data Dictionary
Response variable exploration
Distribution analysis
Percentiles
Variance
Frequency distribution
Outlier treatment
Identify the outliers/threshold limit
Cap/floor the values at the thresholds
Independent variables analyses
Identify the prospective independent variables (that can explain response variable)
Bivariate analysis of response variable against independent variables
Variable treatment /transformation
Grouping of distinct values/levels
Mathematical transformation e.g. log, splines etc.

Pristine 62
Case- Multivariate Linear Regression (Rules of Thumb)
Heteroskedasticity
Check in a univariate manner by individual variables
Easy for univariate linear regression. Can be done manually.
Too cumbersome to do manually for multivariate case
The tools (R, SAS etc.) have in-built features to tackle it.
Fitting the regression
Check for correlation between independent variables
This is to take care of Multicollinearity
Fix Heteroskedasticty
By suitable transformation of response variable a bit tricky).
Using inbuilt features of statistical packages like R
Variable selection
Check for the most suitable transformed variable
Select the transformation giving the best fit
Reject the statistically insignificant variables
Fitting the regression
Analysis of results
Model comparison
Model performance check
R2
Lift/Gains chart and Gini coefficient
Actual vs Predicted comparison
Pristine 63
Multivariate Linear Regression- Data
Snapshot of the data
Data description (known facts):
Auto insurance policy data
Contains policy holders and loss amount
information (variables)
Policy Number
Age
Years of Driving Experience
Number of Vehicles
Gender
Married
Vehicle Age
Fuel Type
Losses (Dependent/Response Variable)
Next step
Create the Data Dictionary

Pristine 64
Multivariate Linear Regression- Data Dictionary

Sl # Variable Name Variable Description Values Stored Variable Type

1 Policy Number Unique Policy Number ? ?

2 Age Age of Policy holder ? ?

Years of Driving Years of Driving Experience of the


3 ? ?
Experience Policy holder
Number of Vehicles insured under
4 Number of Vehicles ? ?
the policy

5 Gender Gender of the Policy holder ? ?

6 Married Marital status of the Policy holder ? ?

Age of vehicle insured under the


7 Vehicle Age ? ?
policy

8 Fuel Type Fuel type of the vehicle insured ? ?

Insurance amount claimed under


9 Losses ? ?
the policy

Pristine 65
Multivariate Linear Regression- Data Dictionary

Sl # Variable Name Variable Description Values Stored Variable Type


Unique value identifying
1 Policy Number Unique Policy Number Identifier
the policy

2 Age Age of Policy holder 16, 17,,70 Numerical (Discrete)

Years of Driving Years of Driving Experience of the


3 0,1,.,53 Numerical (Discrete)
Experience Policy holder
Number of Vehicles insured under
4 Number of Vehicles 1,2,3,4 Numerical (Discrete)
the policy

5 Gender Gender of the Policy holder F, M Categorical (Binary)

6 Married Marital status of the Policy holder Married, Single Categorical (binary)

Age of vehicle insured under the


7 Vehicle Age 0,1,,15 Numerical (Discrete)
policy

8 Fuel Type Fuel type of the vehicle insured D, P Categorical (Binary)

Loss amount claimed under the


9 Losses Range: 13- 3500 Numerical (Continuous)
policy

Pristine 66
Multivariate Linear Regression- Response Variable (Losses) SAS

Pristine 67
Multivariate Linear Regression- Response Variable (Capped
Losses) SAS

Pristine 68
Code to generate bivariate profiling

Pristine 69
Multivariate Linear Regression- Bivariate Profiling SAS

Pristine 70
Multivariate Linear Regression- Bivariate Profiling SAS

Pristine 71
Multivariate Linear Regression- Bivariate Profiling SAS

Pristine 72
Multivariate Linear Regression- Bivariate Profiling SAS

Pristine 73
Multivariate Linear Regression- Bivariate Profiling SAS

Pristine 74
Code to check heteroskedasticity

Pristine 75
Multivariate Linear Regression- Heteroskedasticity (Age)

Pristine 76
Multivariate Linear Regression- Heteroskedasticity (Gender)

Female

Male

Pristine 77
Multivariate Linear Regression- Heteroskedasticity (Married)

Married

Unmarried

Pristine 78
Multivariate Linear Regression- Heteroskedasticity (Vehicle Age)
years

years

years

Pristine 79
Multivariate Linear Regression- Heteroskedasticity (Fuel Type)

Petrol

Diesel

Pristine 80
Multivariate Linear Regression- Variable Selection
Variable selection to be done on the basis of
Multicollinearity (correlation between independent variables)
Banding of variables e.g. whether to use Age or Age Band (also called custom bands)
Statistical significance of variables tested after performing above two steps
List of independent variables:
1. Age
2. Age Band
3. Years of Driving Experience
4. Number of Vehicles
5. Gender
6. Married
7. Vehicle Age
8. Vehicle Age Band
9. Fuel Type

Pristine 81
Covariance and Correlation

Pristine 82
Choosing b/w age and years of experience
Age and Years of Driving Experience are highly correlated (Correlation Coefficient = 0.9972). We can
use either of the variables in regression
Q: Which one to use and which one to reject?
Sol: Fit two separate models using either of the variable one at a time. Check for goodness of fit (R2 in this
case). The variable producing higher R2 gets accepted.

R2 for Age > R2 for Years of Driving Experience


Reject Years of Driving Experience

Pristine 83
Code to make bands and choose

Pristine 84
Age vs age band
Investigate whether to use Age or Age band
Fit regression independently using Age and Age Band
Before fitting regression, Age Band needs to be converted to numerical form from categorical. Replace
Age Band values with Average Age for the particular band.

Regressions results using Age and Average Age

R2 for Average Age > R2 for Age


Select Average Age
Pristine 85
Multivariate Linear Regression- Custom Bands
Investigate whether to use Vehicle Age or Vehicle Age band
Fit regression independently using Vehicle Age and Vehicle Age Band
Before fitting regression, Vehicle Age Band needs to be converted to numerical form from categorical.
Replace Vehicle Age Band values with Vehicle Average Age for the particular band.

Regressions results using Vehicle Age and Average Vehicle Age

R2 for Average Vehicle Age > R2 for Vehicle Age


Select Average Vehicle Age
Pristine 86
Multivariate Linear Regression- Variable Selection
List of shortlisted variables:
1. Age Band in the form of Average Age of the band (selected out of Age and Age Band). Also got selected over
Years of Driving Experience.

2. Number of Vehicles
3. Gender

4. Married

5. Vehicle Age Band in the form of Average Vehicle Age of the band (selected out of Vehicle Age and Vehicle
Age Band).
6. Fuel Type

We will run regression in multivariate fashion and then select final list of variables by taking into
consideration statistical significance.

Pristine 87
Multivariate Linear Regression- Categorical variable conversion
Categorical variables in Binary form need to be converted to their numerical equivalent (0, 1)
1. Gender (F = 0 and M = 1)

2. Married (Married = 0 and Single = 1)

3. Fuel Type (P = 0, D = 1)

Snapshot of the final data on which we will run the multivariate regression

Pristine 88
Code for running the full regression

Insignificant

Pristine 89
Result after removing number of vehicles

Pristine 90
Multivariate Linear Regression- Regression Equation
Predicted Losses = 625.0241
5.56069 * Avg Age + 50.88366 * Gender Dummy +
78.40224 * Married Dummy -15.14453 * Avg Vehicle Age + 267.93268 * Fuel Type Dummy

Interpretation:
Sign of
Coefficients Inference
Coefficient
Intercept 625.005
Avg Age -5.561 -ve Higher is the age, lower is the loss
Gender Dummy 50.883 +ve Average Loss for Males is higher than Females
Married Dummy 78.402 +ve Average Loss for Single is higher than Married
Avg Vehicle Age -15.144 -ve Older is the vehicle, lower are the losses
Fuel Type Dummy 267.932 +ve Losses are higher for Fuel type Diesel

Pristine 91
Multivariate Linear Regression- Residual Plot
Residual plot:
Residuals calculated as Actual Capped Losses Predicted Capped Losses
Residuals should have a uniform distribution else theres some bias in the model
Except for a few observations (circled in red), residuals are uniformly distributed

Pristine 92
Code to generate scorecard

Pristine 93
Scorecard Performance Checks- Rank Ordering

Pristine 94
Thank you!

Pristine
702, Raaj Chambers, Old Nagardas Road, Andheri (E), Mumbai-400 069. INDIA
www.edupristine.com
Ph. +91 22 3215 6191

Pristine www.edupristine.com
Pristine

You might also like