You are on page 1of 23

Business Analytics

Multivariate Linear Regression


(Using Ms-Excel & R)

Pristine

Pristine www.edupristine.com

Multivariate Linear Regression- Fixing Heteroskedasticity


Univariate scenario:
Find the Standard Deviation of response variable for the different levels of independent variable
Divide the independent values of the response variable by the respective standard deviation
The scaled values become the new response variable

E.g. if variable is Fuel Type


If Fuel Type is D, divide Capped Losses by SquareRoot(33862) = 184
If Fuel Type is P, divide Capped Losses by SquareRoot(16400) = 128

Multivariate scenario:
Create all possible unique combinations of independent variables
For each of the combinations, find Standard Deviations
Divide the independent values of the response variable by the respective standard deviation
Too cumbersome to do manually using MS Excel. Also the process is iterative.
More convenient to do using Statistical packages like R.

Course approach
First fit a multivariate regression without fixing heteroskedasticity to get a final set of significant variables
Then do manual adjustment and re-fit regression using MS Excel. This will be just for demonstration. As manual
adjustment is always questionable.
Demonstrate linear regression using R
Pristine

Linear Regression- Preparing MS Excel


1

Pristine

Linear Regression- Using MS Excel (Demo.)


1

Pristine

Multivariate Linear Regression- Variable Selection


Variable selection to be done on the basis of
Multicollinearity (correlation between independent variables)
Banding of variables e.g. whether to use Age or Age Band (also called custom bands)
Statistical significance of variables tested after performing above two steps

List of independent variables:


1.

Age

2.

Age Band

3.

Years of Driving Experience

4.

Number of Vehicles

5.

Gender

6.

Married

7.

Vehicle Age

8.

Vehicle Age Band

9.

Fuel Type

Pristine

Multivariate Linear Regression- Variable Selection (Multicollinearity)


Age and Years of Driving Experience are highly correlated (Correlation Coefficient = 0.9972). We can
use either of the variables in regression
Q: Which one to use and which one to reject?
Sol: Fit two separate models using either of the variable one at a time. Check for goodness of fit (R 2 in this
case). The variable producing higher R2 gets accepted.

Regression Statistics (Age)


Multiple R
0.475766
R Square
0.226354
Adjusted R Square
0.226303
Standard Error
201.2306

Multiple R

0.475273

R Square

0.225885

Adjusted R Square

0.225834

Observations

Standard Error

201.2916

15290

Regression Statistics(Yrs Driving Experience)

Observations

15290

R2 for Age > R2 for Years of Driving Experience


Reject Years of Driving Experience
Pristine

Multivariate Linear Regression- Custom Bands


Investigate whether to use Age or Age band
Fit regression independently using Age and Age Band
Before fitting regression, Age Band needs to be converted to numerical form from categorical. Replace
Age Band values with Average Age for the particular band.
Age Band

Sum of Age

# Policies

Average Age

16-25

93,770.0

4,563.0

20.6

26-59

270,793.0

6,384.0

42.4

60+

282,636.0

4,343.0

65.1

Regressions results using Age and Average Age


Regression Statistics (Age)
Multiple R
0.475766
R Square
0.226354
Adjusted R Square
0.226303
Standard Error
201.2306
Observations
15290

Regression Statistics (Average Age)


Multiple R
0.509969
R Square
0.260068
Adjusted R Square
0.26002
Standard Error
196.7971
Observations
15290

R2 for Average Age > R2 for Age


Select Average Age
Pristine

Multivariate Linear Regression- Custom Bands


Investigate whether to use Vehicle Age or Vehicle Age band
Fit regression independently using Vehicle Age and Vehicle Age Band
Before fitting regression, Vehicle Age Band needs to be converted to numerical form from categorical.
Replace Vehicle Age Band values with Vehicle Average Age for the particular band.
Vehicle Age Band
0-5
6-10
11+

Sum of Vehicle Age


9,229
44,298
78,819

# Policies
Average Vehicle Age
3,688
2.50
5,523
8.02
6,079
12.97

Regressions results using Vehicle Age and Average Vehicle Age


Regression Statistics (Vehicle Age)
Multiple R
0.289431325
R Square
0.083770492
Adjusted R Square
0.083710561
Standard Error
218.9903277
Observations
15290

Regression Statistics (Average Vehicle Age)


Multiple R
0.303099405
R Square
0.09186925
Adjusted R Square
0.091809848
Standard Error
218.0203272
Observations
15290

R2 for Average Vehicle Age > R2 for Vehicle Age


Select Average Vehicle Age
Pristine

Multivariate Linear Regression- Variable Selection


List of shortlisted variables:

1.

Age Band in the form of Average Age of the band (selected out of Age and Age Band). Also got selected over
Years of Driving Experience.

2.

Number of Vehicles

3.

Gender

4.

Married

5.

Vehicle Age Band in the form of Average Vehicle Age of the band (selected out of Vehicle Age and Vehicle
Age Band).

6.

Fuel Type

We will run regression in multivariate fashion and then select final list of variables by taking into
consideration statistical significance.

Pristine

Multivariate Linear Regression- Categorical variable conversion


Categorical variables in Binary form need to be converted to their numerical equivalent (0, 1)

1.

Gender (F = 0 and M = 1)

2.

Married (Married = 0 and Single = 1)

3.

Fuel Type (P = 0, D = 1)

Snapshot of the final data on which we will run the multivariate regression

Pristine

Multivariate Linear Regression- Output


SUMMARY OUTPUT
Regression Statistics
Multiple R

0.865972274

R Square

0.749907979

Adjusted R Square

0.749809794

Standard Error

114.4310136

Observations

15290

ANOVA
df

SS

Regression

Residual

15283

Total

15289
Coefficients

Intercept

MS

Significance F

600073213.5 100012202.3 7637.751088

200122584.4 13094.45688

800195798
Standard Error

t Stat

P-value

Lower 95%

Upper 95%

Lower 95.0%

Upper 95.0%

624.56529

5.29192

118.02233

0.00000

614.19249

634.93809

614.19249

634.93809

-5.55974

0.06546

-84.93889

0.00000

-5.68804

-5.43144

-5.68804

-5.43144

0.17875

0.97039

0.18420

0.85386

-1.72333

2.08082

-1.72333

2.08082

Gender Dummy

50.88326

1.89081

26.91084

0.00000

47.17705

54.58947

47.17705

54.58947

Married Dummy

78.39837

1.92148

40.80106

0.00000

74.63204

82.16469

74.63204

82.16469

Avg Vehicle Age

-15.14220

0.26734

-56.63987

0.00000

-15.66623

-14.61818

-15.66623

-14.61818

Fuel Type Dummy

267.93559

2.74845

97.48614

0.00000

262.54830

273.32287

262.54830

273.32287

Avg Age
Number of Vehicles

Pristine

10

Multivariate Linear Regression- Output


1

Independent Vars

Coefficients(b)

t Stat
(b/)

P-value
(t-dist table)

Lower 95%
(b-1.96*)

Upper 95%
(b+1.96*)

Lower 95%
(b-1.96*)

Upper 95%
(b+1.96*)

Intercept

624.565

5.292

118.022

0.000

614.192

634.938

614.192

634.938

X1

Avg Age

b1

-5.560

0.065

-84.939

0.000

-5.688

-5.431

-5.688

-5.431

X2

Number of Vehicles

b2

0.179

0.970

0.184

0.854

-1.723

2.081

-1.723

2.081

X3

Gender Dummy

b3

50.883

1.891

26.911

0.000

47.177

54.589

47.177

54.589

X4

Married Dummy

b4

78.398

1.921

40.801

0.000

74.632

82.165

74.632

82.165

X5

Avg Vehicle Age

b5

-15.142

0.267

-56.640

0.000

-15.666

-14.618

-15.666

-14.618

X6

Fuel Type Dummy

b6

267.936

2.748

97.486

0.000

262.548

273.323

262.548

273.323

Insignificant

ANOVA

Standard Error
()

df

SS

MS (SS/df)

F (MSReg/MSRes)

Significance F (from
F dist table)

7637.75

Regression { (ypredictedl- ymean)2}

600073213.5

100012202.3

Residual {(yactual - ypredicted)2}

n-p-1

15283

200122584.4

13094.457

Total {(yactual - ymean)2}

n-1

15289

800195798

Regression Statistics

Pristine

Multiple R

SquareRoot(R Square)

0.8659723

R Square

SS Regression/SS Total

0.7499080

Adjusted R Square

R2 - (1 - R2)*{p/(n-p-1)}

0.7498098

Standard Error

SquareRoot{SS Residual/(n-p-1)}

114.4310136

Observations

15290

11

Multivariate Linear Regression- Output (Significance Test)


1

Independent Vars

Coefficients(b)

Standard Error
()

t Stat
(b/)

P-value
(t-dist table)

Lower 95%
(b-1.96*)

Upper 95%
(b+1.96*)

Lower 95%
(b-1.96*)

Upper 95%
(b+1.96*)

Intercept

624.565

5.292

118.022

0.000

614.192

634.938

614.192

634.938

X1

Avg Age

b1

-5.560

0.065

-84.939

0.000

-5.688

-5.431

-5.688

-5.431

X2

Number of Vehicles

b2

0.179

0.970

0.184

0.854

-1.723

2.081

-1.723

2.081

X3

Gender Dummy

b3

50.883

1.891

26.911

0.000

47.177

54.589

47.177

54.589

X4

Married Dummy

b4

78.398

1.921

40.801

0.000

74.632

82.165

74.632

82.165

X5

Avg Vehicle Age

b5

-15.142

0.267

-56.640

0.000

-15.666

-14.618

-15.666

-14.618

X6

Fuel Type Dummy

b6

267.936

2.748

97.486

0.000

262.548

273.323

262.548

273.323

Significance test of coefficients based on Normal distribution


H0: b is no different that 0 (i.e. 0 is the coefficient when the variable is not included in regression)
H1: b is different than 0
Test statistic, Z = (b-0)/ (at 95% two tailed confidence interval, Z = 1.96)
Confidence interval = (b 1.96 * , b + 1.96 * )
For variable to be significant, the interval must not contain 0.
Example1: Avg Age.
Confidence interval = (-5.560-1.96*0.065, -5.560+1.96*0.065) = (-5.688, -5.431)
No zero in the interval. Hence significant.
Example2: Number of Vehicles
Confidence interval = (0.179-1.96*0.970, 0.179+1.96*0.970) = (-1.723, 2.080)
Zero is present in the interval. Hence insignificant.
Pristine

12

Multivariate Linear Regression- Output (Significance Test)


1

Independent Vars

Coefficients(b)

Standard Error
()

t Stat
(b/)

P-value
(t-dist table)

Lower 95%
(b-1.96*)

Upper 95%
(b+1.96*)

Lower 95%
(b-1.96*)

Upper 95%
(b+1.96*)

Intercept

624.565

5.292

118.022

0.000

614.192

634.938

614.192

634.938

X1

Avg Age

b1

-5.560

0.065

-84.939

0.000

-5.688

-5.431

-5.688

-5.431

X2

Number of Vehicles

b2

0.179

0.970

0.184

0.854

-1.723

2.081

-1.723

2.081

X3

Gender Dummy

b3

50.883

1.891

26.911

0.000

47.177

54.589

47.177

54.589

X4

Married Dummy

b4

78.398

1.921

40.801

0.000

74.632

82.165

74.632

82.165

X5

Avg Vehicle Age

b5

-15.142

0.267

-56.640

0.000

-15.666

-14.618

-15.666

-14.618

X6

Fuel Type Dummy

b6

267.936

2.748

97.486

0.000

262.548

273.323

262.548

273.323

Significance test of coefficients based on t distribution.


b/StdErr(b) ~ tn-2
H0: b is no different that 0 (i.e. 0 is the coefficient when the variable is not included in regression)
H1: b is different than 0
At 95% two tailed confidence interval and df greater that 120, t = 1.96)
Confidence interval = (b 1.96 * , b + 1.96 * )
For variable to be significant, the interval must not contain 0.
Example1: Avg Age.
Confidence interval = (-5.560-1.96*0.065, -5.560+1.96*0.065) = (-5.688, -5.431)
No zero in the interval. Hence significant.
Example2: Number of Vehicles
Confidence interval = (0.179-1.96*0.970, 0.179+1.96*0.970) = (-1.723, 2.080)
Zero is present in the interval. Hence insignificant.
Pristine

13

Multivariate Linear Regression- Output at 95% Confidence Interval


SUMMARY OUTPUT
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations

Excluding "Num Including "Num


Vehicles"
Vehicles"
0.865971953
0.749907424
0.749825608
114.4273971
15290

0.865972274
0.749907979
0.749809794
114.4310136
15290

Adjusted R-square improved

ANOVA
df
Regression
Residual
Total

SS

MS

600072769.2 120014553.8

Significance F

9165.874

15284
15289

200123028.7
800195798

13093.6292

Coefficients

Standard Error

t Stat

P-value

Lower 95%

Upper 95%

Intercept
Avg Age
Gender Dummy
Married Dummy
Avg Vehicle Age

625.005
-5.560
50.883
78.402
-15.142

4.723
0.065
1.891
1.921
0.267

132.333
-84.942
26.912
40.806
-56.641

0.00
0.00
0.00
0.00
0.00

615.7474
-5.6879
47.1768
74.6356
-15.6660

634.2625
-5.4314
54.5890
82.1677
-14.6180

615.7474
-5.6879
47.1768
74.6356
-15.6660

634.2625
-5.4314
54.5890
82.1677
-14.6180

Fuel Type Dummy

267.935

2.748

97.489

0.00

262.5480

273.3223

262.5480

273.3223

Pristine

Lower 95.0% Upper 95.0%

14

Multivariate Linear Regression- Regression Equation


Predicted Losses = 625.004932715948
5.5596551344537 * Avg Age + 50.8828923910091 * Gender Dummy +
78.4016899779131 * Married Dummy -15.1420259903571 * Avg Vehicle Age +
267.935139741526 * Fuel Type Dummy
Interpretation:
Coefficients
Intercept
Avg Age
Gender Dummy
Married Dummy
Avg Vehicle Age
Fuel Type Dummy

625.005
-5.560
50.883
78.402
-15.142
267.935

Sign of
Coefficient

-ve
+ve
+ve
-ve
+ve

Inference
Higher is the age, lower is the loss
Average Loss for Males is higher than Females
Average Loss for Single is higher than Married
Older is the vehicle, lower are the losses
Losses are higher for Fuel type D

Illustration of using the equation given in MS Excel

Pristine

15

Multivariate Linear Regression- Residual Plot


Residual plot:
Residuals calculated as Actual Capped Losses Predicted Capped Losses
Residuals should have a uniform distribution else theres some bias in the model
Except for a few observations (circled in red), residuals are uniformly distributed

Capped Losses- Residual


1200
1000
800

600
400
200
0
0

2000

4000

6000

8000

10000

12000

14000

-200
-400

Pristine

16

Multivariate Linear Regression- Gains Chart and Gini


Gains chart is used to represent the effectiveness of a model prediction which is quantified by means of
Gini Coefficient
Methodology illustrated using MS Excel
Equal Obs
Bin
0
1
2
3
4
5
6
7
8
9
10

# Policies

Predicted Loss

0
1528
1529
1529
1529
1529
1529
1529
1529
1529
1530

0
1,167,070
1,046,034
757,330
589,366
531,160
485,428
432,934
385,595
308,050
193,465

Actual Loss
0
1,230,474
991,944
746,854
552,534
553,919
477,284
385,411
423,814
310,846
223,351

Cumulative
Actual Loss
0
1,230,474
2,222,418
2,969,272
3,521,806
4,075,725
4,553,009
4,938,420
5,362,234
5,673,081
5,896,432

Random
0
10%
10%
10%
10%
10%
10%
10%
10%
10%
10%

Cumulative %
Obs
0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%

% Cumulative Actual
Loss
0
20.87%
37.69%
50.36%
59.73%
69.12%
77.22%
83.75%
90.94%
96.21%
100.00%

Actual vs Predicted Losses


3500

2500

Actual Loss

2000

600,000

1500

400,000

1000

200,000

500

0
0

Pristine

6
8
Bins of Equal # Policies

10

12

# Policies

Losses

3000

Predicted Loss

800,000

0.27177

100%
%Cumulative Actual Loss

# Policies

1,000,000

Gini Coeff

Gains Chart

1,400,000
1,200,000

Area Under
Gains Curve
0
0.0104
0.0293
0.0440
0.0550
0.0644
0.0732
0.0805
0.0873
0.0936
0.0981

80%
60%

Cumulative % Obs
40%
% Cumulative Actual Loss
20%
0%
0

10

Bins of Equal # Policies

17

Multivariate Linear Regression- Fixing Heteroskedasticity (Demo.)


1

Create unique combinations of the


variables - Avg Age, Gender Dummy,
Married Dummy, Avg Vehicle Age and
Fuel Type Dummy

Find Standard
Deviation of capped
Losses for the segments.
Detailed methodology
explained in MS Excel.

Calculate Standardized
Capped Losses as
Capped Losses /
Segment Std Dev.
This becomes the new
response variable.

Manually doing this kind of exercise can be flawed as some the segments could be sparsely populated.
This demo. Is just to explain the underlying technique/methodology.
Statistical packages like SAS, R have in-built capability to take care of this.
Pristine

18

Multivariate Linear Regression- Fixing Heteroskedasticity (Demo.)


SUMMARY OUTPUT
Regression Statistics
Multiple R
0.359167467
R Square
Adjusted R Square
Standard Error
Observations

0.129001269
0.128716331
4.77078689
15290

Insignificant which is questionable


as D and P have significantly
differe mean losses

ANOVA

df
Regression
Residual
Total

Intercept
Avg Age
Gender Dummy
Married Dummy
Avg Vehicle Age
Fuel Type Dummy
Pristine

5
15284
15289

SS
MS
F
Significance F
51522.10 10304.42 452.73
0
347870.07
22.76
399392.17

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
12.476
0.197 63.374
0.000
12.091
12.862
12.091
12.862
-0.086
0.003 -31.554
0.000
-0.091
-0.081
-0.091
-0.081
0.213
0.079
2.702
0.007
0.058
0.368
0.058
0.368
-0.204
0.080 -2.552
0.011
-0.361
-0.047
-0.361
-0.047
-0.376
0.011 -33.770
0.000
-0.398
-0.354
-0.398
-0.354
0.136
0.115
1.188
0.235
-0.088
0.361
-0.088
0.361
19

Multivariate Linear Regression- Using R


Step1: Download and install R software from http://www.r-project.org/
Step2: Convert the data to R readable format e.g. *.csv.
D:\Linear Reg using R\Linear_Reg_Sample_Data.csv

Writing R code for


Reading the data
Fitting the Linear Regression

> LinRegData <- read.csv(file = "D:\\Linear Reg using R\\Linear_Reg_Sample_Data.csv")


>FitLinReg <- lm(Capped_Losses ~ Number_of_Vehicles + Avg_Age + Gender_Dummy +
Married_Dummy + Avg_Vehicle_Age + Fuel_Type_Dummy, LinRegData)
>

Pristine

20

Multivariate Linear Regression- Using R


Output

Pristine

21

Thank you!

Pristine
702, Raaj Chambers, Old Nagardas Road, Andheri (E), Mumbai-400 069. INDIA
www.edupristine.com
Ph. +91 22 3215 6191

Pristine

Pristine www.edupristine.com

You might also like