You are on page 1of 138

# Introducing Multiple Regression

Simple Regression

Cause Effect
Independent variable Dependent variable
Simple Regression
Oil Prices

Government
Bond Yields

## One cause, one effect

Multiple Regression

Causes Effect
Independent variables Dependent variable
Multiple Regression
Oil Prices

Government
Bond Yields
S&P 500
Share Index

## Many causes, one effect

Simple Regression
Y
(x1, y1)

(x2, y2)

(x3, y3)
Regression Line:
y = A + Bx
(xn, yn)

## Represent all n points as

(xi,yi), where i = 1 to n
Multiple Regression
Y
(x1, y1, z1)

## (x3, y3, z3)

Regression Plane:
(xn, yn, zn) y = A + Bx + Cz

## Represent all n points as

Z (xi,yi,zi), where i = 1 to n
Multiple Regression

Causes Effect
Dow Jones index, Exxon stock
price of oil
Multiple Regression
Regression Equation:
EXXONt = A + B DOWt + C OILt
E1 1 D1

[] [] [] [][]
O1 e1
E2 1 D2 O2 e2
E3 = A +B D3 +C O3 + e3
1
… … … …

En Dn On en
1
Ei = % return Di = % return of Oi = % change
on Exxon stock Dow Jones in price of oil
on day i index on day i on day i
Multiple Regression
Regression Equation:
y = A + Bx + Cz

y1 1 x1

[ ] [] [][][]
z1 e1
y2 1 x2 z2 e2
y3 =A +B x3 +C z3 + e3
1
… … … …

yn xn zn en
1
Multiple Regression
Regression Equation:
y = A + Bx + Cz

y1 e1

[ ] [ ] []
1 x1 z1
y2 e2
[ ]
A
1 x2 z2
y3

yn
=
1

x3

xn
z3

zn
* B
C
+ e3

en
1
n Rows, n Rows, 3 Rows, n Rows,
1 Column 3 Columns 1 Column 1 Column
Multiple Regression

2 Causes 1 Effect
Dow Jones index, Exxon stock
price of oil
Multiple Regression

k Causes 1 Effect
Dow Jones index, Exxon stock
price of oil, bond yields…
Multiple Regression
Regression Equation:
y = C1 + C2x1 + … + Ck+1xk

y1 1 x11 x1k e1

[ ] [] [] [][]
y2
y3

yn
= C1
1
1

+ C2
x21
x31

xn1
+ … Ck+1
x2k
x3k

+
e2
e3

xnk en
1
Multiple Regression
Regression Equation:
y = C1 + C2x1 + … + Ck+1xk

y1 e1

[ ] [ ] []
1 x11 x1k
y2 e2
[ ]
C1
1 x21 x2k
y3

yn
=
1

x31

xn1
… x3k

xnk
* C2

Ck+1
+ e3

en
1
n Rows, n Rows, k Rows,
1 Column k Columns 1 Column
Multiple Regression
Regression Equation:
y = C1 + C2x1 + … + Ck+1xk

## Multiple regression involves finding

k+1 coefficients, k for the explanatory
variables, and 1 for the intercept
Estimation Methods in Multiple Regression

Maximum
Method of Method of least
likelihood
moments squares
estimation

## The method of least squares works for

multiple regression too
Regression Equation:
y = C1 + C2x1 + … + Ck+1xk

## The “best fit” line is the one where

the sum of the squares of the
lengths of the errors is minimised
Risks in Multiple Regression
Simple and Multiple Regression

## Simple Regression Multiple Regression

Data in 2 dimensions Data in > 2 dimensions
Simple and Multiple Regression

## Simple Regression Multiple Regression

Risks exist, but can usually be Risks are more complicated,
mitigated analysing R2 and require interpreting regression
residuals statistics
Risks in Simple Regression

## No cause-effect Mis-specified Incomplete

relationship relationship relationship
Regression on completely Non-linear (exponential Multiple causes exist, we
unrelated data series or polynomial) fit have captured just one
Diagnosing Risks in Simple Regression

## No cause-effect Mis-specified Incomplete

relationship relationship relationship
high R2, residuals are not
low R2,plot of X ~ Y has low R2, residuals are not
independent of each
no pattern independent of x
other
Mitigating Risks in Simple Regression

## No cause-effect Mis-specified Incomplete

relationship relationship relationship
Wrong choice of X and Y Transform X and Y - Add X variables (move to
- back to drawing board convert to logs or returns multiple regression)
The big new risk with multiple
regression is multicollinearity: X
variables containing the same
information
Multiple Regression
Regression Equation:
y = C1 + C2x1 + … + Ckxk-1

y1 1

[] [ ]
x11 x1k-1
y2
[ ]
C1
1 x21 x2k-1
y3

yn
=

nx1
1

x31

xn1
… x3k-1

xnk-1
*
nxk
C2

Ck
kx1

1
n Rows, n Rows, k Rows,
1 Column k Columns 1 Column
Multiple Regression
Regression Equation:
y = C1 + C2x1 + … + Ckxk-1

y1 1

[] [ ]
x11 x1k-1
y2
[ ]
C1
1 x21 x2k-1
y3

yn
=
1

x31

xn1
… x3k-1

xnk-1
* C2

Ck
1
X1 Xk
X1

High R2

Xk

## Highly correlated explanatory variables

Good News: No Multicollinearity Detected
X1

Low R2

Xk

## Uncorrelated explanatory variables

Multiple Regression
Regression Equation:
EXXONt = A + B DOWt + C OILt

E1 1 D1

[ ] [] [] []
O1
E2 1 D2 O2
E3 = A +B D3 +C O3
1
… … …

En Dn On
1
Ei = % return Di = % return of Oi = % change
on Exxon stock Dow Jones in price of oil
on day i index on day i on day i
Good News: No Multicollinearity Detected
DOW
Returns

Low R2

OIL

## Uncorrelated explanatory variables

Multiple Regression
Regression Equation:
EXXONt = A + B DOWt + C NASDAQt

E1 1 D1

[ ] [] [] []
N1
E2 1 D2 N2
E3 = A +B D3 +C N3
1
… … …

En Dn Nn
1
Ei = % return Di = % return of Ni = % return of
on Exxon stock Dow Jones NASDAQ index
on day i index on day i on day i
DOW

High R2

NASDAQ

## Highly correlated explanatory variables

Multicollinearity Kills Regression’s Usefulness

## Explaining Variance Making Predictions

The R2 as well as the regression The regression model will
coefficients are not very reliable perform poorly with out-of-
sample data
Multicollinearity: Prevention and Cure

## Common Sense Nuts and Bolts Heavy Lifting

Big-picture Setting up data right  Factor analysis,
understanding of the Principal components
data  analysis (PCA)
Think deeply about each x variable
Eliminate closely related ones
Dig down to underlying causes
Common
Sense
Multiple Regression

## Proposed Regression Equation:

EXXONt = A + B DOWt + C NASDAQt + D OILt
% return on % return of % return of % change in
Exxon stock on Dow Jones NASDAQ index price of oil on
day i index on day i on day i day i
Common Sense
Proposed Regression Equation:
EXXONt = A + B DOWt + C NASDAQt + D OILt

Dow Jones
NASDAQ 100 Index Oil Prices
Industrial Average
100 large tech stocks Price of barrel of oil
30 Large-cap US stocks
Common Sense
Proposed Regression Equation:
EXXONt = A + B DOWt + C NASDAQt + D OILt

Dow Jones
NASDAQ 100 Index Oil Prices
Industrial Average
100 large tech stocks Price of barrel of oil
30 Large-cap US stocks

## Do we really need both Dow and NASDAQ returns as

explanatory variables?
Common Sense
Proposed Regression Equation:
EXXONt = A + B DOWt + C NASDAQt + D OILt

Dow Jones
NASDAQ 100 Index Oil Prices
Industrial Average
100 large tech stocks Price of barrel of oil
30 Large-cap US stocks

## If yes - consider keeping one, and constructing a

new explanatory variable of their difference
Common Sense
Proposed Regression Equation:
EXXONt = A + B DOWt + C OILt

Dow Jones
Oil Prices
Industrial Average
Price of barrel of oil
30 Large-cap US stocks

## What underlying factors drive both US large-cap

stocks and the price of oil?
Common Sense

## What underlying factors drive both US large-cap

stocks and the price of oil?
Common Sense

## What underlying factors drive both US large-cap

stocks and the price of oil?
Multiple Regression

## Original Regression Equation:

EXXONt = A + B DOWt + C NASDAQt + D OILt

## Revised Regression Equation:

EXXONt = A + B DOWt + C INTERESTt + D GDPt
‘Standardise’ the variables
Rely on adjusted-R2, not plain R2
Set up dummy variables right
Distribute lags
Nuts and Bolts
Find underlying factors that drive
the correlated x variables
Principal Component Analysis
(PCA) is a great tool
Heavy Lifting
Multiple Regression

## Proposed Regression Equation:

HOMEt = A + B 5-yrt + C 10-yeart + D 2-yeart +
E 1-yeart + F 3-montht + G 1-dayt + …
% change in yield on 5- yield on 10-
home prices in year bond in year bond in …
month i month i month i
Xi

Time

Factor Analysis

## 3-month 1-year government 5-year government

government bonds bonds bonds

## 1-day (overnight) 30-year government 5-year swap rate

money market bonds (inter-bank)

## Interest rates on a wide variety of fixed-income

instruments
Factor Analysis on Interest Rates

## Level Slope Twist

How high are interest How steep is the yield How convex is the yield
rates? curve? curve?

## Three uncorrelated factors explain most variation in

all interest rates
The factors identified are
guaranteed to be uncorrelated

## However, they may not have an

intuitive interpretation

## Factor Principal Component Analysis is one

Analysis procedure for factor analysis
Dimensionality Reduction via Factor Analysis

20-dimensional 3-dimensional
data data

Factor Analysis
1 column for
3 columns, 1 for
each interest
each factor
rate out there
Dimensionality Reduction via Factor Analysis

[ ]
x11 x1k-1 f11 f12 f13
x21

xn1
x2k-1
x31 … x3k-1

xnk-1
[ ]
f21
f31

fn1
f22
f32

fn2
f23
f33

fn3
Factor Analysis
1 column for
3 columns, 1 for
each interest
each factor
rate out there
Factor Analysis is a dimensionality-
reduction technique to identify a
few underlying causes in data
Multiple Regression
Proposed Regression Equation:
HOMEt = A + B 5-yrt + C 10-yeart + D 2-yeart +
E 1-yeart + F 3-montht + G 1-dayt + …

Principal
Component
Analysis

## Revised Regression Equation:

HOMEt = A + B LEVELt + C SLOPEt + D TWISTt
Benefits of Multiple Regression
Simple Regression Is a Great Tool

## Powerful Versatile Deep

Perfectly suited to two Easily extended to non- The first “crossover hit”
common use-cases linear relationships from Machine Learning
Multiple Regression Is Even Better

## Powerful Versatile Deep

Also controls for effects Also works with Especially if combined
different causes categorical data with factor analysis
Multiple Regression Is Even Better

## Powerful Versatile Deep

Also controls for effects Also works with Especially if combined
different causes categorical data with factor analysis
Two Common Applications of Regression

## Explaining Variance Making Predictions

How much variation in one data How much does a move in one
series is caused by another? series impact another?
Controlling For Different Causes

## Proposed Regression Equation:

EXXONt = A + B DOWt + C OILt
% return on % return of % change in
Exxon stock on Dow Jones price of oil on
day i index on day i day i

## All else being equal, how much will Exxon stock

move by if oil prices increase by 1%?
Controlling For Different Causes

## Proposed Regression Equation:

EXXONt = A + B DOWt + C OILt

## All else being equal, how much will Exxon stock

move by if oil prices increase by 1%?
Controlling For Different Causes

## EXXONt = A + B DOWt + C OILt

EXXON? = A + B DOWt + C (OILt + 1%)

## All else being equal, how much will Exxon stock

move by if oil prices increase by 1%?
Controlling For Different Causes

## EXXONt = A + B DOWt + C OILt

EXXON? = A + B DOWt + C (OILt + 1%)

## All else being equal, how much will Exxon stock

move by if oil prices increase by 1%?
Controlling For Different Causes

## EXXONt = A + B DOWt + C OILt

- EXXON? = A + B DOWt + C (OILt + 1%)

## All else being equal, how much will Exxon stock

move by if oil prices increase by 1%?
Controlling For Different Causes

## EXXONt = A + B DOWt + C OILt

- EXXON? = A + B DOWt + C (OILt + 1%)

Change in EXXON = C

## All else being equal, how much will Exxon stock

move by if oil prices increase by 1%?
Regression coefficients tell how
much y changes for a unit change
in each predictor, all others being
held constant
Multiple Regression Is Even Better

## Powerful Versatile Deep

Also controls for effects Also works with Especially if combined
different causes categorical data with factor analysis
Multiple Regression Is Even Better

## Powerful Versatile Deep

Also controls for effects Also works with Especially if combined
different causes categorical data with factor analysis
Multiple Regression Is Even Better

## Powerful Versatile Deep

Also controls for effects Also works with Especially if combined
different causes categorical data with factor analysis
Interpreting the Results of a
Regression Analysis
Interpreting Results of a Simple Regression

R2 Residuals
Measures overall quality of fit - the Check if regression assumptions are
higher the better (up to a point) violated

## Standard errors of individual coefficients are usually

of little significance
Interpreting Results of a Multiple Regression

Standard Errors
R2
of coefficients
e = y - y’
=> y = y’ + e
=> Variance(y) = Variance(y’ + e)
=> Variance(y) = Variance(y’) + Variance(e) + Covariance(y’,e)

## A Not-Very-Important Intermediate Step

Variance of the dependent variable can be decomposed into variance of the
regression fitted values, and that of the residuals
e = y’ - y
=> y = y’ + e
Always = 0
=> Variance(y) = Variance(y’ + e)
=> Variance(y) = Variance(y’) + Variance(e) + Covariance(y’,e)

A Leap of Faith
This is important - more on why in a bit
Variance(y) = Variance(y’) + Variance(e)

Variance Explained
Variance of the dependent variable can be decomposed into variance of the
regression fitted values, and that of the residuals
Variance(y) = Variance(y’) + Variance(e)

## Total Variance (TSS)

A measure of how volatile the dependent variable is, and of much it moves around
TSS = Variance(y’) + Variance(e)

## Explained Variance (ESS)

A measure of how volatile the fitted values are - these come from the regression line
TSS = Variance(y)
TSS = ESS + Variance(e)

This the variance in the dependent variable that can not be explained by the
regression
TSS = Variance(y) ESS = Variance(y’)

Variance Explained
Variance of the dependent variable can be decomposed into variance of the
regression fitted values, and that of the residuals
TSS = Variance(y) ESS = Variance(y’) RSS = Variance(e)
R2 = ESS / TSS

R 2
The percentage of total variance explained by the regression. Usually, the higher the
R2, the better the quality of the regression (upper bound is 100%)
R2 = ESS / TSS

R 2
In multiple regression, adding explanatory variables always increases R2, even if those
variables are irrelevant and increase danger of multicollinearity

Increases if irrelevant* variables are deleted

## (*irrelevant variables = any group whose F-ratio < 1)

Extending Multiple Regression to
Categorical Variables
A Simple Regression

## Proposed Regression Equation:

y = A + Bx
Height of Average height
individual of parents
A Simple Regression
y
Male

Female

Regression Line:
y = A + Bx

## Not a great fit - regression line is far from all points!

A Simple Regression
y
Male

Female

y = A1 + Bx

A1

## We can easily plot a great fit for males…

A Simple Regression
y
Male

Female

y = A2 + Bx

A2

## …and another great fit for females

A Simple Regression
y Regression Line For Males:
Male
y = A1 + Bx

y = A2 + Bx

A1
A2
x

## Two lines - same slope, different intercepts

Regression Line For Males: Regression Line For Females:
y = A1 + Bx y = A2 + Bx

## Combined Regression Line:

y = A1 + (A2 - A1)D + Bx
D=0 for males
=1 for females
Regression Line For Males: Regression Line For Females:
y = A1 + Bx y = A2 + Bx

## Combined Regression Line:

y = A1 + (A2 - A1)D + Bx
D=0 for males

y = A1 + (A2 - A1)D + Bx
= A1 + Bx
Regression Line For Males: Regression Line For Females:
y = A1 + Bx y = A2 + Bx

## Combined Regression Line:

y = A1 + (A2 - A1)D + Bx
D=1 for females

y = A1 + (A2 - A1) + Bx

= A2 + Bx
Original Regression Equation:
y = A + Bx
Height of Average height
individual of parents

## Combined Regression Line:

y = A1 + (A2 - A1)D + Bx
D=0 for males
=1 for females

## Combined Regression Line:

y = A1 + (A2 - A1)D + Bx

=1 for females

## The data contained 2

dummy variable
Given data with k groups, set up k-1
dummy variables, else
multicollinearity occurs
Dummy and Other Categorical Variables

## Dummy Variables Categorical Variables

Finite set of values - e.g. days of
Binary - 0 or 1
week, months of year…

## To include non-binary categorical

Testing for Seasonality

## Proposed Regression Equation:

y = A + BQ1 + CQ2 + DQ3
Average stock Quarter of the
returns year

## The data contains 4 groups, so we

Testing for Seasonality

## The data contains 4 groups, so we

Q1 = 1 for Jan, Feb, Mar
=0 for other quarters
Q2 = 1 for Apr, May, Jun
=0 for other quarters
Q3 = 1 for July, Aug, Sep
=0 for other quarters
Different Groups, Different Slopes
y
Male
y = A1 + B1 x

y = A2 + B2 x
Female
x

## Dummy variables can also be extended for use

where groups have different slopes
Regression Line For Males: Regression Line For Females:
y = A1 + B1 x y = A2 + B2 x

## Combined Regression Line:

y = A1 + (A2 - A1)D1 +
B1x + (B2 - B1)D2
D1 = 0 for males D2 = 0 for males
=1 for females =x for females
Regression Line For Males: Regression Line For Females:
y = A1 + B1 x y = A2 + B2 x

For males: D1 = 0
y = A1 + (A2 - A1)D1 +
B1x + (B2 - B1)D2
D2 = 0

= A1 + B1 x
Regression Line For Males: Regression Line For Females:
y = A1 + B1 x y = A2 + B2 x

For females: D1 = 1
y = A1 + (A2 - A1)(1) +
B1x + (B2 - B1)x D2 = x
= A1 + (A2 - A1) +
B1x + (B2 - B1)x
= A2 + B2 x
Dummy Variables

X Y
Linear regression Logistic regression
Normal Distribution
μ

N(μ,σ)

Average (mean) is μ

Standard deviation is σ
Standard Errors

E(α) = A E(β) = B

## Sampling Distribution of α Sampling Distribution of β

α is the population parameter, A is the β is the population parameter, B is the
sample parameter sample parameter

## Standard error of a regression parameter is the

standard deviation of the sampling distribution
Strong Cause-effect Relationship
Y

High R2

## Residuals are small, standard errors are small

Weak Cause-effect Relationship
Y

Low R2

## Residuals are large, standard errors are large

Standard Errors and Residuals

## Low Standard Error High Standard Error

High confidence that parameter Low confidence that parameter
coefficient is well estimated coefficient is well estimated

## The smaller the residuals, the smaller the standard

errors and the better the quality of the regression
Sample Regression Line
Regression Equation:
y = A + Bx

y1 = A + Bx1 + e1
y2 = A + Bx2 + e2
y3 = A + Bx3 + e3
… …
yn = A + Bxn + en
Sample Regression Line
Regression Equation:
y = A + Bx
Residuals

y1 = A + Bx1 + e1
y2 = A + Bx2 + e2
y3 = A + Bx3 + e3
… …
yn = A + Bxn + en

Easily calculated from regression residuals
SE(α), SE(β) can be found from RSS

## Estimate Standard Errors from RSS

Exact formulae are not important - reported by Excel, R…
The smaller the residuals, the smaller
the standard errors and the better
the quality of the regression
Standard Errors

E(α) = A E(β) = B

## Standard Error of α Standard Error of β

α is the population parameter, A is the β is the population parameter, B is the
sample parameter sample parameter

## Standard error of a regression parameter is the

standard deviation of the sampling distribution
Null Hypotheses

## What if the population parameter α

were actually zero?
Call this the null hypotheses H0
Null Hypotheses: α = 0

E(α) = 0

t x SE(α)

α=A

## If this were actually true, how likely is it that our

sample regression would yield the estimate α = A?
Why Zero?

Y Y

X X

## Sample Regression Line Population Regression Line

y = A + Bx y = α + βx

## If α = 0, it is adding no value in the regression line

and should just be excluded
Null Hypotheses: α = 0

E(α) = 0

t x SE(α)

α=A

## The farther from the mean, the more unlikely that

α=0
t-Statistics

E(α) = 0 E(β) = 0

0.85xSE(α)

A
9.01xSE(β)

B
t-stat(α) = 0.85 t-stat(β) = 9.01
t-stat(α) = A/SE(α) t-stat(β) = B/SE(β)

## We are now testing a hypothesis, that the population

parameter is actually zero
t-Statistics

E(α) = 0 E(β) = 0

0.85xSE(α)

A
9.01xSE(β)

B
t-stat(α) = 0.85 t-stat(β) = 9.01
t-stat(α) = A/SE(α) t-stat(β) = B/SE(β)

## High t-statistic => Yes

The higher the t-statistic of a
coefficient, the higher our confidence
in our estimate of that coefficient
p-Values

E(α) = 0 E(β) = 0

0.85xSE(α)

A 0.39 2 x 10-15

9.01xSE(β)

## p-value(α) = 0.39 p-value(β) = 2 x 10-15 ~ 0

Low t-stat, high p-value High t-stat, low p-value

## Is an individual estimate of α or β ‘adding value’ at all?

low p-value => Yes
The lower the p-value of a coefficient,
the higher our confidence in our
estimate of that coefficient
SER =
n-2

## Standard Error of Regression (SER)

n is the number of points in the regression.

## SER provides an unbiased estimator of error variance σ2

σ2
~ χ2

χ2 Distribution
Never mind the fine print about degrees of freedom for now
Null Hypotheses

## What if all population parameters

were zero? i.e. β = α = 0
Call this the null hypotheses H0
Null Hypotheses: β = α = 0
β = 0, α = 0
F-statistic

β = B, α = A

## If this were actually true, how likely is it that our

sample regression would yield the estimate
β = B, α = A?
Why Zero?

Y Y

X X

## Sample Regression Line Population Regression Line

y = A + Bx y = α + βx

## If α = β = 0, our regression line is not adding any

value at all
Null Hypotheses: α = 0
β = 0, α = 0
F-statistic

β = B, α = A

α=β=0
F-Statistic
β = 0, α = 0
F-statistic

β = B, α = A

## High F-statistic => Yes

p-values and t-statistics tell us
whether individual parameter
coefficients are ‘good’
The F-statistic tells us whether a
entire regression line is ‘good’
Regression Line For Males: Regression Line For Females:
y = A1 + Bx y = A2 + Bx

## Combined Regression Line:

y = A1D1 + A2D2 + Bx

## D1 = 1 for males D2 = 1 for females

=0 for females =0 for males
Regression Line For Males: Regression Line For Females:
y = A1 + Bx y = A2 + Bx

## Combined Regression Line:

y = A1D1 + A2D2 + Bx
D1 = 1 for males D2 = 0 for males

y = A1x1 + A20 + Bx
= A1 + Bx
Regression Line For Males: Regression Line For Females:
y = A1 + Bx y = A2 + Bx

## Combined Regression Line:

y = A1D1 + A2D2 + Bx
D1 = 0 for females D2 = 1 for females

y = A1x0 + A2x1 + Bx

= A2 + Bx
Original Regression Equation:
y = A + Bx
Height of Average height
individual of parents

## Combined Regression Line:

y = A1D1 + A2D2 + Bx
D1 = 1 for males D2 = 1 for females
=0 for females =0 for males
Given data with k groups, set up k-1
dummy variables and an intercept, or
k dummy variables with no intercept
Regression Without Intercept

## Regression R2 can go Excel, Python and R all adjust

negative R2 formula in this case

Python statsmodel R2
Excel and R usually agree
sometimes differs