Multiple Regression

Introducing Multiple Regression
Simple Regression
Cause Effect
Independent variable Dependent variable
Simple Regression
Oil Prices
Government
Bond Yields
One cause, one effect

Multiple Regression
Causes Effect
Independent variables Dependent variable
Multiple Regression
Oil Prices
Government
Bond Yields
S&P 500
Share Index
Many causes, one effect

Simple Regression
Y
(x1, y1)
(x2, y2)
(x3, y3)
Regression Line:
y = A + Bx
(xn, yn)
Represent all n points as

(xi,yi), where i = 1 to n
Multiple Regression
Y
(x1, y1, z1)
(x2, y2, z2)
(x3, y3, z3)

Regression Plane:
(xn, yn, zn) y = A + Bx + Cz
Represent all n points as

Z (xi,yi,zi), where i = 1 to n
Multiple Regression
Causes Effect
Dow Jones index, Exxon stock
price of oil
Multiple Regression
Regression Equation:
EXXONt = A + B DOWt + C OILt
E1 1 D1
[] [] [] [][]
O1 e1
E2 1 D2 O2 e2
E3 = A +B D3 +C O3 + e3
1
… … … …
…
En Dn On en
1
Ei = % return Di = % return of Oi = % change
on Exxon stock Dow Jones in price of oil
on day i index on day i on day i
Multiple Regression
y = A + Bx + Cz
y1 1 x1
[ ] [] [][][]
z1 e1
y2 1 x2 z2 e2
y3 =A +B x3 +C z3 + e3
1
… … … …
…
yn xn zn en
1
Multiple Regression
y = A + Bx + Cz
y1 e1
[ ] [ ] []
1 x1 z1
y2 e2
[ ]
A
1 x2 z2
y3
…
yn
=
1
…
x3
…
xn
z3
…
zn
* B
C
+ e3
…
en
1
n Rows, n Rows, 3 Rows, n Rows,
1 Column 3 Columns 1 Column 1 Column
Multiple Regression
2 Causes 1 Effect
price of oil
Multiple Regression
k Causes 1 Effect
price of oil, bond yields…
Multiple Regression
y = C1 + C2x1 + … + Ck+1xk
y1 1 x11 x1k e1
[ ] [] [] [][]
y2
y3
…
yn
= C1
1
1
…
+ C2
x21
x31
…
xn1
+ … Ck+1
x2k
x3k
…
+
e2
e3
…
xnk en
1
Multiple Regression
y = C1 + C2x1 + … + Ck+1xk
y1 e1
[ ] [ ] []
1 x11 x1k
y2 e2
[ ]
C1
1 x21 x2k
y3
…
yn
=
1
…
x31
…
xn1
… x3k
…
xnk
* C2
…
Ck+1
+ e3
…
en
1
n Rows, n Rows, k Rows,
1 Column k Columns 1 Column
Multiple Regression
y = C1 + C2x1 + … + Ck+1xk
Multiple regression involves finding

k+1 coefficients, k for the explanatory
variables, and 1 for the intercept
Estimation Methods in Multiple Regression
Maximum
Method of Method of least
likelihood
moments squares
estimation
The method of least squares works for

multiple regression too
y = C1 + C2x1 + … + Ck+1xk
The “best fit” line is the one where

the sum of the squares of the
lengths of the errors is minimised
Risks in Multiple Regression
Simple and Multiple Regression
Simple Regression Multiple Regression

Data in 2 dimensions Data in > 2 dimensions
Simple and Multiple Regression
Simple Regression Multiple Regression

Risks exist, but can usually be Risks are more complicated,
mitigated analysing R2 and require interpreting regression
residuals statistics
Risks in Simple Regression
No cause-effect Mis-specified Incomplete

relationship relationship relationship
Regression on completely Non-linear (exponential Multiple causes exist, we
unrelated data series or polynomial) fit have captured just one
Diagnosing Risks in Simple Regression

high R2, residuals are not
low R2,plot of X ~ Y has low R2, residuals are not
independent of each
no pattern independent of x
other
Mitigating Risks in Simple Regression

Wrong choice of X and Y Transform X and Y - Add X variables (move to
- back to drawing board convert to logs or returns multiple regression)
The big new risk with multiple
regression is multicollinearity: X
variables containing the same
information
Multiple Regression
y = C1 + C2x1 + … + Ckxk-1
y1 1
[] [ ]
x11 x1k-1
y2
[ ]
C1
1 x21 x2k-1
y3
…
yn
=
nx1
1
…
x31
…
xn1
… x3k-1
…
xnk-1
*
nxk
C2
…
Ck
kx1
1
n Rows, n Rows, k Rows,
1 Column k Columns 1 Column
Multiple Regression
y = C1 + C2x1 + … + Ckxk-1
y1 1
[] [ ]
x11 x1k-1
y2
[ ]
C1
1 x21 x2k-1
y3
…
yn
=
1
…
x31
…
xn1
… x3k-1
…
xnk-1
* C2
…
Ck
1
X1 Xk
Bad News: Multicollinearity Detected
X1
High R2
Xk
Highly correlated explanatory variables

Good News: No Multicollinearity Detected
X1
Low R2
Xk
Uncorrelated explanatory variables

Multiple Regression
E1 1 D1
[ ] [] [] []
O1
E2 1 D2 O2
E3 = A +B D3 +C O3
1
… … …
…
En Dn On
1
Ei = % return Di = % return of Oi = % change
on Exxon stock Dow Jones in price of oil
Good News: No Multicollinearity Detected
DOW
Returns
Low R2
OIL
Uncorrelated explanatory variables

Multiple Regression
EXXONt = A + B DOWt + C NASDAQt
E1 1 D1
[ ] [] [] []
N1
E2 1 D2 N2
E3 = A +B D3 +C N3
1
… … …
…
En Dn Nn
1
Ei = % return Di = % return of Ni = % return of
on Exxon stock Dow Jones NASDAQ index
DOW
High R2
NASDAQ

Multicollinearity Kills Regression’s Usefulness
Explaining Variance Making Predictions

The R2 as well as the regression The regression model will
coefficients are not very reliable perform poorly with out-of-
sample data
Multicollinearity: Prevention and Cure
Common Sense Nuts and Bolts Heavy Lifting

Big-picture Setting up data right  Factor analysis,
understanding of the Principal components
data  analysis (PCA) 
Think deeply about each x variable
Eliminate closely related ones
Dig down to underlying causes
Common
Sense
Multiple Regression
Proposed Regression Equation:

EXXONt = A + B DOWt + C NASDAQt + D OILt
% return on % return of % return of % change in
Exxon stock on Dow Jones NASDAQ index price of oil on
day i index on day i on day i day i
Common Sense
Dow Jones
NASDAQ 100 Index Oil Prices
Industrial Average
100 large tech stocks Price of barrel of oil
30 Large-cap US stocks
Common Sense
Dow Jones
Industrial Average
Do we really need both Dow and NASDAQ returns as

explanatory variables?
Common Sense
Dow Jones
Industrial Average
If yes - consider keeping one, and constructing a

new explanatory variable of their difference
Common Sense
Dow Jones
Oil Prices
Industrial Average
Price of barrel of oil
What underlying factors drive both US large-cap

stocks and the price of oil?
Common Sense
GDP growth Interest rates
US dollar strength Seasonality

Common Sense
GDP growth Interest rates
US dollar strength Seasonality

Multiple Regression
Original Regression Equation:

Revised Regression Equation:

EXXONt = A + B DOWt + C INTERESTt + D GDPt
‘Standardise’ the variables
Rely on adjusted-R2, not plain R2
Set up dummy variables right
Distribute lags
Nuts and Bolts
Find underlying factors that drive
the correlated x variables
Principal Component Analysis
(PCA) is a great tool
Heavy Lifting
Multiple Regression

HOMEt = A + B 5-yrt + C 10-yeart + D 2-yeart +
E 1-yeart + F 3-montht + G 1-dayt + …
% change in yield on 5- yield on 10-
home prices in year bond in year bond in …
month i month i month i
Xi
Time

Factor Analysis
3-month 1-year government 5-year government

government bonds bonds bonds
1-day (overnight) 30-year government 5-year swap rate

money market bonds (inter-bank)
Interest rates on a wide variety of fixed-income

instruments
Factor Analysis on Interest Rates
Level Slope Twist

How high are interest How steep is the yield How convex is the yield
rates? curve? curve?
Three uncorrelated factors explain most variation in

all interest rates
The factors identified are
guaranteed to be uncorrelated
However, they may not have an

intuitive interpretation
Factor Principal Component Analysis is one

Analysis procedure for factor analysis
Dimensionality Reduction via Factor Analysis
20-dimensional 3-dimensional
data data
Factor Analysis
1 column for
3 columns, 1 for
each interest
each factor
rate out there
Dimensionality Reduction via Factor Analysis
[ ]
x11 x1k-1 f11 f12 f13
x21
…
xn1
x2k-1
x31 … x3k-1
…
xnk-1
[ ]
f21
f31
…
fn1
f22
f32
…
fn2
f23
f33
…
fn3
Factor Analysis
1 column for
3 columns, 1 for
each interest
each factor
rate out there
Factor Analysis is a dimensionality-
reduction technique to identify a
few underlying causes in data
Multiple Regression
HOMEt = A + B 5-yrt + C 10-yeart + D 2-yeart +
E 1-yeart + F 3-montht + G 1-dayt + …
Principal
Component
Analysis
Revised Regression Equation:

HOMEt = A + B LEVELt + C SLOPEt + D TWISTt
Benefits of Multiple Regression
Simple Regression Is a Great Tool
Powerful Versatile Deep

Perfectly suited to two Easily extended to non- The first “crossover hit”
common use-cases linear relationships from Machine Learning
Multiple Regression Is Even Better

Also controls for effects Also works with Especially if combined
different causes categorical data with factor analysis

Two Common Applications of Regression
Explaining Variance Making Predictions

How much variation in one data How much does a move in one
series is caused by another? series impact another?
Controlling For Different Causes

% return on % return of % change in
Exxon stock on Dow Jones price of oil on
day i index on day i day i
All else being equal, how much will Exxon stock

move by if oil prices increase by 1%?



EXXON? = A + B DOWt + C (OILt + 1%)


EXXON? = A + B DOWt + C (OILt + 1%)
Change in EXXON = EXXONt - EXXON?


- EXXON? = A + B DOWt + C (OILt + 1%)


- EXXON? = A + B DOWt + C (OILt + 1%)
Change in EXXON = C

Regression coefficients tell how
much y changes for a unit change
in each predictor, all others being
held constant



Interpreting the Results of a
Regression Analysis
Interpreting Results of a Simple Regression
R2 Residuals
Measures overall quality of fit - the Check if regression assumptions are
higher the better (up to a point) violated
Standard errors of individual coefficients are usually

of little significance
Interpreting Results of a Multiple Regression
Adjusted R2 Residuals F-statistic
Standard Errors
R2
of coefficients
e = y - y’
=> y = y’ + e
=> Variance(y) = Variance(y’ + e)
=> Variance(y) = Variance(y’) + Variance(e) + Covariance(y’,e)
A Not-Very-Important Intermediate Step

Variance of the dependent variable can be decomposed into variance of the
regression fitted values, and that of the residuals
e = y’ - y
=> y = y’ + e
Always = 0
=> Variance(y) = Variance(y’ + e)
=> Variance(y) = Variance(y’) + Variance(e) + Covariance(y’,e)
A Leap of Faith
This is important - more on why in a bit
Variance(y) = Variance(y’) + Variance(e)
Variance Explained
Variance(y) = Variance(y’) + Variance(e)
Total Variance (TSS)

A measure of how volatile the dependent variable is, and of much it moves around
TSS = Variance(y’) + Variance(e)
Explained Variance (ESS)

A measure of how volatile the fitted values are - these come from the regression line
TSS = Variance(y)
TSS = ESS + Variance(e)
Residual Variance (RSS)

This the variance in the dependent variable that can not be explained by the
regression
TSS = Variance(y) ESS = Variance(y’)
TSS = ESS + RSS
Variance Explained
TSS = Variance(y) ESS = Variance(y’) RSS = Variance(e)
R2 = ESS / TSS
R 2
The percentage of total variance explained by the regression. Usually, the higher the
R2, the better the quality of the regression (upper bound is 100%)
R2 = ESS / TSS
R 2
In multiple regression, adding explanatory variables always increases R2, even if those
variables are irrelevant and increase danger of multicollinearity
Adjusted-R2 = R2 x (Penalty for adding irrelevant variables)
Adjusted-R2
Increases if irrelevant* variables are deleted
(*irrelevant variables = any group whose F-ratio < 1)

Extending Multiple Regression to
Categorical Variables
A Simple Regression

y = A + Bx
Height of Average height
individual of parents
A Simple Regression
y
Male
Female
Regression Line:
y = A + Bx
Not a great fit - regression line is far from all points!

A Simple Regression
y
Male
Female
Regression Line For Males:

y = A1 + Bx
A1
We can easily plot a great fit for males…

A Simple Regression
y
Male
Female
Regression Line For Females:

y = A2 + Bx
A2
…and another great fit for females

A Simple Regression
y Regression Line For Males:
Male
y = A1 + Bx
Female Regression Line For Females:

y = A2 + Bx
A1
A2
x
Two lines - same slope, different intercepts

Adding A Dummy Variable
Regression Line For Males: Regression Line For Females:
y = A1 + Bx y = A2 + Bx
Combined Regression Line:

y = A1 + (A2 - A1)D + Bx
D=0 for males
=1 for females
y = A1 + Bx y = A2 + Bx

y = A1 + (A2 - A1)D + Bx
D=0 for males
y = A1 + (A2 - A1)D + Bx
= A1 + Bx
y = A1 + Bx y = A2 + Bx

y = A1 + (A2 - A1)D + Bx
D=1 for females
y = A1 + (A2 - A1) + Bx
= A2 + Bx
y = A + Bx

y = A1 + (A2 - A1)D + Bx
D=0 for males
=1 for females

y = A1 + (A2 - A1)D + Bx
D=0 for males

=1 for females
The data contained 2

groups, so we added 1
dummy variable
Given data with k groups, set up k-1
dummy variables, else
multicollinearity occurs
Dummy and Other Categorical Variables
Dummy Variables Categorical Variables

Finite set of values - e.g. days of
Binary - 0 or 1
week, months of year…
To include non-binary categorical

variables, simply add more dummies
Testing for Seasonality

y = A + BQ1 + CQ2 + DQ3
Average stock Quarter of the
returns year
The data contains 4 groups, so we

added 3 dummy variables
Testing for Seasonality
y = A + BQ1 + CQ2 + DQ3
The data contains 4 groups, so we

added 3 dummy variables
Q1 = 1 for Jan, Feb, Mar
=0 for other quarters
Q2 = 1 for Apr, May, Jun
Q3 = 1 for July, Aug, Sep
Different Groups, Different Slopes
y
Male
y = A1 + B1 x
y = A2 + B2 x
Female
x
Dummy variables can also be extended for use

where groups have different slopes
y = A1 + B1 x y = A2 + B2 x

y = A1 + (A2 - A1)D1 +
B1x + (B2 - B1)D2
D1 = 0 for males D2 = 0 for males
=1 for females =x for females
y = A1 + B1 x y = A2 + B2 x
For males: D1 = 0
y = A1 + (A2 - A1)D1 +
B1x + (B2 - B1)D2
D2 = 0
= A1 + B1 x
y = A1 + B1 x y = A2 + B2 x
For females: D1 = 1
y = A1 + (A2 - A1)(1) +
B1x + (B2 - B1)x D2 = x
= A1 + (A2 - A1) +
B1x + (B2 - B1)x
= A2 + B2 x
Dummy Variables
X Y
Linear regression Logistic regression
Normal Distribution
μ
N(μ,σ)
Average (mean) is μ
Standard deviation is σ
Standard Errors
E(α) = A E(β) = B
A - SE(α) A + SE(α) B - SE(β) B + SE(β)
Sampling Distribution of α Sampling Distribution of β

α is the population parameter, A is the β is the population parameter, B is the
sample parameter sample parameter
Standard error of a regression parameter is the

standard deviation of the sampling distribution
Strong Cause-effect Relationship
Y
High R2
Residuals are small, standard errors are small

Weak Cause-effect Relationship
Y
Low R2
Residuals are large, standard errors are large

Standard Errors and Residuals
Low Standard Error High Standard Error

High confidence that parameter Low confidence that parameter
coefficient is well estimated coefficient is well estimated
The smaller the residuals, the smaller the standard

errors and the better the quality of the regression
Sample Regression Line
y = A + Bx
y1 = A + Bx1 + e1
y2 = A + Bx2 + e2
y3 = A + Bx3 + e3
… …
yn = A + Bxn + en
Sample Regression Line
y = A + Bx
Residuals
y1 = A + Bx1 + e1
y2 = A + Bx2 + e2
y3 = A + Bx3 + e3
… …
yn = A + Bxn + en
RSS = Variance(e)
Residual Variance (RSS)

Easily calculated from regression residuals
SE(α), SE(β) can be found from RSS
Estimate Standard Errors from RSS

Exact formulae are not important - reported by Excel, R…
The smaller the residuals, the smaller
the standard errors and the better
the quality of the regression
Standard Errors
E(α) = A E(β) = B
A - SE(α) A + SE(α) B - SE(β) B + SE(β)
Standard Error of α Standard Error of β

α is the population parameter, A is the β is the population parameter, B is the
sample parameter sample parameter
Standard error of a regression parameter is the

standard deviation of the sampling distribution
Null Hypotheses
What if the population parameter α

were actually zero?
Call this the null hypotheses H0
Null Hypotheses: α = 0
E(α) = 0
t x SE(α)
α=A
If this were actually true, how likely is it that our

sample regression would yield the estimate α = A?
Why Zero?
Y Y
X X
Sample Regression Line Population Regression Line
y = A + Bx y = α + βx
If α = 0, it is adding no value in the regression line

and should just be excluded
E(α) = 0
t x SE(α)
α=A
The farther from the mean, the more unlikely that

α=0
t-Statistics
E(α) = 0 E(β) = 0
0.85xSE(α)
A
9.01xSE(β)
B
t-stat(α) = 0.85 t-stat(β) = 9.01
t-stat(α) = A/SE(α) t-stat(β) = B/SE(β)
We are now testing a hypothesis, that the population

parameter is actually zero
t-Statistics
E(α) = 0 E(β) = 0
0.85xSE(α)
A
9.01xSE(β)
B
t-stat(α) = 0.85 t-stat(β) = 9.01
t-stat(α) = A/SE(α) t-stat(β) = B/SE(β)
Is an individual estimate of A or B ‘adding value’ at all?
High t-statistic => Yes

The higher the t-statistic of a
coefficient, the higher our confidence
in our estimate of that coefficient
p-Values
E(α) = 0 E(β) = 0
0.85xSE(α)
A 0.39 2 x 10-15
9.01xSE(β)
p-value(α) = 0.39 p-value(β) = 2 x 10-15 ~ 0

Low t-stat, high p-value High t-stat, low p-value
Is an individual estimate of α or β ‘adding value’ at all?

low p-value => Yes
The lower the p-value of a coefficient,
the higher our confidence in our
estimate of that coefficient
RSS
SER =
n-2
Standard Error of Regression (SER)

n is the number of points in the regression.
SER provides an unbiased estimator of error variance σ2

RSS
σ2
~ χ2
χ2 Distribution
Never mind the fine print about degrees of freedom for now
Null Hypotheses
What if all population parameters

were zero? i.e. β = α = 0
Call this the null hypotheses H0
Null Hypotheses: β = α = 0
β = 0, α = 0
F-statistic
β = B, α = A
If this were actually true, how likely is it that our

sample regression would yield the estimate
β = B, α = A?
Why Zero?
Y Y
X X
Sample Regression Line Population Regression Line
y = A + Bx y = α + βx
If α = β = 0, our regression line is not adding any

value at all
β = 0, α = 0
F-statistic
β = B, α = A
The farther from the peak, the more unlikely that

α=β=0
F-Statistic
β = 0, α = 0
F-statistic
β = B, α = A
Does our regression as a whole ‘add value’ at all?
High F-statistic => Yes

p-values and t-statistics tell us
whether individual parameter
coefficients are ‘good’
The F-statistic tells us whether a
entire regression line is ‘good’
y = A1 + Bx y = A2 + Bx

y = A1D1 + A2D2 + Bx
D1 = 1 for males D2 = 1 for females

=0 for females =0 for males
y = A1 + Bx y = A2 + Bx

y = A1D1 + A2D2 + Bx
D1 = 1 for males D2 = 0 for males
y = A1x1 + A20 + Bx
= A1 + Bx
y = A1 + Bx y = A2 + Bx

y = A1D1 + A2D2 + Bx
D1 = 0 for females D2 = 1 for females
y = A1x0 + A2x1 + Bx
= A2 + Bx
y = A + Bx

y = A1D1 + A2D2 + Bx
D1 = 1 for males D2 = 1 for females
=0 for females =0 for males
Given data with k groups, set up k-1
dummy variables and an intercept, or
k dummy variables with no intercept
Regression Without Intercept
Regression R2 can go Excel, Python and R all adjust

negative R2 formula in this case
Python statsmodel R2
Excel and R usually agree
sometimes differs

Multiple Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multiple Regression

Uploaded by

Copyright:

Available Formats

Introducing Multiple Regression

One cause, one effect

Many causes, one effect

Represent all n points as

(x2, y2, z2)

(x3, y3, z3)

Represent all n points as

Multiple regression involves finding

The method of least squares works for

The “best fit” line is the one where

Simple Regression Multiple Regression

Simple Regression Multiple Regression

No cause-effect Mis-specified Incomplete

No cause-effect Mis-specified Incomplete

No cause-effect Mis-specified Incomplete

Highly correlated explanatory variables

Uncorrelated explanatory variables

Uncorrelated explanatory variables

Highly correlated explanatory variables

Explaining Variance Making Predictions

Common Sense Nuts and Bolts Heavy Lifting

Proposed Regression Equation:

Do we really need both Dow and NASDAQ returns as

If yes - consider keeping one, and constructing a

What underlying factors drive both US large-cap

GDP growth Interest rates

US dollar strength Seasonality

What underlying factors drive both US large-cap

GDP growth Interest rates

US dollar strength Seasonality

What underlying factors drive both US large-cap

Original Regression Equation:

Revised Regression Equation:

Proposed Regression Equation:

Highly correlated explanatory variables

3-month 1-year government 5-year government

1-day (overnight) 30-year government 5-year swap rate

Interest rates on a wide variety of fixed-income

Level Slope Twist

Three uncorrelated factors explain most variation in

However, they may not have an

Factor Principal Component Analysis is one

Revised Regression Equation:

Powerful Versatile Deep

Powerful Versatile Deep

Powerful Versatile Deep

Explaining Variance Making Predictions

Proposed Regression Equation:

All else being equal, how much will Exxon stock

Proposed Regression Equation:

All else being equal, how much will Exxon stock

EXXONt = A + B DOWt + C OILt

All else being equal, how much will Exxon stock

EXXONt = A + B DOWt + C OILt

Change in EXXON = EXXONt - EXXON?

All else being equal, how much will Exxon stock

EXXONt = A + B DOWt + C OILt

All else being equal, how much will Exxon stock

EXXONt = A + B DOWt + C OILt

All else being equal, how much will Exxon stock

Powerful Versatile Deep

Powerful Versatile Deep

Powerful Versatile Deep