You are on page 1of 62

PREDICTIVE ANALYTICS

USING REGRESSION
Sumeet Gupta
Associate Professor
Indian Institute of Management Raipur

Outline
Basic Concepts
Applications of Predictive Modeling
Linear Regression in One Variable using OLS
Multiple Linear Regression
Assumptions in Regression
Explanatory Vs Predictive Modeling
Performance Evaluation of Predictive Models
Practical Exercises
Case: Nils Baker
Case: Pedigree Vs Grit

BASIC CONCEPTS

Predictive Modeling: Applications


Predictive customer activity on credit cards from their

demographic and historical activity patterns


Predicting the time to failure or equipment based on
utilization and environment conditions
Predicting expenditures on vacation travel based on
historical frequent flyer data
Predicting staffing requirements at help desks based on
historical data and product and sales information
Predicting sales from cross selling of products from
historical information
Predicting the impact of discounts on sales in retail outlets

Basic Concept: Relationships


Examples of relationships:
Sales and earnings
Cost and number produced
Microsoft and the stock market
Effort and results

Scatterplot
A picture to explore the relationship in bivariate data
Correlation r
Measures strength of the relationship (from 1 to 1)
Regression
Predicting one variable from the other

Basic Concept: Correlation


r = 1
A perfect straight line
tilting up to the right
r = 0
No overall tilt
No relationship?
r = 1
A perfect straight line
tilting down to the right

X
Y

X
Y

X
Y

X
Y

Basic Concepts: Simple Linear Model


Linear Model for the Population
The foundation for statistical inference in regression
Observed Y is a straight line, plus randomness

Y = + X +
Randomness of individuals
Population relationship, on average
Y

Basic Concepts: Simple Linear Model


Time Spent vs. Internet Pages Viewed
Two measures of the abilities of 25 Internet sites
At the top right are eBay, Yahoo!, and MSN

Correlation is r = 0.964
Linear relationship
Straight line

with scatter
Increasing relationship
Tilts up and to the right

Minutes per person

Very strong positive association (since r is close to 1)

90
eBay
Yahoo!

60
MSN

30
0

100
200
Pages per person

Basic Concepts: Simple Linear Model


Dollars vs. Deals
For mergers and acquisitions by investment bankers
244 deals worth $756 billion by Goldman Sachs

Correlation is r = 0.419
Positive association
Straight line

with scatter
Increasing relationship
Tilts up and to the right

$1,000
Dollars (billions)

Linear relationship

$500

$0

100

200 300
Deals

400

10

Basic Concepts: Simple Linear Model


Interest Rate vs. Loan Fee
For mortgages
If the interest rate is lower, does the bank make it up with a higher loan

fee?
Correlation is r = 0.890
Linear relationship
Straight line

with scatter
Decreasing relationship
Tilts down and to the right

Interest rate

Strong negative association

6.0%
5.5%
5.0%
0%

1%

2%
3%
Loan fee

4%

11

Basic Concepts: Simple Linear Model


Todays vs. Yesterdays Percent Change
Is there momentum?
If the market was up yesterday, is it more likely to be up today? Or is

each days performance independent?


3%

Correlation is r = 0.11
No relationship?
Tilt is neither

up nor down

2%

Today's change

A weak relationship?

1%
0%
-1%
-2%
-3%
-3% -2% -1% 0% 1% 2% 3%
Yesterday's change

12

Basic Concepts: Simple Linear Model


Call Price vs. Strike Price
For stock options
Call Price is the price of the option contract to buy stock at the

Strike Price
The right to buy at a lower strike price has more value
A nonlinear relationship

A curved relationship
Correlation r = 0.895
A negative relationship:

Higher strike price goes


with lower call price

$100

Call Price

Not a straight line:

$75
$50
$25
$0
$450

$500

$550
Strike Price

$600

$650

13

Basic Concepts: Simple Linear Model


Output Yield vs. Temperature
For an industrial process
With a best optimal temperature setting

A nonlinear relationship
Not a straight line:

Correlation r = 0.0155
r suggests no relationship

But relationship is strong


It tilts neither

up nor down

Yield of process

A curved relationship
160
150
140
130
120
500

600 700 800


Temperature

900

14

Basic Concepts: Simple Linear Model


Circuit Miles vs. Investment (lower left)
For telecommunications firms
A relationship with unequal variability
More vertical variation at the right than at the left
Variability is stabilized by taking logarithms (lower right)

r = 0.957
Log of miles

Circuit miles
(millions)

Correlation r = 0.820

2,000
1,000
0
0

1,000 2,000
Investment
($millions)

20

15

15
20
Log of investment

15

Basic Concepts: Simple Linear Model


Price vs. Coupon Payment
For trading in the bond market
Bonds paying a higher coupon generally cost more

Two clusters are visible


Ordinary bonds (value is from coupon)
Inflation-indexed bonds (payout rises with inflation)
for all bonds

Correlation r = 0.994
Ordinary bonds only

Bid price

Correlation r = 0.950

$150
$100
0%

5%
10%
Coupon rate

16

Basic Concepts: Simple Linear Model


Cost vs. Number Produced
For a production facility
It usually costs more to produce more

An outlier is visible
A disaster (a fire at the factory)

Cost

High cost, but few produced

r = 0.623

10,000

20
40
60
Number produced

Cost

5,000

Outlier removed:
More details,
r = 0.869

4,000
3,000
20
30
40
50
Number produced

17

Basic Concepts: OLS Modeling


Salary vs. Years Experience
For n = 6 employees
Linear (straight line) relationship
Increasing relationship
higher salary generally goes with higher experience

Experience
15
10
20
5
15
5

Salary
30
35
55
22
40
27

Salary ($thousand)

Correlation r = 0.8667

60
50
40
30
20
0

10

20 Experience

18

Basic Concepts: OLS Modeling


Summarizes bivariate data: Predicts Y from X
with smallest errors (in vertical direction, for Y axis)
Intercept is 15.32 salary (at 0 years of experience)
Slope is 1.673 salary (for each additional year of experience, on
average)

Salary (Y)

60
50
40
30
20
10
0

10

20

Experience (X)

19

Basic Concepts: OLS Modeling


Predicted Value comes from Least-Squares Line
For example, Mary (with 20 years of experience)
has predicted salary 15.32+1.673(20) = 48.8
So does anyone with 20 years of experience

Residual is actual Y minus predicted Y


Marys residual is 55 48.8 = 6.2
She earns about $6,200 more than the predicted salary for a person

with 20 years of experience


A person who earns less than predicted will have a negative residual

20

Basic Concepts: OLS Modeling


Marys residual is 6.2
60

Mary earns 55 thousand

50
Marys predicted value is 48.8
Salary

40
30
20
10
0

10

Experience

20

21

Basic Concepts: OLS Modeling


Standard Error of Estimate

Se = SY

(1 r ) nn 12
2

Approximate size of prediction errors (residuals)


Actual Y minus predicted Y:
Y[a+bX]
Example (Salary vs. Experience)

Se = 11.686 (1 0.86672 )

6 1
= 6.52
62

Predicted salaries are about 6.52 (i.e., $6,520) away from actual
salaries

22

Basic Concepts: OLS Modeling


Interpretation: similar to standard deviation
Can move Least-Squares Line up and down by Se
About 68% of the data are within one standard error of estimate
of the least-squares line
(For a bivariate normal distribution)

Salary

60
50
40
30
20
0

10 Experience20

23

Multiple Linear Regression


Linear Model for the Population
Y = ( + 1 X1 + 2 X2 + + k Xk) +
= (Population relationship)
+ Randomness
Where has a normal distribution with mean 0 and constant

standard deviation , and this randomness is independent from one


case to another
An assumption needed for statistical inference

24

Multiple Linear Regression: Results


Intercept: a
Predicted value for Y when every X is 0
Regression Coefficients: b1, b2, bk
The effect of each X on Y, holding all other X variables constant
Prediction Equation or Regression Equation
(Predicted Y) = a+b1 X1+b2 X2++bk Xk
The predicted Y, given the values for all X variables
Prediction Errors or Residuals
(Actual Y) (Predicted Y)

25

Multiple Linear Regression: Results


t Tests for Individual Regression Coefficients
Significant or not significant, for each X variable
Tests whether a particular X variable has an effect on Y, holding the
other X variables constant
Should be performed only if the F test is significant
Standard Errors of the Regression Coefficients

Sb1 , Sb2 ,!, Sbk

(with n k 1 degrees of freedom)

Indicates the estimated sampling standard deviation of each

regression coefficient
Used in the usual way to find confidence intervals and hypothesis
tests for individual regression coefficients

26

Multiple Linear Regression: Results


Predicted Page Costs for Audubon
= a + b1 X1 + b2 X2 + b3 X3
= $4,043 + 3.79(Audience) 124(Percent Male)
+ 0.903(Median Income)
= $4,043 + 3.79(1,645) 124(51.1) + 0.903(38,787)

= $38,966
Actual Page Costs are $25,315
Residual is $25,315 38,966 = $13,651
Audubon has Page Costs $13,651 lower than you would expect for
a magazine with its characteristics (Audience, Percent Male, and
Median Income)

27

Standard Error
Standard Error of Estimate Se
Indicates the approximate size of the prediction errors
About how far are the Y values from their predictions?
For the magazine data
Se = S = $21,578
Actual Page Costs are about $21,578 from their predictions for this

group of magazines (using regression)


Compare to SY = $45,446: Actual Page Costs are about $45,446 from
their average (not using regression)
Using the regression equation to predict Page Costs (instead of simply
using Y ) the typical error is reduced from $45,446 to $21,578

28

Coeff. of Determination
The strength of association is measured by the square of the multiple
correlation coefficient, R2, which is also called the coefficient of
multiple determination.

R2 =

SS reg
SS y

R2 is adjusted for the number of independent variables and the sample


size by using the following formula:
Adjusted

R2

= R2

k(1 - R 2)
n-k-1

29

Coeff. of Determination
Coefficient of Determination R2
Indicates the percentage of the variation in Y that is explained by
(or attributed to) all of the X variables
How well do the X variables explain Y?
For the magazine data
R2 = 0.787 = 78.7%
The X variables (Audience, Percent Male, and Median Income) taken

together explain 78.7% of the variance of Page Costs


This leaves 100% 78.7% = 21.3% of the variation in Page Costs
unexplained

30

The F test
Is the regression significant?
Do the X variables, taken together, explain a significant amount of
the variation in Y?
The null hypothesis claims that, in the population, the X variables
do not help explain Y; all coefficients are 0
H0: 1 = 2 = = k = 0
The research hypothesis claims that, in the population, at least

one of the X variables does help explain Y


H1: At least one of 1, 2, , k 0

31

The F test
H0 : R2pop = 0
This is equivalent to the following null hypothesis:

H0: 1 = 2 = 3 = . . . = k = 0
The overall test can be conducted by using an F statistic:
F=

SS reg /k
SS res /(n - k - 1)

R 2 /k
(1 - R 2 )/(n- k - 1)

which has an F distribution with k and (n - k -1) degrees of freedom.

32

Performing the F test


Three equivalent methods for performing F test; they

always give the same result


Use the p-value
If p < 0.05, then the test is significant
Same interpretation as p-values in Chapter 10

Use the R2 value


If R2 is larger than the value in the R2 table, then the result is significant
Do the X variables explain more than just randomness?

Use the F statistic


If the F statistic is larger than the value in the F table, then the result is

significant

33

Example: F test
For the magazine data, The X variables (Audience, Percent
Male, and Median Income) explain a very highly significant

percentage of the variation in Page Costs


The p-value, listed as 0.000, is less than 0.0005, and is therefore

very highly significant (since it is less than 0.001)


The R2 value, 78.7%, is greater than 27.1% (from the R2 table at
level 0.1% with n = 55 and k = 3), and is therefore very highly
significant
The F statistic, 62.84, is greater than the value (between 7.054
and 6.171) from the F table at level 0.1%, and is therefore very
highly significant

34

t Tests
A t test for each regression coefficient
To be used only if the F test is significant
If F is not significant, you should not look at the t tests

Does the jth X variable have a significant effect on Y, holding the

other X variables constant?


Hypotheses are
H0: j = 0,
H1: j 0
Test using the confidence interval

b j tSb j

use the t table with n k 1 degrees of freedom

Or use the t statistic

tstatistic = b j / Sb j

compare to the t table value with n k 1 degrees of freedom

35

Example: t Tests
Testing b1, the coefficient for Audience
b1 = 3.79, t = 13.5, p = 0.000
Audience has a very highly significant effect on Page Costs, after

adjusting for Percent Male and Median Income

Testing b2, the coefficient for Percent Male


b2 = 124, t = 0.90, p = 0.374
Percent Male does not have a significant effect on Page Costs, after

adjusting for Audience and Median Income

Testing b3, the coefficient for Median Income


b3 = 0.903, t = 2.44, p = 0.018
Median Income has a significant effect on Page Costs, after adjusting

for Audience and Percent Male

36

Assumptions in Regression
Assumptions underlying the statistical techniques

should be tested twice


First for the separate variables
Second for the multivariate model variate, which acts

collectively for the variables in the analysis and thus must


meet the same assumption as individual variables. Differs for
different multivariate technique

Assumptions in Regression
Linearity
The independent variable has a linear relationship with the dependent
variable
Normality
The residuals or the dependent variable follow a normal distribution
Multicollinearity
When some X variables are too similar to one another
Homoskedasticity
The variability in Y values for a given set of predictors is the same
regardless of the values of the predictors
Independence among cases (Absence of correlated errors)
The cases are independent of each other

38

Assumptions in Regression
Normality
The residuals or the dependent variable follow a normal

distribution
If the variation from normality is significant then all
statistical tests are invalid
Graphical Analysis
Histogram and Normal probability plot
Peaked and Skewed distribution result in non-normality

Statistical Analysis

If Z value exceeds critical value, then the distribution is non-

normal
Kolmogorov Smirnov Test; Shapiro-Wilks Test

39

Assumptions in Regression
Normality

40

Assumptions in Regression
Homoskedasticity
Assumption related primarily to dependence

relationships between variables


Assumption that the dependent variable(s) exhibit
equal levels of variance across the range of predictor
variable(s).
The variance of the dependent variable should not
be concentrated in only a limited range of the
independent values
Source
Type of variable
Skewed distribution

41

Assumptions in Regression
Homoskedasticity
Graphical Analysis
Analysis of residuals in case of Regression
Statistical Analysis
Variances within groups formed by non-metric variables
Levene Test
Boxs M Test
Remedy
Data Transformation

42

Assumptions in Regression
Homoskedasticity
Graphical Analysis

43

Assumptions in Regression
Linearity
Assumption for all multivariate techniques based on

correlational measures such as


multiple regression,
logistics regression,
factor analysis, and
structural equation modeling

Correlation represents only the linear association

between variables
Identification

Scatterplots or examination of residuals using regression

Remedy
Data Transformations

44

Assumptions in Regression
Linearity

45

Assumptions in Regression
Absence of Correlated Errors
Prediction errors should not be correlated with each

other
Identification
Most possible cause is the data collection process, such as

two separate groups in the data collection process


Remedy
Including the omitted causal factor into the multivariate analysis

46

Assumptions in Regression
Multicollinearity
Multicollinearity arises when intercorrelations among the predictors

are very high.


Multicollinearity can result in several problems, including:
The partial regression coefficients may not be estimated precisely.
The standard errors are likely to be high.
The magnitudes as well as the signs of the partial regression
coefficients may change from sample to sample.
It becomes difficult to assess the relative importance of the
independent variables in explaining the variation in the dependent
variable.
Predictor variables may be incorrectly included or removed in
stepwise regression.

47

Assumptions in Regression
Multicollinearity
The ability of an independent variable to improve the prediction of the

dependent variable is related not only to its correlation to the


dependent variable, but also to the correlation(s) of the additional
independent variable to the independent variable(s) already in the
regression equation
Collinearity is the association, measured as the correlation,

between tow independent variables


Multicollinearity refers to the correlation among three or more
independent variables
Impact
Reduces any single IVs predictive power by the extent to which it is

associated with the other independent variables

48

Assumptions in Regression
Multicollinearity
Measuring Multicollinearity
Tolerance
Amount of variability of the selected independent variable not explained

by the other independent variables


Tolerance Values should be high
Cut-off is 0.1 but greater than 0.5 gives better results
VIF
Inverse of Tolerance
Should be low (typically below 2.0 and usually below 10)

49

Assumptions in Regression
Multicollinearity
Remedy for Multicollinearity
A simple procedure for adjusting for multicollinearity consists of
using only one of the variables in a highly correlated set of
variables.
Omit highly correlated independent variables and identify other
independent variables to help the prediction
Alternatively, the set of independent variables can be transformed into
a new set of predictors that are mutually independent by using
techniques such as principal components analysis.
More specialized techniques, such as ridge regression and latent root
regression, can also be used.

50

Assumptions in Regression
Data Transformations
To correct violations of the statistical assumptions

underlying the multivariate techniques


To improve the relationship between variables
Transformation to achieve Normality and
Homoscedasticity
Flat Distribution Inverse transformation
Negatively Skewed Distribution Square Root

Transformation
Positively Skewed Distribution Logarithmic Transformation
If the residuals in regression are cone shaped then
Cone opens to right Inverse transformation
Cone opens to left Square root transformation

51

Assumptions in Regression
Data Transformations
Transformation to achieve

Linearity

52

Assumptions in Regression
Data Transformations

53

Assumptions in Regression
General guidelines for transformation
For a noticeable effect of transformation the ratio of a variables
mean to the standard deviation should be less than 4.0
When the transformation can be performed on either of the two
variables, select the one with smallest ratio of mean/sd.
Transformation should be applied to independent variables
except in case of heteroscedasticity
Heteroscedasticity can only be remedied by transformation of
the dependent variable in a dependent relationship
If the heteroscedastic relationship is also non-linear the
dependent variable and perhaps the independent variables must
be transformed
Transformations may change the interpretation of the variables

54

Issues in Regression
Variable Selection
How to choose from a long list of X variables?
Too many: waste the information in the data
Too few: risk ignoring useful predictive information

Model Misspecification
Perhaps the multiple regression linear model is wrong
Unequal variability? Nonlinearity? Interaction?

EXPLANATORY VS
PREDICTIVE MODELING

Explanatory Vs Predictive Modeling


Explanatory models fits the data closely, whereas a good

predictive model predicts new cases accurately


Explanatory models uses entire dataset for estimating the
best-fit model and to maximize explanatory variance (R2).
Predictive models estimate the model on training set and
assess it on the new, unobserved data
Performance measures for explanatory models measures
how close the data fit the models, whereas in predictive
models performance is measured by predictive accuracy

Performance Evaluation
Prediction Error for observation i= Actual y value

predicted y value
Popular numerical measures of predictive accuracy
MAE or MAD (Mean absolute error / deviation)

Average Error

MAPE (Mean Absolute Percentage Error)

Performance Evaluation
RMSE (Root mean squared error)

Total SSE (total sum of squared errors)

CASE

Case: Pedigree Vs Grit


Why does a low R2 does not make the regression useless?
Describe a situation in which a useless regression has a high R2.
Check the validity of the linear regression model assumptions.
Estimate the excess returns of Bobs and Putneys funds. Between them, who

is expected to obtain higher returns at their current funds and by how much?
If hired by the firm, who is expected to obtain higher returns and by how
much?
Can you prove at the 5% level of significance that Bob would get higher
expected returns if he had attended Princeton instead of Ohio State?
Can you prove at the 10% level of significance that Bob would get at least 1%
higher expected returns by managing a growth fund?
Is there strong evidence that fund managers with MBA perform worse than
fund managers without MBA? What is held constant in this comparison?
Based on your analysis of the case, which candidate do you support for
AMBTPMs job opening: Bob or Putney? Discuss

Case: Nils Baker


Is the presence of a physical Bank Branch creating

demand for checking accounts?

Thank You

You might also like