You are on page 1of 62

Predictive Analytics : QM901.

1x
Prof U Dinesh Kumar, IIMB

All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

If you torture the data long enough, it will confess!


- Ronald Coase

All RightsKumar,
Reserved,IIM
Indian
Institute of Management Bangalore
UDinesh
Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

Interesting Hypotheses

Good looking couples are more likely to have girl child(ren)!


Vegetarians miss fewer flights.
Women use camera phone more than men.
Left handed men earn more money!
Smokers are better sales people.
Those who whistle at workplace are more efficient.

All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

What is Regression?
Regression is a tool for finding existence of an association
relationship between a dependent variable (Y) and one or
more independent variables (X1, X2, , Xn) in a study.
The relationship can be linear or non-linear.

All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

Mathematical Vs Statistical Relationship


Mathematical relationship is an exact relationship.

Y = 0 + 1 X
Statistical relationship is not an exact relationship.
Y = 0 + 1 X +

All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

Nomenclature in Regression
A dependent variable (response variable) measures an outcome
of a study (also called outcome variable).
An independent variable (explanatory variable) explains
changes in a response variable.
Regression often set values of explanatory variable to see how it
affects response variable (predict response variable).

All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

Regression model establishes the existence of


an association between two variables, but not
causation.

All RightsKumar,
Reserved,IIM
Indian
Institute of Management Bangalore
UDinesh
Bangalore

Regression Nomenclature
Dependent Variable

Independent Variable

Explained Variable

Explanatory variable

Regressand

Regressor

Predictand

Predictor

Endogenous Variable

Exogenous Variable

Controlled Variable

Control Variable

Target Variable

Stimulus Variable

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

Response Variable

All RightsKumar,
Reserved,IIM
Indian
Institute of Management Bangalore
UDinesh
Bangalore

Dependent and Independent Variables

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

Terms dependent and independent does not necessarily imply a causal


relationship between two variables.
Regression is not designed to capture causality.

Purpose of regression is to predict the value of dependent variable given


the value(s) of independent variable(s).

All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

Why we need Regression?


Companies would like to know about factors that have
significant impact on their Key Performance Indicators (KPI).

Regression helps to create new hypothesis that may assist the


companies to improve their performance.

All Rights Reserved, Indian Institute of Management Bangalore

Where is it used?

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

Finance: CAPM, Non-performing assets, probability of default, chance of


bankruptcy, credit risk.
Marketing: Sales, market share, customer satisfaction, customer churn, customer
retention, customer life time value.
Operations: Inventory, productivity, efficiency.

HR: Job satisfaction, attrition.

All Rights Reserved, Indian Institute of Management Bangalore

Business Problems in Marketing/Retail

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

How to improve the success probability of a new product?


What is the impact of food label on purchase decision?
Which promotion is more effective?

All Rights Reserved, Indian Institute of Management Bangalore

Business Problems in Banking and Finance

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

What is the risk associated with a customer?


Which customer is likely to default?
What percentage of loans are likely to result in a loss?
How to identify the most profitable customer?

All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

Types of Regression

One independent
variable

Regression
Models

Simple
Regression

Linear

Non-linear

More than one


independent variable

Multiple
Regression

Linear

Non-linear

All RightsKumar,
Reserved,IIM
Indian
Institute of Management Bangalore
UDinesh
Bangalore

Types of Regression

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

Simple linear regression

Y 0 1 X 1
Multiple linear regression

Y 0 1 X 1 2 X 2 ... k X k
Nonlinear regression

Y 0

1 2 X 1

X 2 3
All Rights Reserved, Indian Institute of Management Bangalore

Multiple Linear Regression

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

Multiple linear regression means linear in regression parameters (beta


values).

Y 0 1x1 2 x2 ... k xk
2
Y 0 1x1 2 x2 3 x1x2 4 x2 ... k xk

An important task in multiple regression is to


estimate the beta values (1, 2, 3 etc)
All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

Regression Model Development

All RightsKumar,
Reserved,IIM
Indian
Institute of Management Bangalore
UDinesh
Bangalore

Regression Model Development


Derive and Analyze Descriptive
Statistics

Explore the Data

Perform Diagnostic Tests

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

Pre-process the Data

Define Functional Form of the


Relationship

Estimate Regression
Parameters

NO
Model Satisfies
Diagnostic Test
YES

STOP
All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

All Rights Reserved, Indian Institute of Management Bangalore

Functional Form

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

Identify the explanatory variable.


Specify the nature of relationship between dependent
variable and explanatory variables.

All Rights Reserved, Indian Institute of Management Bangalore

Linear Regression Model

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

Relationship between variables is a linear function.


Population
Y-Intercept

Population
Slope

Random
Error

Y i 0 1 X i i
Dependent (Response)
Variable(e.g., Treatment Cost)

Independent (Explanatory)
Variable(e.g., Body weight)
All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

Model Assumptions

All RightsKumar,
Reserved,IIM
Indian
Institute of Management Bangalore
UDinesh
Bangalore

Linear Regression Model Assumptions

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

The error term, i, follows a normal distribution.


For different values of
(Homoscedasticity).

X, the variance of

i is constant

There is no Multi-collinearity (no perfect linear relationship among


explanatory variables).
There is no autocorrelation between two i values.
All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

All Rights Reserved, Indian Institute of Management Bangalore

Estimation of Parameters
Population

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

Random Sample

Unknown
Relationship

Yi 0 1X i i

$
$
$

$
$

All RightsKumar,
Reserved,IIM
Indian
Institute of Management Bangalore
UDinesh
Bangalore

Population Linear Regression Model

Yi 0 1X i i

i = Random error

Observed value

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

E Y X 0 1 X i

All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

Method of Ordinary Least Squares (OLS)

All RightsKumar,
Reserved,IIM
Indian
Institute of Management Bangalore
UDinesh
Bangalore

Least Squares Graphically

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

2
2
2
2
2

LS minimizes i 1 2 3 .... n
i 1

Y2 0 1X 2 2

^4

^2
^1

^3
Yi 0 1X i

X
All Rights Reserved, Indian Institute of Management Bangalore

Estimation of Parameters in Regression

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

The least squares function is given by:

SSE yi 0 j xij
i 1
i 1
j 1

2
i

The least squares estimates must satisfy:


n
k

SSE
2 yi 0 j xij 0
0
i 1
j 1

and
n
k

SSE
2 yi 0 j xij xij 0 j
j
i 1
j 1

All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

Coefficient Equations
Prediction Equation:

yi 0 1xi

Sample Slope:

xi yi n x y

xi x yi y i

2
2
xi x
xi n( x)
i

Sample Y-intercept:

0 y 1x

All Rights Reserved, Indian Institute of Management Bangalore

Why Least Squares Estimate?

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

OLS beta estimates are, Best Linear Unbiased Estimates (BLUE),


provided the error terms are uncorrelated (no auto regression) and have
equal variance (homoscedasticity). That is,

E 0

All Rights Reserved, Indian Institute of Management Bangalore

Advantages of OLS Estimates

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

They are unbiased estimates.


They (estimates) have minimum variance.

They have consistency, as the sample size increases, the estimate, i


converges to the true population parameter value, i.

All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

Interpretation of Regression Coefficients

The interpretation depends on the functional form of the


relationship between the response and the explanatory
variables.

All Rights Reserved, Indian Institute of Management Bangalore

Coefficients Interpretation

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

The intercept, 0, is the mean value of the dependent


variable Y, when the independent variable X = 0

The slope, 1, is the change in the value of the dependent


variable, Y, for unit change in the independent variable X.

All Rights Reserved, Indian Institute of Management Bangalore

Interpretation of the intercept 0

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

The intercept, 0, is the mean value of the dependent variable Y,


when the independent variable X = 0

Y = 129110.79 + 1807.591 x Body Weight

All Rights Reserved, Indian Institute of Management Bangalore

Interpretation of the intercept 1

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

The slope, 1, is the change in the value of the dependent


variable, Y, for unit change in the independent variable X.

Y = 129110.79 + 1807.591 x Body Weight

All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

Interpretation of 0 and 1 in ln(Y) = 0 + 1 ln(X)

Differentiating the equation with respect to X, we get:

1 is the percentage change Y for percentage change in X.


All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

All Rights Reserved, Indian Institute of Management Bangalore

Simple Linear Regression

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

Variable x and y has Linear


relationship

Assumption of the world

y = 0 + 1x + ,

Fitting a model

Minimize SSE
Is x really related to y?

Validating the model

Is 1 statistically significant?

Predict y for a given x.

Using a model
All Rights Reserved, Indian Institute of Management Bangalore

Objective Model Validation

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

Predicted Value of Y without


the model (without the
knowledge of explanatory
variables)

Observed value of Y
Predicted Value of Y with the
model (with the knowledge of
explanatory variables)

All Rights Reserved, Indian Institute of Management Bangalore

Model Validation

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

Use of co-efficient of determination to check the goodness of fit of


regression.
Analysis of Variance (ANOVA) and F test to check the overall fitness of the
regression model.
t-test to validate relationship dependent and individual independent variable.
Residual analysis to check the model adequacies.

All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

All Rights Reserved, Indian Institute of Management Bangalore

What is coefficient of determination?

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

The coefficient of determination (R2 square) is a measure of how well the


regression line fits the data.
The value of R2 lies between 0 and 1 and is the percentage of variation
explained by the regression model.
R2 is a rough indicator of the worth of the regression model.
R2 is the square of the correlation coefficient r (R2 = r2).
All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

Variation in Y
Y
Yii
00 11X
Xi i
Variation in Yi

Systemic
Variation

Random
Variation

or

Variation in Yi

Explained
Variation

Unexplained
Variation

All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

Variation in Y

Yi Y
Total variation

Yi Y

Explained variation

Yi Yi
Unexplained variation

2
2
2
(
Y

Y
)

(
Y

Y
)

(
Y

Y
)
i
i
i
i
i 1

i 1

SST

i 1

SSR

SSE

All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

TOTAL SUM OF SQUARES (SST):

SST = (Yi Y ) 2
How much error is there in predicting Y without the knowledge of X?
SUM OF SQUARES ERROR (SSE):

SSE = (Yi Yi )2
How much error is there in predicting Y with the knowledge of X?

All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

SUM OF SQUARES REGRESSION (SSR):

SSR = (Yi Y ) 2 (Amount of variation explained by the model).


Mathematically,

SST = SSR + SSE

All Rights Reserved, Indian Institute of Management Bangalore

Coefficient of determination

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

Coefficient of determination is the ratio sum of squares due to


regression to the total sum of squares.

SSR
SSE
R
1
SST
SST
2

All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

All Rights Reserved, Indian Institute of Management Bangalore

Spurious Regression

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

One of the major problems with coefficient of determination is


that two sets of data without any relationship can have a very high
coefficient of determination value.

All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

The data shows the number of Facebook users (in millions) and the number of people
who died of Helium poisoning in UK between 2004 and 2012
Year

Number of Facebook users in millions

Number of people who died of Helium Poisoning

2004

2005

2006

12

2007

58

2008

145

11

2009

360

21

2010

608

31

2011

845

40

2012

1056

51

All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

All Rights Reserved, Indian Institute of Management Bangalore

Standard Error of Estimate

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

Standard error is the estimate of the standard deviation of the


regression errors.
Standard error of estimate, Se, measures the variability or scatter of
the observed values around the regression line.

i i
Se
n2

All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

All Rights Reserved, Indian Institute of Management Bangalore

Interpreting the Standard Error of


Estimate

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

A smaller standard error of estimate indicates better fit.


The larger the standard error of estimate, the greater the
scattering of points around the regression line.

If Se = 0, then we can expect a perfect fit.

All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

Standard Error of Estimate for Regression


Coefficients

Standard error of estimate for regression coefficient


measures the amount of sampling error in a regression
coefficient.

All Rights Reserved, Indian Institute of Management Bangalore

Standard error of 0 and 1

Predictive Analytics : QM901.1x


Prof U Dinesh Kumar, IIMB

Standard error of 0 and 1 is given by:


Se x 2
S (0 )
nSSx
S ( 1 )

Se
SS x

SS x ( X i X ) 2
i

All Rights Reserved, Indian Institute of Management Bangalore

You might also like