You are on page 1of 28

A brief overview of

the Linear regression


model
By
Dr. N. NAGESHA
Professor
Dept. of Industrial & Production Engineering
University BDT College of Engineering
(A Constituent College of VTU, Belagavi)
DAVANAGERE – 577 004
Introduction

• Univariate Data
– Descriptive Statistics
– Type of questions Addressed

• Bivariate / Multivariate Data


– Descriptive Statistics
– Type of questions Addressed

• General Linear Model


– Regression
Regression is different from Correlation
• Regression
- Causal Relation?
• Underlying theory has to say about possible causation and the
Regression will be used to validate the causation theory.
• For that matter no technique is available in statistics to say
causation without underlying theory.

– Treating the variables:


• If we say y and x are correlated, it means that we are treating y
and x in a completely symmetrical way.
• In regression, we treat the dependent variable (y) and the
independent variable(s) (x’s) very differently. The y variable is
assumed to be random or “stochastic” in some way, i.e. to have a
probability distribution. The x variables are, however, assumed to
have fixed (“non-stochastic”) values in repeated samples.
The General Linear Model: Regression

• The General Linear Model is a phrase used to


indicate a class of statistical models which include
simple linear regression analysis.
• Regression is the predominant statistical tool used in
due to its simplicity and versatility.

But what is regression analysis?


It is concerned with describing and evaluating the
relationship between a given variable (usually called the
dependent variable) and one or more other variable/s
(usually known as the independent variable(s)).
Some Notation
Some alternative names for the y and x variables:
y x
dependent variable independent variables
regressand regressors
effect variable causal variables
explained variable explanatory variable
endogenous variable exogenous variable

• Dependent Variable :: Must be Quantitative variable Treated as Random Variable


• Independent Variable :: Can be Quantitative or Qualitative Treated as Non-Random
or Fixed
• Univariate Model :: Only one Dependent Variable
• Multivariate Model :: More than one Dependent Variable
• Multiple Regression :: More than one Independent Variable
Simple Linear Regression:
The Basic Mathematical Model

• Regression is based on the concept of the simple proportional relationship -


also known as the straight line.

• Alternate Notations !
• Theory says : Y = f(x)
– Mathematical form :
y = mx + b
– Statistics Literature
Yi  a  bX i  ei
– ei is the error (disturbance term)
Why do we include a Disturbance term?
• Predicting complete system behavior in research is almost impossible.
(Unlike the models in Maths/Physics..laboratory experiments)

• So we must add a component to adjust or compensate for the errors in


prediction.
• The disturbance term can capture a number of features:
- We always leave out some determinants of yt
- There may be errors in the measurement of yt that cannot be modelled.
- Random outside influences on yt which we cannot model
Linear Regression:
the Linguistic Interpretation

• In general terms, the linear model states that the dependent


variable is proportional to the value of the independent
variable.

• Thus, if we state that some variable Y increases in direct


proportion to some increase in X, we are stating a specific
mathematical model of behavior - the linear model.

• Hence, if we say that the crime rate goes up as


unemployment goes up, we are stating a simple linear
model.
The Mathematical Interpretation:
Meaning of the Regression Parameters

• a = the intercept
– the point where the line crosses the Y-axis.
– (the value of the dependent variable when all of the
independent variables = 0)
• b = the slope
– the increase in the dependent variable per unit change
in the independent variable (also known as the 'rise
over the run')
Simple Regression

• For simplicity, say there is only one independent variable.


This is the situation where y depends on only one x variable.

• Examples of the kind of relationship that may be of interest include:


– How surface finish vary depth of cut in machining
– Measuring the long-term relationship between stock prices and
dividends.
– The number of cars sold as a sole function of its price.
Generalising the Simple Model to Multiple Linear Regression

• Before, we have used the model


yt    xt  ut ; t = 1,2,...,T
• But, what if our dependent (y) variable depends on more than one
independent variable?
For example the number of cars sold might conceivably depend on
1. the price of cars
2. the price of public transport
3. the price of petrol
4. the extent of the public’s concern about global warming
• Similarly, surface finish might depend on several factors.
• Having just one independent variable is no good in this case - we want
to have more than one x variable. It is very easy to generalise the
simple model to one with k-1 regressors (independent variables).
Multiple Linear Regression :
Interpretation
• Multiple Linear Regression :
Several Independent Variables
yi   0  1 x1,i   2 x2,i  ...   k xk ,i  ui
• Interpretation of αi :
» For one unit change in Xi, the mean change in Yi after allowing for
all other factors.
Determining the Regression Coefficients
• So how do we determine what  and  are?
• Choose  and  so that the (vertical) distances from the data points to the
fitted lines are minimised (so that the line fits the data as closely as
possible):

x
Ordinary Least Squares
• The most common method used to fit a line to the data is known as
OLS (ordinary least squares).

• What we actually do is take each distance and square it (i.e. take the
area of each of the squares in the diagram) and minimise the total sum
of the squares (hence least squares).

• Tightening up the notation, let


yt denote the actual data point t
ŷt denote the fitted value from the regression line
ût denote the residual, yt - ŷt
Actual and Fitted Value
• Graphical illustration
y

yi

û i

ŷi

xi x
How OLS Works

• So min. uˆ1  uˆ2  uˆ3  uˆ4  uˆ5 , or minimise


2 2 2 2 2
 uˆ
t 1
2
t . This is known
as the residual sum of squares.

• But what was ût ? It was the difference between the actual point and
the line, yt - ŷt .

• So minimising  y  ˆ
y 
 t t is equivalent to minimising
2
t
ˆ
u 2

with respect to $ and $ .


• Why one has to square the error terms …can’t we minimise the sum of
errors itself?
Why squared error?
• Because:
– (1) the sum of the errors expressed as deviations would be
zero as it is with standard deviations, and
– (2) some feel that big errors should be more influential than
small errors.
• Therefore, we wish to find the values of a and b that
produce the smallest sum of squared errors.
Linearity

• Linear model means….which is linear in the parameters ( and  ). It does


not necessarily have to be linear in the variables (y and x).

• Linear in the parameters means that the parameters are not multiplied
together, divided, squared or cubed, etc.

• Some models can be transformed to linear ones by a suitable substitution


or manipulation, e.g. the exponential regression model

Yt  e X t e ut ln Yt     ln X t  ut
• Then let yt= ln Yt and xt=ln Xt
yt    xt  ut
The Assumptions Underlying the
Classical Linear Regression Model (CLRM)

• The model which we have used is known as the classical linear regression model.
• We observe data for xt, but since yt also depends on ut, we must be specific about
how the ut are generated.
• We usually make the following set of assumptions about the ut’s (the
unobservable error terms):
• Technical Notation Interpretation
1. E(ut) = 0 The errors have zero mean
2. Var (ut) = 2 The variance of the errors is constant and finite
over all values of xt
3. Cov (ui,uj)=0 The errors are statistically independent of
one another
4. Cov (ut,xt)=0 No relationship between the error and
corresponding x variate
Expressing Multiple Linear Regression Model

• We could write out a separate equation for every value of t:


y1  1   2 x21   3 x31  ...   k xk 1  u1
y2  1   2 x22   3 x32  ...   k xk 2  u 2
  
yT  1   2 x2T   3 x3T  ...   k xkT  uT
Testing Multiple Hypotheses: The F-test

• We used the t-test to test single hypotheses, i.e. hypotheses involving only
one coefficient. But what if we want to test more than one coefficient
simultaneously?

• We do this using the F-test. The F-test involves estimating 2 regressions.


Calculating the F-Test Statistic

• The test statistic is given by

RRSS  URSS T  k
test statistic  
URSS m
where URSS = RSS from unrestricted regression
RRSS = RSS from restricted regression
m = number of restrictions
T = number of observations
k = number of regressors in unrestricted regression
including a constant in the unrestricted regression (or the total number
of parameters to be estimated).
Goodness of Fit Statistics

• We would like some measure of how well our regression model actually fits
the data.
• We have goodness of fit statistics to test this: i.e. how well the sample
regression function fits the data.
• The most common goodness of fit statistic is known as R2. One way to define
R2 is to say that it is the square of the correlation coefficient between y and y$ .
• For another explanation, recall that what we are interested in doing is
explaining the variability of y about its mean value, y , i.e. the total sum of
squares, TSS:
TSS    yt  y 
2

• We can split the TSS into two parts, the part which we have explained (known
as the explained sum of squares, ESS) and the part which we did not explain
using the model (the RSS).
Defining R2

• That is, TSS = ESS + RSS


 ty  y 2
 
 tˆ
y  y 2
  t
ˆ
u 2

t t t
• Our goodness of fit statistic is
ESS
R2 
TSS

• But since TSS = ESS + RSS, we can also write


ESS TSS  RSS RSS
R2    1
TSS TSS TSS

• R2 must always lie between zero and one. To understand this, consider two
extremes
RSS = TSS i.e. ESS = 0 so R2 = ESS/TSS = 0
ESS = TSS i.e. RSS = 0 so R2 = ESS/TSS = 1
The Limit Cases: R2 = 0 and R2 = 1

yt
yt

xt xt
Problems with R2 as a Goodness of Fit Measure
• There are a number of them:

1. R2 is defined in terms of variation about the mean of y so that if a model


is reparameterised (rearranged) and the dependent variable changes, R2
will change.

2. R2 never falls if more regressors are added. to the regression, e.g.


consider:
Regression 1: yt = 1 + 2x2t + 3x3t + ut
Regression 2: y = 1 + 2x2t + 3x3t + 4x4t + ut
R2 will always be at least as high for regression 2 relative to regression 1.

3. R2 quite often takes on values of 0.9 or higher for time series


regressions.
Adjusted R2

• In order to get around these problems, a modification is often made


which takes into account the loss of degrees of freedom associated
with adding extra variables. This is known as R 2 , or adjusted R2:

 T 1 
R 1 
2
(1  R 2 )
T  k 
• So, if we add an extra regressor, k increases and unless R2 increases by
a more than offsetting amount, R 2 will actually fall.
Regression Examples
Production Function Analysis

Energy Efficiency Modeling

Assumptions in Regressions

You might also like