You are on page 1of 64

Regression

Regression.
Assumptions of regression
Violations of regression
Multicollineraty
Basic
Heterioscedicity
.

How to remove these violations.


Concept of dummy variables

Introduction
The study of the dependence of one variable
(dependent variable) on one or more variables
(explanatory variables)
In regression, we deal with random (or stochastic)
variables.
Dependent variable: explained, predicted, regressand,
response, endogenous, outcome, controlled variable.
Explanatory Variable: Independent, predictor,
regressor, stimulus, exogenous, covariate, control
variable.
Dependent variable is plotted on vertical axis and
independent variable is plotted on horizontal axis.

1.6 TERMINOLOGY AND


NOTATION

In the literature the terms dependent variable and


explanatory variable are described variously. A
representative list is:

REGRESSION VERSUS CORRELATION


In correlation analysis, the primary objective is to
measure the strength or degree of linear
association between two variables. The
coefficient, measures this strength of (linear)
association.
The value of coefficient of correlation varies
between -1 and +1.
In regression analysis, we are not primarily
interested in such a measure. Instead, we try to
estimate or predict the average value of one
variable on the basis of the fixed values of other
variables.

Scatter Plots of Data with Various


Correlation Coefficients
Y

r = -1

r = -.6

X
Y

r = +1

r=0

r = +.3

from: Statistics for Managers Using Microsoft Excel 4th Edition, 2004 Prentice-Hall

r=0

In regression, we are dealing with random


variable.
The term random is a synonym for the term
stochastic. A random or stochastic variable is a
variable that can take on any set of values,
positive or negative, with a given probability.
The dependent variable is assumed to be
statistical, random, or stochastic, that is, to
have a probability distribution. The explanatory
variables, on the other hand, are assumed to
have fixed values (in repeated sampling).

Types of Data
Time Series
Cross-section
Pooled (Panel)

Time Series
A set of observations on the values that a
variable takes at different times, collected
at regular time intervals (daily, weekly,
monthly, quarterly, annually,
quinquennially (every 5 years),
decennially (every 10 years)
Time series is based on the assumption of
Stationarity. Which means that its mean
and variance do not vary systematically
over time.

Cross Sectional
Data on one or more variables collected at the same point in
time, such as the Censes every 10 years (last time, in 1998).
Or data on cotton production and Cotton prices for the 4
provinces in the union for 1990 and 1991. For each year the
data on the 50 states are cross-sectional data.
Cross-sectional data has the problem of heterogeneity
(combination of very large or very small values)
for example, collecting data of Punjab and Balochistan as
Punjab is the biggest populous province and Balochistan is
the biggest geographical province.
For example, Punjab produces huge amounts of eggs and
Balochistan produces very little. When we include such
heterogeneous units in a statistical analysis, the size or scale
effect must be taken into account so as not to mix apples
with oranges.

Pooled Data
The combination of cross-sectional and
time series data
May be in a form of Panel, longitudinal or
micropanel data
Data on cotton production and Cotton
prices for the 4 provinces in Pakistan for
1990 and 1991. For each year the data on
the 50 states are cross-sectional data. And
for both years, it became Pooled data.

Introduction
Panel data is also known as
longitudinal or cross-sectional timeseries data)
is a dataset in which the behavior of
entities are observed across time.
These entities could be states,
companies, individuals, countries,
etc.

How to Organize Panel Data

Two-Variable Regression
Regression analysis is largely concerned
with estimating and/or predicting the
(population) mean value of the dependent
variable on the basis of the known or fixed
values of the explanatory variable(s).
Bivariate or Two-Variable
Regression in which the dependent variable
(the regressand) is related to a single
explanatory variable (the regressor).

The simple linear regression model is


given as
yi

is the i-th dependent variable,


is the i-th independent variable.

is the intercept parameter,


is called slope parameter and
represent change in
for unit
change in
.
is i-th error term.

Linearity
Linearity in the Variables
The first meaning of linearity is that the Y is a linear function of
Xi, the regression curve in this case is a straight line. But
Y = 1 + 2X2i is not a linear function
Y = 1 + 2Xi is a linear function
Linearity in the Parameters
The second interpretation of linearity is Y is a linear function of
the parameters, the s; it may or may not be linear in the
variable X.
Y = 1 + 2X2i is a linear function
Y = 1 + 2Xi is a linear function
is a linear (in the parameter) regression model.

Linearity
From now on the term linear
regression will always mean a regression
that is linear in the parameters; the s
(that is, the parameters are raised to the
first power only).It may or may not be
linear in explanatory variables (the Xs).
All the models shown in below slides are
thus linear regression models, that is,
models linear in the parameters.

Error Term
We can express the deviation of an individual Yi
around its expected value
Technically, ui is known as the stochastic
disturbance or stochastic error term.
The stochastic disturbance term is a proxy for all
the omitted or neglected variables that may affect
Y but are not included in the regression model.
But the stochastic specification has the advantage
that it clearly shows that there are other variables
included in the regression model.
Residual term

Why Error Term?

The disturbance term ui shows all omitted variables from


the model but that collectively affect Y. Why dont we
introduce them into the model explicitly? The reasons are
many:
1. Vagueness of theory: The theory, if any, determining the
behavior of Y may be, and often is, incomplete. We might be
ignorant or unsure about the other variables affecting Y.
2. Unavailability of data: Lack of quantitative information
about these variables, e.g., information on family wealth
generally is not available.
3. Core variables versus peripheral variables: Assume that
besides income X1, the number of children per family X2, sex
X3, religion X4, education X5, and geographical region X6 also
affect consumption expenditure. But the joint influence of all
or some of these variables may be so small and it does not
pay to introduce them into the model explicitly. One hopes

Why Error Term?


4. Intrinsic randomness in human behavior: Even if we succeed in
introducing all the relevant variables into the model, there is
bound to be some intrinsic randomness in individual Ys that
cannot be explained no matter how hard we try. The disturbances,
the us, may very well reflect this intrinsic randomness.
5. Poor proxy variables: But since data on these variables are not
directly observable, in practice we use proxy variables, which may
not be true representative.
6. Principle of parsimony: we would like to keep our regression
model as simple as possible. If we can explain the behavior of Y
substantially with two or three explanatory variables and if our
theory is not strong enough to suggest what other variables might
be included, why introduce more variables? Let ui represent all
other variables.

The Population Linear Regression Model

23

Assumptions
1.
is random variable ,it has normal
distributed with mean zero and
variance
i.e.

The constant variance assumption is


known as homoscedasticty .

Assumptions
.

2.
The disturbance terms are
independent of each other.
3.
The explanatory variable is nonstochastic and assumed without
error.

Assumptions
4. The explanatory variables are not
perfectly linear correlated.

Assumptions
Properties of least squares estimates
OLS estimators are the linear
function of actual observation .
The least squares estimate are the
unbiased estimates of

Assumptions
Variance of

Where K is the total number of


parameter estimated from in the
regression line.

Assumptions

Autocorrelation

Econometric problems

Regression Diagnostics
Session 3:

What shall we learn?


At the end of this session, we shall be
able to:
Find and remove influential observations
Check for homogeneity,
multicollinearity, model specification

Influential data
Why a single influential observation
can be a concern for a researcher?
Unusual observations include:
Outliers: an observation with large
residual
Leverage: extreme influence of an
observation on the dependent variable

Outliers
Scatter plot
Summary statistic if the gap between
minimum and max is unusually
greater
Cooks-D test is used to remove both
outlier and influential variable at
same time...

How to find unseal data


We might start examining data with:
Summary statistics
Graphs
Numerical tests

Summary Statistics

Do you see any problem with any


variable?

Diagnostic tests
See the standard deviation
See the max and min values

Finding Unusual Data: Graphs

graph matrix debt tax profit tang


varincome

Estimate the regression equation

Statistical Tests
We can use studentized residuals as a first means for
identifying outliers
After estimating OLS, residuals can be predicted with

predict r, rstudent
We should pay attention to studentized residuals that
exceed +2 or -2, and get even more concerned about
residuals that exceed +2.5 or -2.5 and even yet more
concerned about residuals that exceed +3 or -3

How to identify r greater than 2


We can use list comand with if option
list [variables] if abs(r) > 2
Abs is used for absolute values
We can drop outliers with drop
command
drop [variables] if abs(r) > 2

Influential observation
To identify observation that have greater influence on
the dependent variable, we can use leverage function
after OLS
predict lev, leverage
Generally, a point with leverage greater than
(2k+2)/n should be carefully examined. Here k is the
number of predictors and n is the number of
observations.
(2k+2)/n -----(2*3 + 2)/51-----0.16, if the value is
leverage is greater than 0.16; then delete the values.

How to identify influential


observations
We can use list command with if
option
list [variables] if lev> value
We can drop influential observations
with drop command
drop [variables] if lev > value

Exercise: ..
Load the file and estimate the regression equation
regress .
Now check:
For outliers
For influential data

Using both graphical and numerical tests

Can we check for residuals and


influence at the same time
CooksD combines information on
the residual and leverage.
The lowest value that Cook's D
can assume is zero, and the
higher the Cook's D is, the more
influential the point.
The convention cut-off point is
4/n

CooksD test
predict d, cooksd
list [variables] d if d>4/n

More Graphical options


After OLS, we can use avplots
An avplot is an attractive graphic
method to present multiple influential
points on a predictor.
What we are looking for in an avplot are
those points that can exert substantial
change to the regression line.

2. Checking homoscedasticity or
HETEOSCEDASTICITY
When variance of the residuals is not constant, so it
means there is heteroscedasticity. While when the
variance is constant, so it is known as
homoscedasticity.
Heteroscedasticity mostly occurs in cross-sectional
data. It can be detected by several graphical or nongraphical methods.
When we detect heteroscedasticity the hypothesis
tests are invalid because the standard errors are
biased so are the values of T and F statistics, hence
we cannot analyze them correctly in the presence of
heteroscedasticity. (Feng Li. Department of Statistics,
Stockholm University)

2. Checking homoscedasticity
or HETEOSCEDASTICITY
One of the main assumptions for the
ordinary least squares regression is the
homogeneity of variance of the residuals.
If the model is well-fitted, there should be
no pattern to the residuals plotted against
the fitted values.
The hetroscedasticity is as a result of
cross-sectional data

2. Checking homoscedasticity or
HETEOSCEDASTICITY
We can use graphical command or statistical commands
Graphical : rvfplot
Statistical : Breusch-Pagan test And White's test
estat hettest is the Breusch-Pagan test.
estat imtest is the Whites test.

It test the null hypothesis that the variance of the residuals is


homogenous.
If p-value is <0.05 (95% level) --- reject
null(HETEOSCEDASTICITY)
If p-value is >0.05 (95% level) --- accept null (homoscedasticity)
If the p-value is very small, we would have to reject the
hypothesis and accept the alternative hypothesis that the
variance is not homogenous

2. Checking homoscedasticity or
HETEOSCEDASTICITY
Heteroscedasticity can also occur when the model
is not specified correctly;
other reason may be when there are a limited
number of dependent variables or if reliability of
independent variable is somehow linked with two or
more dependent variables (Hayes and Cai, 2007).
It may also occur when there is unequal spread of
errors
around
regression
line
and
when
observations are of varying sizes so theyll
contribute unevenly to the error term hence
causing heteroscedasticity.

2. Checking homoscedasticity or
HETEOSCEDASTICITY
Graphically when there are deviations from the central
line it means there is a problem of heteroscedasticity.
There are several tests for detecting heteroscedasticity
one of them used in the research is Breusch-Pagan test.
This test checks the null hypothesis and also verifies
that variance is constant.
estat Hettest command is used in stata to check
heteroscedasticity.
If p-value is so small then we will accept alternative
hypothesis and reject null hypothesis, which means
variance is not constant and there is heteroscedasticity.

2. Checking homoscedasticity or
HETEOSCEDASTICITY
Robust command is then used to control
heteroscedasticity, outliers and other influential
variables.
????
xtreg dependent variable independent
variables, fe robust
fe is used to specify that fixed effect has been
selected as a model and robust command is
utilized in order to control heteroscedasticity,
outliers and other influential variables that are
present in the data.

MULTICOLLINEARITY
The term multicollinearity was first used by
Powel Ciompa in 1910, but this term was
invented by Frish during his work Confluence
Analysis in 1934. Multicollinearity is the
problem in regression analysis where due to
insufficient information in the sample there is
a difficulty in estimation of the parameters.
(Saika B. and Singh R. 2013)
Whereas John Reimer and Rothrock defined
multicollinearity as:

MULTICOLLINEARITY

The name given to general problem which arises


where some or all of the explanatory variables in
relation and are so highly correlated one with another
that it becomes very difficult, if not impossible to
disentangle is their separate influence and obtain a
reasonably precise estimate of their relative effects.
As multicollinearity expands then the key concern is a
sudden boost in standard errors for the coefficients,
due to which reliability of the model decreases. The
values of t-statistics become smaller incase of higher
multicollinearity due to which it is difficult to accept
alternative hypothesis.

Multicollinearity
What happens when two or more variables are
highly correlated?
When there is a perfect linear relationship
among the predictors, the estimates for a
regression model cannot be uniquely computed.
The primary concern is that as the degree of
multicollinearity increases the standard errors
for the coefficients can get wildly inflated.

MULTICOLLINEARITY
It commonly results in misleading
and confusing conclusions. One of
the reasons for multicollinearity
might be the use of inappropriate
dummy variable.

Tests for
MULTICOLLINEARITY
Calculate Correlation Coefficient
The

easiest way to detect multicollinearity is by calculating


correlation between pairs of independent variables. If correlation
is 1 or -1 so the researcher should then remove one of the two
correlated variables from the sample.
Scatter diagram between independent variables will give some
indication about the multicollinearity issue.
Variance Inflation Factor (VIF)
While in case of STATA variance inflation factor (VIF) is used to
compute the amount/degree of multicollinearity among the
variables. One could simply enter the command vif in STATA,
after running regression analysis on the data. If the value of VIF is
greater than or equal to 10 then it means there is the problem of
multicollinearity in the data.
VIF can also be computed as:
Where as is coefficient of determination of model.

How to Deal with Multi-Collinearity


Drop

one of the two variables which


are linear correlated with one
another.
Which variable to be dropped?????
(decision is based on

Checking Normality of
Residuals
After we run a regression analysis, we can use
the predict command to create residuals and
then use commands to check the normality
both graphically and numerically.
Graphical: such as kdensity, qnorm and
pnorm to check the normality of the residuals.
Numerical: iqr and swilk
After regression, we then use the predict
command to generate residuals.
predict (residual variable), resid

Checking Normality of
Residuals
Below we use the kdensity command
to produce a kernel density plot with
the normal option requesting that a
normal density be overlaid on the plot.
kdensity stands for kernel density
estimate. It can be thought of as a
histogram with narrow bins and
moving average.
kdensity (residual variable), normal

Checking Normality of
Residuals
The pnorm command graphs a standardized
normal probability (P-P) plot while qnorm plots
the quantiles of a variable against the quantiles
of a normal distribution.
pnorm is sensitive to non-normality in the middle
range of data and qnorm is sensitive to nonnormality near the tails.
We can accept that the residuals are close to a
normal distribution.
pnorm (residual variable)
qnorm (residual variable)

Checking Normality of
Residuals

There are also numerical tests for testing


normality.
iqr stands for inter-quartile range and assumes
the symmetry of the distribution.
Severe outliers consist of those points that are
either 3 inter-quartileranges below the first
quartile or 3 inter-quartile-ranges above the third
quartile. The presence of any severe outliers
should be sufficient evidence to reject normality
at a 5% significance level.
Mild outliers are common in samples of any size.
In our case, we don't have any severe outliers
and the distribution seems fairly symmetric. The
residuals have an approximately normal

Checking Normality of
Residuals
Another test available is the swilk test
which performs the Shapiro-Wilk W test
for normality.
The p-value is based on the
assumption that the distribution is
normal (Null Hypothesis).
If p-value is more than 0.05, we accept
the normality and vice versa.
swilk (residual variable)

You might also like