Professional Documents
Culture Documents
Regression.
Assumptions of regression
Violations of regression
Multicollineraty
Basic
Heterioscedicity
.
Introduction
The study of the dependence of one variable
(dependent variable) on one or more variables
(explanatory variables)
In regression, we deal with random (or stochastic)
variables.
Dependent variable: explained, predicted, regressand,
response, endogenous, outcome, controlled variable.
Explanatory Variable: Independent, predictor,
regressor, stimulus, exogenous, covariate, control
variable.
Dependent variable is plotted on vertical axis and
independent variable is plotted on horizontal axis.
r = -1
r = -.6
X
Y
r = +1
r=0
r = +.3
from: Statistics for Managers Using Microsoft Excel 4th Edition, 2004 Prentice-Hall
r=0
Types of Data
Time Series
Cross-section
Pooled (Panel)
Time Series
A set of observations on the values that a
variable takes at different times, collected
at regular time intervals (daily, weekly,
monthly, quarterly, annually,
quinquennially (every 5 years),
decennially (every 10 years)
Time series is based on the assumption of
Stationarity. Which means that its mean
and variance do not vary systematically
over time.
Cross Sectional
Data on one or more variables collected at the same point in
time, such as the Censes every 10 years (last time, in 1998).
Or data on cotton production and Cotton prices for the 4
provinces in the union for 1990 and 1991. For each year the
data on the 50 states are cross-sectional data.
Cross-sectional data has the problem of heterogeneity
(combination of very large or very small values)
for example, collecting data of Punjab and Balochistan as
Punjab is the biggest populous province and Balochistan is
the biggest geographical province.
For example, Punjab produces huge amounts of eggs and
Balochistan produces very little. When we include such
heterogeneous units in a statistical analysis, the size or scale
effect must be taken into account so as not to mix apples
with oranges.
Pooled Data
The combination of cross-sectional and
time series data
May be in a form of Panel, longitudinal or
micropanel data
Data on cotton production and Cotton
prices for the 4 provinces in Pakistan for
1990 and 1991. For each year the data on
the 50 states are cross-sectional data. And
for both years, it became Pooled data.
Introduction
Panel data is also known as
longitudinal or cross-sectional timeseries data)
is a dataset in which the behavior of
entities are observed across time.
These entities could be states,
companies, individuals, countries,
etc.
Two-Variable Regression
Regression analysis is largely concerned
with estimating and/or predicting the
(population) mean value of the dependent
variable on the basis of the known or fixed
values of the explanatory variable(s).
Bivariate or Two-Variable
Regression in which the dependent variable
(the regressand) is related to a single
explanatory variable (the regressor).
Linearity
Linearity in the Variables
The first meaning of linearity is that the Y is a linear function of
Xi, the regression curve in this case is a straight line. But
Y = 1 + 2X2i is not a linear function
Y = 1 + 2Xi is a linear function
Linearity in the Parameters
The second interpretation of linearity is Y is a linear function of
the parameters, the s; it may or may not be linear in the
variable X.
Y = 1 + 2X2i is a linear function
Y = 1 + 2Xi is a linear function
is a linear (in the parameter) regression model.
Linearity
From now on the term linear
regression will always mean a regression
that is linear in the parameters; the s
(that is, the parameters are raised to the
first power only).It may or may not be
linear in explanatory variables (the Xs).
All the models shown in below slides are
thus linear regression models, that is,
models linear in the parameters.
Error Term
We can express the deviation of an individual Yi
around its expected value
Technically, ui is known as the stochastic
disturbance or stochastic error term.
The stochastic disturbance term is a proxy for all
the omitted or neglected variables that may affect
Y but are not included in the regression model.
But the stochastic specification has the advantage
that it clearly shows that there are other variables
included in the regression model.
Residual term
23
Assumptions
1.
is random variable ,it has normal
distributed with mean zero and
variance
i.e.
Assumptions
.
2.
The disturbance terms are
independent of each other.
3.
The explanatory variable is nonstochastic and assumed without
error.
Assumptions
4. The explanatory variables are not
perfectly linear correlated.
Assumptions
Properties of least squares estimates
OLS estimators are the linear
function of actual observation .
The least squares estimate are the
unbiased estimates of
Assumptions
Variance of
Assumptions
Autocorrelation
Econometric problems
Regression Diagnostics
Session 3:
Influential data
Why a single influential observation
can be a concern for a researcher?
Unusual observations include:
Outliers: an observation with large
residual
Leverage: extreme influence of an
observation on the dependent variable
Outliers
Scatter plot
Summary statistic if the gap between
minimum and max is unusually
greater
Cooks-D test is used to remove both
outlier and influential variable at
same time...
Summary Statistics
Diagnostic tests
See the standard deviation
See the max and min values
Statistical Tests
We can use studentized residuals as a first means for
identifying outliers
After estimating OLS, residuals can be predicted with
predict r, rstudent
We should pay attention to studentized residuals that
exceed +2 or -2, and get even more concerned about
residuals that exceed +2.5 or -2.5 and even yet more
concerned about residuals that exceed +3 or -3
Influential observation
To identify observation that have greater influence on
the dependent variable, we can use leverage function
after OLS
predict lev, leverage
Generally, a point with leverage greater than
(2k+2)/n should be carefully examined. Here k is the
number of predictors and n is the number of
observations.
(2k+2)/n -----(2*3 + 2)/51-----0.16, if the value is
leverage is greater than 0.16; then delete the values.
Exercise: ..
Load the file and estimate the regression equation
regress .
Now check:
For outliers
For influential data
CooksD test
predict d, cooksd
list [variables] d if d>4/n
2. Checking homoscedasticity or
HETEOSCEDASTICITY
When variance of the residuals is not constant, so it
means there is heteroscedasticity. While when the
variance is constant, so it is known as
homoscedasticity.
Heteroscedasticity mostly occurs in cross-sectional
data. It can be detected by several graphical or nongraphical methods.
When we detect heteroscedasticity the hypothesis
tests are invalid because the standard errors are
biased so are the values of T and F statistics, hence
we cannot analyze them correctly in the presence of
heteroscedasticity. (Feng Li. Department of Statistics,
Stockholm University)
2. Checking homoscedasticity
or HETEOSCEDASTICITY
One of the main assumptions for the
ordinary least squares regression is the
homogeneity of variance of the residuals.
If the model is well-fitted, there should be
no pattern to the residuals plotted against
the fitted values.
The hetroscedasticity is as a result of
cross-sectional data
2. Checking homoscedasticity or
HETEOSCEDASTICITY
We can use graphical command or statistical commands
Graphical : rvfplot
Statistical : Breusch-Pagan test And White's test
estat hettest is the Breusch-Pagan test.
estat imtest is the Whites test.
2. Checking homoscedasticity or
HETEOSCEDASTICITY
Heteroscedasticity can also occur when the model
is not specified correctly;
other reason may be when there are a limited
number of dependent variables or if reliability of
independent variable is somehow linked with two or
more dependent variables (Hayes and Cai, 2007).
It may also occur when there is unequal spread of
errors
around
regression
line
and
when
observations are of varying sizes so theyll
contribute unevenly to the error term hence
causing heteroscedasticity.
2. Checking homoscedasticity or
HETEOSCEDASTICITY
Graphically when there are deviations from the central
line it means there is a problem of heteroscedasticity.
There are several tests for detecting heteroscedasticity
one of them used in the research is Breusch-Pagan test.
This test checks the null hypothesis and also verifies
that variance is constant.
estat Hettest command is used in stata to check
heteroscedasticity.
If p-value is so small then we will accept alternative
hypothesis and reject null hypothesis, which means
variance is not constant and there is heteroscedasticity.
2. Checking homoscedasticity or
HETEOSCEDASTICITY
Robust command is then used to control
heteroscedasticity, outliers and other influential
variables.
????
xtreg dependent variable independent
variables, fe robust
fe is used to specify that fixed effect has been
selected as a model and robust command is
utilized in order to control heteroscedasticity,
outliers and other influential variables that are
present in the data.
MULTICOLLINEARITY
The term multicollinearity was first used by
Powel Ciompa in 1910, but this term was
invented by Frish during his work Confluence
Analysis in 1934. Multicollinearity is the
problem in regression analysis where due to
insufficient information in the sample there is
a difficulty in estimation of the parameters.
(Saika B. and Singh R. 2013)
Whereas John Reimer and Rothrock defined
multicollinearity as:
MULTICOLLINEARITY
Multicollinearity
What happens when two or more variables are
highly correlated?
When there is a perfect linear relationship
among the predictors, the estimates for a
regression model cannot be uniquely computed.
The primary concern is that as the degree of
multicollinearity increases the standard errors
for the coefficients can get wildly inflated.
MULTICOLLINEARITY
It commonly results in misleading
and confusing conclusions. One of
the reasons for multicollinearity
might be the use of inappropriate
dummy variable.
Tests for
MULTICOLLINEARITY
Calculate Correlation Coefficient
The
Checking Normality of
Residuals
After we run a regression analysis, we can use
the predict command to create residuals and
then use commands to check the normality
both graphically and numerically.
Graphical: such as kdensity, qnorm and
pnorm to check the normality of the residuals.
Numerical: iqr and swilk
After regression, we then use the predict
command to generate residuals.
predict (residual variable), resid
Checking Normality of
Residuals
Below we use the kdensity command
to produce a kernel density plot with
the normal option requesting that a
normal density be overlaid on the plot.
kdensity stands for kernel density
estimate. It can be thought of as a
histogram with narrow bins and
moving average.
kdensity (residual variable), normal
Checking Normality of
Residuals
The pnorm command graphs a standardized
normal probability (P-P) plot while qnorm plots
the quantiles of a variable against the quantiles
of a normal distribution.
pnorm is sensitive to non-normality in the middle
range of data and qnorm is sensitive to nonnormality near the tails.
We can accept that the residuals are close to a
normal distribution.
pnorm (residual variable)
qnorm (residual variable)
Checking Normality of
Residuals
Checking Normality of
Residuals
Another test available is the swilk test
which performs the Shapiro-Wilk W test
for normality.
The p-value is based on the
assumption that the distribution is
normal (Null Hypothesis).
If p-value is more than 0.05, we accept
the normality and vice versa.
swilk (residual variable)