Regression Assumptions

Regression Assumptions
Best Linear Unbiased Estimate (BLUE)
If the following assumptions are met:

The Model is
M1 Complete M2 Linear M3 Additive
Variables are
V1 measured at an interval or ratio scale V2 without error
The regression error term is

E1 E2 E3 E4 E5 E6 normally distributed has an expected value of 0 errors are independent homoscedasticity predictors are unrelated to error In a system of interrelated equations the errors are unrelated to each other
Characteristics of OLS if sample is probability sample

Unbiased Efficient Consistent
The Three Desirable Characteristics

Lack of bias
E(b)= b is the sample is the true, population coefficient
On the average we are on target
Efficiency
Standard error will be minimum
Remember: var(b) 1 2 x 2i xi2 OLS will minimize 2 (the error variance)
2
Consistency
As N increases the standard error decreases
Notice: as N increases so does xi2
M1 Completeness
Meals
Parents education
Diagnosis and Remedy

Complete model means no relevant independent variable is omitted Diagnosis
Theoretical
Remedy
Including new variables
M2 Linearity
160000 140000
Violation of linearity
An almost perfect relationship will appear as a weak one Almost all linear relations stop
being linear at a certain point
Y
120000 100000 80000 60000 40000 20000 0 -20000 200 400 600 800 1000 1200 Rsq = 0.1174
160000 140000
1.000
120000 100000 80000 60000 40000 20000
.998
.996
.994
.992
.990
0
.988
-20000 200 400 600 800 1000 1200
Rsq = 0.9313
.986 200 400 600 800 1000 1200
Rsq = 0.6211
Diagnosis & Remedy

Diagnosis:
Visual scatter plots Comparing regression with continuous and dummied independent variable Use dummies
Remedy:
Y=a+bX+e becomes Y=a+b1D1+ +bk-1Dk-1+e where X is broken up into k dummies (Di) and k-1 is included. If the R-square of this equation is significantly higher than the R-square of the original that is a sign of non-linearity. The pattern of the slopes (bi) will indicate the shape of the nonlinearity.
Transform the variables through a non-linear transformation, therefore
Y=a+bX+e becomes
Y=a+b1X+b2X2+e or Y=a+b1X+b2X2+b3X3+e or Y=a+b1X++bkXk+e Rule: K= # of turns (vertexes or vertices) plus 1 or Y=a+b log(X)+e or log(Y)=a+bX+e or Y=ea+bx+e or Y=a+b/X+e etc.
M3 Additivity
Y=a+b1X1+b2X2+e The assumption is that both X1 and X2 each, separately add to Y regardless of the value of the other. Imagine, that the effect of X1 depends on X2.
where b*1 >b**1
E.g. Inc=a+b1Education+b2Citizenship+e
If Citizen Inc=a*+b*1Education+e* If Not Citizen Inc=a**+b**1Education+e**
You cannot simply add the two. There are many examples of the violation of additivity:
E.g., the effect of previous knowledge (X1) and effort (X2) on grades (Y) The effect of race and skills on income (discrimination) The effect of paternal and maternal education on academic achievement
Diagnosis & Remedy

Diagnosis:
Theoretical Try other functional forms and compare R-squares Run the regression separately for different groups
Remedy:
Introducing the multiplicative term as a new variable (statistical interaction) Yi=a+b1X1+b2X2+e becomes Yi=a+b1X1+b2X2+b3Z+ e where Z=X1*X2 Yi=a+b10+b2X2+b30*X2+ e = Yi=a+b2X2+ e Yi=a+b11+b2X2+b31*X2+ e= a+b1+b2X2+b3X2+ e= Yi=(a+b1)+(b2+b3)X2+ e b1= difference of two intercepts b3 = difference of two slopes
If X1=0 then If X1=1 then
V1 Proper Level of Measurement

Dependent Nominal Independent Dichotomous Polytomous Dichotomous 2x2 table Kx2 table Dummy variables with logit/probit Polytomous 2xK table Dummy variables with logit/probit Ordinal 2xN table Dummy variables with logit/probit or just logit/probit Interval/Ratio Logit/probit Dummy variables with multinomial logit/probit KxK table Dummy variables with multinomial logit/probit NxK table Dummy variables with multinomial logit/probit or just multinomial logit/probit Multinomial logit/pobit Ordinal Nx2 table Dummy variables with ordered logit/probit NxK table Dummy variables with ordered logit/probit NxN table Dummy variables with ordered logit/probit or just ordered logit/probit Ordered logit/probit Interval/Ratio Difference of means test Regression with dummy variables ANOVA Regression with dummy variables Regression with dummy or just Regression
Regression
K=# of unordered categories (nominal variables) N= # of ordered categories (ordinal variables)
V2 Measurement Error
Suppose X*=X+e

Take Y=a+bX+e
where X is the real value and e is a random measurement error
Then Y=a+bX*+e Y=a+b(X+e)+e=a+bX+be+e Y=a+bX+E where E=be+e and b=b The slope (b) will not change but the error will increase as a result
Our R-square will be smaller Our standard errors will be larger t-values smaller significance smaller
Suppose X#=X+cW+e

where W is a systematic measurement error c is a weight
Then Y=a+bX#+e Y=a+b(X+cW+e)+e=a+bX+bcW+E if b2=bc Y=a+bX+b2W+E b=b iff rwx=0 or rwy=0 otherwise bb which means that the slope will change together with the increase in the error. (Recall earlier path-analysis demonstration of what happens to original relationship once we control for another
variable. W is just a control variable here, it just happens to be mixed into X.)
Apart from the problems stated above, that means that

Our slope will be wrong/biased
Diagnosis & Remedy

Diagnosis:
Look at the correlation of the measure with other measures of the same variable
Remedy:
Include biasing variable W in regression Use multiple indicators and structural equation models (AMOS) Confirmatory factor analysis Better measures
E1 Normally Distributed Error
Non-Normal Error
Our calculations of statistical significance depends on this assumption Statistical inference can be robust even when error is nonnormal Diagnosis:
You can look at the distribution of the error. Because of the homoscedasticity assumption (see later) the error when summed up for each prediction should be also normal. (In principle, we have multiple observations for each prediction.) Remember! Our measured variables (Y and X) do not have to have a normal distribution! Only the error for each prediction.
Remedy:
Any non-linear transformation will change the shape of the distribution of the error
E2 Error Has a Non-Zero Mean
1.000 .998
The solid line gives a negative
.996
.994 .992
The dotted line a positive mean This can happen when we have some selection problem Diagnosis:
Visual scatter plot will not help unless we know in advance somehow the true regression line If it is a selection problem try to address it.
.990
.988
Rsq = 0.6211 400 600 800 1000 1200
.986 200
Remedy:
E3 Non-independent errors
Example 1: Suppose you take a survey of 10 people but you interview everyone 10 times. Now your N=100 but your errors are not independent. For the same person you will have similar errors Example 2: Suppose you take 10 countries and you observe them in 10 different time period Now your N=100 but your errors are not independent. For the same country you will have similar errors Example 3: Suppose you take 100 countries and you observe them only once. Now your N=100. But countries that are next to each other are often similar (same geography and climate, similar history, cooperation etc.). If your model underpredicts Denmark, it is likely to underpredict Sweden as well. Example 4: Suppose you take 100 people but they are all couples, so what you really have is 50 couples. Husband and wife tend to be similar. If your model underestimates one chances are it does the same for the other. Spouses have similar errors.
Statistical inference assumes that each case is independent of the other and in the two examples above it is not the case. In fact, your N < 100. This biases your standard error because the formula is tricked into believing that you have a larger sample than you actually have and larger samples give smaller standard errors and better statistical significance. This may also bias your estimates of the intercept and the slope. Non-linearity is a special case of correlated errors.
Diagnosis & Remedy

It is called autocorrelation because the correlation is between cases and not variables, although autocorrelations often can be traced to certain variables such as geographic distance or same country or person or family.
Diagnosis Visual, scatterplot Checking groups of cases that are theoretically suspect Certain forms of serial or spatial autocorrelations can be diagnosed by calculating certain statistics (e.g., Durbin-Watson test) Remedy:
You can include new variables in the equation E.g.: for serial (temporal) correlation you can include the value of Y in t-1 as an independent variable For spatial correlation we can often model the relationships by introducing an weight matrix
E4 Heteroscedasticity
Homoscedasticity means equal variance Heteroscedasticity means unequal variance We assume that each prediction is not just on target on average but also that we make the same amount of error Heteroscedasticity results in biased standard errors and statistical significance
Diagnosis:
Visual, scatter plot Introducing a weight matrix (e.g. using 1/X). You can do that by dividing the entire equation by X. Y=a+bX+e will become Y/X=a/X+b+e/X Notice that for large values of X the new error (e/X) will be smaller. If you run this regression, you have to create Y/X, your new dependent variable and 1/X your new independent variable. Notice that the intercept a will be the slope and b the slope will be the intercept in the new equation.
Remedy:

E5 Predictor Related to Error

Error represents all factors influencing Y that are not included in the regression equation If an omitted variable is related to X the assumption is violated. This is the same as the Completeness or Omitted Variable Problem Diagnosis:
The error will ALWAYS be uncorrelated with X, there is no way to establish the TRUE error Theoretical
Remedy:
Adding new variables to the model
E6 Correlated errors across interrelated equations

We sometimes estimate more than one regression. Suppose Yt=a+b1Xt-1+b2Zt-1+e but Xt=a+b1Yt-1+b2Zt-1+e e and e will be correlated (whatever is omitted from both equations will show up in both e and e making them correlated) This is also the case in sample selection models S=a+b1X+b2Z+e S is whether one is selected into the sample (Szelenyi: does one engage in household farming) Y=a+b1X+b2Z+b3W+b4V+e Y is the outcome of interest (Szelenyi: if one does engage in household farming how much value does he produce) e and e will be correlated (whatever is omitted from both equations will show up in both e and e making them correlated)

Regression Assumptions

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regression Assumptions

Uploaded by

Copyright:

Available Formats

Regression Assumptions

Best Linear Unbiased Estimate (BLUE)

If the following assumptions are met:

The regression error term is

Characteristics of OLS if sample is probability sample

The Three Desirable Characteristics

Diagnosis and Remedy

120000 100000 80000 60000 40000 20000

-20000 200 400 600 800 1000 1200

.986 200 400 600 800 1000 1200

Diagnosis & Remedy

If Citizen Inc=a*+b*1Education+e* If Not Citizen Inc=a**+b**1Education+e**

Diagnosis & Remedy

If X1=0 then If X1=1 then

V1 Proper Level of Measurement

K=# of unordered categories (nominal variables) N= # of ordered categories (ordinal variables)

where X is the real value and e is a random measurement error

where W is a systematic measurement error c is a weight

Apart from the problems stated above, that means that

Diagnosis & Remedy

E1 Normally Distributed Error

E2 Error Has a Non-Zero Mean

The solid line gives a negative

Rsq = 0.6211 400 600 800 1000 1200

Diagnosis & Remedy

E5 Predictor Related to Error

E6 Correlated errors across interrelated equations

You might also like

If Citizen Inc=a+b1Education+e* If Not Citizen Inc=a+b1Education+e**