You are on page 1of 9

http://www.technion.ac.il/docs/sas/ets/index.

htm
1. Determine whether the following ones are linear or non-linear regression models and
give the reason why.
a) y = a + b x + c x2 + d z + e
non-linear regression model,or polynomial model
linear
regression
model
if
> become to have linear relation.

we

change

x^2

to

K,

indepedent

variables

b) y = a + b xc + d z + e
c=0 or 1 linear model
otherwise, nonlinear model.
That is why dependent variable has nonlinear relation with independent variable.

2. Explain the pros and cons of using R2 as a measure of goodness-of-fit for the linear
regression models.
In statistics, the coefficient of determination, R2 is used in the context of statistical
models whose main purpose is the prediction of future outcomes on the basis of other
related information. It is the proportion of variability in a data set that is accounted for by
the statistical model.[1] It provides a measure of how well future outcomes are likely to be
predicted by the model.
If a new independent variable is added, R 2 will go up even though the independent has
nothing to do with dependent variable. R2 is non-decreasing of the number of regressors.

3. In the following regression model, if price increase by 1%, what would be the change
in y?

log(y) = -6.63 + + 0.3274 log(price) + ehat


y increases by 0.3274% which means price elasticity of demand.

4. In the following regression model, if age increases from 35 to 36, what would be the
change in wage, which is measured in thousand dollars?
wage = 3258 + + 3.246 age 0.275 age2 + + ehat
wage decreases by -16.279

5. What is the problem with the following regression model?


log(wage) = b0 + b1log(age) + b2log(age2) + e
b2log(age2) = 2 b2log(age) is perfect collinearity.
This model violates the assumption of OLS that there are no exact linear relationships
among the independent variables. (No perfect collinearity)

6. In the following regression model, is age statistically significant? Is it economically


significant? Be sure to give your reason. Note that what in the parenthesis is the standard
error.
log(wage) = + 0.000584 age + + ehat
(0.000211)
T= 2.7677725 is significant because t = 1.95 = 5%
Economically, it is insignificant because age does nearly not affect wage.
1 unit change in age is associated with 0.0584% change in wage.

7. What is de-seasonalization and why is de-seasonalized data useful?

DE-SEASONALIZATION

(Statistics) A process

which removes the seasonal effects from time series data. One way
to determine if a de-seasonalization transformation of the data is necessary is to examine
the autocorrelations. If, for monthly data, the twelfth autocorrelation is abnormally high,
or for quarterly data, the fourth autocorrelation abnormally high, then the data is seasonal
in nature and requires de-seasonalization before attempting to fit a model to its behavior.
More frequently referred to as Seasonal Adjustment (S.A.). Also see Seasonal
Adjustment, Seasonal Adjustment Factors, Seasonal Factors, and Seasonality.
We can explain dependent variable more exactly by excluding seasonal factors which
show repeatly.
8. The following result is obtained from regressing consumption measured in thousand of
dollars on income measured in dollars and other variables. For the ease of presentation, it
is better to have the coefficient of income without so many zeros after the decimal point.
What kind of transformation should we make to consumption and income, then? What
would then the standard error become?
consumption = + 0.000746 income + + ehat
(0.000248)
We make transformation to income measured in thousand dollars, which means to divide
by 1000 so that coefficient of income is changed to 0.746. There is no change to
consumption. Standard error become 0.248.

9. Why do researchers prefer over-parameterization to under-parameterization?

Overparameterization dose not affect the unbiasedness of the OLS estimators.


Under-parameterization problem generally causes the OLS estimators
to be biased. Deriving the bias caused by omitting an important
variable is an example of misspecification analysis.
However,
including irrelevant variables causes inefficiency which means larger variance of
OLS estimators, imprecision (larger standard error), insignificance for
statistical tests

and wider confidence intervals.

10. What are the so-called heteroskedastic-robust standard errors and why are they
useful? How to get them in SAS?

where rij denotes the ith residual from regressing xj on all other independent
variables,
and SSRj is the sum of squared residuals from this regression. The square
root of the quantity is called the heteroskedasticity-robust standard
error for ^ j .
If we use it, OLS is still useful regardless of the kind of heteroskedasticity
present in the population.

and/or heteroskedastic.
proc reg;
model cigs = lincome lcigpric educ age agesq restaurn / acov spec;
output out = sample residual = uhat;
run;
The acov and spec in the model statement are called options in SAS
statements. SAS has many such options that we can use to control the print outs
in different procedures. Usually you have to consult the manuals to have a full
understanding.
In this specific case, acov is used to instruct SAS to print out the heteroskedasticrobust variance-covariance matrix of the estimators. If you invoke the acov
option, when you do test, SAS would give you two results: one is based on the

conventional standard error, and the other is based on the heteroskedasticrobust standard error.

11. What can we say about the linear regression model if the calculated p-value for the
White test is 0.0856?
White test is for testing existence of heteroskedasticity.
Null hypothesis is there exists homoskedasticity. Because p-value 0.0856 is larger than
0.05, null hypothesis is not rejected. That is, there is not heteroskedasticity.

12. What can we say about the linear regression model if the calculated Durbin-Watson
test statistic is 0.8746?
Because it is close to 0, this linear regression has positive autocorrelation.
13. What can we say about the linear regression model if the calculated p-value of the
normality test is 0.0248?
We reject null hypothesis. This does not follow normal distribution .

14. What can we say about the linear regression model if the calculated p-value of the
Ramseys RESET test is 0.0748?
We cannot reject null hypothesis. Non-linear combinations of the estimated values do not
help explain the endogenous variable. That they do not affect dependent variable.
Ramsey RESET test
From Wikipedia, the free encyclopedia

The Ramsey Regression Equation Specification Error Test (RESET) test (Ramsey,
1969) is a general specification test for the linear regression model. More specifically, it
tests whether non-linear combinations of the estimated values help explain the
endogenous variable. The intuition behind the test is that, if non-linear combinations of
the explanatory variables have any power in explaining the endogenous variable, then the
model is mis-specified.

15. What is the common feature of the linear probability, probit, and logit models? What
is the major deficiency of the linear probability model?
They estimate model whose Dependent variable is dummy variable.
Natural Heteroskedasticity
Var(ui) = Pr(yi=1)[1-Pr(yi=1)]
Residual follow binomial distribution so that variance is Var(ui) = Pr(yi=1)[1-Pr(yi=1)
and variance of residual change depending on probability.

The problem with interpretation, the probability is not bounded That is, the value of
probability sometimes becomes lower than 0 or larger than 1, but value of probability
range from 0 to 1.
we can get predictions either less than zero or greater than one. However,
probabilities must be between zero and one.

16. What is the biggest difference between the probit and logit models?
Probit model is based on cumulative normal probability distribution.
Logit model is based on cumulative logisatic distribution.

17. What do p, d, and q mean in the ARIMA(p,d,q) and why is this model useful?
A nonseasonal ARIMA model is classified as an "ARIMA(p,d,q)" model, where:

p is the number of autoregressive terms,

d is the number of nonseasonal differences, and

q is the number of lagged forecast errors in the prediction equation.

The ARIMA(p,d,q) Model


(1-L)da(L)Xt = b(L)et
Where a(L) is a p-th order polynomial in L and b(L) is a q-th order
polynomial in L, et is a white noise, d denotes number of differencing in
order to achieve stationarity.
ARIMA model is useful for analyzing non-stationary time-series data containing ordinary
or seasonal trends.

18. What is the difference between the fixed-effects and random effects models?
a fixed effects model is a statistical model that represents the observed quantities in
terms of explanatory variables that are all treated as if those quantities were non-random.
This is in contrast to random effects models in which either all or some of the explanatory
variables are treated as if they arise from the random causes.
: .
: .
An unobserved effects panel data model where the unobserved effects is allowed to be
arbitrarily correlated with the explanatory variables in each period time.
Radom effects models
An unobserved effects panel data model where the unobserved effects is assumed tobe
uncorrelated with the explanatory variables in each period time.

19. What is the difference between the reduced form and structural form?
Difference between the interpretation of the structural form and
reduced form: change in quantity demanded and equilibrium quantity.
reduced form equation, which means that we
have written an endogenous variable in terms of exogenous variables.
We call this a structural equation to emphasize that we are
interested in the _j, which
simply means that the equation is supposed to measure a causal
relationship

20. What is simultaneity bias and why can the use of IV eliminate it?
Simultaneity Bias
Consider estimating the demand function by OLS

Qid =_1Pi + _1Pis + u1i


The bias that arises from using OLS to estimate an equations model.
IV is not correlated with error term.

OLS of _1 is biased since cov(Pi,u1i) is nonzero.


It is called simultaneity bias as Pi is correlated with u1i since it is endogenous.

21. What can we do if an equation in the simultaneous equation models is not identified?
If equation is unidentified, we will not be able to get consistent estimates of the
structural parameters

22. What can we conclude if the p-value of the Hausman test for random effect is 0.0148?
Hausman test
the null is that the random effect is uncorrelated with
Regressors
We reject the null hypothesis. We cannot say the random effect is
uncorrelated with Regressors.
23. What can we conclude if the p-value of the Bassman test for overidentification
restriction is 0.2485?
Test for overidentification restrictions
proposed by Bassman
in case an equation is over-identified, the test can be
used to test the null that the IVs are uncorrelated with
the error terms
Null is not rejected. That is, IVs are uncorrelated with the error terms.

24. Let female be a dummy variable whose value is 1 for females and 0 for males. What
is the effect of education on wage in the following regression model? What is the
difference of impacts of education on wage for males and that for females?
wage = + b1 female + b2 education + b3 (female*education) +
male : b2
female : b2+b3
difference : b3
25. Let female be a dummy variable whose value is 1 for females and 0 for males. What
is the gender difference?

wage = + b1 female + b2 education + b3 (female*education) +


female: b1 +b2edu +b3edu
male : b2edu
b1+ b3edu

You might also like