Professional Documents
Culture Documents
1. Regress y on all x.
2. Select ݔଵ כwhich removal will result in smallest decrease in R2.
3. Test the significance of this predictor ݔଵ כgiven all other predictors.
Stop if it is significant – no predictor should be removed.
1
Remove this predictor if insignificant and go to next step.
4. Choose the second ݔଶ כwith smallest decrease in R2 on removal.
5. Test the significance of ݔଶ כgiven the other predictors.
Stop if significant and no more variables should be removed.
Remove this variable if insignificance and so on until no predictors can be
removed from the regression.
Remarks
For practical convenience we may choose a F-in value and a F-out value in selecting
the regressors in the equation. For example we may set F-in = 4 and F-out = 4 (Why is
it 4?) then
Sometimes we may choose a smaller F-out than F-in so that it is harder to remove a
regressor that has been included already.
Alternative, we may set α-in and α-out. For example, α-in =0.05 and α-out =0.10.
2
Example 1
The regression of y on x1, x2, x3 and x4 using the data contained in ‘stepwise.xls’
produces the following results:
Total SS = 743.18, n = 25
Stepwise Regression
3
3.2` Multicollinearity
This is a situation where regressor variables are highly correlated. When some
columns of X is a linear combination of other columns multicollinearity occurs. In
practice we seldom have exact collinearity.
xij − xi
xij → = xij* X → X*
∑ (x ij − xi ) 2
If one or some of the r’s are close to 1, then X*′X* may be near singular and the
inverse of X*′X* will be very sensitive to the r’s. In this situation X*′X* is said to be
ill-conditioned.
Examples
4
Since var( β̂ ) = σ2(X*′X* )-1 so we have
Cases A D C D
We say that the variance is inflated from 1.0 (the ideal case) to 63.94 in Case A. We
define the variance inflation factor VIF to be the increase of the variance of an
estimated coefficient as compared to the ideal case.
σ2
var( β̂1 ) = .
S11
1
var( β̂1 ) = var(y on x1 alone) × .
1 − R12
5
With the presence of other variables, the variance of β̂1 will be inflated by a factor
1
. We define
1 − R12
1
VIF = ≡ variance inflation factor of x1.
1 − R12
Since 0 ≤ R12 ≤ 1 , VIF ≥ 1.
If R12 = 0, x1 is uncorrelated with other regressors, VIF =1 and var( β̂1 ) will not be
inflated.
If R12 = 1, x1 is exactly correlated with other regressors, VIF =∞ and var( β̂1 ) will be
inflated infinitely.
1 k
VIF = ∑ VIFi
k i =1
VIF >> 1 or VIFi > 10 will be considered that multicollinearity may severely
influence the stability of the estimated coefficients.
λmax
Condition number = .
λmin
6
Example 2
Student consumption data in a certain month in 2010:
Con = Student consumption other than tuition and boarding fees (in $)
Yd = Student Disposable income (in $)
LiA = Student liquid assets (in $)
Analysis of Variance
SOURCE DF SS MS F p
Regression 2 6739470 3369735 22.75 0.003
Error 5 740530 148106
Total 7 7480000
7
Regression of CON on LiA only
s = 367.2
Analysis of Variance
SOURCE DF SS MS F p
Regression 2 99151096 49575548 367.73 0.000
Error 6 808901 134817
Total 8 99960000
Correlation matrix1
8
3.3 Autocorrelation
Standard regression model assumes a constant variance on the errors or disturbances.
However this may not be always satisfied in the real world. For example, in the
regression of expenditure on income, we may expect expenditures vary less at lower
incomes but more at higher income levels. This is because higher income groups have
more room to exhibit different and varied expenditure behaviors. The assumption of
constant variance may also be violated when time series data are involved as today
values tend to influence tomorrow’s values. In this case, we say we have
autocorrelated or serially correlated errors.
ε i = ρε i −1 +ν i , ν i ~ IN (0, σ v2 ) , ρ < 1 .
Note that
E( ε i ) = 0
var( ε i ) = ρ 2var ( ε i −1 ) + 2cov(vi, ε i −1 ) + var (vi)
= ρ 2var( ε i ) + σ v2
σ v2
= .
1− ρ 2
Also
E( ε i ε i −1 ) = ρ E( ε i2−1 ) + E(vi, ε i −1 )
= ρσ ε2 = cov(ε i , ε i −1 ) .
Thus the disturbances have the same variance but are not independent. The covariance
matrix of the disturbances is
⎡ 1 ρ ρ2 " ρ n −1 ⎤
⎢ ⎥
⎢ ρ 1 ρ " ρ n−2 ⎥
cov(ε ) = σ ε ⎢ ρ 2
2
ρ 1 " ρ n −3 ⎥ .
⎢ ⎥
⎢ # # # % # ⎥
⎢ ρ n −1 ρ n−2 ρ n −3 " 1 ⎥⎦
⎣
Let ei be the least squares residuals from the regression of y on x. The Durbin-Watson
statistic is defined by
9
n
∑ (e − e i i −1 )2
DW = i =2
n
= d.
∑e
i =1
2
i
d ranges from 0 to 4.
yi = β 0 + β1 xi + ε i ,
ε i = ρε i −1 +ν i , ν i ~ IN (0, σ v2 ) , ρ < 1 .
10
The unknown autocorrelation coefficient ρ is usually estimated by the first order
autocorrelation of the residuals from the regression of y on x.
Orcutt-Cochrane Procedure
11
Example 3
Filename: House price
House price and household income
Y= average house price in $
X=average household income in $
Analysis of Variance
Source DF SS MS F P
Regression 1 10104102863 10104102863 1465.31 0.000
Residual Error 20 137910319 6895516
Total 21 10242013182
12
d = DW = 0.383
Since d < 2 we test for positive autocorrelation.
Analysis of Variance
Source DF SS MS F P
Regression 1 823501802 823501802 320.70 0.000
Residual Error 19 48788555 2567819
Total 20 872290357
y = 7988 + 19.4 x.
13
If the DW is still significant, we may repeat the O-C procedure on y* and x*. Estimate
the autocorrelation ( ρ̂1 , say) from the residuals of the regression of y* on x* and
transform y* and x* by:
Analysis of Variance
Source DF SS MS F P
Regression 1 303543183 303543183 130.97 0.000
Residual Error 18 41718181 2317677
Total 19 345261364
y = 8480 + 19.2 x
after having taking into account the autocorrelation of the errors. By assuming
independent errors the estimated regression equation will be
y = 5989 + 20.4 x.
14
3.4 Indicator variables
A regression model may involve both quantitative and qualitative regressor variables.
For example, to predict the weight (y) of a person from his/her height (x) it might be
more reasonable to use two models, one for males and the other for females because
of the differences in body profile. By defining a variable (called indicator, categorical
or dummy variable) D for gender
D = 0 for female
D = 1 for male
the regression model(s) may be formulated as
y = β 0 + γ D + β1 x + ε .
Note that we have two different equations with same slope coefficient but different
intercepts:
For females, D=0, thus the regression equation is y = β 0 + β1 x + ε .
For males, D=1, the regression equation becomes y = ( β 0 + γ ) + β1 x + ε .
y = (β 0 + γ ) + β1x
y y = β 0 + β1 x
β0
15
Example 4
The mileage per gallon of private cars are obtained to study the relationship of
mileage with the vehicle weight and transmission. (filename: City MPG)
16
Regression Analysis: citympg versus weight, auto
Analysis of Variance
Source DF SS MS F P
Regression 2 692.19 346.10 77.34 0.000
Residual Error 29 129.78 4.48
Total 31 821.97
To test whether manual and auto cars have different MPG we can test the significance
of the coefficient γ. If γ is significantly different from 0 it means that the two
regressions have different intercepts, hence two separate regression lines.
y = β 0 + γ 1 D1 + γ 2 D2 + γ 3 D3 + β1 x1 + ... + ε .
17
3.5 Assessment of assumptions
The ideal conditions for regression are:
a. The relationship is linear
b. The disturbances have the same variance
c. The disturbances are independent
d. The disturbances are normally distributed
e. The disturbances are not correlated with the regressor variables
The violation of any of the above conditions will lead to very undesirable results,
e.g. The estimates are no longer unbiased,
The estimates are not stable (large variances), etc.
Examination of the residuals will always enable us to have some idea whether these
conditions are satisfied or not.
Some plots of residuals are usually to identify any violation of the above conditions.
These are:
18
This is the ideal case in which the residuals are scattered randomly around the 0-line.
19
This is an example where the error variance is not constant. Residuals have smaller
variation at lower level but larger variation at higher level.
Example 5
20
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.972318
R Square 0.945403
Adjusted R Square 0.940439
Standard Error 11.95915
Observations 13
Standard
Coefficients Error t Stat P‐value
Intercept ‐2235.88 166.6896 ‐13.4134 3.67E‐08
Year 1.223445 0.088647 13.80129 2.73E‐08
Year Residual Plot
25
20
15
10
Residuals
5
0
‐51800 1850 1900 1950
‐10
‐15
‐20
Year
Residual vs Predicted
20
15
10
5
0
‐5 0 50 100 150
‐10
‐15
‐20
Though R-square is quite large, indicating satisfactory fit, the residuals show a
systematic pattern when plotted against time, the regressor variable, or against the
predicted y. This indicates a non-linear relationship between y and x. For example it
may improve the fitting by including the square term of the time.
21
The regression of y on x and x2 gives the following results:
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.99639837
R Square 0.9928097
Adjusted R Square 0.99137165
Standard Error 4.55181799
Observations 13
Standard
Coefficients Error t Stat P‐value
Intercept 26948.0616 3594.7123 7.496584 2.07E‐05
Year ‐29.835602 3.8252303 ‐7.79969 1.47E‐05
Year2 0.00826038 0.0010173 8.119839 1.03E‐05
Year Residual Plot
10
4
Residuals
0
1800 1850 1900 1950
‐2
‐4
‐6
Year
The R-square has increased to 99.3%, indicating a near perfect fit. Though the
residuals do not seem to have any systematic pattern the constant variance condition
seems being violated. The residuals vary very little in early years but substantially in
later years. This means that the variances of the disturbances increase with time. The
problem of non-constant variances will not be handled here.
22
3.6 Non-linear effects
In some applications though the dependent variable does not seem to linearly related
to the predictor variable it may be possible to linearize the relation so that the linear
regression procedure can still be applied to estimate the non-linear relation.
The table below gives some common nonlinear relations that can be linearized by a
simple transformation.
Regress
ln(y) on ln(x)
Regress
ln(y) on x
axi Reciprocal 1 b 1 1 b 1
yi = = ⋅ +
b + xi yi a xi a a a
1 1
y'= , x' =
(Growth rate model) y x y ' = 1/ a + (b / a ) x '
Regress
1 1
on
y x
yi = a + b ln xi Logarithmic yi = a + b ⋅ ln( xi ) b a or
ln(a)
x ' = ln x y = ln a + bx '
Regress
y on ln(x)
23
Example 6
The damages susceptibility of peaches to the height at which they are dropped (drop
height measured in mm) and the density of the peaches (measured in g/cm3) are given
in the following table.
ln y = ln α + x ln β
24
Sxx = 1716771 – (4199.7)2/11 = 113363.72
y = (1.4087)(0.9967)x
R2 = 1.2086/2.9404 = 0.411
25
3.6 Regression without intercept
yi = β xi + ε i
Let the fitted line be yi = bxi. The least square principle is to choose b to minimize
n
φ = ∑ ( yi − bxi ) 2 .
i =1
∂φ n
= −2∑ xi ( yi − bxi ) .
∂b i =1
Setting this partial derivative equal to zero, we shall obtain the normal equation
b∑ xi2 = ∑ xi yi
which, upon solving gives
b=
∑ xy .
∑x 2
Note that
b=
∑ xy = 1 ∑ x( β x + ε )
∑x ∑x 2 2
1
=β+ ∑ xε .
∑ x2
Obviously, E(b) = β and is therefore unbiased. In fact, b is BLU.
var(b) = E (b − E (b)) 2
⎛ ∑ xε
2
⎞
= E⎜ ⎟⎟
⎜ ∑ x2
⎝ ⎠
σ2
= .
∑x 2
Residual SS = ∑ ( y − bx) = ∑ y2 2
− b2 ∑ x 2 .
b∑ x 2
Here, R 2 = . Different definition than with intercept case is used here.
∑y 2
26
3.6.1 Sum of residuals
yi = α + β xi + ε i .
e = y − yˆ = y − a − bx
= y − ( y − bx ) − bx
= ( y − y ) − b( x − x ).
Thus
∑ e = ∑ ( y − y ) + b∑ ( x − x ) = 0 .
For the case without intercept,
e = y − yˆ = y − bx
= y−x
∑ xy .
∑x 2
∑ e = ∑ y − ∑ x∑ xy / ∑ x 2
=
∑ x ∑ [ ∑ x − x] y. 2
∑x 2
∑x
This sum is uncertain. It may equal to any value. Thus the sum of residuals will sum
to 0 if there is an intercept but could be different from 0 when there is no intercept.
∑ e = ∑ ( y − bx) = ∑ y − b2 ∑ x 2
2 2 2
( ∑ xy ) .
2
=∑y − 2
∑x 2
Residual SS
s2 = .
n −1
s2
Estimated var(b) = .
∑ x2
27
Exercises
1. The data set (Nerlove data, filename: nerlove) has been used by many for
benchmarking software performance. You are required to find a regression
model to predict the kilowatt output from other factors. Use the three selection
procedures in turn to obtain the best set of predictor variables. Show your
work step by step. You may use computer packages to help with your
computation and check your results with those obtained from a statistical
package with a variable selection procedure.
28
25 7.5492 2969 8183.34 80.657 9.0000 0.2397 0.3972 0.3631
75 22.5612 3571 7297.71 78.255 41.5951 0.1142 0.1833 0.7025
167 21.5587 3886 9538.68 63.569 30.8894 0.1252 0.2033 0.6715
62 20.8671 3965 8403.59 74.480 33.1992 0.1162 0.2151 0.6687
80 21.5454 3981 8186.05 75.082 35.2049 0.1052 0.2299 0.6650
89 17.4802 4148 7536.89 74.025 24.5837 0.1176 0.2035 0.6789
82 29.8011 4187 7996.44 74.120 47.4257 0.1052 0.1824 0.7125
181 19.4391 4560 8558.37 76.464 23.7777 0.1531 0.2577 0.5893
96 30.2067 5286 7084.10 73.325 38.3384 0.0884 0.1969 0.7147
103 24.2903 5316 9759.83 74.025 27.8380 0.1894 0.1740 0.6366
85 30.8773 5643 10182.50 61.040 27.8498 0.1722 0.2204 0.6074
95 22.4421 5648 8954.12 78.440 25.9160 0.0834 0.2111 0.7055
179 33.9733 5708 10024.20 78.102 42.1660 0.0986 0.1826 0.7188
91 19.9008 5785 7969.55 71.910 22.2448 0.1093 0.2180 0.6727
111 37.0666 6754 10177.90 77.197 25.6208 0.2070 0.2363 0.5566
81 35.5303 6770 7798.26 67.570 29.8250 0.1108 0.2814 0.6078
112 25.1686 6779 7826.93 74.200 20.2790 0.1427 0.2662 0.3909
87 24.3565 6793 6336.88 70.295 18.5909 0.1266 0.3253 0.5481
76 33.0175 6837 7310.15 69.795 28.4405 0.1187 0.2515 0.6298
110 40.5281 6891 6769.55 74.120 35.9651 0.0895 0.2393 0.6711
71 42.2514 7320 5879.51 92.063 39.2104 0.0864 0.2064 0.7072
177 33.8814 7382 7512.72 72.362 25.9001 0.1393 0.2486 0.6140
104 31.2922 7484 8063.73 67.680 23.5267 0.1713 0.2535 0.5752
94 27.0832 7896 7119.96 74.513 20.1100 0.1196 0.2484 0.6320
100 32.5840 7930 7119.01 48.997 22.8380 0.1209 0.2772 0.6018
120 52.7634 9145 10373.50 81.750 35.8083 0.2027 0.1997 0.5976
115 41.1798 9275 8657.53 76.140 24.5804 0.1047 0.3284 0.5666
102 47.3864 9530 7624.57 83.880 31.5825 0.1266 0.2106 0.6628
97 30.1678 9602 7054.18 59.977 20.2010 0.0928 0.2164 0.6908
92 28.7861 9660 6686.73 79.542 20.2630 0.0697 0.2391 0.6913
132 57.7267 10004 6472.86 76.300 28.0959 0.1806 0.2362 0.5832
123 38.8472 10057 6035.95 81.578 25.8240 0.0844 0.2178 0.6978
105 31.9884 10149 6437.92 73.140 18.5343 0.1169 0.2367 0.6464
166 51.7415 10361 9578.63 68.016 28.1423 0.1913 0.2407 0.5680
114 55.1764 10855 8061.96 71.490 31.7601 0.1192 0.2362 0.6445
125 48.1125 11114 8413.86 69.975 22.5536 0.1301 0.2969 0.5730
139 76.2528 11667 10436.30 80.660 46.0701 0.1120 0.1708 0.7172
169 66.1032 11837 8709.43 75.379 31.3321 0.1627 0.2103 0.6296
118 68.4800 12542 8142.84 80.385 35.7882 0.1336 0.1688 0.6976
126 79.0705 12706 9282.51 70.853 37.2477 0.1108 0.2011 0.6880
113 45.1827 12936 8320.06 65.760 22.0330 0.1027 0.1992 0.6981
106 41.9016 12954 6460.64 62.330 21.7550 0.0865 0.2194 0.6941
129 77.8849 13702 7113.79 70.850 34.9616 0.1212 0.2121 0.6667
119 97.3859 13846 7786.37 88.540 44.1571 0.1003 0.2066 0.6931
117 80.3593 16311 7282.61 81.550 40.9692 0.0527 0.1337 0.8136
176 79.6207 16508 9404.97 78.044 42.2086 0.1501 0.1556 0.6943
135 90.7168 17280 9191.47 72.967 36.8816 0.0918 0.1795 0.7287
109 58.1154 17875 6288.41 73.395 20.6191 0.0658 0.2781 0.6561
174 107.9780 18455 6690.23 76.300 32.9654 0.1513 0.2101 0.6386
29
140 134.2280 19445 9829.32 67.580 38.8027 0.1756 0.1834 0.6410
171 90.3718 21956 7954.47 83.338 22.9115 0.1169 0.2984 0.5847
170 113.2560 22522 9500.78 76.732 25.0289 0.1961 0.2604 0.5435
127 111.8680 23217 6873.73 83.880 33.3944 0.0849 0.2007 0.7144
142 125.3360 24001 8047.35 74.372 33.0932 0.0998 0.2457 0.6544
137 183.2320 27118 9914.36 78.480 41.7578 0.1280 0.2265 0.6455
130 87.1015 27708 6378.23 63.600 20.3000 0.1060 0.2257 0.6683
144 240.5140 29613 9312.93 81.750 41.8872 0.1561 0.2017 0.6422
143 191.5630 30958 9810.10 69.541 36.3076 0.1636 0.1524 0.6840
141 168.3780 34212 5683.83 80.385 40.5286 0.0651 0.1361 0.7988
138 169.2350 38343 9117.16 65.992 31.5897 0.0663 0.2192 0.7144
175 269.7730 46870 9761.38 69.541 33.1999 0.1594 0.2194 0.6212
172 240.4860 53918 6068.87 78.380 31.1954 0.0966 0.1846 0.7188
30
1564.25 601.46 277.44 32.00 404.44 MIDWEST
1634.75 585.10 312.35 36.00 283.11 MIDWEST
1159.25 524.56 292.87 34.00 222.44 SOUTH
1202.75 535.17 268.27 31.00 283.11 WEST
1294.25 486.03 309.85 32.00 242.66 WEST
1467.50 540.17 291.03 28.00 333.66 MIDWEST
1583.75 583.85 289.29 27.00 313.44 MIDWEST
1124.75 499.15 272.55 26.00 374.11 WEST
3. The table (filename: passenger miles) below gives the cost in $ and passenger
miles of an airline for 22 consecutive years. Run a regression of C on Q in the
form:
C = β 0 + β1Q + ε
to see how passenger miles will affect its operating cost. Calculate the Durbin-
Watson statistic and test for its significance, state whether the errors are
possibly positively or negatively correlated. If the D-W statistic indicates its
significance apply the Orcutt-Cochrane procedure to remove the error
autocorrelation. State your final regression equation.
T Q C Q=output, revenue
1 1140640 952.757 T=year
2 1215690 986.757 C=total cost, in $
3 1309570 1091.98
4 1511530 1175.78
5 1676730 1160.17
6 1823740 1173.76
7 2022890 1290.51
8 2314760 1390.67
9 2639160 1612.73
10 3247620 1825.44
11 3787750 1546.04
31
12 3867750 1527.9
13 3996020 1660.2
14 4282880 1822.31
15 4748320 1936.46
16 569292 520.635
17 640614 534.627
18 777655 655.192
19 999294 791.575
20 1203970 842.945
21 1358100 852.892
22 1501350 922.843
The rate of return of a factor is defined to the percentage change in output with
respect to the percentage change in input of the factor, e.g. rate of return of
capital K is given by
∂Q / Q K ∂Q ∂ ln Q
= = .
∂K / K Q ∂K ∂ ln K
(a) Show that the rate of returns of capital and labour α are β and respectively.
32
8 4257.46 714.2 5585.01
9 1625.19 320.54 1618.75
10 1272.05 253.17 1562.08
11 1004.45 236.44 662.04
12 598.87 140.73 875.37
13 853.1 145.04 1696.98
14 1165.63 240.27 1078.79
15 1917.55 536.73 2109.34
16 9849.17 1564.83 13989.55
17 1088.27 214.62 884.24
18 8095.63 1083.1 9119.7
19 3175.39 521.74 5686.99
20 1653.38 304.85 1701.06
21 5159.31 835.69 5206.36
22 3378.4 284 3288.72
23 592.85 150.77 357.32
24 1601.98 259.91 2031.93
25 2065.85 497.6 2492.98
26 2293.87 275.2 1711.74
27 745.67 137 768.59
Production data
Primary metals, 27 Statewide observations, data are per establishment
ValueAdd : value added
Labor : labor input
Capital : capital stock - gross value of plant and equipment
(c) Find the returns of the inputs and total return to scale.
(d) If the return to a factor is smaller than that of another, the former is said to
be more intensive than the latter in production. Which factor is more
intensive?
33