DS - Tute 2

Question 1
(14 points)
Using data consisting of variables Y, X1, X2 and X3, a few regression models, with Y being
the response variable, were generated whose output is provided below. Also provided is the
pair-wise correlation matrix.
Correlation Matrix
Y
X1
X2
1.00
X1
0.77
1.00
X2
0.38
0.00
1.00
X3
0.21
0.00
0.00
X3
1.00
Model 1: Yi = 0 + 1 X1i + i
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.77
R Square
Adjusted R Square
0.56
Standard Error
3.54
Observations
20.00
ANOVA
df
Regression
SS
1.00
MS
320.00 320.00
Residual
18.00
225.80
Total
19.00
545.80
Coefficients
Intercept
X1
Standard
Error
F
25.51
12.54
t Stat
P-value
18.90
0.79
23.86
0.00
4.00
0.79
5.05
0.00
Significance F
0.00
Model 2: Yi = 0 + 1 X2i + i
SUMMARY OUTPUT
Multiple R
0.38
R Square
Adjusted R Square
0.10
Standard Error
5.09
Observations
20.00
ANOVA
df
Regression
SS
MS
F
3.09
1.00
80.00
80.00
Residual
18.00
465.80
25.88
Total
19.00
545.80
Coefficient
s
Standard Error
Intercept
X2
t Stat
Significance F
0.10
P-value
18.90
1.14
16.62
0.00
2.00
1.14
1.76
0.10
MS
F
0.84
Model 3: Yi = 0 + 1 X3i + i
SUMMARY OUTPUT
Multiple R
0.21
R Square
Adjusted R Square
-0.01
Standard Error
5.38
Observations
20.00
ANOVA
df
Regression
SS
1.00
24.20
24.20
Residual
18.00
521.60
28.98
Total
19.00
545.80
Coefficients
Standard
t Stat
P-value
Significance F
0.37
Error
Intercept
X3
18.90
1.20
15.70
0.00
1.10
1.20
0.91
0.37
a) If variable X2 is added to Model 1, what will happen to the coefficient of X1? Will it
increase, decrease, remain the same? Explain your answer.
[3 points]
b) If variable X2 is added to Model 1, provide the closest possible estimate on the

Coefficient of Determination for the resulting model. Explain your estimate.
[3 points]
c) If a regression model was obtained by including all three explanatory variables X1, X2
and X3, estimate the standard error of estimate for the resulting model. Provide the
closest possible estimate and explain your answer.
[4 points]
d) If a forward stepwise regression to select variables was performed with the maximum
allowable p value for keeping a variable being 0.05, what will be the most likely
model that will result? Explain your answer.
[4 points]
Question 2 (25 points)

The following table contains sample data on millions gallons of gasoline consumed (Gallons),
the average retail price in cents (RetailPrice), Consumer Price Index (CPI), CPI for public
transportation (CPITrans), number of registered cars in thousands (RegCars), average
mileage of cars in Miles/Gallon (MPG) and disposable income in dollars (DispInc). This data
relates to a state in the US. What factors can be used to explain gasoline consumption?
Some regression outputs (with a few missing values) are provided in subsequent pages. The
pair-wise correlation matrix is also provided.
CPI
CPITrans
RegCars
MPG
DispInc
1962
Year
Gallons RetailPrice
43771
30.64
90.6
87.4
66638
14.37
6271
1963
45246
30.42
91.7
88.5
69842
14.26
6378
1964
47567
30.35
92.9
90.1
72969
14.25
6727
1965
50275
31.15
94.5
91.9
76634
14.15
7027
1966
53312
32.08
97.2
95.2
80106
14.1
7280
1967
55110
33.16
100
100
82367
14.05
7513
1968
58524
33.71
104.2
104.6
85793
13.91
7728
1969
62448
34.84
109.8
112.7
89156
13.75
7891
1970
65784
35.69
116.3
128.5
92095
13.7
8134
1971
69514
36.43
121.3
137.7
96144
13.73
8322
1972
73463
36.13
125.3
143.4
100658
13.67
8562
1973
78011
38.82
133.1
144.8
106119
13.29
9042
1974
74217
52.41
147.7
148
109823
13.65
8867
1975
76457
57.22
161.2
158.6
116679
13.74
8944
1976
78447
59.47
170.5
174.2
115170
13.93
9175
1977
80677
63.07
181.5
182.4
118711
14.15
9381
1978
83233
65.71
195.4
187.8
121717
14.26
9735
1979
80233
87.79
217.4
200.3
125750
14.49
9829
1980
73375
119.1
246.8
251.6
127448
15.32
9722
1981
71718
131.1
272.4
312
129123
15.68
9769
1982
72848
122.2
289.1
346
129500
16.36
9725
1983
73156
115.7
298.4
362.6
131723
16.81
9930
1984
71180
112.9
311.1
385.2
133751
17.8
10419
1985
69450
111.5
322.2
402.8
137308
18.28
10622
1986
71404
85.7
328.4
426.4
140693
18.35
10947
1987
70984
89.7
340.4
441.4
142209
19.26
10976
Correlation Matrix
Gallons
RetailPrice
CPI
CPITrans
RegCars
MPG
Gallons
1.000
RetailPrice
0.525
1.000
CPI
0.543
0.917
1.000
CPITrans
0.469
0.867
0.987
1.000
RegCars
0.807
0.862
0.930
0.887
1.000
MPG
0.162
0.725
0.894
0.935
0.694
1.000
DispInc
0.813
0.808
0.912
0.880
0.990
0.695
Model 1:
Gallonsi = 0 + 1 RetailPricei + i
SUMMARY OUTPUT
DispInc
1.000
Multiple R
0.53
R Square
0.28
Adjusted R Square
0.25
Standard Error
10055.36
Observations
26.00
ANOVA
df
SS
MS
Regression
9.15
Residual
0.01
101110215.62
Total
3351556320.62
Coefficients Standard Error
Intercept
t Stat
P-value
56236.32
4162.48
13.51
0.00
171.89
56.83
3.02
0.01
RetailPrice
Model 2:
Significance F
Gallonsi = 0 + 1 RetailPricei + 2 RegCarsi + i
SUMMARY OUTPUT
Multiple R
0.87
R Square
0.76
Adjusted R Square
0.74
Standard Error
5876.89
Observations
26.00
ANOVA
df
SS
Regression
MS
1278593598.83 37.02
Significance F
0.00
Residual
Total
Coefficients Standard Error t Stat
P-value
Intercept
10007.44
7151.09
1.40
0.18
RetailPrice
-215.88
65.46
-3.30
0.00
RegCars
0.66
0.10
6.87
0.00
Model 3: Gallonsi = 0 + 1 RetailPricei + 2 RegCarsi + 3 MPGi + 4 DispInci + i
SUMMARY OUTPUT
Multiple R
0.99
R Square
0.98
Adjusted R Square
0.98
Standard Error
Observations
1640.38
26.00
ANOVA
df
SS
MS
Regression
823762178.78
306.14
Significance F
0.00
Residual
Total
Coefficient
s
Standard Error
Intercept
RetailPrice
RegCars
MPG
DispInc
t Stat
P-value
54733.17
5188.74
10.55
0.00
-65.34
26.66
-2.45
0.02
0.43
0.15
2.92
0.01
-4883.99
306.39
-15.94
0.00
4.91
2.27
2.17
0.04
a) In Model 1, what amount of the total variation in the response variable is explained
by the variable, RetailPrice? Show your calculation.
[3 points]
b) If Model 1 were to be used to predict the gasoline consumption for the year 1988,
what would be the 95% prediction interval for gasoline consumption if the retail price
for that year is projected to be $0.66 per gallon. Assume that the mean value of retail
price in the data set is $0.645 and the sample standard deviation is $0.353. Show
your work. [5 points]
c) Given the presence of RetailPrice and RegCars in the model does the addition of
explanatory variables MPG and DispInc add significant additional information to the
model? Perform a partial F test to answer this question. Use a significance level =
0.05. [4 points]
d) From a macroeconomic standpoint, disposable income is a factor worth considering
only if for each dollar increase in the average disposable income, the consumption of
gasoline increases by at least 2 million gallons. Given the presence of factors retail
price, number of registered cars and the average mileage of cars, is there sufficient
evidence to suggest that gasoline consumption does increase by at least 2 million
gallons for each dollar increase in average disposable income? Use a significance
level of = 0.05 to answer this question.
[5 points]
e) Would it be worthwhile adding variable CPITrans to Model 3? Why or Why not? [3
points]
f)
As per Model 1, if the price gasoline increases then the consumption also increases.
However, as per Model 2, the effect is the opposite. Which one is really true? How do
you explain this dichotomy? Explain your answer.
[5 points]

Below is data on a sample of antique items sold. Each data item lists the price for which it
was sold (in $), the age of the antique piece (in years) and the number of bidders. What are
the factors that can be used to explain the price of an antique item?
auct_pr
age
num_bid
$946
113
$1,336
126
10
$744
115
$1,979
182
11
$1,522
150
$1,235
127
13
$1,483
159
$1,152
117
13
$1,545
175
$1,262
168
$845
127
$1,055
108
14
$1,253
132
10
$1,297
137
$1,147
137
$1,080
115
12
$1,550
182
$1,047
156
$1,792
179
$729
108
$854
143
$1,593
187
$1,175
111
15
$1,713
137
15
$1,356
194
$1,822
156
12
$1,884
162
11
$1,024
117
11
$2,131
170
14
$785
111
$1,092
153
$2,041
184
10
Provided below are the pair-wise correlation matrix some regression outputs
Exhibit I: Correlation Matrix

Correlations
AUCT_PR
AUCT_PR
Pearson
Correlation
Sig. (2-tailed)
Pearson
Correlation
Sig. (2-tailed)
.395(*)
.000
.025
32
32
.730(**)
-.254
.000
.161
32
32
32
.395(*)
-.254
.025
.161
32
32
32
N
NUM_BID
NUM_BID
.730(**)
32
N
AGE
AGE
1
Pearson
Correlation
Sig. (2-tailed)
N
** Correlation is significant at the 0.01 level (2-tailed).

* Correlation is significant at the 0.05 level (2-tailed).
Exhibit II: Model I auct_pr = 0 + 1 age + 2 num_bid + i

Model Summary(b)
Model
1
R Square
.945(a)
.893
Adjusted R
Square
Std. Error of the

Estimate
.885
133.13650
a Predictors: (Constant), NUM_BID, AGE

b Dependent Variable: AUCT_PR
ANOVA(b)
Model
1
Regression
Residual
Sum of
Squares
4277159.70
3
df
514034.515
Total
4791194.21
9
a Predictors: (Constant), NUM_BID, AGE
Mean Square
2
2138579.852
29
17725.328
F
120.651
Sig.
.000(a)
31
Coefficients(a)
Model
Unstandardized Coefficients
B
(Constant)
Std. Error
Standardized
Coefficients
Sig.
Beta
-1336.722
173.356
-7.711
.000
AGE
12.736
.902
.888
14.114
.000
NUM_BID
85.815
8.706
.620
9.857
.000
a Dependent Variable: AUCT_PR
Scatterplot
Dependent Variable: AUCT_PR
Regression Standardized Residual
2.0
1.5
1.0
.5
0.0
-.5
-1.0
-1.5
-2.0
-3
-2
-1
Regression Standardized Predicted Value

Exhibit III: Model II auct_pr = 0 + 1 age + 2 num_bid + 3 age_bid + i,
where age_bid = age*num_bid
Model Summary(b)
Model
2
R
.977(a)
R Square
Adjusted R
Square
.954
Std. Error of the

Estimate
.949
88.36738
a Predictors: (Constant), AGE_BID, AGE, NUM_BID

ANOVA(b)
Sum of
Squares
df
Regression
4572547.9
87
Residual
218646.23
2
Total
4791194.2
19
a Predictors: (Constant), AGE_BID, AGE, NUM_BID
Model
1
Mean Square
3
1524182.662
28
7808.794
F
195.188
Sig.
.000(a)
31
Coefficients(a)
Model
Unstandardized
Coefficients
Standardized
Coefficients
Sig.
B
1
(Constant)
322.754
AGE
Std. Error
Beta
293.325
1.100
.281
.873
2.020
.061
.432
.669
NUM_BID
-93.410
29.708
-.675
-3.144
.004
AGE_BID
1.298
.211
1.370
6.150
.000
a Dependent Variable: AUCT_PR
Scatterplot
Dependent Variable: AUCT_PR
Regression Standardized Residual
-1
-2
-2
-1
Regression Standardized Predicted Value

a.
Comparing Model I to Model II, which one, in your opinion, is preferable? Give the
pros and cons of each in stating your answer.
[6 points]
b.
In Model II, which of the three variables: age, num_bid and age_bid, has the
greatest impact on auct_pr? State clearly why.
[2 points]
c.
If an antique is 120 years old, what would be the precise impact of number of
bidders on the price as described in Model II? If the antique were only 30 years
old, would this model apply? Why or Why not? [4 points]
d.
If you were to develop a simple linear regression model with auct_pr as the
dependent variable and age as the independent variable, would the model turn
out to be significant, using the F test, at a significance level of = 0.01? Why or
Why not?
[3 points]

The pairwise correlation matrix and the regression outputs provided below pertain to data
collected on a sample of 80 countries. The variables on which the data was collected are:
GNPCapita: GNP per capita
PopGrowth: average annual change in population, 1980-1990 {(Pt+1 Pt)/Pt}
Calorie: daily per capita calorie content of food used for domestic consumption
LifeExp: average life expectancy of newborn given current mortality conditions
Fertility: average births per woman given current fertility rates.
The sample means and sample standard deviations for each variable is provided below. Also
provided are the pairwise correlation matrix and some regression outputs as exhibits. Note
that some values have been deleted by design.
GNPCapita:
PopGrowth:
Calorie:
LifeExp:
Fertility:
4119.86, s = 6908.5
x
x
x
x
0.0197,
s = 0.0119
2654.075, s = 534.19
4.21,
s = 1.964
63.45,
s = 10.807
Exhibit I: Correlation Matrix

GNPCapita
PopGrowth
Calorie
LifeExp
GNPCapita
1.000
PopGrowth
-0.562
1.000
Calorie
0.668
-0.667
1.000
LifeExp
0.574
-0.662
0.724
1.000
Fertility
-0.600
0.829
-0.752
-0.899
Fertility
1.000
Exhibit II:
Model I: LifeExp = 0 + 1 Fertilityi + i
SUMMARY OUTPUT
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
4.762
80.000
ANOVA
df
SS
MS
Significance
F
Regression
Residual
Total
9227.800
Coefficients
Standard Error
t Stat
P-value
Intercept
84.274
1.266
66.592
0.000
Fertility
-4.946
0.273
-18.137
0.000
Exhibit III
Model II: LifeExp = 0 + 1 Fertilityi + 2 PopGrowth + 3 Calorie + i
SUMMARY OUTPUT
Multiple R
0.916
R Square
0.839
Adjusted R Square
0.832
Standard Error
Observations
80.000
ANOVA
df
SS
MS
Regression
2579.686 131.692
Residual
Total
Coefficients
Intercept
Standard Error
t Stat
P-value
74.829
5.129
14.589
259.221
75.398
3.438
Calorie
0.003
0.001
1.962
0.053
Fertility
-5.677
0.516
-10.995
0.000
PopGrowth
Exhibit IV
0.000
Significance F
0.000
Model III: GNPCapita = 0 + 1 Fertility + 2 PopGrowth + 3 Calorie +

i
4LifeExp +
SUMMARY OUTPUT
Multiple R
0.691
R Square
0.478
Adjusted R Square
0.450
Standard Error
Observations
5125.104
80.000
ANOVA
df
Regression
SS
MS
4.000 1800470510.893 450117627.723
Residual
75.000 1970001976.594
Total
79.000 3770472487.488
Coefficients
Intercept
PopGrowth
Calorie
Fertility
LifeExp
Standard Error
Significance F
17.136
26266693.021
t Stat
P-value
-16075.810
11578.790
-1.388
0.169
-105278.350
93853.829
-1.122
0.266
6.011
1.690
3.557
0.001
105.470
962.337
0.110
0.913
92.550
132.829
0.697
0.488
a.
In Model I, what amount of the total variation in Life Expectancy is explained by

the variation in Fertility? Show your calculations.
[5 points]
b.
Determine the missing F value and its corresponding Significance F for Model I.
Show your calculations.
[3 points]
c.
Consider the country of Costa Rica. Its average Fertility is 3.0. Using Model I can
we conclude that Cost Ricas average life expectancy is above 50, based on a
significance level of = 0.05? Reason out your answer carefully by showing all
calculations.
[6 points]
d.
Is Model IIs predictive power significantly better than that of Model I? Perform an
appropriate statistical procedure to answer this question. Show your steps.
[5 points]
0.000
e.
Using the information provided in the outputs above, determine the VIF (Variance
Inflation Factor) associated with the variable LifeExp in Model III? Show your
computations.
[5 points]
Highly publicized salaries of corporate chief executive officers (CEOs) in the United States
have generated sustained interest in understanding factors related to CEO compensation in
general. Data on the annual compensation of the CEOs of 167 financial companies is culled
out from a larger dataset that appeared in Forbes magazine. Each year, Forbes magazine
publishes data giving the compensation package of 800 top CEOs in the US including those
of financial companies.
The variable of interest is total compensation, defined as the sum of salary plus any
bonuses, including stock options. How does one explain the wide variation in total
compensation of CEOs in the same industry? Some of the variables considered that might
help explain the variation in salary are:
MBA?: is equal to 1 if the CEO has an MBA, 0 otherwise
MasterPhD?: is equal to 1 if the CEO has a masters or PhD degree, 0 otherwise.
YearsFirm(Yrs): Total number of years the CEO worked for the company for which he/she is
currently CEO
YearsCEO(Yrs): Total number of years the CEO has served the company as a CEO
StockOwned(%): % of company stock owned by the CEO
Sales(millions of $): Annual sales of the company in millions of dollars
Profits(millions of $): Annual profits of the company in millions of dollars
A regression model is built using SPSS with the Forward method giving it a choice of
choosing any of the above listed variables. The output obtained at the first step is as follows:
Table 1: Model Summaryb
Model
R
Adjusted R
Std. Error of the
Square
Estimate
R Square
.115
Durbin-Watson
4406976.255
2.031
en
sio
n0
a. Predictors: (Constant), Profits(in 000000 of $)

b. Dependent Variable: TotalComp($)
Table 2: Coefficientsa
Model
Standardized
Unstandardized Coefficients
Coefficients
Partial
Correlati
B
1
(Constant)
Profits(in 000000
Std. Error
1182515.538
404938.707
4090.826
859.264
of $)
a. Dependent Variable: TotalComp($)
Beta
.347
Sig.
2.920
.004
4.761
.000
Correlation
.347
on
.347
Table 3: Excluded Variablesb

Model
Collinearity
Statistics
Beta In
Sig.
Partial Correlation
Tolerance
MBA?
-.016a
-.018
.999
MasterPhD?
-.065
-.069
.992
YearsFirm(Yrs)
-.087a
-.092
.989
YearsCEO(Yrs)
.024
.026
.982
StockOwned(%)
.021a
.022
.981
-.070
.174
Sales(millions of $)
-.157
a. Predictors in the Model: (Constant), Profits(in 000000 of $)

b. Dependent Variable: TotalComp($)
Using the information given above answer the followng questions:

a) The correlation and partial correlations given for Profit in Table 2 are the same.
Explain why. (1 point)
b) What is the proportion of explained variation compared to the total variation for the model
above?
(2 points)
c) Which variable should be chosen as the next candidate to enter the regression
equation? Why?
(1 point)
d) What does the collinearlity statistics for Sales(millions of $) given in Table 3 in the
output mean? State clearly the steps needed to calculate it.
(2
points
e) Are successive observations in the data used for the regression independent or not?
Explain?
(1 point)
f)
What is the impact on the compensation of CEOs who do not have an MBA? (1 point)

DS - Tute 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DS - Tute 2

Uploaded by

Copyright:

Available Formats

Question 1

b) If variable X2 is added to Model 1, provide the closest possible estimate on the

Question 2 (25 points)

Gallonsi = 0 + 1 RetailPricei + 2 RegCarsi + i

Model 3: Gallonsi = 0 + 1 RetailPricei + 2 RegCarsi + 3 MPGi + 4 DispInci + i

Question 3 (25 points)

Exhibit I: Correlation Matrix

** Correlation is significant at the 0.01 level (2-tailed).

Exhibit II: Model I auct_pr = 0 + 1 age + 2 num_bid + i

Std. Error of the

a Predictors: (Constant), NUM_BID, AGE

a Dependent Variable: AUCT_PR

Regression Standardized Predicted Value

Std. Error of the

a Predictors: (Constant), AGE_BID, AGE, NUM_BID

a Dependent Variable: AUCT_PR

Regression Standardized Predicted Value

Question 4 (23 points)

Exhibit I: Correlation Matrix

Model III: GNPCapita = 0 + 1 Fertility + 2 PopGrowth + 3 Calorie +

4.000 1800470510.893 450117627.723

In Model I, what amount of the total variation in Life Expectancy is explained by

Std. Error of the

a. Predictors: (Constant), Profits(in 000000 of $)

Table 3: Excluded Variablesb

a. Predictors in the Model: (Constant), Profits(in 000000 of $)

Using the information given above answer the followng questions:

You might also like