You are on page 1of 11

UNIVERSITY OF OSLO

DEPARTMENT OF ECONOMICS
Exam:
ECON 301B - APPLIED STATISTICS AND ECONOMETRICS
Date of exam: Monday, 9 December 2002
Time for exam:

9 a.m. - 3 p.m

The problem set covers 7 pages including computer output


Resources allowed: All printed books and private notes as well as

calculators.
All questions should be answered..
The grade scale is A,B,C,D,Fail (with A as best grade) .
Comments are given in arial font after each question.
Some of the views expressed in the comments could have been different. It is
the coherence and quality in the argument that matters.
Scientific journals constitute the medium of communication between scientists, and
also the memory (storage) of science. The economics of (scientific) journals is
interesting. Bergstrom1 argues that journals owned by private publishers are grossly
overpriced, and he recommends several actions to reduce the large profits made by
these publishers. Bergstrom provides data to substantiate his case. There are 180
economic journals in his database, of which 16 are published by scholarly societies
such as the American Economic Association. These 16 journals are published on a
non-profit basis, as opposed to the remaining journals that have private publishers. We
shall concentrate on the following variables:
P
Y
C
A
N
S

: Library subscription price for the journal per year.


: Number of libraries subscribing to the journal.
: Total number of times papers in the journal were cited in 1998.
: Age of the journal.
: Number of pages in the journal in 1998.
: Binary variable (dummy); 1 if non-profit (scholarly society), 0 otherwise.

Bergstrom, T.C. 2000. Free Labor for Costly Journals? Journal of Economic Perspectives. 15: 183198.

2
It is rare that an article in an economics journal is as explicit as Bergstrom in its
policy recommendations aimed at reducing the profits of economic agents, but
Bergstrom clearly has a dual role: as disinterested analyst, and as an academic
economist with an economic interest. In his section What can we do, Bergstrom
suggests: (i) To expand the much cheaper and also generally better non-profit journals
owned by professional societies. (ii) To support new electronic journals. And (iii) to
punish overpriced journals by cancelling library subscriptions, defecting editorial
boards, not sending good papers to these journals, and refuse to referee papers from
them.
(a)

In the Figure 1 in the appendix, price P is plotted against number of pages N .


The circles represent non-profit journals. Comment on the graph.

Price is clearly rising with number of pages for both private and non-profit
journals, but apparently more so for private journals. The graph shows that
variability in price increases with number of parameters. Whether price is
linearly related to number of parameters is hard to tell. Too much noise in the
graph!
(b)

Figure 1 does not show a relationship between P and N that agrees well with
the classical assumptions behind OLS. Why? Explain from the figure why
LP ln( P) might be close to linear in LN ln( N ) , and that the classical
assumptions might be better satisfied on this log-log scale. Use L as a prefix to
denote logged variables throughout.

The OLS assumption of homoscedasticity is clearly not met for price versus
number of pages, according to Figure 1. The log-transformation will stretch
the lower end and contract the upper end of variation. It is concave! To take
log of price will counteract the obvious higher variation at higher number of
pages. To also take log of number of pages will hopefully preserve, and
perhaps improve on the possible linear pattern in the graph.
(c)

A matrix of pair-wise scatter plots for logged variables is given in Figure 2 for
non-profit journals, and in Figure 3 for privately published journals. Regarding
LP as the response variable, how does this variable seem to respond to the
other variables? You might comment further on the plots, but be brief.

The upper right corners in Figures 2 and 3 show the scatter of LP versus LN
. The scatter in Figure 2 has a small number of points, and one must therefore
guard against over-interpretation. The scatter in Figure 3 shows a fairly nice
linear and homoscedastic scatter for the 164 private journals, perhaps with
slightly more variability in the lower end. This pattern is not contradicted by
the scatter in Figure 2, other than the variability there seems larger at the
upper end. Both these possible violations of homoscedasticity might be small
sample illusions.
Upper rows in both graphs, from left to right: LP seems uncorrelated with LA
and nearly so with LC . The latter is surprising, since many citations adds
value to the journal. But this effect might be masked by other variables we
2

3
will see from the regressions. LC is positively correlated with both LA and
LN . There is thus potential for masking the effect of LC . Figures 2 and 3
show no reasons not to analyse these data by linear methods (regression) on
the log scale for the variables price, age, citations and number of pages.
(d)

Consider the regression


LP 1 2 S 3 LN u ,
where u is a stochastic error term, and S is the dummy variable defined on
page 1. The OLS results for this regression are given in Table 1 in the appendix.
Explain what is meant by R-squared and Adj R-squared. What are the
interpretation of 2 and 3 respectively?

R-squared is the fraction of squared variation in the response variable around


its mean that is explained by the linear regression (is recovered as variation
in the predicted response due to variation in the explanatory variables) within
the sample. R 2 regressionSS / residualSS . Adj R-squared is adjusted for the
number of covariates in the regression by adding covariates that have no
theoretical correlation with the response, Adj R-squared will tend to (but not
necessarily) decrease while R-squared will increase (or remain unchanged, in
the unlikely event that the empirical correlation between the current residual
and the new covariate is precisely zero).

2 measures the effect on the expected value of LP of a journal being nonprofit versus private, given the same number of pages. The empirical result is
that a non-profit journal is priced at about exp(-1.18)=.31 of a private journal
with the same number of pages. 3 measures the price elasticity with respect
to number of pages. The elasticity in this model is assumed the same for both
private and non-profit journals.
(e)

A more general model to consider is


LP 1 2 S 3 LN 4 LA 5 LC 6 ( S LN ) 7 ( S LA) 8 ( S LC
) 9 LN 2 u
Would you interpret 2 differently for this model than for the model in (d)? The
OLS results for this model are given in Table 2, where SLN S LN etc., and
LN 2 LN 2 . Calculate a 95% confidence interval for 3 . What is your point
estimate of the elasticity e ln( P ) / ln( N ) for private journals of median
number of pages, LN 6.54 ? What is the estimated elasticity for a non-profit
journal of the same size?

2 does not have quite the same basic interpretation. The expected
difference in log price for two other wise identical journals, one being nonprofit and the other private, is now 2 6 LN 7 LA 8 LC , and not only 2 . It
is surprising that this estimated effect now seems positive: that non-profit
journals are more expensive than private. But note the large standard error.
The effect is not statistically significant!
3

The 95% confidence interval is


3 t SE 4.14 1.97 2.47 4.14

4.87 9.01,

0.73 . t is the 0.975


quantile in the t-distribution with 171 degrees of freedom.
The estimated elasticity e ln( P ) / ln( N ) is 4.14 2 .39 6.54 .96 for
private journals, and .96 .49 .47 for non-profit journals of median size.
(f)

A rationale for introducing the interaction term S LC is that private journals


maximize profit, and the more cited a journal is the more valuable it is.
Comment on the estimated signs of 5 and 8 . Discuss also the estimated signs
for the other coefficients.

Both LC and S LC are estimated to have negative effect on price. Both


effects are non-significant, though. That LC seems to have a negative effect
is surprising. I would expect the private publishers to try to maximize profit,
and thus to price much cited journals higher. That S LC seems to have a
negative effect makes sense. Relative to the private, the non-profit journals
would tend to be relatively cheaper the more valuable (more cited) the journal
is everything else equal.
The sign for S is already discussed. The signs for LN , SLN and LN 2 must
be discussed jointly, say by looking at the elastiticity that has been
considered. What remains are the signs for LA and SLA . The first is
significantly negative. This is a bit surprising since older journals that have
survived might be more valuable than newer journals (everything else equal),
and private publishers should then be expected to price them higher. That
SLA has a positive sign, which is non-significant, might also be surprising. It is
hard to see why non-profit journals get relatively more expensive than private
ones the older they are.
(g)

A third model is obtained by reducing model (e) to


LP 1 3 LN 4 LA 5 LC 8 (S LC ) 9 LN 2 u

The results by OLS are given in Table 3. Which of the three models considered
so far would you prefer? Discuss and test!
As measured by Adj R-squared, model (g) is the superior of the three. It fits
the data much better than model (d), and only slightly worse than model (e)
(as measured by R-squared, and equivalently by Root MSE). Model (g) is
simpler and slightly easier to interpret than model (e). In model (g), S effects
price only through differential effect of citation. It is nice to isolate the effect of
the most interesting covariate, and it is nice to get strong significance (and
expected sign) for this covariate. This is the case in model (g). This discussion
leads to model (g) as the preferred one.
The three models are only partially nested: (g) is obtained from (e) by setting
three coefficients to zero, and (d) is obtained from (e) by setting 6 coefficients
to zero. But model (d) cannot be obtained from (g) by setting coefficients to
4

5
zero. Since model (d) obviously is inferior, we only test model (g) versus
model (e). The test problem is H 0 : 2 6 7 0 versus at least one of the
coefficients being non-zero. The F-statistic is
F 57.591 56.758 /(8 5) / .576 .482 . This number is compared to the Fdistribution with 3 and 171 degrees of freedom, and the conclusion is that
there is hardly evidence for claiming one or more of the three coefficients
being non-zero ( H 0 is not rejected at any meaningful level).
(h)

Table 4 gives the variance inflation factors for model (g). What do these
numbers tell you? Suggest a change of variables that will reduce the unwanted
effects of large inflation factors, but without changing the essence of model (g).

The two first VIFs are terribly high. They tell us that LN 2 and LN are strongly
linearly related, as they certainly must be. That the p-value for LN (testing its
coefficient being non-zero) is as low as 12% is really impressing in the
presence of this strong colinearity. The colinearity could be reduced by
replacing LN 2 by LN 2* LN 2 LN 2 | LN , which is the residual obtained by
regressing LN on LN 2 by OLS. This would alter the coefficient for LN and
also its SE, but the coefficient for LN 2 would not be changed. The fit (Rsquared, sums of squares) would remain unchanged.
(i)

Returning to Bergstroms paper. Do you agree that private journals are overpriced? Based on your preferred model, describe the pricing policy and profit
generation in private journals.

I base my discussion on model (g). The strong significant effect of SLC


indicate that private publishers price journals relatively higher than society
publishers the more valuable the journal is as measured by citations. With
non-profit journals as a standard, private journals are overpriced and more
so the more they are cited.
(j)

Are economists in academia loyal to their non-profit journals in the sense that
University libraries are more prone to subscribe to a journal published by a
scholarly society when everything else is equal? To address this question, the
following model is considered.
LY 1 2 S 3 LP 4 LN 5 LA 6 LC u

The OLS results for this model are given in Table (5). Discuss the issue raised.
Note that the supplier side in the journal market is a mixed bag. Non-profit
journals are generally priced according to real production cost, with the hard
work of editing and refereeing done on a no-pay basis. These journals are thus
priced with little regard to what could have been their market price.
The signs for LP , LN , LA , and LC do make sense. However, libraries do
seem to subscribe less to non-profit journals than privately published ones,
everything else equal. This effect is non-significant. It could be due to heavier
marketing by private publishers. For private publishers, we do also have a

6
problem with simultaneity. Private publishers are likely to fix their price in
response to what they perceive as the demand function. A privately published
journal of the same quality as a society journal is thus likely to be priced
higher, see point (i). The number of subscribers would then be reduced, and it
is thus likely that subscription is higher for non-profit journals than private
journals of the same quality.
(k)

Inspecting the empirical residuals from model (j), a pattern is noted. The pattern
seems to be

E ln u 2 2.3 0.82 LA 0.42 LC .


This formula is obtained by regression. Several regression models were
attempted to find a reasonable model. Explain why this finding indicates
heteroscedasticity. How can the formula be used to construct weights for a
weighted regression? The results from such a weighted regression is given in
Table 6. Discuss the pros and cons of using this particular weighted regression
rather than the OLS. Which of the 95% confidence intervals for 2 given in
Table 5 and Table 6 respectively will you prefer?
Under homoscedasticity, we should have no relation between any of the
2
covariates and 2 Eu 2 or with E ln(u ) . The given relation indicates that

2 is increasing with LA and decreasing with LC . Disregarding the effect of


non-linearity, one might replace u 2 by 2 in the relation, and obtain
2 exp 2.3 0.82 LA 0.42 LC , and the inverse of this as the weight in weighted
regression.
The pros: The weighted regression provides a better fit, as measured by Rsquared and F. It also reduces standard errors. The cons: the particular
weights have been found by running several regressions. Such fishing trips
might bring home a catch, but not necessarily bring out a pattern in the
variance that prevails in repeated sampling whatever that could be in the
particular case. The chosen weights might therefore represent some degree
of over-fitting to the data. On balance, however, we have a relatively large
sample and a distinct improvement in the fit. I will therefore vote for the
weighted regression, and my confidence interval of choice for 2 is .46, .21
(l)

Our data consists of 180 journals in economics. This is pretty much the
collection of academic journals in this field that use the English language. This
collection is thus not a random sample from some existing population. Explain
the statistical meaning of a confidence interval, say that in point (e), and discuss
the difficulties involved in this interpretation since we do not sample in a
simplistic sense.

Confidence intervals make precise sense when hypothetical repetitions of the


experiment are meaningful: In repetitions, the method produce intervals that
covers the true value in a fraction of cases that agrees with the degree of

7
confidence. This interpretation might be extended slightly. The particular
study might not be reproducible by drawing an independent sample. But, in
studies where the assumed model indeed represents the uncertainties
involved, the fraction of studies leading to their produced confidence intervals
covering the true values (which might vary from study to study) will match the
degree of confidence (which is kept fixed) in the long run.
Repeated sampling makes no sense in the present study. The question is
whether the assumed model, say (e), correctly represents the uncertainties
involved. I have my doubts. There are, for example, reasons why some
journals are privately published and others are society journals. The society
journals tend to be more general in scope than the private ones. It is easier to
carve out a market segment in a limited area, say labour economics of
fisheries economics, and there establishing a private journal with a dominating
position, than in the more general areas. Perhaps such conditions are more
important for the pricing than the covariates in (e).
The big question is rather: to what extent are our results externally valid? That
is, to what extent do our results have validity for extrapolation in time,
language area and field of science? It is hard to give a good answer here. The
best approach might be to study the journals in a few other fields of science
and see if the same pattern emerges.

APPENDIX
(Output based on Stata)
price_NonProfit

price_Private

2120

20
167

2632

Figure 1. Price by number of pages for non-profit and private journals.

2.5

3.5

4.5

8
7
6

LP

5
4

4.5
4
3.5

LA

3
2.5
9
8
7

LC

6
5
8
7

LN

6
5
4

Figure 2. Scatter plots for logged variables. The plot for LC (on the y-axis)
versus LN is, for example, found in row 3 and column 4. Non-profit journals.

8
8
6

LP

4
2

5
4

LA

3
2

10
8
6

LC

4
2
8
7

LN

6
5
2

10

Figure 3. Scatter plots for logged variables. Private journals.

Source |
SS
df
MS
-------------+-----------------------------Model | 36.8357611
2 18.4178806
Residual | 119.232662
177 .673630857
-------------+-----------------------------Total | 156.068423
179 .871890631

Number of obs
F( 2,
177)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

180
27.34
0.0000
0.2360
0.2274
.82075

-----------------------------------------------------------------------------price
LP |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------society
S | -1.183172
.2207267
-5.36
0.000
-1.618767
-.7475775
pages
LN |
.812689
.1315476
6.18
0.000
.5530854
1.072293
_cons |
.3748265
.8661977
0.43
0.666
-1.334578
2.084231
------------------------------------------------------------------------------

Table 1. Regression results for model (d). OLS. Stata output with variable text added
(price as a reminder that LP is ln( price) etc.).

10

Source |
SS
df
MS
-------------+-----------------------------Model | 57.5912891
8 7.19891114
Residual | 98.4771338
171 .575889671
-------------+-----------------------------Total | 156.068423
179 .871890631

Number of obs
F( 8,
171)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

180
12.50
0.0000
0.3690
0.3395
.75887

-----------------------------------------------------------------------------price
LP |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------society
S |
2.866088
2.946723
0.97
0.332
-2.950549
8.682725
pages
LN | -4.143584
2.472981
-1.68
0.096
age
LA | -.4891446
.1040427
-4.70
0.000
-.694518
-.2837712
citations LC | -.0161615
.064683
-0.25
0.803
-.1438415
.1115185
SLN | -.4852809
.4685923
-1.04
0.302
-1.410251
.4396893
SLC | -.1073762
.2245599
-0.48
0.633
-.5506426
.3358902
SLA |
.0218138
.3993713
0.05
0.957
-.7665187
.8101464
LN2 |
.3910415
.1870774
2.09
0.038
.0217631
.7603198
_cons |
17.69035
8.153315
2.17
0.031
1.596248
33.78446
------------------------------------------------------------------------------

Table 2. Regression results for model (e). OLS.

Source |
SS
df
MS
-------------+-----------------------------Model | 56.7578429
5 11.3515686
Residual | 99.3105799
174 .570750459
-------------+-----------------------------Total | 156.068423
179 .871890631

Number of obs
F( 5,
174)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

180
19.89
0.0000
0.3637
0.3454
.75548

-----------------------------------------------------------------------------price
LP |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------pages
LN | -3.750736
2.416129
-1.55
0.122
-8.51943
1.017957
age
LA | -.4836628
.0991712
-4.88
0.000
-.6793961
-.2879295
citations LC | -.0071069
.0617511
-0.12
0.909
-.1289846
.1147708
SLC | -.1690801
.0313662
-5.39
0.000
-.2309874
-.1071729
LN2 |
.3557514
.1820416
1.95
0.052
-.0035425
.7150454
_cons |
16.57367
7.994926
2.07
0.040
.7941495
32.35318
------------------------------------------------------------------------------

Table 3. Regression results for model (g). OLS.

Variable |
VIF
1/VIF
-------------+---------------------LN2 |
423.92
0.002359
LN |
419.79
0.002382
LC |
2.11
0.474096
LA |
1.24
0.803837
SLC |
1.21
0.823833
-------------+---------------------Mean VIF |
169.65

Table 4. Variance inflation factors for model (g).

10

11
Source |
SS
df
MS
-------------+-----------------------------Model | 139.755649
5 27.9511297
Residual |
86.234402
174 .495600012
-------------+-----------------------------Total | 225.990051
179 1.26251425

Number of obs
F( 5,
174)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

180
56.40
0.0000
0.6184
0.6074
.70399

----------------------------------------------------------------------------------subscriptions LY |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
----------------------------------------------------------------------------------society
S | -.2190234
.2061648
-1.06
0.290
-.6259291
.1878823
price
LP | -.4394106
.0695569
-6.32
0.000
-.5766944
-.3021268
pages
LN |
.3482928
.1591566
2.19
0.030
.0341669
.6624188
age
LA |
.4272673
.0983372
4.34
0.000
.2331801
.6213546
citations
LC |
.4110117
.0571062
7.20
0.000
.2983018
.5237217
_cons |
1.209037
.8558679
1.41
0.160
-.4801824
2.898256
-----------------------------------------------------------------------------------

Table 5. OLS results for model (j).


. regress LY
S LP LN LA LC [weight=w]
(analytic weights assumed)
(sum of wgt is
3.1952e+03)
Source |
SS
df
MS
-------------+-----------------------------Model | 157.285638
5 31.4571277
Residual | 65.0093556
174 .373616986
-------------+-----------------------------Total | 222.294994
179 1.24187147

Number of obs
F( 5,
174)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

180
84.20
0.0000
0.7076
0.6992
.61124

---------------------------------------------------------------------------------subscriptions LY |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------------------------------------------------------------------------------society
S | -.1202159
.1696996
-0.71
0.480
-.4551505
.2147186
price
LP | -.4362878
.0587879
-7.42
0.000
-.552317
-.3202587
pages
LN |
.3057825
.1365795
2.24
0.026
.0362167
.5753484
age
LA |
.5066978
.095116
5.33
0.000
.3189682
.6944274
citations
LC |
.4090865
.0528861
7.74
0.000
.3047057
.5134673
_cons |
1.212265
.7035282
1.72
0.087
-.1762821
2.600813
----------------------------------------------------------------------------------

Table 6. Weighted regression results for model (j).

11