You are on page 1of 10

Econometrics Project

Prepared for Module CB9016


“Applied Econometrics”
by
Carlos Ferreira

Submitted on the 16th March, 2009


1a) Looking at the dataset first, we realise there's a variable that accounts for total production
(out) and a host of variables giving quantities of inputs used. We also note that the total capital
expenditure is not included as one variable, but as several (fert, fodd, mach and cap). Finally, the
variables age, soilc and soils are not continuous, suggesting their usage as dummies.
To examine the possible direction and magnitude of the impact of the regressand on the
regressor, we plotted the each of the variable pairs. The plots revealed two cases that are constantly
outliers. Even being roughly on line with the expected regression curves, the two largest farms
present very large outputs and very large usage of inputs when compared to the rest of the sample,
resulting in being over four standard errors beyond the mean. As a result, and at the risk of over-
reacting to a potentially small problem, we chose to eliminate these two cases from the analysis.
For the variable land, we expect a strong, positive and linear link with the output. The same
applies to the variable labour, but in this case we expect the coefficient to be higher than the one for
land. We also expect a large, positive relation between fertilizer and output. In the case of fodder,
the analysis of the plot shows that some farmers use it, while others don't.. This might result in a
pronounced heteroscedasticity if fodder was included as a stand-alone variable in the model. The
best way this variable could be used is in a composite total capital variable. In the case of
machinery, we expect a high positive relation with output as well.
We created a total capital variable, tc

tci = ferti + fodi + machi + capi

The variable tc accounts the total capital expenditure in the farm. Plotting tc against the
output suggests a strong, positive relation.
As for the variable age of the farmer, an analysis of the resulting plot suggests older farmers
may obtain a larger output. The plots for clay soil and sandy soil don't show much difference
between the two conditions for each variable, The charts, however, can't provide information
concerning the interaction between them.
As a consequence of the discussed above, we suggest an economic model in which there is a
linear relation between the revenue obtained from the output and the quantity of the various inputs.
We would also expect a diminishing marginal product of the various inputs: increasing any of them
will increase revenue, but at a diminishing rate. The total revenue per farm – out – and the various
inputs (land, labor, tc, age, soilc, soils) are the independent variables (regressors). The expected
relations are all positive: as any of the dependent variables (regressand) increases, so will the
revenue. We believe resulting relation will be linear in the parameters, which means that the
corresponding coefficient to each variable will be constant (a βi coefficient).

1b.i) The coefficients β1 and β5 represent the partial elasticity of output with respect to
(respectively) the amount of land used and the amount of capital (cap) used.
We expect coefficient β1 to be positive, reflecting the fact that, the larger the amount of land
ceteris paribus, the larger the amount produced and, consequently, the larger the revenue. Likewise,
we expect to find positive signal β5, reflecting the fact that more capital will probably lead to more
production and larger revenue, ceteris paribus.
Concerning the respective magnitudes, the predictions are not so clear-cut. Both inputs will
theoretically have diminishing returns, but our model fails to account for that, by calculating a
constant elasticity whatever the amount of land or capital used. In a setting of modern agriculture
production, it's probably easier to increase capital than to increase land, when the objective is
obtaining a larger revenue. Because of this, we expect β1 to be smaller than β5.

(1b.ii) Most of the problems we can find come from the possibility the regression violated any of
the assumptions of the Classic Linear Regression Model. One potential problem we might find in
this model is heteroscedasticity – the conditional variances of the error term being different. Some
possible causes for heteroscedasticity in this case include differences on the precision of measure
methods (it is not clear that all farms have the same kind of care and precision while recording their
activities), outliers (as mentioned before, there is a small number of cases that can be considered
outliers), incorrect specification of the regression model and wrong choice of functional form.
Another potential problem arising in this case is multicollinearity. Some of the variables
might be functionally linked: for instance, capital and labour have the amount of land built into their
variables, because the largest the amount of land, the more capital and labour employed, ceteris
paribus.
We also consider there may be an omitted variable bias: the model excludes variables
potentially important, regarding the quality of the soil.
Finally, it is not at all clear that the model is correctly specified, since we may be ignoring
important variables, the functional form may not the be the most adequate and some of the
probabilistic assumptions about the variables may not be correct.

1c) We have estimated model A, and obtained the following results:

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -244238.83 79844.4 -3.06 0.00245 **
land -1179.65 545.93 -2.16 0.03159 *
lab 89.37 31.78 2.81 0.00528 **
fert 183.39 31.9 5.75 2.44e-08 ***
mach 164.9 20.69 7.97 4.48e-14 ***
cap 465.59 29.49 15.79 < 2e-16 ***
Residual standard error: 644700 on 269 degrees of freedom
Multiple R-squared: 0.8827, Adjusted R-squared: 0.8805
F-statistic: 404.8 on 5 and 269 DF, p-value: < 2.2e-16

All the coefficients in the model are significant, so the problem of high R-squares and low t
values doesn't apply. However, the problem did apply to an alternative dataset, where the two
outliers were not eliminated. The remaining of question 1c will refer to results and tests conducted
for that alternative model A.
One of the possible reasons for a high R-squared coefficient but low t-values is the
occurrence of a high degree of multicollinearity. It is suggested by the literature that an R-squared
of more that 0.8 but the occurrence of slope coefficients not-statistically different from 0 could
mean a high degree of multicollinearity. In this case, it is only one coefficient in that situation (for
land), but we decide to test further for a high degree of multicollinearity.
For that, we decide to test for high pairwise correlations among regressors. The literature
suggests multicollinearity could be an important issue if the zero-order correlations are higher that
0.8. Running these correlations yielded two values of R-squared larger that 0.8:
Y= land, and X=fert, R2 = 0.81
Y= lab, and X=fert, R2 = 0.81
Since the first test is too strong and the second necessary is sufficient but not necessary, we
decide to apply a third test, and perform the auxiliary regressions, regressing each Xi on the
remaining X variables. As a form of simplifying the analysis, we follow Klein's rule of thumb,
which states that multicollinearity might be a problem if the adjusted R-squared of any of these
auxiliary regressions is larger that the adjusted R-squared of the overall regressions. In this case, we
obtained values of adjusted R-squared between 0.65 and 0.89, all smaller than the adjusted R-
squared of the overall regression, so this test would point to no meaningful multicollinearity.
Overall, the first two tests point out the potential for multicollinearity (the first, arguably, not
so much since there is only one slope not significantly different from 0), while the third does not.
The general impression is that multicollinearity could be an issue, but we choose not to act on it,
instead investigating the possibility if another functional form that includes all capital-related
variables, potentially curing the issue and better describing the data.

1d) The first possible solution is the one we adopted for model A: excluding the outliers from
the analysis.
Another potential remedy for a significant multicollinearity is to drop one variable from the
analysis – in the case of model A, the variable dropped would be the amount of fertilizer, part of the
potentially multicollinear capital block, and highly correlated to both land and labour. However,
that could induce a problem of omitted variable bias: the model could be incorrectly specified.
Fertilizer is a theoretically important determinant of the total quantity produced and, consequently,
of the total revenue obtained.
Besides, dropping the variable will result in an overestimation of the absolute value of the
coefficients associated with labour, machinery and capital. Since the coefficients, even highly
multicollinear, are BLUE, omitting this variable will result in a bias in the values of the parameters,
and consequently impact on the values estimated from the regression. Another variable whose
exclusion of the analysis could result in bias specification bias is fodder, which is capital-related.
We first estimated a model where land, labour and all capital-related variables are included
(model A1):

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -281415.32 61360.28 -4.59 6.93e-06 ***
land 549.86 437.67 1.26 0.21
lab 128.2 24.56 5.22 3.59e-07 ***
fert 192.12 24.5 7.84 1.06e-13 ***
mach 142.24 15.97 8.91 < 2e-16 ***
cap 128.23 33.42 3.84 0.000155 ***
fodd 85.91 6.26 13.73 < 2e-16 ***
Residual standard error: 495000 on 268 degrees of freedom
Multiple R-squared: 0.9311, Adjusted R-squared: 0.9296
F-statistic: 603.6 on 6 and 268 DF, p-value: < 2.2e-16

Model A1 provides a good fit, besides being in accord to theory. However, the fact that four
different variables account for capital expenditure could lead to unwanted complications and,
potentially, multicollinearity (note the coefficient for land is not significant). Since the units in all
capital-related variables are the same, we can sum them to produce a total capital variable (tc) and
test the model (A2):

Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.49E+005 6.36E+004 -5.49 9.14e-08 ***
land 2.24E+003 3.27E+002 6.86 4.75e-11 ***
lab 1.91E+002 2.22E+001 8.58 7.36e-16 ***
tc 9.67E+001 3.46E+000 27.91 < 2e-16 ***
Residual standard error: 523200 on 271 degrees of freedom
Multiple R-squared: 0.9222, Adjusted R-squared: 0.9213
F-statistic: 1070 on 3 and 271 DF, p-value: < 2.2e-16

One alternative way to look at estimating the model is to consider it to be a short-run


production model, taking revenue as a proxy of the quantity produced, and logging it to create the
regressand. The regressors would be the quotient between 1 and land, labour and total capital,
translating into a Logarithmic Reciprocal Model (model A3):

Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.57E+001 5.11E-002 306.04 < 2e-16 ***
I(1/land) -3.61E+001 7.70E+000 -4.68 4.48e-06 ***
I(1/lab) -1.65E+003 2.47E+002 -6.67 1.46e-10 ***
I(1/tc) -4.49E+003 3.19E+002 -14.07 < 2e-16 ***
Residual standard error: 0.3421 on 271 degrees of freedom
Multiple R-squared: 0.8228, Adjusted R-squared: 0.8208
F-statistic: 419.4 on 3 and 271 DF, p-value: < 2.2e-16

From a purely statistical point of view, the better model is the one with the highest goodness
of fit (adjusted R-squared). We note we can't compare models where the dependent variable is
different: in models A1 and A2, we used out as dependent variable, while in model A3 we used
log(out). As a result, we can only compare the adjusted R-squared of the first first two models. The
adjusted R-squared of A1 is marginally higher, and would be statistical choice.
From an economics point of view, and as far as we agree that the revenue is a good
representation of quantity produced and that the data pertains to the short run, we believe our short-
run production function (A3) is the best suited. Although it has a smaller R-squared that the linear
functions, it makes sense to believe all the three inputs have a diminishing marginal product. All the
signs are in the expected direction, all the t values are statistically significant and the overall F
statistic is also sufficiently high so we consider the model as significant overall.
From an econometrics point of view, the chosen model should be the one that better
describes reality and makes economic sense. For what we've discussed above, we believe that is
model A3.

1e) A Cobb-Douglas production function, in stochastic form, is expressed as:

Qi = β1Liβ2Kiβ3eui

where Q=quantity produced, L=amount of labour used and K=amount of capital used. In the
present case, we do not have the data concerning the quantity produced, only the revenue. With the
market prices, we could transform it into the produced quantity; this means the quantity produced is
proportional to the revenue, so we can use the revenue as quantity produced in our model.
Extracting logs in both sides of the equation, we obtain:

lnQi = lnβ1 + β2lnLi + β3lnKi + ui

Since fertilizer, fodder, machinery and capital are all capital-related variables, and in the
same standardized unit, we shall add them, using the total capital variable described before (tc). The
result of the linear regression, model B, is:

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.78 0.29 13.2 < 2e-16 ***
I(log(lab)) 0.48 0.06 8.56 8.57e-16 ***
I(log(tc)) 0.72 0.04 20.25 < 2e-16 ***
Residual standard error: 0.2808 on 272 degrees of freedom
Multiple R-squared: 0.8802, Adjusted R-squared: 0.8793
F-statistic: 998.9 on 2 and 272 DF, p-value: < 2.2e-16

From a statistical point of view, Model B has a high adjusted r-squared, making it a
potentially good model. All coefficients have the expected signals and high t values, making them
significant. Output presents a constant elasticity of 0.48 with respect to labour, keeping total capital
constant (a 1% increase in the quantity labour results in a 0.48% increase in revenue), and a
constant elasticity of 0.72, keeping the amount of labour constant.
In a Cobb-Douglas production function, the sum of the coefficients β2 and β3 tells us
whether the production has constant, increasing or decreasing returns to scale, if it equals, is larger
or is smaller than 1, respectively. In our estimation, the sum of the estimated coefficients is 1.20
(0.48+0.72). We test the hypothesis that the sum of the real coefficients equals 1:

H0: β2 + β3 = 1
H1: β2 + β3 ≠ 1

Res. Df RSS Df Sum of Sq F Pr(>F)


1 272 21.45
2 273 23.83 -1 -2.37 30.1 9.404e-08 ***

A highly significant F value suggests we should reject H0 – the production does not enjoy
constant returns to scale. From the sum of the estimated coefficients, we believe the production
enjoys increasing returns to scale. To demonstrate this, we test a new hypothesis:

H0: β2 + β3 = 1.2
H1: β2 + β3 ≠ 1.2

Res. Df RSS Df Sum of Sq F Pr(>F)


1 272 21.45
2 273 21.45 -1 0 0.02 0.9

A value of F not statistically significant means we can't reject H0. Since the sum of β2 and
β3 equals 1.2, the production enjoys increasing returns to scale.

We then create model C, by dividing all the variables in model B by the variable land, and
logging the result. We decided to do this because, in a competitive setting all producers are
optimizing the level of input usage; as a result, all of them will be using a similar level of capital
and labour per unit of land..

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.05 0.15 39.29 <2e-16 ***
I(log(lab/land)) 0.04 0.05 0.71 0.48
I(log(tc/land)) 0.74 0.04 21.07 <2e-16 ***
Residual standard error: 0.2844 on 272 degrees of freedom
Multiple R-squared: 0.6976, Adjusted R-squared: 0.6954
F-statistic: 313.7 on 2 and 272 DF, p-value: < 2.2e-16

Models B and C are quite similar in terms of functional form. The first question that must be
asked is which makes more sense in economic theory. Model B only includes capital and labour as
determinants of the quantity produced, whereas Model C accounts for the land as well. In an
agricultural production setting, it makes sense to account for land, and model C does it.
We first test the models for heteroscedasticity. Since the visual method is not conclusive, we
tried some more formal methods, starting with the Breusch-Pagan test. The results were the
following:

data: B
BP = 21.4954, df = 2, p-value = 2.149e-05
data: C
BP = 21.3869, df = 2, p-value = 2.269e-05

We reject the null hypothesis of homoscedasticity. We proceeded to apply the Goldfeld-


Quandt test:

data: B
GQ = 2.0398, df1 = 135, df2 = 134, p-value = 2.237e-05
data: C
GQ = 2.4099, df1 = 135, df2 = 134, p-value = 2.728e-07

Again, in both cases we reject the null hypothesis of homoscedasticity. We thus conclude
that both regressions suffer from heteroscedasticity.
Next, we test for autocorrelation, using the Durbin-Watson test:

data: B
DW = 1.4271, p-value = 6.783e-07
data: C
DW = 1.1591, p-value = 1.041e-12

In both cases, the regressions seem to exhibit significant first-order autocorrelation. To


understand how robust these results are, we then test for autocorrelation using the Breusch-Godfrey
test for serial correlation of order 1:

data: B
LM test = 23.0077, df = 1, p-value = 1.614e-06
data: C
LM test = 51.0209, df = 1, p-value = 9.139e-13

Again, both models suffer from significant autocorrelation.


Finally, we tested both models for multicollinearity, using pair-wise correlations and
determining auxiliary regressions. We concluded, by both methods, that neither B nor C suffer from
multicollinearity.
Overall, we believe model B should be chosen. It is simpler and accounts for reality in a
satisfactory way. Model C accounts for the influence of land, but it is not clear that gives any
advantage in the analysis, and bring about a reduced prediction capacity.
Besides, model B gives us a useful measure of returns to scale in this setting.
In any case, in model B we have an estimate of the percentage change in the output for a
percentage change in capital or labour, while model C gives the same prediction per unit of land.
Both can be useful.

1f) In creating a dummy variable, we must make sure that, for any m conditions of the
benchmark category, we create m-1 variables. Looking at the dataset first, we predict that the
variables soilc and soils will be dummy variables. Either soil characteristic will have a different
impact in the quantity produced (and, consequently, in the total farm revenue, ceteris paribus).
However, there are also situations where the soil is both “clay” and “sandy”, and other situations
where it is neither of those characteristics. As a result, the model will need one more dummy
variable, to account for the situation where the soil is both; this means all
We begin by analysing age of the farmer. The benchmark category has tow different
conditions: up to (and including) 40 years old; and over 40. So, we will create one dummy variable,
age2, with two conditions:
0 if age <=40
1 if age > 40
As for the type of soil, we identify 4 different conditions: clay soil; sandy soil; clay and
sandy soil; and soil neither clay not sandy. So, we will define three dummy variables: clay (1=yes,
0=no), sandy (1=yes, 2=no) and both (clay*sandy). This last variable is intended to give us a
measure of the interaction between the types of soil.
With these variables defined, we estimate model D:

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.84 0.32 18.11 <2e-16 ***
I(log(lab/land)) 0.02 0.05 0.47 0.64
I(log(tc/land)) 0.76 0.03 22.32 <2e-16 ***
age 0 0 -0.97 0.33
clay 0.36 0.27 1.34 0.18
sandy 0.16 0.27 0.59 0.56
I(clay * sandy) -0.31 0.28 -1.14 0.26
Residual standard error: 0.2702 on 268 degrees of freedom
Multiple R-squared: 0.7311, Adjusted R-squared: 0.7251
F-statistic: 121.4 on 6 and 268 DF, p-value: < 2.2e-16

Model D, although significative and with a high adjusted R-squared, only has two
significant coefficients: the intercept and total capital per hectare.
To finish, we compare models C and D using an anova. This is possible because both
models have the same regressand and use the same sample. The models are significantly different
(F = 8.3429, p = 2.345e-06).
To conclude, we believe model B is the one that better describes the reality. Being a Cobb-
Douglas production function, it accounts for both capital and labour as inputs for agricultural
production. As we have shown, there are increasing returns to scale – if there weren't, models B and
C would be equivalent, with C accounting for the increase in production per hectare.
2a) We create an economic model to try to predict farmer's decision to join a water community
or not. For this, we try to create an economic model that, given the levels of a number of variables,
will predict whether or not a specific farmer will join the water community.
The economic model includes various factors to account for the probability of the farmer
joining the water community. These factors can be socio-economic (education, age, gender), the
cost of irrigation (total area farmed, percentage of total irrigated area, share of crops in total revenue
and share of crops in total revenue), and variables related to the kind of irrigation used (furrow,
sprinkler, flood, irrha, irrpc).
This is a model where if the sum of these four factors is over a certain threshold, the
probability of the farmer joining the water community equals 1 (he will join it); and, it it's below
that threshold, the probability equals 0 (he won't join it).

2b) This model cannot be linear, because our dependent variable is qualitative. As a result, OLS
estimation would be meaningless. As a result, we resort to a qualitative response model.
The probability of belonging to the water community depends on an implicit utility index
(which we call Ui), calculating the utility one farmer obtains from belonging to the water
community. This index is a linear function of the several variables discussed above: socio-
economic, cost of irrigation and type of irrigation, such that:

Ui = β1 + β2*totalha + β3*crops + β4*furrow + β5*sprinkler + β6*flood + β7*furrow*sprinkler


+ β8*furrow*flood + β9*sprinkler*flood + β10*furrow*sprinkler*flood + β11*irrha + β12*irrpc
+ β13*gender + β14*age + β15*education + ui

Note that the irrigation variables (furrow, sprinkler and flood) are dummy variables; and that
farmers can use just one of them, combine two different kinds of irrigation, combine all three kinds
or use no irrigation at all. As a result, our benchmark variable (type of irrigation used) has eight
categories and consequently we use seven different variables to account for all possibilities.
The probability of joining the water community (Pi) is a function of the implicit utility Ui. If
Ui exceeds a certain threshold (we call it Ui*) the farmer will join the water community; otherwise,
he will not:

Pi = P(member=1 | U) = P(Ui>Ui*) = P(Zi>Ui) = F(Ui)

Where Zi is the standard normal variable and F is the standard normal cumulative
distribution function.
We can calculate Pi by three different methods: by a Linear Probability Model, by a logit
model or by a probit model. Since LPM is plagued by biases, we chose to only calculate the logit
and the probit model.
Choosing between the logit and the probit model is a difficult task. Both estimates are quite
similar, and produce similar predictions. In this case, we used the Akaine Information Criterion: the
model with the smallest AIC was chosen. Again, the models were closely matched: the AIC for
logit was 125.03, while the AIC for probit was 124.42. We chose the probit model as the best
match.

2c.i) We obtained the following estimates for the probit model

Estimate Std. Error z value Pr(>|z|)


(Intercept) -1.63E+000 2.10E+000 -0.77 0.44
totalha 6.50E-002 8.65E-002 0.75 0.45
crops -8.57E-004 1.81E-002 -0.05 0.96
furrow 2.60E+000 7.18E-001 3.62 0.000293 ***
sprinkler 1.79E+000 6.91E-001 2.59 0.009557 **
flood 2.74E+000 8.17E-001 3.35 0.000806 ***
irrha 2.88E-001 2.21E-001 1.3 0.19
irrpc 2.99E-001 6.84E-001 0.44 0.66
gender 4.22E-002 4.81E-001 0.09 0.93
age -9.27E-002 1.56E-001 -0.59 0.55
education 4.25E-003 1.48E-001 0.03 0.98
furrow:sprinkler -1.72E+000 8.62E-001 -2 0.045607 *
furrow:flood 2.34E+000 2.61E+002 0.01 0.99
sprinkler:flood 2.93E+000 1.76E+003 0 1
furrow:sprinkler:fl -3.02E+000 1.86E+003 0 1
ood
Null deviance: 166.67 on 248 degrees of freedom
Residual deviance: 94.42 on 234 degrees of freedom
AIC: 124.42
Number of Fisher Scoring iterations: 18

As can be seen by the estimation, only four variables have a significant impact on the
probability of becoming a member of the water community or not: usng furrow irrigation, using
sprinkler irrigation, using flood irrigation, or using a combination of furrow and sprinkler irrigation.
Using furrow, sprinkler or flood irrigation all have a positive impact in the probability of
joining the water community. On the contrary, using both furrow and sprinkler irrigation methods
means farmers are less likely to become members of the water community.

2c.ii) Analysis of the data suggests two possible path for the water community, in order to
increase
its number of associates. The first course of action is to contact more farmers that only use one
irrigation method. The analysis shows that they are vary likely to decide to join.
The other course of action is to investigate why farmers who use both furrow and sprinkler
irrigation methods are less likely to join the water community. Perhaps there might be some
underlying economic (or otherwise) explanation for this fact, and the water association could in
some way devise strategies to counteract the reduced probability of joining by these farmers.

2c.iii) Generally, there are criticisms to make on the dataset. The first one concerns the size of
some of the groups involved: there are only 14 women owners for a total sample size of 149
farmers; there are only two farmers who use both flood and sprinkler, but a larger number of
farmers who use only furrow irrigation. The quality of our analysis would gain from a more
balanced group size.
The data also ignore potential personal differences between the farmers, at an attitudinal
level. There is no data concerning farmer's personal preferences for belonging to a water
community. Failure to accounting for these potential differences results in omitted variable bias –
these variables could be underpining the differences (or lack of differences) observed.

You might also like