You are on page 1of 11

Answer all questions. Question 1 (4 points): Provide a brief denition of signicance level.

Question 2 (4 points): Explain briey what unbiasedness means. Question 3 (4 points): Write down the formula for the Two Stage Least Squares (TSLS) estimator. Question 4 (4 points): Write down the empirical equation for the differences-in-differences estimator. State the condition that needs to be satised for the differences-in-differences estimator to capture the causal effect of treatment. Question 5 (4 points): We have obtained the following OLS regression results:

wage i = 2 3 f emalei + 2 agei 2 (malei agei )


(1) (1) (.5) (1)

where malei = 1 when the individual is male, f emalei = 1 when the individual is female, and agei is in years. Robust standard errors are in parentheses. Draw a line in the (age, wage) space to represent how earnings change with age for men. Draw an equivalent line for women. Mark explicitly the intercepts and the slopes. Question 6 (4 points): I have obtained the following regression results:

= 2.3 + .07 educ + .01 exp lwage


(.51) (.035) (.005)

where lwage is the log of wages, educ is the number of years in education and exp is the number of years of experience (robust standard errors in parentheses). How do we interpret educ ? Construct a 95% condence interval for educ . Can we reject the hypothesis that educ = 0? Question 7 (4 points): We want to understand whether higher wages are associated with lower employee turnover. We hypothesise the existence of a negative relation. We obtained the following OLS regression line (robust standard errors in parentheses):

turnover = 3 .02 wages


(2.3) (.005)

We have reasons to suspect that both turnover and wages are measured with a lot of error. Is .02 a lower bound of the true parameter?

Question 8 (4 points): We have the following population model:

y = 0 + 1 x +
We have to decide between two alternative estimators. The density functions of the two estimators are displayed below. Discuss briey which estimator you prefer.

Question 9 (4 points): We want to study whether government advertising affects newspapers coverage of corruption scandals. Our empirical model is:

f rontpagemj = + advertisingmj + j + m +

mj

where f rontpagemj captures the percentage of j s front page that was devoted to covering corruption scandals in month m, advertisingmj captures the total government advertising spending in newspaper j in month m, and j and m are newspaper and month xed effects. Can you suggest a placebo experiment for this test? Discuss briey why such a placebo experiment will be useful. Question 10 (4 points): The twenty-four-hour Fox News Channel was introduced by Rupert Murdoch in the US in October 1996. Fox News expanded rapidly through local cable markets, and reached twenty percent of US cities by June 2000. Since Fox News is a right-wing channel, left-wing commentators feared at the time that the introduction of the channel would signicantly affect the outcome of the November 2000 US Congress elections. A study on the effect of NTV (Independent TV) on the 1999 Russian Duma elections found substantial media effects on voting. Discuss briey whether the ndings of this Russian study can be validly extrapolated to a US context.

Question 11 (4 points): We want to understand the effect of a company prots on its CEOs salary. Our population model is:

lsalary = 0 + 1 prof its +


We are going to run an OLS regression on the model above. We are concerned about the existence of the variable envir, which captures how good the business environment is. If the economy is growing a lot, the business environment is good and envir will take a high value. Will envir create an omitted variable bias? If so, could you give a sign to the bias? Question 12 (4 points): Prove that the F-statistic can also be expressed as:
2 R2 )/q (Rur r 2 )/(nk 1) (1Rur

Question 13 (6 points): Consider the population model:

yi = xi +

where xi and i satisfy Gauss-Markov assumptions (i.e. assumptions 1-5 in the course). Let = y is an /x , where y and x are the sample means of yi and xi respectively. Show that unbiased estimator of (remember, treat xi as xed values). Question 14 (6 points): We want to study whether managers are on average happier than rank-and-le workers. We are not interested in causality, but only in whether there are statistical differences across the two groups. We obtained the following regression results (robust standard errors are in parentheses):

happiness i = 20.7 + 10.3 manageri + .7 earningsi


(3.7) (8.1) (.6)

What can we conclude from the above regression? Discuss briey.

Question 15 (6 points): In our dataset we have four groups of workers: white males, white females, black males and black females. For each worker we have information on their earnings. We have run two comparison of means tests as described in the STATA output below:

We are going to run a regression on the following population model:

earningsi = 0 + 1 bfi + 2 wfi + 3 wmi +

where bfi is a dummy variable taking value 1 if individual i is both black and female, and taking value 0 otherwise, and wfi and wmi are dened similarly for white females and white males. By looking at the STATA output above, provide the point estimates of our regression (note: you can round up to make your calculations easier).

Question 16 (6 points): We want to understand whether the increase in police has an effect in reducing drug usage. To do this, we gathered information on drug usage for a panel dataset of municipalities, all of which increased the size of their police force on the same date. We have plotted below the evolution of drug usage over time for our sample of municipalities. The vertical line captures the day in which the increase in police force took effect:

We are considering two alternative empirical models to estimate the effect of police on drug usage. The rst one is:

drugusageit = postt + i + timet +

it

where i is a municipality xed effect, postt = 1 if the municipality was sampled after the increase in police and zero otherwise and timet = 1, 2... captures the time (day) in which the municipality was sampled. The second model is:

drugusageit = postt + i + t +

it

where t is a time (day) xed effect. One way to interpret the day xed effects is as the coefcients associated with day-specic dummy variables. a. In the rst model, how would you interpret ? b. Which of the two models do you prefer and why?

Question 17 (6 points): We want to use a panel dataset of workers to study how earnings are affected by age. We do not know what functional form the relation should take in our empirical model. The youngest individual in the sample is 16 years old and the oldest is 85 years old.

Discuss briey how we could use dummy variables to explore the functional form of the
relation between earnings and age.

Most workers in our dataset are between 25 and 64 years old, and very few are younger or
older than that. Discuss briey the implications of this fact for the variance of the estimated age effects outside the 25-64 interval. Question 18 (6 points): I wanted to study the effects of receiving extra tuition on class grades. To do this, I ran a lottery among all students in the school. Extra tuition was offered at random to some students (but not to others) following the outcome of the lottery. I constructed a variable called tuition to take value 1 when the student was both offered tuition and accepted it. I then ran a regression on the following empirical model:

grades = 0 + 1 tuition +
Will my estimate capture the causal effect of tuition on grades? If not, discuss briey the sign of the bias. Question 19 (6 points): The government program STOMACH was launched in 2013 to provide school meals to disadvantaged students. Due to limited funds, access to the program was limited to students with even enrollment numbers (whether enrollment numbers are odd or even is regarded as random). We want to study whether access to school meals improves academic performance. We have access to two separate cross sections of students, one for 2012 (before the program started) and another for 2013 (after the program started; note that the students are different across the two cross sections). For every student we know their enrollment number and academic performance. What empirical model can we use to test our hypothesis? If we had only the 2013 cross section, could we still test our hypothesis? If we had only the 2012 cross section, could we still test our hypothesis?

Question 20 (6 points): We want to study whether richer individuals bequest a higher proportion of their wealth to charitable causes. We ran two regressions to study this relation. In the rst regression Bill Gates was included in the sample, and we obtained:

%charity i = 2.2 + 3.2 wealthi


(.71) (.7)

In the second regression Bill Gates was dropped from the sample, and we obtained:

%charity i = 2.2 + 3.2 wealthi


(.72) (2.1)

where %charity captures the percentage of the wealth donated to charity and wealth is measured in dollars. Discuss briey whether Bill Gates is an outlying observation. Draw a scatter plot capturing how the data from our sample (including the data point Bill Gates) looks in the (wealth, %charity ) space. Question 21 (6 points): In my high school, desks were allocated randomly on the rst day. These seating positions were unchanged throughout the seven years that we stayed at this school. I want to study the causal effect of desk number on future earnings. The hypothesis is that lower numbers (i.e. closer to the teacher) may have led to more learning and higher future performance. To study this question, I gathered a sample among those ex-fellow students who attended the Annual Alumni Meeting. These Alumni meetings are unpleasant occasions, since ex-students who attend do so mostly to boast about their present success. I am planning to run a regression on the following model:

earnings = 0 + 1 desknumber +
Discuss briey whether the above model can provide unbiased estimates of the causal effect of desk number on future earnings. If not, discuss the sign of the bias. Question 22 (6 points): In Jutlandia a child inherits his fathers (but not his mothers) surname. For cultural reasons, almost all families wish to have (at least) one male child, since this way the family name is not lost. Access to prenatal sex determination and sex-selective abortion is completely forbidden, so the ratio of male to female babies is exactly one. We wish to study whether the number of children in a family has a causal effect on family earnings. The hypothesis is that parents may feel compelled to work harder and earn more income when they need to support a larger family. We have a cross-sectional dataset which includes information on household earnings, parents educational levels, parents height, parents religion, number of children and gender of each child. We are thinking of using the following empirical model:

earnings = 0 + 1 numberchildren +
Discuss briey whether the above model can provide unbiased estimates of the causal effect of family size on family earnings. Can you think of any potential instrumental variable?

Solutions: Question 1 (4 points): The probability of rejecting a true null hypothesis. Question 2 (4 points): Unbiasedness of an estimator means that this estimator will get the true population value right on average (over a very large number of samples). Question 3 (4 points):

T SLS = 1
Question 4 (4 points):

n (z z )(yi y ) i=1 i n ( z z )( x x i ) i=1 i

y = 0 + 0 d2 + 1 dT + 1 (d2 dT ) +
where d2 = 1 for the second period and dT = 1 for the treatment group. The condition is that the treatment and the control group would have evolved in a similar way in the absence of the treatment. Question 5 (4 points):

Question 6 (4 points): A one unit increase educ is associated with 7% increase in wage.

(.07 .035 1.96, .07 + .035 1.96) educ = 0 as zero is not included in the condence interval. We can reject the hypothesis that
Question 7 (4 points): We have a higher bound, not a lower bound, of the true parameter. Our best guess is that wages < .02. So the true parameter is likely to be further away from zero

than .02 is. Question 8 (4 points): We would prefer Estimator 2, since it is unbiased. Unbiasedness is more important than a small variance. Question 9 (4 points): We could run the same regression using coverage of other types of scandals as dependent variables. If there is some other confounding factor which is correlated with government advertising and is a predictor of front page coverage of corruption stories, we will nd that even coverage of corruption by non-governmental organisations is correlated with government advertising. Question 10 (4 points): The US is an established democracy and therefore (a) parties are already well established and their platforms are stable and well understood by voters, (b) the media is ercely competitive, so most voters can nd a media outlet catering to their own preestablished views. This implies that, in the US, most Fox News viewers had probably already made up their minds before they started watching Fox News. Question 11 (4 points): Yes, it will. Companies make more prots when the business environment is good. Secondly, when the business environment is good CEOs have better alternative employment offers, and companies have to offer them higher salaries in order to keep them. The bias is the product of the effect of envir on lsalary (in a multiple regression model) and the effect of prof its on envir. The bias is positive. Question 12 (4 points):
2 R2 )/q (Rur r 2 )/(nk 1) (1Rur

SSRur ur 1 SSRr /q ((1 SSR ( SSRrSST )/q SST ) ( SST )) = = SSRur SSRur (1(1 SST ))/(nk1) ( SST )/(nk1)

(SSRr SSRur )/q (SSRur )/(nk1)

Question 13 (6 points): We are going to prove it now:


1 1 ) = E ( y E ( )= x E ( x + ) = x E ( ) + x E () = x x n 1 1 +x E( n ) = + E ( ) = i i=1 i i=1 x

Question 14 (6 points): From this regression it is difcult to tell. We should have run a regression of happiness on manager only. In the regression that we have run, the two independent variables are very strongly correlated with each other, which is likely to be the reason why the standard errors are so large. It is also possible that managers are happier because they are richer.

0 = 1.5, 1 = 0.5, 2 = .5 and 3 = 1. Question 15 (6 points):


Question 16 (6 points): a. should be interpreted as the estimated constant change in drug use over time, that is the extra (average) drug usage resulting from the passing of an extra unit of time.

b. The second model is more exible, as it does not impose a constant effect of time. Instead it allows each day to have its own intercept. However, it cannot be estimated, since the set of day dummies necessary to estimate it is perfectly collinear with timet . Question 17 (6 points):

log (earnings)it = i + t +

j =17,18...85

j agejit +

it

where i and t are individual and time xed effects, agejit = 1 if individual i is j years old at time t, and j is the difference between being j years old and being 16 years old.

For those years the estimated age effects will be very imprecise. We can see this in the 2 formula for the variance: var( j ) = (agej agej )2 (1R2 )
agej

Question 18 (6 points): No. More motivated students are more likely to accept the tuition. As a result, the group of students that eventually receive tuition is likely to have high grades for reasons other than the tuition. The group of students without the tuition includes the unmotivated students that rejected the tuition (as well as the group not offered tuition). Therefore, there is omitted variable bias. Our discussion suggests that the bias will be positive. Question 19 (6 points): We should write a diff-in-diff model:

perfit = 0 + 0 2013t + 1 eveni + 1 (2013t eveni ) +

it

where 2013 = 1 if the student belongs to the 2013 cross section and eveni = 1 if the individual has an even enrollment number. If we had only the 2013 cross section, we should run the following model:

perfi = 0 + 1 eveni + i
With the 2012 cross section, it isnt possible to test the hypothesis. Question 20 (6 points): Strictly speaking Bill Gates is not an outlying observation. We can see that dropping him from the sample has not changed the point estimate at all. However, the standard error has increased dramatically. The data probably looks something like this:

Question 21 (6 points): Desk number is allocated randomly, so there is no reason to suspect omitted variable bias or reverse causality. However, the sample is not random. In particular, there is sample selection along the dependent variable. If 1 < 0 (as posited), the existence of sample selection will create a positive bias in the estimation. Question 22 (6 points): We will not be able to obtain unbiased estimates. Richer families can afford more children, so there is a clear reverse causality effect. Differential preferences can affect both family size and family earnings, so there is also scope for omitted variable bias. A good instrument is the gender of the rst child, which is a completely random variable. If the rst child is male many families will stop there, and family size will be smaller than if the rst child is female.

You might also like