You are on page 1of 5

Cornell University Professor Doug Miller

Department of Policy Analysis and Management Spring Quarter 2018

PAM 3100
Multiple Regression Analysis
PROBLEM SET 5
Due 11:55pm, Thursday March 15 via Blackboard

• Keep answers as brief as possible, and include key Stata output (charts and descriptive statistics) with
your answers. Be sure to label your charts and output clearly, and to indicate which question each
chart is intended to answer.
• Turn in the main problem set as one file only, not several documents. Clearly label the problem set
file
o However, also turn in related .log and .do files (as relevant) as separate uploads.
• Include relevant Stata commands and output (such as tables or “summarize” output) in your answers,
so that we know what commands led to what results
• When cutting and pasting Stata results into your Word document, use “Courier” or “Courier New” or
other fonts that preserve the neat formatting in Stata

• Build a .do file to answer the problems. In the .do file, create a log file. This will make it
easy to re-run if needed, and you will build on this in subsequent problem sets. Also, you
can cut-and-paste from the log file into your problem set write-up.

1. Taking a Sample of data, examining how estimates compare with the “population” In this
problem we will treat the dataset CPS-ASEC-2017.dta as if it were the population of interest, and
imagine what we might estimate if we only had access to small samples from that dataset.

a. First, run a regression of WAGE on EDUCATION, and report the results of the
regression line.
b. Next, create a dataset which has only the following two variables: WAGE and
EDUCATION. (use stata’s “keep” command.) Then save this dataset with its own
name. For example, I might “save ps5prob1.dta , replace”

c. Next load up the data and take a random sample of 50 observations. You can do
this with the commands:
i. “use ps5prob1.dta , clear”
ii. “bsample 50”
Summarize the new dataset to confirm you have 50 observations. Run a regression
of WAGE on education. How do the results compare to the regression from the
main sample in question a?

d. Next, do step 1.c three more times. Each time you draw a different sample of data,
and get different regression output. Use this output to fill the in the rows
corresponding to “Sample 1” through “sample 4” of the Table below. (Fill in the
“population” slope row with results from problem 1.a):
N Population Slope
Population 61,305

Sample N Estimated Slope Estimated Std. 95% CI for slope


Error of Slope
1 50
2 50
3 50
4 50
5 200
6 200
7 200
8 200

e. Next, take 4 samples of size 200, in each sample run the same regression (using
education to predict WAGE), and fill in the remaining rows of the table.

f. For the N=50 samples, How do the “estimated” slopes compare to the “true”
population slope? How much variability do they have? (give a statistical measure of
variability, such as the standard deviation)

g. Are the results similar or different for the N = 200 samples, compared to the N=50
samples?

h. If you were to draw a sample with size 800, what is your best guess for what the
estimated slope would be? Why? What is your best guess for what the estimated
standard error would be? Why? (Hint: use the formulas relating to the distribution
of the least squares estimates.)

2. Dummy variables in bivariate regressions. For this problem, return to the main dataset
CPS-ASEC-2017.dta.

a. Summarize the dummy variable for FEMALE. Describe what the variable means
and what the average tells us.

b. Regress wage on this dummy. What do the regression results tell us?

c. Write down the “population line” model that corresponds to this regression. What
must be true of beta_2 (the “slope” on FEMALE) in this model if men and women

2
are paid the same? Test this hypothesis. Clearly state your hypothesis, show your
steps, and clearly state and interpret your conclusion. (You can implement the test
however you would like.)

d. Use the regression output to determine the mean wages for men. Check this against
a direct measure of the average.

e. Use the regression output to determine the mean wages for women. Check this
against a direct measure of the average.

f. What does the regression let us do, that we could not easily do with simple
summarize commands?

g. Some researchers suppose that the observed differences are due to differences in
education, age, etc. Select a subsample of those with exactly a BA/BS degree, and
who are aged 25-35. (use stata’s “keep if … ” command.) How does the wage gap
in this sample compare with the overall wage gap?

h. In part (g) we were able to “hold constant” education and age by restricting our
sample based on these two variables. What other variables would you like to “hold
constant” when examining the differences in wages for men and women?

3. This next problem is inspired by the paper “Children and their parents’ labor supply:
evidence from exogenous variation in Family Size”, by Joshua Angrist and William Evans,
American Economic Review 1998, which you can download from
http://www.jstor.org/stable/116844 . You do not need to read the article – this link is just
in case you are curious. (You may need to be on campus, or to use the library’s proxy server,
to access the article.) In this problem we will examine whether we can detect a preference
for child sex composition, and in particular if we can detect whether parents prefer “at least
one of each” to only boys or only girls. This problem will also walk us through the steps of
“data cleaning” – going from raw micro data that we can access off of the web to a working
dataset we can analyze.

3.1 Download the data set “CA_ACS_2010_2013.dta” from the class resources tab. This is a
large data set, containing data on about 1,469,000 individuals in California from the years 2010-2013.
You should also download the codebook (“usa_00026.cbk.txt”) (you can open the codebook in
Word or other word processor).

3.2
Next, we need to create a clean dataset. For this problem set exercise, we will only use
households with one adult married man and one adult married woman and two or more
children in them.

We can identify these variables using stata’s egen command. We will use the fact that each
Household is given a unique identifying number in the census, in the variable “serial”. This
number doesn’t have any specific meaning. But all individuals who share that number live in
the same household together.

3
The following will generate a variable which says how many married men there are in the
household.
“egen num_married_men = sum( (sex==1)&(marst==1)&(age >=18) ) , by(serial)”

(modify this command to additionally create the number of married women in the
household.)

The next line will create a variable which says how many kids there are in the household:
“egen num_kids = sum((age < 18)) , by(serial)”

Next we will keep only those individuals with one married man, one married woman, and
two or more kids:

“keep if num_married_men == 1 & num_married_women == 1 & num_kids >= 2”

Next, we want to identify, for each household, the sex of the oldest child and the sex of the
next oldest child. We are going to need a few tricks to do this. The following block of code
should help you out. Try to figure out what each line in the code is doing, and check your
intuition by looking at the results it creates in the data set using the data editor:

gen kid = age < 18


sort serial kid age
qui by serial kid: generate i_am_oldest_kid = (kid == 1) & (_n==_N)
qui by serial kid: generate i_am_2nd_oldest_kid = (kid == 1) & (_n==_N - 1)

egen oldest_is_boy = max(i_am_oldest_kid * sex == 1) , by(serial)


egen second_is_boy = max(i_am_2nd_oldest_kid * sex == 1) , by(serial)

generate two_boys = oldest_is_boy == 1 & second_is_boy == 1


generate two_girls = oldest_is_boy == 0 & second_is_boy == 0
generate boy_and_girl = (oldest_is_boy == 0 & second_is_boy == 1) |
(oldest_is_boy == 1 & second_is_boy == 0)

(Note that the last line should be on just one line in Stata, not spilling over into two lines)

Next create a variable “third_child” that = 1 if there is a third (or higher) child in the family,
and =0 otherwise. This variable will be our main outcome variable.

Finally, we are ready to limit our dataset to one observation per family. To do this, let’s just
keep the observation for adult women:

“keep if sex == 2 & age >= 18”

3.3 Use “Summarize” to give means and standard deviations for the important variables
in your dataset.

4
3.4 Now, use the data (and relevant Stata Commands) to fill in the following Table:

Two Boys Two Girls A boy and a girl


Mean(3 or more kids)
= Prob(3 or more
kids)
Std. error of mean
95% Confidence
Interval
Num. Obs

3.5 From this Table, what can you conclude about parent’s preferences for the gender of
their children? Why?

3.6 Thinking about differences in the results across the columns, do you think they could be
due to chance? Why or why not?

4. [OPTIONAL] Replicate Problem 3 for another sample. Go to “IPUMS USA”, get an


account, and learn how to download US Data from another period or state, or instead
download data from another country.

You might also like