Professional Documents
Culture Documents
Data Science
Bias and Sampling
Hanspeter Pfister & Joe Blitzstein
pfister@seas.harvard.edu / blitzstein@stat.harvard.edu
This Week
HW1 due tonight at 11:59 pm
professors 66.6
clocksmiths 55.3
locksmiths 47.2
students 20.2
01jun2012
08jun2012
15jun2012
22jun2012
29jun2012
06jul2012
13jul2012
20jul2012
27jul2012
03aug2012
10aug2012
17aug2012
24aug2012
31aug2012
07sep2012
14sep2012
MSN Paid MSN Natural Goog Natural Goog Paid Goog Natural
result from
To quantify this substitution, Blake-Nosko-Tadelis
Table 1 shows estimates from a(2013)
simple pre-post comparison
as well ashttp://conference.nber.org/confer/2013/EoDs13/Tadelis.pdf
a simple dierence-in-dierences across search platforms. In the pre-post analysis
we regress the log of total daily clicks from MSN to eBay on an indicator for whether days
(Challenger Disaster) Wainer (2000), Visual Revelations
Why sample from a population?
often the only feasible option
but its useful to think about the question:
What would you do if you had all the data?
also often important for computational reasons
There are many sampling schemes...
simple random sampling
stratified sampling
cluster sampling
snowball sampling
Absolute vs. relative
In simple random sampling, which matters more: the relative
sample size, or the absolute sample size?
2 2
2 1 2
2 1 0
1
2
1 0 1
2
1 2 2
Figure 2: Two successive stages of k = 3 snowball sampling. Nodes have been labelled with th
tage number of when they first appeared in the sample. Node 0 was the original node, acquire
via Bernoulli sampling
Bias of an Estimator
The bias of an estimator is = E()
how far off it is on average:
bias()
http://scott.fortmann-roe.com/docs/BiasVariance.html
Unbiased Estimation: Poisson Example
X Pois( )
2
Goal: estimate e
sensible?
Basus Elephant
k
X
= wi i
i=1
1
wi /
Var(i )