Welcome to Scribd!

Skip carousel

Bias and Sampling: CS109/Stat121/AC209/E-109 Data Science

Uploaded by

Matheus Silva

0% found this document useful (0 votes)

39 views17 pages

Original Title

06-BiasAndSampling.pdf

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

39 views17 pages

Bias and Sampling: CS109/Stat121/AC209/E-109 Data Science

Uploaded by

Matheus Silva

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 17

Search inside document

CS109/Stat121/AC209/E-109

Data Science
Bias and Sampling
Hanspeter Pfister & Joe Blitzstein
pfister@seas.harvard.edu / blitzstein@stat.harvard.edu
This Week
HW1 due tonight at 11:59 pm

HW2 will be posted by tonight start soon!

Friday lab 10-11:30 am in MD G115

Pandas with Rahul, Brandon, and Steffen
Some Forms of Bias
selection bias
publication bias (file drawer problem)
censoring bias
length bias
sampling bias
Longevity Study

Profession Average Longevity

chocolate maker 73.6

professors 66.6

clocksmiths 55.3

locksmiths 47.2

students 20.2

Sources: Lombard (1835), Wainer (1999), Stigler (2002)

01jan2012
08jan2012
Figure 2: Brand Keyword Click Substitution
15jan2012
22jan2012
29jan2012
05feb2012
12feb2012
19feb2012
26feb2012
04mar2012
11mar2012
18mar2012
25mar2012
01apr2012
08apr2012
15apr2012
22apr2012
29apr2012
06may2012
13may2012
20may2012
27may2012
03jun2012
10jun2012
17jun2012
24jun2012

01jun2012

08jun2012

15jun2012

22jun2012

29jun2012

06jul2012

13jul2012

20jul2012

27jul2012

03aug2012

10aug2012

17aug2012

24aug2012

31aug2012

07sep2012

14sep2012
MSN Paid MSN Natural Goog Natural Goog Paid Goog Natural

(a) MSN Test (b) Google Test

MSN and Google click traffic is shown for two events where paid search was suspended (Left)
and suspended and resumed (Right).

result from
To quantify this substitution, Blake-Nosko-Tadelis
Table 1 shows estimates from a(2013)
simple pre-post comparison
as well ashttp://conference.nber.org/confer/2013/EoDs13/Tadelis.pdf
a simple dierence-in-dierences across search platforms. In the pre-post analysis
we regress the log of total daily clicks from MSN to eBay on an indicator for whether days
(Challenger Disaster) Wainer (2000), Visual Revelations
Why sample from a population?
often the only feasible option
but its useful to think about the question:
What would you do if you had all the data?
also often important for computational reasons
There are many sampling schemes...
simple random sampling
stratified sampling
cluster sampling
snowball sampling
Absolute vs. relative
In simple random sampling, which matters more: the relative
sample size, or the absolute sample size?

For example, how much bigger a sample should you collect

in China vs. in the US, to get the same standard error?
Snowball Sampling (Link-Tracing)

2 2

2 1 2

2 1 0
1

2
1 0 1
2

1 2 2

(a) Stage 1 (b) Stage 2

Figure 2: Two successive stages of k = 3 snowball sampling. Nodes have been labelled with th
tage number of when they first appeared in the sample. Node 0 was the original node, acquire
via Bernoulli sampling
Bias of an Estimator
The bias of an estimator is = E()

how far off it is on average:
bias()

So why not just subtract off the bias?

Bias-Variance Tradeoff
one form: 2
MSE() = Var() + bias ()
often a little bit of bias can make it
possible to have much lower MSE

http://scott.fortmann-roe.com/docs/BiasVariance.html
Unbiased Estimation: Poisson Example
X Pois( )

2
Goal: estimate e

( 1)X is the best (and only) unbiased estimator of e 2

sensible?
Basus Elephant

Estimate the total weight of 50 elephants.

Horvitz-Thompson Estimator
Estimate the total of some variable for a finite population:
yi
Ty =
i
i S

where S is the sample and i > 0 is the probability of i being in the

sample

Unbiased! But what about the variance?

Fisher Weighting
How should we combine independent, unbiased
estimators for a parameter into one estimator?

k
X
= wi i
i=1

The weights should sum to 1, but how should they be chosen?

1
wi /
Var(i )

(Inversely proportional to variance; why not SD?)

Thumb Rules For Civil Engineers PDF
Document4 pages
Thumb Rules For Civil Engineers PDF
A K
No ratings yet
Pytorch Cheat Sheet For Beginners and Udacity Deep Learning Nanodegree
Document23 pages
Pytorch Cheat Sheet For Beginners and Udacity Deep Learning Nanodegree
Matheus Silva
No ratings yet
Pytorch Cheat Sheet For Beginners and Udacity Deep Learning Nanodegree
Document23 pages
Pytorch Cheat Sheet For Beginners and Udacity Deep Learning Nanodegree
Matheus Silva
No ratings yet
Github Git Cheat Sheet
Document2 pages
Github Git Cheat Sheet
Ankita Sinha
No ratings yet
Grade9 Physics PDF
Document2 pages
Grade9 Physics PDF
Raj
No ratings yet
Sample Exhibition Database
Document39 pages
Sample Exhibition Database
Rajesh
No ratings yet
Time Schedule For Con. 33kv - Bces
Document1 page
Time Schedule For Con. 33kv - Bces
AMMAR MAW
No ratings yet
Dynamic Vehicle Test 3
Document14 pages
Dynamic Vehicle Test 3
ahmad raza
No ratings yet
Guidelines For Utility Impact Assessment For Local Roads in Developed Areas MME 2012 Cross-Section
Document81 pages
Guidelines For Utility Impact Assessment For Local Roads in Developed Areas MME 2012 Cross-Section
irfan mohammed
No ratings yet
Friction Clutches PDF
Document14 pages
Friction Clutches PDF
amanmittal08
50% (2)
DS-003-2-En - Exertherm - IR06EMSC Sensor - Screen
Document2 pages
DS-003-2-En - Exertherm - IR06EMSC Sensor - Screen
Mohammad Asif
No ratings yet
Introduction To GIS (Geographical Information System)
Document20 pages
Introduction To GIS (Geographical Information System)
pwnjha
100% (3)
Principles of Gas Chromatography 2
Document12 pages
Principles of Gas Chromatography 2
Enanahmed Enan
No ratings yet
Bogiflex KGD20 - For Plant
Document13 pages
Bogiflex KGD20 - For Plant
Anonymous PVXBGg9T
No ratings yet
Webinar HPLC PT GeneCraft Labs
Document44 pages
Webinar HPLC PT GeneCraft Labs
Bayu Indra Permana
100% (1)
5-Mahe Noor May-2012
Document56 pages
5-Mahe Noor May-2012
Naushad Alam
No ratings yet
Project Planning
Document21 pages
Project Planning
mihociuli
No ratings yet
3-Mahe Noor Mar-2012
Document56 pages
3-Mahe Noor Mar-2012
Naushad Alam
No ratings yet
Form R-3
Document4 pages
Form R-3
Shashank revdekar
No ratings yet
4-Mahe Noor April-2012
Document56 pages
4-Mahe Noor April-2012
Naushad Alam
No ratings yet
Formato Corregido
Document917 pages
Formato Corregido
Doris Emilse Rojas Romero
0% (1)
FullCV Prof - Mahmoodi March2022
Document19 pages
FullCV Prof - Mahmoodi March2022
Terem Tebauta Jr
No ratings yet
AcademicCalender11 12 PDF
Document1 page
AcademicCalender11 12 PDF
rvmehta18
No ratings yet
BMGT - PRJ MGMT - Gantt Chart
Document1 page
BMGT - PRJ MGMT - Gantt Chart
Suman Kumar
No ratings yet
Performance Lawn Equipment Database Cu A Jot or
Document77 pages
Performance Lawn Equipment Database Cu A Jot or
Jonathan Humprey Cuajotor
No ratings yet
Project Report Group 4 Demand Forecasting at Apollo Hospitals
Document34 pages
Project Report Group 4 Demand Forecasting at Apollo Hospitals
Nitin Verma
No ratings yet
Lecture Notes 3 1 12 PDF
Document20 pages
Lecture Notes 3 1 12 PDF
Saman Brookhim
No ratings yet
Interior Decoration Gantt Chart PDF
Document1 page
Interior Decoration Gantt Chart PDF
Wyncie Cariño
No ratings yet
Ramzan Special Inzaar
Document27 pages
Ramzan Special Inzaar
Syed Mateen Ahmed
No ratings yet
Monthly Log2
Document266 pages
Monthly Log2
muhammad_rehman2917
No ratings yet
Embalse Bullileo (Lago)
Document5 pages
Embalse Bullileo (Lago)
SOFIA VERGARA CERDA
No ratings yet
Embalse Bullileo (Lago)
Document5 pages
Embalse Bullileo (Lago)
SOFIA VERGARA CERDA
No ratings yet
Untitled - Notebook March 08, 2012
Document17 pages
Untitled - Notebook March 08, 2012
Saman Brookhim
No ratings yet
Implementare 1an Construct
Document14 pages
Implementare 1an Construct
Donna Workman
No ratings yet
134G34G
Document2 pages
134G34G
luis wilbert
No ratings yet
Dynamic WT9 Flight Manual
Document70 pages
Dynamic WT9 Flight Manual
zupanm
No ratings yet
FlowSample Answer
Document19 pages
FlowSample Answer
Ανδρέας Πάτσης
No ratings yet
Vitrinclinic Visitors 1650457415205
Document13 pages
Vitrinclinic Visitors 1650457415205
Rami aldobeh
No ratings yet
Date Table
Document93 pages
Date Table
bikash.panda111
No ratings yet
A Training Man Hours
Document8 pages
A Training Man Hours
BUDI FAHRUDIN
No ratings yet
Date 1st Set 2nd Set 3rd Set 4th Set S.S. Sum Max
Document9 pages
Date 1st Set 2nd Set 3rd Set 4th Set S.S. Sum Max
jmoh
No ratings yet
Ubqari February 2012
Document50 pages
Ubqari February 2012
faizan351
No ratings yet
IITJEE 2012: Paper 2 Code 0
Document54 pages
IITJEE 2012: Paper 2 Code 0
amitkap00r
No ratings yet
Kantar Worldpanel Comtech
Document2 pages
Kantar Worldpanel Comtech
Storm Williams
No ratings yet
J0417231045 Sabrina Sitompul
Document338 pages
J0417231045 Sabrina Sitompul
Sylvina Azkyah
No ratings yet
HKDSE Math18 - Exponenial and Logarithm
Document5 pages
HKDSE Math18 - Exponenial and Logarithm
Brian Li
No ratings yet
Biological Findings
Document26 pages
Biological Findings
CozmescuAlin
No ratings yet
Congestion Apr2012
Document36 pages
Congestion Apr2012
Ifeanyi Oparaeke
No ratings yet
Arooz Aur Urdu Ke TaqaaZe
Document9 pages
Arooz Aur Urdu Ke TaqaaZe
mmmurtaza
No ratings yet
Admissions Forms 10th 18110018022
Document1 page
Admissions Forms 10th 18110018022
Ahmed Manzoor
No ratings yet
Top Movies 2012: 15 Lagbaja 16 Akin & Tit 17 18 19 Akin 20 Excel 21 Wether 22 Charles
Document2 pages
Top Movies 2012: 15 Lagbaja 16 Akin & Tit 17 18 19 Akin 20 Excel 21 Wether 22 Charles
kennedy
No ratings yet
Latest Notification
Document3 pages
Latest Notification
acesbhagat
No ratings yet
Latest Notification
Document3 pages
Latest Notification
acesbhagat
No ratings yet
ZTE FDD LTE Radio Network Optimization Guideline V1 4-1-70
Document1 page
ZTE FDD LTE Radio Network Optimization Guideline V1 4-1-70
Sameer Ibraimo
No ratings yet
BEng 29.0
Document2 pages
BEng 29.0
Kyaw Zwar
No ratings yet
Date Open High Low Close Buy Sell Silver B Silver B Silver S Silver S
Document61 pages
Date Open High Low Close Buy Sell Silver B Silver B Silver S Silver S
Ajith Chand Bhandaari
No ratings yet
78.158-1e.pdf-622mw Turbine General Discription and Operation Maunal
Document316 pages
78.158-1e.pdf-622mw Turbine General Discription and Operation Maunal
Xuanhung199
No ratings yet
Analisa Banjir
Document293 pages
Analisa Banjir
Agus Cahyanto
No ratings yet
AEP ORIA V3 1 May 2012
Document190 pages
AEP ORIA V3 1 May 2012
aapierro13
No ratings yet
IITJEE 2012: Paper 2
Document54 pages
IITJEE 2012: Paper 2
Deeksha Gupta
No ratings yet
Datedsheet
Document1 page
Datedsheet
kookielove
No ratings yet
Vir Pe CSC 04 Calidda Nissan Derco 29 09
Document140 pages
Vir Pe CSC 04 Calidda Nissan Derco 29 09
Lucero A
No ratings yet
Academic Calendar 2012
Document1 page
Academic Calendar 2012
Lawrence Ling
No ratings yet
IITJEE2012 Paper1
Document46 pages
IITJEE2012 Paper1
amitkap00r
No ratings yet
Choksi Brothers & Sisters: 419, Urdu Bazar, Matia Mahal, Jama Masjid, Delhi - 110006
Document56 pages
Choksi Brothers & Sisters: 419, Urdu Bazar, Matia Mahal, Jama Masjid, Delhi - 110006
sunnivoice
No ratings yet
Manual Commcare
Document3 pages
Manual Commcare
Alejandro De Leon
No ratings yet
Modular Dispatch
Document119 pages
Modular Dispatch
Luis Fernando Cajamune Marin
No ratings yet
DADGAD - Roslin Castle
Document2 pages
DADGAD - Roslin Castle
Peter Heijnen
No ratings yet
CBD 3356 Project Proposal
Document3 pages
CBD 3356 Project Proposal
hardik solanki
No ratings yet
Array Programming With Numpy: Review
Document6 pages
Array Programming With Numpy: Review
Gabriel Aparecido Fonseca
No ratings yet
AI:OS For Machine Learning: Accelerate ML Workflows With End-To-End Mlops
Document2 pages
AI:OS For Machine Learning: Accelerate ML Workflows With End-To-End Mlops
Matheus Silva
No ratings yet
Preprocessing of MRI Data For Alzheimer Diseases Diagnosis: July 2018
Document4 pages
Preprocessing of MRI Data For Alzheimer Diseases Diagnosis: July 2018
Matheus Silva
No ratings yet
DeepAD SubjectLevel Ready2submit Final
Document33 pages
DeepAD SubjectLevel Ready2submit Final
Matheus Silva
No ratings yet
Structure and Dynamics of Functional Networks in Child-Onset - Guilherme Ferraz de Arruda and Francisco A. Rodrigues
Document7 pages
Structure and Dynamics of Functional Networks in Child-Onset - Guilherme Ferraz de Arruda and Francisco A. Rodrigues
Matheus Silva
No ratings yet
Cardiac Arrhythmias Detection in An ECG Beat Signal Using Fast Fourier Transform and Artificial Neural Network
Document8 pages
Cardiac Arrhythmias Detection in An ECG Beat Signal Using Fast Fourier Transform and Artificial Neural Network
Matheus Silva
No ratings yet
My Portion : Written by Mark Barlow. Original Key DB Major
Document2 pages
My Portion : Written by Mark Barlow. Original Key DB Major
Matheus Silva
No ratings yet
Slides - CF - Tensor Decompositions For Learning LVM
Document20 pages
Slides - CF - Tensor Decompositions For Learning LVM
Matheus Silva
No ratings yet
Structure and Dynamics of Functional Networks in Child-Onset - Guilherme Ferraz de Arruda and Francisco A. Rodrigues
Document7 pages
Structure and Dynamics of Functional Networks in Child-Onset - Guilherme Ferraz de Arruda and Francisco A. Rodrigues
Matheus Silva
No ratings yet
KB - Data Mining With Python Sources PDF
Document112 pages
KB - Data Mining With Python Sources PDF
Matheus Silva
No ratings yet
Credit Risk Analysis Using Machine and Deep Learning
Document19 pages
Credit Risk Analysis Using Machine and Deep Learning
Matheus Silva
No ratings yet
Network Models II: CS109/Stat121/AC209/E-109 Data Science
Document19 pages
Network Models II: CS109/Stat121/AC209/E-109 Data Science
Matheus Silva
No ratings yet
CS109/Stat121/AC209/E-109 Data Science: Network Models
Document20 pages
CS109/Stat121/AC209/E-109 Data Science: Network Models
Matheus Silva
No ratings yet
19 Storytelling PDF
Document64 pages
19 Storytelling PDF
Matheus Silva
No ratings yet
Classification and Clustering: CS109/Stat121/AC209/E-109 Data Science
Document28 pages
Classification and Clustering: CS109/Stat121/AC209/E-109 Data Science
Matheus Silva
No ratings yet
04 DataMunging PDF
Document36 pages
04 DataMunging PDF
Matheus Silva
No ratings yet
CS109 Data Science: Trees, Networks & Databases
Document80 pages
CS109 Data Science: Trees, Networks & Databases
Matheus Silva
No ratings yet
16 Databases Seltzer PDF
Document43 pages
16 Databases Seltzer PDF
Matheus Silva
No ratings yet
08 HighDimensional PDF
Document88 pages
08 HighDimensional PDF
Matheus Silva
No ratings yet
13 PracticalMachineLearning PDF
Document84 pages
13 PracticalMachineLearning PDF
Matheus Silva
No ratings yet
14 MapReduce PDF
Document82 pages
14 MapReduce PDF
Matheus Silva
100% (1)
12 MCMC PDF
Document30 pages
12 MCMC PDF
Matheus Silva
No ratings yet
03 StatisticalGraphs PDF
Document91 pages
03 StatisticalGraphs PDF
Matheus Silva
No ratings yet
STAT121 / AC209 / E-109: CS109 Data Science
Document74 pages
STAT121 / AC209 / E-109: CS109 Data Science
Matheus Silva
No ratings yet
11 BayesianMethods PDF
Document27 pages
11 BayesianMethods PDF
Matheus Silva
No ratings yet
02 Process PDF
Document86 pages
02 Process PDF
Matheus Silva
No ratings yet
Classification and Clustering: CS109/Stat121/AC209/E-109 Data Science
Document28 pages
Classification and Clustering: CS109/Stat121/AC209/E-109 Data Science
Matheus Silva
No ratings yet
Prime Number Factorization
Document10 pages
Prime Number Factorization
edithaenriquez
No ratings yet
02 Sub-Surface Exploration 01
Document24 pages
02 Sub-Surface Exploration 01
kabir Ahmed
No ratings yet
Elasticity Measurement of Local Taxes and Charges in Forecast of Own-Source Revenue (PAD) of Provincial Government in Indonesia
Document27 pages
Elasticity Measurement of Local Taxes and Charges in Forecast of Own-Source Revenue (PAD) of Provincial Government in Indonesia
ayu desi
No ratings yet
Evaluasi Pengelolaan Sampah Di Kawasan Pantai Kabupaten Bantul, D. I. Yogyakarta
Document14 pages
Evaluasi Pengelolaan Sampah Di Kawasan Pantai Kabupaten Bantul, D. I. Yogyakarta
Ravyola Azzahra
No ratings yet
Kahawalage TR 0061123036 Tony Ahfock Thesis
Document74 pages
Kahawalage TR 0061123036 Tony Ahfock Thesis
Tharindu Rukshan
No ratings yet
62684en1 PDF
Document447 pages
62684en1 PDF
suraj
No ratings yet
Smart Security Camera System For Video Surveillance Using Open CV
Document6 pages
Smart Security Camera System For Video Surveillance Using Open CV
lambanaveen
No ratings yet
Digfilt
Document237 pages
Digfilt
Juhi Singh
No ratings yet
Conectar A BD en Selenium
Document4 pages
Conectar A BD en Selenium
Cristhian Andrés González
No ratings yet
SharePoint 2010 Questions
Document5 pages
SharePoint 2010 Questions
Sreedhar Konduru
No ratings yet
Transportation Model
Document20 pages
Transportation Model
Raj Upadhyay
No ratings yet
Smith Meter Microloadnet Operator Reference Manual-A Voir PDF
Document96 pages
Smith Meter Microloadnet Operator Reference Manual-A Voir PDF
mehrez
No ratings yet
Zipato MQTTCloud
Document34 pages
Zipato MQTTCloud
densas
No ratings yet
Atmel 0038
Document1 page
Atmel 0038
namer
No ratings yet
PUMY-P100-140YHM Technical & Service Manual (OC355revB)
Document90 pages
PUMY-P100-140YHM Technical & Service Manual (OC355revB)
Pavle Perovic
No ratings yet
03 VEX Spot
Document2 pages
03 VEX Spot
temam
No ratings yet
Fmaths 3RD Term YR11 Plan
Document28 pages
Fmaths 3RD Term YR11 Plan
adegunloye temitope
No ratings yet
Discovering Vanishing Objects in POSS I Red Images Using The Virtual Observatory - Beatrice - V - Stac1552
Document12 pages
Discovering Vanishing Objects in POSS I Red Images Using The Virtual Observatory - Beatrice - V - Stac1552
Bozidar Kemic
No ratings yet
L011375 - MT4434TE Spec Sheet
Document2 pages
L011375 - MT4434TE Spec Sheet
Junior Bautista
No ratings yet
UML Class Diagram Examples of Common Scenarios - EdrawMax
Document12 pages
UML Class Diagram Examples of Common Scenarios - EdrawMax
elizabeth engg
No ratings yet
Machine Translation: A Presentation By: Julie Conlonova, Rob Chase, and Eric Pomerleau
Document31 pages
Machine Translation: A Presentation By: Julie Conlonova, Rob Chase, and Eric Pomerleau
emailmyname
No ratings yet