Professional Documents
Culture Documents
/
AC209
/
E-109
CS109
Data
Science
Rafael
A.
Irizarry
rafa@jimmy.harvard.edu
Verena
Kaynig-Fi;kau
vkaynig@seas.harvard.edu
Today
What
is
Data
Science?
Why
learn
Data
Science?
How
do
we
learn
Data
Science?
Who
is
helping
you
learn
Data
Science?
th
20
Century
Innova<on
Data
Does
ferLlizer
increase
crop
yields?
Answer:
Collect
and
analyze
agricultural
experimental
data
Does
Streptomycin
cure
Tuberculosis?
Collect
and
analyze
randomized
trials
data
Does
smoking
cause
lung-cancer?
Collect
and
analyze
observaLonal
studies
data
Analyzing
these
was
the
job
of:
boring
ol
sta<s<cians
st
21
Century
st
21
century
Hollywood
Money Ball
Observed dierence
2012 results
NeSlix Challenge
NeSlix
NeSlix
Prize
October,
2006:
Oers
$1,000,000
for
an
improved
recommender
algorithm
Training
data
100
million
raLngs
480,000
users
17,770
movies
6
years
of
data:
2000-2005
Test
data
Last
few
raLngs
of
each
user
(2.8
million)
EvaluaLon
via
RMSE:
root
mean
squared
error
Neqlix
Cinematch
RMSE:
0.9514
CompeLLon
$1
million
grand
prize
for
10%
improvement
If
10%
not
met,
$50,000
annual
Progress
Prize
for
best
improvement
user
movie
score
date
21
2002-01-03
213
2002-04-04
345
2002-05-05
123
2002-05-05
768
2003-05-03
76
2003-10-10
45
2004-10-11
568
2004-10-11
342
2004-10-11
234
2004-12-12
76
2005-01-02
56
2005-01-31
NeSlix
Prize
October,
2006:
Oers
$1,000,000
for
an
improved
recommender
algorithm
Training
data
100
million
raLngs
480,000
users
17,770
movies
6
years
of
data:
2000-2005
user
movie
score
date
21
2002-01-03
5
score
2002-04-04
date
2002-05-05
2003-01-03
2002-05-05
2002-05-04
2003-05-03
2002-07-05
2003-10-10
2002-09-05
2004-10-11
2004-05-03
2004-10-11
2003-10-10
2004-10-11
2004-10-11
2004-12-12
2004-10-11
2005-01-02
2004-10-11
2005-01-31
2004-12-12
user
1
1
2
2
2
Test
data
Last
few
raLngs
of
each
user
(2.8
million)
EvaluaLon
via
RMSE:
root
mean
squared
error
Neqlix
Cinematch
RMSE:
0.9514
CompeLLon
$1
million
grand
prize
for
10%
improvement
If
10%
not
met,
$50,000
annual
Progress
Prize
for
best
improvement
3
4
5
5
5
213
movie
345
212
123
1123
768
25
76
8773
45
98
568
16
342
2450
234
2032
76
9098
56
11012
2
2
3
4
5
5
5
6
6
?
?
?
?
?
?
?
?
?
?
4
3
5
4
1
2
2
5
4
664
2005-01-02
1526
2005-01-31
A
latent
factors
model
idenLes
factors
with
maximum
discriminaLon
between
movies
A
latent
factors
model
idenLes
factors
with
maximum
discriminaLon
between
movies
Ad-targe<ng
Biology
st
21
century
version
st
21
century
version
Genes
CpG density
brain:normal
liver:normal
spleen:normal
Island
Shore
+
PRTFDC1
Location
st
21
century
version
2.5
brain:normal
liver:normal
spleen:normal
M
1.5
0.5
0.5
Genes
CpG density
0.00 0.10
PRTFDC1
25280400
chr10
Island
Shore
25280800
25281200
25281600
Location
Spellcheckers
Speech
recogniLon
Language
translators
DigiLzing
books
Social
sciences
Medical
diagnosLcs
Personalized
medicine
Basic
Biology
Video
Video
Video
Video
video
h;p://cs109.joeong.com/
h;ps://www.youtube.com/watch?v=azD8i4nJgQc
Video
Class Details
hfp://cs109.org
Rafael Irizarry
Verena
Kaynig-Fifkau
Computer
Science
Diploma
from
University
of
Hamburg,
Germany
PhD
in
Machine
Learning
from
ETH
Zurich,
Switzerland
Postdoc
at
Harvard
with
Hanspeter
Pster
Now
lecturer
at
IACS
Python
Convert
Working
on
Connectomics:
Image
Processing
Deep
Learning
Big
Data
NW
235.10
Email:
vkaynig@seas.harvard.edu
Important
details
Class
web
site:
h;p://cs109.org
Syllabus:
h;p://cs109.github.io/2014/pages/syllabus.html
Head
TF:
Stephanie
Hicks
shicks@jimmy.harvard.edu
Sta
Tudor
Giurgica-Tiron
Zhijie
Zhou
Nural
Zaman
Brian
Feeny
Joy
Ming
Rick
Lee
Felix
Gonda
Korey
Tucker
Lane
Erickson
Diana
Miao
Logan
Kerr
Stephen
Klosterman
Grades
Homework:
65%,
assessed
on
your
individual
submission
Final
Project
Part
I:
10%,
assessed
on
your
individual
submission
Final
Project
Part
II:
25%,
assessed
on
meeLng
the
project
criteria
You
are
welcome
to
discuss
the
courses
ideas,
material,
and
homework
with
others
in
order
to
be;er
understand
it,
but
the
work
you
turn
in
must
be
your
own
Homework
Real-World
focus
Scrape
and
wrangle
messy
data
Explore
data
Apply
staLsLcal
analysis
Visualize
and
communicate
results
Final
Project
Teams
of
up
to
4
students
Pick
a
project
of
your
choosing
Part
I:
describe
quesLon
and
plane
for
answering
Process
books,
web
sites,
screencasts
IPython
(excepLons
possible)
Part
II:
present
results
and
conclusions
Best
project
prizes!
Ques<ons