You are on page 1of 60

STAT121

/ AC209 / E-109
CS109 Data Science
Rafael A. Irizarry
rafa@jimmy.harvard.edu

Verena Kaynig-Fi;kau
vkaynig@seas.harvard.edu

Today
What is Data Science?
Why learn Data Science?
How do we learn Data Science?
Who is helping you learn Data Science?

th
20 Century Innova<on

Engineering and Computer Science played key role


Cars
Airplanes
Power grid
Television
Air condiLoning and central heaLng
Nuclear power
Digital computers
The internet
For more:
h;p://camdp.com/blogs/21st-century-problems

But how about these 20th Century


ques<ons?
Does ferLlizer increase crop yields?

Does Streptomycin cure Tuberculosis?
Does smoking cause lung-cancer?

What is the dierence?

What is the dierence?


DeterminisLc versus random
DeducLve versus empirical
SoluLons deduced mostly from theory versus
soluLons deduced from mostly from data

Data
Does ferLlizer increase crop yields? Answer: Collect
and analyze agricultural experimental data
Does Streptomycin cure Tuberculosis? Collect and
analyze randomized trials data
Does smoking cause lung-cancer? Collect and analyze
observaLonal studies data
Analyzing these was the job of: boring ol sta<s<cians

st
21 Century

st
21 century

I keep saying the sexy job in the next ten years


will be staLsLcians. People think I'm joking, but
who would've guessed that computer engineers
would've been the sexy job of the 1990s?
- Hal Varian, Googles Chief Economist

Hal Varian Explains...


The ability to take data to be able to understand
it, to process it, to extract value from it, to
visualize it, to communicate it's going to be a
hugely important skill in the next decades, not only
at the professional level but even at the
educaLonal level for elementary school kids, for
high school kids, for college kids. Because now we
really do have essenLally free and ubiquitous data.
Hal Varian

Data Science Success Stories

The Data Scien<st


Actual

Hollywood

Money Ball

StarLng around 2001, the Oakland As picked players that scouts


thought were no good but data said otherwise

Nate Silver won the elec<on


Harvard Business Review

PredicLon: 349 to 189, 6.1% dierence.


Actual: 365 to 173, 7.2% dierence

Observed dierence

2012 results

Silvers predicted dierence

NeSlix Challenge

In Sept 2009 a team lead by Chris


Volinsky from StaLsLcs Research
AT&T Research was announced as
winner!

NeSlix

A US-based DVD rental-by mail company


>10M customers, 100K Ltles, ships 1.9M DVDs per day

Good recommendaLons = happy customers

Courtesy of Chris Volinsky

NeSlix Prize
October, 2006:
Oers $1,000,000 for an improved recommender algorithm


Training data
100 million raLngs
480,000 users
17,770 movies
6 years of data: 2000-2005
Test data
Last few raLngs of each user (2.8 million)
EvaluaLon via RMSE: root mean squared error
Neqlix Cinematch RMSE: 0.9514

CompeLLon
$1 million grand prize for 10% improvement
If 10% not met, $50,000 annual Progress Prize for
best improvement

user

movie

score

date

21

2002-01-03

213

2002-04-04

345

2002-05-05

123

2002-05-05

768

2003-05-03

76

2003-10-10

45

2004-10-11

568

2004-10-11

342

2004-10-11

234

2004-12-12

76

2005-01-02

56

2005-01-31

Courtesy of Chris Volinsky

NeSlix Prize
October, 2006:
Oers $1,000,000 for an improved recommender algorithm


Training data
100 million raLngs
480,000 users
17,770 movies
6 years of data: 2000-2005

user

movie

score

date

21

2002-01-03

5
score

2002-04-04
date

2002-05-05
2003-01-03
2002-05-05
2002-05-04
2003-05-03
2002-07-05
2003-10-10
2002-09-05
2004-10-11
2004-05-03
2004-10-11
2003-10-10
2004-10-11
2004-10-11
2004-12-12
2004-10-11
2005-01-02
2004-10-11
2005-01-31
2004-12-12

user
1
1
2
2
2

Test data
Last few raLngs of each user (2.8 million)
EvaluaLon via RMSE: root mean squared error
Neqlix Cinematch RMSE: 0.9514

CompeLLon
$1 million grand prize for 10% improvement
If 10% not met, $50,000 annual Progress Prize for
best improvement

3
4
5
5
5

213
movie

345
212
123
1123
768
25
76
8773
45
98
568
16
342
2450
234
2032
76
9098
56
11012

2
2
3
4
5
5
5
6
6

?
?
?
?
?
?
?
?
?
?

4
3
5
4
1
2
2
5
4

664

2005-01-02

1526

2005-01-31

Courtesy of Chris Volinsky

Latent Factors Model


A latent
factors model
idenLes
factors with
maximum
discriminaLon
between
movies

Courtesy of Chris Volinsky

Latent Factors Model


Frat-house, gross-out comedies

CriLcally acclaimed/ Strong female leads

A latent
factors model
idenLes
factors with
maximum
discriminaLon
between
movies

Courtesy of Chris Volinsky

Latent Factors Model


Frat-house, gross-out comedies
Artsy, edgy,
movies

CriLcally acclaimed/ Strong female leads

A latent
factors model
idenLes
factors with
maximum
discriminaLon
between
movies

Hollywood Big budget

Courtesy of Chris Volinsky

Ad-targe<ng

Biology

Anton Van Leeuwenhoek (1623-1723)

The father of microbiology

Some of his discoveries

By improving the microscope

he saw what others could not

st
21 century version

Modern high-throughput technology

Produces complex data, not images

st
21 century version

Genes

CpG density

brain:normal
liver:normal
spleen:normal

Island

Shore
+
PRTFDC1

Location

Modern high-throughput technology

My work has helped bring data into focus

st
21 century version

2.5

brain:normal
liver:normal
spleen:normal

M
1.5

0.5

0.5

Genes

CpG density
0.00 0.10

PRTFDC1

25280400

chr10

Modern high-throughput technology

Island

Shore

25280800

25281200

25281600

Location

My work has helped bring data into focus

Many other examples

Spellcheckers
Speech recogniLon
Language translators
DigiLzing books
Social sciences
Medical diagnosLcs
Personalized medicine
Basic Biology

Last years projects


Some examples

Video

Video

Video

Video

video

h;p://cs109.joeong.com/
h;ps://www.youtube.com/watch?v=azD8i4nJgQc

Video

How do we do Data Science?

Skills we will learn


Science: determining what quesLons can be answered
with data and what are the best datasets for answering
them
Computer programming: using computers to analyze
data
Data wrangling: gevng data into analyzable form on
our computers
Sta<s<cs: separaLng signal from noise
Machine learning: making predicLons from data
Communica<on: sharing ndings through visualizaLon,
stories and interpretable summaries

Specic concepts and principles


Science: gain experience asking quesLons.
Computer programming: python, GitHub, cloud
compuLng (on Amazon)
Data wrangling: python libraries for reading data
tables and scrapping web pages.
Sta<s<cs: exploratory data analysis, inference,
esLmaLon, condiLonal probabiliLes, regression,
modeling, Bayesian staLsLcs, and more.
Machine learning: support vector machines, k-nearest
neighbors, regression trees, random forests, boosLng,
Communica<on: python graphing packages and in-
class pracLce

Class will be centered around data


Homework and lectures will be based on:
Baseball data
World income and life expectancy
Gene expression data
ElecLon data (predict 2014 midterm
compeLLon)
Data for building predictors

Choose your own for nal project


Examples of freely available data:
Genomics
Astronomy
Financial
Social media: Twi;er, Reddit, Stack Overow
Fitbit (get your own)
Baseball and other sports
Movie raLngs
Baby names
To name a few

Lets collect data for Thursday



h;p://bit.ly/1plnpos

Class Details

hfp://cs109.org

Rafael Irizarry

Math undergrad University of Puerto Rico


PhD in StaLsLcs UC Berkeley
15 years at Hopkins
At Harvard since 2013
DFCI is my groups home
5 postdocs, 2 grad students
Taught EdX on Genomics
Webpage: h;p://rafalab.org
Blog: simplystaLsLcs.org
Twi;er: @rafalab
Email: rafa@jimmy.harvard.edu

Verena Kaynig-Fifkau
Computer Science Diploma from
University of Hamburg, Germany
PhD in Machine Learning from
ETH Zurich, Switzerland
Postdoc at Harvard with
Hanspeter Pster
Now lecturer at IACS
Python Convert
Working on Connectomics:
Image Processing
Deep Learning
Big Data

NW 235.10
Email: vkaynig@seas.harvard.edu

Important details

Class web site: h;p://cs109.org
Syllabus:
h;p://cs109.github.io/2014/pages/syllabus.html
Head TF: Stephanie Hicks
shicks@jimmy.harvard.edu

Sta

Stephanie Hicks (Head TF)


Marc Streit (Guest lecturer)
Mingxiang Teng
Michael Packer
Marcus Way
Michael Lackner
Amy Mir
Tarik Adnan Moon
Olivia Angiuli
Yang Li
Huihui Fan
Antonia Oprescu
Claudio Rosenberg

Tudor Giurgica-Tiron
Zhijie Zhou
Nural Zaman
Brian Feeny
Joy Ming
Rick Lee
Felix Gonda
Korey Tucker
Lane Erickson
Diana Miao
Logan Kerr
Stephen Klosterman

Grades
Homework: 65%, assessed on your individual
submission
Final Project Part I: 10%, assessed on your
individual submission
Final Project Part II: 25%, assessed on meeLng
the project criteria
You are welcome to discuss the courses ideas,
material, and homework with others in order to
be;er understand it, but the work you turn in must
be your own

Homework
Real-World focus
Scrape and wrangle messy data
Explore data
Apply staLsLcal analysis
Visualize and communicate results

Final Project
Teams of up to 4 students
Pick a project of your choosing
Part I: describe quesLon and plane for
answering
Process books, web sites, screencasts
IPython (excepLons possible)
Part II: present results and conclusions
Best project prizes!

Details for Homework 0

Ques<ons

You might also like