Data Science Course Overview and Success Stories

STAT121
/ AC209 / E-109
CS109 Data Science
Rafael A. Irizarry
rafa@jimmy.harvard.edu

Verena Kaynig-Fi;kau
vkaynig@seas.harvard.edu

Today
What is Data Science?
Why learn Data Science?
How do we learn Data Science?
Who is helping you learn Data Science?
th
20 Century Innova<on
Engineering and Computer Science played key role

Cars
Airplanes
Power grid
Television
Air condiLoning and central heaLng
Nuclear power
Digital computers
The internet
For more:
h;p://camdp.com/blogs/21st-century-problems

But how about these 20th Century

ques<ons?
Does ferLlizer increase crop yields?

Does Streptomycin cure Tuberculosis?
Does smoking cause lung-cancer?

What is the dierence?
What is the dierence?

DeterminisLc versus random
DeducLve versus empirical
SoluLons deduced mostly from theory versus
soluLons deduced from mostly from data
Data
Does ferLlizer increase crop yields? Answer: Collect
and analyze agricultural experimental data
Does Streptomycin cure Tuberculosis? Collect and
analyze randomized trials data
Does smoking cause lung-cancer? Collect and analyze
observaLonal studies data
Analyzing these was the job of: boring ol sta<s<cians
st
21 Century
st
21 century
I keep saying the sexy job in the next ten years

will be staLsLcians. People think I'm joking, but
who would've guessed that computer engineers
would've been the sexy job of the 1990s?
- Hal Varian, Googles Chief Economist
Hal Varian Explains...

The ability to take data to be able to understand
it, to process it, to extract value from it, to
visualize it, to communicate it's going to be a
hugely important skill in the next decades, not only
at the professional level but even at the
educaLonal level for elementary school kids, for
high school kids, for college kids. Because now we
really do have essenLally free and ubiquitous data.
Hal Varian
Data Science Success Stories
The Data Scien<st

Actual
Hollywood
Money Ball
StarLng around 2001, the Oakland As picked players that scouts

thought were no good but data said otherwise
Nate Silver won the elec<on

Harvard Business Review
PredicLon: 349 to 189, 6.1% dierence.

Actual: 365 to 173, 7.2% dierence
Observed dierence
2012 results
Silvers predicted dierence
NeSlix Challenge
In Sept 2009 a team lead by Chris

Volinsky from StaLsLcs Research
AT&T Research was announced as
winner!
NeSlix
A US-based DVD rental-by mail company

>10M customers, 100K Ltles, ships 1.9M DVDs per day
Good recommendaLons = happy customers
Courtesy of Chris Volinsky
NeSlix Prize
October, 2006:
Oers $1,000,000 for an improved recommender algorithm

Training data
100 million raLngs
480,000 users
17,770 movies
6 years of data: 2000-2005
Test data
Last few raLngs of each user (2.8 million)
EvaluaLon via RMSE: root mean squared error
Neqlix Cinematch RMSE: 0.9514

CompeLLon
$1 million grand prize for 10% improvement
If 10% not met, $50,000 annual Progress Prize for
best improvement
user
movie
score
date
21
2002-01-03
213
2002-04-04
345
2002-05-05
123
2002-05-05
768
2003-05-03
76
2003-10-10
45
2004-10-11
568
2004-10-11
342
2004-10-11
234
2004-12-12
76
2005-01-02
56
2005-01-31
NeSlix Prize
October, 2006:
Oers $1,000,000 for an improved recommender algorithm

Training data
100 million raLngs
480,000 users
17,770 movies
6 years of data: 2000-2005
user
movie
score
date
21
2002-01-03
5
score
2002-04-04
date
2002-05-05
2003-01-03
2002-05-05
2002-05-04
2003-05-03
2002-07-05
2003-10-10
2002-09-05
2004-10-11
2004-05-03
2004-10-11
2003-10-10
2004-10-11
2004-10-11
2004-12-12
2004-10-11
2005-01-02
2004-10-11
2005-01-31
2004-12-12
user
1
1
2
2
2
Test data
Last few raLngs of each user (2.8 million)
EvaluaLon via RMSE: root mean squared error
Neqlix Cinematch RMSE: 0.9514

CompeLLon
$1 million grand prize for 10% improvement
If 10% not met, $50,000 annual Progress Prize for
best improvement
3
4
5
5
5
213
movie
345
212
123
1123
768
25
76
8773
45
98
568
16
342
2450
234
2032
76
9098
56
11012
2
2
3
4
5
5
5
6
6
?
?
?
?
?
?
?
?
?
?
4
3
5
4
1
2
2
5
4
664
2005-01-02
1526
2005-01-31
Latent Factors Model

A latent
factors model
idenLes
factors with
maximum
discriminaLon
between
movies

Frat-house, gross-out comedies
CriLcally acclaimed/ Strong female leads
A latent
factors model
idenLes
factors with
maximum
discriminaLon
between
movies

Frat-house, gross-out comedies
Artsy, edgy,
movies
CriLcally acclaimed/ Strong female leads
A latent
factors model
idenLes
factors with
maximum
discriminaLon
between
movies
Hollywood Big budget
Ad-targe<ng
Biology
Anton Van Leeuwenhoek (1623-1723)
The father of microbiology
Some of his discoveries
By improving the microscope
he saw what others could not
st
21 century version
Modern high-throughput technology
Produces complex data, not images
st
21 century version
Genes
CpG density
brain:normal
liver:normal
spleen:normal
Island
Shore
+
PRTFDC1
Location
My work has helped bring data into focus
st
21 century version
2.5
brain:normal
liver:normal
spleen:normal
M
1.5
0.5
0.5
Genes
CpG density
0.00 0.10
PRTFDC1
25280400
chr10
Island
Shore
25280800
25281200
25281600
Location
My work has helped bring data into focus
Many other examples
Spellcheckers
Speech recogniLon
Language translators
DigiLzing books
Social sciences
Medical diagnosLcs
Personalized medicine
Basic Biology
Last years projects

Some examples
Video
Video
Video
Video
video
h;p://cs109.joeong.com/
h;ps://www.youtube.com/watch?v=azD8i4nJgQc
Video
How do we do Data Science?
Skills we will learn

Science: determining what quesLons can be answered
with data and what are the best datasets for answering
them
Computer programming: using computers to analyze
data
Data wrangling: gevng data into analyzable form on
our computers
Sta<s<cs: separaLng signal from noise
Machine learning: making predicLons from data
Communica<on: sharing ndings through visualizaLon,
stories and interpretable summaries
Specic concepts and principles

Science: gain experience asking quesLons.
Computer programming: python, GitHub, cloud
compuLng (on Amazon)
Data wrangling: python libraries for reading data
tables and scrapping web pages.
Sta<s<cs: exploratory data analysis, inference,
esLmaLon, condiLonal probabiliLes, regression,
modeling, Bayesian staLsLcs, and more.
Machine learning: support vector machines, k-nearest
neighbors, regression trees, random forests, boosLng,
Communica<on: python graphing packages and in-
class pracLce
Class will be centered around data

Homework and lectures will be based on:
Baseball data
World income and life expectancy
Gene expression data
ElecLon data (predict 2014 midterm
compeLLon)
Data for building predictors

Choose your own for nal project

Examples of freely available data:
Genomics
Astronomy
Financial
Social media: Twi;er, Reddit, Stack Overow
Fitbit (get your own)
Baseball and other sports
Movie raLngs
Baby names
To name a few
Lets collect data for Thursday

h;p://bit.ly/1plnpos

Class Details
hfp://cs109.org
Rafael Irizarry
Math undergrad University of Puerto Rico

PhD in StaLsLcs UC Berkeley
15 years at Hopkins
At Harvard since 2013
DFCI is my groups home
5 postdocs, 2 grad students
Taught EdX on Genomics
Webpage: h;p://rafalab.org
Blog: simplystaLsLcs.org
Twi;er: @rafalab
Email: rafa@jimmy.harvard.edu
Verena Kaynig-Fifkau
Computer Science Diploma from
University of Hamburg, Germany
PhD in Machine Learning from
ETH Zurich, Switzerland
Postdoc at Harvard with
Hanspeter Pster
Now lecturer at IACS
Python Convert
Working on Connectomics:
Image Processing
Deep Learning
Big Data
NW 235.10
Email: vkaynig@seas.harvard.edu

Important details

Class web site: h;p://cs109.org
Syllabus:
h;p://cs109.github.io/2014/pages/syllabus.html
Head TF: Stephanie Hicks
shicks@jimmy.harvard.edu
Sta
Stephanie Hicks (Head TF)

Marc Streit (Guest lecturer)
Mingxiang Teng
Michael Packer
Marcus Way
Michael Lackner
Amy Mir
Tarik Adnan Moon
Olivia Angiuli
Yang Li
Huihui Fan
Antonia Oprescu
Claudio Rosenberg
Tudor Giurgica-Tiron
Zhijie Zhou
Nural Zaman
Brian Feeny
Joy Ming
Rick Lee
Felix Gonda
Korey Tucker
Lane Erickson
Diana Miao
Logan Kerr
Stephen Klosterman
Grades
Homework: 65%, assessed on your individual
submission
Final Project Part I: 10%, assessed on your
individual submission
Final Project Part II: 25%, assessed on meeLng
the project criteria
You are welcome to discuss the courses ideas,
material, and homework with others in order to
be;er understand it, but the work you turn in must
be your own
Homework
Real-World focus
Scrape and wrangle messy data
Explore data
Apply staLsLcal analysis
Visualize and communicate results
Final Project
Teams of up to 4 students
Pick a project of your choosing
Part I: describe quesLon and plane for
answering
Process books, web sites, screencasts
IPython (excepLons possible)
Part II: present results and conclusions
Best project prizes!
Details for Homework 0
Ques<ons

Data Science Course Overview and Success Stories

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science Course Overview and Success Stories

Uploaded by

Copyright:

Available Formats

STAT121

Engineering and Computer Science played key role

But how about these 20th Century

What is the dierence?

What is the dierence?

I keep saying the sexy job in the next ten years

Hal Varian Explains...

Data Science Success Stories

The Data Scien<st

StarLng around 2001, the Oakland As picked players that scouts

Nate Silver won the elec<on

PredicLon: 349 to 189, 6.1% dierence.

Silvers predicted dierence

In Sept 2009 a team lead by Chris

A US-based DVD rental-by mail company

Good recommendaLons = happy customers

Courtesy of Chris Volinsky

Courtesy of Chris Volinsky

Courtesy of Chris Volinsky

Latent Factors Model

Courtesy of Chris Volinsky

Latent Factors Model

CriLcally acclaimed/ Strong female leads

Courtesy of Chris Volinsky

Latent Factors Model

CriLcally acclaimed/ Strong female leads

Hollywood Big budget

Courtesy of Chris Volinsky

Anton Van Leeuwenhoek (1623-1723)

The father of microbiology

Some of his discoveries

By improving the microscope

he saw what others could not

Modern high-throughput technology

Produces complex data, not images

Modern high-throughput technology

My work has helped bring data into focus

Modern high-throughput technology

My work has helped bring data into focus

Many other examples

Last years projects

How do we do Data Science?

Skills we will learn

Specic concepts and principles

Class will be centered around data

Choose your own for nal project

Lets collect data for Thursday

Math undergrad University of Puerto Rico

Stephanie Hicks (Head TF)

Details for Homework 0

You might also like