You are on page 1of 52

Introduction to Data Science

Lecture 1
CS 194 Fall 2014
John Canny
Including notes from Michael Franklin
Dan Bruckner, Evan Sparks,
Shivaram Venkataraman
Outline
Data Science Why all the excitement?
examples
Where does data come from
So what is Data Science
Doing Data Science
About the course
what well cover
data science first, big data later
requirements, workload etc.
Data Analysis Has Been Around for a
While

R.A. Fisher W.E.


Peter Luhn
Demming

Howard
Dresner

Abridged Version of Jeff Hammerbachers


timeline for CS 194, 2012
Data makes everything clearer
Seven Countries Study (Ancel Keys, UCB 1925,28)
13,000 subjects total, 5-40 years follow-up.
Data Science: Why all the Excitement?
e.g.,
Google Flu Trends:

Detecting outbreaks
two weeks ahead
of CDC data

New models are estimating


which cities are most at risk
for spread of the Ebola virus.

5
Why the all the Excitement?

6
Data and Election 2012 (cont.)
that was just one of several ways that Mr. Obamas
campaign operations, some unnoticed by Mr. Romneys
aides in Boston, helped save the presidents candidacy. In
Chicago, the campaign recruited a team of behavioral
scientists to build an extraordinarily sophisticated database
that allowed the Obama campaign not only to alter the
very nature of the electorate, making it younger and less
white, but also to create a portrait of shifting voter
allegiances. The power of this operation stunned Mr.
Romneys aides on election night, as they saw voters they
never even knew existed turn out in places like Osceola
County, Fla.

New York TImes, Wed Nov 7, 2012

7
A history of the (Business) Internet: 1997
Pagerank: The web as a behavioral dataset
DB size = 50 billion sites

Google server farms


2 million machines (est)
1998 sponsored search

Overture

2002
Sponsored search
Google revenue around $50 bn/year from marketing,
97% of the companies revenue.

Sponsored search uses an auction a pure competition


for marketers trying to win access to consumers.

In other words, a competition for models of consumers


their likelihood of responding to the ad and of
determining the right bid for the item.

There are around 30 billion search requests a month.


Perhaps a trillion events of history between search
providers.
Sponsored search
Sponsored search
Data Makes Everything Clearer?
Data Makes Everything Clearer

Searches for
MySpace

Searches for
Facebook
Data Makes Everything Clearer

and based on Princeton


search trends:

This trend suggests that


Princeton will have only
half its current enrollment
by 2018, and by 2021 it will
have no students at all,

http://techcrunch.com/2014/01/23/facebook-
losing-users-princeton-losing-credibility/
Big Data Sources
Its All Happening On-line User Generated (Web &
Mobile)
Every:
Click
Ad impression
Billing event
.
Fast Forward, pause, .
Server request
Transaction
Network message
Fault

Internet of Things / M2M Health/Scientific Computing


Graph Data
Lots of interesting data
has a graph structure:
Social networks
Communication networks
Computer Networks
Road networks
Citations
Collaborations/Relationships

Some of these graphs can get


quite large (e.g., Facebook*
user graph)

19
What can you do with the data?

Crowdsourcing + physical modeling + sensing + data assimilation

to produce:

From Alex Bayen, UCB 20


Big Data is so 2012
the sexy job in the next 10 years will be
statisticians, Hal Varian, Google Chief Economist
the U.S. will need 140,000-190,000 predictive
analysts and 1.5 million managers/analysts by
2018. McKinsey Global Institutes June 2011
New Data Science institutes being created or
repurposed NYU, Columbia, Washington, UCB,...
New degree programs, courses, boot-camps:
e.g., at Berkeley: Stats, I-School, CS, Astronomy
One proposal (elsewhere) for an MS in Big Data
Science
DATA SCIENCE WHAT IS IT?
Data Science an Emerging Field

OReilly Radar report 24


Some recent ML
Competitions
Data Science One Definition
Contrast: Databases
Databases Data Science
Data Value Precious Cheap
Data Volume Modest Massive
Examples Bank records, Online clicks,
Personnel records, GPS logs,
Census, Tweets,
Medical records Building sensor readings
Priorities Consistency, Speed,
Error recovery, Availability,
Auditability Query richness
Structured Strongly (Schema) Weakly or none (Text)
Properties Transactions, ACID* CAP* theorem (2/3),
eventual consistency
Realizations SQL NoSQL:
Riak, Memcached,
Apache River,
MongoDB, CouchDB,
Hbase, Cassandra,
ACID = Atomicity, Consistency, Isolation and Durability CAP = Consistency, Availability, Partition Tolerance
Contrast: Databases
Databases Data Science

Querying the past Querying the future

Business intelligence (BI) is the transformation of raw data into meaningful and
useful information for business analysis purposes. BI can handle enormous
amounts of unstructured data to help identify, develop and otherwise create new
strategic business opportunities - Wikipedia
Contrast: Scientific Computing
Image General purpose classifier
Supernova

Not

Nugent group / C3 LBL

Scientific Modeling Data-Driven Approach


Physics-based models General inference engine replaces model
Problem-Structured Structure not related to problem
Mostly deterministic, precise Statistical models handle true randomness,
Run on Supercomputer or and unmodeled complexity.
High-end Computing Cluster Run on cheaper computer Clusters (EC2)
Contrast: Computational Science
CASP: A Worldwide, Biannual Brain Mapping: Allen Institute,
Protein Folding Contest White House, Berkeley

Quark Raptor-X
Rich, Complex Data-intensive,
general ML models Techniques (Massive ML)
Energy Models
Feature-based inference Principal Component Analysis
Faithful, Physical
Simulation Conditional Neural Fields Independent Component Analysis
Sparse Coding
Spatial (Image) Filtering
Contrast: Machine Learning
Machine Learning Data Science
Develop new (individual) models Explore many models, build and tune
hybrids
Prove mathematical properties of
models Understand empirical properties of
models
Improve/validate on a few,
relatively clean, small datasets Develop/use tools that can handle
massive datasets
Publish a paper Take action!
5-min break
Analyzing the Analysts

From Kandel, Paepcke, Hellerstein and Heer, Enterprise Data Analysts and
Visualization: An Interview Study, IEEE VAST 2012
SOME DATA SCIENCE AT BERKELEY
Moore/Sloan Data Science Initiative

3 chosen Nationwide
5 Yrs, $38MM
The Berkeley Center
is called BIDS
BIDS Space
Doe Memorial Library at
the heart of the UC
Berkeley campus will be
the new home of the
Berkeley Institute for Data
Science (BIDS).

The campus has set aside


5,000 sq ft on the ground
floor directly accessible
from the buildings north
entrance and opposite to
the historical Morrison
Reading Room.
AMPLab

38
Other Berkeley Projects (used in the course)
Ipython:
Created by Fernando Perez
(Brain Science). Probably the
most widely used Data
Science Environment.

BIDMach: A hardware-optimized
(rooflined) toolkit. The fastest
tool for general machine learning.

Caffe: Most popular toolkit for


DNN (Deep Neural Network)
modeling.

39
DOING DATA SCIENCE
Ben Frys Model
1. Acquire
2. Parse
3. Filter
4. Mine
5. Represent
6. Refine
7. Interact

41
Jeff Hammerbachers Model
1. Identify problem

2. Instrument data sources

3. Collect data

4. Prepare data (integrate, transform, clean, lter, aggregate)

5. Build model

6. Evaluate model

7. Communicate results
42
From the Trenches
Yahoo [KDD 2009, best app. paper]

Ebay [SIGIR 2011, hon. mention]

Quantcast [2012]

Microsoft [CIKM 2014]


Data Scientists Practice

Clean,
prep

Hypothesize Large Scale


Digging Around Model Exploitation
in Data

Evaluate
Interpret
Whats Hard about Data Science
Overcoming assumptions
Making ad-hoc explanations of data patterns
Overgeneralizing
Communication
Not checking enough (validate models, data pipeline
integrity, etc.)
Using statistical tests correctly
Prototype Production transitions
Data pipeline complexity (who do you ask?)
About the Course
Grading
Class Participation (M) and in-class labs (W) 20%
Midterm 20%
Final Project (in groups) 25%
Homeworks 30%
Bunnies 5%

Lab model hands-on weekly labs here (145 Moffitt)

A bunny is a cuter, cuddlier species of quiz


Normally due Monday before class (Friday this week).
Projects
Project teams should form by week 3.

Project proposals will be due 10/2.

You can choose a project topic, but we will also provide


a list of suggested projects from around campus (from
BIDs researchers). You need:
A clear problem statement
An accessible dataset
Modeling plan + appropriate tools
About the Course
Staff Contact:
Instructor: John Canny, lastname@berkeley.edu
Office hours: MW 2-3pm.
GSIs:
Charles Reiss: Tu/Th 5-6 at 283E Soda
firstname.lastname@berkeley.edu
Biye Jiang: Wed 10-11 at 283E Soda
firstletteroffirstnamelastname@berkeley.edu

Use Piazza for questions


Course Site, Readings
The main course site is in bCourses, titled
Introduction to Data Science

Most work will be submitted there, some homeworks


will be submitted on instructional machines with
glookup.

Readings will be linked from the course site. Some are


campus only, configure proxy.lib.berkeley.edu in your
browser to read at home.
If time permits
A data analysis exercise:
English Premier League Soccer: Everton vs. W. Bromwich Albion
Predict the outcome:
Everton Goals
W. Bromwich Goals 0 1 2+

2+
Analysis
What kinds of data will you use?
Almost anything is OK, except other predictions.
History: individual or pair-wise?
Team or players?
Numerical or text?
What kind of model will you build?
What assumptions are safe to make?
Predictions

PL Followers Non-Followers
Everton Goals Everton Goals

W. Bromwich Goals
W. Bromwich Goals

0 1 2+ 0 1 2+

0 0

1 1

2+ 2+

Answers on Monday!
Readings
This weeks reading bunny by Friday.

Read next weeks readings, and complete bunny before


class on Monday.

You might also like