Lecture 1 Data Science

Introduction to Data Science
Lecture 1
CS 194 Fall 2014
John Canny
Including notes from Michael Franklin
Dan Bruckner, Evan Sparks,
Shivaram Venkataraman
Outline
Data Science Why all the excitement?
examples
Where does data come from
So what is Data Science
Doing Data Science
About the course
what well cover
data science first, big data later
requirements, workload etc.
Data Analysis Has Been Around for a
While
R.A. Fisher W.E.

Peter Luhn
Demming
Howard
Dresner
Abridged Version of Jeff Hammerbachers

timeline for CS 194, 2012
Data makes everything clearer
Seven Countries Study (Ancel Keys, UCB 1925,28)
13,000 subjects total, 5-40 years follow-up.
Data Science: Why all the Excitement?
e.g.,
Google Flu Trends:
Detecting outbreaks
two weeks ahead
of CDC data
New models are estimating

which cities are most at risk
for spread of the Ebola virus.
5
Why the all the Excitement?
6
Data and Election 2012 (cont.)
that was just one of several ways that Mr. Obamas
campaign operations, some unnoticed by Mr. Romneys
aides in Boston, helped save the presidents candidacy. In
Chicago, the campaign recruited a team of behavioral
scientists to build an extraordinarily sophisticated database
that allowed the Obama campaign not only to alter the
very nature of the electorate, making it younger and less
white, but also to create a portrait of shifting voter
allegiances. The power of this operation stunned Mr.
Romneys aides on election night, as they saw voters they
never even knew existed turn out in places like Osceola
County, Fla.
New York TImes, Wed Nov 7, 2012
7
A history of the (Business) Internet: 1997
Pagerank: The web as a behavioral dataset
DB size = 50 billion sites
Google server farms

2 million machines (est)
1998 sponsored search
Overture
2002
Sponsored search
Google revenue around $50 bn/year from marketing,
97% of the companies revenue.
Sponsored search uses an auction a pure competition

for marketers trying to win access to consumers.
In other words, a competition for models of consumers

their likelihood of responding to the ad and of
determining the right bid for the item.
There are around 30 billion search requests a month.

Perhaps a trillion events of history between search
providers.
Sponsored search
Sponsored search
Data Makes Everything Clearer?
Data Makes Everything Clearer
Searches for
MySpace
Searches for
Facebook
Data Makes Everything Clearer
and based on Princeton

search trends:
This trend suggests that

Princeton will have only
half its current enrollment
by 2018, and by 2021 it will
have no students at all,
http://techcrunch.com/2014/01/23/facebook-
losing-users-princeton-losing-credibility/
Big Data Sources
Its All Happening On-line User Generated (Web &
Mobile)
Every:
Click
Ad impression
Billing event
.
Fast Forward, pause, .
Server request
Transaction
Network message
Fault

Internet of Things / M2M Health/Scientific Computing

Graph Data
Lots of interesting data
has a graph structure:
Social networks
Communication networks
Computer Networks
Road networks
Citations
Collaborations/Relationships

Some of these graphs can get

quite large (e.g., Facebook*
user graph)
19
What can you do with the data?
Crowdsourcing + physical modeling + sensing + data assimilation
to produce:
From Alex Bayen, UCB 20

Big Data is so 2012
the sexy job in the next 10 years will be
statisticians, Hal Varian, Google Chief Economist
the U.S. will need 140,000-190,000 predictive
analysts and 1.5 million managers/analysts by
2018. McKinsey Global Institutes June 2011
New Data Science institutes being created or
repurposed NYU, Columbia, Washington, UCB,...
New degree programs, courses, boot-camps:
e.g., at Berkeley: Stats, I-School, CS, Astronomy
One proposal (elsewhere) for an MS in Big Data
Science
DATA SCIENCE WHAT IS IT?
Data Science an Emerging Field
OReilly Radar report 24

Some recent ML
Competitions
Data Science One Definition
Contrast: Databases
Databases Data Science
Data Value Precious Cheap
Data Volume Modest Massive
Examples Bank records, Online clicks,
Personnel records, GPS logs,
Census, Tweets,
Medical records Building sensor readings
Priorities Consistency, Speed,
Error recovery, Availability,
Auditability Query richness
Structured Strongly (Schema) Weakly or none (Text)
Properties Transactions, ACID* CAP* theorem (2/3),
eventual consistency
Realizations SQL NoSQL:
Riak, Memcached,
Apache River,
MongoDB, CouchDB,
Hbase, Cassandra,
ACID = Atomicity, Consistency, Isolation and Durability CAP = Consistency, Availability, Partition Tolerance
Contrast: Databases
Databases Data Science
Querying the past Querying the future
Business intelligence (BI) is the transformation of raw data into meaningful and
useful information for business analysis purposes. BI can handle enormous
amounts of unstructured data to help identify, develop and otherwise create new
strategic business opportunities - Wikipedia
Contrast: Scientific Computing
Image General purpose classifier
Supernova
Not
Nugent group / C3 LBL
Scientific Modeling Data-Driven Approach

Physics-based models General inference engine replaces model
Problem-Structured Structure not related to problem
Mostly deterministic, precise Statistical models handle true randomness,
Run on Supercomputer or and unmodeled complexity.
High-end Computing Cluster Run on cheaper computer Clusters (EC2)
Contrast: Computational Science
CASP: A Worldwide, Biannual Brain Mapping: Allen Institute,
Protein Folding Contest White House, Berkeley
Quark Raptor-X
Rich, Complex Data-intensive,
general ML models Techniques (Massive ML)
Energy Models
Feature-based inference Principal Component Analysis
Faithful, Physical
Simulation Conditional Neural Fields Independent Component Analysis
Sparse Coding
Spatial (Image) Filtering
Contrast: Machine Learning
Machine Learning Data Science
Develop new (individual) models Explore many models, build and tune
hybrids
Prove mathematical properties of
models Understand empirical properties of
models
Improve/validate on a few,
relatively clean, small datasets Develop/use tools that can handle
massive datasets
Publish a paper Take action!
5-min break
Analyzing the Analysts
From Kandel, Paepcke, Hellerstein and Heer, Enterprise Data Analysts and
Visualization: An Interview Study, IEEE VAST 2012
SOME DATA SCIENCE AT BERKELEY
Moore/Sloan Data Science Initiative
3 chosen Nationwide
5 Yrs, $38MM
The Berkeley Center
is called BIDS
BIDS Space
Doe Memorial Library at
the heart of the UC
Berkeley campus will be
the new home of the
Berkeley Institute for Data
Science (BIDS).
The campus has set aside

5,000 sq ft on the ground
floor directly accessible
from the buildings north
entrance and opposite to
the historical Morrison
Reading Room.
AMPLab
38
Other Berkeley Projects (used in the course)
Ipython:
Created by Fernando Perez
(Brain Science). Probably the
most widely used Data
Science Environment.
BIDMach: A hardware-optimized
(rooflined) toolkit. The fastest
tool for general machine learning.
Caffe: Most popular toolkit for

DNN (Deep Neural Network)
modeling.
39
DOING DATA SCIENCE
Ben Frys Model
1. Acquire
2. Parse
3. Filter
4. Mine
5. Represent
6. Refine
7. Interact
41
Jeff Hammerbachers Model
1. Identify problem
2. Instrument data sources
3. Collect data
4. Prepare data (integrate, transform, clean, lter, aggregate)
5. Build model
6. Evaluate model
7. Communicate results
42
From the Trenches
Yahoo [KDD 2009, best app. paper]
Ebay [SIGIR 2011, hon. mention]
Quantcast [2012]
Microsoft [CIKM 2014]

Data Scientists Practice
Clean,
prep
Hypothesize Large Scale

Digging Around Model Exploitation
in Data
Evaluate
Interpret
Whats Hard about Data Science
Overcoming assumptions
Making ad-hoc explanations of data patterns
Overgeneralizing
Communication
Not checking enough (validate models, data pipeline
integrity, etc.)
Using statistical tests correctly
Prototype Production transitions
Data pipeline complexity (who do you ask?)
About the Course
Grading
Class Participation (M) and in-class labs (W) 20%
Midterm 20%
Final Project (in groups) 25%
Homeworks 30%
Bunnies 5%
Lab model hands-on weekly labs here (145 Moffitt)
A bunny is a cuter, cuddlier species of quiz

Normally due Monday before class (Friday this week).
Projects
Project teams should form by week 3.
Project proposals will be due 10/2.
You can choose a project topic, but we will also provide

a list of suggested projects from around campus (from
BIDs researchers). You need:
A clear problem statement
An accessible dataset
Modeling plan + appropriate tools
About the Course
Staff Contact:
Instructor: John Canny, lastname@berkeley.edu
Office hours: MW 2-3pm.
GSIs:
Charles Reiss: Tu/Th 5-6 at 283E Soda
firstname.lastname@berkeley.edu
Biye Jiang: Wed 10-11 at 283E Soda
firstletteroffirstnamelastname@berkeley.edu
Use Piazza for questions

Course Site, Readings
The main course site is in bCourses, titled
Introduction to Data Science
Most work will be submitted there, some homeworks

will be submitted on instructional machines with
glookup.
Readings will be linked from the course site. Some are

campus only, configure proxy.lib.berkeley.edu in your
browser to read at home.
If time permits
A data analysis exercise:
English Premier League Soccer: Everton vs. W. Bromwich Albion
Predict the outcome:
Everton Goals
W. Bromwich Goals 0 1 2+
2+
Analysis
What kinds of data will you use?
Almost anything is OK, except other predictions.
History: individual or pair-wise?
Team or players?
Numerical or text?
What kind of model will you build?
What assumptions are safe to make?
Predictions
PL Followers Non-Followers
Everton Goals Everton Goals
W. Bromwich Goals
W. Bromwich Goals
0 1 2+ 0 1 2+
0 0
1 1
2+ 2+
Answers on Monday!
Readings
This weeks reading bunny by Friday.
Read next weeks readings, and complete bunny before

class on Monday.

Lecture 1 Data Science

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 1 Data Science

Uploaded by

Copyright:

Available Formats

Introduction to Data Science

R.A. Fisher W.E.

Abridged Version of Jeff Hammerbachers

New models are estimating

New York TImes, Wed Nov 7, 2012

Google server farms

Sponsored search uses an auction a pure competition

In other words, a competition for models of consumers

There are around 30 billion search requests a month.

and based on Princeton

This trend suggests that

Internet of Things / M2M Health/Scientific Computing

Some of these graphs can get

Crowdsourcing + physical modeling + sensing + data assimilation

From Alex Bayen, UCB 20

OReilly Radar report 24

Querying the past Querying the future

Nugent group / C3 LBL

Scientific Modeling Data-Driven Approach

The campus has set aside

Caffe: Most popular toolkit for

2. Instrument data sources

4. Prepare data (integrate, transform, clean, lter, aggregate)

Ebay [SIGIR 2011, hon. mention]

Microsoft [CIKM 2014]

Hypothesize Large Scale

Lab model hands-on weekly labs here (145 Moffitt)

A bunny is a cuter, cuddlier species of quiz

Project proposals will be due 10/2.

You can choose a project topic, but we will also provide

Use Piazza for questions

Most work will be submitted there, some homeworks

Readings will be linked from the course site. Some are

Read next weeks readings, and complete bunny before

You might also like