Professional Documents
Culture Documents
Lecture 1
CS 194 Fall 2014
John Canny
Including notes from Michael Franklin
Dan Bruckner, Evan Sparks,
Shivaram Venkataraman
Outline
Data Science Why all the excitement?
examples
Where does data come from
So what is Data Science
Doing Data Science
About the course
what well cover
data science first, big data later
requirements, workload etc.
Data Analysis Has Been Around for a
While
Howard
Dresner
Detecting outbreaks
two weeks ahead
of CDC data
5
Why the all the Excitement?
6
Data and Election 2012 (cont.)
that was just one of several ways that Mr. Obamas
campaign operations, some unnoticed by Mr. Romneys
aides in Boston, helped save the presidents candidacy. In
Chicago, the campaign recruited a team of behavioral
scientists to build an extraordinarily sophisticated database
that allowed the Obama campaign not only to alter the
very nature of the electorate, making it younger and less
white, but also to create a portrait of shifting voter
allegiances. The power of this operation stunned Mr.
Romneys aides on election night, as they saw voters they
never even knew existed turn out in places like Osceola
County, Fla.
7
A history of the (Business) Internet: 1997
Pagerank: The web as a behavioral dataset
DB size = 50 billion sites
Overture
2002
Sponsored search
Google revenue around $50 bn/year from marketing,
97% of the companies revenue.
Searches for
MySpace
Searches for
Facebook
Data Makes Everything Clearer
http://techcrunch.com/2014/01/23/facebook-
losing-users-princeton-losing-credibility/
Big Data Sources
Its All Happening On-line User Generated (Web &
Mobile)
Every:
Click
Ad impression
Billing event
.
Fast Forward, pause, .
Server request
Transaction
Network message
Fault
19
What can you do with the data?
to produce:
Business intelligence (BI) is the transformation of raw data into meaningful and
useful information for business analysis purposes. BI can handle enormous
amounts of unstructured data to help identify, develop and otherwise create new
strategic business opportunities - Wikipedia
Contrast: Scientific Computing
Image General purpose classifier
Supernova
Not
Quark Raptor-X
Rich, Complex Data-intensive,
general ML models Techniques (Massive ML)
Energy Models
Feature-based inference Principal Component Analysis
Faithful, Physical
Simulation Conditional Neural Fields Independent Component Analysis
Sparse Coding
Spatial (Image) Filtering
Contrast: Machine Learning
Machine Learning Data Science
Develop new (individual) models Explore many models, build and tune
hybrids
Prove mathematical properties of
models Understand empirical properties of
models
Improve/validate on a few,
relatively clean, small datasets Develop/use tools that can handle
massive datasets
Publish a paper Take action!
5-min break
Analyzing the Analysts
From Kandel, Paepcke, Hellerstein and Heer, Enterprise Data Analysts and
Visualization: An Interview Study, IEEE VAST 2012
SOME DATA SCIENCE AT BERKELEY
Moore/Sloan Data Science Initiative
3 chosen Nationwide
5 Yrs, $38MM
The Berkeley Center
is called BIDS
BIDS Space
Doe Memorial Library at
the heart of the UC
Berkeley campus will be
the new home of the
Berkeley Institute for Data
Science (BIDS).
38
Other Berkeley Projects (used in the course)
Ipython:
Created by Fernando Perez
(Brain Science). Probably the
most widely used Data
Science Environment.
BIDMach: A hardware-optimized
(rooflined) toolkit. The fastest
tool for general machine learning.
39
DOING DATA SCIENCE
Ben Frys Model
1. Acquire
2. Parse
3. Filter
4. Mine
5. Represent
6. Refine
7. Interact
41
Jeff Hammerbachers Model
1. Identify problem
3. Collect data
5. Build model
6. Evaluate model
7. Communicate results
42
From the Trenches
Yahoo [KDD 2009, best app. paper]
Quantcast [2012]
Clean,
prep
Evaluate
Interpret
Whats Hard about Data Science
Overcoming assumptions
Making ad-hoc explanations of data patterns
Overgeneralizing
Communication
Not checking enough (validate models, data pipeline
integrity, etc.)
Using statistical tests correctly
Prototype Production transitions
Data pipeline complexity (who do you ask?)
About the Course
Grading
Class Participation (M) and in-class labs (W) 20%
Midterm 20%
Final Project (in groups) 25%
Homeworks 30%
Bunnies 5%
2+
Analysis
What kinds of data will you use?
Almost anything is OK, except other predictions.
History: individual or pair-wise?
Team or players?
Numerical or text?
What kind of model will you build?
What assumptions are safe to make?
Predictions
PL Followers Non-Followers
Everton Goals Everton Goals
W. Bromwich Goals
W. Bromwich Goals
0 1 2+ 0 1 2+
0 0
1 1
2+ 2+
Answers on Monday!
Readings
This weeks reading bunny by Friday.