You are on page 1of 11

Today

COM 578  Dull organizational stuff


Empirical Methods in Machine – Course Summary
Learning and Data Mining – Grading
– Office hours
– Homework
Rich Caruana – Final Project
 Fun stuff
http://www.cs
http://www.cs..cornell.
cornell.edu/Courses/cs578/2007fa
edu/Courses/cs578/2007fa – Historical Perspective on Statistics, Machine Learning,
and Data Mining

Staff, Office Hours, … Topics


Rich Caruana Upson Hall 4157
Tue 4:30-5:00pm Wed 10:30-11:00am  Decision Trees  Performance Metrics
caruana@
caruana@cs.
cs.cornell.
cornell.edu  K-Nearest Neighbor  Data Transformation
TA: Daria Sorokina Upson Hall 5156  Artificial Neural Nets  Feature Selection
TBA
daria@cs.cornell.edu  Support Vector Machines  Missing Values
TA: Ainur Yessenalina Upson Hall 4156  Association Rules  Case Studies:
TBA
ainur@cs.cornell.edu  Clustering – Medical prediction
 Boosting/Bagging – Protein folding
TA: Alex Niculescu-
Niculescu-Mizil Upson Hall 5154
TBA – Autonomous vehicle
alexn@cs.cornell.edu
alexn@cs.cornell.edu  Cross Validation navigation
Admin: Melissa Totman Upson Hall 4147
M-F 9:00am-4:00pm
~30% overlap with CS478

1
Grading Homeworks
 4 credit course  short programming and experiment assignments
– e.g., implement backprop and test on a dataset
 25% take-home mid-term (late-October) – goal: get familiar with a variety of learning methods
 25% open-book final (????)  two or more weeks to complete each assignment
 30% homework assignments (3 assignments) C, C++, Java, Perl,
 Perl, shell scripts, or Matlab
 20% course project (teams of 1-4 people)
 must be done individually
 late penalty: one letter grade per day  hand in code with summary and analysis of results
 90-100 = A-, A, A+  emphasis on understanding and analysis of results,
 80-90 = B-, B, B+ not generating a pretty report
 70-80 = C-, C, C+  short course in Unix and writing shell scripts

Project Text Books


 Data Mining Mini Competition  Required Text:
 Train best model on problem(s) we give you – Machine Learning by Tom Mitchell
– decision trees
– k-nearest neighbor  Optional Texts:
– artificial neural nets – Elements of Statistical Learning: Data Mining, Inference, and
– SVMs Prediction by Hastie,
Hastie, Tibshirani,
Tibshirani, and Friedman
– bagging, boosting, model averaging, ... – Pattern Classification,
Classification, 2nd ed., by Richard Duda,
Duda, Peter Hart, &
 Given train and test sets David Stork
– Have target values on train set – Pattern Recognition and Machine Learning by Chris Bishop
– No target values on test set – Data Mining: Concepts and Techniques by Jiawei Han and
– Send us predictions and we calculate performance Micheline Kamber
– Performance on test sets is part of project grade
 Due before exams & study period  Selected papers

2
Statistics, Machine Learning,
Fun Stuff
and Data Mining

Past, Present, and Future Once upon a time...

3
Pre-Statistics: Ptolmey-1850
 First “Data Sets”
Sets” created
– Positions of mars in orbit: Tycho Brahe (1546-1601)
– Star catalogs
before statistics  Tycho catalog had 777 stars with 1-2 arcmin precision
– Messier catalog (100+ “dim fuzzies”
fuzzies” that look like comets)
– Triangulation of meridian in France
 Not just raw data - processing is part of data
– Tychonic System: anti-Copernican, many epicycles
 No theory of errors - human judgment
– Kepler knew Tycho’
Tycho’s data was never in error by 8 arcmin
 Few models of data - just learning about modeling
– Kepler’
Kepler’s Breakthrough: Copernican model and 3 laws of orbits

Pre-Statistics: 1790-1850 Statistics: 1850-1950


 The Metric System:  Data collection starts to separate from analysis
– uniform system of weights and measures
 Hand-collected data sets
 Meridian from Dunkirk to Barcelona through Paris
– Physics, Astronomy, Agriculture, ...
– triangulation
– Quality control in manufacturing
 Meter = Distance (pole to equator)/10,000,000
– Many hours to collect/process each data point
 Most accurate survey made at that time
 1000’
1000’s of measurements spanning 10-20 years!  Usually Small: 1 to 1000 data points
 Data is available in a 3-volume book that analyses it  Low dimension: 1 to 10 variables
 No theory of error:  Exist only on paper (sometimes in text books)
– surveyors use judgment to “correct data”
data” for better consistency  Experts get to know data inside out
and accuracy!
 Data is clean: human has looked at each point

4
Statistics: 1850-1950 Statistics: 1850-1950
 Calculations done manually  Analysis of errors in measurements
– manual decision making during analysis  What is most efficient estimator of some value?
– Mendel’
Mendel’s genetics  How much error in that estimate?
– human calculator pools for “larger”
larger” problems  Hypothesis testing:
 Simplified models of data to ease computation – is this mean larger than that mean?
– Gaussian, Poisson, … – are these two populations different?
– Keep computations tractable  Regression:
 Get the most out of precious data – what is the value of y when x=xi or x=x
x=xj?
– careful examination of assumptions  How often does some event occur?
– outliers examined individually – p(fail(part1)) = p1; p(fail(part2)) = p2; p(crash(plane)) = ?

Statistics would look very


different if it had been born after Statistics meets Computers
the computer instead of 100
years before the computer

5
Machine Learning: 1950-2000... Machine Learning: 1950-2000...
 Medium size data sets become available  Computers can do very complex calculations on medium
– 100 to 100,000 records size data sets
– Higher dimension: 5 to 250 dimensions (more if vision)  Models can be much more complex than before
– Fit in memory  Empirical evaluation methods instead of theory
 Exist in computer, usually not on paper – don’
don’t calculate expected error, measure it from sample
 Too large for humans to read and fully understand – cross validation
 Data not clean – e.g., 95% confidence interval from data, not Gaussian model
– Missing values, errors, outliers,  Fewer statistical assumptions about data
– Many attribute types: boolean,
boolean, continuous, nominal, discrete,  Make machine learning as automatic as possible
ordinal
 Don’
Don’t know right model => OK to have multiple models
– Humans can’
can’t afford to understand/fix each point
(vote them)

Machine Learning: 1950-2000... ML: Pneumonia Risk Prediction


 Regression Pneumonia
Risk
 Multivariate Adaptive Regression Splines (MARS)
 Linear perceptron
 Artificial neural nets
 Decision trees
 K-nearest neighbor

Chest X-Ray

RBC Count
Blood Pressure

Albumin
Blood pO2
White Count
Age
Gender
 Support Vector Machines (SVMs
(SVMs))
 Ensemble Methods: Bagging and Boosting
 Clustering Pre-Hospital In-Hospital
Attributes Attributes

6
ML: Autonomous Vehicle Navigation
Steering Direction
Can’t yet buy cars that drive
themselves, and few hospitals use
artificial neural nets yet to make
critical decisions about patients.

Machine Learning: 1950-2000...


 New Problems:
– Can’
Can’t understand many of the models Machine Learning Leaves the Lab
– Less opportunity for human expertise in process
– Good performance in lab doesn’
doesn’t necessarily mean Computers get Bigger/Faster
good performance in practice
– Brittle systems, work well on typical cases but often but
break on rare cases Data gets Bigger/Faster, too
– Can’
Can’t handle heterogeneous data sources

7
Data Mining: 1995-20??
 Huge data sets collected fully automatically
– large scale science: genomics, space probes, satellites
– Cornell’
Cornell’s Arecibo Radio Telescope Project:
 terabytes
per day
 petabytesover life of project
 too much data to move over internet -- they use FedEx!

Protein Folding

8
Data Mining: 1995-20?? Data Mining: 1995-20??
 Huge data sets collected fully automatically  Data exists only on disk (can’
(can’t fit in memory)
– large scale science: genomics, space probes, satellites  Experts can’
can’t see even modest samples of data
– consumer purchase data  Calculations done completely automatically
– web: > 500,000,000 pages of text – large computers
– clickstream data (Yahoo!: terabytes per day!) – efficient (often simplified) algorithms
– many heterogeneous data sources – human intervention difficult
 High dimensional data  Models of data
– “low”
low” of 45 attributes in astronomy – complex models possible
– 100’
100’s to 1000’
1000’s of attributes common – but complex models may not be affordable (Google
(Google))
– linkage makes many 1000’
1000’s of attributes possible  Get something useful out of massive, opaque data
– data “tombs”
tombs”

9
Data Mining: 1990-20?? Data Mining: 1995-20??
 What customers will respond best to this coupon?  New Problems:
 Who is it safe to give a loan to? – Data too big
 What products do consumers purchase in sets? – Algorithms must be simplified and very efficient
(linear in size of data if possible, one scan is best!)
 What is the best pricing strategy for products?
– Reams of output too large for humans to comprehend
 Are there unusual stars/galaxies in this data? – Very messy uncleaned data
 Do patients with gene X respond to treatment Y? – Garbage in, garbage out
 What job posting best matches this employee? – Heterogeneous data sources
 How do proteins fold? – Ill-posed questions
– Privacy

Statistics, Machine Learning, Change in Scientific Methodology


and Data Mining
Traditional:
Traditional: New:
New:
 Historic revolution and refocusing of statistics  Formulate hypothesis  Design large experiment
 Statistics, Machine Learning, and Data Mining  Design experiment  Collect large data
merging into a new multi-faceted field  Collect data  Put data in large database
 Old lessons and methods still apply, but are used  Analyze results  Formulate hypothesis
in new ways to do new things  Review hypothesis  Evaluate hyp on database
 Repeat/Publish  Run limited experiments
 Those who don’
don’t learn the past will be forced to
to drive nail in coffin
reinvent it
 Review hypothesis
 => Computational Statistics, ML, DM, …  Repeat/Publish

10
ML/DM Here to Stay

 Will infiltrate all areas of science, engineering,


public policy, marketing, economics, …
 Adaptive methods as part of engineering process
– Engineering from simulation
– Wright brothers on steroids!
 But we can’
can’t manually verify models are right!
 Can we trust results of automatic learning/mining?

11

You might also like