36-350, Data Mining - Introduction

4/6/13
36-350, Data Mining: Introduction
36-350 Data Mining (Fall 2009)
Introduction to the Course

(as prepared and roughly as given)
What Is "Data Mining"?

Extracting useful predictive patterns from large collections of data a.k.a. "Knowledge discovery in data bases" Examples: Commercial banking: FICO, automated mortgage underwriting; fraud detection; how much is that house worth? Information retrieval: search engines most prominently Recommendation systems: Firefly (of happy memory), Amazon, LibraryThing Financial speculation: statistical arbitrage, LTCM, algorithmic trading Marketing: identifying demographic sub-groups, targeted advertising and promotions; rewards programs Logistics: getting the things you'll want to sell in place when you want to sell them Engineering: what do customers actually use your products for? how do they fail? can you predict failures before they happen? Biology: gene identification, disease identification Insurance/HMOs: how much to charge whom, how much to pay Policing: when and where will certain kinds of crimes happen?
Why now?
Precursors/impulses go back a long time "We have always been an information society": control revolution of the 19th century Industrial revolution: all this stuff , and people, to keep track of Technologies of keeping-track: forms, standards, job descriptions/requirements, schedules, exams, inspections, categories, reports, files, "your permanent record" machine- readable and -processable data: Hollerith machines (from automatic looms), leading to IBM and the rest of the pre-computer information-processing industry statistics: knowing/finding resources, finding patterns, making plans (originally, a "statistician" was someone who advised a state about its resources and those of its enemies) Limited by cost: collecting, storing, examining data all expensive especially when it must be done by hand people are slow
www.stat.cmu.edu/~cshalizi/350/lectures/00/00.html 1/3
4/6/13
people are expensive (time, training) people don't scale (can't just copy programs) people can't explain themselves and when data have to be specially made rather than a by-product of normal activity Computers drastically lower the cost of collecting, storing, accessing and examining data think of drawing plots if nothing else! plus you record transactions on the computer anyway Data-mining is about automating parts of the analysis process look for patterns (what kind of pattern? look how?) preferably interesting ones (interesting to who? how do you tell?) and check that they're not just flukes (for example...) Clinical vs. actuarial judgment as proof-of-concept psychiatrists are worse at predicting patient outcomes than simple decision rules ... but it turns out no profession is better than simple rules (though some are as good) what to do when there are no good professionals?
Sources and Methods

Exploratory data analysis, descriptive statistics, visualization Inferential statistics, especially non-parametric methods Expensive analyses meant it was worth thinking very hard about your models first but also encouraged totally unrealistic simplifying assumptions, especially linear dependence and Gaussian distributions we don't have to make those assumptions (so much) any more Machine learning: blurs in to inferential statistics Optimization Databases We are going to skimp on the last two Extremely important Huge issues arise with really big data with 2 million customers, there are 2 trillion customer pairs, finding the closest match takes 23 days at 1 microsecond per pair but we can't cover everything and this is a statistics class, not computer science
www.stat.cmu.edu/~cshalizi/350/lectures/00/00.html
2/3
4/6/13
Some Themes
Choice of representation/abstraction is important Choices within method are important Methods and representations are interdependent Choices have to be justified as helping you meet specific goals; beware of optimality criteria! The importance of not fooling yourself and/or programming the machine to fool you: using predictions and perturbations Technical theme: bias/variance or accuracy/precision trade-off Technical theme: adaptability is a partial substitute for knowledge Technical theme: successive approximation/iterative algorithms
Waste, Fraud and abuse

Any new technology produces con-artists, quacks, and excess ambition Will try to point out some ways data mining can go wrong situations where it won't work situations where people make impossible claims for it things it shouldn't be used for period Institutional context in which you mine data Serious data collection happens within big organizations, and data rarely leaves them logistics privacy competitive advantage Keeping track of what the organization is trying to do (e.g., "make arrests" vs. "reduce crime") Deciding whether you want any part of what is being attempted (e.g., many businesses would like to identify gullible customers)
www.stat.cmu.edu/~cshalizi/350/lectures/00/00.html
3/3

36-350, Data Mining - Introduction

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

36-350, Data Mining - Introduction

Uploaded by

Copyright:

Available Formats

4/6/13

36-350, Data Mining: Introduction

36-350 Data Mining (Fall 2009)

Introduction to the Course

What Is "Data Mining"?

36-350, Data Mining: Introduction

Sources and Methods

36-350, Data Mining: Introduction

Waste, Fraud and abuse

You might also like