You are on page 1of 36

DAMA-NCR

Tuesday, November 13, 2001


Laura Squier
TechnicaI ConsuItant
Isquier@spss.com
What is Data Mining?
Agenda
What Data Mining IS and IS NOT
Steps in the Data Mining Process
CRISP-DM
ExpIanation of ModeIs
ExampIes of Data Mining
AppIications
"uestions
Evolutionary Step Business Question Enabling
Technologies
Product Providers Characteristics
Data Collection
(1960s)
"What was my total
revenue in the last
Iive years?"
Computers, tapes,
disks
IBM, CDC Retrospective,
static data delivery
Data Access
(1980s)
"What were unit
sales in New
England last
March?"
Relational
databases
(RDBMS),
Structured Query
Language (SQL),
ODBC
Oracle, Sybase,
InIormix, IBM,
MicrosoIt
Retrospective,
dynamic data
delivery at record
level
Data Warehousing
& Decision
Support
(1990s)
"What were unit
sales in New
England last
March? Drill down
to Boston."
On-line analytic
processing
(OLAP),
multidimensional
databases, data
warehouses
SPSS, Comshare,
Arbor, Cognos,
Microstrategy,NCR
Retrospective,
dynamic data
delivery at multiple
levels
Data Mining
(Emerging Today)
"What`s likely to
happen to Boston
unit sales next
month? Why?"
Advanced
algorithms,
multiprocessor
computers, massive
databases
SPSS/Clementine,
Lockheed, IBM,
SGI, SAS, NCR,
Oracle, numerous
startups
Prospective,
proactive
inIormation
delivery
The Evolution oI Data Analysis
ResuIts of Data Mining
IncIude:
Forecasting what may happen in
the future
CIassifying peopIe or things into
groups by recognizing patterns
CIustering peopIe or things into
groups based on their attributes
Associating what events are IikeIy
to occur together
Sequencing what events are IikeIy
to Iead to Iater events
Data mining is not
Brute-force crunching of bulk
data
"Blind application of algorithms
Going to find relationships
where none exist
Presenting data in different
ways
A database intensive task
A difficult to understand
technology requiring an
advanced degree in computer
science
Data Mining Is
A hot buzzword for a cIass of
techniques that find patterns in data
A user-centric, interactive process
which Ieverages anaIysis
technoIogies and computing power
A group of techniques that find
reIationships that have not
previousIy been discovered
Not reIiant on an existing database
A reIativeIy easy task that requires
knowIedge of the business
probIem/subject matter expertise
Data Mining versus
OLAP
OLAP - On-Iine
AnaIyticaI
Processing
Provides you
with a very
good view of
what is
happening, but
can not predict
what wiII
happen in the
future or why it
is happening
Data Mining Versus StatisticaI
AnaIysis
Data AnaIysis
Tests for statisticaI
correctness of modeIs
Are statisticaI
assumptions of modeIs
correct?
Eg Is the R-Square
good?
Hypothesis testing
Is the reIationship
significant?
Use a t-test to
vaIidate significance
Tends to reIy on sampIing
Techniques are not
optimised for Iarge amounts
of data
Requires strong statisticaI
skiIIs
Data Mining
OriginaIIy deveIoped to act
as expert systems to soIve
probIems
Less interested in the
mechanics of the
technique
If it makes sense then Iet's
use it
Does not require
assumptions to be made
about data
Can find patterns in very
Iarge amounts of data
Requires understanding
of data and business
probIem
ExampIes of What PeopIe
are Doing with Data Mining:
raud/Non-Compliance
Anomaly detection
IsoIate the factors that
Iead to fraud, waste and
abuse
Target auditing and
investigative efforts more
effectiveIy
Credit/Risk Scoring
Intrusion detection
Parts failure prediction
Recruiting/Attracting
customers
Maximizing
profitability (cross
selling, identifying
profitable customers)
Service Delivery and
Customer Retention
uiId profiIes of
customers IikeIy
to use which
services
eb Mining
How Can We Do Data
Mining?
y UtiIizing the CRISP-
DM MethodoIogy
a standard process
existing data
software
technoIogies
situationaI expertise
Why ShouId There be a
Standard Process?
Framework for recording
experience
Allows projects to be
replicated
Aid to project pIanning and
management
"Comfort factor" for new
adopters
Demonstrates maturity of
Data Mining
Reduces dependency on
~stars
1he data mining process must 1he data mining process must
be reliable and repeatable by be reliable and repeatable by
people with little data mining people with little data mining
background. background.
Process
Standardization
CRISP-DM:
CRoss Industry Standard Process for Data Mining
Initiative Iaunched Sept.1996
SPSS/ISL, NCR, DaimIer-enz, OHRA
Funding from European commission
Over 200 members of the CRISP Over 200 members of the CRISP- -DM SIG worIdwide DM SIG worIdwide
DM Vendors DM Vendors - - SPSS, NCR, IM, SAS, SGI, Data DistiIIeries, SPSS, NCR, IM, SAS, SGI, Data DistiIIeries,
SyIIogic, Magnify, .. SyIIogic, Magnify, ..
System SuppIiers / consuItants System SuppIiers / consuItants - - Cap Gemini, ICL RetaiI, DeIoitte Cap Gemini, ICL RetaiI, DeIoitte
& Touche, . & Touche, .
End Users End Users - - T, A, LIoyds ank, AirTouch, Experian, ... T, A, LIoyds ank, AirTouch, Experian, ...
CRISP-DM
Non Non- -proprietary proprietary
AppIication/Industry AppIication/Industry
neutraI neutraI
TooI neutraI TooI neutraI
Focus on business issues Focus on business issues
As weII as technicaI As weII as technicaI
anaIysis anaIysis
Framework for guidance Framework for guidance
Experience base Experience base
TempIates for TempIates for
AnaIysis AnaIysis
The The
CRISP CRISP- -
DM DM
Process Process
ModeI ModeI
Why CRISP-DM?
The data mining process must be reliable and repeatable by
people with little data mining skills
CRISP-DM provides a uniIorm Iramework Ior
guidelines
experience documentation
CRISP-DM is Ilexible to account Ior diIIerences
DiIIerent business/agency problems
DiIIerent data
usiness
Understanding
Data
Understanding
EvaIuation
Data
Preparation
ModeIing
Determine
usiness Objectives
Background
Business Objectives
Business Success
Criteria
Situation Assessment
Inventory of Resources
Requirements,
Assumptions, and
Constraints
Risks and Contingencies
Terminology
Costs and Benefits
Determine
Data Mining GoaI
Data Mining Goals
Data Mining Success
Criteria
Produce Project PIan
Project Plan
Initial Asessment of
Tools and Techniques
CoIIect InitiaI Data
Initial Data Collection
Report
Describe Data
Data Description Report
ExpIore Data
Data Exploration Report
Verify Data "uaIity
Data Quality Report
Data Set
Data Set Description
SeIect Data
Rationale for Inclusion /
Exclusion
CIean Data
Data Cleaning Report
Construct Data
Derived Attributes
Generated Records
Integrate Data
Merged Data
Format Data
Reformatted Data
SeIect ModeIing
Technique
Modeling Technique
Modeling Assumptions
Generate Test Design
Test Design
uiId ModeI
Parameter Settings
Models
Model Description
Assess ModeI
Model Assessment
Revised Parameter
Settings
EvaIuate ResuIts
Assessment of Data
Mining Results w.r.t.
Business Success
Criteria
Approved Models
Review Process
Review of Process
Determine Next Steps
List of Possible Actions
Decision
PIan DepIoyment
Deployment Plan
PIan Monitoring and
Maintenance
Monitoring and
Maintenance Plan
Produce FinaI Report
Final Report
Final Presentation
Review Project
Experience
Documentation
DepIoyment
Phases and Tasks Phases and Tasks
Phases in the DM Process: Phases in the DM Process:
CRISP CRISP- -DM DM
Phases in the DM
Process (1 & 2)
usiness Understanding:
Statement of
usiness Objective
Statement of Data
Mining objective
Statement of Success
Criteria
Data Understanding
ExpIore the data and
verify the quaIity
Find outIiers
Phases in the DM
Process (3)
Data preparation:
Takes usuaIIy over 90% of our time
CoIIection
Assessment
ConsoIidation and CIeaning
tabIe Iinks, aggregation IeveI,
missing vaIues, etc
Data seIection
active roIe in ignoring non-
contributory data?
outIiers?
Use of samples
visuaIization tooIs
Transformations - create new
variabIes
Phases in the DM Process
(4)
ModeI buiIding
SeIection of the modeIing
techniques is based upon
the data mining objective
ModeIing is an iterative
process - different for
supervised and
unsupervised learning
May modeI for either
description or prediction
Types of ModeIs
Prediction ModeIs for
Predicting and
CIassifying
Regression algorithms
(predict numeric
outcome): neural
networks, rule
induction, CART
(OLS regression,
GLM)
Classification
algorithm predict
symbolic outcome):
CHAID, C5.0
(discriminant analysis,
logistic regression)
Descriptive ModeIs for
Grouping and Finding
Associations
Clustering/Grouping
algorithms: K-
means, Kohonen
Association
algorithms: apriori,
GRI
NeuraI Network
Output
Hidden layer
Input layer
NeuraI Networks
Description
DifficuIt interpretation
Tends to 'overfit' the data
Extensive amount of training time
A Iot of data preparation
Works with aII data types
RuIe Induction
Description
Produces decision trees:
income < $40K
job > 5 yrs then good
risk
job < 5 yrs then bad
risk
income > $40K
high debt then bad risk
low debt then good risk
Or RuIe Sets:
RuIe #1 for good risk:
if income > $40K
if Iow debt
RuIe #2 for good risk:
if income < $40K
if job > 5 years
at. n
ad 52.01 168
Good 47.99 155
Total (100.00) 323
redit ranking (1=default)
at. n
ad 86.67 143
Good 13.33 22
Total (51.08) 165
Paid Weekly/Monthly
P-value=0.0000, hi-square=179.6665, df=1
Weekl ypay
at. n
Bad 15.82 25
Good 84.18 133
Total (48.92) 158
Monthlysalary
at. n
ad 90.51 143
Good 9.49 15
Total (48.92) 158
Age ategorical
P-value=0.0000, hi-square=30.1113, df=1
Young (< 25);Middle (25-35)
at. n
Bad 0.00 0
Good 100.00 7
Total (2.17) 7
Old ( > 35)
at. n
Bad 48.98 24
Good 51.02 25
Total (15.17) 49
Age ategorical
P-value=0.0000, hi-square=58.7255, df=1
Young (< 25)
at. n
Bad 0.92 1
Good 99.08 108
Total (33.75) 109
Middle (25-35);Old ( > 35)
at. n
Bad 0.00 0
Good 100.00 8
Total (2.48) 8
Social lass
P-value=0.0016, hi-square=12.0388, df=1
Management;lerical
at. n
ad 58.54 24
Good 41.46 17
Total (12.69) 41
Professional
RuIe Induction
Description
Intuitive output
HandIes aII forms of numeric data, as weII
as non-numeric (symboIic) data
C5 AIgorithm a speciaI case of ruIe
induction
Target variabIe must be symboIic
Apriori
Description
Seeks association rules in
dataset
'Market basket' anaIysis
Sequence discovery
Kohonen Network
Description
unsupervised
seeks to
describe
dataset in
terms of
naturaI
clusters of
cases
Phases in the DM
Process (5)
ModeI EvaIuation
Evaluation of model: how well it
performed on test data
Methods and criteria depend on
model type:
e.g., coincidence matrix with
classification models, mean
error rate with regression
models
Interpretation of model:
important or not, easy or hard
depends on algorithm
Phases in the DM
Process (6)
DepIoyment
Determine how the resuIts need to be
utiIized
Who needs to use them?
How often do they need to be used
DepIoy Data Mining resuIts by:
Scoring a database
UtiIizing resuIts as business ruIes
interactive scoring on-Iine
Specific Data Mining
AppIications:
What data mining has
done for...
ScheduIed its workforce
to provide faster, more accurate
answers to questions.
The US InternaI Revenue Service
needed to improve customer
service and...
What data mining has done
for...
anaIyzed suspects' ceII phone
usage to focus investigations.
The US Drug Enforcement
Agency needed to be more
effective in their drug "busts"
and
What data mining has done
for...
Reduced direct maiI costs by 30%
whiIe garnering 95% of the
campaign's revenue.
HSC need to cross-seII more
effectiveIy by identifying profiIes
that wouId be interested in higher
yieIding investments and...
FinaI Comments
Data Mining can be utiIized in any
organization that needs to find
patterns or reIationships in their
data.
y using the CRISP-DM
methodoIogy, anaIysts can have a
reasonabIe IeveI of assurance that
their Data Mining efforts wiII
render usefuI, repeatabIe, and
vaIid resuIts.
"uestions?

You might also like