Professional Documents
Culture Documents
Tan,Steinbach, Kumar
4/18/2004
mining is
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
4/18/2004
The Data
Gap
3,000,000
2,500,000
2,000,000
1,500,000
1,000,000
Number of
analysts
500,000
0
1995
1996
1997
1998
1999
From:
Tan,Steinbach,
R. Grossman,
Kumar
C. Kamath, V. Kumar,
Introduction
Data Mining
to Data for
Mining
Scientific and Engineering Applications
4/18/2004
Definitions
Tan,Steinbach, Kumar
4/18/2004
Look up
phone number
in phone
directory
Query a Web
search engine
for information
about
Amazon
Tan,Steinbach, Kumar
4/18/2004
Enormity of data
High dimensionality
of data
Heterogeneous,
distributed nature
of data
Tan,Steinbach, Kumar
Statistics/
AI
Machine Learning/
Pattern
Recognition
Data Mining
Database
systems
4/18/2004
Prediction Methods
Use some variables to predict unknown or future
values of other variables.
Description Methods
Find human-interpretable patterns that describe the
data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
10
Classification: Definition
Tan,Steinbach, Kumar
4/18/2004
11
Clustering Definition
Similarity Measures:
Euclidean Distance if attributes are continuous.
Other Problem-specific Measures.
Tan,Steinbach, Kumar
4/18/2004
12
Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.
Intracluster
Intraclusterdistances
distances
are
areminimized
minimized
Tan,Steinbach, Kumar
Intercluster
Interclusterdistances
distances
are
aremaximized
maximized
4/18/2004
13
TID
Items
1
2
3
4
5
Tan,Steinbach, Kumar
Rules
RulesDiscovered:
Discovered:
{Milk}
{Milk}-->
-->{Coke}
{Coke}
{Diaper,
Milk}
{Diaper, Milk}-->
-->{Beer}
{Beer}
4/18/2004
14
Given is a set of objects, with each object associated with its own timeline of
events, find rules that predict strong sequential dependencies among different
events.
(A B)
(C)
(D E)
Rules are formed by first discovering patterns. Event occurrences in the patterns
are governed by timing constraints.
(A B)
<= xg
(C) (D E)
>ng
<= ws
<= ms
Tan,Steinbach, Kumar
4/18/2004
15
Regression
Tan,Steinbach, Kumar
4/18/2004
16
Deviation/Anomaly Detection
Detect significant deviations from normal behavior
Applications:
Network Intrusion
Detection
Typical network traffic at University level may reach over 100 million connections
per day
Tan,Steinbach, Kumar
4/18/2004
17
Scalability
Dimensionality
Complex and Heterogeneous Data
Data Quality
Data Ownership and Distribution
Privacy Preservation
Streaming Data
Tan,Steinbach, Kumar
4/18/2004
18
Classification
Tan,Steinbach, Kumar
4/18/2004
19
Classification Example
ca
go
e
t
al
c
ri
al
us
c
i
o
u
or
in
g
t
e
t
n
ss
a
o
a
c
c
cl
Refund Marital
Status
Taxable
Income Cheat
No
No
Single
75K
100K
No
Yes
Married
50K
Single
70K
No
No
Married
150K
Yes
Married
120K
No
Yes
Divorced 90K
No
Divorced 95K
Yes
No
Single
40K
No
Married
No
No
Married
80K
Taxable
Income Cheat
Yes
Single
125K
No
Married
No
60K
Test
Set
10
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
10
Tan,Steinbach, Kumar
Training
Set
Learn
Classifier
Model
4/18/2004
20
Classification: Application 1
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
Approach:
Use
We
Collect
various demographic, lifestyle, and companyinteraction related information about all such customers.
Type of business, where they stay, how much they earn, etc.
Use
Tan,Steinbach, Kumar
4/18/2004
21
Classification: Application 2
Fraud Detection
Goal: Predict fraudulent cases in credit card
transactions.
Approach:
Use credit card transactions and the information on its accountholder as attributes.
When does a customer buy, what does he buy, how often he pays on time,
etc
Tan,Steinbach, Kumar
4/18/2004
22
Classification: Application 3
Customer Attrition/Churn:
Goal: To predict whether a customer is likely to be lost
to a competitor.
Approach:
Use
4/18/2004
23
Classification: Application 4
Approach:
Segment
the image.
Measure
Model
Success
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Tan,Steinbach, Kumar
4/18/2004
24
Classifying Galaxies
Courtesy: http://aps.umn.edu
Early
Class:
Stages of
Formation
Intermediate
Attributes:
Image features,
Characteristics of
light waves received,
etc.
Late
Data Size:
4/18/2004
25
Clustering
Applications
Tan,Steinbach, Kumar
4/18/2004
26
Clustering: Application 1
Market Segmentation:
Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
Approach:
Collect
Tan,Steinbach, Kumar
4/18/2004
27
Clustering: Application 2
Document Clustering:
Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.
Approach: To identify frequently occurring terms in
each document. Form a similarity measure based on
the frequencies of different terms. Use it to cluster.
Gain: Information Retrieval can utilize the clusters to
relate a new document or search term to clustered
documents.
Tan,Steinbach, Kumar
4/18/2004
28
Tan,Steinbach, Kumar
Financial
Total
Articles
555
Correctly
Placed
364
Foreign
341
260
National
273
36
Metro
943
746
Sports
738
573
Entertainment
354
278
4/18/2004
29
Discovered Clusters
1
2
3
4
Tan,Steinbach, Kumar
Fannie-Mae-DOWN,Fed-Ho me-Loan-DOW N,
MBNA-Corp -DOWN,Morgan-Stanley-DOWN
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlu mberger-UP
Industry Group
Technology1-DOWN
Technology2-DOWN
Financial-DOWN
Oil-UP
4/18/2004
30
Association Rule
Discovery
Tan,Steinbach, Kumar
4/18/2004
31
Association Rule
Discovery: Application 1
Tan,Steinbach, Kumar
4/18/2004
32
Association Rule
Discovery: Application 2
So,
Tan,Steinbach, Kumar
4/18/2004
33
Association Rule
Discovery: Application 3
Inventory Management:
Goal: A consumer appliance repair company wants to
anticipate the nature of repairs on its consumer
products and keep the service vehicles equipped with
right parts to reduce on number of visits to consumer
households.
Approach: Process the data on tools and parts
required in previous repairs at different consumer
locations and discover the co-occurrence patterns.
Tan,Steinbach, Kumar
4/18/2004
34
Sequential
Pattern Discovery
Tan,Steinbach, Kumar
4/18/2004
35
Tan,Steinbach, Kumar
4/18/2004
36