Professional Documents
Culture Documents
Data Mining
Raghu Ramakrishnan
Yahoo! Research
University of WisconsinMadison (on leave)
Group 2
Group 3
Group 1
Group 4
4. Evaluate results:
Many uninteresting clusters
One interesting cluster! Customers with both
business and personal accounts; unusually high
percentage of likely respondents
Action:
New marketing campaign
Result:
Acceptance rate for home equity offers more
than doubled
Examples:
Clustering
Linear regression model
Classification model
Frequent itemsets and association rules
Support Vector Machines
Age Minivan
<30 >=30 YES
Sports, YES
Car Type YES Truck
NO
Minivan Sports, Truck
YES NO
0 30 60 Age
TECS 2007, Data Mining R. Ramakrishnan, Yahoo! Research
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
Decision Trees
A decision tree T encodes d (a classifier or
regression function) in form of a tree.
A node t in T without children is called a leaf
node. Otherwise t is called an internal node.
Encoded classifier:
If (age<30 and
carType=Minivan)
Then YES
Age
If (age <30 and
<30 >=30 (carType=Sports or
carType=Truck))
Then NO
Car Type YES
If (age >= 30)
Then YES
Minivan Sports, Truck
YES NO
Age
30 35
(Yes: No: )
Age
Training Database
<30 >=30
Database
AVC-Sets
Main Memory
TECS 2007, Data Mining R. Ramakrishnan, Yahoo! Research
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
RainForest Algorithms: RF-Hybrid
Age<30
Database
AVC-Sets
Main Memory
TECS 2007, Data Mining R. Ramakrishnan, Yahoo! Research
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
RainForest Algorithms: RF-Hybrid
Third Scan: As we expand the tree, we run out
Of memory, and have to spill
partitions to disk, and recursively
read and process them later.
Age<30 Database
Sal<20k Car==S
Age<30
Database
Sal<20k Car==S
CLARANS
DBSCAN
BIRCH
CLIQUE
CURE
.
Above algorithms can be used to cluster documents after
reducing their dimensionality using SVD
Imposing constraints
Only find rules involving the dairy department
Only find rules involving expensive products
Only find rules with whiskey on the right hand side
Only find rules with milk on the left hand side
Hierarchies on the items
Calendars (every Sunday, every 1st of the month)
Sample Applications
Direct marketing
Fraud detection for medical insurance
Floor/shelf planning
Web site layout
Cross-selling
Extract
Data Consistency?
Marital
Cust ID Age Wealth
Status
1 35 M 380,000
2 20 S 50,000
3 57 M 470,000
1 35 M 380,000 TV 1 Appliance
Coke 6 Drink
Ham 3 Food
Name of algorithm
SELECT [Customers].[ID],
MyDMM.[Age],
PredictProbability(MyDMM.[Age])
FROM
MyDMM PREDICTION JOIN [Customers]
ON MyDMM.[Gender] = [Customers].[Gender] AND
MyDMM.[Hair Color] =
[Customers].[Hair Color]
Automobile
ALL 3
3
1
2
ALL
Region
2
State
Category
ALL
Sedan Truck
DIMENSION
Civic Camry F150 Sierra Model 1
ATTRIBUTES
p3 p4
MA
East
p1 p2 p1 F150 NY 100
ALL
p2 Sierra NY 500
TX
West
p3 F150 MA 100
CA
p4 Sierra MA 200
Automobile
ALL 3
3
1
2
ALL
Region
2
State
Category
ALL
Sedan Truck
p3 p4
MA
p5
East
p1 p2 p1 F150 NY 100
ALL
p2 Sierra NY 500
TX
West
p3 F150 MA 100
CA
p4 Sierra MA 200
p5 Truck MA 100
Auto = F150
Loc = MA
SUM(Repair) = ??? How do we treat p5?
Truck
FactID Auto Loc Repair
F150 Sierra p1 F150 NY 100
p2 Sierra NY 500
p5 p3 F150 MA 100
MA
p3 p4
p4 Sierra MA 200
East
p5 Truck MA 100
NY
p1 p2
Truck
FactID Auto Loc Repair
F150 Sierra
p1 F150 NY 100
p5 p2 Sierra NY 500
MA
p3 p4
p3 F150 MA 100
East
p4 Sierra MA 200
NY
p1 p2 p5 Truck MA 100
Truck
ID FactID Auto Loc Repair Weight
p1 p2
6 p5 Sierra MA 100 0.5
Truck
ID FactID Auto Loc Repair Weight
p1 p2
6 p5 Sierra MA 100 0.5
Count (c1) 2
Truck
pc1, p 5
Count (c1) Count (c 2) 2 1
Count (c 2) 1
F150 Sierra pc 2, p 5
Count (c1) Count (c 2) 2 1
p5 p5
MA
p3 p4
p6
c1 c2
East
p1 p2
NY
p3 p4 ID Sales
p6 p1 100
c1 c2
East
p2 150
p1 p2 p3 300
NY
p4 200
p5 250
p6 400
Truck
Q (c ) Q (c ) F150 Sierra
pc ,r
Q(c ') Qsum(r ) r
MA
c 'region ( r ) c1 c2
East
NY
TECS 2007, Data Mining R. Ramakrishnan, Yahoo! Research
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
What is a Good Allocation Policy?
F150 Sierra
p3 p4
MA
We
We propose
propose desiderata
desiderata that
p5 that enable
enable
appropriate
appropriate definition
definition of
of query
query
East
semantics
semantics for
for imprecise
imprecise data
data
NY
p1 p2
Truck Consistency
specifies the
F150 Sierra relationship between
answers to related
p3 p4 queries on a fixed
MA
p5 data set
East
NY
p1 p2
MA
MA
MA
p3 p4 p3 p4 p3 p4
NY
NY
NY
p1 p2 p1 p2 p1 p2
MA
p3 p4 possible worlds
[Kripke63, ]
NY
p1 p2
F150 Sierra
F150 Sierra
MA
p3
w1
MA
p3 p5 p4
p4
p5 w4
w2
NY
w3 p2
NY
p2 p1
p1
MA
MA
p4 p5 p4
p5
p3 p3
NY
NY
p2 p2
p1 p1
TECS 2007, Data Mining R. Ramakrishnan, Yahoo! Research
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
Query Semantics
Z: Dimensions Y: Measure
Location Time # of
Goal: Look for patterns of unusually App.
high numbers of applications: ...
AL, USA Dec, 04 2
WY, USA Dec, 04 3
Location Time
All All
State AL WY
Fact table D
Z: Dimensions X: Predictors Y: Class
Location Time Race Sex Approval
Cube subset
Location Time
All All
State AL WY
Measure in a cell:
Accuracy of the model
Predictiveness of Race
Data [USA, Dec 04](D)
measured based on that
Location Time Race Sex Approval model
AL ,USA Dec, 04 White M Y Similarity between that
model and a given model
WY, USA Dec, 04 Black F N
Model h(X, [USA, Dec 04](D))
E.g., decision tree
2004 2003
Jan Dec Jan Dec
CA 0.4 0.2 0.3 0.6 0.5
Black M
No Yes
The loan decision process in USA during Dec 04 h0(X) Test set
was similar to a discriminatory decision model
TECS 2007, Data Mining R. Ramakrishnan, Yahoo! Research
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
Example (6/7): Predictiveness
2004 2003
Jan Dec Jan Dec
CA 0.4 0.2 0.3 0.6 0.5
Yes Yes
USA 0.2 0.3 0.9 No
.
No
.
Build models
. .
Yes No
1
| | x
I ( h1 ( x ) h2 ( x ))
2004 2003
Cell value: Predictiveness of Race
Jan Dec Jan Dec
AB 0.4 0.2 0.1 0.1 0.2
Nave Bayes:
Scoring function: algebraic
Kernel-density-based classifier:
Scoring function: distributive
Decision tree, random forest:
Neither distributive, nor algebraic
PBE: Probability-based ensemble (new)
To make any machine-learning model distributive
Approximation
Using exhaustive
Execution Time (sec)
method
Using bottom-up
score computation
# of Records
TECS 2007, Data Mining R. Ramakrishnan, Yahoo! Research
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
Bellwether Analysis:
Global Aggregates from Local Regions
...
Hardware Laptop [1-4,MD] [1-1, NY] [1-3,WI]
Conclusions