You are on page 1of 413

Introduction

Motivation: Business Intelligence


Customer information
(customer-id, gender, age,
home-address, occupation,
income, family-size, )

Product information
(Product-id, category,
manufacturer, made-in,
stock-price, )

Sales information
(customer-id, product-id, #units, unit-price,
sales-representative, )
Business queries:

Jian Pei: CMPT 741/459 Data Mining -- Introduction

Techniques: Business Intelligence


Multidimensional data analysis
Online query answering
Interactive data exploration

Jian Pei: CMPT 741/459 Data Mining -- Introduction

Motivation: Store Layout Design

http://buildipedia.com/images/masterformat/Channels/In_Studio/Todays_Grocery_Store/Todays_Grocery_Store_Layout-Figure_B.jpg

Jian Pei: CMPT 741/459 Data Mining -- Introduction

Techniques: Store Layout Design


Customer purchase patterns
Business strategies

Jian Pei: CMPT 741/459 Data Mining -- Introduction

Motivation: Community Detection

http://image.slidesharecdn.com/communitydetectionitilecture22june2011-110622095259-phpapp02/95/community-detection-in-socialmedia-1-728.jpg?cb=1308736811

Jian Pei: CMPT 741/459 Data Mining -- Introduction

Techniques: Community Detection


Similarity between objects
Partitioning objects into groups
No guidance about what a group is

Jian Pei: CMPT 741/459 Data Mining -- Introduction

Motivation: Disease Prediction


What medical problems
does this patient has?
Symptoms:
overweight,
high blood
pressure,
back pain,
short of breadth,
chest pain,
cold sweat

Jian Pei: CMPT 741/459 Data Mining -- Introduction

Techniques: Disease Prediction


Features
Model

Jian Pei: CMPT 741/459 Data Mining -- Introduction

Motivation: Fraud Detection

http://i.imgur.com/ckkoAOp.gif

Jian Pei: CMPT 741/459 Data Mining -- Introduction

10

Techniques: Fraud Detection


Features
Dissimilarity
Groups and noise

http://i.stack.imgur.com/tRDGU.png

Jian Pei: CMPT 741/459 Data Mining -- Introduction

11

What Is Data Science About?


Data
Extraction of knowledge from data
Continuation of data mining and knowledge
discovery from data (KDD)

Jian Pei: CMPT 741/459 Data Mining -- Introduction

12

What Is Data?
Values of qualitative or quantitative variables
belonging to a set of items
Represented in a structure, e.g., tabular, tree
or graph structure
Typically the results of measurements
As an abstract concept can be viewed as the
lowest level of abstraction from which
information and then knowledge are derived
Jian Pei: CMPT 741/459 Data Mining -- Introduction

13

What Is Information?
Knowledge communicated or received
concerning a particular fact or circumstance
Conceptually, information is the message
(utterance or expression) being conveyed
Cannot be predicted
Can resolve uncertainty

Jian Pei: CMPT 741/459 Data Mining -- Introduction

14

What Is Knowledge?
Familiarity with someone or something,
which can include facts, information,
descriptions, or skills acquired through
experience or education
Implicit knowledge: practical skill or expertise
Explicit knowledge: theoretical
understanding of a subject

Jian Pei: CMPT 741/459 Data Mining -- Introduction

15

Data Systems
A data system answers queries based on
data acquired in the past
Base data the rawest data not derived
from anywhere else
Knowledge information derived from the
base data

Jian Pei: CMPT 741/459 Data Mining -- Introduction

16

Dealing with Data Querying


Given a set of student records about name,
age, courses taken and grades
Simple queries
What is John Does age?

Aggregate queries
What is the average GPA of all students at this
school?

Queries can be arbitrarily complicated


Find the students X and Y whose grades are less
than 3% apart in as many courses as possible
Jian Pei: CMPT 741/459 Data Mining -- Introduction

17

Queries
A precise request for information
Subjects in databases and information
retrieval
Databases: structured queries on structured
(e.g., relational) data
Information retrieval: unstructured queries on
unstructured (e.g., text, image) data

Important assumptions
Information needs
Query languages
Jian Pei: CMPT 741/459 Data Mining -- Introduction

18

Data-driven Exploration
What should be the next strategy of a
company?
A lot of data: sales, human resource, production,
tax, service cost,

The question cannot be translated into a


precise request for information (i.e., a query)
Developing familiarity (knowledge) and
actionable items (decisions) by interactively
analyzing data
Jian Pei: CMPT 741/459 Data Mining -- Introduction

19

Data-driven Thinking
Starting with some simple queries
New queries are raised by consuming the
results of previous queries
No ultimate query in design!
But many queries can be answered using DB/IR
techniques

Jian Pei: CMPT 741/459 Data Mining -- Introduction

20

The Art of Data-driven Thinking


The way of generating queries remains an
art!
Different people may derive different results
using the same data
If you torture the data long enough, it will confess
Ronald H. Coase

More often than not, more data may be


needed datafication
Jian Pei: CMPT 741/459 Data Mining -- Introduction

21

Queries for Data-driven Thinking


Probe queries finding information about
specific individuals
Aggregation finding information about groups
Pattern finding finding commonality in
population
Association and correlation finding
connections among individuals and groups
Causality analysis finding causes and
consequences
Jian Pei: CMPT 741/459 Data Mining -- Introduction

22

What Is Data Mining?


Broader sense: the art of data-driven
thinking
Technical sense: the non-trivial process of
identifying valid, novel, potentially useful,
and ultimately understandable patterns in
data [Fayyad, Piatetsky-Shapiro, Smyth, 96]
Methods and tools of answering various types of
queries in the data mining process in the
broader sense
Jian Pei: CMPT 741/459 Data Mining -- Introduction

23

Machine Learning
A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P, if its
performance at tasks in T, as measured by P,
improves with experience E
Tom M. Mitchell
Essentially, learn the distribution of data

Jian Pei: CMPT 741/459 Data Mining -- Introduction

24

Data mining vs. Machine Learning


Machine learning focuses on prediction,
based on known properties learned from the
training data
Data mining focuses on the discovery of
(previously) unknown properties on the data

Jian Pei: CMPT 741/459 Data Mining -- Introduction

25

The KDD Process


Knowledge

Transformed
data
Preprocessed
data

Interpretation/
Patterns evaluation
Data mining

Transformation

Preprocessing

Selection
Target data

Data
Jian Pei: CMPT 741/459 Data Mining -- Introduction

26

Data Mining R&D

New problem identification


Data collection and transformation
Algorithm design and implementation
Evaluation
Effectiveness evaluation
Efficiency & scalability evaluation

Deployment and business solution

Jian Pei: CMPT 741/459 Data Mining -- Introduction

27

Data Mining on Big Data


Data is so widely available and so
strategically important that the scarce thing is
the knowledge to extract wisdom from it
Hal Varian, Googles Chief Economist

Jian Pei: CMPT 741/459 Data Mining -- Introduction

28

What Is Big Data?


No quantitative definition!
Big data is like teenage sex
everyone talks about it,
nobody really knows how to do it,
everyone thinks everyone else is doing it,
so everyone claims they are doing it...
Dan Ariely

Jian Pei: CMPT 741/459 Data Mining -- Introduction

29

Data Volume vs. Storage Cost


The unit cost of disk storage decreases
dramatically
Year

Unit cost

1956

$10,000/MB

1980

$193/MB

1990

$9/MB

2000

$6.9/GB

2010

$0.08/GB

2013

0.06/GB

http://ns1758.ca/winch/winchest.html

Jian Pei: CMPT 741/459 Data Mining -- Introduction

30

Big Data Volume


Data sets with sizes beyond the ability of
commonly-used software tools to capture,
curate, manage, and process the data within a
tolerable elapsed time
Wikipedia

Jian Pei: CMPT 741/459 Data Mining -- Introduction

31

Big Data: Volume


Every day, about 7 billion shares change hands
on US equity markets
About 2/3 is traded by computer algorithms based
on huge amounts of data to predict gains and risk

In Q2 2015
Facebook has 1.49 billion active users
Wechat has 600 million active users, 100 million
outside China
LinkedIn has 380 million active users
Twitter has 304 active users
Jian Pei: CMPT 741/459 Data Mining -- Introduction

32

Velocity
Google processes 24+ petabytes of data per
day
Facebook gets 10+ million new photos
uploaded every hour
Facebook members like or leave a comment
3+ billion times per day
YouTube users upload 1+ hour of video
every second
400+ million tweets per day
Jian Pei: CMPT 741/459 Data Mining -- Introduction

33

What Has Been Changed?


The 1880 census in the US took 8 years to
complete
The 1890 census would need 13 years using
punch cards, it was reduced to less than 1 year

It is essential to get not only the accurate but


also the timely data
Statisticians use sampling to estimate

Recently, with the new technologies, the


ways of data collection and transmission
have been fundamentally changed
Jian Pei: CMPT 741/459 Data Mining -- Introduction

34

Sampling for Volume/Velocity?


Sampling idea: the marginal new information
brought by larger amount of data shrinks
quickly
The sample should be truly random

On a data set of hundreds or thousands of


attributes, can sampling help in
Finding subcategories of attribute combinations
Finding outliers and exceptions

Big data contains signals of different strengths


No noise, instead weaker and weaker, but still may
be interesting and important signals
Jian Pei: CMPT 741/459 Data Mining -- Introduction

35

Big Data Leytro Pictures


Lytro pictures record the whole light field
Photographers can decide later which parts to
focus on

Big data tries to record as much information


as possible
Analysts can decide later what to extract from
big data
Both advantages and challenges

Jian Pei: CMPT 741/459 Data Mining -- Introduction

36

Veracity
1 in 3 business leaders don't trust the
information they use to make decisions
Assuming a slowly growing total cost budget,
tradeoff between data volume and data
quality
Loss of veracity in combining different types
of information from different sources
Loss of veracity in data extraction,
transformation, and processing
Jian Pei: CMPT 741/459 Data Mining -- Introduction

37

Variety
Integrating data capturing different aspects
of a data object
Vancouver Canucks: game video, technical
statistics, social media,
Different pieces are in different format

Different views of the same data object from


different sources
Did the soccer ball pass the goal line?
The views may not be consistent
Jian Pei: CMPT 741/459 Data Mining -- Introduction

38

Four V-challenges
Volume: massive scale and growth, 40% per
year in global data generated
Velocity: real time data generation and
consumption
Variety: heterogeneous data, mainly
unstructured or semi-structured, from many
sources
Veracity
Jian Pei: CMPT 741/459 Data Mining -- Introduction

39

Is Big Data Really New?


People were aware of the existence of big
data long time ago, but no one can access it
until very recently
(Genesis 28:15) I am with you and will watch
over you wherever you go

Similar statements in Quran and Sutra

What has been changed?


How is data connected with people
Jian Pei: CMPT 741/459 Data Mining -- Introduction

40

Diversity in Data Usage


In the past, only very few projects can afford
to be data-intensive
Nowadays, excessive applications are
(naturally) data-intensive

Jian Pei: CMPT 741/459 Data Mining -- Introduction

41

Datafication
Extract data about an object or event in a
quantified way so that it can be analyzed
Different from digitalization

An important feature of big data


Key: new data, new applications, new
opportunities

Jian Pei: CMPT 741/459 Data Mining -- Introduction

42

New Values of Datafication


Example: Captcha and ReCaptcha (Luis von
Ahn)
How to create new values of data and
datafication?
Connecting data with new users
Connecting different pieces of data to present a
bigger picture

Important techniques
Data aggregation
Extended datafication
Jian Pei: CMPT 741/459 Data Mining -- Introduction

43

Big Data Players

Data holders
Data specialists
Big-data mindset leaders
A capable company may play 2 or 3 roles at
the same time
What is most important, big-data mindset,
skills, or data itself?

Jian Pei: CMPT 741/459 Data Mining -- Introduction

44

Privacy
big data analytics have the potential to
eclipse longstanding civil rights protections
in how personal information is used in
housing, credit, employment, health,
education, and the marketplace
Executive Office of the (US) President

Jian Pei: CMPT 741/459 Data Mining -- Introduction

45

Keep in Mind
Our industry does not respect
tradition it only respects
innovation.
Satya Nadella

Jian Pei: CMPT 741/459 Data Mining -- Introduction

46

Goals of This Course


Data-driven thinking towards being a (big)
data scientist
Principles and hands-on skills of data
mining, particularly in the context of big data
Identifying new data mining problems
Data mining algorithm design
Data mining applications

Novel problems for upcoming research


Jian Pei: CMPT 741/459 Data Mining -- Introduction

47

Format
Due to the fast progress in data mining, we
will go beyond the textbook substantially
Active classroom discussion
Open questions and brainstorming
Textbook: Data Mining Concepts and
Techniques (3rd ed)

Jian Pei: CMPT 741/459 Data Mining -- Introduction

48

Read Try Think


Reading
(required) Textbook and a small number of research
papers
You have to have the 3rd ed of the textbook!
(open end, not covered by the exam) Technical and
non-technical materials

Trying
Assignments and a project

Thinking
Examine everything from a data scientist angle from
today
Jian Pei: CMPT 741/459 Data Mining -- Introduction

49

Data Mining: History


1989 IJCAI Workshop on Knowledge
Discovery in Databases
Knowledge Discovery in Databases (G.
Piatetsky-Shapiro and W. Frawley, 1991)

91-94 Workshops on Knowledge


Discovery in Databases
Advances in Knowledge Discovery and
Data Mining (U. Fayyad, G. PiatetskyShapiro, P. Smyth, and R. Uthurusamy,
1996)

Jian Pei: CMPT 741/459 Data Mining -- Introduction

50

Data Mining: History (cont d)


95-98 International Conferences on Knowledge
Discovery in Databases and Data Mining
(KDD 95-98)
Journal of Data Mining and Knowledge Discovery (1997)

ACM SIGKDD conferences since 1998 and


SIGKDD Explorations
More conferences on data mining
PAKDD (1997), PKDD (1997), SIAM-Data Mining
(2001), (IEEE) ICDM (2001), etc.

ACM Transactions on KDD starting in 2007


Jian Pei: CMPT 741/459 Data Mining -- Introduction

51

Frequent Pattern Mining

How Many Words Is a Picture Worth?

E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)

53

Burnt or Burned?

E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)

54

Store Layout Design

http://buildipedia.com/images/masterformat/Channels/In_Studio/Todays_Grocery_Store/Todays_Grocery_Store_Layout-Figure_B.jpg

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)

55

Transaction Data
Alphabet: a set of items
Example: all products sold in a store

A transaction: a set of items involved in an


activity
Example: the items purchased by a customer in
a visit

Other information is often associated


Timestamp, price, salesperson, customer-id,
store-id,
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)

56

Examples of Transaction Data

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)

57

How to Store Transaction Data?


Transaction-id
(t123, a, b, c)
(t236, b, d)
Relational storage
Transaction-based storage
Item-based (vertical) storage

Tid

Item

t123

t123

t123

t236

t236

Item a: , t123,
Item b: , t123, , t236,

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)

58

Transaction Data Analysis


Transactions: customers purchases of
commodities
{bread, milk, cheese} if they are bought together

Frequent patterns: product combinations that


are frequently purchased together by
customers
Frequent patterns: patterns (set of items,
sequence, etc.) that occur frequently in a
database [AIS93]
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)

59

Why Frequent Patterns?


What products were often purchased
together?
What are the frequent subsequent
purchases after buying a iPod?
What kinds of genes are sensitive to this
new drug?
What key-word combinations are frequently
associated with web pages about gameevaluation?
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)

60

Why Frequent Pattern Mining?


Foundation for many data mining tasks
Association rules, correlation, causality,
sequential patterns, spatial and multimedia
patterns, associative classification, cluster
analysis, iceberg cube,

Broad applications
Basket data analysis, cross-marketing, catalog
design, sale campaign analysis, web log (click
stream) analysis,
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)

61

Frequent Itemsets
Itemset: a set of items
E.g., acm = {a, c, m}

Support of itemsets
Sup(acm) = 3

Given min_sup = 3, acm


is a frequent pattern
Frequent pattern mining:
finding all frequent
patterns in a database
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)

Transaction database TDB


TID
Items bought
100 f, a, c, d, g, I, m, p
200
300
400
500

a, b, c, f, l, m, o
b, f, h, j, o
b, c, k, s, p
a, f, c, e, l, p, m, n

62

A Nave Attempt
Generate all possible itemsets, test their
supports against the database
How to hold a large number of itemsets into
main memory?
100 items 2100 1 possible itemets

How to test the supports of a huge number


of itemsets against a large database, say
containing 100 million transactions?
A transaction of length 20 needs to update the
support of 220 1 = 1,048,575 itemsets
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)

63

Transactions in Real Applications


A large department store often carries more
than 100 thousand different kinds of items
Amazon.com carries more than 17,000 books
relevant to data mining

Walmart has more than 20 million


transactions per day, AT&T produces more
than 275 million calls per day
Mining large transaction databases of many
items is a real demand
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)

64

How to Get an Efficient Method?


Reducing the number of itemsets that need
to be checked
Checking the supports of selected itemsets
efficiently

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)

65

Candidate Generation & Test


Any subset of a frequent itemset must also be
frequent an anti-monotonic property
A transaction containing {beer, diaper, nuts} also
contains {beer, diaper}
{beer, diaper, nuts} is frequent {beer, diaper} must
also be frequent

In other words, any superset of an infrequent


itemset must also be infrequent
No superset of any infrequent itemset should be
generated or tested
Many item combinations can be pruned!
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)

66

Apriori-Based Mining
Generate length (k+1) candidate itemsets
from length k frequent itemsets, and
Test the candidates against DB

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)

67

The Apriori Algorithm [AgSr94]


Data base D
TID
10
20
30
40

Items
a, c, d
b, c, e
a, b, c, e
b, e

1-candidates
Scan D

Min_sup=2
3-candidates
Scan D

Itemset
bce

Freq 3-itemsets
Itemset
bce

Sup
2

Itemset
a
b
c
d
e

Sup
2
3
3
1
3

Freq 1-itemsets
Itemset
a
b
c
e

Freq 2-itemsets
Itemset
ac
bc
be
ce

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)

Sup
2
2
3
2

2-candidates

Sup
2
3
3
3

Counting
Itemset
ab
ac
ae
bc
be
ce

Sup
1
2
1
2
3
2

Itemset
ab
ac
ae
bc
be
ce

Scan D

68

The Apriori Algorithm


Level-wise, candidate generation and test
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
Candidate
generation

L1 = {frequent items};
for (k = 1; Lk !=; k++) do

Test

Ck+1 = candidates generated from Lk;


for each transaction t in database do increment the
count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support

return k Lk;

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)

69

Important Steps in Apriori


How to find frequent 1- and 2-itemsets?
How to generate candidates?
Step 1: self-joining Lk
Step 2: pruning

How to count supports of candidates?

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)

70

Finding Frequent 1- & 2-itemsets


Finding frequent 1-itemsets (i.e., frequent
items) using a one dimensional array
Initialize c[item]=0 for each item
For each transaction T, for each item in T,
c[item]++;
If c[item]>=min_sup, item is frequent

Finding frequent 2-itemsets using a 2dimensional triangle matrix


For items i, j (i<j), c[i, j] is the count of itemset ij
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)

71

Counting Array
A 2-dimensional triangle matrix can be
implemented using a 1-dimensional array
1
1
2
3
4
5

2
1

3
2
5

4
3
6
8

5
4
7
9
10
4

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)

There are n items


For items i, j (i>j),
c[i,j] = c[(i-1)(2n-i)/2+j-i];
Example: c[3,5]
=c[(3-1)*(2*5-3)/
2+5-3]=c[9]
5

9 10
72

Example of Candidate-generation
L3 = {abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd abc * abd
acde acd * ace

Pruning:
acde is removed because ade is not in L3

C4={abcd}

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)

73

How to Generate Candidates?


Suppose the items in Lk-1 are listed in an order
Step 1: self-join Lk-1

INSERT INTO Ck
SELECT p.item1, p.item2, , p.itemk-1, q.itemk-1
FROM Lk-1 p, Lk-1 q
WHERE p.item1=q.item1, , p.itemk-2=q.itemk-2, p.itemk-1 <
q.itemk-1

Step 2: pruning

For each itemset c in Ck do

For each (k-1)-subsets s of c do if (s is not in Lk-1) then delete c


from Ck

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)

74

How to Count Supports?


Why is counting supports of candidates a problem?
The total number of candidates can be very huge
One transaction may contain many candidates

Method
Candidate itemsets are stored in a hash-tree
A leaf node of hash-tree contains a list of itemsets and
counts
Interior node contains a hash table
Subset function: finds all the candidates contained in a
transaction
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)

75

Example: Counting Supports


Subset function
3,6,9
1,4,7

Transaction: 1 2 3 5 6

2,5,8
1+2356
234
567

13+56
145

136

12+356
124
457

125
458

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)

345

356
357
689

367
368

159

76

Association Rules
Rule c am
Support: 3 (i.e., the support Transaction database TDB
of acm)
TID
Items bought
Confidence: 75% (i.e.,
100 f, a, c, d, g, I, m, p
sup(acm) / sup(c))
200 a, b, c, f, l, m, o
Given a minimum support
300 b, f, h, j, o
threshold and a minimum
confidence threshold, find
400 b, c, k, s, p
all association rules whose
500 a, f, c, e, l, p, m, n
support and confidence
passing the thresholds
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)

77

Challenges of Freq Pat Mining


Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for
candidates

Jian Pei: CMPT 741/459 Frequent Pattern Mining (2)

78

Improving Apriori: Ideas


Reducing the number of transaction
database scans
Shrinking the number of candidates
Facilitating support counting of candidates

Jian Pei: CMPT 741/459 Frequent Pattern Mining (2)

79

Bottleneck of Freq Pattern Mining


Multiple database scans are costly
Mining long patterns needs many scans and
generates many candidates
To find frequent itemset i1i2i100
# of scans: 100
# of Candidates:

100 100
100 100

+
+ ! +
= 2 1 1.27 1030
1 2
100

Bottleneck: candidate-generation-and-test

Can we avoid candidate generation?


Jian Pei: CMPT 741/459 Frequent Pattern Mining (2)

80

Search Space of Freq. Pat. Mining


Itemsets form a lattice
ABCD
ABC
AB

AC
A

ABD

ACD

BC

AD

BCD
CD

BD
D

{}

Itemset lattice
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2)

81

Set Enumeration Tree


Use an order on items, enumerate itemsets in
lexicographic order
a, ab, abc, abcd, ac, acd, ad, b, bc, bcd, bd, c, dc, d

Reduce a lattice to a tree


a

Set enumeration tree

ab

ac
abc

b
ad

abd

c
bc
acd

bd

cd
bcd

abcd
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2)

82

Borders of Frequent Itemsets


Frequent itemsets are connected
is trivially frequent
X on the border every subset of X is frequent

a
ab

ac
abc

b
ad

abd

c
bc
acd

bd

cd

bcd

abcd
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2)

83

Projected Databases
To test whether Xy is frequent, we can use
the X-projected database
The sub-database of transactions containing X
Check whether item y is frequent in X-projected
database

a
ab

ac
abc

b
ad

abd

c
bc
acd

bd

cd

bcd

abcd
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2)

84

Compress Database by FP-tree


The 1st scan: find
frequent items
Only record frequent
items in FP-tree
F-list: f-c-a-b-m-p

The 2nd scan:


construct tree
Order frequent items in
each transaction w.r.t. flist
Explore sharing among
transactions
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2)

Header
table
item
f
c
a
b
m
p

TID

root
f:4
c:3

c:1
b:1

b:1

a:3

p:1

m:2

b:1

p:2

m:1

Items bought

(ordered)
freq items

100 f, a, c, d, g, I, m, p f, c, a, m, p
200 a, b, c, f, l,m, o

f, c, a, b, m

300 b, f, h, j, o

f, b

400 b, c, k, s, p

c, b, p

500 a, f, c, e, l, p, m, n

f, c, a, m, p
85

Benefits of FP-tree
Completeness
Never break a long pattern in any transaction
Preserve complete information for freq pattern mining
Not scan database anymore

Compactness
Reduce irrelevant info infrequent items are removed
Items in frequency descending order (f-list): the more
frequently occurring, the more likely to be shared
Never be larger than the original database (not counting
node-links and the count fields)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2)

86

Partitioning Frequent Patterns


Frequent patterns can be partitioned into
subsets according to f-list: f-c-a-b-m-p
Patterns containing p
Patterns having m but no p

Patterns having c but no a nor b, m, or p
Pattern f

Depth-first search of a set enumeration tree


The partitioning is complete and does not have
any overlap
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2)

87

Find Patterns Having Item p


Only transactions containing p are needed
Form p-projected database
Starting at entry p of the header table
Follow the side-link of frequent item p
Accumulate all transformed prefix paths of p
p-projected database TDB|p
fcam: 2
cb: 1
Local frequent item: c:3
Frequent patterns containing p
p: 3, pc: 3
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2)

Header
table
item
f
c
a
b
m
p

root
f:4
c:3

c:1
b:1

a:3

b:1
p:1

m:2

b:1

p:2

m:1
88

Find Pat Having Item m But No p


Form m-projected database TDB|m
Item p is excluded (why?)
Contain fca:2, fcab:1
Local frequent items: f, c, a Header

Build FP-tree for TDB|m


Header
table
item
f
c
a

root

f:3
c:3
a:3
m-projected FP-tree
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2)

table
item
f
c
a
b
m
p

root
f:4
c:3

c:1
b:1

a:3

b:1
p:1

m:2

b:1

p:2

m:1
89

Recursive Mining
Patterns having m but no p can be mined
recursively
Optimization: enumerate patterns from a
single-branch FP-tree
Enumerate all combination
Support = that of the last item
m, fm, cm, am
fcm, fam, cam
fcam
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2)

Header
table
item
f
c
a

root
f:3
c:3
a:3

m-projected FP-tree
90

Enumerate Patterns From Single


Prefix of FP-tree
A (projected) FP-tree has a single prefix
Reduce the single prefix into one node
Join the mining results of the two parts
root
root

a1:n1

a1:n1

a2:n2

r=

a3:n3
b1:m1
c2:k2

r1

a2:n2
a3:n3

c1:k1

b1:m1

c2:k2

c1:k1

c3:k3

c3:k3

Jian Pei: CMPT 741/459 Frequent Pattern Mining (2)

91

FP-growth
Pattern-growth: recursively grow frequent patterns
by pattern and database partitioning
Algorithm
For each frequent item, construct its projected database,
and then its projected FP-tree
Repeat the process on each newly created projected
FP-tree
Until the resulted FP-tree is empty, or contains only one
path single path generates all the combinations, each
of which is a frequent pattern
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2)

92

Scaling up by DB Projection
What if an FP-tree cannot fit into memory?
Database projection
Partition a database into a set of projected
databases
Construct and mine FP-tree once the projected
database can fit into main memory
Heuristic: Projected database shrinks quickly in many
applications

Jian Pei: CMPT 741/459 Frequent Pattern Mining (2)

93

Parallel vs. Partition Projection


Parallel projection:
form all projected
database at a time
Partition projection:
propagate projections
p-proj DB
fcam
cb
fcam

m-proj DB
fcab
fca
fca
am-proj DB
fc
fc
fc

Tran. DB
fcamp
fcabm
fb
cbp
fcamp

b-proj DB
f
cb

a-proj DB
fc

cm-proj DB
f
f
f

Jian Pei: CMPT 741/459 Frequent Pattern Mining (2)

c-proj DB
f

f-proj DB

94

Why Is FP-growth Efficient?


Divide-and-conquer strategy
Decompose both the mining task and DB
Lead to focused search of smaller databases

Other factors
No candidate generation nor candidate test
Database compression using FP-tree
No repeated scan of entire database
Basic operations counting local frequent items
and building FP-tree, no pattern search nor
pattern matching
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2)

95

Major Costs in FP-growth


Poor locality of FP-trees
Low hit rate of cache

Building FP-trees
A stack of FP-trees

Redundant information
Transaction abcd appears in a-, ab-, abc-, ac-,
, c- projected databases and FP-trees

Jian Pei: CMPT 741/459 Frequent Pattern Mining (2)

96

Effectiveness of Freq Pat Mining


Too many patterns!
A pattern a1a2an contains 2n-1 subpatterns
Understanding many patterns is difficult or even
impossible for human users

Non-focused mining
A manager may be only interested in patterns
involving some items (s)he manages
A user is often interested in patterns satisfying
some constraints
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3)

97

Tid transaction

Itemset Lattice
ABCD
ABC
AB

AC
A

ABD

ACD

BC

BCD

AD

B
{}

ABD

20

ABC

30

AD

40

ABCD

50

CD

CD

BD

10

Min_sup=2

D
Length Frequent itemsets
1

A, B, C, D

AB, AC, AD, BC, BD, CD

ABC, ABD, ACD

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3)

98

Max-Patterns
Tid transaction
ABCD
ABC
AB

AC
A

ABD

ACD

BC

BCD

AD

B
{}

ABD

20

ABC

30

AD

CD 40
50

BD

10

ABCD
CD
Min_sup=2

Length Frequent itemsets


1

A, B, C, D

AB, AC, AD, BC, BD, CD

ABC, ABD

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3)

99

Borders and Max-patterns


Max-patterns: borders of frequent patterns
Any subset of max-pattern is frequent
Any superset of max-pattern is infrequent
ABCD
Cannot generate rules
ABC
AB

AC
A

ABD

ACD

BC

AD

BCD
CD

BD
D

{}
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3)

100

Patterns and Support Counts


Tid transaction

ABCD
ABC:2
AB:3

AC:2

ABD:2
BC:2

ACD

BCD

AD:3

BD:2

10

ABD

20

ABC

30

AD

CD:2 40
50

A:4

B:4

C:3
{}

ABCD
CD

D:4

Min_sup=2

Len Frequent itemsets


1

A:4, B:4, C:3, D:4

AB:3, AC:2, AD:3, BC:3, BD:2, CD:2

ABC:2, ABD:2

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3)

101

Frequent Closed Patterns


For frequent itemset X, if there exists no item
y not in X s.t. every transaction containing X
also contains y, then X is a frequent closed
pattern
acdf is a frequent closed pattern Min_sup=2

Concise rep. of freq pats

Can generate non-redundant rules

Reduce # of patterns and rules


N. Pasquier et al. In ICDT 99
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3)

TID

Items

10

a, c, d, e, f

20

a, b, e

30

c, e, f

40

a, c, d, f

50

c, e, f
102

Closed and Max-patterns


Closed pattern mining algorithms can be
adapted to mine max-patterns
A max-pattern must be closed

Depth-first search methods have advantages


over breadth-first search ones
Why?

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3)

103

Constraint-based Data Mining


Find all the patterns in a database autonomously?
The patterns could be too many but not focused!

Data mining should be interactive


User directs what to be mined

Constraint-based mining
User flexibility: provides constraints on what to be mined
System optimization: push constraints for efficient mining

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3)

104

Constraints in Data Mining


Knowledge type constraint
classification, association, etc.

Data constraint using SQL-like queries


find product pairs sold together in stores in New York

Dimension/level constraint
in relevance to region, price, brand, customer category

Rule (or pattern) constraint


small sales (price < $10) triggers big sales (sum >$200)

Interestingness constraint
strong rules: support and confidence
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3)

105

Constrained Mining vs. Search


Constrained mining vs. constraint-based search
Both aim at reducing search space
Finding all patterns vs. some (or one) answers satisfying
constraints
Constraint-pushing vs. heuristic search
An interesting research problem on integrating both

Constrained mining vs. DBMS query processing


Database query processing requires to find all
Constrained pattern mining shares a similar philosophy
as pushing selections deeply in query processing
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3)

106

Optimization
Mining frequent patterns with constraint C
Sound: only find patterns satisfying the constraints C
Complete: find all patterns satisfying the constraints C

A nave solution
Constraint test as a post-processing

More efficient approaches


Analyze the properties of constraints
Push constraints as deeply as possible into frequent
pattern mining

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3)

107

TDB (min_sup=2)

Anti-Monotonicity
Anti-monotonicity

TID

Transaction

10

a, b, c, d, f

20
30
40

b, c, d, f, g, h
a, c, d, e, f
c, e, f, g

An intemset S violates the constraint, so does


any of its superset
Item Profit
a
40
sum(S.Price) v is anti-monotone
b
0
sum(S.Price) v is not anti-monotone c
-20

Example
C: range(S.profit) 15
Itemset ab violates C
So does every superset of ab
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3)

10

-30

30

20

-10
108

Anti-monotonic Constraints
Constraint
vS
SV
SV
min(S) v
min(S) v
max(S) v
max(S) v
count(S) v
count(S) v
sum(S) v ( a S, a 0 )
sum(S) v ( a S, a 0 )
range(S) v
range(S) v
avg(S) v, { =, , }
support(S)
support(S)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3)

Antimonotone
No
no
yes
no
yes
yes
no
yes
no
yes
no
yes
no
convertible
yes
no
109

TDB (min_sup=2)

Monotonicity
Monotonicity

TID

Transaction

10

a, b, c, d, f

20

b, c, d, f, g, h

30

a, c, d, e, f

40

c, e, f, g

An intemset S satisfies the constraint, so does


any of its superset
Item Profit
sum(S.Price) v is monotone
a
40
min(S.Price) v is monotone
b
0

Example
C: range(S.profit) 15
Itemset ab satisfies C
So does every superset of ab
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3)

-20

10

-30

30

20

-10
110

Monotonic Constraints
Constraint
vS
SV
SV
min(S) v
min(S) v
max(S) v
max(S) v
count(S) v
count(S) v
sum(S) v ( a S, a 0 )
sum(S) v ( a S, a 0 )
range(S) v
range(S) v
avg(S) v, { =, , }
support(S)
support(S)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3)

Monotone
yes
yes
no
yes
no
no
yes
no
yes
no
yes
no
yes
convertible
no
yes
111

Converting Tough Constraints


TDB (min_sup=2)
Convert tough constraints into antimonotone or monotone by properly
ordering items
Examine C: avg(S.profit) 25
Order items in value-descending order
<a, f, g, d, b, h, c, e>

If an itemset afb violates C


So does afbh, afb*
It becomes anti-monotone!

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3)

TID

Transaction

10

a, b, c, d, f

20

b, c, d, f, g, h

30

a, c, d, e, f

40

c, e, f, g

Item
a
b
c
d
e
f
g
h

Profit
40
0
-20
10
-30
30
20
-10
112

Convertible Constraints
Let R be an order of items
Convertible anti-monotone
If an itemset S violates a constraint C, so does every
itemset having S as a prefix w.r.t. R
Ex. avg(S) v w.r.t. item value descending order

Convertible monotone
If an itemset S satisfies constraint C, so does every
itemset having S as a prefix w.r.t. R
Ex. avg(S) v w.r.t. item value descending order

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3)

113

Strongly Convertible Constraints


avg(X) 25 is convertible anti-monotone
w.r.t. item value descending order R: <a,
f, g, d, b, h, c, e>
Itemset af violates a constraint C, so does
every itemset with af as prefix, such as afd

avg(X) 25 is convertible monotone


w.r.t. item value ascending order R-1: <e,
c, h, b, d, g, f, a>
Itemset d satisfies a constraint C, so does
itemsets df and dfa, which having d as a
prefix

Item

Profit

40

-20

10

-30

30

20

-10

Thus, avg(X) 25 is strongly convertible


Jian Pei: CMPT 741/459 Frequent Pattern Mining (3)

114

Convertible Constraints
Constraint

Convertible
Convertible Strongly
anti-monotone monotone convertible

avg(S) , v

Yes

Yes

Yes

median(S) , v

Yes

Yes

Yes

sum(S) v (items could be of


any value, v 0)

Yes

No

No

sum(S) v (items could be of


any value, v 0)

No

Yes

No

sum(S) v (items could be of


any value, v 0)

No

Yes

No

sum(S) v (items could be of


any value, v 0)

Yes

No

No

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3)

115

Can Apriori Handle Convertible


Constraint?
A convertible, not monotone nor antimonotone nor succinct constraint cannot
be pushed deep into the an Apriori mining
algorithm

Item

Value

Within the level wise framework, no direct


pruning based on the constraint can be made
Itemset df violates constraint C: avg(X)>=25
Since adf satisfies C, Apriori needs df to
assemble adf, df cannot be pruned

40

-20

10

-30

But it can be pushed into frequent-pattern


growth framework!

30

20

-10

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3)

116

Mining With Convertible


Constraints
C: avg(S.profit) 25
List of items in every transaction in
value descending order R:

TDB (min_sup=2)
TID

Transaction

10

a, f, d, b, c

20

f, g, d, b, c

30

a, f, d, c, e

40

f, g, h, c, e
Item

Profit

40

30

Scan transaction DB once

20

remove infrequent items

10

-10

-20

-30

<a, f, g, d, b, h, c, e>
C is convertible anti-monotone w.r.t. R

Item h in transaction 40 is dropped

Itemsets a and f are good

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3)

117

Not Every Pattern Is Interesting!


Trivial patterns
Pregnant Female 100% confidence

Misleading patterns
Play basketball eat cereal [40%, 66.7%]
Basketball

Not basketball

Sum (row)

Cereal

2000

1750

3750

Not cereal

1000

250

1250

Sum(col.)

3000

2000

5000

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4)

118

Evaluation Criteria
Objective interestingness measures
Examples: support, patterns formed by mutually
independent items
Domain independent

Subjective measures
Examples: domain knowledge, templates/
constraints

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4)

119

Correlation and Lift


P(B|A)/P(B) is called the lift of rule A B
corrA,B

P(A B)
P(AB)
=
=
P(A)P(B) P(A)P(B)

Play basketball eat cereal (lift: 0.89)


Play
basketball not eat cereal (lift: 1.33)
ssociation
Analysis
Contingency table

. A 2-way contingency table for variables A and B.

Basketball

Not basketball

Sum (row)

f11

f10

f1+

Cereal

2000

1750

3750

f01

f00

f0+

Not cereal

1000

250

1250

f+1

f+0

Sum(col.)

3000

2000

5000

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4)

120

Property of Lift

If A and B are independent, lift = 1


If A and B are positively correlated, lift > 1
Chapter 6 Association Analysis
If A and B are negatively correlated, lift < 1
Limitation:
lift is tables
sensitive
to pairs
P(A)
and
Table 6.9. Contingency
for the word
({p,q}
andP(B)
{r,s}.
p

880

50

930

50

20

70

930

70

1000

lift(p, q) < lift(r, s)!

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4)

20

50

70

50

880

930

70

930

1000
121

From Itemsets to Sequences


Itemsets: combinations of items, no temporal order
Temporal order is important in many situations
Time-series databases and sequence databases
Frequent patterns (frequent) sequential patterns

Applications of sequential pattern mining


Customer shopping sequences:
First buy computer, then iPod, and then digital camera, within 3
months.

Medical treatment, natural disasters, science and


engineering processes, stocks and markets, telephone
calling patterns, Web log clickthrough streams, DNA
sequences and gene structures
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4)

122

What Is Sequential Pattern Mining?


Given a set of sequences, find the complete
set of frequent subsequences
A sequence database
SID

sequence

10

<a(abc)(ac)d(cf)>

20

<(ad)c(bc)(ae)>

30

<(ef)(ab)(df)cb>

40

<eg(af)cbc>

sequence : < (ef) (ab) (df) c b >


An element may contain a set of items.
Items within an element are unordered
and we list them alphabetically.

<a(bc)dc> is a subsequence
of <a(abc)(ac)d(cf)>

Given support threshold min_sup =2, <(ab)c> is a


sequential pattern
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4)

123

Challenges in Seq Pat Mining


A huge number of possible sequential
patterns are hidden in databases
A mining algorithm should
Find the complete set of patterns satisfying the
minimum support (frequency) threshold
Be highly efficient, scalable, involving only a
small number of database scans
Be able to incorporate various kinds of userspecific constraints
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4)

124

Apriori Property of Seq Patterns


Apriori property in sequential patterns
If a sequence S is infrequent, then none of the
super-sequences of S is frequent
E.g, <hb> is infrequent so do <hab> and
<(ah)b>
Seq-id
Sequence
Given support threshold
min_sup =2

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4)

10

<(bd)cb(ac)>

20

<(bf)(ce)b(fg)>

30

<(ah)(bf)abf>

40

<(be)(ce)d>

50

<a(bd)bcb(ade)>
125

GSP
GSP (Generalized Sequential Pattern) mining
Outline of the method
Initially, every item in DB is a candidate of length-1
For each level (i.e., sequences of length-k) do
Scan database to collect support count for each candidate
sequence
Generate candidate length-(k+1) sequences from length-k
frequent sequences using Apriori

Repeat until no frequent sequence or no candidate can


be found

Major strength: Candidate pruning by Apriori


Jian Pei: CMPT 741/459 Frequent Pattern Mining (4)

126

Finding Len-1 Seq Patterns


Initial candidates

Cand Sup
<a>, <b>, <c>, <d>, <e>, <f>, <g>,
<a>
3
<h>
<b>
5
Scan database once
<c>
4
count support for candidates
<d>
3
Seq-id
Sequence
<e>
3
10
<(bd)cb(ac)>
<f>
2
20
<(bf)(ce)b(fg)>
min_sup =2
30
<g>
1
<(ah)(bf)abf>
40

<(be)(ce)d>

50

<a(bd)bcb(ade)>

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4)

<h>

1
127

Generating Length-2 Candidates


51 length-2
Candidates
<a>
<a>
<b>
<c>
<d>

<a>

<b>

<c>

<d>

<e>

<f>

<a>

<aa>

<ab>

<ac>

<ad>

<ae>

<af>

<b>

<ba>

<bb>

<bc>

<bd>

<be>

<bf>

<c>

<ca>

<cb>

<cc>

<cd>

<ce>

<cf>

<d>

<da>

<db>

<dc>

<dd>

<de>

<df>

<e>

<ea>

<eb>

<ec>

<ed>

<ee>

<ef>

<f>

<fa>

<fb>

<fc>

<fd>

<fe>

<ff>

<b>

<c>

<d>

<e>

<f>

<(ab)>

<(ac)>

<(ad)>

<(ae)>

<(af)>

<(bc)>

<(bd)>

<(be)>

<(bf)>

<(cd)>

<(ce)>

<(cf)>

<(de)>

<(df)>

<e>
<f>
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4)

<(ef)>

Without Apriori
property,
8*8+8*7/2=92
candidates

Apriori prunes
44.57% candidates
128

Finding Len-2 Seq Patterns


Scan database one more time, collect
support count for each length-2 candidate
There are 19 length-2 candidates which
pass the minimum support threshold
They are length-2 sequential patterns

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4)

129

Generating Length-3 Candidates and


Finding Length-3 Patterns
Generate Length-3 Candidates
Self-join length-2 sequential patterns
<ab>, <aa> and <ba> are all length-2 sequential
patterns <aba> is a length-3 candidate
<(bd)>, <bb> and <db> are all length-2 sequential
patterns <(bd)b> is a length-3 candidate

46 candidates are generated

Find Length-3 Sequential Patterns


Scan database once more, collect support
counts for candidates
19 out of 46 candidates pass support threshold
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4)

130

The GSP Mining Process


5th scan: 1 cand. 1 length-5 seq.
pat.

<(bd)cba>

Cand. cannot pass


sup. threshold

Cand. not in DB at all


4th scan: 8 cand. 6 length-4 seq. <abba> <(bd)bc>
pat.
3rd scan: 46 cand. 19 length-3 seq. <abb> <aab> <aba> <baa> <bab>
pat. 20 cand. not in DB at all
2nd scan: 51 cand. 19 length-2 seq.
<aa> <ab> <af> <ba> <bb> <ff> <(ab)> <(ef)>
pat. 10 cand. not in DB at all
1st scan: 8 cand. 6 length-1 seq.
<a> <b> <c> <d> <e> <f> <g> <h>
pat.

min_sup
=2
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4)

Seq-id
10
20
30
40
50

Sequence

<(bd)cb(ac)>
<(bf)(ce)b(fg)>
<(ah)(bf)abf>
<(be)(ce)d>
<a(bd)bcb(ade)>
131

The GSP Algorithm


Take sequences in form of <x> as length-1
candidates
Scan database once, find F1, the set of length-1
sequential patterns
Let k=1; while Fk is not empty do
Form Ck+1, the set of length-(k+1) candidates from Fk;
If Ck+1 is not empty, scan database once, find Fk+1, the
set of length-(k+1) sequential patterns
Let k=k+1;

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4)

132

Bottlenecks of GSP
A huge set of candidates
1,000 frequent length-1 sequences generate
1000 999
1000 1000 +
= 1,499,500 length-2 candidates!
2

Multiple scans of database in mining


Real challenge: mining long sequential
patterns

An exponential number of short candidates


A length-100 sequential pattern needs 1030
100
candidate sequences!
i = 2 1 10
100
i =1

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4)

100

30

133

FreeSpan: Freq Pat-projected


Sequential Pattern Mining
The itemset of a seq pat must be frequent
Recursively project a sequence database into a
set of smaller databases based on the current
set of frequent patterns
Mine each projected database to find its patterns
Sequence Database SDB
< (bd) c b (ac) >
< (bf) (ce) b (fg) >
< (ah) (bf) a b f >
< (be) (ce) d >
< a (bd) b c b (ade) >

f_list: b:5, c:4, a:3, d:3, e:3, f:2


All seq. pat. can be divided into 6 subsets:
Seq. pat. containing item f
Those containing e but no f
Those containing d but no e nor f
Those containing a but no d, e or f
Those containing c but no a, d, e or f
Those containing only item b

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4)

134

From FreeSpan to PrefixSpan


Freespan:
Projection-based: no candidate sequence needs
to be generated
But, projection can be performed at any point in
the sequence, and the projected sequences may
not shrink much

PrefixSpan
Projection-based
But only prefix-based projection: less projections
and quickly shrinking sequences
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4)

135

Prefix and Suffix (Projection)


<a>, <aa>, <a(ab)> and <a(abc)> are
prefixes of sequence <a(abc)(ac)d(cf)>
Given sequence <a(abc)(ac)d(cf)>
Prefix

Suffix (Prefix-Based Projection)

<a>
<aa>
<ab>

<(abc)(ac)d(cf)>
<(_bc)(ac)d(cf)>
<(_c)(ac)d(cf)>

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4)

136

Mining Sequential Patterns by


Prefix Projections
Step 1: find length-1 sequential patterns
<a>, <b>, <c>, <d>, <e>, <f>

Step 2: divide search space. The complete


set of seq. pat. can be partitioned into 6
subsets:

SID
sequence
The ones having prefix <a>; 10 <a(abc)(ac)d(cf)>
The ones having prefix <b>; 20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>
The ones having prefix <f> 40
<eg(af)cbc>

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4)

137

Finding Seq. Pat. with Prefix <a>


Only need to consider projections w.r.t. <a>
<a>-projected database: <(abc)(ac)d(cf)>,
<(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>

Find all the length-2 seq. pat. having prefix


<a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>
Further partition into 6 subsets
Having prefix <aa>;

Having prefix <af>
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4)

SID

sequence

10

<a(abc)(ac)d(cf)>

20

<(ad)c(bc)(ae)>

30

<(ef)(ab)(df)cb>

40

<eg(af)cbc>
138

Completeness of PrefixSpan
SDB

Having prefix <a>


<a>-projected database
<(abc)(ac)d(cf)>
<(_d)c(bc)(ae)>
<(_b)(df)cb>
<(_f)cbc>

SID

sequence

10

<a(abc)(ac)d(cf)>

20

<(ad)c(bc)(ae)>

30

<(ef)(ab)(df)cb>

40

<eg(af)cbc>

Length-1 sequential patterns


<a>, <b>, <c>, <d>, <e>, <f>

Having prefix <c>, , <f>

Having prefix <b>


<b>-projected database

Length-2 sequential
patterns
<aa>, <ab>, <(ab)>,
<ac>, <ad>, <af>

Having prefix <aa> Having prefix <af>


<aa>-proj. db

<af>-proj. db

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4)

139

Efficiency of PrefixSpan
No candidate sequence needs to be
generated
Projected databases keep shrinking
Major cost of PrefixSpan: constructing
projected databases
Can be improved by bi-level projections

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4)

140

Effectiveness
Redundancy due to anti-monotonicity
{<abcd>} leads to 15 sequential patterns of
same support
Closed sequential patterns and sequential
generators

Constraints on sequential patterns


Gap
Length
More sophisticated, application oriented
constraints
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4)

141

Data Warehousing & OLAP

Motivation: Business Intelligence


Customer information
(customer-id, gender, age,
home-address, occupation,
income, family-size, )

Product information
(Product-id, category,
manufacturer, made-in,
stock-price, )

Sales information
(customer-id, product-id, #units, unit-price,
sales-representative, )

Business queries:
Which categories of products are most popular for customers
Find pairs (customer groups, most popular products)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1)

143

In what aspect is he most similar to


cases of coronary artery disease
and, at the same time, dissimilar to
adiposity?
Symptoms:
overweight,
high blood
pressure,
back pain,
short of breadth,
chest pain,
cold sweat

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1)

144

Dont You Ever Google Yourself?


Big data makes one know oneself better
57% American adults search themselves on
Internet
Good news: those people are
better paid than those who
havent done so! (Investors.com)

Egocentric analysis becomes


more and more important with
big data
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1)

145

Egocentric Analysis
How am I different from (more often than
not, better than) others?
In what aspects am I good?

http://img03.deviantart.net/a670/i/2010/219/a/e/glee___egocentric_by_gleeondoodles.jpg

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1)

146

Dimensions
An aspect or feature of a situation, problem, or
thing, a measurable extent of some kind
Dictionary
Dimensions/attributes are used to model
complex objects in a divide-and-conquer
manner
Objects are compared in selected dimensions/
attributes

More often than not, objects have too many


dimensions/attributes than one is interested in
and can handle
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1)

147

Multi-dimensional Analysis
Find interesting patterns in multi-dimensional
subspaces
Michael Jordan is outstanding in subspaces (total
points, total rebounds, total assists) and (number of
games played, total points, total assists)

Different patterns may be manifested in


different subspaces
Feature selection (machine learning and statistics):
select a subset of relevant features for use in model
construction a set of features for all objects
Different subspaces may manifest different patterns
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1)

148

OLAP
Conceptually, we may explore all possible
subspaces for interesting patterns

What patterns are interesting?


How can we explore all possible subspaces
systematically and efficiently?
Fundamental problems in analytics and data
mining

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1)

149

OLAP
Aggregates and group-bys are frequently used in
data analysis and summarization
SELECT time, altitude, AVG(temp)
FROM weather GOUP BY time, altitude;
In TPC, 6 standard benchmarks have 83 queries,
aggregates are used 59 times, group-bys are used 20
times

Online analytical processing (OLAP): the


techniques that answer multi-dimensional
analytical (MDA) queries efficiently
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1)

150

OLAP Operations
Roll up (drill-up): summarize data by
climbing up hierarchy or by dimension
reduction
(Day, Store, Product type, SUM(sales)
(Month, City, *, SUM(sales))

Drill down (roll down): reverse of roll-up,


from higher level summary to lower level
summary or detailed data, or introducing
new dimensions
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1)

151

Roll Up
http://www.tutorialspoint.com/dwh/images/rollup.jpg

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1)

152

Drill Down

http://www.tutorialspoint.com/dwh/images/drill_down.jpg

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1)

153

Other Operations
Dice: pick specific values or ranges on some
dimensions
Pivot: rotate a cube changing the order of
dimensions in visual analysis

http://en.wikipedia.org/wiki/File:OLAP_pivoting.png

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1)

154

Dice

http://www.tutorialspoint.com/dwh/images/dice.jpg

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1)

155

Relational Representation
If there are n dimensions, there are 2n
possible aggregation columns
Roll up by model by year by color in a table

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1)

156

Difficulties
Many group bys are needed
6 dimensions 26=64 group bys

In most SQL systems, the resulting query


needs 64 scans of the data, 64 sorts or
hashes, and a long wait!

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1)

157

Dummy Value ALL

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1)

158

DATA CUBE
Model Year
Color Sales

CUBE
SALES
Model Year Color Sales
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford

1990
1990
1990
1991
1991
1991
1992
1992
1992
1990
1990
1990
1991
1991
1991
1992
1992
1992

red
white
blue
red
white
blue
red
white
blue
red
white
blue
red
white
blue
red
white
blue

5
87
62
54
95
49
31
54
71
64
62
63
52
9
55
27
62
39

CUBE

SELECT Model, Year, Color, SUM(sales) AS Sales


FROM Sales
WHERE Model in {'Ford', 'Chevy'}
AND Year BETWEEN 1990 AND 1992
GROUP BY CUBE(Model, Year, Color);
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1)

Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL

1990
1990
1990
1990
1991
1991
1991
1991
1992
1992
1992
1992
ALL
ALL
ALL
ALL
1990
1990
1990
1990
1991
1991
1991
1991
1992
1992
1992
1992
ALL
ALL
ALL
ALL
1990
1990
1990
1990
1991
1991
1991
1991
1992
1992
1992
1992
ALL
ALL
ALL
ALL

blue
red
white
ALL
blue
red
white
ALL
blue
red
white
ALL
blue
red
white
ALL
blue
red
white
ALL
blue
red
white
ALL
blue
red
white
ALL
blue
red
white
ALL
blue
red
white
ALL
blue
red
white
ALL
blue
red
white
ALL
blue
red
white
ALL

62
5
95
154
49
54
95
198
71
31
54
156
182
90
236
508
63
64
62
189
55
52
9
116
39
27
62
128
157
143
133
433
125
69
149
343
106
104
110
314
110
58
116
284
339
233
369
941

159

Semantics of ALL
ALL is a set
Model.ALL = ALL(Model) = {Chevy, Ford }
Year.ALL = ALL(Year) = {1990,1991,1992}
Color.ALL = ALL(Color) = {red,white,blue}

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1)

160

OLTP Versus OLAP


OLTP

OLAP

users

clerk, IT professional

knowledge worker

function

day to day operations

decision support

DB design

application-oriented

subject-oriented

data

current, up-to-date, detailed, flat


relational Isolated

historical, summarized, multidimensional


integrated, consolidated

usage

repetitive

ad-hoc

access

read/write, index/hash on prim.


key

lots of scans

unit of work

short, simple transaction

complex query

# records
accessed

tens

millions

#users

thousands

hundreds

DB size

100MB-GB

100GB-TB

metric

transaction throughput

query throughput, response

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1)

161

What Is a Data Warehouse?


A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile
collection of data in support of
management s decision-making process.
W. H. Inmon
Data warehousing: the process of
constructing and using data warehouses

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1)

162

Subject-Oriented
Organized around major subjects, such as
customer, product, sales
Focusing on the modeling and analysis of
data for decision makers, not on daily
operations or transaction processing
Providing a simple and concise view around
particular subject issues by excluding data
that are not useful in the decision support
process
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1)

163

Integrated
Integrating multiple, heterogeneous data sources
Relational databases, flat files, on-line transaction
records

Data cleaning and data integration


Ensuring consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources
E.g., Hotel price: currency, tax, breakfast covered, etc.

When data is moved to the warehouse, it is converted

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1)

164

Time Variant
The time horizon for the data warehouse is
significantly longer than that of operational systems
Operational databases: current value data
Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)

Every key structure in the data warehouse contains


an element of time, explicitly or implicitly
But the key of operational data may or may not contain
time element

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1)

165

Nonvolatile
A physically separate store of data
transformed from the operational
environment
Operational updates of data do not occur in
the data warehouse environment
Do not require transaction processing, recovery,
and concurrency control mechanisms
Require only two operations in data accessing
Initial loading of data
Access of data
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1)

166

Why Separate Data Warehouse?


High performance for both
Operational DBMS: tuned for OLTP
Warehouse: tuned for OLAP

Different functions and different data


Historical data: data analysis often uses
historical data that operational databases do not
typically maintain
Data consolidation: data analysis requires
consolidation (aggregation, summarization) of
data from heterogeneous sources
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1)

167

Data Warehouse Schema Design


Query answering efficiency
Subject orientation
Integration

Tradeoff between time and space


Universal table versus fully normalized schema

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2)

168

Star Schema
time

time_key
day
day_of_the_week
month
quarter
yearbranch
branch_key
branch_name
branch_type

Sales Fact Table


time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales

Measures
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2)

item
item_key
item_name
brand
type
supplier_type
location

location_key
street
city
state_or_province
country
169

Snowflake Schema
time

time_key
day
day_of_the_week
month
quarter
year
branch
branch_key
branch_name
branch_type

Sales Fact Table


time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales

Measures
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2)

item
item_key
supplier
item_name supplier_ke
supplier_ty
brand
type
supplier_key
location

location_key
street
city
city_key
city_key
city
state_or_province
country
170

Fact Constellation

Shipping Fact Tab


time_key

time

item

time_key
day
day_of_the_week
month
quarter
year

branch
branch_key
branch_name
branch_type

Sales Fact Table


time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales

Measur
es
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2)

item_key
item_name
brand
type
supplier_type

location
location_key
street
city
province_or_state
country

item_key
shipper_ke
from_location
y
to_location
dollars_cost
units_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type

171

(Good) Aggregate Functions


Distributive: there is a function G() such that
F({Xi,j}) = G({F({Xi,j |i=1,...,lj}) | j=1,n})
Examples: COUNT(), MIN(), MAX(), SUM()
G=SUM() for COUNT()

Algebraic: there is an M-tuple valued function G()


and a function H() such that
F({Xi,j}) = H({G({Xi,j |i=1,.., I}) | j=1,..., n })
Examples: AVG(), standard deviation, MaxN(), MinN()
For AVG(), G() records sum and count, H() adds these
two components and divides to produce the global
average
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2)

172

Holistic Aggregate Functions


There is no constant bound on the size of
the storage needed to describe a subaggregate.
There is no constant M, such that an M-tuple
characterizes the computation
F({Xi,j |i=1,...,I}).

Examples: Median(), MostFrequent() (also


called the Mode()), and Rank()
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2)

173

Index Requirements in OLAP


Data is read only
(Almost) no insertion or deletion

Query types
Point query: looking up one specific tuple (rare)
Range query: returning the aggregate of a
(large) set of tuples, with group by
Complex queries: need specific algorithms and
index structures, will be discussed later

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2)

174

OLAP Query Example


In table (cust, gender, ), find the total
number of male customers
Method 1: scan the table once
Method 2: build a B+ tree index on attribute
gender, still need to access all tuples of male
customers
Can we get the count without scanning many
tuples, even not all tuples of male
customers?
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2)

175

Bitmap Index
For n tuples, a bitmap index has n bits and
can be packed into n /8 bytes and n /32
words
From a bit to the row-id: the j-th bit of the pcust gender
th byte row-id = p*8 +j
Jack

Cathy

Nancy

1 0 0
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2)

176

Using Bitmap to Count


Shcount[] contains the number of bits in the
entry subscript
Example: shcount[01100101]=4
count = 0;
for (i = 0; i < SHNUM; i++)
count += shcount[B[i]];

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2)

177

Advantages of Bitmap Index


Efficient in space
Ready for logic composition
C = C1 AND C2
Bitmap operations can be used

Bitmap index only works for categorical data


with low cardinality
Naively, we need 50 bits per entry to represent
the state of a customer in US
How to represent a sale in dollars?
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2)

178

Bit-Sliced Index
A sale amount can be written as an integer
number of pennies, and then be represented
as a binary number of N bits
24 bits is good for up to $167,772.15,
appropriate for many stores

A bit-sliced index is N bitmaps


Tuple j sets in bitmap k if the k-th bit in its binary
representation is on
The space costs of bit-sliced index is the same
as storing the data directly
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2)

179

Using Indexes
SELECT SUM(sales) FROM Sales WHERE C;
Tuples satisfying C is identified by a bitmap B

Direct access to rows to calculate SUM:


scan the whole table once
B+ tree: find the tuples from the tree
Projection index: only scan attribute sales
Bit-sliced index: get the sum from (B AND
Bk)*2k
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2)

180

Cost Comparison
Traditional value-list index (B+ tree) is costly
in both I/O and CPU time
Not good for OLAP

Bit-sliced index is efficient in I/O


Other case studies in [O Neil and Quass,
SIGMOD 97]

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2)

181

Horizontal or Vertical Storage


A fact table for data warehousing is often fat
Tens of even hundreds of dimensions/attributes

A query is often about only a few attributes


Horizontal storage: tuples are stored one by one
Vertical storage: tuples are stored by attributes
A1
x1

z1

A2
x2

z2

A100
x100

z100

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2)

A1
x1

z1

A2
x2

z2

A100
x100

z100
182

Horizontal Versus Vertical


Find the information of tuple t
Typical in OLTP
Horizontal storage: get the whole tuple in one search
Vertical storage: search 100 lists

Find SUM(a100) GROUP BY {a22, a83}

Typical in OLAP
Horizontal storage (no index): search all tuples O(100n),
where n is the number of tuples
Vertical storage: search 3 lists O(3n), 3% of the
horizontal storage method

Projection index: vertical storage


Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2)

183

MOLAP
2Qtr

3Qtr

4Qtr

sum
U.S.A
Canada
Mexico

Country

TV
PC
VCR
sum

1Qtr

Date

sum

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3)

184

Pros and Cons

Easy to implement
Fast retrieval
Many entries may be empty if data is sparse
Costly in space

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3)

185

ROLAP Data Cube in Table


A multi-dimensional database
Base table
Dimensions

Measure
Dimensions

Measure

Store

Product

Season

Sales

S1

P1

Spring

Store

S1

P2

Spring

12

S1

P1

Spring

S2

P1

Fall

S1

P2

Spring

12

S2

P1

Fall

S1

Spring

Cubing

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3)

Product Season AVG(Sales)

186

Data Cube: A Lattice of Cuboids


all
time

item

time,location
time,item

D(apex) cuboid
location

supplier

item,location

time,supplier

location,supplier

item,supplier

time,location,supplier

time,item,locationtime,item,supplier

D cuboids

D cuboids
D cuboids

item,location,supplier

D(base) cuboid
time, item, location, supplierc
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3)

187

Data Cube: A Lattice of Cuboids


all
time
time,item

item

D(apex) cuboid
location

supplier

D cuboids

time,location

item,location
location,supplier
item,supplier
time,supplier

D cuboids

time,location,supplier

time,item,location

time,item,supplier

item,location,supplier
time, item, location, supplier

D cuboids
D(base) cuboid

Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells
(9/15, milk, Urbana, Dairy_land), (9/15, milk, Urbana, *),
(*, milk, Urbana, *), (*, milk, Urbana, *)
(*, milk, Chicago, *), (*, milk, *, *)

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3)

188

Full Cube vs. Iceberg Cube


Full cube vs. iceberg cube
iceberg
condition

compute cube sales iceberg as


select month, city, customer group, count(*)
from salesInfo
cube by month, city, customer group
having count(*) >= min support

Avoid explosive growth: A cube with 100 dimensions


n 2 base cells: (a1, a2, ., a100), (b1, b2, , b100)
n How many aggregate cells if having count >= 1 ?
n What about having count >= 2 ?

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3)

189

Multi-Way Array Aggregation


Array-based bottom-up
algorithm
Using multi-dimensional chunks
No direct tuple comparisons
Simultaneous aggregation on
multiple dimensions
Intermediate aggregate values
are re-used for computing
ancestor cuboids
Cannot do Apriori pruning: No
iceberg optimization
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3)

All

AB

BC

AC

ABC

Multi-way Array Aggregation for


Cube Computation (MOLAP)
Partition arrays into chunks (a small subcube which fits in memory).
Compressed sparse array addressing: (chunk_id, offset)
Compute aggregates in multiway by visiting cube cells in the order which
minimizes the # of times to visit each cell, and reduces memory access & storage
cost.
C c2 c3
c1

c0
b3

b2

b1

b0
a0

a1

a2

What is the best


traversing
order to do
multi-way
aggregation?

a3

A
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3)

191

Multi-way Array Aggregation for Cube


Computation (3-D to 2-D)
All

all

AB
A

BC

AC

ABC
AB

AC

BC

ABC

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3)

The best order is


the one that
minimizes the
memory
requirement and
reduced I/Os
192

Multi-way Array Aggregation for Cube


Computation (2-D to 1-D)

All

AB

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3)

BC

AC

ABC

193

Multi-Way Array Aggregation for


Cube Computation
Method: the planes should be sorted and
computed according to their size in ascending
order
Idea: keep the smallest plane in the main memory,
fetch and compute only one chunk at a time for the
largest plane

Limitation of the method: computing well only


for a small number of dimensions
If there are a large number of dimensions, topdown computation and iceberg cube computation
methods can be explored
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3)

194

Iceberg Cube
In a data cube, many aggregate cells are
trivial
Having an aggregate too small

Iceberg query

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3)

195

Monotonic Iceberg Condition


If COUNT(a, b, *)<100, then COUNT(a, b,
c)<100 for any c
For cells c1 and c2, c1 is called an ancestor
of c2 if in all dimensions that c1 takes a non-*
value, c2 agrees with c1
(a,b,*) is an ancestor of (a,b,c)

An iceberg condition P is monotonic if for


any aggregate cell c failing P, any
descendants of c cannot honor P
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3)

196

BUC
Once a base table (A,
B, C) is sorted by A-BC, aggregates (*,*,*),
(A,*,*), (A,B,*) and
(A,B,C) can be
computed with one
scan and 4 counters
To compute other
aggregates, we can
sort the base table in
some other orders
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3)

197

Example
Threshold: sum() >= 300
Location

Year

Color

Amount

Vancouver

2015

Yellow

300

Victoria

2014

Red

400

Seattle

2015

Green

120

Vancouver

2014

Green

260

Seattle

2015

Red

160

Vancouver

2014

Yellow

280

Vancouver

2015

Red

160

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3)

198

Example: Sorting on Location


Location

Year

Color

Amount

Seattle

2015

Green

120

Seattle

2015

Red

160

Vancouver

2015

Yellow

300

Vancouver

2014

Yellow

280

Vancouver

2015

Red

160

Vancouver

2014

Green

260

Victoria

2014

Red

400

Sum(Seattle, *, *) = 280
Sum(Vancouver, *, *) = 1000
Sum(Victoria, *, *) = 400

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3)

199

Sorting on Year for Vancouver


Location

Year

Color

Amount

Seattle

2015

Green

120

Seattle

2015

Red

160

Vancouver

2014

Yellow

280

Vancouver

2014

Green

260

Vancouver

2015

Yellow

300

Vancouver

2015

Red

160

Victoria

2014

Red

400

Sum(Vancouver, 2014, *) = 540


Sum(Vancouver, 2015, *) = 460

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3)

200

Color on Vancouver & 2014/2015


Location

Year

Color

Amount

Seattle

2015

Green

120

Seattle

2015

Red

160

Vancouver

2014

Green

260

Vancouver

2014

Yellow

280

Vancouver

2015

Red

160

Vancouver

2015

Yellow

300

Victoria

2014

Red

400

Sum(Vancouver, 2014, Yellow) = 280


Sum(Vancouver, 2014, Green) = 260
Sum(Vancouver, 2015, Yellow) = 300
Sum(Vancouver, 2015, Red) = 160
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3)

201

Sort on Color for Vancouver


Location

Year

Color

Amount

Seattle

2015

Green

120

Seattle

2015

Red

160

Vancouver

2014

Green

260

Vancouver

2015

Red

160

Vancouver

2014

Yellow

280

Vancouver

2015

Yellow

300

Victoria

2014

Red

400

Sum(Vancouver, *, Green) = 260


Sum(Vancouver, *, Red) = 160
Sum(Vancouver, *, Yellow) = 580

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3)

202

How to Sort the Base Table?


General sorting in main memory O(nlogn)
Counting in main memory O(n), linear to the
number of tuples in the base table
How to sort 1 million integers in range 0 to 100?
Set up 100 counters, initiate them to 0 s
Scan the integers once, count the occurrences
of each value in 1 to 100
Scan the integers again, put the integers to the
right places
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3)

203

Pushing Monotonic Conditions


BUC searches the
aggregates bottom-up
in depth-first manner
Only when a
monotonic condition
holds, the descendants
of the current node
should be expanded

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3)

204

Clustering

Community Detection

http://image.slidesharecdn.com/communitydetectionitilecture22june2011-110622095259-phpapp02/95/community-detection-in-socialmedia-1-728.jpg?cb=1308736811

Jian Pei: CMPT 741/459 Clustering (1)

206

Customer Relation Management


Partitioning customers into groups such that
customers within a group are similar in some
aspects
A manager can be assigned to a group
Customized products and services can be
developed

Jian Pei: CMPT 741/459 Clustering (1)

207

What Is Clustering?
Group data into clusters
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Unsupervised learning: no predefined classes
Outliers
Cluster 1
Cluster 2

Jian Pei: CMPT 741/459 Clustering (1)

208

Requirements of Clustering

Scalability
Ability to deal with various types of attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge
to determine input parameters

Jian Pei: CMPT 741/459 Clustering (1)

209

Data Matrix
For memory-based clustering
Also called object-by-variable structure

Represents n objects with p variables


(attributes, measures)
A relational table

Jian Pei: CMPT 741/459 Clustering (1)

! x
x
1f
11
"
"
"
x
! x
i
1
if

"
"
"

xn1 ! xnf

x
1p
"
"
! x
ip
"
"

! x
np
210

Dissimilarity Matrix
For memory-based clustering
Also called object-by-object structure
Proximities of pairs of objects
d(i, j): dissimilarity between objects i and j
Nonnegative
0
d (2,1)
Close to 0: similar
0

d (3,1) d (3,2) 0

"
"
"

d (n,1) d (n,2) ! ! 0

Jian Pei: CMPT 741/459 Clustering (1)

211

How Good Is Clustering?


Dissimilarity/similarity depends on distance
function
Different applications have different functions

Judgment of clustering quality is typically


highly subjective

Jian Pei: CMPT 741/459 Clustering (1)

212

Types of Data in Clustering

Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types

Jian Pei: CMPT 741/459 Clustering (1)

213

Interval-valued Variables
Continuous measurements of a roughly
linear scale
Weight, height, latitude and longitude
coordinates, temperature, etc.

Effect of measurement units in attributes


Smaller unit larger variable range larger
effect to the result
Standardization + background knowledge

Jian Pei: CMPT 741/459 Clustering (1)

214

Standardization
Calculate the mean absolute deviation
s f = 1n (| x1 f m f | + | x2 f m f | +...+ | xnf m f |)

m f = 1n (x1 f + x2 f

+ ... +

xnf )

Calculate the standardized measurement (zxif m f


score)
zif =
sf
Mean absolute deviation is more robust
The effect of outliers is reduced but remains
detectable

Jian Pei: CMPT 741/459 Clustering (1)

215

Similarity and Dissimilarity


Distances are normally used measures
Minkowski distance: a generalization
d (i, j) = q | x x |q + | x x |q +...+ | x x |q (q > 0)
i1
j1
i2
j2
ip
jp

If q = 2, d is Euclidean distance
If q = 1, d is Manhattan distance
If q = , d is Chebyshev distance
Weighed distance
d (i, j) = q w | x x |q +w | x x |q +...+ w p | x x |q ) (q > 0)
ip j p
1 i1 j1
2 i2 j 2

Jian Pei: CMPT 741/459 Clustering (1)

216

Manhattan and Chebyshev Distance

Chebyshev Distance
Manhattan Distance
When n = 2, chess-distance
Picture from Wekipedia
Jian Pei: CMPT 741/459 Clustering (1)

http://brainking.com/images/rules/chess/02.gif
217

Properties of Minkowski Distance


Nonnegative: d(i,j) 0
The distance of an object to itself is 0
d(i,i) = 0

Symmetric: d(i,j) = d(j,i)


Triangular inequality
d(i,j) d(i,k) + d(k,j)

j
k

Jian Pei: CMPT 741/459 Clustering (1)

218

Binary Variables

Object j
1
0
1
q
r
Object i
0
s
t
Sum q+s r+t

Sum
q+r
s+t
p

A contingency table for binary data


Symmetric variable: each state carries the
same weight
r+s
d (i, j) =
Invariant similarity

q + r + s +t

Asymmetric variable: the positive value


r+s
d (i, j) = q +
carries more weight
r+s
Noninvariant similarity (Jacard)

Jian Pei: CMPT 741/459 Clustering (1)

219

Nominal Variables
A generalization of the binary variable in that
it can take more than 2 states, e.g., Red,
yellow, blue, green
m
d (i, j) = p
p
Method 1: simple matching
M: # of matches, p: total # of variables

Method 2: use a large number of binary


variables
Creating a new binary variable for each of the M
nominal states
Jian Pei: CMPT 741/459 Clustering (1)

220

Ordinal Variables
An ordinal variable can be discrete or
rif {1,..., M f }
continuous
Order is important, e.g., rank
Can be treated like interval-scaled
Replace xif by their rank
Map the range of each variable onto [0, 1] by
replacing the i-th object in the f-th variable by
zif

rif 1
=
M f 1

Compute the dissimilarity using methods for


interval-scaled variables
Jian Pei: CMPT 741/459 Clustering (1)

221

Ratio-scaled Variables
Ratio-scaled variable: a positive
measurement on a nonlinear scale
E.g., approximately at exponential scale, such
as AeBt

Treat them like interval-scaled variables?


Not a good choice: the scale can be distorted!

Apply logarithmic transformation, yif = log(xif)


Treat them as continuous ordinal data, treat
their rank as interval-scaled
Jian Pei: CMPT 741/459 Clustering (1)

222

Variables of Mixed Types


A database may contain all the six types of
variables
Symmetric binary, asymmetric binary, nominal,
ordinal, interval and ratio

One may use a weighted formula to combine


their effects
pf = 1 ij( f ) dij( f )
d (i, j) =
pf = 1 ij( f )
Jian Pei: CMPT 741/459 Clustering (1)

223

Clustering Methods

K-means and partitioning methods


Hierarchical clustering
Density-based clustering
Grid-based clustering
Pattern-based clustering
Other clustering methods

Jian Pei: CMPT 741/459 Clustering (1)

224

Partitioning Algorithms: Ideas


Partition n objects into k clusters
Optimize the chosen partitioning criterion

Global optimal: examine all possible partitions


(kn-(k-1)n--1) possible partitions, too expensive!

Heuristic methods: k-means and k-medoids


K-means: a cluster is represented by the center
K-medoids or PAM (partition around medoids): each
cluster is represented by one of the objects in the cluster

Jian Pei: CMPT 741/459 Clustering (1)

225

K-means
Arbitrarily choose k objects as the initial
cluster centers
Until no change, do
(Re)assign each object to the cluster to which
the object is the most similar, based on the
mean value of the objects in the cluster
Update the cluster means, i.e., calculate the
mean value of the objects for each cluster

Jian Pei: CMPT 741/459 Clustering (1)

226

K-Means:
Example
K-Means:
K-Means:Example
Example

=2

10

3
2
1
0

10

10

10

7
8
10

K=2
K=2

Arbitrarily choose
Arbitrarily
chooseKK
bitrarily
choose
K
object as
object
as initial
initial
bject as
initialcenter
cluster

cluster center
uster center

Assign
3
AssignAssign
each 2
each each
1
objects
object
objects
0
9
10 to
0
1
to
most
the
to most
similar
similarmost
similar
center center

3
2

1
0

0
01
3

12

23

3
4

54

65

76

87

10

9 8 10 9

4
Update
Update
Update
3
the
the
2
the
cluster1
cluster
cluster
10
means0 0
means
means

3
2
1
0

reassign
reassign
reassign

center

10

10

7
6

4
3
2
1

10

10

0 0 1

Jian Pei: CMPT 741/459 Clustering (1)

7
10

10

10

reassign
reassign
reassign

10

10

10

Update
Update54
Update
the
the
3
thecluster
cluster 2
means
cluster
1

means
means

10

10
9
8
7
6
5

0
0

1 0 2

0
0

10

10

227

Pros and Cons of K-means


Relatively efficient: O(tkn)
n: # objects, k: # clusters, t: # iterations; k, t <<
n.

Often terminate at a local optimum


Applicable only when mean is defined
What about categorical data?

Need to specify the number of clusters


Unable to handle noisy data and outliers
Unsuitable to discover non-convex clusters
Jian Pei: CMPT 741/459 Clustering (1)

228

Variations of the K-means


Aspects of variations
Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster means

Handling categorical data: k-modes


Use mode instead of mean
Mode: the most frequent item(s)

A mixture of categorical and numerical data: k-prototype


method

EM (expectation maximization): assign a


probability of an object to a cluster (will be
discussed later)
Jian Pei: CMPT 741/459 Clustering (1)

229

A Problem
of K-means
ensitive
to outliers

Outlier: objects with extremely large values


+

Sensitive
to
outliers
May substantially distort the distribution of the data

Outlier: objects with extremely large values

-medoids:
the most
centrally
located
objec
May substantially
distort
the distribution
of the data
n acluster
K-medoids: the most centrally located object
in a cluster
10

10

0
0

10

JianMining
Pei: CMPT
741/459 Clustering
(1)
Pei: Data
-- Clustering
and Outlier
Detection

10

230

PAM: A K-medoids Method


PAM: partitioning around Medoids
Arbitrarily choose k objects as the initial medoids
Until no change, do
(Re)assign each object to the cluster to which the
nearest medoid
Randomly select a non-medoid object o , compute the
total cost, S, of swapping medoid o with o
If S < 0 then swap o with o to form the new set of k
medoids

Jian Pei: CMPT 741/459 Clustering (1)

231

Swapping Cost
Measure whether o is better than o as a
medoid
Use the squared-error
criterion
k
E = d ( p, oi ) 2
i =1 pCi

Compute Eo -Eo
Negative: swapping brings benefit

Jian Pei: CMPT 741/459 Clustering (1)

232

PAM: Example
Example
PAM:
TotalCost
Cost=
=20
20
Total
Total
Cost
=
20
1010
10
10
99 9

1010
10
99 9
88 8

Arbitrary
Arbitrary
Arbitrary
choosekkk
choose
choose
objectas
as
object
object
as
initial
initial
initial
medoids
medoids
medoids

77 7
66 6
55 5
44 4
33 3
22 2
11 1
00 0
00 0

11 1

22 2

33 3

44 4

55 5

66 6

77 7

88 8

99 9

7
66 6
6
55 5
5
44 4
4
33 3
3
22 2
2
11 1

K=2
K=2
K=2
Do loop
loop
Do
Do
loop
Until no
no
Until
Until
no
change
change
change

9
88 8
8
77 7

1
00 0
0 00 0

1010
10

1010
10
10
99 9

11 1
1

22 2
2

33 3
3

44 4
4

55 5
5

66 6
6

77 7
7

88 8
8

99 9
9

1010
10
10

Assign
Assign
Assign
each
each
each
remaining
remaining
remaining
objectto
to
object
object
to
nearest
nearest
nearest
medoids
medoids
medoids

9
88 8
8
77 7
7
66 6
6
55 5
5
44 4
4
33 3
3
22 2
2
11 1

1
00 0
0 00 0
0

Compute
Compute
Compute
totalcost
costof
of
total
total
cost
of
swapping
swapping
swapping

88 8
77 7
66 6
55 5

55 5
5

66 6
6

77 7
7

88 8
8

99 9
9

1010
10
10

99 9
88 8
77 7
66 6
55 5

44 4

44 4

33 3

33 3

22 2

22 2

11 1

11 1

00 0
00 0

44 4
4

1010
10

99 9

qualityis
If
IfIfquality
quality
isis
improved.
improved.
improved.

33 3
3

ramdom

1010
10

ramdom

22 2
2

Randomly selectaa
Randomly
Randomly select
select a
nonmedoid
object,Oramdom
nonmedoid
ramdom
nonmedoid object,O
object,O

Total Cost==26
26
Total
Total Cost
Cost = 26

Swapping O
Swapping
Swapping O
O
and
O
and
O
ramdom
and Oramdom

11 1
1

11 1

22 2

33 3

44 4

55 5

66 6

77 7

Jian Pei:Data
DataMining
Mining----Clustering
Clustering andOutlier
OutlierDetection
Detection
Jian
Jian Pei:
Pei: CMPT
741/459 Clusteringand
(1)

88 8

1010
99 9 10

00 0
00 0

11 1

22 2

33 3

44 4

55 5

66 6

77 7

88 8

1010
99 9 10

39
39
233

Pros and Cons of PAM


PAM is more robust than k-means in the
presence of noise and outliers
Medoids are less influenced by outliers

PAM is efficient for small data sets but does


not scale well for large data sets
O(k(n-k)2) for each iteration

Jian Pei: CMPT 741/459 Clustering (1)

234

Hierarchy
An arrangement or classification of things
according to inclusiveness
A natural way of abstraction, summarization,
compression, and simplification for
understanding
Typical setting: organize a given set of
objects to a hierarchy
No or very little supervision
Some heuristic quality guidances on the quality
of the hierarchy
Jian Pei: CMPT 459/741 Clustering (2)

235

Hierarchical Clustering
Group data objects into a tree of clusters
Top-down versus bottom-up
Step 0

a
b

Step 1

Step 2 Step 3 Step 4

ab
abcde

cde

de

e
Step 4

agglomerative
(AGNES)

Step 3

Step 2 Step 1 Step 0

Jian Pei: CMPT 459/741 Clustering (2)

divisive
(DIANA)
236

AGNES (Agglomerative Nesting)


Initially, each object is a cluster
Step-by-step cluster merging, until all objects
form a cluster
Single-link approach
Each cluster is represented by all of the objects
in the cluster
The similarity between two clusters is measured
by the similarity of the closest pair of data points
belonging to different clusters
Jian Pei: CMPT 459/741 Clustering (2)

237

Dendrogram
Show how to merge clusters
hierarchically
Decompose data objects into a multilevel nested partitioning (a tree of
clusters)
A clustering of the data objects: cutting
the dendrogram at the desired level
Each connected component forms a cluster
Jian Pei: CMPT 459/741 Clustering (2)

238

DIANA (Divisive ANAlysis)


Initially, all objects are in one cluster
Step-by-step splitting clusters until each
cluster contains only one object
10

10

10

0
0

10

Jian Pei: CMPT 459/741 Clustering (2)

0
0

10

10

239

Distance Measures

d ( p, q )
Minimum distance d min (Ci , C j ) = pmin
C , qC
Maximum distance d max (Ci , C j ) = max d ( p, q)
pC , qC
Mean distance
d mean (Ci , C j ) = d (mi , m j )
Average distance
1
d avg (Ci , C j ) =

ni n j

d ( p, q )

pCi qC j

m: mean for a cluster


C: a cluster
n: the number of objects in a cluster
Jian Pei: CMPT 459/741 Clustering (2)

240

Challenges
Hard to choose merge/split points
Never undo merging/splitting
Merging/splitting decisions are critical

High complexity O(n2)


Integrating hierarchical clustering with other
techniques
BIRCH, CURE, CHAMELEON, ROCK

Jian Pei: CMPT 459/741 Clustering (2)

241

BIRCH
Balanced Iterative Reducing and Clustering
using Hierarchies
CF (Clustering Feature) tree: a hierarchical
data structure summarizing object
information
Clustering objects clustering leaf nodes of the
CF tree

Jian Pei: CMPT 459/741 Clustering (2)

242

Clustering Feature Vector


Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: Ni=1=oi
SS:

CF = (5, (16,30),(54,190))

Ni=1=oi2
10
9
8
7
6
5
4
3
2
1
0
0

Jian Pei: CMPT 459/741 Clustering (2)

10

(
243

CF-tree in BIRCH
Clustering features
Summarize the statistics for a cluster
Many cluster quality measures (e.g., radium, distance)
can be derived
Additivity: CF1+CF2=(N1+N2, L1+L2, SS1+SS2)

A CF tree: a height-balanced tree storing the


clustering features for a hierarchical clustering
A nonleaf node in a tree has descendants or children
The nonleaf nodes store sums of the CFs of children

Jian Pei: CMPT 459/741 Clustering (2)

244

CF Tree
B=7
L=6

CF
CF CF
1child 2child 3child
1
2
3
Non-leaf node
CF
CF CF
1
2
3
child child child
1
2
3
Leaf node

prev

CF CF
1
2

CF next
6

Jian Pei: CMPT 459/741 Clustering (2)

CF
6child
6

Root

CF
5
child
5
Leaf node
prev

CF CF
1
2

CF next
4

245

Parameters of a CF-tree
Branching factor: the maximum number of
children
Threshold: max diameter of sub-clusters
stored at the leaf nodes

Jian Pei: CMPT 459/741 Clustering (2)

246

BIRCH Clustering
Phase 1: scan DB to build an initial inmemory CF tree (a multi-level compression
of the data that tries to preserve the inherent
clustering structure of the data)
Phase 2: use an arbitrary clustering
algorithm to cluster the leaf nodes of the CFtree

Jian Pei: CMPT 459/741 Clustering (2)

247

Pros & Cons of BIRCH


Linear scalability
Good clustering with a single scan
Quality can be further improved by a few
additional scans

Can handle only numeric data


Sensitive to the order of the data records

Jian Pei: CMPT 459/741 Clustering (2)

248

Distance-based Methods: Drawbacks


Hard to find clusters with irregular shapes
Hard to specify the number of clusters
Heuristic: a cluster must be dense

Jian Pei: CMPT 459/741 Clustering (3)

249

How to Find Irregular Clusters?


Divide the whole space into many small
areas
The density of an area can be estimated
Areas may or may not be exclusive
A dense area is likely in a cluster

Start from a dense area, traverse connected


dense areas and discover clusters in
irregular shape
Jian Pei: CMPT 459/741 Clustering (3)

250

Directly Density Reachable


Parameters

MinPts = 3
Eps = 1 cm

p
q

Eps: Maximum radius of the neighborhood


MinPts: Minimum number of points in an Epsneighborhood of that point
NEps(p): {q | dist(p,q) Eps}

Core object p: |NEps(p)|MinPts


A core object is in a dense area

Point q directly density-reachable from p iff


q NEps(p) and p is a core object
Jian Pei: CMPT 459/741 Clustering (3)

251

Density-Based Clustering
Density-reachable
Directly density reachable p1p2, p2p3, , pn-1 pn
pn density-reachable from p1

Density-connected
If points p, q are density-reachable from o then p and q
are density-connected
p
q

p1

Jian Pei: CMPT 459/741 Clustering (3)

q
o
252

DBSCAN
A cluster: a maximal set of densityconnected points
Discover clusters of arbitrary shape in spatial
databases with noise
Outlier
Border
Core

Jian Pei: CMPT 459/741 Clustering (3)

Eps = 1cm
MinPts = 5

253

DBSCAN: the Algorithm


Arbitrary select a point p
Retrieve all points density-reachable from p
wrt Eps and MinPts
If p is a core point, a cluster is formed
If p is a border point, no points are densityreachable from p and DBSCAN visits the
next point of the database
Continue the process until all of the points
have been processed
Jian Pei: CMPT 459/741 Clustering (3)

254

Challenges for DBSCAN


Different clusters may have very different
densities
Clusters may be in hierarchies

Jian Pei: CMPT 459/741 Clustering (3)

255

Biclustering
Clustering both objects and attributes
simultaneously
Four requirements
Only a small set of objects in a cluster (bicluster)
A bicluster only involves a small number of
attributes
An object may participate in multiple biclusters
or no biclusters
An attribute may be involved in multiple
biclusters, or no biclusters
Jian Pei: CMPT 459/741 Clustering (3)

256

Application Examples
Recommender systems
Objects: users
Attributes: items
Values: user ratings

sample/condition

gene

w11 w12
w21 w22
w31 w32

w1m

wn1 wn2

wnm

w2m
w3m

Microarray data
Objects: genes
Attributes: samples
Values: expression levels
Jian Pei: CMPT 459/741 Clustering (3)

257

Biclusters with Constant Values


11.2. CLUSTERING HIGH-DIMENSIONAL DATA

a1

a33

a86

b6
60

60

60

b12
60

60

60

b36
60

60

60

535

b99
60

60

60

CHAPTER 11. ADVANCED CLUSTER AN

Figure 11.5: A gene-condition matrix, a submatrix, and a bi-cluster.

10 10 10 10 10
20 AllElectronics
20 20 is highly
20 interested in finding
subset of products. 20
For example,
a group of customers who all like the same group of products. Such a cluster
50 50 50 50 50
is a submatrix in the customer-product matrix, where all elements have a high
value. Using such a cluster,
AllElectronics
in two
0
0
0 can0 make recommendations
0
directions. First, the company can recommend products to new customers
who are similar to the customers in the cluster. Second, the company can
recommend to customers new products that are similar to those involved in
the cluster.

On rows
Figure 11.6: A bi-cluster with constant values on rows.
Jian Pei: CMPT
459/741
Clusteringin(3)a gene expression data matrix, the bi-clusters in a
As with
bi-clusters

258

Biclusters with Coherent Values


Also known as pattern-based clusters

Jian Pei: CMPT 459/741 Clustering (3)

259

Biclusters with Coherent Evolutions

Only up- or down-regulated changes over


LUSTERING
DATA
rows or HIGH-DIMENSIONAL
columns
10
20
50
0

50
100
100
80

30
50
90
20

70
1000
120
100

20
30
80
10

Figure 11.8: A Coherent


bi-cluster evolutions
with coherent
on rows.
on evolutions
rows

Pei: CMPT
459/741 Clustering (3) that is e
260 bilues Jian
using
multiplication,
ij = c i j . Clearly,

Differences from Subspace Clustering


Subspace clustering uses global distance/
similarity measure
Pattern-based clustering looks at patterns
A subspace cluster according to a globally
defined similarity measure may not follow
the same pattern

Jian Pei: CMPT 459/741 Clustering (3)

261

Objects Follow the Same Pattern?


pScore
Objectblue

Obejctgreen

D1

D2

The less the pScore, the more consistent the objects


Jian Pei: CMPT 459/741 Clustering (3)

262

Pattern-based Clusters
pScore: the similarity between two objects rx,
ry on two attributes au, av
rx .au
pScore
ry .au

rx .av
= ( rx .au ry .au ) ( rx .av ry .av )

ry .av

-pCluster (R, D): for any objects rx, ryR


and any attributes au, avD,
rx .au
pScore
ry .au

Jian Pei: CMPT 459/741 Clustering (3)

rx .av

ry .av

( 0)
263

Maximal pCluster
If (R, D) is a -pCluster , then every subcluster (R , D ) is a -pCluster, where R R
and D D
An anti-monotonic property
A large pCluster is accompanied with many
small pClusters! Inefficacious

Idea: mining only the maximal pClusters!


A -pCluster is maximal if there exists no proper
super cluster as a -pCluster
Jian Pei: CMPT 459/741 Clustering (3)

264

Mining Maximal pClusters


Given
A cluster threshold
An attribute threshold mina
An object threshold mino

Task: mine the complete set of significant


maximal -pClusters
A significant -pCluster has at least mino objects
on at least mina attributes
Jian Pei: CMPT 459/741 Clustering (3)

265

Grid-based Clustering Methods


Ideas
Using multi-resolution grid data structures
Using dense grid cells to form clusters

Several interesting methods


CLIQUE
STING
WaveCluster

Jian Pei: CMPT 459/741 Clustering (4)

266

CLIQUE
Clustering In QUEst
Automatically identify subspaces of a high
dimensional data space
Both density-based and grid-based

Jian Pei: CMPT 459/741 Clustering (4)

267

CLIQUE: the Ideas


Partition each dimension into the same
number of equal length intervals
Partition an m-dimensional data space into nonoverlapping rectangular units

A unit is dense if the number of data points


in the unit exceeds a threshold
A cluster is a maximal set of connected
dense units within a subspace
Jian Pei: CMPT 459/741 Clustering (4)

268

CLIQUE: the Method


Partition the data space and find the number of
points in each cell of the partition
Apriori: a k-d cell cannot be dense if one of its (k-1)-d
projection is not dense

Identify clusters:
Determine dense units in all subspaces of interests and
connected dense units in all subspaces of interests

Generate minimal description for the clusters


Determine the minimal cover for each cluster

Jian Pei: CMPT 459/741 Clustering (4)

269

0 1 2 3 4 5 6 7

Vacation

Salary
(10,000)

CLIQUE: An Example

50

age
Vacation
(week)
0 1 2 3 4 5 6 7

30

20
Jian Pei: CMPT 459/741 Clustering (4)

20

30

40

50

30

40

50

age
60

age
60
270

CLIQUE: Pros and Cons


Automatically find subspaces of the highest
dimensionality with high density clusters
Insensitive to the order of input
Not presume any canonical data distribution

Scale linearly with the size of input


Scale well with the number of dimensions
The clustering result may be degraded at the
expense of simplicity of the method
Jian Pei: CMPT 459/741 Clustering (4)

271

Bad Cases for CLIQUE


Parts of a cluster may be missed

A cluster from CLIQUE may


contain noise

Jian Pei: CMPT 459/741 Clustering (4)

272

Fuzzy Clustering
Each point xi takes a probability wij to belong
to a cluster Cj
Requirements
k
For each point xi,

ij

=1

j =1

For each cluster Cj 0 < wij < m


i =1

Jian Pei: CMPT 459/741 Clustering (4)

273

Fuzzy C-Means (FCM)


Select an initial fuzzy pseudo-partition, i.e., assign
values to all the wij
Repeat
Compute the centroid of each cluster using the fuzzy
pseudo-partition
Recompute the fuzzy pseudo-partition, i.e., the wij

Until the centroids do not change (or the change is


below some threshold)

Jian Pei: CMPT 459/741 Clustering (4)

274

Critical Details
Optimization on sum of the squared error
(SSE): SSE(C ,, C ) = k m w p dist( x , c ) 2
1

ij

j =1 i =1

c j = wijp xi / wijp

Computing centroids:
i =1
i =1
Updating the fuzzy pseudo-partition
wij = (1 / dist( xi , c j ) 2 )

1
p 1

2
(
1
/
dist
(
x
,
c
)
)

i
q

1
p 1

q =1

When p=2

wij = 1 / dist ( xi , c j ) 2

2
1
/
dist
(
x
,
c
)

i
q
q =1

Jian Pei: CMPT 459/741 Clustering (4)

275

Choice of P
When p 1, FCM behaves like traditional kmeans
When p is larger, the cluster centroids
approach the global centroid of all data
points
The partition becomes fuzzier as p increases

Jian Pei: CMPT 459/741 Clustering (4)

276

Effectiveness

Jian Pei: CMPT 459/741 Clustering (4)

277

Is a Clustering Good?
Feasibility
Applying any clustering methods on a uniformly
distributed data set is meaningless

Quality
Are the clustering results meeting users interest?
Clustering patients into clusters corresponding
various disease or sub-phenotypes is meaningful
Clustering patients into clusters corresponding to
male or female is not meaningful
Jian Pei: CMPT 459/741 Clustering (4)

278

Major Tasks
Assessing clustering tendency
Are there non-random structures in the data?

Determining the number of clusters or other


critical parameters
Measuring clustering quality

Jian Pei: CMPT 459/741 Clustering (4)

279

Uniformly Distributed Data

Clustering uniformly distributed data is


meaningless
A uniformly distributed data set is generated
504CHAPTER 10. CLUSTER ANALYSIS: BASIC C
by a uniform data distribution

Jian Pei: CMPT 459/741 Clustering (4)

Figure 10.21: A data set that is uniformly 280


distrib

Hopkins Statistic
Hypothesis: the data is generated by a
uniform distribution in a space
Sample n points, p1, , pn, uniformly from
the space of D
For each point pi, find the nearest neighbor
of pi in D, let xi be the distance between pi
and its nearest neighbor in D
xi = min{dist(pi , v)}
v2D

Jian Pei: CMPT 459/741 Clustering (4)

281

Hopkins Statistic
Sample n points, q1, , qn, uniformly from D
For each qi, find the nearest neighbor of qi in
D {qi}, let yi be the distance between qi and
its nearest neighbor in D {qi}
yi =

min {dist(qi , v)}

v2D,v6=qi

Calculate the Hopkins Statistic H


H= P
n

i=1

Jian Pei: CMPT 459/741 Clustering (4)

n
P

yi

i=1

xi +

n
P

yi

i=1
282

Explanation
n
X

xi
If D
is
uniformly
distributed,
then
and
i=1
n
X
yi would be close to each other, and thus
i=1

H would be round 0.5


n
X
If D is skewed, then yi would be
i=1
substantially smaller, and thus H would be
close to 0
If H > 0.5, then it is unlikely that D has
statistically significant clusters
Jian Pei: CMPT 459/741 Clustering (4)

283

Finding the Number of Clusters


Depending on many factors
The shape and scale of the distribution in the
data set
The clustering resolution required by the user

Many methods
exist
r
n
2

Set k =
, each cluster has 2n points on
average
Plot the sum of within-cluster variances with
respect to k, find the first (or the most significant
turning point)

Jian Pei: CMPT 459/741 Clustering (4)

284

A Cross-Validation Method
Divide the data set D into m parts
Use m 1 parts to find a clustering
Use the remaining part as the test set to test
the quality of the clustering
For each point in the test set, find the closest
centroid or cluster center
Use the squared distances between all points in the
test set and the corresponding centroids to measure
how well the clustering model fits the test set

Repeat m times for each value of k, use the


average as the quality measure
Jian Pei: CMPT 459/741 Clustering (4)

285

Measuring Clustering Quality


Ground truth: the ideal clustering determined
by human experts
Two situations
There is a known ground truth the extrinsic
(supervised) methods, comparing the clustering
against the ground truth
The ground truth is unavailable the intrinsic
(unsupervised) methods, measuring how well
the clusters are separated
Jian Pei: CMPT 459/741 Clustering (4)

286

Quality in Extrinsic Methods


Cluster homogeneity: the more pure the
clusters in a clustering, the better the clustering
Cluster completeness: objects in the same
cluster in the ground truth should be clustered
together
Rag bag: putting a heterogeneous object into a
pure cluster is worse than putting it into a rag
bag
Small cluster preservation: splitting a small
cluster in the ground truth into pieces is worse
than splitting a bigger one
Jian Pei: CMPT 459/741 Clustering (4)

287

Bcubed Precision and Recall


D = {o1, , on}
L(oi) is the cluster of oi given by the ground truth

C is a clustering on D
C(oi) is the cluster-id of oi in C

For two objects oi and oj, the correctness is


1 if L(oi) = L(oj) C(oi) = C(oj), 0
otherwise

Jian Pei: CMPT 459/741 Clustering (4)

288

Correctness(oi , oj ) =

1 if L(oi ) = L(oj ) C(oi ) = C(oj )


0 otherwise.

Bcubed
Precision
and
Recall
BCubed precision is defined as
Precision

"

Correctness(oi , oj )

n
10.6. EVALUATION OF "
CLUSTERING
oj :i=j,C(oi )=C(oj )
{oj |i = j, C(oi ) = C(oj )}
i=1
Precision
BCubed
=
.
BCubed
recall
is defined
as
n
!
Recall
Correctness(oi , oj )
n
!
oj :i=j,L(oi )=L(oj )
{oj |i = j, L(oi ) = L(oj )}
i=1
Recall BCubed =
.
n

Intrinsic Methods

When the ground truth of a data set is not available, we have to use
Jian Pei:to
CMPT
459/741the
Clustering
(4)
289
method
assess
clustering
quality. In general, intrinsic metho

Silhouette Coefficient
No ground truth is assumed
Suppose a data set D of n objects is partitioned
into k clusters, C1, , Ck
For each object o,
Calculate a(o), the average distance between o and
every other object in the same cluster
compactness of a cluster, the smaller, the better
Calculate b(o), the minimum average distance from
o to every objects in a cluster that o does not belong
to degree of separation from other clusters, the
larger, the better
Jian Pei: CMPT 459/741 Clustering (4)

290

Silhouette Coefficient
a(o) =

Then

dist(o, o0 )

o,o0 2Ci ,o0 6=o

|Ci | 1
P
dist(o, o0 )

b(o) = min {
Cj :o62Cj

o0 2Cj

|Cj |

b(o) a(o)
s(o) =
max{a(o), b(o)}

Use the average silhouette coefficient of all


objects as the overall measure
Jian Pei: CMPT 459/741 Clustering (4)

291

Classification

Classification and Prediction


Classification: predict categorical class
labels
Build a model for a set of classes/concepts
Classify loan applications (approve/decline)

Prediction: model continuous-valued


functions
Predict the economic growth in 2015

Jian Pei: CMPT 741/459 Classification (1)

293

Classification: A 2-step Process


Model construction: describe a set of
predetermined classes
Training dataset: tuples for model construction
Each tuple/sample belongs to a predefined class

Classification rules, decision trees, or math formulae

Model application: classify unseen objects


Estimate accuracy of the model using an independent
test set
Acceptable accuracy apply the model to classify
tuples with unknown class labels
Jian Pei: CMPT 741/459 Classification (1)

294

Model Construction
Classification
Algorithms

Training
Data

Name
Mike
Mary
Bill
Jim
Dave
Anne

Rank
Ass. Prof
Ass. Prof
Prof
Asso. Prof
Ass. Prof
Asso. Prof

Years
3
7
2
7
6
3

Jian Pei: CMPT 741/459 Classification (1)

Tenured
No
Yes
Yes
Yes
No
No

Classifier
(Model)

IF rank = professor
OR years > 6
THEN tenured = yes
295

Model Application
Classifier
Testing
Data

Unseen Data
(Jeff, Professor, 4)

Name
Rank
Years
Tom
Ass. Prof
2
Merlisa Asso. Prof
7
George
Prof
5
Joseph Ass. Prof
7
Jian Pei: CMPT 741/459 Classification (1)

Tenured
No
No
Yes
Yes

Tenured?

296

Supervised/Unsupervised Learning
Supervised learning (classification)
Supervision: objects in the training data set have
labels
New data is classified based on the training set

Unsupervised learning (clustering)


The class labels of training data are unknown
Given a set of measurements, observations, etc.
with the aim of establishing the existence of
classes or clusters in the data
Jian Pei: CMPT 741/459 Classification (1)

297

Data Preparation
Data cleaning
Preprocess data in order to reduce noise and
handle missing values

Relevance analysis (feature selection)


Remove the irrelevant or redundant attributes

Data transformation
Generalize and/or normalize data

Jian Pei: CMPT 741/459 Classification (1)

298

Measurements of Quality
Prediction accuracy
Speed and scalability
Construction speed and application speed

Robustness: handle noise and missing


values
Scalability: build model for large training data
sets
Interpretability: understandability of models
Jian Pei: CMPT 741/459 Classification (1)

299

Decision Tree Induction

Decision tree representation


Construction of a decision tree
Inductive bias and overfitting
Scalable enhancements for large databases

Jian Pei: CMPT 741/459 Classification (1)

300

Decision Tree
A node in the tree a test of some attribute
A branch: a possible value of the attribute
Classification
Outlook
Start at the root
Test the attribute
Move down the tree branch

Sunny
Humidity

High Normal
No
Jian Pei: CMPT 741/459 Classification (1)

Yes

Overcast

Rain

Yes

Wind

Strong

Weak

No

Yes
301

Training Dataset
Outlook
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain

Temp
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild

Jian Pei: CMPT 741/459 Classification (1)

Humid
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High

Wind
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Strong

PlayTennis
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
302

Appropriate Problems
Instances are represented by attribute-value
pairs
Extensions of decision trees can handle realvalued attributes

Disjunctive descriptions may be required


The training data may contain errors or
missing values

Jian Pei: CMPT 741/459 Classification (1)

303

Basic Algorithm ID3


Construct a tree in a top-down recursive divideand-conquer manner
Which attribute is the best at the current node?
Create a node for each possible attribute value
Partition training data into descendant nodes

Conditions for stopping recursion


All samples at a given node belong to the same class
No attribute remained for further partitioning
Majority voting is employed for classifying the leaf

There is no sample at the node


Jian Pei: CMPT 741/459 Classification (1)

304

Which Attribute Is the Best?


The attribute most useful for classifying
examples
Information gain and gini index
Statistical properties
Measure how well an attribute separates the
training examples

Jian Pei: CMPT 741/459 Classification (1)

305

Entropy
Measure homogeneity of examples
c

Entropy ( S ) pi log 2 pi
i =1

S is the training data set, and pi is the proportion


of S belong to class i

The smaller the entropy, the purer the data


set

Jian Pei: CMPT 741/459 Classification (1)

306

Information Gain
The expected reduction in entropy caused
by partitioning the examples according to an
attribute
| Sv |
Gain( S , A) Entropy ( S )
Entropy ( S v )
vValues ( A ) | S |

Value(A) is the set of all possible values for


attribute A, and Sv is the subset of S for
which attribute A has value v
Jian Pei: CMPT 741/459 Classification (1)

307

Example

9
9 5
5
Entropy ( S ) = log 2 log 2
14
14 14
14
= 0.94

Outlook

Temp

Humid

Wind

PlayTenni
s

Sunny

Hot

High

Weak

No

Sunny

Hot

High

Strong

No

Overcast

Hot

High

Weak

Yes

Rain

Mild

High

Weak

Yes

Rain

Cool

Normal

Weak

Yes

Rain

Cool

Normal

Strong

No

Overcast

Cool

Normal

Strong

Yes

Sunny

Mild

High

Weak

No

Sunny

Cool

Normal

Weak

Yes

Rain

Mild

Normal

Weak

Yes

Sunny

Mild

Normal

Strong

Yes

Overcast

Mild

High

Strong

Yes

Overcast

Hot

Normal

Weak

Yes

Rain

Mild

High

Strong

No

| Sv |
Gain( S ,Wind ) = Entropy ( S )
Entropy ( S v )

v{Weak , Strong } | S |
8
6
Engropy ( SWeak ) Engropy ( S Strong )
14
14
8
6
= 0.94 0.811 1.00 = 0.048
14
14
= Entropy ( S )

Jian Pei: CMPT 741/459 Classification (1)

308

Hypothesis Space Search in


Decision Tree Building
Hypothesis space: the set of possible
decision trees
ID3: simple-to-complex, hill-climbing search
Evaluation function: information gain

Jian Pei: CMPT 741/459 Classification (1)

309

Capabilities and Limitations


The hypothesis space is complete
Maintains only a single current hypothesis
No backtracking
May converge to a locally optimal solution

Use all training examples at each step


Make statistics-based decisions
Not sensitive to errors in individual example

Jian Pei: CMPT 741/459 Classification (1)

310

Natural Bias
The information gain measure favors
attributes with many values
An extreme example
Attribute date may have the highest
information gain
A very broad decision tree of depth one
Inapplicable to any future data

Jian Pei: CMPT 741/459 Classification (1)

311

Alternative Measures
Gain ratio: penalize attributes like date by
incorporating split information

| Si |
|S |
log 2 i
|S|
i =1 | S |

SplitInfor mation( S , A)

Split information is sensitive to how broadly and


uniformly the attribute splits the data

GainRatio( S , A)

Gain( S , A)
SplitInformation( S , A)

Gain ratio can be undefined or very large


Only test attributes with over average gain

Jian Pei: CMPT 741/459 Classification (1)

312

Measuring Inequality
Lorenz Curve
X-axis: quintiles
Y-axis: accumulative share of
income earned by the plotted
quintile
Gap between the actual lines
and the mythical line: the degree
of inequality

Gini
index

Jian Pei: CMPT 741/459 Classification (1)

Gini = 0, even distribution


Gini = 1, perfectly unequal
The greater the distance,
the more unequal the
distribution
313

Gini Index (Adjusted)


A data set S contains examples from n
classes
n
2
gini(T ) = 1 p j
j =1

pj is the relative frequency of class j in S

A data set S is split into two subsets S1 and


S2 with sizes N1 and N2 respectively
gini split (T ) =

N 1 gini( ) + N 2 gini( )
T1
T2
N
N

The attribute provides the smallest


ginisplit(T) is chosen to split the node
Jian Pei: CMPT 741/459 Classification (1)

314

Extracting Classification Rules


Classification rules can be extracted from a
decision tree
Each path from the root to a leaf an IFTHEN rule
All attribute-value pair along a path form a
conjunctive condition
The leaf node holds the class prediction
IF age = <=30 AND student = no THEN
buys_computer = no

Rules are easy to understand


Jian Pei: CMPT 741/459 Classification (1)

315

Inductive Bias
The set of assumptions that, together with
the training data, deductively justifies the
classification to future instances
Preferences of the classifier construction

Shorter trees are preferred over longer trees


Trees that place high information gain
attributes close to the root are preferred

Jian Pei: CMPT 741/459 Classification (1)

316

Why Prefer Short Trees?


Occam s razor: prefer the simplest
hypothesis that fits the data
Fewer short trees than long trees
A short tree is less likely to be a statistical
coincidence
One should not increase, beyond what is necessary, the
number of entities required to explain anything Also
known as the principle of parsimony
Jian Pei: CMPT 741/459 Classification (1)

317

Overfitting
A decision tree T may overfit the training
data
if there exists an alternative tree T such that T
has a higher accuracy than T over the training
examples, but T has a higher accuracy than T
over the entire distribution of data

Why overfitting?
Noise data
Bias in training data
Jian Pei: CMPT 741/459 Classification (1)

T
All data

T
Training data
318

The Evaluation Issues


The accuracy of a classifier can be
evaluated using a test data set
The test set is a part of the available labeled
data set

But how can we evaluate the accuracy of a


classification method?
A classification method can generate many
classifiers

What if the available labeled data set is too


small?
Jian Pei: CMPT 741/459 Classification (2)

319

Holdout Method
Partition the available labeled data set into
two disjoint subsets: the training set and the
test set
50-50
2/3 for training and 1/3 for testing

Build a classifier using the training set


Evaluate the accuracy using the test set

Jian Pei: CMPT 741/459 Classification (2)

320

Limitations of Holdout Method


Fewer labeled examples for training
The classifier highly depends on the
composition of the training and test sets
The smaller the training set, the larger the
variance

If the test set is too small, the evaluation is


not reliable
The training and test sets are not
independent
Jian Pei: CMPT 741/459 Classification (2)

321

Cross-Validation
Each record is used the same number of times for
training and exactly once for testing
K-fold cross-validation
Partition the data into k equal-sized subsets
In each round, use one subset as the test set, and use
the rest subsets together as the training set
Repeat k times
The total error is the sum of the errors in k rounds

Leave-one-out: k = n
Utilize as much data as possible for training
Computationally expensive
Jian Pei: CMPT 741/459 Classification (2)

322

Accuracy Can Be Misleading


Consider a data set of 99% of the negative
class and 1% of the positive class
A classifier predicts everything negative has
an accuracy of 99%, though it does not work
for the positive class at all!
Imbalance class distribution is popular in
many applications
Medical applications, fraud detection,
Jian Pei: CMPT 741/459 Classification (2)

323

Performance Evaluation Matrix


Confusion matrix (contingency table, error matrix): used for
imbalance class distribution

PREDICTED CLASS
Class=Yes Class=No
ACTUAL
Class=Yes
a (TP)
b (FN)
CLASS
Class=No
c (FP)
d (TN)

a+d
TP + TN
Accuracy =
=
a + b + c + d TP + TN + FP + FN
Jian Pei: CMPT 741/459 Classification (2)

324

Performance Evaluation Matrix


PREDICTED CLASS
Class=Yes Class=No
ACTUAL
Class=Yes
a (TP)
b (FN)
CLASS
Class=No
c (FP)
d (TN)

True positive rate (TPR, sensitivity) = TP / (TP + FN)


True negative rate (TNR, specificity) = TN / (TN + FP)
False positive rate (FNR) = FP / (TN + FP)
False negative rate (FNR) = FN / (TP + FN)
Jian Pei: CMPT 741/459 Classification (2)

325

Recall and Precision


Target class is more important than the other
classes
PREDICTED CLASS
Class=Yes Class=No
ACTUAL
Class=Yes
a (TP)
b (FN)
CLASS
Class=No
c (FP)
d (TN)

Precision p = TP / (TP + FP)


Recall r = TP / (TP + FN)
Jian Pei: CMPT 741/459 Classification (2)

326

Fallout
Type I errors false positive: a negative
object is classified as positive
Fallout: the type I error rate, FP / (TP + FP)

Type II errors false negative: a positive


object is classified as negative
Captured by recall

Jian Pei: CMPT 741/459 Classification (2)

327

F Measure
How can we summarize precision and recall into
one metric?
Using the harmonic mean between the two

2rp
2TP
F - measure (F) =
=
r + p 2TP + FP + FN
F measure
( 2 +1)rp
( 2 +1)TP
F =
= 2
2
r+ p
( +1)TP + 2 FN + FP

= 0, F is the precision
= , F is the recall
0 < < , F is a tradeoff between the precision and the
recall
Jian Pei: CMPT 741/459 Classification (2)

328

Weighted Accuracy
A more general metric
wa + w d
Weighted Accuracy =
wa + wb+ wc + w d
1

Measure

w1

w2

w3

w4

Recall

Precision

2 + 1

Accuracy

Jian Pei: CMPT 741/459 Classification (2)

329

ROC Curve
Receiver Operating Characteristic (ROC)
1-dimensional data set containing 2
classes. Any points located at x > t is
classified as positive

Jian Pei: CMPT 741/459 Classification (2)

330

ROC Curve
(TP,FP):
(0,0): declare everything
to be negative class
(1,1): declare everything
to be positive class
(1,0): ideal
Diagonal line:
Random guessing
Below diagonal line:
prediction is opposite of
the true class
Jian Pei: CMPT 741/459 Classification (2)

Figure from [Tan, Steinbach, Kumar]


331

Comparing Two Classifiers

Figure from [Tan, Steinbach, Kumar]


Jian Pei: CMPT 741/459 Classification (2)

332

Cost-Sensitive Learning
In some applications, misclassifying some
classes may be disastrous
Tumor detection, fraud detection

Using a cost matrix


PREDICTED CLASS
ACTUAL Class=Yes
CLASS
Class=No

Class=Yes

Class=No

-1

100

Jian Pei: CMPT 741/459 Classification (2)

333

Sampling for Imbalance Classes


Consider a data set containing 100 positive
examples and 1,000 negative examples
Undersampling: use a random sample of 100
negative examples and all positive examples
Some useful negative examples may be lost
Run undersampling multiple times, use the ensemble of
multiple base classifiers
Focused undersampling: remove negative samples that
are not useful for classification, e.g., those far away from
the decision boundary
Jian Pei: CMPT 741/459 Classification (2)

334

Oversampling
Replicate the positive examples until the
training set has an equal number of positive
and negative examples
For noisy data, may cause overfitting

Jian Pei: CMPT 741/459 Classification (2)

335

Errors in Classification
Bias: the difference between the real class
boundary and the decision boundary of a
classification model
Variance: variability in the training data set
Intrinsic noise in the target class: the target
class can be non-deterministic instances
with the same attribute values can have
different class labels
Jian Pei: CMPT 741/459 Classification (3)

336

One or More?
What if a medical doctor is not sure about a case?
Joint-diagnosis: using a group of doctors carrying
different expertise
Wisdom from crowd is often more accurate

All eager learning methods make prediction using a


single classifier induced from training data
A single classifier may have low confidence in some
cases

Ensemble methods: construct a set of base


classifiers and take a vote on predictions in
classification
Jian Pei: CMPT 741/459 Classification (3)

337

Ensemble Classifiers
D

Step 1:
Create Multiple
Data Sets
Step 2:
Build Multiple
Classifiers

D1

D2

C1

C2

Step 3:
Combine
Classifiers

....

C*

Original
Training data

Dt-1

Dt

Ct -1

Ct

C*(x)=Vote(C1(x), , Ck(x))

Figure from [Tan, Steinbach, Kumar]


Jian Pei: CMPT 741/459 Classification (3)

338

Why May Ensemble Method Work?


Suppose there are two classes and each
base classifier has an error rate of 35%
What if we use 25 base classifiers?
If all base classifiers are identical, the ensemble
error rate is still 35%
If base classifiers are independent, the
ensemble makes a wrong prediction only if more
than half of the base classifiers are wrong
25 25

i
25i

0
.
35
0
.
65
= 0.06

i
i =13

Jian Pei: CMPT 741/459 Classification (3)

339

Ensemble Error Rate

Figure from [Tan, Steinbach, Kumar]


Jian Pei: CMPT 741/459 Classification (3)

340

Ensemble Classifiers When?


The base classifiers should be independent
of each other
Each base classifier should do better than a
classifier that performs random guessing

Jian Pei: CMPT 741/459 Classification (3)

341

How to Construct Ensemble?


Manipulating the training set: derive multiple
training sets and build a base classifier on each
Manipulating the input features: use only a subset
of features in a base classifier
Manipulating the class labels: if there are many
classes, in a classifier, randomly divide the classes
into two subsets A and B; for a test case, if a base
classifier predicts its class as A, all classes in A
receive a vote
Manipulating the learning algorithm, e.g., using
different network configuration in ANN
Jian Pei: CMPT 741/459 Classification (3)

342

Bootstrap
Given an original training set T, derive a
tranining set T by repeatedly uniformly
sampling with replacement
If T has n tuples, each tuple has a probability
p = 1 - (1 - 1/n)n of being selected in T
When n , p 1 - 1/e 0.632

Use the tuples not in T as the test set

Jian Pei: CMPT 741/459 Classification (3)

343

Bootstrap
Use a bootstrap sample as the training set,
use the tuples not in the training set as the
test set
.632 bootstrap: compute the overall
accuracy by combining the accuracies of
each bootstrap sample with the accuracy
computed from a classifier using the whole
data set as the training set
acc.632bootstrap

1 k
= (0.632 i + 0.368 accall )
k 1

Jian Pei: CMPT 741/459 Classification (3)

344

Bagging
Run bootstrap k times to obtain k base classifiers
A test instance is assigned to the class that
receives the highest number of votes
Strength: reduce the variance of base classifiers
good for unstable base classifiers
Unstable classifiers: sensitive to minor perturbations in
the training set, e.g., decision trees, associative
classifiers, and ANN

For stable classifiers (e.g., linear discriminant


analysis and kNN classifiers), bagging may even
degrade the performance since the training sets
are smaller
Less overfitting on noisy data
Jian Pei: CMPT 741/459 Classification (3)

345

Boosting
Assign a weight to each training example
Initially, each example is assigned a weight 1/n

Weights can be used in one of the following ways


Weights as a sampling distribution to draw a set of
bootstrap samples from the original training set
Weights used by a base classifier to learn a model
biased towards heavier examples

Adaptively change the weight at the end of each


boosting round
The weight of an example correctly classified decreases
The weight of an example incorrectly classified
increases

Each round generates a base classifier


Jian Pei: CMPT 741/459 Classification (3)

346

Critical Design Choices in Boosting


How the weights of the training examples
are updated at the end of each boosting
round?
How the predictions made by base
classifiers are combined?

Jian Pei: CMPT 741/459 Classification (3)

347

AdaBoost
Each base classifier carries an importance
score related to its error rate
N
1
Error rate
i = w j I (Ci ( x j ) y j )
N j =1
wi: weight, I(p) = 1 if p is true
Importance score = 1 ln 1 i
i
2 i

Jian Pei: CMPT 741/459 Classification (3)

348

How Does Importance Score Work?

Jian Pei: CMPT 741/459 Classification (3)

349

Weight Adjustment in AdaBoost


j

if C j ( xi ) = yi
w exp
( j +1)
wi
=

j
Z j exp
if C j ( xi ) yi
where Z j is the normalization factor, wi( j +1) = 1
( j)
i

If any intermediate rounds generate an error rate


more than 50%, the weights are reverted back to
1/n

The ensemble error rate is bounded


eensemble i (1 i )
i

Jian Pei: CMPT 741/459 Classification (3)

350

Intuition Bayesian Classification


More hockey fans in Canada than in US
Which country is Tom, a hockey ball fan, from?
Predicting Canada has a better chance to be right

Prior probability P(Canadian)=5%: reflect


background knowledge 5% of total population is
Canadians
P(hockey fan | Canadian)=30%: the probability of a
Canadian who is also a hockey fan
Posterior probability P(Canadian | hockey fan): the
probability of a hockey fan is from Canada
Jian Pei: CMPT 741/459 Classification (4)

351

Bayes Theorem
P ( D | h) P ( h)
P( h | D) =
P ( D)

Find the maximum a posteriori (MAP)


hypothesis
P ( D | h ) P ( h)
hMAP max P(h | D) = max
hH
hH
P( D)
= max P( D | h) P(h)
hH

Require background knowledge


Computational cost
Jian Pei: CMPT 741/459 Classification (4)

352

Nave Bayes Classifier


Assumption: attributes are independent
Given a tuple (a1, a2, , an), predict its
class as
C = arg max P(a1 , a2 , , an | Ci ) P(Ci )
i

= arg max P(Ci ) P(a j | Ci )


i

arg max f ( x) : the value of x that maximizes f(x)


Example: arg max x 2 = 3
x{1, 2 , 3}

Jian Pei: CMPT 741/459 Classification (4)

353

Example: Training Dataset


Data sample X =
(Outlook=sunny,
Temp=mild, Humid=high
Wind=weak)
Will she play tennis? Yes
P(Yes|X) =
P(X|Yes) P(Yes) = 0.014
P(No|X) =
P(X|No) P(No) = 0.007
Jian Pei: CMPT 741/459 Classification (4)

Outlook
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain

Temp
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild

Humid
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High

Wind PlayTennis
Weak
No
Strong
No
Weak
Yes
Weak
Yes
Weak
Yes
Strong
No
Strong
Yes
Weak
No
Weak
Yes
Weak
Yes
Strong
Yes
Strong
Yes
Weak
Yes
Strong
No

354

Probability of Infrequent Values


(outlook = Sunny,
temp = high,
humid = low,
wind = weak)?
P(humid = low) = 0

Jian Pei: CMPT 741/459 Classification (4)

Outlook
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain

Temp
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild

Humid
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High

Wind PlayTennis
Weak
No
Strong
No
Weak
Yes
Weak
Yes
Weak
Yes
Strong
No
Strong
Yes
Weak
No
Weak
Yes
Weak
Yes
Strong
Yes
Strong
Yes
Weak
Yes
Strong
No

355

Smoothing
Suppose an attribute has n different values:
a1, , an
Assume a small enough value > 0
Let Pi be the frequency of ai,
Pi = # tuples having ai / total # of tuples
1
n
Estimate P (ai ) = +
Pi

Jian Pei: CMPT 741/459 Classification (4)

356

Characteristics of Nave Bayes


Robust to isolated noise points
Such points are averaged out in probability
computation

Insensitive to missing values


Robust to irrelevant attributes
Distributions on such attributes are almost
uniform

Correlated attributes degrade the


performance
Jian Pei: CMPT 741/459 Classification (4)

357

Bayes Error Rate


The error rate of the ideal nave Bayes
classifier

Err =

Zx
0

P (Crocodile | X)dX +

Jian Pei: CMPT 741/459 Classification (4)

Z1

P (Alligator | X)dX

358

Pros and Cons


Pros
Easy to implement
Good results obtained in many cases

Cons
A (too) strong assumption: independent
attributes

How to handle dependent/correlated


attributes?
Bayesian belief networks
Jian Pei: CMPT 741/459 Classification (4)

359

Associative Classification
Mine association possible rules (PR) in form
of condset c
Condset: a set of attribute-value pairs
C: class label

Build classifier
Organize rules according to decreasing
precedence based on confidence and support

Classification
Use the first matching rule to classify an
unknown case
Jian Pei: CMPT 741/459 Classification (4)

360

Associative Classification Methods


CBA (Classification By Association: Liu, Hsu & Ma,
KDD 98)
Mine association possible rules in the form of
Cond-set (a set of attribute-value pairs) class label

Build classifier: Organize rules according to decreasing


precedence based on confidence and then support

CMAR (Classification based on Multiple


Association Rules: Li, Han, Pei, ICDM 01)
Classification: Statistical analysis on multiple rules

Jian Pei: CMPT 741/459 Classification (4)

361

Instance-based Methods
Instance-based learning
Store training examples and delay the processing until a
new instance must be classified ( lazy evaluation )

Typical approaches
K-nearest neighbor approach
Instances represented as points in an Euclidean space

Locally weighted regression


Construct local approximation

Case-based reasoning
Use symbolic representations and knowledge-based inference

Jian Pei: CMPT 741/459 Classification (4)

362

The K-Nearest Neighbor Method


Instances are points in an n-D space
The k-nearest neighbors (KNN) in the
Euclidean distance
Return the most common value among the k
training examples nearest to the query point

Discrete-/real-valued target functions


_

+
_
_
Jian Pei: CMPT 741/459 Classification (4)

_
+
.xq

_
+

+
363

KNN Methods
For continuous-valued target functions, return the
mean value of the k nearest neighbors
Distance-weighted nearest neighbor algorithm
Give greater weights to closer neighbors

1
d ( xq , xi )2

Robust to noisy data by averaging k-nearest


neighbors
Curse of dimensionality

Distance could be dominated by irrelevant attributes


Axes stretch or elimination of the least relevant attributes

Jian Pei: CMPT 741/459 Classification (4)

364

Lazy vs. Eager Learning


Efficiency: lazy learning uses less training
time but more predicting time
Accuracy
Lazy method effectively uses a richer hypothesis
space
Eager: must commit to a single hypothesis that
covers the entire instance space

Jian Pei: CMPT 741/459 Classification (4)

365

Outlier Detection

Motivation: Fraud Detection

http://i.imgur.com/ckkoAOp.gif

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)

367

Techniques: Fraud Detection


Features
Dissimilarity
Groups and noise

http://i.stack.imgur.com/tRDGU.png

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)

368

Outlier Analysis
One persons noise is another persons
signal
Outliers: the objects considerably dissimilar
from the remainder of the data
Examples: credit card fraud, Michael Jordon,
intrusions, etc
Applications: credit card fraud detection, telecom
fraud detection, intrusion detection, customer
segmentation, medical analysis, etc
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)

369

Outliers and Noise


Different from noise
Noise is random error or variance in a measured
variable

Outliers are interesting: an outlier violates


the mechanism that generates the normal
data
Outlier detection vs. novelty detection
Early stage may be regarded as outliers
But later merged into the model
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)

370

Types of Outliers
Three kinds: global, contextual and collective
outliers
A data set may have multiple types of outlier
One object may belong to more than one type of
outlier

Global outlier (or point anomaly)


An outlier object significantly deviates from the
rest of the data set

challenge: find an appropriate measurement


of deviation
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)

371

Contextual Outliers
An outlier object deviates significantly based on a
selected context
Ex. Is 10C in Vancouver an outlier? (depending on summer or
winter?)

Attributes of data objects should be divided into two


groups
Contextual attributes: defines the context, e.g., time & location
Behavioral attributes: characteristics of the object, used in
outlier evaluation, e.g., temperature

A generalization of local outlierswhose density


significantly deviates from its local area
Challenge: how to define or formulate meaningful
context?
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)

372

Collective Outliers
A subset of data objects collectively deviate
significantly from the whole data set, even if the
individual data objects may not be outliers
Application example: intrusion detection when a
number of computers keep sending denial-ofservice packages to each other

Detection of collective outliers


Consider not only behavior of individual objects, but
also that of groups of objects
Need to have the background knowledge on the
relationship among data objects, such as a distance
or similarity measure on objects
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)

373

Outlier Detection: Challenges


Modeling normal objects and outliers properly
Hard to enumerate all possible normal behaviors in
an application
The border between normal and outlier objects is
often a gray area

Application-specific outlier detection


Choice of distance measure among objects and the
model of relationship among objects are often
application-dependent
Example: clinic data: a small deviation could be an
outlier; while in marketing analysis, larger
fluctuations
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)

374

Outlier Detection: Challenges


Handling noise in outlier detection
Noise may distort the normal objects and blur the
distinction between normal objects and outliers
Noise may help hide outliers and reduce the
effectiveness of outlier detection

Understandability
Understand why these are outliers: Justification of
the detection
Specify the degree of an outlier: the unlikelihood of
the object being generated by a normal mechanism
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)

375

Outlier Detection Methods


Whether user-labeled examples of outliers
can be obtained
Supervised, semi-supervised, and unsupervised
methods

Assumptions about normal data and outliers


Statistical, proximity-based, and clusteringbased methods

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)

376

Supervised Methods
Modeling outlier detection as a classification problem
Samples examined by domain experts used for training & testing

Methods for Learning a classifier for outlier detection effectively:


Model normal objects & report those not matching the model as
outliers, or
Model outliers and treat those not matching the model as normal

Challenges
Imbalanced classes, i.e., outliers are rare: Boost the outlier class
and make up some artificial outliers
Catch as many outliers as possible, i.e., recall is more important
than accuracy (i.e., not mislabeling normal objects as outliers)

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)

377

Unsupervised Methods
Assume the normal objects are somewhat
``clustered' into multiple groups, each having some
distinct features
An outlier is expected to be far away from any
groups of normal objects
Weakness: Cannot detect collective outlier effectively
Normal objects may not share any strong patterns, but
the collective outliers may share high similarity in a small
area

Many clustering methods can be adapted for


unsupervised methods
Find clusters, then outliers: not belonging to any cluster
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)

378

Unsupervised Methods: Challenges


In some intrusion or virus detection, normal
activities are diverse
Unsupervised methods may have a high false
positive rate but still miss many real outliers.
Supervised methods can be more effective, e.g.,
identify attacking some key resources

Challenges
Hard to distinguish noise from outliers
Costly since first clustering: but far less outliers than
normal objects

Newer methods: tackle outliers directly


Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)

379

Semi-Supervised Methods
In many applications, the number of labeled data is often
small
Labels could be on outliers only, normal objects only, or both

If some labeled normal objects are available


Use the labeled examples and the proximate unlabeled
objects to train a model for normal objects
Those not fitting the model of normal objects are detected as
outliers

If only some labeled outliers are available, a small


number of labeled outliers many not cover the possible
outliers well
To improve the quality of outlier detection, one can get help
from models for normal objects learned from unsupervised
methods
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)

380

Pros and Cons


Effectiveness of statistical methods: highly
depends on whether the assumption of
statistical model holds in the real data
There are rich alternatives to use various
statistical models
Parametric vs. non-parametric

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)

381

Proximity-based Methods
An object is an outlier if the nearest
neighbors of the object are far away, i.e., the
proximity of the object is significantly
deviates from the proximity of most of the
other objects in the same data set

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)

382

Pros and Cons


The effectiveness of proximity-based methods
highly relies on the proximity measure
In some applications, proximity or distance
measures cannot be obtained easily
Often have a difficulty in identifying a group of
outliers that stay close to each other
Two major types of proximity-based outlier
detection methods
Distance-based vs. density-based
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)

383

Clustering-based Methods
Normal data belong to large and dense
clusters, whereas outliers belong to small or
sparse clusters, or do not belong to any
clusters

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)

384

Challenges
Since there are many clustering methods,
there are many clustering-based outlier
detection methods as well
Clustering is expensive: straightforward
adaption of a clustering method for outlier
detection can be costly and does not scale
up well for large data sets

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)

385

Statistical Outlier Analysis


Assumption: the objects in a data set are
generated by a (stochastic) process (a
generative model)
Learn a generative model fitting the given
data set, and then identify the objects in low
probability regions of the model as outliers
two categories: parametric versus nonparametric
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)

386

Example
Statistical methods (also known as modelbased methods) assume that the normal
data follow some statistical model
The data not following the model are outliers.

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)

387

Parametric Methods
Assumption: the normal data is generated by
a parametric distribution with parameter
The probability density function of the
parametric distribution f(x | ) gives the
probability that object x is generated by the
distribution
The smaller this value, the more likely x is an
outlier
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)

388

Univariate Outliers Based on Normal


Distribution
ln L(,

)=

n
X
i=1

ln f (xi | (u,

)) =

n
ln(2)
2

n
ln
2

n
1 X
2

(xi

)2

i=1

Taking derivatives with respect to and 2,


we derive the following maximum likelihood
estimates
n
X
1

=x
=
xi
n i=1

n
X
2 = 1
(xi
n i=1

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)

x
)2

389

Example
Daily average temperature: {24.0, 28.9, 28.9,
29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4}
Since n = 10,
p
== 2.29
= 1.51

= 28.61
Then (24 28.61)
/1.51
3.04
< 3, 24 is
an outlier since 3 contains 99.7% data

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)

390

The Grubbs Test


Maximum normed residual test
For each object x in a data set, compute its
v
z-score
u
2
x is an outlier if z

N 1u
p t
N
N

2N

,N 2
+ t2 ,N
2N

t22N
,N 2 is the value taken by a t-distribution at a
significance level of /(2N), and N is the number
of objects in the data set

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)

391

Non-parametric Method
Not assume an a-priori statistical model,
instead, determine the model from the input
data
Not completely parameter free but consider the
number and nature of the parameters are
flexible and not fixed in advance

Examples: histogram and kernel density


estimation
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)

392

Histogram
A transaction in the amount of $7,500 is an
outlier, since only 0.2% transactions have an
amount higher than $5,000

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)

393

Challenges
Hard to choose an appropriate bin size for
histogram
Too small bin size normal objects in empty/
rare bins, false positive
Too big bin size outliers in some frequent
bins, false negative

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)

394

Proximity-based Outlier Detection


Objects far away from the others are outliers
The proximity of an outlier deviates significantly
from that of most of the others in the data set
Distance-based outlier detection: An object o is
an outlier if its neighborhood does not have
enough other points
Density-based outlier detection: An object o is
an outlier if its density is relatively much lower
than that of its neighbors
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2)

395

Depth-based Methods
Organize data objects in layers with various
depths
The shallow layers are more likely to contain
outliers

Example: Peeling, Depth contours


Complexity O(Nk/2) for k-d datasets
Unacceptable for k>2

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2)

396

Depth-based Outliers: Example

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2)

397

Distance-based Outliers
A DB(p, D)-outlier is an object O in a dataset
T such that at least a fraction p of the objects
in T lie at a distance greater than distance D
from O
The larger D, the more outlying
The larger p, the more outlying

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2)

398

Density-based Local Outlier


Both o1 and o2 are outliers
Distance-based methods
can detect o1, but not o2

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2)

399

Intuition
Outliers comparing to their local
neighborhoods, instead of the global data
distribution
The density around an outlier object is
significantly different from the density around
its neighbors
Use the relative density of an object against
its neighbors as the indicator of the degree
of the object being outliers
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2)

400

Classification-based Outlier Detection


Train a classification model that can
distinguish normal data from outliers
A brute-force approach: Consider a training
set that contains some samples labeled as
normal and others labeled as outlier
A training set in practice is typically heavily
biased: the number of normal samples likely
far exceeds that of outlier samples
Cannot detect unseen anomaly
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2)

401

One-Class Model
A classifier is built to describe only the normal class
Learn the decision boundary of the normal class
using classification methods such as SVM
Any samples that do not belong to the normal class
(not within the decision boundary) are declared as
outliers
Advantage: can detect new outliers that may not
appear close to any outlier objects in the training set
Extension: Normal objects may belong to multiple
classes
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2)

402

One-Class Model

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2)

403

Semi-Supervised Learning Methods


Combine classification-based and clustering-based
methods
Method
Use a clustering-based approach to find a large cluster,
C, and a small cluster, C1
Since some objects in C carry the label normal, treat all
objects in C as normal
Use the one-class model of this cluster to identify normal
objects in outlier detection
Since some objects in cluster C1 carry the label outlier,
declare all objects in C1 as outliers
Any object that does not fall into the model for C (such
as a) is considered an outlier as well
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2)

404

Example

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2)

405

Pros and Cons


Pros: Outlier detection is fast
Cons: Quality heavily depends on the availability
and quality of the training set,
It is often difficult to obtain representative and highquality training data

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2)

406

Contextual Outliers
An outlier object deviates significantly based on a
selected context
Ex. Is 10C in Vancouver an outlier? (depending on summer or
winter?)

Attributes of data objects should be divided into two


groups
Contextual attributes: defines the context, e.g., time & location
Behavioral attributes: characteristics of the object, used in
outlier evaluation, e.g., temperature

A generalization of local outlierswhose density


significantly deviates from its local area
Challenge: how to define or formulate meaningful
context?
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2)

407

Detection of Contextual Outliers


If the contexts can be clearly identified,
transform it to conventional outlier detection
Identify the context of the object using the
contextual attributes
Calculate the outlier score for the object in the
context using a conventional outlier detection
method

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2)

408

Example
Detect outlier customers in the context of
customer groups
Contextual attributes: age group, postal code
Behavioral attributes: the number of transactions per
year, annual total transaction amount

Method
Locate cs context;
Compare c with the other customers in the same
group; and
Use a conventional outlier detection method
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2)

409

Modeling Normal Behavior


Model the normal behavior with respect to contexts
Use a training data set to train a model that predicts the
expected behavior attribute values with respect to the
contextual attribute values
An object is a contextual outlier if its behavior attribute
values significantly deviate from the values predicted by
the model

Use a prediction model to link the contexts and


behavior
Avoid explicit identification of specific contexts
Some possible methods: regression, Markov Models,
and Finite State Automaton
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2)

410

Collective Outliers
Objects as a group deviate significantly from
the entire data
Examine the structure of the data set, i.e, the
relationships between multiple data objects
The structures are often not explicitly defined,
and have to be discovered as part of the outlier
detection process.

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2)

411

Detecting High Dimensional Outliers


Interpretability of outliers
Which subspaces manifest the outliers or an
assessment regarding the outlying-ness of the objects

Data sparsity: data in high-D spaces are often sparse


The distance between objects becomes heavily
dominated by noise as the dimensionality increases

Data subspaces
Local behavior and patterns of data

Scalability with respect to dimensionality


The number of subspaces increases exponentially

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2)

412

Angle-based Outliers

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2)

413

You might also like