Exam Version

Introduction
Motivation: Business Intelligence

Customer information
(customer-id, gender, age,
home-address, occupation,
income, family-size, )
Product information
(Product-id, category,
manufacturer, made-in,
stock-price, )
Sales information
(customer-id, product-id, #units, unit-price,
sales-representative, )
Business queries:
Jian Pei: CMPT 741/459 Data Mining -- Introduction
Techniques: Business Intelligence

Multidimensional data analysis
Online query answering
Interactive data exploration
Motivation: Store Layout Design
http://buildipedia.com/images/masterformat/Channels/In_Studio/Todays_Grocery_Store/Todays_Grocery_Store_Layout-Figure_B.jpg
Techniques: Store Layout Design

Customer purchase patterns
Business strategies
Motivation: Community Detection
http://image.slidesharecdn.com/communitydetectionitilecture22june2011-110622095259-phpapp02/95/community-detection-in-socialmedia-1-728.jpg?cb=1308736811
Techniques: Community Detection

Similarity between objects
Partitioning objects into groups
No guidance about what a group is
Motivation: Disease Prediction

What medical problems
does this patient has?
Symptoms:
overweight,
high blood
pressure,
back pain,
short of breadth,
chest pain,
cold sweat
Techniques: Disease Prediction

Features
Model
Motivation: Fraud Detection
http://i.imgur.com/ckkoAOp.gif
10
Techniques: Fraud Detection

Features
Dissimilarity
Groups and noise
http://i.stack.imgur.com/tRDGU.png
11
What Is Data Science About?

Data
Extraction of knowledge from data
Continuation of data mining and knowledge
discovery from data (KDD)
12
What Is Data?
Values of qualitative or quantitative variables
belonging to a set of items
Represented in a structure, e.g., tabular, tree
or graph structure
Typically the results of measurements
As an abstract concept can be viewed as the
lowest level of abstraction from which
information and then knowledge are derived
13
What Is Information?
Knowledge communicated or received
concerning a particular fact or circumstance
Conceptually, information is the message
(utterance or expression) being conveyed
Cannot be predicted
Can resolve uncertainty
14
What Is Knowledge?
Familiarity with someone or something,
which can include facts, information,
descriptions, or skills acquired through
experience or education
Implicit knowledge: practical skill or expertise
Explicit knowledge: theoretical
understanding of a subject
15
Data Systems
A data system answers queries based on
data acquired in the past
Base data the rawest data not derived
from anywhere else
Knowledge information derived from the
base data
16
Dealing with Data Querying

Given a set of student records about name,
age, courses taken and grades
Simple queries
What is John Does age?
Aggregate queries
What is the average GPA of all students at this
school?
Queries can be arbitrarily complicated

Find the students X and Y whose grades are less
than 3% apart in as many courses as possible
17
Queries
A precise request for information
Subjects in databases and information
retrieval
Databases: structured queries on structured
(e.g., relational) data
Information retrieval: unstructured queries on
unstructured (e.g., text, image) data
Important assumptions
Information needs
Query languages
18
Data-driven Exploration
What should be the next strategy of a
company?
A lot of data: sales, human resource, production,
tax, service cost,
The question cannot be translated into a

precise request for information (i.e., a query)
Developing familiarity (knowledge) and
actionable items (decisions) by interactively
analyzing data
19
Data-driven Thinking
Starting with some simple queries
New queries are raised by consuming the
results of previous queries
No ultimate query in design!
But many queries can be answered using DB/IR
techniques
20
The Art of Data-driven Thinking

The way of generating queries remains an
art!
Different people may derive different results
using the same data
If you torture the data long enough, it will confess
Ronald H. Coase
More often than not, more data may be

needed datafication
21
Queries for Data-driven Thinking

Probe queries finding information about
specific individuals
Aggregation finding information about groups
Pattern finding finding commonality in
population
Association and correlation finding
connections among individuals and groups
Causality analysis finding causes and
consequences
22
What Is Data Mining?

Broader sense: the art of data-driven
thinking
Technical sense: the non-trivial process of
identifying valid, novel, potentially useful,
and ultimately understandable patterns in
data [Fayyad, Piatetsky-Shapiro, Smyth, 96]
Methods and tools of answering various types of
queries in the data mining process in the
broader sense
23
Machine Learning
A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P, if its
performance at tasks in T, as measured by P,
improves with experience E
Tom M. Mitchell
Essentially, learn the distribution of data
24
Data mining vs. Machine Learning

Machine learning focuses on prediction,
based on known properties learned from the
training data
Data mining focuses on the discovery of
(previously) unknown properties on the data
25
The KDD Process

Knowledge
Transformed
data
Preprocessed
data
Interpretation/
Patterns evaluation
Data mining
Transformation
Preprocessing
Selection
Target data
Data
26
Data Mining R&D
New problem identification

Data collection and transformation
Algorithm design and implementation
Evaluation
Effectiveness evaluation
Efficiency & scalability evaluation
Deployment and business solution
27
Data Mining on Big Data

Data is so widely available and so
strategically important that the scarce thing is
the knowledge to extract wisdom from it
Hal Varian, Googles Chief Economist
28
What Is Big Data?

No quantitative definition!
Big data is like teenage sex
everyone talks about it,
nobody really knows how to do it,
everyone thinks everyone else is doing it,
so everyone claims they are doing it...
Dan Ariely
29
Data Volume vs. Storage Cost

The unit cost of disk storage decreases
dramatically
Year
Unit cost
1956
$10,000/MB
1980
$193/MB
1990
$9/MB
2000
$6.9/GB
2010
$0.08/GB
2013
0.06/GB
http://ns1758.ca/winch/winchest.html
30
Big Data Volume

Data sets with sizes beyond the ability of
commonly-used software tools to capture,
curate, manage, and process the data within a
tolerable elapsed time
Wikipedia
31
Big Data: Volume

Every day, about 7 billion shares change hands
on US equity markets
About 2/3 is traded by computer algorithms based
on huge amounts of data to predict gains and risk
In Q2 2015
Facebook has 1.49 billion active users
Wechat has 600 million active users, 100 million
outside China
LinkedIn has 380 million active users
Twitter has 304 active users
32
Velocity
Google processes 24+ petabytes of data per
day
Facebook gets 10+ million new photos
uploaded every hour
Facebook members like or leave a comment
3+ billion times per day
YouTube users upload 1+ hour of video
every second
400+ million tweets per day
33
What Has Been Changed?

The 1880 census in the US took 8 years to
complete
The 1890 census would need 13 years using
punch cards, it was reduced to less than 1 year
It is essential to get not only the accurate but

also the timely data
Statisticians use sampling to estimate
Recently, with the new technologies, the

ways of data collection and transmission
have been fundamentally changed
34
Sampling for Volume/Velocity?

Sampling idea: the marginal new information
brought by larger amount of data shrinks
quickly
The sample should be truly random
On a data set of hundreds or thousands of

attributes, can sampling help in
Finding subcategories of attribute combinations
Finding outliers and exceptions
Big data contains signals of different strengths

No noise, instead weaker and weaker, but still may
be interesting and important signals
35
Big Data Leytro Pictures

Lytro pictures record the whole light field
Photographers can decide later which parts to
focus on
Big data tries to record as much information

as possible
Analysts can decide later what to extract from
big data
Both advantages and challenges
36
Veracity
1 in 3 business leaders don't trust the
information they use to make decisions
Assuming a slowly growing total cost budget,
tradeoff between data volume and data
quality
Loss of veracity in combining different types
of information from different sources
Loss of veracity in data extraction,
transformation, and processing
37
Variety
Integrating data capturing different aspects
of a data object
Vancouver Canucks: game video, technical
statistics, social media,
Different pieces are in different format
Different views of the same data object from

different sources
Did the soccer ball pass the goal line?
The views may not be consistent
38
Four V-challenges
Volume: massive scale and growth, 40% per
year in global data generated
Velocity: real time data generation and
consumption
Variety: heterogeneous data, mainly
unstructured or semi-structured, from many
sources
Veracity
39
Is Big Data Really New?

People were aware of the existence of big
data long time ago, but no one can access it
until very recently
(Genesis 28:15) I am with you and will watch
over you wherever you go

Similar statements in Quran and Sutra
What has been changed?

How is data connected with people
40
Diversity in Data Usage

In the past, only very few projects can afford
to be data-intensive
Nowadays, excessive applications are
(naturally) data-intensive
41
Datafication
Extract data about an object or event in a
quantified way so that it can be analyzed
Different from digitalization
An important feature of big data

Key: new data, new applications, new
opportunities
42
New Values of Datafication

Example: Captcha and ReCaptcha (Luis von
Ahn)
How to create new values of data and
datafication?
Connecting data with new users
Connecting different pieces of data to present a
bigger picture
Important techniques
Data aggregation
Extended datafication
43
Big Data Players
Data holders
Data specialists
Big-data mindset leaders
A capable company may play 2 or 3 roles at
the same time
What is most important, big-data mindset,
skills, or data itself?
44
Privacy
big data analytics have the potential to
eclipse longstanding civil rights protections
in how personal information is used in
housing, credit, employment, health,
education, and the marketplace
Executive Office of the (US) President
45
Keep in Mind
Our industry does not respect
tradition it only respects
innovation.
Satya Nadella
46
Goals of This Course

Data-driven thinking towards being a (big)
data scientist
Principles and hands-on skills of data
mining, particularly in the context of big data
Identifying new data mining problems
Data mining algorithm design
Data mining applications
Novel problems for upcoming research

47
Format
Due to the fast progress in data mining, we
will go beyond the textbook substantially
Active classroom discussion
Open questions and brainstorming
Textbook: Data Mining Concepts and
Techniques (3rd ed)
48
Read Try Think

Reading
(required) Textbook and a small number of research
papers
You have to have the 3rd ed of the textbook!
(open end, not covered by the exam) Technical and
non-technical materials
Trying
Assignments and a project
Thinking
Examine everything from a data scientist angle from
today
49
Data Mining: History

1989 IJCAI Workshop on Knowledge
Discovery in Databases
Knowledge Discovery in Databases (G.
Piatetsky-Shapiro and W. Frawley, 1991)
91-94 Workshops on Knowledge

Discovery in Databases
Advances in Knowledge Discovery and
Data Mining (U. Fayyad, G. PiatetskyShapiro, P. Smyth, and R. Uthurusamy,
1996)
50
Data Mining: History (cont d)

95-98 International Conferences on Knowledge
Discovery in Databases and Data Mining
(KDD 95-98)
Journal of Data Mining and Knowledge Discovery (1997)
ACM SIGKDD conferences since 1998 and

SIGKDD Explorations
More conferences on data mining
PAKDD (1997), PKDD (1997), SIAM-Data Mining
(2001), (IEEE) ICDM (2001), etc.
ACM Transactions on KDD starting in 2007

51
Frequent Pattern Mining
How Many Words Is a Picture Worth?
E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)
53
Burnt or Burned?
E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013
54
Store Layout Design
http://buildipedia.com/images/masterformat/Channels/In_Studio/Todays_Grocery_Store/Todays_Grocery_Store_Layout-Figure_B.jpg
55
Transaction Data
Alphabet: a set of items
Example: all products sold in a store
A transaction: a set of items involved in an

activity
Example: the items purchased by a customer in
a visit
Other information is often associated

Timestamp, price, salesperson, customer-id,
store-id,
56
Examples of Transaction Data
57
How to Store Transaction Data?

Transaction-id
(t123, a, b, c)
(t236, b, d)
Relational storage
Transaction-based storage
Item-based (vertical) storage
Tid
Item
t123
t123
t123
t236
t236
Item a: , t123,
Item b: , t123, , t236,

58
Transaction Data Analysis

Transactions: customers purchases of
commodities
{bread, milk, cheese} if they are bought together
Frequent patterns: product combinations that

are frequently purchased together by
customers
Frequent patterns: patterns (set of items,
sequence, etc.) that occur frequently in a
database [AIS93]
59
Why Frequent Patterns?

What products were often purchased
together?
What are the frequent subsequent
purchases after buying a iPod?
What kinds of genes are sensitive to this
new drug?
What key-word combinations are frequently
associated with web pages about gameevaluation?
60
Why Frequent Pattern Mining?

Foundation for many data mining tasks
Association rules, correlation, causality,
sequential patterns, spatial and multimedia
patterns, associative classification, cluster
analysis, iceberg cube,
Broad applications
Basket data analysis, cross-marketing, catalog
design, sale campaign analysis, web log (click
stream) analysis,
61
Frequent Itemsets
Itemset: a set of items
E.g., acm = {a, c, m}
Support of itemsets
Sup(acm) = 3
Given min_sup = 3, acm

is a frequent pattern
Frequent pattern mining:
finding all frequent
patterns in a database
Transaction database TDB

TID
Items bought
100 f, a, c, d, g, I, m, p
200
300
400
500
a, b, c, f, l, m, o
b, f, h, j, o
b, c, k, s, p
a, f, c, e, l, p, m, n
62
A Nave Attempt
Generate all possible itemsets, test their
supports against the database
How to hold a large number of itemsets into
main memory?
100 items 2100 1 possible itemets
How to test the supports of a huge number

of itemsets against a large database, say
containing 100 million transactions?
A transaction of length 20 needs to update the
support of 220 1 = 1,048,575 itemsets
63
Transactions in Real Applications

A large department store often carries more
than 100 thousand different kinds of items
Amazon.com carries more than 17,000 books
relevant to data mining
Walmart has more than 20 million

transactions per day, AT&T produces more
than 275 million calls per day
Mining large transaction databases of many
items is a real demand
64
How to Get an Efficient Method?

Reducing the number of itemsets that need
to be checked
Checking the supports of selected itemsets
efficiently
65
Candidate Generation & Test

Any subset of a frequent itemset must also be
frequent an anti-monotonic property
A transaction containing {beer, diaper, nuts} also
contains {beer, diaper}
{beer, diaper, nuts} is frequent {beer, diaper} must
also be frequent
In other words, any superset of an infrequent

itemset must also be infrequent
No superset of any infrequent itemset should be
generated or tested
Many item combinations can be pruned!
66
Apriori-Based Mining
Generate length (k+1) candidate itemsets
from length k frequent itemsets, and
Test the candidates against DB
67
The Apriori Algorithm [AgSr94]

Data base D
TID
10
20
30
40
Items
a, c, d
b, c, e
a, b, c, e
b, e
1-candidates
Scan D
Min_sup=2
3-candidates
Scan D
Itemset
bce
Freq 3-itemsets
Itemset
bce
Sup
2
Itemset
a
b
c
d
e
Sup
2
3
3
1
3
Freq 1-itemsets
Itemset
a
b
c
e
Freq 2-itemsets
Itemset
ac
bc
be
ce
Sup
2
2
3
2
2-candidates
Sup
2
3
3
3
Counting
Itemset
ab
ac
ae
bc
be
ce
Sup
1
2
1
2
3
2
Itemset
ab
ac
ae
bc
be
ce
Scan D
68
The Apriori Algorithm

Level-wise, candidate generation and test
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
Candidate
generation
L1 = {frequent items};
for (k = 1; Lk !=; k++) do
Test
Ck+1 = candidates generated from Lk;

for each transaction t in database do increment the
count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
return k Lk;
69
Important Steps in Apriori

How to find frequent 1- and 2-itemsets?
How to generate candidates?
Step 1: self-joining Lk
Step 2: pruning
How to count supports of candidates?
70
Finding Frequent 1- & 2-itemsets

Finding frequent 1-itemsets (i.e., frequent
items) using a one dimensional array
Initialize c[item]=0 for each item
For each transaction T, for each item in T,
c[item]++;
If c[item]>=min_sup, item is frequent
Finding frequent 2-itemsets using a 2dimensional triangle matrix

For items i, j (i<j), c[i, j] is the count of itemset ij
71
Counting Array
A 2-dimensional triangle matrix can be
implemented using a 1-dimensional array
1
1
2
3
4
5
2
1
3
2
5
4
3
6
8
5
4
7
9
10
4
There are n items

For items i, j (i>j),
c[i,j] = c[(i-1)(2n-i)/2+j-i];
Example: c[3,5]
=c[(3-1)*(2*5-3)/
2+5-3]=c[9]
5
9 10
72
Example of Candidate-generation
L3 = {abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd abc * abd
acde acd * ace
Pruning:
acde is removed because ade is not in L3
C4={abcd}
73
How to Generate Candidates?

Suppose the items in Lk-1 are listed in an order
Step 1: self-join Lk-1
INSERT INTO Ck
SELECT p.item1, p.item2, , p.itemk-1, q.itemk-1
FROM Lk-1 p, Lk-1 q
WHERE p.item1=q.item1, , p.itemk-2=q.itemk-2, p.itemk-1 <
q.itemk-1
Step 2: pruning
For each itemset c in Ck do
For each (k-1)-subsets s of c do if (s is not in Lk-1) then delete c

from Ck
74
How to Count Supports?

Why is counting supports of candidates a problem?
The total number of candidates can be very huge
One transaction may contain many candidates
Method
Candidate itemsets are stored in a hash-tree
A leaf node of hash-tree contains a list of itemsets and
counts
Interior node contains a hash table
Subset function: finds all the candidates contained in a
transaction
75
Example: Counting Supports

Subset function
3,6,9
1,4,7
Transaction: 1 2 3 5 6
2,5,8
1+2356
234
567
13+56
145
136
12+356
124
457
125
458
345
356
357
689
367
368
159
76
Association Rules
Rule c am
Support: 3 (i.e., the support Transaction database TDB
of acm)
TID
Items bought
Confidence: 75% (i.e.,
100 f, a, c, d, g, I, m, p
sup(acm) / sup(c))
200 a, b, c, f, l, m, o
Given a minimum support
300 b, f, h, j, o
threshold and a minimum
confidence threshold, find
400 b, c, k, s, p
all association rules whose
500 a, f, c, e, l, p, m, n
support and confidence
passing the thresholds
77
Challenges of Freq Pat Mining

Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for
candidates
78
Improving Apriori: Ideas

Reducing the number of transaction
database scans
Shrinking the number of candidates
Facilitating support counting of candidates
79
Bottleneck of Freq Pattern Mining

Multiple database scans are costly
Mining long patterns needs many scans and
generates many candidates
To find frequent itemset i1i2i100
# of scans: 100
# of Candidates:
100 100
100 100
+
+ ! +
= 2 1 1.27 1030
1 2
100
Bottleneck: candidate-generation-and-test
Can we avoid candidate generation?

80
Search Space of Freq. Pat. Mining

Itemsets form a lattice
ABCD
ABC
AB
AC
A
ABD
ACD
BC
AD
BCD
CD
BD
D
{}
Itemset lattice
81
Set Enumeration Tree

Use an order on items, enumerate itemsets in
lexicographic order
a, ab, abc, abcd, ac, acd, ad, b, bc, bcd, bd, c, dc, d
Reduce a lattice to a tree

a
Set enumeration tree
ab
ac
abc
b
ad
abd
c
bc
acd
bd
cd
bcd
abcd
82
Borders of Frequent Itemsets

Frequent itemsets are connected
is trivially frequent
X on the border every subset of X is frequent
a
ab
ac
abc
b
ad
abd
c
bc
acd
bd
cd
bcd
abcd
83
Projected Databases
To test whether Xy is frequent, we can use
the X-projected database
The sub-database of transactions containing X
Check whether item y is frequent in X-projected
database
a
ab
ac
abc
b
ad
abd
c
bc
acd
bd
cd
bcd
abcd
84
Compress Database by FP-tree

The 1st scan: find
frequent items
Only record frequent
items in FP-tree
F-list: f-c-a-b-m-p
The 2nd scan:

construct tree
Order frequent items in
each transaction w.r.t. flist
Explore sharing among
transactions
Header
table
item
f
c
a
b
m
p
TID
root
f:4
c:3
c:1
b:1
b:1
a:3
p:1
m:2
b:1
p:2
m:1
Items bought
(ordered)
freq items
100 f, a, c, d, g, I, m, p f, c, a, m, p
200 a, b, c, f, l,m, o
f, c, a, b, m
300 b, f, h, j, o
f, b
400 b, c, k, s, p
c, b, p
500 a, f, c, e, l, p, m, n
f, c, a, m, p
85
Benefits of FP-tree
Completeness
Never break a long pattern in any transaction
Preserve complete information for freq pattern mining
Not scan database anymore
Compactness
Reduce irrelevant info infrequent items are removed
Items in frequency descending order (f-list): the more
frequently occurring, the more likely to be shared
Never be larger than the original database (not counting
node-links and the count fields)
86
Partitioning Frequent Patterns

Frequent patterns can be partitioned into
subsets according to f-list: f-c-a-b-m-p
Patterns containing p
Patterns having m but no p

Patterns having c but no a nor b, m, or p
Pattern f
Depth-first search of a set enumeration tree

The partitioning is complete and does not have
any overlap
87
Find Patterns Having Item p

Only transactions containing p are needed
Form p-projected database
Starting at entry p of the header table
Follow the side-link of frequent item p
Accumulate all transformed prefix paths of p
p-projected database TDB|p
fcam: 2
cb: 1
Local frequent item: c:3
Frequent patterns containing p
p: 3, pc: 3
Header
table
item
f
c
a
b
m
p
root
f:4
c:3
c:1
b:1
a:3
b:1
p:1
m:2
b:1
p:2
m:1
88
Find Pat Having Item m But No p

Form m-projected database TDB|m
Item p is excluded (why?)
Contain fca:2, fcab:1
Local frequent items: f, c, a Header
Build FP-tree for TDB|m

Header
table
item
f
c
a
root
f:3
c:3
a:3
m-projected FP-tree
table
item
f
c
a
b
m
p
root
f:4
c:3
c:1
b:1
a:3
b:1
p:1
m:2
b:1
p:2
m:1
89
Recursive Mining
Patterns having m but no p can be mined
recursively
Optimization: enumerate patterns from a
single-branch FP-tree
Enumerate all combination
Support = that of the last item
m, fm, cm, am
fcm, fam, cam
fcam
Header
table
item
f
c
a
root
f:3
c:3
a:3
m-projected FP-tree
90
Enumerate Patterns From Single

Prefix of FP-tree
A (projected) FP-tree has a single prefix
Reduce the single prefix into one node
Join the mining results of the two parts
root
root
a1:n1
a1:n1
a2:n2
r=
a3:n3
b1:m1
c2:k2
r1
a2:n2
a3:n3
c1:k1
b1:m1
c2:k2
c1:k1
c3:k3
c3:k3
91
FP-growth
Pattern-growth: recursively grow frequent patterns
by pattern and database partitioning
Algorithm
For each frequent item, construct its projected database,
and then its projected FP-tree
Repeat the process on each newly created projected
FP-tree
Until the resulted FP-tree is empty, or contains only one
path single path generates all the combinations, each
of which is a frequent pattern
92
Scaling up by DB Projection
What if an FP-tree cannot fit into memory?
Database projection
Partition a database into a set of projected
databases
Construct and mine FP-tree once the projected
database can fit into main memory
Heuristic: Projected database shrinks quickly in many
applications
93
Parallel vs. Partition Projection

Parallel projection:
form all projected
database at a time
Partition projection:
propagate projections
p-proj DB
fcam
cb
fcam
m-proj DB
fcab
fca
fca
am-proj DB
fc
fc
fc
Tran. DB
fcamp
fcabm
fb
cbp
fcamp
b-proj DB
f
cb
a-proj DB
fc
cm-proj DB
f
f
f
c-proj DB
f
f-proj DB
94
Why Is FP-growth Efficient?

Divide-and-conquer strategy
Decompose both the mining task and DB
Lead to focused search of smaller databases
Other factors
No candidate generation nor candidate test
Database compression using FP-tree
No repeated scan of entire database
Basic operations counting local frequent items
and building FP-tree, no pattern search nor
pattern matching
95
Major Costs in FP-growth

Poor locality of FP-trees
Low hit rate of cache
Building FP-trees
A stack of FP-trees
Redundant information
Transaction abcd appears in a-, ab-, abc-, ac-,
, c- projected databases and FP-trees
96
Effectiveness of Freq Pat Mining

Too many patterns!
A pattern a1a2an contains 2n-1 subpatterns
Understanding many patterns is difficult or even
impossible for human users
Non-focused mining
A manager may be only interested in patterns
involving some items (s)he manages
A user is often interested in patterns satisfying
some constraints
97
Tid transaction
Itemset Lattice
ABCD
ABC
AB
AC
A
ABD
ACD
BC
BCD
AD
B
{}
ABD
20
ABC
30
AD
40
ABCD
50
CD
CD
BD
10
Min_sup=2
D
Length Frequent itemsets
1
A, B, C, D
AB, AC, AD, BC, BD, CD
ABC, ABD, ACD
98
Max-Patterns
Tid transaction
ABCD
ABC
AB
AC
A
ABD
ACD
BC
BCD
AD
B
{}
ABD
20
ABC
30
AD
CD 40
50
BD
10
ABCD
CD
Min_sup=2
Length Frequent itemsets

1
A, B, C, D
AB, AC, AD, BC, BD, CD
ABC, ABD
99
Borders and Max-patterns

Max-patterns: borders of frequent patterns
Any subset of max-pattern is frequent
Any superset of max-pattern is infrequent
ABCD
Cannot generate rules
ABC
AB
AC
A
ABD
ACD
BC
AD
BCD
CD
BD
D
{}
100
Patterns and Support Counts

Tid transaction
ABCD
ABC:2
AB:3
AC:2
ABD:2
BC:2
ACD
BCD
AD:3
BD:2
10
ABD
20
ABC
30
AD
CD:2 40
50
A:4
B:4
C:3
{}
ABCD
CD
D:4
Min_sup=2
Len Frequent itemsets

1
A:4, B:4, C:3, D:4
AB:3, AC:2, AD:3, BC:3, BD:2, CD:2
ABC:2, ABD:2
101
Frequent Closed Patterns

For frequent itemset X, if there exists no item
y not in X s.t. every transaction containing X
also contains y, then X is a frequent closed
pattern
acdf is a frequent closed pattern Min_sup=2
Concise rep. of freq pats
Can generate non-redundant rules
Reduce # of patterns and rules

N. Pasquier et al. In ICDT 99
TID
Items
10
a, c, d, e, f
20
a, b, e
30
c, e, f
40
a, c, d, f
50
c, e, f
102
Closed and Max-patterns

Closed pattern mining algorithms can be
adapted to mine max-patterns
A max-pattern must be closed
Depth-first search methods have advantages

over breadth-first search ones
Why?
103
Constraint-based Data Mining

Find all the patterns in a database autonomously?
The patterns could be too many but not focused!
Data mining should be interactive

User directs what to be mined
Constraint-based mining
User flexibility: provides constraints on what to be mined
System optimization: push constraints for efficient mining
104
Constraints in Data Mining

Knowledge type constraint
classification, association, etc.
Data constraint using SQL-like queries

find product pairs sold together in stores in New York
Dimension/level constraint
in relevance to region, price, brand, customer category
Rule (or pattern) constraint

small sales (price < $10) triggers big sales (sum >$200)
Interestingness constraint
strong rules: support and confidence
105
Constrained Mining vs. Search

Constrained mining vs. constraint-based search
Both aim at reducing search space
Finding all patterns vs. some (or one) answers satisfying
constraints
Constraint-pushing vs. heuristic search
An interesting research problem on integrating both
Constrained mining vs. DBMS query processing

Database query processing requires to find all
Constrained pattern mining shares a similar philosophy
as pushing selections deeply in query processing
106
Optimization
Mining frequent patterns with constraint C
Sound: only find patterns satisfying the constraints C
Complete: find all patterns satisfying the constraints C
A nave solution
Constraint test as a post-processing
More efficient approaches

Analyze the properties of constraints
Push constraints as deeply as possible into frequent
pattern mining
107
TDB (min_sup=2)
Anti-Monotonicity
Anti-monotonicity
TID
Transaction
10
a, b, c, d, f
20
30
40
b, c, d, f, g, h
a, c, d, e, f
c, e, f, g
An intemset S violates the constraint, so does

any of its superset
Item Profit
a
40
sum(S.Price) v is anti-monotone
b
0
sum(S.Price) v is not anti-monotone c
-20
Example
C: range(S.profit) 15
Itemset ab violates C
So does every superset of ab
10
-30
30
20
-10
108
Anti-monotonic Constraints
Constraint
vS
SV
SV
min(S) v
min(S) v
max(S) v
max(S) v
count(S) v
count(S) v
sum(S) v ( a S, a 0 )
sum(S) v ( a S, a 0 )
range(S) v
range(S) v
avg(S) v, { =, , }
support(S)
support(S)
Antimonotone
No
no
yes
no
yes
yes
no
yes
no
yes
no
yes
no
convertible
yes
no
109
TDB (min_sup=2)
Monotonicity
Monotonicity
TID
Transaction
10
a, b, c, d, f
20
b, c, d, f, g, h
30
a, c, d, e, f
40
c, e, f, g
An intemset S satisfies the constraint, so does

any of its superset
Item Profit
sum(S.Price) v is monotone
a
40
min(S.Price) v is monotone
b
0
Example
C: range(S.profit) 15
Itemset ab satisfies C
So does every superset of ab
-20
10
-30
30
20
-10
110
Monotonic Constraints
Constraint
vS
SV
SV
min(S) v
min(S) v
max(S) v
max(S) v
count(S) v
count(S) v
sum(S) v ( a S, a 0 )
sum(S) v ( a S, a 0 )
range(S) v
range(S) v
avg(S) v, { =, , }
support(S)
support(S)
Monotone
yes
yes
no
yes
no
no
yes
no
yes
no
yes
no
yes
convertible
no
yes
111
Converting Tough Constraints

TDB (min_sup=2)
Convert tough constraints into antimonotone or monotone by properly
ordering items
Examine C: avg(S.profit) 25
Order items in value-descending order
<a, f, g, d, b, h, c, e>
If an itemset afb violates C

So does afbh, afb*
It becomes anti-monotone!
TID
Transaction
10
a, b, c, d, f
20
b, c, d, f, g, h
30
a, c, d, e, f
40
c, e, f, g
Item
a
b
c
d
e
f
g
h
Profit
40
0
-20
10
-30
30
20
-10
112
Convertible Constraints
Let R be an order of items
Convertible anti-monotone
If an itemset S violates a constraint C, so does every
itemset having S as a prefix w.r.t. R
Ex. avg(S) v w.r.t. item value descending order
Convertible monotone
If an itemset S satisfies constraint C, so does every
itemset having S as a prefix w.r.t. R
Ex. avg(S) v w.r.t. item value descending order
113
Strongly Convertible Constraints

avg(X) 25 is convertible anti-monotone
w.r.t. item value descending order R: <a,
f, g, d, b, h, c, e>
Itemset af violates a constraint C, so does
every itemset with af as prefix, such as afd
avg(X) 25 is convertible monotone

w.r.t. item value ascending order R-1: <e,
c, h, b, d, g, f, a>
Itemset d satisfies a constraint C, so does
itemsets df and dfa, which having d as a
prefix
Item
Profit
40
-20
10
-30
30
20
-10
Thus, avg(X) 25 is strongly convertible

114
Convertible Constraints
Constraint
Convertible
Convertible Strongly
anti-monotone monotone convertible
avg(S) , v
Yes
Yes
Yes
median(S) , v
Yes
Yes
Yes
sum(S) v (items could be of

any value, v 0)
Yes
No
No

any value, v 0)
No
Yes
No

any value, v 0)
No
Yes
No

any value, v 0)
Yes
No
No
115
Can Apriori Handle Convertible

Constraint?
A convertible, not monotone nor antimonotone nor succinct constraint cannot
be pushed deep into the an Apriori mining
algorithm
Item
Value
Within the level wise framework, no direct

pruning based on the constraint can be made
Itemset df violates constraint C: avg(X)>=25
Since adf satisfies C, Apriori needs df to
assemble adf, df cannot be pruned
40
-20
10
-30
But it can be pushed into frequent-pattern

growth framework!
30
20
-10
116
Mining With Convertible

Constraints
C: avg(S.profit) 25
List of items in every transaction in
value descending order R:
TDB (min_sup=2)
TID
Transaction
10
a, f, d, b, c
20
f, g, d, b, c
30
a, f, d, c, e
40
f, g, h, c, e
Item
Profit
40
30
Scan transaction DB once
20
remove infrequent items
10
-10
-20
-30
<a, f, g, d, b, h, c, e>
C is convertible anti-monotone w.r.t. R
Item h in transaction 40 is dropped
Itemsets a and f are good
117
Not Every Pattern Is Interesting!

Trivial patterns
Pregnant Female 100% confidence
Misleading patterns
Play basketball eat cereal [40%, 66.7%]
Basketball
Not basketball
Sum (row)
Cereal
2000
1750
3750
Not cereal
1000
250
1250
Sum(col.)
3000
2000
5000
118
Evaluation Criteria
Objective interestingness measures
Examples: support, patterns formed by mutually
independent items
Domain independent
Subjective measures
Examples: domain knowledge, templates/
constraints
119
Correlation and Lift

P(B|A)/P(B) is called the lift of rule A B
corrA,B
P(A B)
P(AB)
=
=
P(A)P(B) P(A)P(B)
Play basketball eat cereal (lift: 0.89)

Play
basketball not eat cereal (lift: 1.33)
ssociation
Analysis
Contingency table
. A 2-way contingency table for variables A and B.
Basketball
Not basketball
Sum (row)
f11
f10
f1+
Cereal
2000
1750
3750
f01
f00
f0+
Not cereal
1000
250
1250
f+1
f+0
Sum(col.)
3000
2000
5000
120
Property of Lift
If A and B are independent, lift = 1

If A and B are positively correlated, lift > 1
Chapter 6 Association Analysis
If A and B are negatively correlated, lift < 1
Limitation:
lift is tables
sensitive
to pairs
P(A)
and
Table 6.9. Contingency
for the word
({p,q}
andP(B)
{r,s}.
p
880
50
930
50
20
70
930
70
1000
lift(p, q) < lift(r, s)!
20
50
70
50
880
930
70
930
1000
121
From Itemsets to Sequences

Itemsets: combinations of items, no temporal order
Temporal order is important in many situations
Time-series databases and sequence databases
Frequent patterns (frequent) sequential patterns
Applications of sequential pattern mining

Customer shopping sequences:
First buy computer, then iPod, and then digital camera, within 3
months.
Medical treatment, natural disasters, science and

engineering processes, stocks and markets, telephone
calling patterns, Web log clickthrough streams, DNA
sequences and gene structures
122
What Is Sequential Pattern Mining?

Given a set of sequences, find the complete
set of frequent subsequences
A sequence database
SID
sequence
10
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
sequence : < (ef) (ab) (df) c b >

An element may contain a set of items.
Items within an element are unordered
and we list them alphabetically.
<a(bc)dc> is a subsequence
of <a(abc)(ac)d(cf)>
Given support threshold min_sup =2, <(ab)c> is a

sequential pattern
123
Challenges in Seq Pat Mining

A huge number of possible sequential
patterns are hidden in databases
A mining algorithm should
Find the complete set of patterns satisfying the
minimum support (frequency) threshold
Be highly efficient, scalable, involving only a
small number of database scans
Be able to incorporate various kinds of userspecific constraints
124
Apriori Property of Seq Patterns

Apriori property in sequential patterns
If a sequence S is infrequent, then none of the
super-sequences of S is frequent
E.g, <hb> is infrequent so do <hab> and
<(ah)b>
Seq-id
Sequence
Given support threshold
min_sup =2
10
<(bd)cb(ac)>
20
<(bf)(ce)b(fg)>
30
<(ah)(bf)abf>
40
<(be)(ce)d>
50
<a(bd)bcb(ade)>
125
GSP
GSP (Generalized Sequential Pattern) mining
Outline of the method
Initially, every item in DB is a candidate of length-1
For each level (i.e., sequences of length-k) do
Scan database to collect support count for each candidate
sequence
Generate candidate length-(k+1) sequences from length-k
frequent sequences using Apriori
Repeat until no frequent sequence or no candidate can

be found
Major strength: Candidate pruning by Apriori

126
Finding Len-1 Seq Patterns

Initial candidates
Cand Sup
<a>, , <c>, <d>, <e>, <f>, <g>,
<a>
3
<h>

5
Scan database once
<c>
4
count support for candidates
<d>
3
Seq-id
Sequence
<e>
3
10
<(bd)cb(ac)>
<f>
2
20
<(bf)(ce)b(fg)>
min_sup =2
30
<g>
1
<(ah)(bf)abf>
40
<(be)(ce)d>
50
<a(bd)bcb(ade)>
<h>
1
127
Generating Length-2 Candidates

51 length-2
Candidates
<a>
<a>

<c>
<d>
<a>

<c>
<d>
<e>
<f>
<a>
<aa>
<ab>
<ac>
<ad>
<ae>
<af>

<ba>
<bb>
<bc>
<bd>
<be>
<bf>
<c>
<ca>
<cb>
<cc>
<cd>
<ce>
<cf>
<d>
<da>
<db>
<dc>
<dd>
<de>
<df>
<e>
<ea>
<eb>
<ec>
<ed>
<ee>
<ef>
<f>
<fa>
<fb>
<fc>
<fd>
<fe>
<ff>

<c>
<d>
<e>
<f>
<(ab)>
<(ac)>
<(ad)>
<(ae)>
<(af)>
<(bc)>
<(bd)>
<(be)>
<(bf)>
<(cd)>
<(ce)>
<(cf)>
<(de)>
<(df)>
<e>
<f>
<(ef)>
Without Apriori
property,
8*8+8*7/2=92
candidates
Apriori prunes
44.57% candidates
128
Finding Len-2 Seq Patterns

Scan database one more time, collect
support count for each length-2 candidate
There are 19 length-2 candidates which
pass the minimum support threshold
They are length-2 sequential patterns
129
Generating Length-3 Candidates and

Finding Length-3 Patterns
Generate Length-3 Candidates
Self-join length-2 sequential patterns
<ab>, <aa> and <ba> are all length-2 sequential
patterns <aba> is a length-3 candidate
<(bd)>, <bb> and <db> are all length-2 sequential
patterns <(bd)b> is a length-3 candidate
46 candidates are generated
Find Length-3 Sequential Patterns

Scan database once more, collect support
counts for candidates
19 out of 46 candidates pass support threshold
130
The GSP Mining Process

5th scan: 1 cand. 1 length-5 seq.
pat.
<(bd)cba>
Cand. cannot pass

sup. threshold
Cand. not in DB at all

4th scan: 8 cand. 6 length-4 seq. <abba> <(bd)bc>
pat.
3rd scan: 46 cand. 19 length-3 seq. <abb> <aab> <aba> <baa> <bab>
pat. 20 cand. not in DB at all
2nd scan: 51 cand. 19 length-2 seq.
<aa> <ab> <af> <ba> <bb> <ff> <(ab)> <(ef)>
pat. 10 cand. not in DB at all
1st scan: 8 cand. 6 length-1 seq.
<a> <c> <d> <e> <f> <g> <h>
pat.
min_sup
=2
Seq-id
10
20
30
40
50
Sequence
<(bd)cb(ac)>
<(bf)(ce)b(fg)>
<(ah)(bf)abf>
<(be)(ce)d>
<a(bd)bcb(ade)>
131
The GSP Algorithm

Take sequences in form of <x> as length-1
candidates
Scan database once, find F1, the set of length-1
sequential patterns
Let k=1; while Fk is not empty do
Form Ck+1, the set of length-(k+1) candidates from Fk;
If Ck+1 is not empty, scan database once, find Fk+1, the
set of length-(k+1) sequential patterns
Let k=k+1;
132
Bottlenecks of GSP
A huge set of candidates
1,000 frequent length-1 sequences generate
1000 999
1000 1000 +
= 1,499,500 length-2 candidates!
2
Multiple scans of database in mining

Real challenge: mining long sequential
patterns
An exponential number of short candidates

A length-100 sequential pattern needs 1030
100
candidate sequences!
i = 2 1 10
100
i =1
100
30
133
FreeSpan: Freq Pat-projected

Sequential Pattern Mining
The itemset of a seq pat must be frequent
Recursively project a sequence database into a
set of smaller databases based on the current
set of frequent patterns
Mine each projected database to find its patterns
Sequence Database SDB
< (bd) c b (ac) >
< (bf) (ce) b (fg) >
< (ah) (bf) a b f >
< (be) (ce) d >
< a (bd) b c b (ade) >
f_list: b:5, c:4, a:3, d:3, e:3, f:2

All seq. pat. can be divided into 6 subsets:
Seq. pat. containing item f
Those containing e but no f
Those containing d but no e nor f
Those containing a but no d, e or f
Those containing c but no a, d, e or f
Those containing only item b
134
From FreeSpan to PrefixSpan

Freespan:
Projection-based: no candidate sequence needs
to be generated
But, projection can be performed at any point in
the sequence, and the projected sequences may
not shrink much
PrefixSpan
Projection-based
But only prefix-based projection: less projections
and quickly shrinking sequences
135
Prefix and Suffix (Projection)

<a>, <aa>, <a(ab)> and <a(abc)> are
prefixes of sequence <a(abc)(ac)d(cf)>
Given sequence <a(abc)(ac)d(cf)>
Prefix
Suffix (Prefix-Based Projection)
<a>
<aa>
<ab>
<(abc)(ac)d(cf)>
<(_bc)(ac)d(cf)>
<(_c)(ac)d(cf)>
136
Mining Sequential Patterns by

Prefix Projections
Step 1: find length-1 sequential patterns
<a>, , <c>, <d>, <e>, <f>
Step 2: divide search space. The complete

set of seq. pat. can be partitioned into 6
subsets:
SID
sequence
The ones having prefix <a>; 10 <a(abc)(ac)d(cf)>
The ones having prefix ; 20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>
The ones having prefix <f> 40
<eg(af)cbc>
137
Finding Seq. Pat. with Prefix <a>

Only need to consider projections w.r.t. <a>
<a>-projected database: <(abc)(ac)d(cf)>,
<(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>
Find all the length-2 seq. pat. having prefix

<a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>
Further partition into 6 subsets
Having prefix <aa>;

Having prefix <af>
SID
sequence
10
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
138
Completeness of PrefixSpan
SDB
Having prefix <a>

<a>-projected database
<(abc)(ac)d(cf)>
<(_d)c(bc)(ae)>
<(_b)(df)cb>
<(_f)cbc>
SID
sequence
10
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
Length-1 sequential patterns

<a>, , <c>, <d>, <e>, <f>
Having prefix <c>, , <f>
Having prefix 

-projected database
Length-2 sequential
patterns
<aa>, <ab>, <(ab)>,
<ac>, <ad>, <af>
Having prefix <aa> Having prefix <af>

<aa>-proj. db
<af>-proj. db
139
Efficiency of PrefixSpan
No candidate sequence needs to be
generated
Projected databases keep shrinking
Major cost of PrefixSpan: constructing
projected databases
Can be improved by bi-level projections
140
Effectiveness
Redundancy due to anti-monotonicity
{<abcd>} leads to 15 sequential patterns of
same support
Closed sequential patterns and sequential
generators
Constraints on sequential patterns

Gap
Length
More sophisticated, application oriented
constraints
141
Data Warehousing & OLAP
Motivation: Business Intelligence

Customer information
(customer-id, gender, age,
home-address, occupation,
income, family-size, )
Product information
(Product-id, category,
manufacturer, made-in,
stock-price, )
Sales information
(customer-id, product-id, #units, unit-price,
sales-representative, )
Business queries:
Which categories of products are most popular for customers
Find pairs (customer groups, most popular products)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1)
143
In what aspect is he most similar to

cases of coronary artery disease
and, at the same time, dissimilar to
adiposity?
Symptoms:
overweight,
high blood
pressure,
back pain,
short of breadth,
chest pain,
cold sweat
144
Dont You Ever Google Yourself?

Big data makes one know oneself better
57% American adults search themselves on
Internet
Good news: those people are
better paid than those who
havent done so! (Investors.com)
Egocentric analysis becomes

more and more important with
big data
145
Egocentric Analysis
How am I different from (more often than
not, better than) others?
In what aspects am I good?
http://img03.deviantart.net/a670/i/2010/219/a/e/glee___egocentric_by_gleeondoodles.jpg
146
Dimensions
An aspect or feature of a situation, problem, or
thing, a measurable extent of some kind
Dictionary
Dimensions/attributes are used to model
complex objects in a divide-and-conquer
manner
Objects are compared in selected dimensions/
attributes
More often than not, objects have too many

dimensions/attributes than one is interested in
and can handle
147
Multi-dimensional Analysis
Find interesting patterns in multi-dimensional
subspaces
Michael Jordan is outstanding in subspaces (total
points, total rebounds, total assists) and (number of
games played, total points, total assists)
Different patterns may be manifested in

different subspaces
Feature selection (machine learning and statistics):
select a subset of relevant features for use in model
construction a set of features for all objects
Different subspaces may manifest different patterns
148
OLAP
Conceptually, we may explore all possible
subspaces for interesting patterns
What patterns are interesting?

How can we explore all possible subspaces
systematically and efficiently?
Fundamental problems in analytics and data
mining
149
OLAP
Aggregates and group-bys are frequently used in
data analysis and summarization
SELECT time, altitude, AVG(temp)
FROM weather GOUP BY time, altitude;
In TPC, 6 standard benchmarks have 83 queries,
aggregates are used 59 times, group-bys are used 20
times
Online analytical processing (OLAP): the

techniques that answer multi-dimensional
analytical (MDA) queries efficiently
150
OLAP Operations
Roll up (drill-up): summarize data by
climbing up hierarchy or by dimension
reduction
(Day, Store, Product type, SUM(sales)
(Month, City, *, SUM(sales))
Drill down (roll down): reverse of roll-up,

from higher level summary to lower level
summary or detailed data, or introducing
new dimensions
151
Roll Up
http://www.tutorialspoint.com/dwh/images/rollup.jpg
152
Drill Down
http://www.tutorialspoint.com/dwh/images/drill_down.jpg
153
Other Operations
Dice: pick specific values or ranges on some
dimensions
Pivot: rotate a cube changing the order of
dimensions in visual analysis
http://en.wikipedia.org/wiki/File:OLAP_pivoting.png
154
Dice
http://www.tutorialspoint.com/dwh/images/dice.jpg
155
Relational Representation
If there are n dimensions, there are 2n
possible aggregation columns
Roll up by model by year by color in a table
156
Difficulties
Many group bys are needed
6 dimensions 26=64 group bys
In most SQL systems, the resulting query

needs 64 scans of the data, 64 sorts or
hashes, and a long wait!
157
Dummy Value ALL
158
DATA CUBE
Model Year
Color Sales
CUBE
SALES
Model Year Color Sales
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
1990
1990
1990
1991
1991
1991
1992
1992
1992
1990
1990
1990
1991
1991
1991
1992
1992
1992
red
white
blue
red
white
blue
red
white
blue
red
white
blue
red
white
blue
red
white
blue
5
87
62
54
95
49
31
54
71
64
62
63
52
9
55
27
62
39
CUBE
SELECT Model, Year, Color, SUM(sales) AS Sales

FROM Sales
WHERE Model in {'Ford', 'Chevy'}
AND Year BETWEEN 1990 AND 1992
GROUP BY CUBE(Model, Year, Color);
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Chevy
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
Ford
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
1990
1990
1990
1990
1991
1991
1991
1991
1992
1992
1992
1992
ALL
ALL
ALL
ALL
1990
1990
1990
1990
1991
1991
1991
1991
1992
1992
1992
1992
ALL
ALL
ALL
ALL
1990
1990
1990
1990
1991
1991
1991
1991
1992
1992
1992
1992
ALL
ALL
ALL
ALL
blue
red
white
ALL
blue
red
white
ALL
blue
red
white
ALL
blue
red
white
ALL
blue
red
white
ALL
blue
red
white
ALL
blue
red
white
ALL
blue
red
white
ALL
blue
red
white
ALL
blue
red
white
ALL
blue
red
white
ALL
blue
red
white
ALL
62
5
95
154
49
54
95
198
71
31
54
156
182
90
236
508
63
64
62
189
55
52
9
116
39
27
62
128
157
143
133
433
125
69
149
343
106
104
110
314
110
58
116
284
339
233
369
941
159
Semantics of ALL
ALL is a set
Model.ALL = ALL(Model) = {Chevy, Ford }
Year.ALL = ALL(Year) = {1990,1991,1992}
Color.ALL = ALL(Color) = {red,white,blue}
160
OLTP Versus OLAP

OLTP
OLAP
users
clerk, IT professional
knowledge worker
function
day to day operations
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date, detailed, flat

relational Isolated
historical, summarized, multidimensional

integrated, consolidated
usage
repetitive
ad-hoc
access
read/write, index/hash on prim.

key
lots of scans
unit of work
short, simple transaction
complex query
# records
accessed
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
query throughput, response
161
What Is a Data Warehouse?

A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile
collection of data in support of
management s decision-making process.
W. H. Inmon
Data warehousing: the process of
constructing and using data warehouses
162
Subject-Oriented
Organized around major subjects, such as
customer, product, sales
Focusing on the modeling and analysis of
data for decision makers, not on daily
operations or transaction processing
Providing a simple and concise view around
particular subject issues by excluding data
that are not useful in the decision support
process
163
Integrated
Integrating multiple, heterogeneous data sources
Relational databases, flat files, on-line transaction
records
Data cleaning and data integration

Ensuring consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources
E.g., Hotel price: currency, tax, breakfast covered, etc.
When data is moved to the warehouse, it is converted
164
Time Variant
The time horizon for the data warehouse is
significantly longer than that of operational systems
Operational databases: current value data
Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse contains

an element of time, explicitly or implicitly
But the key of operational data may or may not contain
time element
165
Nonvolatile
A physically separate store of data
transformed from the operational
environment
Operational updates of data do not occur in
the data warehouse environment
Do not require transaction processing, recovery,
and concurrency control mechanisms
Require only two operations in data accessing
Initial loading of data
Access of data
166
Why Separate Data Warehouse?

High performance for both
Operational DBMS: tuned for OLTP
Warehouse: tuned for OLAP
Different functions and different data

Historical data: data analysis often uses
historical data that operational databases do not
typically maintain
Data consolidation: data analysis requires
consolidation (aggregation, summarization) of
data from heterogeneous sources
167
Data Warehouse Schema Design

Query answering efficiency
Subject orientation
Integration
Tradeoff between time and space

Universal table versus fully normalized schema
168
Star Schema
time
time_key
day
day_of_the_week
month
quarter
yearbranch
branch_key
branch_name
branch_type
Sales Fact Table

time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item
item_key
item_name
brand
type
supplier_type
location
location_key
street
city
state_or_province
country
169
Snowflake Schema
time
time_key
day
day_of_the_week
month
quarter
year
branch
branch_key
branch_name
branch_type
Sales Fact Table

time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item
item_key
supplier
item_name supplier_ke
supplier_ty
brand
type
supplier_key
location
location_key
street
city
city_key
city_key
city
state_or_province
country
170
Fact Constellation
Shipping Fact Tab

time_key
time
item
time_key
day
day_of_the_week
month
quarter
year
branch
branch_key
branch_name
branch_type
Sales Fact Table

time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measur
es
item_key
item_name
brand
type
supplier_type
location
location_key
street
city
province_or_state
country
item_key
shipper_ke
from_location
y
to_location
dollars_cost
units_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type
171
(Good) Aggregate Functions

Distributive: there is a function G() such that
F({Xi,j}) = G({F({Xi,j |i=1,...,lj}) | j=1,n})
Examples: COUNT(), MIN(), MAX(), SUM()
G=SUM() for COUNT()
Algebraic: there is an M-tuple valued function G()

and a function H() such that
F({Xi,j}) = H({G({Xi,j |i=1,.., I}) | j=1,..., n })
Examples: AVG(), standard deviation, MaxN(), MinN()
For AVG(), G() records sum and count, H() adds these
two components and divides to produce the global
average
172
Holistic Aggregate Functions

There is no constant bound on the size of
the storage needed to describe a subaggregate.
There is no constant M, such that an M-tuple
characterizes the computation
F({Xi,j |i=1,...,I}).
Examples: Median(), MostFrequent() (also

called the Mode()), and Rank()
173
Index Requirements in OLAP

Data is read only
(Almost) no insertion or deletion
Query types
Point query: looking up one specific tuple (rare)
Range query: returning the aggregate of a
(large) set of tuples, with group by
Complex queries: need specific algorithms and
index structures, will be discussed later
174
OLAP Query Example

In table (cust, gender, ), find the total
number of male customers
Method 1: scan the table once
Method 2: build a B+ tree index on attribute
gender, still need to access all tuples of male
customers
Can we get the count without scanning many
tuples, even not all tuples of male
customers?
175
Bitmap Index
For n tuples, a bitmap index has n bits and
can be packed into n /8 bytes and n /32
words
From a bit to the row-id: the j-th bit of the pcust gender
th byte row-id = p*8 +j
Jack
Cathy
Nancy
1 0 0
176
Using Bitmap to Count

Shcount[] contains the number of bits in the
entry subscript
Example: shcount[01100101]=4
count = 0;
for (i = 0; i < SHNUM; i++)
count += shcount[B[i]];
177
Advantages of Bitmap Index

Efficient in space
Ready for logic composition
C = C1 AND C2
Bitmap operations can be used
Bitmap index only works for categorical data

with low cardinality
Naively, we need 50 bits per entry to represent
the state of a customer in US
How to represent a sale in dollars?
178
Bit-Sliced Index
A sale amount can be written as an integer
number of pennies, and then be represented
as a binary number of N bits
24 bits is good for up to $167,772.15,
appropriate for many stores
A bit-sliced index is N bitmaps

Tuple j sets in bitmap k if the k-th bit in its binary
representation is on
The space costs of bit-sliced index is the same
as storing the data directly
179
Using Indexes
SELECT SUM(sales) FROM Sales WHERE C;
Tuples satisfying C is identified by a bitmap B
Direct access to rows to calculate SUM:

scan the whole table once
B+ tree: find the tuples from the tree
Projection index: only scan attribute sales
Bit-sliced index: get the sum from (B AND
Bk)*2k
180
Cost Comparison
Traditional value-list index (B+ tree) is costly
in both I/O and CPU time
Not good for OLAP
Bit-sliced index is efficient in I/O

Other case studies in [O Neil and Quass,
SIGMOD 97]
181
Horizontal or Vertical Storage

A fact table for data warehousing is often fat
Tens of even hundreds of dimensions/attributes
A query is often about only a few attributes

Horizontal storage: tuples are stored one by one
Vertical storage: tuples are stored by attributes
A1
x1
z1
A2
x2
z2
A100
x100
z100
A1
x1
z1
A2
x2
z2
A100
x100
z100
182
Horizontal Versus Vertical

Find the information of tuple t
Typical in OLTP
Horizontal storage: get the whole tuple in one search
Vertical storage: search 100 lists
Find SUM(a100) GROUP BY {a22, a83}
Typical in OLAP
Horizontal storage (no index): search all tuples O(100n),
where n is the number of tuples
Vertical storage: search 3 lists O(3n), 3% of the
horizontal storage method
Projection index: vertical storage

183
MOLAP
2Qtr
3Qtr
4Qtr
sum
U.S.A
Canada
Mexico
Country
TV
PC
VCR
sum
1Qtr
Date
sum
184
Pros and Cons
Easy to implement
Fast retrieval
Many entries may be empty if data is sparse
Costly in space
185
ROLAP Data Cube in Table

A multi-dimensional database
Base table
Dimensions
Measure
Dimensions
Measure
Store
Product
Season
Sales
S1
P1
Spring
Store
S1
P2
Spring
12
S1
P1
Spring
S2
P1
Fall
S1
P2
Spring
12
S2
P1
Fall
S1
Spring
Cubing
Product Season AVG(Sales)
186
Data Cube: A Lattice of Cuboids

all
time
item
time,location
time,item
D(apex) cuboid
location
supplier
item,location
time,supplier
location,supplier
item,supplier
time,location,supplier
time,item,locationtime,item,supplier
D cuboids
D cuboids
D cuboids
item,location,supplier
D(base) cuboid
time, item, location, supplierc
187
Data Cube: A Lattice of Cuboids

all
time
time,item
item
D(apex) cuboid
location
supplier
D cuboids
time,location
item,location
location,supplier
item,supplier
time,supplier
D cuboids
time,location,supplier
time,item,location
time,item,supplier
item,location,supplier
time, item, location, supplier
D cuboids
D(base) cuboid
Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells
(9/15, milk, Urbana, Dairy_land), (9/15, milk, Urbana, *),
(*, milk, Urbana, *), (*, milk, Urbana, *)
(*, milk, Chicago, *), (*, milk, *, *)
188
Full Cube vs. Iceberg Cube

Full cube vs. iceberg cube
iceberg
condition
compute cube sales iceberg as

select month, city, customer group, count(*)
from salesInfo
cube by month, city, customer group
having count(*) >= min support
Avoid explosive growth: A cube with 100 dimensions

n 2 base cells: (a1, a2, ., a100), (b1, b2, , b100)
n How many aggregate cells if having count >= 1 ?
n What about having count >= 2 ?
189
Multi-Way Array Aggregation

Array-based bottom-up
algorithm
Using multi-dimensional chunks
No direct tuple comparisons
Simultaneous aggregation on
multiple dimensions
Intermediate aggregate values
are re-used for computing
ancestor cuboids
Cannot do Apriori pruning: No
iceberg optimization
All
AB
BC
AC
ABC
Multi-way Array Aggregation for

Cube Computation (MOLAP)
Partition arrays into chunks (a small subcube which fits in memory).
Compressed sparse array addressing: (chunk_id, offset)
Compute aggregates in multiway by visiting cube cells in the order which
minimizes the # of times to visit each cell, and reduces memory access & storage
cost.
C c2 c3
c1

c0
b3
b2
b1

b0
a0
a1
a2
What is the best

traversing
order to do
multi-way
aggregation?
a3
A
191
Multi-way Array Aggregation for Cube

Computation (3-D to 2-D)
All
all
AB
A
BC
AC
ABC
AB
AC
BC
ABC
The best order is

the one that
minimizes the
memory
requirement and
reduced I/Os
192
Multi-way Array Aggregation for Cube

Computation (2-D to 1-D)
All
AB
BC
AC
ABC
193
Multi-Way Array Aggregation for

Cube Computation
Method: the planes should be sorted and
computed according to their size in ascending
order
Idea: keep the smallest plane in the main memory,
fetch and compute only one chunk at a time for the
largest plane
Limitation of the method: computing well only

for a small number of dimensions
If there are a large number of dimensions, topdown computation and iceberg cube computation
methods can be explored
194
Iceberg Cube
In a data cube, many aggregate cells are
trivial
Having an aggregate too small
Iceberg query
195
Monotonic Iceberg Condition

If COUNT(a, b, *)<100, then COUNT(a, b,
c)<100 for any c
For cells c1 and c2, c1 is called an ancestor
of c2 if in all dimensions that c1 takes a non-*
value, c2 agrees with c1
(a,b,*) is an ancestor of (a,b,c)
An iceberg condition P is monotonic if for

any aggregate cell c failing P, any
descendants of c cannot honor P
196
BUC
Once a base table (A,
B, C) is sorted by A-BC, aggregates (*,*,*),
(A,*,*), (A,B,*) and
(A,B,C) can be
computed with one
scan and 4 counters
To compute other
aggregates, we can
sort the base table in
some other orders
197
Example
Threshold: sum() >= 300
Location
Year
Color
Amount
Vancouver
2015
Yellow
300
Victoria
2014
Red
400
Seattle
2015
Green
120
Vancouver
2014
Green
260
Seattle
2015
Red
160
Vancouver
2014
Yellow
280
Vancouver
2015
Red
160
198
Example: Sorting on Location

Location
Year
Color
Amount
Seattle
2015
Green
120
Seattle
2015
Red
160
Vancouver
2015
Yellow
300
Vancouver
2014
Yellow
280
Vancouver
2015
Red
160
Vancouver
2014
Green
260
Victoria
2014
Red
400
Sum(Seattle, *, *) = 280
Sum(Vancouver, *, *) = 1000
Sum(Victoria, *, *) = 400
199
Sorting on Year for Vancouver

Location
Year
Color
Amount
Seattle
2015
Green
120
Seattle
2015
Red
160
Vancouver
2014
Yellow
280
Vancouver
2014
Green
260
Vancouver
2015
Yellow
300
Vancouver
2015
Red
160
Victoria
2014
Red
400
Sum(Vancouver, 2014, *) = 540

Sum(Vancouver, 2015, *) = 460
200
Color on Vancouver & 2014/2015

Location
Year
Color
Amount
Seattle
2015
Green
120
Seattle
2015
Red
160
Vancouver
2014
Green
260
Vancouver
2014
Yellow
280
Vancouver
2015
Red
160
Vancouver
2015
Yellow
300
Victoria
2014
Red
400
Sum(Vancouver, 2014, Yellow) = 280

Sum(Vancouver, 2014, Green) = 260
Sum(Vancouver, 2015, Yellow) = 300
Sum(Vancouver, 2015, Red) = 160
201
Sort on Color for Vancouver

Location
Year
Color
Amount
Seattle
2015
Green
120
Seattle
2015
Red
160
Vancouver
2014
Green
260
Vancouver
2015
Red
160
Vancouver
2014
Yellow
280
Vancouver
2015
Yellow
300
Victoria
2014
Red
400
Sum(Vancouver, *, Green) = 260

Sum(Vancouver, *, Red) = 160
Sum(Vancouver, *, Yellow) = 580
202
How to Sort the Base Table?

General sorting in main memory O(nlogn)
Counting in main memory O(n), linear to the
number of tuples in the base table
How to sort 1 million integers in range 0 to 100?
Set up 100 counters, initiate them to 0 s
Scan the integers once, count the occurrences
of each value in 1 to 100
Scan the integers again, put the integers to the
right places
203
Pushing Monotonic Conditions

BUC searches the
aggregates bottom-up
in depth-first manner
Only when a
monotonic condition
holds, the descendants
of the current node
should be expanded
204
Clustering
Community Detection
http://image.slidesharecdn.com/communitydetectionitilecture22june2011-110622095259-phpapp02/95/community-detection-in-socialmedia-1-728.jpg?cb=1308736811
Jian Pei: CMPT 741/459 Clustering (1)
206
Customer Relation Management

Partitioning customers into groups such that
customers within a group are similar in some
aspects
A manager can be assigned to a group
Customized products and services can be
developed
207
What Is Clustering?
Group data into clusters
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Unsupervised learning: no predefined classes
Outliers
Cluster 1
Cluster 2
208
Requirements of Clustering
Scalability
Ability to deal with various types of attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge
to determine input parameters
209
Data Matrix
For memory-based clustering
Also called object-by-variable structure
Represents n objects with p variables

(attributes, measures)
A relational table
! x
x
1f
11
"
"
"
x
! x
i
1
if
"
"
"
xn1 ! xnf
x
1p
"
"
! x
ip
"
"
! x
np
210
Dissimilarity Matrix
For memory-based clustering
Also called object-by-object structure
Proximities of pairs of objects
d(i, j): dissimilarity between objects i and j
Nonnegative
0
d (2,1)
Close to 0: similar
0
d (3,1) d (3,2) 0
"
"
"
d (n,1) d (n,2) ! ! 0
211
How Good Is Clustering?

Dissimilarity/similarity depends on distance
function
Different applications have different functions
Judgment of clustering quality is typically

highly subjective
212
Types of Data in Clustering
Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types
213
Interval-valued Variables
Continuous measurements of a roughly
linear scale
Weight, height, latitude and longitude
coordinates, temperature, etc.
Effect of measurement units in attributes

Smaller unit larger variable range larger
effect to the result
Standardization + background knowledge
214
Standardization
Calculate the mean absolute deviation
s f = 1n (| x1 f m f | + | x2 f m f | +...+ | xnf m f |)
m f = 1n (x1 f + x2 f
+ ... +
xnf )
Calculate the standardized measurement (zxif m f

score)
zif =
sf
Mean absolute deviation is more robust
The effect of outliers is reduced but remains
detectable
215
Similarity and Dissimilarity

Distances are normally used measures
Minkowski distance: a generalization
d (i, j) = q | x x |q + | x x |q +...+ | x x |q (q > 0)
i1
j1
i2
j2
ip
jp
If q = 2, d is Euclidean distance
If q = 1, d is Manhattan distance
If q = , d is Chebyshev distance
Weighed distance
d (i, j) = q w | x x |q +w | x x |q +...+ w p | x x |q ) (q > 0)
ip j p
1 i1 j1
2 i2 j 2
216
Manhattan and Chebyshev Distance
Chebyshev Distance
Manhattan Distance
When n = 2, chess-distance
Picture from Wekipedia
http://brainking.com/images/rules/chess/02.gif
217
Properties of Minkowski Distance

Nonnegative: d(i,j) 0
The distance of an object to itself is 0
d(i,i) = 0
Symmetric: d(i,j) = d(j,i)

Triangular inequality
d(i,j) d(i,k) + d(k,j)
j
k
218
Binary Variables
Object j
1
0
1
q
r
Object i
0
s
t
Sum q+s r+t
Sum
q+r
s+t
p
A contingency table for binary data

Symmetric variable: each state carries the
same weight
r+s
d (i, j) =
Invariant similarity
q + r + s +t
Asymmetric variable: the positive value

r+s
d (i, j) = q +
carries more weight
r+s
Noninvariant similarity (Jacard)
219
Nominal Variables
A generalization of the binary variable in that
it can take more than 2 states, e.g., Red,
yellow, blue, green
m
d (i, j) = p
p
Method 1: simple matching
M: # of matches, p: total # of variables
Method 2: use a large number of binary

variables
Creating a new binary variable for each of the M
nominal states
220
Ordinal Variables
An ordinal variable can be discrete or
rif {1,..., M f }
continuous
Order is important, e.g., rank
Can be treated like interval-scaled
Replace xif by their rank
Map the range of each variable onto [0, 1] by
replacing the i-th object in the f-th variable by
zif
rif 1
=
M f 1
Compute the dissimilarity using methods for

interval-scaled variables
221
Ratio-scaled Variables
Ratio-scaled variable: a positive
measurement on a nonlinear scale
E.g., approximately at exponential scale, such
as AeBt
Treat them like interval-scaled variables?

Not a good choice: the scale can be distorted!
Apply logarithmic transformation, yif = log(xif)

Treat them as continuous ordinal data, treat
their rank as interval-scaled
222
Variables of Mixed Types

A database may contain all the six types of
variables
Symmetric binary, asymmetric binary, nominal,
ordinal, interval and ratio
One may use a weighted formula to combine

their effects
pf = 1 ij( f ) dij( f )
d (i, j) =
pf = 1 ij( f )
223
Clustering Methods
K-means and partitioning methods

Hierarchical clustering
Density-based clustering
Grid-based clustering
Pattern-based clustering
Other clustering methods
224
Partitioning Algorithms: Ideas

Partition n objects into k clusters
Optimize the chosen partitioning criterion
Global optimal: examine all possible partitions

(kn-(k-1)n--1) possible partitions, too expensive!
Heuristic methods: k-means and k-medoids

K-means: a cluster is represented by the center
K-medoids or PAM (partition around medoids): each
cluster is represented by one of the objects in the cluster
225
K-means
Arbitrarily choose k objects as the initial
cluster centers
Until no change, do
(Re)assign each object to the cluster to which
the object is the most similar, based on the
mean value of the objects in the cluster
Update the cluster means, i.e., calculate the
mean value of the objects for each cluster
226
K-Means:
Example
K-Means:
K-Means:Example
Example
=2
10
3
2
1
0
10
10
10
7
8
10
K=2
K=2
Arbitrarily choose
Arbitrarily
chooseKK
bitrarily
choose
K
object as
object
as initial
initial
bject as
initialcenter
cluster
cluster center
uster center
Assign
3
AssignAssign
each 2
each each
1
objects
object
objects
0
9
10 to
0
1
to
most
the
to most
similar
similarmost
similar
center center
3
2
1
0
0
01
3
12
23
3
4
54
65
76
87
10
9 8 10 9
4
Update
Update
Update
3
the
the
2
the
cluster1
cluster
cluster
10
means0 0
means
means
3
2
1
0
reassign
reassign
reassign
center
10
10
7
6
4
3
2
1
10
10
0 0 1
7
10
10
10
reassign
reassign
reassign
10
10
10
Update
Update54
Update
the
the
3
thecluster
cluster 2
means
cluster
1
means
means
10
10
9
8
7
6
5
0
0
1 0 2
0
0
10
10
227
Pros and Cons of K-means

Relatively efficient: O(tkn)
n: # objects, k: # clusters, t: # iterations; k, t <<
n.
Often terminate at a local optimum

Applicable only when mean is defined
What about categorical data?
Need to specify the number of clusters

Unable to handle noisy data and outliers
Unsuitable to discover non-convex clusters
228
Variations of the K-means

Aspects of variations
Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster means
Handling categorical data: k-modes

Use mode instead of mean
Mode: the most frequent item(s)
A mixture of categorical and numerical data: k-prototype

method
EM (expectation maximization): assign a

probability of an object to a cluster (will be
discussed later)
229
A Problem
of K-means
ensitive
to outliers
Outlier: objects with extremely large values

+
Sensitive
to
outliers
May substantially distort the distribution of the data
Outlier: objects with extremely large values
-medoids:
the most
centrally
located
objec
May substantially
distort
the distribution
of the data
n acluster
K-medoids: the most centrally located object
in a cluster
10
10
0
0
10
JianMining
Pei: CMPT
741/459 Clustering
(1)
Pei: Data
-- Clustering
and Outlier
Detection
10
230
PAM: A K-medoids Method

PAM: partitioning around Medoids
Arbitrarily choose k objects as the initial medoids
Until no change, do
(Re)assign each object to the cluster to which the
nearest medoid
Randomly select a non-medoid object o , compute the
total cost, S, of swapping medoid o with o
If S < 0 then swap o with o to form the new set of k
medoids
231
Swapping Cost
Measure whether o is better than o as a
medoid
Use the squared-error
criterion
k
E = d ( p, oi ) 2
i =1 pCi
Compute Eo -Eo
Negative: swapping brings benefit
232
PAM: Example
Example
PAM:
TotalCost
Cost=
=20
20
Total
Total
Cost
=
20
1010
10
10
99 9
1010
10
99 9
88 8
Arbitrary
Arbitrary
Arbitrary
choosekkk
choose
choose
objectas
as
object
object
as
initial
initial
initial
medoids
medoids
medoids
77 7
66 6
55 5
44 4
33 3
22 2
11 1
00 0
00 0
11 1
22 2
33 3
44 4
55 5
66 6
77 7
88 8
99 9
7
66 6
6
55 5
5
44 4
4
33 3
3
22 2
2
11 1
K=2
K=2
K=2
Do loop
loop
Do
Do
loop
Until no
no
Until
Until
no
change
change
change
9
88 8
8
77 7
1
00 0
0 00 0
1010
10
1010
10
10
99 9
11 1
1
22 2
2
33 3
3
44 4
4
55 5
5
66 6
6
77 7
7
88 8
8
99 9
9
1010
10
10
Assign
Assign
Assign
each
each
each
remaining
remaining
remaining
objectto
to
object
object
to
nearest
nearest
nearest
medoids
medoids
medoids
9
88 8
8
77 7
7
66 6
6
55 5
5
44 4
4
33 3
3
22 2
2
11 1
1
00 0
0 00 0
0
Compute
Compute
Compute
totalcost
costof
of
total
total
cost
of
swapping
swapping
swapping
88 8
77 7
66 6
55 5
55 5
5
66 6
6
77 7
7
88 8
8
99 9
9
1010
10
10
99 9
88 8
77 7
66 6
55 5
44 4
44 4
33 3
33 3
22 2
22 2
11 1
11 1
00 0
00 0
44 4
4
1010
10
99 9
qualityis
If
IfIfquality
quality
isis
improved.
improved.
improved.
33 3
3
ramdom
1010
10
ramdom
22 2
2
Randomly selectaa
Randomly
Randomly select
select a
nonmedoid
object,Oramdom
nonmedoid
ramdom
nonmedoid object,O
object,O
Total Cost==26
26
Total
Total Cost
Cost = 26
Swapping O
Swapping
Swapping O
O
and
O
and
O
ramdom
and Oramdom
11 1
1
11 1
22 2
33 3
44 4
55 5
66 6
77 7
Jian Pei:Data
DataMining
Mining----Clustering
Clustering andOutlier
OutlierDetection
Detection
Jian
Jian Pei:
Pei: CMPT
741/459 Clusteringand
(1)
88 8
1010
99 9 10
00 0
00 0
11 1
22 2
33 3
44 4
55 5
66 6
77 7
88 8
1010
99 9 10
39
39
233
Pros and Cons of PAM

PAM is more robust than k-means in the
presence of noise and outliers
Medoids are less influenced by outliers
PAM is efficient for small data sets but does

not scale well for large data sets
O(k(n-k)2) for each iteration
234
Hierarchy
An arrangement or classification of things
according to inclusiveness
A natural way of abstraction, summarization,
compression, and simplification for
understanding
Typical setting: organize a given set of
objects to a hierarchy
No or very little supervision
Some heuristic quality guidances on the quality
of the hierarchy
235
Hierarchical Clustering
Group data objects into a tree of clusters
Top-down versus bottom-up
Step 0
a
b
Step 1
Step 2 Step 3 Step 4
ab
abcde
cde
de
e
Step 4
agglomerative
(AGNES)
Step 3
Step 2 Step 1 Step 0
divisive
(DIANA)
236
AGNES (Agglomerative Nesting)

Initially, each object is a cluster
Step-by-step cluster merging, until all objects
form a cluster
Single-link approach
Each cluster is represented by all of the objects
in the cluster
The similarity between two clusters is measured
by the similarity of the closest pair of data points
belonging to different clusters
237
Dendrogram
Show how to merge clusters
hierarchically
Decompose data objects into a multilevel nested partitioning (a tree of
clusters)
A clustering of the data objects: cutting
the dendrogram at the desired level
Each connected component forms a cluster
238
DIANA (Divisive ANAlysis)

Initially, all objects are in one cluster
Step-by-step splitting clusters until each
cluster contains only one object
10
10
10
0
0
10
0
0
10
10
239
Distance Measures
d ( p, q )
Minimum distance d min (Ci , C j ) = pmin
C , qC
Maximum distance d max (Ci , C j ) = max d ( p, q)
pC , qC
Mean distance
d mean (Ci , C j ) = d (mi , m j )
Average distance
1
d avg (Ci , C j ) =
ni n j
d ( p, q )
pCi qC j
m: mean for a cluster

C: a cluster
n: the number of objects in a cluster
240
Challenges
Hard to choose merge/split points
Never undo merging/splitting
Merging/splitting decisions are critical
High complexity O(n2)

Integrating hierarchical clustering with other
techniques
BIRCH, CURE, CHAMELEON, ROCK
241
BIRCH
Balanced Iterative Reducing and Clustering
using Hierarchies
CF (Clustering Feature) tree: a hierarchical
data structure summarizing object
information
Clustering objects clustering leaf nodes of the
CF tree
242
Clustering Feature Vector

Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: Ni=1=oi
SS:
CF = (5, (16,30),(54,190))
Ni=1=oi2
10
9
8
7
6
5
4
3
2
1
0
0
10
(
243
CF-tree in BIRCH
Clustering features
Summarize the statistics for a cluster
Many cluster quality measures (e.g., radium, distance)
can be derived
Additivity: CF1+CF2=(N1+N2, L1+L2, SS1+SS2)
A CF tree: a height-balanced tree storing the

clustering features for a hierarchical clustering
A nonleaf node in a tree has descendants or children
The nonleaf nodes store sums of the CFs of children
244
CF Tree
B=7
L=6
CF
CF CF
1child 2child 3child
1
2
3
Non-leaf node
CF
CF CF
1
2
3
child child child
1
2
3
Leaf node
prev
CF CF
1
2
CF next
6
CF
6child
6
Root
CF
5
child
5
Leaf node
prev
CF CF
1
2
CF next
4
245
Parameters of a CF-tree
Branching factor: the maximum number of
children
Threshold: max diameter of sub-clusters
stored at the leaf nodes
246
BIRCH Clustering
Phase 1: scan DB to build an initial inmemory CF tree (a multi-level compression
of the data that tries to preserve the inherent
clustering structure of the data)
Phase 2: use an arbitrary clustering
algorithm to cluster the leaf nodes of the CFtree
247
Pros & Cons of BIRCH

Linear scalability
Good clustering with a single scan
Quality can be further improved by a few
additional scans
Can handle only numeric data

Sensitive to the order of the data records
248
Distance-based Methods: Drawbacks

Hard to find clusters with irregular shapes
Hard to specify the number of clusters
Heuristic: a cluster must be dense
249
How to Find Irregular Clusters?

Divide the whole space into many small
areas
The density of an area can be estimated
Areas may or may not be exclusive
A dense area is likely in a cluster
Start from a dense area, traverse connected

dense areas and discover clusters in
irregular shape
250
Directly Density Reachable

Parameters
MinPts = 3
Eps = 1 cm
p
q
Eps: Maximum radius of the neighborhood

MinPts: Minimum number of points in an Epsneighborhood of that point
NEps(p): {q | dist(p,q) Eps}
Core object p: |NEps(p)|MinPts

A core object is in a dense area
Point q directly density-reachable from p iff

q NEps(p) and p is a core object
251
Density-Based Clustering
Density-reachable
Directly density reachable p1p2, p2p3, , pn-1 pn
pn density-reachable from p1
Density-connected
If points p, q are density-reachable from o then p and q
are density-connected
p
q
p1
q
o
252
DBSCAN
A cluster: a maximal set of densityconnected points
Discover clusters of arbitrary shape in spatial
databases with noise
Outlier
Border
Core
Eps = 1cm
MinPts = 5
253
DBSCAN: the Algorithm

Arbitrary select a point p
Retrieve all points density-reachable from p
wrt Eps and MinPts
If p is a core point, a cluster is formed
If p is a border point, no points are densityreachable from p and DBSCAN visits the
next point of the database
Continue the process until all of the points
have been processed
254
Challenges for DBSCAN

Different clusters may have very different
densities
Clusters may be in hierarchies
255
Biclustering
Clustering both objects and attributes
simultaneously
Four requirements
Only a small set of objects in a cluster (bicluster)
A bicluster only involves a small number of
attributes
An object may participate in multiple biclusters
or no biclusters
An attribute may be involved in multiple
biclusters, or no biclusters
256
Application Examples
Recommender systems
Objects: users
Attributes: items
Values: user ratings
sample/condition
gene
w11 w12
w21 w22
w31 w32
w1m
wn1 wn2
wnm
w2m
w3m
Microarray data
Objects: genes
Attributes: samples
Values: expression levels
257
Biclusters with Constant Values

11.2. CLUSTERING HIGH-DIMENSIONAL DATA
a1
a33
a86
b6
60
60
60
b12
60
60
60
b36
60
60
60
535
b99
60

60

60

CHAPTER 11. ADVANCED CLUSTER AN
Figure 11.5: A gene-condition matrix, a submatrix, and a bi-cluster.
10 10 10 10 10
20 AllElectronics
20 20 is highly
20 interested in finding
subset of products. 20
For example,
a group of customers who all like the same group of products. Such a cluster
50 50 50 50 50
is a submatrix in the customer-product matrix, where all elements have a high
value. Using such a cluster,
AllElectronics
in two
0
0
0 can0 make recommendations
0
directions. First, the company can recommend products to new customers
who are similar to the customers in the cluster. Second, the company can
recommend to customers new products that are similar to those involved in
the cluster.
On rows
Figure 11.6: A bi-cluster with constant values on rows.
Jian Pei: CMPT
459/741
Clusteringin(3)a gene expression data matrix, the bi-clusters in a
As with
bi-clusters
258
Biclusters with Coherent Values

Also known as pattern-based clusters
259
Biclusters with Coherent Evolutions
Only up- or down-regulated changes over

LUSTERING
DATA
rows or HIGH-DIMENSIONAL
columns
10
20
50
0
50
100
100
80
30
50
90
20
70
1000
120
100
20
30
80
10
Figure 11.8: A Coherent

bi-cluster evolutions
with coherent
on rows.
on evolutions
rows
Pei: CMPT
459/741 Clustering (3) that is e
260 bilues Jian
using
multiplication,
ij = c i j . Clearly,
Differences from Subspace Clustering

Subspace clustering uses global distance/
similarity measure
Pattern-based clustering looks at patterns
A subspace cluster according to a globally
defined similarity measure may not follow
the same pattern
261
Objects Follow the Same Pattern?

pScore
Objectblue
Obejctgreen
D1
D2
The less the pScore, the more consistent the objects

262
Pattern-based Clusters
pScore: the similarity between two objects rx,
ry on two attributes au, av
rx .au
pScore
ry .au
rx .av
= ( rx .au ry .au ) ( rx .av ry .av )
ry .av
-pCluster (R, D): for any objects rx, ryR

and any attributes au, avD,
rx .au
pScore
ry .au
rx .av
ry .av
( 0)
263
Maximal pCluster
If (R, D) is a -pCluster , then every subcluster (R , D ) is a -pCluster, where R R
and D D
An anti-monotonic property
A large pCluster is accompanied with many
small pClusters! Inefficacious
Idea: mining only the maximal pClusters!

A -pCluster is maximal if there exists no proper
super cluster as a -pCluster
264
Mining Maximal pClusters

Given
A cluster threshold
An attribute threshold mina
An object threshold mino
Task: mine the complete set of significant

maximal -pClusters
A significant -pCluster has at least mino objects
on at least mina attributes
265
Grid-based Clustering Methods

Ideas
Using multi-resolution grid data structures
Using dense grid cells to form clusters
Several interesting methods

CLIQUE
STING
WaveCluster
266
CLIQUE
Clustering In QUEst
Automatically identify subspaces of a high
dimensional data space
Both density-based and grid-based
267
CLIQUE: the Ideas

Partition each dimension into the same
number of equal length intervals
Partition an m-dimensional data space into nonoverlapping rectangular units
A unit is dense if the number of data points

in the unit exceeds a threshold
A cluster is a maximal set of connected
dense units within a subspace
268
CLIQUE: the Method

Partition the data space and find the number of
points in each cell of the partition
Apriori: a k-d cell cannot be dense if one of its (k-1)-d
projection is not dense
Identify clusters:
Determine dense units in all subspaces of interests and
connected dense units in all subspaces of interests
Generate minimal description for the clusters

Determine the minimal cover for each cluster
269
0 1 2 3 4 5 6 7
Vacation
Salary
(10,000)
CLIQUE: An Example
50
age
Vacation
(week)
0 1 2 3 4 5 6 7
30
20
20
30
40
50
30
40
50
age
60
age
60
270
CLIQUE: Pros and Cons

Automatically find subspaces of the highest
dimensionality with high density clusters
Insensitive to the order of input
Not presume any canonical data distribution
Scale linearly with the size of input

Scale well with the number of dimensions
The clustering result may be degraded at the
expense of simplicity of the method
271
Bad Cases for CLIQUE

Parts of a cluster may be missed
A cluster from CLIQUE may

contain noise
272
Fuzzy Clustering
Each point xi takes a probability wij to belong
to a cluster Cj
Requirements
k
For each point xi,
ij
=1
j =1
For each cluster Cj 0 < wij < m

i =1
273
Fuzzy C-Means (FCM)

Select an initial fuzzy pseudo-partition, i.e., assign
values to all the wij
Repeat
Compute the centroid of each cluster using the fuzzy
pseudo-partition
Recompute the fuzzy pseudo-partition, i.e., the wij
Until the centroids do not change (or the change is

below some threshold)
274
Critical Details
Optimization on sum of the squared error
(SSE): SSE(C ,, C ) = k m w p dist( x , c ) 2
1
ij
j =1 i =1
c j = wijp xi / wijp
Computing centroids:
i =1
i =1
Updating the fuzzy pseudo-partition
wij = (1 / dist( xi , c j ) 2 )
1
p 1
2
(
1
/
dist
(
x
,
c
)
)
i
q
1
p 1
q =1
When p=2
wij = 1 / dist ( xi , c j ) 2
2
1
/
dist
(
x
,
c
)
i
q
q =1
275
Choice of P
When p 1, FCM behaves like traditional kmeans
When p is larger, the cluster centroids
approach the global centroid of all data
points
The partition becomes fuzzier as p increases
276
Effectiveness
277
Is a Clustering Good?
Feasibility
Applying any clustering methods on a uniformly
distributed data set is meaningless
Quality
Are the clustering results meeting users interest?
Clustering patients into clusters corresponding
various disease or sub-phenotypes is meaningful
Clustering patients into clusters corresponding to
male or female is not meaningful
278
Major Tasks
Assessing clustering tendency
Are there non-random structures in the data?
Determining the number of clusters or other

critical parameters
Measuring clustering quality
279
Uniformly Distributed Data
Clustering uniformly distributed data is

meaningless
A uniformly distributed data set is generated
504CHAPTER 10. CLUSTER ANALYSIS: BASIC C
by a uniform data distribution
Figure 10.21: A data set that is uniformly 280

distrib
Hopkins Statistic
Hypothesis: the data is generated by a
uniform distribution in a space
Sample n points, p1, , pn, uniformly from
the space of D
For each point pi, find the nearest neighbor
of pi in D, let xi be the distance between pi
and its nearest neighbor in D
xi = min{dist(pi , v)}
v2D
281
Hopkins Statistic
Sample n points, q1, , qn, uniformly from D
For each qi, find the nearest neighbor of qi in
D {qi}, let yi be the distance between qi and
its nearest neighbor in D {qi}
yi =
min {dist(qi , v)}
v2D,v6=qi
Calculate the Hopkins Statistic H

H= P
n
i=1
n
P
yi
i=1
xi +
n
P
yi
i=1
282
Explanation
n
X
xi
If D
is
uniformly
distributed,
then
and
i=1
n
X
yi would be close to each other, and thus
i=1
H would be round 0.5

n
X
If D is skewed, then yi would be
i=1
substantially smaller, and thus H would be
close to 0
If H > 0.5, then it is unlikely that D has
statistically significant clusters
283
Finding the Number of Clusters

Depending on many factors
The shape and scale of the distribution in the
data set
The clustering resolution required by the user
Many methods
exist
r
n
2
Set k =
, each cluster has 2n points on
average
Plot the sum of within-cluster variances with
respect to k, find the first (or the most significant
turning point)
284
A Cross-Validation Method
Divide the data set D into m parts
Use m 1 parts to find a clustering
Use the remaining part as the test set to test
the quality of the clustering
For each point in the test set, find the closest
centroid or cluster center
Use the squared distances between all points in the
test set and the corresponding centroids to measure
how well the clustering model fits the test set
Repeat m times for each value of k, use the

average as the quality measure
285
Measuring Clustering Quality

Ground truth: the ideal clustering determined
by human experts
Two situations
There is a known ground truth the extrinsic
(supervised) methods, comparing the clustering
against the ground truth
The ground truth is unavailable the intrinsic
(unsupervised) methods, measuring how well
the clusters are separated
286
Quality in Extrinsic Methods

Cluster homogeneity: the more pure the
clusters in a clustering, the better the clustering
Cluster completeness: objects in the same
cluster in the ground truth should be clustered
together
Rag bag: putting a heterogeneous object into a
pure cluster is worse than putting it into a rag
bag
Small cluster preservation: splitting a small
cluster in the ground truth into pieces is worse
than splitting a bigger one
287
Bcubed Precision and Recall

D = {o1, , on}
L(oi) is the cluster of oi given by the ground truth
C is a clustering on D
C(oi) is the cluster-id of oi in C
For two objects oi and oj, the correctness is

1 if L(oi) = L(oj) C(oi) = C(oj), 0
otherwise
288
Correctness(oi , oj ) =
1 if L(oi ) = L(oj ) C(oi ) = C(oj )

0 otherwise.
Bcubed
Precision
and
Recall
BCubed precision is defined as
Precision
"
Correctness(oi , oj )
n
10.6. EVALUATION OF "
CLUSTERING
oj :i=j,C(oi )=C(oj )
{oj |i = j, C(oi ) = C(oj )}
i=1
Precision
BCubed
=
.
BCubed
recall
is defined
as
n
!
Recall
Correctness(oi , oj )
n
!
oj :i=j,L(oi )=L(oj )
{oj |i = j, L(oi ) = L(oj )}
i=1
Recall BCubed =
.
n
Intrinsic Methods
When the ground truth of a data set is not available, we have to use
Jian Pei:to
CMPT
459/741the
Clustering
(4)
289
method
assess
clustering
quality. In general, intrinsic metho
Silhouette Coefficient
No ground truth is assumed
Suppose a data set D of n objects is partitioned
into k clusters, C1, , Ck
For each object o,
Calculate a(o), the average distance between o and
every other object in the same cluster
compactness of a cluster, the smaller, the better
Calculate b(o), the minimum average distance from
o to every objects in a cluster that o does not belong
to degree of separation from other clusters, the
larger, the better
290
Silhouette Coefficient
a(o) =
Then
dist(o, o0 )
o,o0 2Ci ,o0 6=o
|Ci | 1
P
dist(o, o0 )
b(o) = min {
Cj :o62Cj
o0 2Cj
|Cj |
b(o) a(o)
s(o) =
max{a(o), b(o)}
Use the average silhouette coefficient of all

objects as the overall measure
291
Classification
Classification and Prediction

Classification: predict categorical class
labels
Build a model for a set of classes/concepts
Classify loan applications (approve/decline)
Prediction: model continuous-valued

functions
Predict the economic growth in 2015
Jian Pei: CMPT 741/459 Classification (1)
293
Classification: A 2-step Process

Model construction: describe a set of
predetermined classes
Training dataset: tuples for model construction
Each tuple/sample belongs to a predefined class
Classification rules, decision trees, or math formulae
Model application: classify unseen objects

Estimate accuracy of the model using an independent
test set
Acceptable accuracy apply the model to classify
tuples with unknown class labels
294
Model Construction
Classification
Algorithms
Training
Data
Name
Mike
Mary
Bill
Jim
Dave
Anne
Rank
Ass. Prof
Ass. Prof
Prof
Asso. Prof
Ass. Prof
Asso. Prof
Years
3
7
2
7
6
3
Tenured
No
Yes
Yes
Yes
No
No
Classifier
(Model)
IF rank = professor
OR years > 6
THEN tenured = yes
295
Model Application
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
Name
Rank
Years
Tom
Ass. Prof
2
Merlisa Asso. Prof
7
George
Prof
5
Joseph Ass. Prof
7
Tenured
No
No
Yes
Yes
Tenured?
296
Supervised/Unsupervised Learning
Supervised learning (classification)
Supervision: objects in the training data set have
labels
New data is classified based on the training set
Unsupervised learning (clustering)

The class labels of training data are unknown
Given a set of measurements, observations, etc.
with the aim of establishing the existence of
classes or clusters in the data
297
Data Preparation
Data cleaning
Preprocess data in order to reduce noise and
handle missing values
Relevance analysis (feature selection)

Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
298
Measurements of Quality
Prediction accuracy
Speed and scalability
Construction speed and application speed
Robustness: handle noise and missing

values
Scalability: build model for large training data
sets
Interpretability: understandability of models
299
Decision Tree Induction
Decision tree representation

Construction of a decision tree
Inductive bias and overfitting
Scalable enhancements for large databases
300
Decision Tree
A node in the tree a test of some attribute
A branch: a possible value of the attribute
Classification
Outlook
Start at the root
Test the attribute
Move down the tree branch
Sunny
Humidity
High Normal
No
Yes
Overcast
Rain
Yes
Wind
Strong
Weak
No
Yes
301
Training Dataset
Outlook
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Temp
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
Humid
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Wind
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Strong
PlayTennis
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
302
Appropriate Problems
Instances are represented by attribute-value
pairs
Extensions of decision trees can handle realvalued attributes
Disjunctive descriptions may be required

The training data may contain errors or
missing values
303
Basic Algorithm ID3

Construct a tree in a top-down recursive divideand-conquer manner
Which attribute is the best at the current node?
Create a node for each possible attribute value
Partition training data into descendant nodes
Conditions for stopping recursion

All samples at a given node belong to the same class
No attribute remained for further partitioning
Majority voting is employed for classifying the leaf
There is no sample at the node

304
Which Attribute Is the Best?

The attribute most useful for classifying
examples
Information gain and gini index
Statistical properties
Measure how well an attribute separates the
training examples
305
Entropy
Measure homogeneity of examples
c
Entropy ( S ) pi log 2 pi
i =1
S is the training data set, and pi is the proportion

of S belong to class i
The smaller the entropy, the purer the data

set
306
Information Gain
The expected reduction in entropy caused
by partitioning the examples according to an
attribute
| Sv |
Gain( S , A) Entropy ( S )
Entropy ( S v )
vValues ( A ) | S |
Value(A) is the set of all possible values for

attribute A, and Sv is the subset of S for
which attribute A has value v
307
Example
9
9 5
5
Entropy ( S ) = log 2 log 2
14
14 14
14
= 0.94
Outlook
Temp
Humid
Wind
PlayTenni
s
Sunny
Hot
High
Weak
No
Sunny
Hot
High
Strong
No
Overcast
Hot
High
Weak
Yes
Rain
Mild
High
Weak
Yes
Rain
Cool
Normal
Weak
Yes
Rain
Cool
Normal
Strong
No
Overcast
Cool
Normal
Strong
Yes
Sunny
Mild
High
Weak
No
Sunny
Cool
Normal
Weak
Yes
Rain
Mild
Normal
Weak
Yes
Sunny
Mild
Normal
Strong
Yes
Overcast
Mild
High
Strong
Yes
Overcast
Hot
Normal
Weak
Yes
Rain
Mild
High
Strong
No
| Sv |
Gain( S ,Wind ) = Entropy ( S )
Entropy ( S v )
v{Weak , Strong } | S |
8
6
Engropy ( SWeak ) Engropy ( S Strong )
14
14
8
6
= 0.94 0.811 1.00 = 0.048
14
14
= Entropy ( S )
308
Hypothesis Space Search in

Decision Tree Building
Hypothesis space: the set of possible
decision trees
ID3: simple-to-complex, hill-climbing search
Evaluation function: information gain
309
Capabilities and Limitations

The hypothesis space is complete
Maintains only a single current hypothesis
No backtracking
May converge to a locally optimal solution
Use all training examples at each step

Make statistics-based decisions
Not sensitive to errors in individual example
310
Natural Bias
The information gain measure favors
attributes with many values
An extreme example
Attribute date may have the highest
information gain
A very broad decision tree of depth one
Inapplicable to any future data
311
Alternative Measures
Gain ratio: penalize attributes like date by
incorporating split information
| Si |
|S |
log 2 i
|S|
i =1 | S |
SplitInfor mation( S , A)
Split information is sensitive to how broadly and

uniformly the attribute splits the data
GainRatio( S , A)
Gain( S , A)
SplitInformation( S , A)
Gain ratio can be undefined or very large

Only test attributes with over average gain
312
Measuring Inequality
Lorenz Curve
X-axis: quintiles
Y-axis: accumulative share of
income earned by the plotted
quintile
Gap between the actual lines
and the mythical line: the degree
of inequality
Gini
index
Gini = 0, even distribution

Gini = 1, perfectly unequal
The greater the distance,
the more unequal the
distribution
313
Gini Index (Adjusted)

A data set S contains examples from n
classes
n
2
gini(T ) = 1 p j
j =1
pj is the relative frequency of class j in S
A data set S is split into two subsets S1 and

S2 with sizes N1 and N2 respectively
gini split (T ) =
N 1 gini( ) + N 2 gini( )
T1
T2
N
N
The attribute provides the smallest

ginisplit(T) is chosen to split the node
314
Extracting Classification Rules

Classification rules can be extracted from a
decision tree
Each path from the root to a leaf an IFTHEN rule
All attribute-value pair along a path form a
conjunctive condition
The leaf node holds the class prediction
IF age = <=30 AND student = no THEN
buys_computer = no
Rules are easy to understand

315
Inductive Bias
The set of assumptions that, together with
the training data, deductively justifies the
classification to future instances
Preferences of the classifier construction
Shorter trees are preferred over longer trees

Trees that place high information gain
attributes close to the root are preferred
316
Why Prefer Short Trees?

Occam s razor: prefer the simplest
hypothesis that fits the data
Fewer short trees than long trees
A short tree is less likely to be a statistical
coincidence
One should not increase, beyond what is necessary, the
number of entities required to explain anything Also
known as the principle of parsimony
317
Overfitting
A decision tree T may overfit the training
data
if there exists an alternative tree T such that T
has a higher accuracy than T over the training
examples, but T has a higher accuracy than T
over the entire distribution of data
Why overfitting?
Noise data
Bias in training data
T
All data
T
Training data
318
The Evaluation Issues

The accuracy of a classifier can be
evaluated using a test data set
The test set is a part of the available labeled
data set
But how can we evaluate the accuracy of a

classification method?
A classification method can generate many
classifiers
What if the available labeled data set is too

small?
319
Holdout Method
Partition the available labeled data set into
two disjoint subsets: the training set and the
test set
50-50
2/3 for training and 1/3 for testing
Build a classifier using the training set

Evaluate the accuracy using the test set
320
Limitations of Holdout Method

Fewer labeled examples for training
The classifier highly depends on the
composition of the training and test sets
The smaller the training set, the larger the
variance
If the test set is too small, the evaluation is

not reliable
The training and test sets are not
independent
321
Cross-Validation
Each record is used the same number of times for
training and exactly once for testing
K-fold cross-validation
Partition the data into k equal-sized subsets
In each round, use one subset as the test set, and use
the rest subsets together as the training set
Repeat k times
The total error is the sum of the errors in k rounds
Leave-one-out: k = n
Utilize as much data as possible for training
Computationally expensive
322
Accuracy Can Be Misleading

Consider a data set of 99% of the negative
class and 1% of the positive class
A classifier predicts everything negative has
an accuracy of 99%, though it does not work
for the positive class at all!
Imbalance class distribution is popular in
many applications
Medical applications, fraud detection,
323
Performance Evaluation Matrix

Confusion matrix (contingency table, error matrix): used for
imbalance class distribution
PREDICTED CLASS
Class=Yes Class=No
ACTUAL
Class=Yes
a (TP)
b (FN)
CLASS
Class=No
c (FP)
d (TN)
a+d
TP + TN
Accuracy =
=
a + b + c + d TP + TN + FP + FN
324
Performance Evaluation Matrix

PREDICTED CLASS
Class=Yes Class=No
ACTUAL
Class=Yes
a (TP)
b (FN)
CLASS
Class=No
c (FP)
d (TN)
True positive rate (TPR, sensitivity) = TP / (TP + FN)

True negative rate (TNR, specificity) = TN / (TN + FP)
False positive rate (FNR) = FP / (TN + FP)
False negative rate (FNR) = FN / (TP + FN)
325
Recall and Precision

Target class is more important than the other
classes
PREDICTED CLASS
Class=Yes Class=No
ACTUAL
Class=Yes
a (TP)
b (FN)
CLASS
Class=No
c (FP)
d (TN)
Precision p = TP / (TP + FP)

Recall r = TP / (TP + FN)
326
Fallout
Type I errors false positive: a negative
object is classified as positive
Fallout: the type I error rate, FP / (TP + FP)
Type II errors false negative: a positive

object is classified as negative
Captured by recall
327
F Measure
How can we summarize precision and recall into
one metric?
Using the harmonic mean between the two
2rp
2TP
F - measure (F) =
=
r + p 2TP + FP + FN
F measure
( 2 +1)rp
( 2 +1)TP
F =
= 2
2
r+ p
( +1)TP + 2 FN + FP
= 0, F is the precision
= , F is the recall
0 < < , F is a tradeoff between the precision and the
recall
328
Weighted Accuracy
A more general metric
wa + w d
Weighted Accuracy =
wa + wb+ wc + w d
1
Measure
w1
w2
w3
w4
Recall
Precision
2 + 1
Accuracy
329
ROC Curve
Receiver Operating Characteristic (ROC)
1-dimensional data set containing 2
classes. Any points located at x > t is
classified as positive
330
ROC Curve
(TP,FP):
(0,0): declare everything
to be negative class
(1,1): declare everything
to be positive class
(1,0): ideal
Diagonal line:
Random guessing
Below diagonal line:
prediction is opposite of
the true class
Figure from [Tan, Steinbach, Kumar]

331
Comparing Two Classifiers

332
Cost-Sensitive Learning
In some applications, misclassifying some
classes may be disastrous
Tumor detection, fraud detection
Using a cost matrix

PREDICTED CLASS
ACTUAL Class=Yes
CLASS
Class=No
Class=Yes
Class=No
-1
100
333
Sampling for Imbalance Classes

Consider a data set containing 100 positive
examples and 1,000 negative examples
Undersampling: use a random sample of 100
negative examples and all positive examples
Some useful negative examples may be lost
Run undersampling multiple times, use the ensemble of
multiple base classifiers
Focused undersampling: remove negative samples that
are not useful for classification, e.g., those far away from
the decision boundary
334
Oversampling
Replicate the positive examples until the
training set has an equal number of positive
and negative examples
For noisy data, may cause overfitting
335
Errors in Classification
Bias: the difference between the real class
boundary and the decision boundary of a
classification model
Variance: variability in the training data set
Intrinsic noise in the target class: the target
class can be non-deterministic instances
with the same attribute values can have
different class labels
336
One or More?
What if a medical doctor is not sure about a case?
Joint-diagnosis: using a group of doctors carrying
different expertise
Wisdom from crowd is often more accurate
All eager learning methods make prediction using a

single classifier induced from training data
A single classifier may have low confidence in some
cases
Ensemble methods: construct a set of base

classifiers and take a vote on predictions in
classification
337
Ensemble Classifiers
D
Step 1:
Create Multiple
Data Sets
Step 2:
Build Multiple
Classifiers
D1
D2
C1
C2
Step 3:
Combine
Classifiers
....
C*
Original
Training data
Dt-1
Dt
Ct -1
Ct
C*(x)=Vote(C1(x), , Ck(x))

338
Why May Ensemble Method Work?

Suppose there are two classes and each
base classifier has an error rate of 35%
What if we use 25 base classifiers?
If all base classifiers are identical, the ensemble
error rate is still 35%
If base classifiers are independent, the
ensemble makes a wrong prediction only if more
than half of the base classifiers are wrong
25 25

i
25i
0
.
35
0
.
65
= 0.06
i
i =13
339
Ensemble Error Rate

340
Ensemble Classifiers When?

The base classifiers should be independent
of each other
Each base classifier should do better than a
classifier that performs random guessing
341
How to Construct Ensemble?

Manipulating the training set: derive multiple
training sets and build a base classifier on each
Manipulating the input features: use only a subset
of features in a base classifier
Manipulating the class labels: if there are many
classes, in a classifier, randomly divide the classes
into two subsets A and B; for a test case, if a base
classifier predicts its class as A, all classes in A
receive a vote
Manipulating the learning algorithm, e.g., using
different network configuration in ANN
342
Bootstrap
Given an original training set T, derive a
tranining set T by repeatedly uniformly
sampling with replacement
If T has n tuples, each tuple has a probability
p = 1 - (1 - 1/n)n of being selected in T
When n , p 1 - 1/e 0.632
Use the tuples not in T as the test set
343
Bootstrap
Use a bootstrap sample as the training set,
use the tuples not in the training set as the
test set
.632 bootstrap: compute the overall
accuracy by combining the accuracies of
each bootstrap sample with the accuracy
computed from a classifier using the whole
data set as the training set
acc.632bootstrap
1 k
= (0.632 i + 0.368 accall )
k 1
344
Bagging
Run bootstrap k times to obtain k base classifiers
A test instance is assigned to the class that
receives the highest number of votes
Strength: reduce the variance of base classifiers
good for unstable base classifiers
Unstable classifiers: sensitive to minor perturbations in
the training set, e.g., decision trees, associative
classifiers, and ANN
For stable classifiers (e.g., linear discriminant

analysis and kNN classifiers), bagging may even
degrade the performance since the training sets
are smaller
Less overfitting on noisy data
345
Boosting
Assign a weight to each training example
Initially, each example is assigned a weight 1/n
Weights can be used in one of the following ways

Weights as a sampling distribution to draw a set of
bootstrap samples from the original training set
Weights used by a base classifier to learn a model
biased towards heavier examples
Adaptively change the weight at the end of each

boosting round
The weight of an example correctly classified decreases
The weight of an example incorrectly classified
increases
Each round generates a base classifier

346
Critical Design Choices in Boosting

How the weights of the training examples
are updated at the end of each boosting
round?
How the predictions made by base
classifiers are combined?
347
AdaBoost
Each base classifier carries an importance
score related to its error rate
N
1
Error rate
i = w j I (Ci ( x j ) y j )
N j =1
wi: weight, I(p) = 1 if p is true
Importance score = 1 ln 1 i
i
2 i
348
How Does Importance Score Work?
349
Weight Adjustment in AdaBoost

j
if C j ( xi ) = yi
w exp
( j +1)
wi
=
j
Z j exp
if C j ( xi ) yi
where Z j is the normalization factor, wi( j +1) = 1
( j)
i
If any intermediate rounds generate an error rate

more than 50%, the weights are reverted back to
1/n
The ensemble error rate is bounded

eensemble i (1 i )
i
350
Intuition Bayesian Classification

More hockey fans in Canada than in US
Which country is Tom, a hockey ball fan, from?
Predicting Canada has a better chance to be right
Prior probability P(Canadian)=5%: reflect

background knowledge 5% of total population is
Canadians
P(hockey fan | Canadian)=30%: the probability of a
Canadian who is also a hockey fan
Posterior probability P(Canadian | hockey fan): the
probability of a hockey fan is from Canada
351
Bayes Theorem
P ( D | h) P ( h)
P( h | D) =
P ( D)
Find the maximum a posteriori (MAP)

hypothesis
P ( D | h ) P ( h)
hMAP max P(h | D) = max
hH
hH
P( D)
= max P( D | h) P(h)
hH
Require background knowledge

Computational cost
352
Nave Bayes Classifier

Assumption: attributes are independent
Given a tuple (a1, a2, , an), predict its
class as
C = arg max P(a1 , a2 , , an | Ci ) P(Ci )
i
= arg max P(Ci ) P(a j | Ci )

i
arg max f ( x) : the value of x that maximizes f(x)

Example: arg max x 2 = 3
x{1, 2 , 3}
353
Example: Training Dataset

Data sample X =
(Outlook=sunny,
Temp=mild, Humid=high
Wind=weak)
Will she play tennis? Yes
P(Yes|X) =
P(X|Yes) P(Yes) = 0.014
P(No|X) =
P(X|No) P(No) = 0.007
Outlook
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Temp
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
Humid
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Wind PlayTennis
Weak
No
Strong
No
Weak
Yes
Weak
Yes
Weak
Yes
Strong
No
Strong
Yes
Weak
No
Weak
Yes
Weak
Yes
Strong
Yes
Strong
Yes
Weak
Yes
Strong
No
354
Probability of Infrequent Values

(outlook = Sunny,
temp = high,
humid = low,
wind = weak)?
P(humid = low) = 0
Outlook
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Temp
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
Humid
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Wind PlayTennis
Weak
No
Strong
No
Weak
Yes
Weak
Yes
Weak
Yes
Strong
No
Strong
Yes
Weak
No
Weak
Yes
Weak
Yes
Strong
Yes
Strong
Yes
Weak
Yes
Strong
No
355
Smoothing
Suppose an attribute has n different values:
a1, , an
Assume a small enough value > 0
Let Pi be the frequency of ai,
Pi = # tuples having ai / total # of tuples
1
n
Estimate P (ai ) = +
Pi
356
Characteristics of Nave Bayes

Robust to isolated noise points
Such points are averaged out in probability
computation
Insensitive to missing values

Robust to irrelevant attributes
Distributions on such attributes are almost
uniform
Correlated attributes degrade the

performance
357
Bayes Error Rate

The error rate of the ideal nave Bayes
classifier
Err =
Zx
0
P (Crocodile | X)dX +
Z1
P (Alligator | X)dX
358
Pros and Cons

Pros
Easy to implement
Good results obtained in many cases
Cons
A (too) strong assumption: independent
attributes
How to handle dependent/correlated

attributes?
Bayesian belief networks
359
Associative Classification
Mine association possible rules (PR) in form
of condset c
Condset: a set of attribute-value pairs
C: class label
Build classifier
Organize rules according to decreasing
precedence based on confidence and support
Classification
Use the first matching rule to classify an
unknown case
360
Associative Classification Methods

CBA (Classification By Association: Liu, Hsu & Ma,
KDD 98)
Mine association possible rules in the form of
Cond-set (a set of attribute-value pairs) class label
Build classifier: Organize rules according to decreasing

precedence based on confidence and then support
CMAR (Classification based on Multiple

Association Rules: Li, Han, Pei, ICDM 01)
Classification: Statistical analysis on multiple rules
361
Instance-based Methods
Instance-based learning
Store training examples and delay the processing until a
new instance must be classified ( lazy evaluation )
Typical approaches
K-nearest neighbor approach
Instances represented as points in an Euclidean space
Locally weighted regression

Construct local approximation
Case-based reasoning
Use symbolic representations and knowledge-based inference
362
The K-Nearest Neighbor Method

Instances are points in an n-D space
The k-nearest neighbors (KNN) in the
Euclidean distance
Return the most common value among the k
training examples nearest to the query point
Discrete-/real-valued target functions

_
+
_
_
_
+
.xq
_
+
+
363
KNN Methods
For continuous-valued target functions, return the
mean value of the k nearest neighbors
Distance-weighted nearest neighbor algorithm
Give greater weights to closer neighbors
1
d ( xq , xi )2
Robust to noisy data by averaging k-nearest

neighbors
Curse of dimensionality
Distance could be dominated by irrelevant attributes

Axes stretch or elimination of the least relevant attributes
364
Lazy vs. Eager Learning

Efficiency: lazy learning uses less training
time but more predicting time
Accuracy
Lazy method effectively uses a richer hypothesis
space
Eager: must commit to a single hypothesis that
covers the entire instance space
365
Outlier Detection
Motivation: Fraud Detection
http://i.imgur.com/ckkoAOp.gif
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1)
367
Techniques: Fraud Detection

Features
Dissimilarity
Groups and noise
http://i.stack.imgur.com/tRDGU.png
368
Outlier Analysis
One persons noise is another persons
signal
Outliers: the objects considerably dissimilar
from the remainder of the data
Examples: credit card fraud, Michael Jordon,
intrusions, etc
Applications: credit card fraud detection, telecom
fraud detection, intrusion detection, customer
segmentation, medical analysis, etc
369
Outliers and Noise

Different from noise
Noise is random error or variance in a measured
variable
Outliers are interesting: an outlier violates

the mechanism that generates the normal
data
Outlier detection vs. novelty detection
Early stage may be regarded as outliers
But later merged into the model
370
Types of Outliers
Three kinds: global, contextual and collective
outliers
A data set may have multiple types of outlier
One object may belong to more than one type of
outlier
Global outlier (or point anomaly)

An outlier object significantly deviates from the
rest of the data set
challenge: find an appropriate measurement

of deviation
371
Contextual Outliers
An outlier object deviates significantly based on a
selected context
Ex. Is 10C in Vancouver an outlier? (depending on summer or
winter?)
Attributes of data objects should be divided into two

groups
Contextual attributes: defines the context, e.g., time & location
Behavioral attributes: characteristics of the object, used in
outlier evaluation, e.g., temperature
A generalization of local outlierswhose density

significantly deviates from its local area
Challenge: how to define or formulate meaningful
context?
372
Collective Outliers
A subset of data objects collectively deviate
significantly from the whole data set, even if the
individual data objects may not be outliers
Application example: intrusion detection when a
number of computers keep sending denial-ofservice packages to each other
Detection of collective outliers

Consider not only behavior of individual objects, but
also that of groups of objects
Need to have the background knowledge on the
relationship among data objects, such as a distance
or similarity measure on objects
373
Outlier Detection: Challenges

Modeling normal objects and outliers properly
Hard to enumerate all possible normal behaviors in
an application
The border between normal and outlier objects is
often a gray area
Application-specific outlier detection

Choice of distance measure among objects and the
model of relationship among objects are often
application-dependent
Example: clinic data: a small deviation could be an
outlier; while in marketing analysis, larger
fluctuations
374
Outlier Detection: Challenges

Handling noise in outlier detection
Noise may distort the normal objects and blur the
distinction between normal objects and outliers
Noise may help hide outliers and reduce the
effectiveness of outlier detection
Understandability
Understand why these are outliers: Justification of
the detection
Specify the degree of an outlier: the unlikelihood of
the object being generated by a normal mechanism
375
Outlier Detection Methods

Whether user-labeled examples of outliers
can be obtained
Supervised, semi-supervised, and unsupervised
methods
Assumptions about normal data and outliers

Statistical, proximity-based, and clusteringbased methods
376
Supervised Methods
Modeling outlier detection as a classification problem
Samples examined by domain experts used for training & testing
Methods for Learning a classifier for outlier detection effectively:

Model normal objects & report those not matching the model as
outliers, or
Model outliers and treat those not matching the model as normal
Challenges
Imbalanced classes, i.e., outliers are rare: Boost the outlier class
and make up some artificial outliers
Catch as many outliers as possible, i.e., recall is more important
than accuracy (i.e., not mislabeling normal objects as outliers)
377
Unsupervised Methods
Assume the normal objects are somewhat
``clustered' into multiple groups, each having some
distinct features
An outlier is expected to be far away from any
groups of normal objects
Weakness: Cannot detect collective outlier effectively
Normal objects may not share any strong patterns, but
the collective outliers may share high similarity in a small
area
Many clustering methods can be adapted for

unsupervised methods
Find clusters, then outliers: not belonging to any cluster
378
Unsupervised Methods: Challenges

In some intrusion or virus detection, normal
activities are diverse
Unsupervised methods may have a high false
positive rate but still miss many real outliers.
Supervised methods can be more effective, e.g.,
identify attacking some key resources
Challenges
Hard to distinguish noise from outliers
Costly since first clustering: but far less outliers than
normal objects
Newer methods: tackle outliers directly

379
Semi-Supervised Methods
In many applications, the number of labeled data is often
small
Labels could be on outliers only, normal objects only, or both
If some labeled normal objects are available

Use the labeled examples and the proximate unlabeled
objects to train a model for normal objects
Those not fitting the model of normal objects are detected as
outliers
If only some labeled outliers are available, a small

number of labeled outliers many not cover the possible
outliers well
To improve the quality of outlier detection, one can get help
from models for normal objects learned from unsupervised
methods
380
Pros and Cons

Effectiveness of statistical methods: highly
depends on whether the assumption of
statistical model holds in the real data
There are rich alternatives to use various
statistical models
Parametric vs. non-parametric
381
Proximity-based Methods
An object is an outlier if the nearest
neighbors of the object are far away, i.e., the
proximity of the object is significantly
deviates from the proximity of most of the
other objects in the same data set
382
Pros and Cons

The effectiveness of proximity-based methods
highly relies on the proximity measure
In some applications, proximity or distance
measures cannot be obtained easily
Often have a difficulty in identifying a group of
outliers that stay close to each other
Two major types of proximity-based outlier
detection methods
Distance-based vs. density-based
383
Clustering-based Methods
Normal data belong to large and dense
clusters, whereas outliers belong to small or
sparse clusters, or do not belong to any
clusters
384
Challenges
Since there are many clustering methods,
there are many clustering-based outlier
detection methods as well
Clustering is expensive: straightforward
adaption of a clustering method for outlier
detection can be costly and does not scale
up well for large data sets
385
Statistical Outlier Analysis

Assumption: the objects in a data set are
generated by a (stochastic) process (a
generative model)
Learn a generative model fitting the given
data set, and then identify the objects in low
probability regions of the model as outliers
two categories: parametric versus nonparametric
386
Example
Statistical methods (also known as modelbased methods) assume that the normal
data follow some statistical model
The data not following the model are outliers.
387
Parametric Methods
Assumption: the normal data is generated by
a parametric distribution with parameter
The probability density function of the
parametric distribution f(x | ) gives the
probability that object x is generated by the
distribution
The smaller this value, the more likely x is an
outlier
388
Univariate Outliers Based on Normal

Distribution
ln L(,
)=
n
X
i=1
ln f (xi | (u,
)) =
n
ln(2)
2
n
ln
2
n
1 X
2
(xi
)2
i=1
Taking derivatives with respect to and 2,

we derive the following maximum likelihood
estimates
n
X
1
=x
=
xi
n i=1
n
X
2 = 1
(xi
n i=1
x
)2
389
Example
Daily average temperature: {24.0, 28.9, 28.9,
29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4}
Since n = 10,
p
== 2.29
= 1.51
= 28.61
Then (24 28.61)
/1.51
3.04
< 3, 24 is
an outlier since 3 contains 99.7% data
390
The Grubbs Test

Maximum normed residual test
For each object x in a data set, compute its
v
z-score
u
2
x is an outlier if z
N 1u
p t
N
N
2N
,N 2
+ t2 ,N
2N
t22N
,N 2 is the value taken by a t-distribution at a
significance level of /(2N), and N is the number
of objects in the data set
391
Non-parametric Method
Not assume an a-priori statistical model,
instead, determine the model from the input
data
Not completely parameter free but consider the
number and nature of the parameters are
flexible and not fixed in advance
Examples: histogram and kernel density

estimation
392
Histogram
A transaction in the amount of $7,500 is an
outlier, since only 0.2% transactions have an
amount higher than $5,000
393
Challenges
Hard to choose an appropriate bin size for
histogram
Too small bin size normal objects in empty/
rare bins, false positive
Too big bin size outliers in some frequent
bins, false negative
394
Proximity-based Outlier Detection

Objects far away from the others are outliers
The proximity of an outlier deviates significantly
from that of most of the others in the data set
Distance-based outlier detection: An object o is
an outlier if its neighborhood does not have
enough other points
Density-based outlier detection: An object o is
an outlier if its density is relatively much lower
than that of its neighbors
395
Depth-based Methods
Organize data objects in layers with various
depths
The shallow layers are more likely to contain
outliers
Example: Peeling, Depth contours

Complexity O(Nk/2) for k-d datasets
Unacceptable for k>2
396
Depth-based Outliers: Example
397
Distance-based Outliers
A DB(p, D)-outlier is an object O in a dataset
T such that at least a fraction p of the objects
in T lie at a distance greater than distance D
from O
The larger D, the more outlying
The larger p, the more outlying
398
Density-based Local Outlier

Both o1 and o2 are outliers
Distance-based methods
can detect o1, but not o2
399
Intuition
Outliers comparing to their local
neighborhoods, instead of the global data
distribution
The density around an outlier object is
significantly different from the density around
its neighbors
Use the relative density of an object against
its neighbors as the indicator of the degree
of the object being outliers
400
Classification-based Outlier Detection

Train a classification model that can
distinguish normal data from outliers
A brute-force approach: Consider a training
set that contains some samples labeled as
normal and others labeled as outlier
A training set in practice is typically heavily
biased: the number of normal samples likely
far exceeds that of outlier samples
Cannot detect unseen anomaly
401
One-Class Model
A classifier is built to describe only the normal class
Learn the decision boundary of the normal class
using classification methods such as SVM
Any samples that do not belong to the normal class
(not within the decision boundary) are declared as
outliers
Advantage: can detect new outliers that may not
appear close to any outlier objects in the training set
Extension: Normal objects may belong to multiple
classes
402
One-Class Model
403
Semi-Supervised Learning Methods

Combine classification-based and clustering-based
methods
Method
Use a clustering-based approach to find a large cluster,
C, and a small cluster, C1
Since some objects in C carry the label normal, treat all
objects in C as normal
Use the one-class model of this cluster to identify normal
objects in outlier detection
Since some objects in cluster C1 carry the label outlier,
declare all objects in C1 as outliers
Any object that does not fall into the model for C (such
as a) is considered an outlier as well
404
Example
405
Pros and Cons

Pros: Outlier detection is fast
Cons: Quality heavily depends on the availability
and quality of the training set,
It is often difficult to obtain representative and highquality training data
406
Contextual Outliers
An outlier object deviates significantly based on a
selected context
Ex. Is 10C in Vancouver an outlier? (depending on summer or
winter?)
Attributes of data objects should be divided into two

groups
Contextual attributes: defines the context, e.g., time & location
Behavioral attributes: characteristics of the object, used in
outlier evaluation, e.g., temperature
A generalization of local outlierswhose density

significantly deviates from its local area
Challenge: how to define or formulate meaningful
context?
407
Detection of Contextual Outliers

If the contexts can be clearly identified,
transform it to conventional outlier detection
Identify the context of the object using the
contextual attributes
Calculate the outlier score for the object in the
context using a conventional outlier detection
method
408
Example
Detect outlier customers in the context of
customer groups
Contextual attributes: age group, postal code
Behavioral attributes: the number of transactions per
year, annual total transaction amount
Method
Locate cs context;
Compare c with the other customers in the same
group; and
Use a conventional outlier detection method
409
Modeling Normal Behavior

Model the normal behavior with respect to contexts
Use a training data set to train a model that predicts the
expected behavior attribute values with respect to the
contextual attribute values
An object is a contextual outlier if its behavior attribute
values significantly deviate from the values predicted by
the model
Use a prediction model to link the contexts and

behavior
Avoid explicit identification of specific contexts
Some possible methods: regression, Markov Models,
and Finite State Automaton
410
Collective Outliers
Objects as a group deviate significantly from
the entire data
Examine the structure of the data set, i.e, the
relationships between multiple data objects
The structures are often not explicitly defined,
and have to be discovered as part of the outlier
detection process.
411
Detecting High Dimensional Outliers

Interpretability of outliers
Which subspaces manifest the outliers or an
assessment regarding the outlying-ness of the objects
Data sparsity: data in high-D spaces are often sparse

The distance between objects becomes heavily
dominated by noise as the dimensionality increases
Data subspaces
Local behavior and patterns of data
Scalability with respect to dimensionality

The number of subspaces increases exponentially
412
Angle-based Outliers
413

Exam Version

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Exam Version

Uploaded by

Copyright:

Available Formats

Introduction

Motivation: Business Intelligence

Jian Pei: CMPT 741/459 Data Mining -- Introduction

Techniques: Business Intelligence

Jian Pei: CMPT 741/459 Data Mining -- Introduction

Motivation: Store Layout Design

Jian Pei: CMPT 741/459 Data Mining -- Introduction

Techniques: Store Layout Design

Jian Pei: CMPT 741/459 Data Mining -- Introduction

Motivation: Community Detection

Jian Pei: CMPT 741/459 Data Mining -- Introduction

Techniques: Community Detection

Jian Pei: CMPT 741/459 Data Mining -- Introduction

Motivation: Disease Prediction

Jian Pei: CMPT 741/459 Data Mining -- Introduction

Techniques: Disease Prediction

Jian Pei: CMPT 741/459 Data Mining -- Introduction

Motivation: Fraud Detection

Jian Pei: CMPT 741/459 Data Mining -- Introduction

Techniques: Fraud Detection

Jian Pei: CMPT 741/459 Data Mining -- Introduction

What Is Data Science About?

Jian Pei: CMPT 741/459 Data Mining -- Introduction

Jian Pei: CMPT 741/459 Data Mining -- Introduction

Jian Pei: CMPT 741/459 Data Mining -- Introduction

Jian Pei: CMPT 741/459 Data Mining -- Introduction

Dealing with Data Querying

Queries can be arbitrarily complicated

The question cannot be translated into a

Jian Pei: CMPT 741/459 Data Mining -- Introduction

The Art of Data-driven Thinking

More often than not, more data may be

Queries for Data-driven Thinking

What Is Data Mining?

Jian Pei: CMPT 741/459 Data Mining -- Introduction

Data mining vs. Machine Learning

Jian Pei: CMPT 741/459 Data Mining -- Introduction

The KDD Process

Data Mining R&D

New problem identification

Deployment and business solution

Jian Pei: CMPT 741/459 Data Mining -- Introduction

Data Mining on Big Data

Jian Pei: CMPT 741/459 Data Mining -- Introduction

What Is Big Data?

Jian Pei: CMPT 741/459 Data Mining -- Introduction

Data Volume vs. Storage Cost

Jian Pei: CMPT 741/459 Data Mining -- Introduction

Big Data Volume

Jian Pei: CMPT 741/459 Data Mining -- Introduction

Big Data: Volume

What Has Been Changed?

It is essential to get not only the accurate but

Recently, with the new technologies, the

Sampling for Volume/Velocity?

On a data set of hundreds or thousands of

Big data contains signals of different strengths

Big Data Leytro Pictures

Big data tries to record as much information

Jian Pei: CMPT 741/459 Data Mining -- Introduction

Different views of the same data object from

Is Big Data Really New?

What has been changed?

Diversity in Data Usage