You are on page 1of 66

Similarity and clustering

Clustering 2
Motivation
Problem: Query word could be ambiguous:
Eg: QueryStar retrieves documents about
astronomy, plants, animals etc.
Solution: Visualisation
Clustering document responses to queries along lines of
different topics.
Problem 2: Manual construction of topic
hierarchies and taxonomies
Solution:
Preliminary clustering of large samples of web
documents.
Problem 3: Speeding up similarity search
Solution:
Restrict the search for documents similar to a query to
most representative cluster(s).
Clustering 3
Example
Scatter/Gather, a text clustering system, can separate salient topics in the response to
keyword queries. (Image courtesy of Hearst)

Clustering 4
Clustering
Task : Evolve measures of similarity to cluster a collection of
documents/terms into groups within which similarity within a
cluster is larger than across clusters.
Cluster Hypothesis: Given a `suitable clustering of a
collection, if the user is interested in document/term d/t, he is
likely to be interested in other members of the cluster to
which d/t belongs.
Similarity measures
Represent documents by TFIDF vectors
Distance between document vectors
Cosine of angle between document vectors
Issues
Large number of noisy dimensions
Notion of noise is application dependent
Clustering 5
Top-down clustering
k-Means: Repeat
Choose k arbitrary centroids
Assign each document to nearest centroid
Recompute centroids
Expectation maximization (EM):
Pick k arbitrary distributions
Repeat:
Find probability that document d is generated
from distribution f for all d and f
Estimate distribution parameters from weighted
contribution of documents
Clustering 6
Choosing `k
Mostly problem driven
Could be data driven only when either
Data is not sparse
Measurement dimensions are not too noisy
Interactive
Data analyst interprets results of structure
discovery
Clustering 7
Choosing k : Approaches
Hypothesis testing:
Null Hypothesis (H
o
): Underlying density is a
mixture of k distributions
Require regularity conditions on the mixture
likelihood function (Smith85)
Bayesian Estimation
Estimate posterior distribution on k, given data
and prior on k.
Difficulty: Computational complexity of integration
Autoclass algorithm of (Cheeseman98) uses
approximations
(Diebolt94) suggests sampling techniques

Clustering 8
Choosing k : Approaches
Penalised Likelihood
To account for the fact that L
k
(D) is a non-
decreasing function of k.
Penalise the number of parameters
Examples : Bayesian Information Criterion (BIC),
Minimum Description Length(MDL), MML.
Assumption: Penalised criteria are asymptotically
optimal (Titterington 1985)
Cross Validation Likelihood
Find ML estimate on part of training data
Choose k that maximises average of the M cross-
validated average likelihoods on held-out data D
test

Cross Validation techniques: Monte Carlo Cross
Validation (MCCV), v-fold cross validation (vCV)
Similarity and clustering
Clustering 10
Motivation
Problem: Query word could be ambiguous:
Eg: QueryStar retrieves documents about
astronomy, plants, animals etc.
Solution: Visualisation
Clustering document responses to queries along lines of
different topics.
Problem 2: Manual construction of topic
hierarchies and taxonomies
Solution:
Preliminary clustering of large samples of web
documents.
Problem 3: Speeding up similarity search
Solution:
Restrict the search for documents similar to a query to
most representative cluster(s).
Clustering 11
Example
Scatter/Gather, a text clustering system, can separate salient topics in the response to
keyword queries. (Image courtesy of Hearst)

Clustering 12
Clustering
Task : Evolve measures of similarity to cluster a
collection of documents/terms into groups within
which similarity within a cluster is larger than across
clusters.
Cluster Hypothesis: Given a `suitable clustering of
a collection, if the user is interested in
document/term d/t, he is likely to be interested in
other members of the cluster to which d/t belongs.
Collaborative filtering: Clustering of two/more
objects which have bipartite relationship
Clustering 13
Clustering (contd)
Two important paradigms:
Bottom-up agglomerative clustering
Top-down partitioning
Visualisation techniques: Embedding of
corpus in a low-dimensional space
Characterising the entities:
Internally : Vector space model, probabilistic
models
Externally: Measure of similarity/dissimilarity
between pairs
Learning: Supplement stock algorithms with
experience with data



Clustering 14
Clustering: Parameters

Similarity measure: (eg: cosine similarity)

Distance measure: (eg: eucledian distance)

Number k of clusters
Issues
Large number of noisy dimensions
Notion of noise is application dependent
) , (
2 1
d d p
) , (
2 1
d d o
Clustering 15
Clustering: Formal specification
Partitioning Approaches
Bottom-up clustering
Top-down clustering
Geometric Embedding Approaches
Self-organization map
Multidimensional scaling
Latent semantic indexing
Generative models and probabilistic
approaches
Single topic per document
Documents correspond to mixtures of multiple
topics


Clustering 16
Partitioning Approaches
Partition document collection into k clusters
Choices:
Minimize intra-cluster distance
Maximize intra-cluster semblance
If cluster representations are available
Minimize
Maximize
Soft clustering
d assigned to with `confidence
Find so as to minimize or
maximize
Two ways to get partitions - bottom-up
clustering and top-down clustering

} ..... , {
2 1 k
D D D

e i D d d
i
d d
2 1
,
2 1
) , ( o

e i D d d
i
d d
2 1
,
2 1
) , ( p

e i D d
i
i
D d ) , ( p
i
D

e i D d
i
i
D d ) , ( o
i
D i d
z
,
i d
z
,

e i D d
i i d
i
D d z ) , (
,
o

e i D d
i i d
i
D d z ) , (
,
p
Clustering 17
Bottom-up clustering(HAC)
Initially G is a collection of singleton groups,
each with one document
Repeat
Find I, A in G with max similarity measure,
s(IA)
Merge group I with group A
For each I keep track of best A
Use above info to plot the hierarchical
merging process (DENDOGRAM)
To get desired number of clusters: cut across
any level of the dendogram

d
Clustering 18
Dendogram
A dendogram presents the progressive, hierarchy-forming merging process pictorially.

Clustering 19
Similarity measure
Typically s(IA) decreases with
increasing number of merges
Self-Similarity
Average pair wise similarity between
documents in I


= inter-document similarity measure
(say cosine of tfidf vectors)
Other criteria: Maximium/Minimum pair
wise similarity between documents in the
clusters

u e
u
= u
2 1
,
2 1
2
) , (
1
) (
d d
d d s
C
s
) , (
2 1
d d s
Clustering 20
Computation
Un-normalized
group profile:
( ) ( )

u e
= u
d
d p p

Can show:
( )
( ) 1
) (

), (

u u
u u u
= u
p p
s
( )
( )
( )( ) 1
) (

), (

A + I A + I
A + I A I A I
= A I
p p
s
( ) ( )
( ) ( ) ( ) ( )
( ) ( ) A I +
A A + I I
= A I A I
p p
p p p p
p p

2
, ,
,
O(n
2
logn) algorithm with n
2
space
Clustering 21
Similarity
)) ( ( )) ( (
)) ( ( )), ( (
) , (
| o
| o
| o
c g c g
c g c g
s

=
product inner , =
Normalized
document profile:
)) ( (
)) ( (
) (
o
o
o
c g
c g
p =
Profile for
document group I:

I e
I e
= I
o
o
o
o
) (
) (
) (
p
p
p
Clustering 22
Switch to top-down
Bottom-up
Requires quadratic time and space
Top-down or move-to-nearest
Internal representation for documents as well as
clusters
Partition documents into `k clusters
2 variants
Hard (0/1) assignment of documents to clusters
soft : documents belong to clusters, with fractional
scores
Termination
when assignment of documents to clusters ceases to
change much OR
When cluster centroids move negligibly over successive
iterations
Clustering 23
Top-down clustering
Hard k-Means: Repeat
Choose k arbitrary centroids
Assign each document to nearest centroid
Recompute centroids
Soft k-Means :
Dont break close ties between document assignments to
clusters
Dont make documents contribute to a single cluster which
wins narrowly
Contribution for updating cluster centroid from document
related to the current similarity between and .
c
u
d
d c
u
c c c
c
c
d
d
u u u
u
u
n u


A + =


= A

) | | exp(
) | | exp(
2
2
Clustering 24
Seeding `k clusters
Randomly sample documents
Run bottom-up group average
clustering algorithm to reduce to k
groups or clusters : O(knlogn) time
Iterate assign-to-nearest O(1) times
Move each document to nearest cluster
Recompute cluster centroids
Total time taken is O(kn)
Non-deterministic behavior
( ) kn O
Clustering 25
Choosing `k
Mostly problem driven
Could be data driven only when either
Data is not sparse
Measurement dimensions are not too noisy
Interactive
Data analyst interprets results of structure
discovery
Clustering 26
Choosing k : Approaches
Hypothesis testing:
Null Hypothesis (H
o
): Underlying density is a
mixture of k distributions
Require regularity conditions on the mixture
likelihood function (Smith85)
Bayesian Estimation
Estimate posterior distribution on k, given data
and prior on k.
Difficulty: Computational complexity of integration
Autoclass algorithm of (Cheeseman98) uses
approximations
(Diebolt94) suggests sampling techniques

Clustering 27
Choosing k : Approaches
Penalised Likelihood
To account for the fact that L
k
(D) is a non-
decreasing function of k.
Penalise the number of parameters
Examples : Bayesian Information Criterion (BIC),
Minimum Description Length(MDL), MML.
Assumption: Penalised criteria are asymptotically
optimal (Titterington 1985)
Cross Validation Likelihood
Find ML estimate on part of training data
Choose k that maximises average of the M cross-
validated average likelihoods on held-out data D
test

Cross Validation techniques: Monte Carlo Cross
Validation (MCCV), v-fold cross validation (vCV)

Clustering 28
Visualisation techniques
Goal: Embedding of corpus in a low-
dimensional space
Hierarchical Agglomerative Clustering (HAC)
lends itself easily to visualisaton
Self-Organization map (SOM)
A close cousin of k-means
Multidimensional scaling (MDS)
minimize the distortion of interpoint distances in
the low-dimensional embedding as compared to
the dissimilarity given in the input data.
Latent Semantic Indexing (LSI)
Linear transformations to reduce number of
dimensions
Clustering 29
Self-Organization Map (SOM)
Like soft k-means
Determine association between clusters and documents
Associate a representative vector with each cluster and
iteratively refine
Unlike k-means
Embed the clusters in a low-dimensional space right from
the beginning
Large number of clusters can be initialised even if eventually
many are to remain devoid of documents
Each cluster can be a slot in a square/hexagonal grid.
The grid structure defines the neighborhood N(c) for
each cluster c
Also involves a proximity function between
clusters and

c
u
c
u
c

) , ( c h
Clustering 30
SOM : Update Rule
Like Neural network
Data item d activates neuron (closest
cluster) as well as the neighborhood
neurons
Eg Gaussian neighborhood function

Update rule for node under the influence
of d is:

Where is the ndb width and is the
learning rate parameter
d
c
) (
d
c N
)
) ( 2
|| ||
exp( ) , (
2
2
t
c h
c
o
u u

=
) )( , ( ) ( ) ( ) 1 (

u n u u + = + d c h t t t
d

) (t n
) (
2
t o
Clustering 31
SOM : Example I
SOM computed from over a million documents taken from 80 Usenet newsgroups. Light
areas have a high density of documents.

Clustering 32
SOM: Example II
Another example of SOM at work: the sites listed in the Open Directory
have beenorganized within a map of Antarctica at http://antarcti.ca/.
Clustering 33
Multidimensional Scaling(MDS)
Goal
Distance preserving low dimensional embedding of
documents
Symmetric inter-document distances
Given apriori or computed from internal representation
Coarse-grained user feedback
User provides similarity between documents i and j .
With increasing feedback, prior distances are overridden
Objective : Minimize the stress of embedding
^
ij
d
ij
d
^
,
2
,
2
) (


=
j i
ij
j i
ij ij
d
d d
stress
Clustering 34
MDS: issues
Stress not easy to optimize
Iterative hill climbing
1. Points (documents) assigned random
coordinates by external heuristic
2. Points moved by small distance in
direction of locally decreasing stress
For n documents
Each takes time to be moved
Totally time per relaxation
) (n O
) (
2
n O
Clustering 35
Fast Map [Faloutsos 95]
No internal representation of
documents available
Goal
find a projection from an n dimensional space
to a space with a smaller number `k of
dimensions.
Iterative projection of documents
along lines of maximum spread
Each 1D projection preserves distance
information

Clustering 36
Best line
Pivots for a line: two points (a and b)
that determine it
Avoid exhaustive checking by picking
pivots that are far apart
First coordinates of point on best
line

b a
x b b a x a
d
d d d
x
,
2
,
2
,
2
,
1
2
+
=
1
x
x
) , ( b a
Clustering 37
Iterative projection
For i = 1 to k
1. Find a next (i
th
) best line
A best line is one which gives maximum
variance of the point-set in the direction of the
line
2. Project points on the line
3.Project points on the hyperspace orthogonal to
the above line

Clustering 38
Projection
Purpose
To correct inter-point distances
between points by taking into
account the components already
accounted for by the first pivot line.


Project recursively upto 1-D space
Time: O(nk) time
2
1 1
2
,
'
,
) (
' '
y x d d
y x
y x
=
) , (
' '
y x
) , (
1 1
y x
' '
, y x
d
Clustering 39
Issues
Detecting noise dimensions
Bottom-up dimension composition too slow
Definition of noise depends on application
Running time
Distance computation dominates
Random projections
Sublinear time w/o losing small clusters
Integrating semi-structured information
Hyperlinks, tags embed similarity clues
A link is worth a ? words
Clustering 40
Expectation maximization (EM):
Pick k arbitrary distributions
Repeat:
Find probability that document d is generated
from distribution f for all d and f
Estimate distribution parameters from weighted
contribution of documents
Clustering 41
Extended similarity
Where can I fix my scooter?
A great garage to repair your
2-wheeler is at
auto and car co-occur often
Documents having related
words are related
Useful for search and clustering
Two basic approaches
Hand-made thesaurus
(WordNet)
Co-occurrence and
associations
car
auto
auto car
car auto
auto car
car auto
auto car
car auto
car ~ auto
~
Clustering 42
k
k-dim vector
Latent semantic indexing
A
Documents
T
e
r
m
s

U
d
t
r
D V
d
SVD
Term Document
car
auto
Clustering 43
Batman Rambo Andre Hiver Whispers StarWars
Lyle
Ellen
Jason
Fred
Dean
Karen
Batman Rambo Andre Hiver Whispers StarWars
Lyle
Ellen
Jason
Fred
Dean
Karen
Collaborative recommendation
People=record, movies=features
People and features to be clustered
Mutual reinforcement of similarity
Need advanced models
From Clustering methods in collaborative filtering, by Ungar and Foster
Clustering 44
A model for collaboration
People and movies belong to unknown
classes
P
k
= probability a random person is in class k
P
l
= probability a random movie is in class l
P
kl
= probability of a class-k person liking a
class-l movie
Gibbs sampling: iterate
Pick a person or movie at random and assign to a
class with probability proportional to P
k
or P
l
Estimate new parameters
Clustering 45
Aspect Model
Metric data vs Dyadic data vs Proximity data vs
Ranked preference data.
Dyadic data : domain with two finite sets of
objects
Observations : Of dyads X and Y
Unsupervised learning from dyadic data
Two sets of objects





} , .... { }, , .... {
1 1 n i n i
y y y Y x x x X = =
Clustering 46
Aspect Model (contd)
Two main tasks
Probabilistic modeling:
learning a joint or conditional probability model
over
structure discovery:
identifying clusters and data hierarchies.
Y X
Clustering 47
Aspect Model
Statistical models
Empirical co-occurrence frequencies
Sufficient statistics
Data spareseness:
Empirical frequencies either 0 or significantly
corrupted by sampling noise
Solution
Smoothing
Back-of method [Katz87]
Model interpolation with held-out data [JM80, Jel85]
Similarity-based smoothing techniques [ES92]
Model-based statistical approach: a principled
approach to deal with data sparseness


Clustering 48
Aspect Model
Model-based statistical approach: a principled
approach to deal with data sparseness
Finite Mixture Models [TSM85]
Latent class [And97]
Specification of a joint probability distribution for
latent and observable variables [Hoffmann98]
Unifies
statistical modeling
Probabilistic modeling by marginalization
structure detection (exploratory data analysis)
Posterior probabilities by bayes rule on latent space of
structures

Clustering 49
Aspect Model

Realisation of an
underlying sequence of random
variables

2 assumptions
All co-occurrences in sample S are iid
are independent given
P(c) are the mixture components


: ) , (
1 N n
n n
y x S
s s
=
: ) , (
1 N n
n n
Y X S
s s
=
n
A
n n
Y X ,
Clustering 50
Aspect Model: Latent classes
} ,.. {
} ,... {
)} ( ), ( {
1
1
1
L
K
N n
n n
d d D
c c C
Y D X C
=
=
s s
} ,.... {
) , (
1
1
K
N n
n n n
a a A
Y X A
=
s s
} ,... {
} ), ( {
1
1
K
N n
n n
c c C
Y X C
=
s s
} ,... {
} ), ( {
1
1
K
N n
n n
c c C
Y X C
=
s s
Increasing
Degree of
Restriction
On Latent
space
Clustering 51
Aspect Model
Symmetric Asymmetric
I I I I
I I
e e e e e
= =
= =
= =
X x Y y
y x n
A a X x Y y
y x n
N
n
N
n
n n n n n n n n
a y P a x P a P y x P S P
a y P a x P a P a y x P a S P
) , ( ) , (
1 1
)] | ( ) | ( ) ( [ ) , ( ) (
) | ( ) | ( ) ( ) , , ( ) , (
I I I I
I I
e e e e e
= =
= =
= =
X x Y y
y x n
A a X x Y y
y x n
N
n
N
n
n n n n n n n n
a y P x a P x P y x P S P
a y P a x P a P a y x P a S P
) , ( ) , (
1 1
)] | ( ) | ( [ ) ( ) , ( ) (
) | ( ) | ( ) ( ) , , ( ) , (
Clustering 52
Clustering vs Aspect
Clustering model
constrained aspect model


For flat:

For hierarchical
Group structure on object spaces as
against partition the observations
Notation
P(.) : are the parameters
P{.}: are posteriors


ac
n n
c x C x X a A P c x a P o = = = = = } ) ( , | ( ) , | (
ac k k
a c o ~
) , | ( . c x a P c a
ac k k
o |
Clustering 53
Hierarchical Clustering model
One-sided clustering Hierarchical clustering
I I
I I I I
e e e e
e e e e e
=
= =
Y y
y x n
A a C c X x
X x Y y
y x n
A a X x Y y
y x n
a y P c x a P c P x P
a y P x a P x P y x P S P
) , (
) , ( ) , (
)] | ( ) , | ( [ ) ( ) (
)] | ( ) | ( [ ) ( ) , ( ) (
I I I I
I I I I
e e e e e e e
e e e e e
= =
= =
Y y
y x n x n
X x C c Y y
y x n
A a C c X x
X x Y y
y x n
A a X x Y y
y x n
a y P x P c P a y P c x a P c P x P
a y P x a P x P y x P S P
) , ( ) ( ) , (
) , ( ) , (
)] | ( [ ] ) ( [ ) ( )] | ( ) , | ( [ ) ( ) (
)] | ( ) | ( [ ) ( ) , ( ) (
Clustering 54
Comparison of Es

e
= = = =
A a
n n n
a y P a x P a P
a y P a x P a P
y Y x X a A P
'
) ' | ( ) ' | ( ) ' (
) | ( ) | ( ) (
} ; , | { u
I
I
e e
e
= =
C c
y x n
Y y
y x n
Y y
x
c y P c P
c y P c P
S c x C P
'
) , (
) , (
)] ' | ( [ ) ' (
)] | ( [ ) (
} , | ) ( { u
I
I
e e
e e
= =
C c Y y
y x n
y x n
Y y A a
c x a P a y P c P
c x a P a y P c P
S c x C P
'
) , (
) , (
)] ' , | ( ) | ( [ ) ' (
)] , | ( ) | ( [ ) (
} , | ) ( { u

e
= = = = =
A a
n n n
a y P c x a P
a y P c x a P
c x C y Y x X a A P
'
) ' | ( ) , | ' (
) | ( ) , | (
} ; ) ( , , | { u
Aspect model

One-sided aspect model

Hierarchical aspect model
Clustering 55
Tempered EM(TEM)
Additively (on the log scale) discount
the likelihood part in Bayes formula:
1. Set and perform EM until the performance on held--out data
deteriorates (early stopping).
2. Decrease e.g., by setting with some rate parameter .
3. As long as the performance on held-out data improves continue TEM
iterations at this value of
4. Stop on i.e., stop when decreasing does not yield further
improvements, otherwise goto step (2)
5. Perform some final iterations using both, training and heldout data.

e
= = = =
A a
n n n
a y P a x P a P
a y P a x P a P
y Y x X a A P
'
)] ' | ( ) ' | ( )[ ' (
)] | ( ) | ( )[ (
} ; , | {
|
|
u
|
| n| | 1 < n
|
| |
Clustering 56
M-Steps
) ' ; , ' | ( ) , ' (
) ' ; , | ( ) , (
) ' ; , | (
) ' ; , | (
) | (
, ' 1
:
u
u
u
u
y x a P y x n
y x a P y x n
y x a P
y x a P
a x P
y x
y
n n
N
n
n n
x x n
n

= =
=
=
) ' ; ' , | ( ) ' , (
) ' ; , | ( ) , (
) ' ; , | (
) ' ; , | (
) | (
' , 1
:
u
u
u
u
y x a P y x n
y x a P y x n
y x a P
y x a P
a y P
y x
x
n n
N
n
n n
y y n
n

= =
=
=
} ' ; | ) ( { ) (
} ' ; | ) ( { ) , (
) | (
u
u
x
x
x
x
S c x C P x n
S c x C P y x n
c y P
=
=
=

N
x n
x P
) (
) ( =
} ' ; ' , | { ) ' , (
} ' ; , | { ) , (
} ' ; , | {
} ' ; , | {
) | (
' , 1
:
u
u
u
u
y x a P y x n
y x a P y x n
y x a P
y x a P
a y P
y x
x
n n
N
n
n n
y y n
n

= =
=
=
N
x n
x P
) (
) ( =
N
x n
x P
) (
) ( =
) ' ; , ' | ( ) , ' (
) ' ; , | ( ) , (
) ' ; , | (
) ' ; , | (
) | (
, ' 1
:
u
u
u
u
y x a P y x n
y x a P y x n
y x a P
y x a P
x a P
y x
y
n n
N
n
n n
x x n
n

= =
=
=
1. Aspect




2. Assymetric


3. Hierarchical x-clustering





4. One-sided x-clustering





Clustering 57
Example Model [Hofmann and Popat CIKM 2001]
Hierarchy of document categories






Clustering 58
Example Application
Clustering 59
Topic Hierarchies
To overcome sparseness problem in topic
hierarchies with large number of classes
Sparseness Problem: Small number of
positive examples
Topic hierarchies to reduce variance in
parameter estimation
Automatically differentiate
Make use of term distributions estimated for more general,
coarser text aspects to provide better, smoothed estimates
of class conditional term distributions
Convex combination of term distributions in a Hierarchical
Mixture Model

refers to all inner nodes a above the terminal class node
c.

|
=
c a
a w P c a P c w P ) | ( ) | ( ) | (
|
Clustering 60
Topic Hierarchies
(Hierarchical X-clustering)
X = document, Y = word

} ' ; ' ), ( | { ) ' ), ( (
} ' ; ), ( | { ) ), ( (
} ' ; ' , | { ) ' , (
} ' ; , | { ) , (
} ' ; , | {
} ' ; , | {
) | (
' , ) (
) (
' , 1
:
u
u
u
u
u
u
y x c a P y x c n
y x c a P y x c n
y x a P y x n
y x a P y x n
y x a P
y x a P
a y P
y a x c
a x c
y x
x
n n
N
n
n n
y y n
n

|
|
=
=
= = =
N
x n
x P
) (
) ( =
I
I
I
I
e e
e |
e e
e e
= = =
C c Y y
y x n
y x n
Y y c a
C c Y y
y x n
y x n
Y y A a
x c a P a y P c P
x c a P a y P c P
x c x a P a y P c P
x c x a P a y P c P
S c x C P
'
) , (
) , (
'
) , (
) , (
))] ( ' | ( ) | ( [ ) ' (
))] ( | ( ) | ( [ ) (
))] ( ' , | ( ) | ( [ ) ' (
))] ( , | ( ) | ( [ ) (
} , | ) ( { u

| e
= = =
c a A a
a y P x c a P
a y P c a P
a y P c x a P
a y P c x a P
x c y a P x c y x a P
' '
) ' | ( )) ( | ' (
) | ( ) | (
) ' | ( ) , | ' (
) | ( ) , | (
} ); ( , | { } ); ( , , | { u u

|
=
c a
y
x c y a P
x c y a P c y n
x c a P
'
)) ( , | ' (
)) ( , | ( ) , (
} ); ( | { u
Clustering 61
Document Classification Exercise

Modification of Nave Bayes

|
=
c a
a w P c a P c w P ) | ( ) | ( ) | (
I
I
e
e
=
x y
i
c
x y
i
i
i
c y P c P
c y P c P
x c P
) ' | ( ) ' (
) | ( ) (
) | (
'
Clustering 62
Mixture vs Shrinkage
Shrinkage [McCallum Rosenfeld AAAI98]: Interior
nodes in the hierarchy represent
coarser views of the data which are
obtained by simple pooling scheme of
term counts
Mixture : Interior nodes represent
abstraction levels with their
corresponding specific vocabulary
Predefined hierarchy [Hofmann and Popat CIKM 2001]
Creation of hierarchical model from unlabeled data
[Hofmann IJCAI99]
Clustering 63
Mixture Density Networks(MDN)
[Bishop CM 94 Mixture Density Networks]
broad and flexible class of distributions that
are capable of modeling completely general
continuous distributions
superimpose simple component densities with
well known properties to generate or
approximate more complex distributions
Two modules:
Mixture models: Output has a distribution given as
mixture of distributions
Neural Network: Outputs determine parameters of
the mixture model

.
Clustering 64
MDN: Example
A conditional mixture density network with Gaussian component densities
Clustering 65
MDN
Parameter Estimation :
Using Generalized EM (GEM) algo to speed
up.
Inference
Even for a linear mixture, closed form
solution not possible
Use of Monte Carlo Simulations as a
substitute
Clustering 66
Vocabulary V, term w
i
, document o
represented by
is the number of times w
i

occurs in document o
Most fs are zeroes for a single
document
Monotone component-wise damping
function g such as log or square-root
Document model
{ }
V w
i
i
w f c
e
= ) , ( ) ( o o
) , ( o
i
w f
{ }
V w
i
i
w f g c g
e
= )) , ( ( )) ( ( o o

You might also like