Professional Documents
Culture Documents
Magnus Rosell
Contents
I Introduction
I Text Categorization
I Text Clustering
Introduction
Representation:
Features
Objects F1 F2 F3 ...
obj1 . . . ...
obj2 . . . ...
.. .. .. .. ..
. . . . .
0.9
0.8
0.7
0.6
beer
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
football
Magnus Rosell 19 / 51 Unsupervised learning: (Text) Clustering
K-Means
Initial partition
I Pick k center points at random
I Pick k objects at random
I Construct a random partition
Conditions
I Repeat until (almost) no object changes cluster
I Repeat a predetermined number of times
I Repeat until a predetermined quality is reached
0.9
0.8
0.7
0.6
beer
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
football
Magnus Rosell 23 / 51 Unsupervised learning: (Text) Clustering
2D Clustering Example 3: one tight, one wide
1
0.9
0.8
0.7
0.6
beer
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
football
Magnus Rosell 24 / 51 Unsupervised learning: (Text) Clustering
2D Clustering Example 4: random
1
0.9
0.8
0.7
0.6
beer
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
football
Magnus Rosell 25 / 51 Unsupervised learning: (Text) Clustering
K-Means (cont.)
K-Means
I Time complexity: O(knI ) similarity calculations where k
number of clusters, n number of objects,
and I number of iterations.
I Decide number of clusters in advance
I Try several clusterings!
I Different results depending on initial partition
I Try several initial partitions!
I What is a good partition?
I Global revisits all texts in every iteration: the centroids
are kept up-to-date. Takes word co-occurrence into account
through centroids.
Agglomerative
1. Make one cluster for each document.
2. Join the most similar pair into one cluster.
3. Repeat 2 until some condition is fulfilled.
Conditions
I Repeat until one cluster.
I Repeat until a predetermined number of clusters
I Repeat until a predetermined quality is reached (the quality is
usually increasing with number of clusters)
Similarity in Agglomerative:
1. Single Link: similarity between the most similar texts in each
cluster. elongated clusters
2. Complete Link: similarity between the least similar texts in
each cluster. compact clusters.
3. Group Average Link: similarity between cluster centroids.
middle ground.
4. Wards Method: combination method.
Agglomerative
I Time complexity: O(n2 ) similarity calculations
I Results in a hierarchy that could be browsed.
I Deterministic: same result everytime.
I Local in each iteration only the similarities at that level is
considered. Bad early descisions cant be undone.
Divisive algorithm
1. Put all documents in one cluster.
2. Split the worst cluster.
3. Repeat 2 until some condition is fulfilled.
Agglomerative
I hierarchy
I may stop at optimal number of clusters
I same result every time (deterministic)
I local, bad early descision can not be changed
I slow: O(n2 )
Let
C = {ci } a clustering with clusters ci
K = {k (j) } a categorization with kategories k (j)
n number of documents
ni number of documents in cluster ci
n(j) number of documents in category k (j)
(j)
ni number of documents in cluster ci and category k (j)
(j)
M = {ni } The confusion matrix
which is the average similarity of the texts in the set to all texts in
their respective clusters.
Purity
I Cluster Purity:
(j)
ni
(ci ) = max p(i , j) = max
j j ni
I Clustering Purity:
X ni
(C ) = (ci )
n
i
Purity disregards all other than the majority class in each cluster.
But a cluster that only contains texts from two categories is not
that bad . . .
Entropy
P
I Entropy for a cluster: E (ci ) = p(i , j) log(p(i , j))
j
P ni
I Entropy for the Clustering: E (C ) = i n E (ci )
(Word Clustering)
One possibility: cluster the columns of the term-document-matrix.
Other:
I Suffix Tree Clustering
I Frequent Term-based Clustering
I Similarity
I SOM - Self Organizing Maps: WEBSOM.
I Projections
I Representation
I Infomat