Professional Documents
Culture Documents
Overview
Definition of Clustering Existing clustering methods Clustering examples Classification Classification examples Conclusion
Definition
Clustering can be considered the most important unsupervised learning technique; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. Clustering is the process of organizing objects into groups whose members are similar in some way. A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters.
Why clustering?
A few good reasons ...
Simplifications Pattern detection Useful in data concept construction Unsupervised learning process
Measuring Similarity
Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d(i, j) There is a separate quality function that measures the goodness of a cluster. The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables. Weights should be associated with different variables based on applications and data semantics. It is hard to define similar enough or good enough
the answer is typically highly subjective.
Professor Lee, Sin-Min
In this case we easily identify the 4 clusters into which the data can be divided; the similarity criterion is distance: two or more objects belong to the same cluster if they are close according to a given distance. This is called distance-based clustering.
Hierarchical clustering
Agglomerative (bottom up) start with 1 point (singleton) 2. recursively add two or more appropriate clusters 3. Stop when k number of clusters is achieved. 1. 1. 2. 3. Divisive (top down) Start with a big cluster Recursively divide into smaller clusters Stop when k number of clusters is achieved.
Partitioning clustering
1. Divide data into proper subset 2. recursively go through each subset and relocate points between clusters (opposite to visit-once approach in Hierarchical approach)
This recursive relocation= higher quality cluster
Probabilistic clustering
1. Data are picked from mixture of probability distribution. 2. Use the mean, variance of each distribution as parameters for cluster 3. Single cluster membership
Single-Linkage Clustering(hierarchical)
The N*N proximity matrix is D = [d(i,j)] The clusterings are assigned sequence numbers 0,1,......, (n-1) L(k) is the level of the kth clustering A cluster with sequence number m is denoted (m) The proximity between clusters (r) and (s) is denoted d [(r),(s)]
Mu-Yu Lu, SJSU
current clustering, say pair (r), (s), according to d[(r),(s)] = min d[(i),(j)] where the minimum is over all pairs of clusters in the current clustering.
clusters (r) and (s) into a single cluster to form the next clustering m. Set the level of this clustering to L(m) = d[(r),(s)]
columns corresponding to clusters (r) and (s) and adding a row and column corresponding to the newly formed cluster. The proximity between the new cluster, denoted (r,s) and old cluster (k) is defined in this way: d[(k), (r,s)] = min d[(k),(r)], d[(k),(s)]
The nearest pair of cities is MI and TO, at distance 138. These are merged into a single cluster called "MI/TO". The level of the new cluster is L(MI/TO) = 138 and the new sequence number is m = 1. Then we compute the distance from this new compound object to all other objects. In single link clustering the rule is that the distance from the compound object to another object is equal to the shortest distance from any member of the cluster to the outside object. So the distance from "MI/TO" to RM is chosen to be 564, which is the distance from MI to RM, and so on.
min d(i,j) = d(NA,RM) = 219 => merge NA and RM into a new cluster called NA/RM L(NA/RM) = 219 m=2
min d(i,j) = d(BA,NA/RM) = 255 => merge BA and NA/RM into a new cluster called BA/NA/RM L(BA/NA/RM) = 255 m=3
min d(i,j) = d(BA/NA/RM,FI) = 268 => merge BA/NA/RM and FI into a new cluster called BA/FI/NA/RM L(BA/FI/NA/RM) = 268 m=4
Finally, we merge the last two clusters at level 295. The process is summarized by the following hierarchical tree:
K-mean algorithm
1. 2. It accepts the number of clusters to group data into, and the dataset to cluster as input values. It then creates the first K initial clusters (K= number of clusters needed) from the dataset by choosing K rows of data randomly from the dataset. For Example, if there are 10,000 rows of data in the dataset and 3 clusters need to be formed, then the first K=3 initial clusters will be created by selecting 3 records randomly from the dataset as the initial clusters. Each of the 3 initial clusters formed will have just one row of data.
3. The K-Means algorithm calculates the Arithmetic Mean of each cluster formed in the dataset. The Arithmetic Mean of a cluster is the mean of all the individual records in the cluster. In each of the first K initial clusters, their is only one record. The Arithmetic Mean of a cluster with one record is the set of values that make up that record. For Example if the dataset we are discussing is a set of Height, Weight and Age measurements for students in a University, where a record P in the dataset S is represented by a Height, Weight and Age measurement, then P = {Age, Height, Weight). Then a record containing the measurements of a student John, would be represented as John = {20, 170, 80} where John's Age = 20 years, Height = 1.70 metres and Weight = 80 Pounds. Since there is only one record in each initial cluster then the Arithmetic Mean of a cluster with only the record for John as a member = {20, 170, 80}.
4.
Next, K-Means assigns each record in the dataset to only one of the initial clusters. Each record is assigned to the nearest cluster (the cluster which it is most similar to) using a measure of distance or similarity like the Euclidean Distance Measure or Manhattan/City-Block Distance Measure.
5.
K-Means re-assigns each record in the dataset to the most similar cluster and recalculates the arithmetic mean of all the clusters in the dataset. The arithmetic mean of a cluster is the arithmetic mean of all the records in that cluster. For Example, if a cluster contains two records where the record of the set of measurements for John = {20, 170, 80} and Henry = {30, 160, 120}, then the arithmetic mean Pmean is represented as Pmean= {Agemean, Heightmean, Weightmean). Agemean= (20 + 30)/2, Heightmean= (170 + 160)/2 and Weightmean= (80 + 120)/2. The arithmetic mean of this cluster = {25, 165, 100}. This new arithmetic mean becomes the center of this new cluster. Following the same procedure, new cluster centers are formed for all the existing clusters.
6. K-Means re-assigns each record in the dataset to only one of the new clusters formed. A record or data point is assigned to the nearest cluster (the cluster which it is most similar to) using a measure of distance or similarity 7. The preceding steps are repeated until stable clusters are formed and the K-Means clustering procedure is completed. Stable clusters are formed when new iterations or repetitions of the K-Means clustering algorithm does not create new clusters as the cluster center or Arithmetic Mean of each cluster formed is the same as the old cluster center. There are different techniques for determining when a stable cluster is formed or when the k-means clustering algorithm procedure is completed.
Classification
Goal: Provide an overview of the classification
problem and introduce some of the basic algorithms
Classification Examples
Teachers classify students grades as A, B, C, D, or F. Identify mushrooms as poisonous or edible. Predict when a river will flood. Identify individuals with credit risks. Speech recognition Pattern recognition
<80 x
<70 x <50
>=80
B >=70 C
>=60 D
Classification Techniques
Approach: 1. Create specific model by evaluating training data (or using domain experts knowledge). 2. Apply model developed to new data. Classes must be predefined Most common techniques use DTs, NNs, or are based on distances or statistical methods.
Defining Classes
Distance Based
Partitioning Based
Division
Prediction
Algorithm: KNN
KNN
KNN Algorithm
Decision Tree
Given: D = {t1, , tn} where ti=<ti1, , tih> Database schema contains {A1, A2, , Ah} Classes C={C1, ., Cm} Decision or Classification Tree is a tree associated with D such that Each internal node is labeled with attribute, Ai Each arc is labeled with predicate which can be applied to attribute at parent Each leaf node is labeled with a class, Cj
DT Induction
Comparing DTs
Balanced
Deep
ID3
Creates tree using information theory concepts and tries to reduce expected number of comparison.. ID3 chooses split attribute with the highest information gain using entropy as base for calculation.
Conclusion
very useful in data mining applicable for both text and graphical based data Help simplify data complexity classification detect hidden pattern in data
Reference
Dr. M.H. Dunham http://engr.smu.edu/~mhd/dmbook/part2.ppt. Dr. Lee, Sin-Min San Jose State University Mu-Yu Lu, SJSU Database System Concepts, Silberschatz, Korth, Sudarshan