Cluster Analysis

1
Understanding Statistical Cliques: Cluster Analysis in R
What is Cluster Analysis?

Cluster analysis is a general term for many methods for examining data in order to detect or
uncover groups or clusters of objects or individuals that are 1) homogeneous and 2) separate
from other clusters. Specifically, clustering is the classification of objects into different groups;
partitioning data into clusters so that data in each cluster are ideally similar on one or more traits
(often measured using distance between objects or observations on a grid).
Why would we want to use Cluster Analysis?
Cluster analysis is often used by marketers and psychologists to identify groupings or market
segments. In marketing, cluster analysis is used for marketing segmentation to determine
target segments, product positioning and new product development. Rather than relying on
traditional demographic segmentation, cluster analysis allows marketers to segment on a variety
of psychographic characteristics. For example, marketers may want to determine groups to
target. Rather than simply targeting consumers by age (e.g. under 25, between 26 and 49, and
over 50) or by gender, they may want to cluster consumers based upon a large number of
variables. By examining age, income, gender, interests, children, marital status, and a variety of
other demographic and psychographic variables, marketers can develop clusters or segments.
For instance, they may want to target soccer moms (e.g. married women between 35 and 50
years old, with at least two children under the age of 18, who live in a suburban area with a
certain household income level) differently than they would target yuppies (young urban
professionals) (e.g. consumers of either gender between the ages of 23 and 35 who make a
certain income, live in a city, eat out often, shop at certain types of stores). By cluster analyzing
consumers based upon a variety of characteristics, marketers can segment them into
appropriate groups based on a combination of factors.
Cluster Analysis procedure
1.
2.
3.
4.
5.
Select the variables that you want to analyze.

Select a distance measure (e.g. Euclidian distance, Manhattan distance).
Select a clustering procedure (e.g. k-means, hierarchical clustering).
Decide on the number of clusters.
Map and interpret the clusters.
2
Common Distance Measures
See ?dist in appendix
In order to perform cluster analysis, it is often important to determine which distance measure
will be used to assess similarity and determine clusters. This can have an important effect on
the analysis and subsequent conclusions. There are several key distances measures used in
cluster analysis. Typically, researchers use Euclidian distance (the default in R). In most cases
the Manhattan distance yields the same results. However, in the Manhattan distance, the effect
of a single large distance (created by an outlier) is lessened. Also, the Euclidian distance
measure requires that the distance between any two points cannot exceed the sum of the
distances between the other two pairs. In other words, the distances must be able to construct
a triangle. In this situation, we may need to use Manhattan distance. The possible distance
measures include:
Euclidian Distance: This tends to be the most common measure used in cluster analysis.
Euclidian distance is measured using Pythagoreans theorem to assess the shortest
distance between two points. This distance is also called as the crow flies. For two
points P1 = (X1,Y1) and P2 = (X1,Y2), the distance is computed as:
_________________
(X1 X2)2 + (Y1 Y2)2
Manhattan Distance: The distance between two points measured along axes at a right
angle. In a plane with p1 at (x1, y1) and p2 at (x2, y2), the distance is computed as:
|x1 - x2| + |y1 - y2|
Mahalanobis Distance: this measure uses the correlation coefficients between the
observations and uses that as a measure to cluster them. This is an important measure
since it is unit invariant, which means it is not dependent on the scale of the
measurements. It can be used to determine similarity of an unknown sample set to a
known one. Formally, the Mahalanobis distance from a group of values with mean: =
(1, 2, 3, p)T and covariance matrix for a multivariate vector: x = x1,x2,x3,xp)T is
defined as:
_______________
DM(x) = (x )T - 1 (x )
See ?mahalanobis in appendix
Below is a graphic example. In this diagram, the blue line represents Manhattan distance and
the green line represents Euclidian distance from Points A to B.
B
A
Clustering procedures
There are several types of clustering methods:
Non-hierarchical clustering (also called k-means clustering): In this analysis, we choose
k number of clusters. Each of these clusters is assigned a centroid (or center). These
initial centroids can be taken randomly, but it is important for researchers to recognize
that different locations of the centroids can cause different results. Next, we determine
the distance of each object to the nearest centroid. Then, we need to recalculate new
centroids, which result from the clusters of the previous step. After we have these new
centroids, we have to bind the points again to their nearest centroid. We group each
object by minimum distance to the centroid and continue doing so until we find
convergence and stability (i.e. centroids do not move any more). The goal of k-means
cluster analysis is to minimize the summed distance between all data points and the
cluster centroids. The diagram to the right explains the procedure. This process does not
always find the optimal configuration and results can be easily affected by randomly
selected centroids. Outliers may also have a strong effect on results.
See ?kmeans in appendix
Hierarchical clustering: In hierarchical clustering,
objects are organized into a hierarchical structure
as part of the procedure. We start with n total
points and clusters each containing a single point
(n total clusters). Then we look for the closest
two clusters (using one of several distance
measures explained above). This leaves us with
n-1 total clusters, with all but one containing a
single element. We continue this process using
distance between cluster centroids. This
agglomerative process uses a bottom-up
strategy to cluster single elements into
R Code:
plot(hclust(dist(Cluster.data,main="Cluster
Dendogram of Marketing data")
4
successively larger clusters. There are several methods for agglomerative clustering,
including:
o Centroid methods: Clusters are generated that maximize the distance between
the centers of clusters (a centroid is the mean value for all the objects in the
cluster).
o Variance methods: Clusters are generated that minimize the within-cluster
variance.
o Wards Procedure: Clusters are generated that minimize the squared Euclidean
distance to the center mean.
o Linkage methods: Objects are clustered based on the distance between them:
Single Linkage method: cluster objects based on the minimum distance
between them (also called the nearest neighbor rule).
Complete Linkage method: cluster objects based on the maximum
distance between them (also called the furthest neighbor rule).
Average Linkage method: cluster objects based on the average distance
between all pairs of objects (one member of the pair must be from a
different cluster).
Each cluster occurs at a greater distance between clusters than the previous and the
researchers must decide when to stop clustering. This occurs when the clusters are
considered too far apart to merge or when there is a small enough number of clusters
(both determined by the researcher).
See ?hclust in appendix
Mapping Clusters
Dendrograms: When conducting cluster analysis, it may be helpful to use visual
diagrams to better understand the data. For example, many marketers, psychology
researchers and statisticians rely on the illustrative techniques such as dendrograms to
analyze their results.
Hierarchical clustering is often represented using a hierarchy tree or dendrogram,
which the leaves are the original single element clusters and the clustering process is
represented by horizontal lines (dendrograms are explained in detail in the following
section).
For example, suppose the following data (a,b,c,d,e, and f) is to be clustered by distance.
The diagram below shows how these clusters would form:
Lets Try an Example!

See ?dendrograms in appendix
This data will be used from a marketing survey. Make sure to download the attached csv file
from the Wiki. Open up the file before we get started. It should look something like this:
This survey includes data from 120 subjects. The measures taken include age, occupation
(coded on a 1 9 scale with 0 indicating current unemployment), income (measured on a 1 9
scale, perceived humor of the advertisement (PH), attitude toward the advertisement (AAD),
and intentions to buy (ITB).
First, we need to upload the data (see attached file Cluster data.csv):
setwd("/temp")
read.csv(file="Cluster.csv", header=TRUE, sep = ",")
Cluster.data <- read.csv(file="Cluster.csv", header=TRUE, sep = ",")
Next, we will conduct a k-means clustering on this data. We have to specify the number of
clusters (centers), the maximum number of iterations (iter.max). Both of these numbers are up
to the researcher. For this example, we will use 5 centers and iterate up to 10 times.
kmeans(Cluster.data,5,iter.max = 10)
This returns:
K-means clustering with 5 clusters of sizes 7, 17, 34, 34, 28
Cluster means:
age
occup
1 31.00000 5.571429
2 29.88235 6.294118
3 20.94118 6.735294
4 20.91176 6.264706
5 21.92857 7.178571
income
ph
aad
itb
6.000000 15.857143 15.857143 15.428571
4.588235 9.352941 9.411765 10.058824
3.588235 15.088235 15.735294 12.323529
4.647059 9.764706 13.176471 12.058824
4.250000 7.785714 6.464286 9.464286
The above data gives us the means for the centers for each of the 6 variables in each of the 5
clusters. The data below tells us which cluster each data point has been grouped in:
Clustering vector:
4 4 3 4 1 2 1 5 4 5
3 1 1 1 4 4 5 4 1 1
4 2 4 4 4 5 5 3 3 2
4 5 4 5 3 5 5 4 5 5
1
3
5
5
5 4 5 3 2 3 1 1 5 2 2 4 4 3 5 1 1 2 5 5 4 5 3 2 5
3 4 4 4 4 5 5 2 4 3 1 1 3 3 4 2 1 3 4 2 2 4 5 2 2
5 2 4 4 2 3 2 4 5 4 2 4 4 5 5 3 4 3 4 5 5 4 5 4 2
4
Within cluster sum of squares by cluster:

293.1429 644.3529 1097.5294 1194.0588 1087.8571
Available components:
"cluster" "centers" "withinss" "size"
Take some time to play with this data. How do the results change if we create 4 centers? Or
6? What about if we alter the number of maximum iterations? Cluster analysis leaves many
decisions up to the researchers so it is important to understand how small changes in the
analysis can alter the end results.
Now, lets try hierarchical clustering we will do this using the hclust command. This function
performs a hierarchical cluster analysis using a set of dissimilarities for the number of objects
being clustered. As described earlier, each obkject starts in its own cluster and then the
algorithm proceeds by joining the two most similar clusters until there is just one cluster. The
hclust command allows us to specify an agglomeration method to be used (e.g. ward, complete
and centroid)
hclust(dist(Cluster.data), method = "cen", members =NULL)
Next, we will want to plot the dendrogram in order to better analyze the data:
hc <- hclust(dist(Cluster.data))
plot(hc,main="Cluster Dendogram of Marketing data")
This returns the following diagram:
When using dendrograms, it can also be helpful to use the identify command. This command
allows us to highlight clusters based upon where we click on the dendrogram. If we want to, we
can specify n, the maximum number of clusters to be identified. In addition, we can identify
MAXCLUSTER, the maximum number of clusters that can be produced by a cut. Here is an
example:
identify(hc)
9
We can also use rect.hclust command to draw rectangles around the branches of the
dendrogram highlighting the corresponding clusters:
The above data (Cluster Dendrogram of Marketing data) uses a complete clustering method
and Euclidian distance (which is the default). We can try several changes We will change the
method to centroid to see if what happens to the diagram. We can also remove the labels,
which (because they are ID numbers) are uninformative and simply seem to clutter the diagram.
Finally, we will change the hang command to a negative number, which will cause the labels to
hang from 0.
hc1 <- hclust(dist(Cluster.data), method = "cen", members =NULL)
plot(hc, labels = FALSE, hang = -1, main = "Cluster Dendogram of
Marketing data with Centroid method")
10
This returns the following diagram:
As we just saw, we have the option to change the method (e.g. distance measure). By looking
at the distance matrixes using different methods, we can better understand how changing the
method option in other In addition, if we want to calculate the distance using one of several
methods (e.g. Euclidian, Manhattan), we can do so using the following codes, which test several
different methods:
dist(Cluster.data, method = "euclidean")
dist(Cluster.data, method = "manhattan")
Next, we can try cutree. This command is like when you cut a tree resulting from hclust into
several groups. We do this by specifying the desired number of groups or the cut heights. We
can specify k, which is an integer scalar or vector with the desired number of groups and/or h, a
numeric scalar or vector with heights where the tree should be cut. We have to specify either k
or h. If we specify both, k overrides h. Here is the code:
cutree(hc, k = 4, h = 2)
This returns a vector with group memberships ranging from 1 to 4:

1
1
1
3
1
2
3
3
1
3
1
3
1
2
1
3
2
1
1
1
3
1
3
3
2
3
1
3
3
1
1
3
1
2
1
3
3
2
3
3
2
4
3
3
3 1 3 1 3 4 2 2 3 3 3 1 1 2 3 2 2 3 3 2 1 3 1 3 3
2 1 1 1 1 3 3 3 3 1 2 2 4 1 1 3 2 1 3 3 3 3 1 3 3
3 3 1 1 3 1 3 1 3 1 3 2 1 1 3 1 1 1 1 3 3 2 3 1 3
1
11
Lets try a few other versions, altering the numbers for k and providing h instead of k:
cutree(hc, h = 3)
This returns:
1
20
38
12
68
82
95
2
21
39
55
69
83
96
3
22
40
56
70
84
97
4
23
41
28
71
85
98
5
24
1
57
36
86
96
6
7
25 26
42 43
58 59
72 73
87 88
99 100
8
9 10 11 12 13 14 15 16 17
27 28 29 30 31 32 33 34 35 36
44 45 46 47 48 49 50 51 52 53
60 45 61 62 63 60 64 65 66 67
74 75 76 77 78 79 80 67 58 81
89 70 90 89 53 76 91 92 90 93
10 101 10 102
18 19
37
54
13
69
94
With so many numbers, the second command (with h = 3) does not seem to provide much
valuable information. Therefore, it seems more helpful to determine k, the desired number of
groups. For example, if we were doing marketing research related to segmentation, we might
want 3 5 groups in order to target our audience without necessitating 20+ different marketing
plans or advertisements.
cutree(hc, k = 3)
cutree(hc, k = 5)
This last code returns:

1
1
1
3
1
2
3
3
1
3
1
3
1
2
1
3
2
1
1
1
3
1
3
3
2
3
1
3
3
1
1
3
1
2
1
3
3
2
3
3
4
5
3
3
3 1 3 1 3 5 2 2 3 3 3 1 1 2 3 2 2 3 3 4 1 3 1 3 3
2 1 1 1 1 3 3 3 3 1 2 2 5 1 1 3 2 1 3 3 3 3 1 3 3
3 3 1 1 3 1 3 1 3 1 3 4 1 1 3 1 1 1 1 3 3 4 3 1 3
1
Because there are only a handful of 4s and 5s out of the total 120 subjects (four and three,
respectively), it appears that we dont get much more out of adding the additional possible
groups.
If we wanted to segment participants using cluster analysis, we could determine 3 4 clusters in
which to group people. Based upon age, occupation, income, perceived humor, attitude toward
a provided advertisement and intentions to buy, we could group these people into appropriate
groups. With kmeans, hclust, cutree and the dendrogram, we can see how these groupings
occur naturally around 4 different groups.
Appendix
?cutree: Cut a tree into groups of data
Description:
12
Cuts a tree, e.g., as resulting from 'hclust', into several group either by specifying the desired
number(s) of groups or the cut height(s).
Usage:
cutree(tree, k = NULL, h = NULL)
Arguments:
tree: a tree as produced by 'hclust'. 'cutree()' only expects a list with components 'merge',
'height', and 'labels', of appropriate content each.
k: an integer scalar or vector with the desired number of groups
h: numeric scalar or vector with heights where the tree should be cut.
At least one of 'k' or 'h' must be specified, 'k' overrides 'h' if both are given.
Value:
'cutree' returns a vector with group memberships if 'k' or 'h' are scalar, otherwise a matrix with
group memberships is returned where each column corresponds to the elements of 'k' or 'h',
respectively (which are also used as column names).
?dendrogram: Dendrogram
Description:
Class '"dendrogram"' provides general functions for handling tree-like structures. It is
intended as a replacement for similar functions in hierarchical clustering and
classification/regression trees, such that all of these can use the same engine for plotting or
cutting trees.
Usage:
as.dendrogram(object, ...)
Arguments:
object: any R object that can be made into one of class '"dendrogram."'
x: object of class '"dendrogram."'
hang: numeric scalar indicating how the _height_ of leaves should be computed from the
heights of their parents; see 'plot.hclust'.
type: type of plot.
center: logical; if 'TRUE', nodes are plotted centered with respect to the leaves in the branch.
Otherwise (default), plot them in the middle of all direct child nodes.
edge.root: logical; if true, draw an edge to the root node.
nodePar: a 'list' of plotting parameters to use for the nodes (see 'points') or 'NULL' by default
which does not draw symbols at the nodes. The list may contain components named 'pch',
'cex', 'col', and/or 'bg' each of which can have length two for specifying separate attributes for
_inner_ nodes and_leaves_.
edgePar: a 'list' of plotting parameters to use for the edge 'segments' and labels (if there's an
'edgetext'). The list may contain components named 'col', 'lty' and 'lwd' (for the segments),
'p.col', 'p.lwd', and 'p.lty' (for the 'polygon' around the text) and 't.col' for the text color. As with
'nodePar', each can have length two for differentiating leaves and inner nodes.
leaflab: a string specifying how leaves are labeled. The default '"perpendicular"' write text
vertically (by default). '"textlike"' writes text horizontally (in a rectangle), and '"none"'
suppresses leaf labels.
dLeaf: a number specifying the *d*istance in user coordinates between the tip of a leaf and its
label. If 'NULL' as per default, 3/4 of a letter width or height is used.
13
horiz: logical indicating if the dendrogram should be drawn _horizontally_ or not.
frame.plot: logical indicating if a box around the plot should be drawn, see 'plot.default'.
h: height at which the tree is cut.
digits: integer specifying the precision for printing, see 'print.default'.
max.level, digits.d, give.attr, wid, nest.lev, indent.str: arguments to 'str', see 'str.default()'. Note
that 'give.attr = FALSE' still shows 'height' and 'members' attributes for each node.
stem: a string used for 'str()' specifying the _stem_ to use for each dendrogram branch.
Details:
Each node of the tree carries some information needed for efficient plotting or cutting as
attributes, of which only 'members', 'height' and 'leaf' for leaves are compulsory:
'members' total number of leaves in the branch
'height' numeric non-negative height at which the node is plotted.
'midpoint' numeric horizontal distance of the node from the left border (the leftmost leaf) of the
branch (unit 1 between all leaves). This is used for 'plot(*, center=FALSE)'.
'label' character; the label of the node
'x.member' for 'cut()$upper', the number of _former_ members; more generally a substitute for
the 'members' component used for horizontal (when 'horiz = FALSE', else vertical)
alignment.
'edgetext' character; the label for the edge leading to the node
'nodePar' a named list (of length-1 components) specifying node-specific attributes for 'points'
plotting, see the 'nodePar' argument above.
'edgePar' a named list (of length-1 components) specifying attributes for 'segments' plotting of
the edge leading to the node, and drawing of the 'edgetext' if available, see the 'edgePar'
argument above.
'leaf' logical, if 'TRUE', the node is a leaf of the tree.
?dist: Distance Matrix Computation

Description:
This function computes and returns the distance matrix computed by using the specified
distance measure to compute the distances between the rows of a data matrix.
Usage:
dist(x, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)
Arguments:
x: a numeric matrix, data frame or '"dist"' object.
method: the distance measure to be used. This must be one of '"euclidean"', '"maximum"',
'"manhattan"', '"canberra"', '"binary"' or '"minkowski"'.
diag: logical value indicating whether the diagonal of the distance matrix should be printed by
'print.dist'.
upper: logical value indicating whether the upper triangle of the distance matrix should be
printed by 'print.dist'.
p: The power of the Minkowski distance.
m: An object with distance information to be converted to a '"dist"' object. For the default
method, a '"dist"' object, or a matrix (of distances) or an object which can be coerced to such a
matrix using 'as.matrix()'. (Only the lower triangle of the matrix is used, the rest is ignored).
14
digits, justify: passed to 'format' inside of 'print()'.
Details:
Available distance measures are (written for two vectors x and y):
'euclidean': Usual square distance between the two vectors (2 norm).
'maximum': Maximum distance between two components of x and y (supremum norm)
'manhattan': Absolute distance between the two vectors (1 norm).
'canberra': sum(|x_i - y_i| / |x_i + y_i|). Terms with zero
numerator and denominator are omitted from the sum and treated as if the values were
missing.
'binary': (aka _asymmetric binary_): The vectors are regarded as binary bits, so non-zero
elements are on and zero elements are off. The distance is the _proportion_ of bits in which
only one is on amongst those in which at least one is on.
'minkowski': The p norm, the pth root of the sum of the pth powers of the differences of the
components.
Value:
'dist' returns an object of class '"dist"'.

Size: integer, the number of observations in the dataset.
Labels: optionally, contains the labels, if any, of the observations of the dataset
Diag, Upper: logicals corresponding to the arguments 'diag' and 'upper' above, specifying how
the object should be printed.
call: optionally, the 'call' used to create the object.
method: optionally, the distance method used; resulting from 'dist()', the ('match.arg()'ed)
'method' argument.
?hclust: Hierarchical Clustering

Description:
Hierarchical cluster analysis on a set of dissimilarities and methods for analyzing it.
Usage:
hclust(d, method = "complete", members=NULL)
Arguments:
d: a dissimilarity structure as produced by 'dist'.
method: the agglomeration method to be used. This should be (an unambiguous abbreviation
of) one of '"ward"', '"single"', '"complete"', '"average"', '"mcquitty"', '"median"' or "centroid"'.
members: 'NULL' or a vector with length size of 'd'. See the Details section.
x,tree: an object of the type produced by 'hclust'.
hang: The fraction of the plot height by which labels should hang below the rest of the plot. A
negative value will cause the labels to hang down from 0.
labels: A character vector of labels for the leaves of the tree. By default the row names or row
numbers of the original data are used. If 'labels=FALSE' no labels at all are plotted.
axes, frame.plot, ann: logical flags as in 'plot.default'.
main, sub, xlab, ylab: character strings for 'title'. 'sub' and 'xlab have a non-NULL default
when there's a 'tree$call'.
unit: logical. If true, the splits are plotted at equally-spaced heights rather than at the height in
the object.
hmin: numeric. All heights less than 'hmin' are regarded as being 'hmin': this can be used to
suppress detail at the bottom of the tree.
15
level, square, plot.: as yet unimplemented arguments of 'plclust' for S-PLUS compatibility.
Details:
This function performs a hierarchical cluster analysis using a set of dissimilarities for the n
objects being clustered. Initially, each object is assigned to its own cluster and then the
algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing
until there is just a single cluster.
A number of different clustering methods are provided. _Ward's_ minimum variance method
aims at finding compact, spherical clusters. The _complete linkage_ method finds similar
clusters. The _single linkage_ method (which is closely related to the minimal spanning tree)
adopts a friends of friends clustering strategy. The other methods can be regarded as aiming
for clusters with characteristics somewhere between the single and complete link methods.
Note however, that methods '"median"' and '"centroid"' are _not_ leading to a _monotone
distance_ measure, or equivalently the resulting dendrograms can have so called
_inversions_ (which are hard to interpret).
If 'members!=NULL', then 'd' is taken to be a dissimilarity matrix between clusters instead of
dissimilarities between singletons and 'members' gives the number of observations per
cluster. This way the hierarchical cluster algorithm can be started in the middle of the
dendrogram, e.g., in order to reconstruct the part of the tree above a cut (see examples).
Dissimilarities between clusters can be efficiently computed (i.e., without 'hclust' itself) only for
a limited number of distance/linkage combinations, the simplest one being squared Euclidean
distance and centroid linkage. In this case the dissimilarities between the clusters are the
squared Euclidean distances between cluster means.
Value: An object of class *hclust* which describes the tree produced by the clustering process. The
object is a list with components:
merge: an n-1 by 2 matrix. Row i of 'merge' describes the merging of clusters at step i of the
clustering. If an element j in the row is negative, then observation -j was merged at this
stage. If j is positive then the merge was with the cluster formed at the (earlier) stage j of the
algorithm. Thus negative entries in 'merge' indicate agglomerations of singletons, and
positive entries indicate agglomerations of non-singletons.
height: a set of n-1 non-decreasing real values. The clustering _height_: that is, the value of
the criterion associated with the clustering 'method' for the particular agglomeration.
order: a vector giving the permutation of the original observations suitable for plotting, in the
sense that a cluster plot using this ordering and matrix 'merge' will not have crossings of the
branches.
labels: labels for each of the objects being clustered.
call: the call which produced the result.
method: the cluster method that has been used.
dist.method: the distance that has been used to create 'd' (only returned if the distance
object has a '"method"' attribute).
?identify.hclust: Identify Clusters in a Dendrogram
Description:
'identify.hclust' reads the position of the graphics pointer when the (first) mouse button is
pressed. It then cuts the tree at the vertical position of the pointer and highlights the cluster
containing the horizontal position of the pointer. Optionally a function is applied to the index of
data points contained in the cluster.
16
Usage:
identify(x, FUN = NULL, N = 20, MAXCLUSTER = 20, DEV.FUN = NULL,...)
Arguments:
x: an object of the type produced by 'hclust'.
FUN: (optional) function to be applied to the index numbers of the data points in a cluster (see
Details below).
N: the maximum number of clusters to be identified.
MAXCLUSTER: the maximum number of clusters that can be produced by a cut (limits the
effective vertical range of the pointer).
DEV.FUN: (optional) integer scalar. If specified, the corresponding
graphics device is made active before 'FUN' is applied.
Details:
By default clusters can be identified using the mouse and an 'invisible' list of indices of the
respective data points is returned.
If 'FUN' is not 'NULL', then the index vector of data points is passed to this function as first
argument, see the examples below. The active graphics device for 'FUN' can be specified
using 'DEV.FUN'.
?kmeans: K-Means Clustering
Description:
Perform k-means clustering on a data matrix.
Usage:
kmeans(x, centers, iter.max = 10, nstart = 1, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy",
"MacQueen"))
Arguments:
x: A numeric matrix of data, or an object that can be coerced to such a matrix (such as
numeric vector or a data frame with all numeric columns)
centers: Either the number of clusters or a set of initial (distinct cluster centers. If a number, a
random set of (distinct) rows in 'x' is chosen as the initial centers.
iter.max: The maximum number of iterations allowed.
nstart: If 'centers' is a number, how many random sets should be chosen?
algorithm: character: may be abbreviated.
Details:
The data given by 'x' is clustered by the k-means method, which aims to partition the points
into k groups such that the sum of squares from points to the assigned cluster centres is
minimized.
Value:
An object of class '"kmeans"' which is a list with components:

cluster: A vector of integers indicating the cluster to which each point is allocated.
centers: A matrix of cluster centers.
withinss: The within-cluster sum of squares for each cluster.
size: The number of points in each cluster.
17
?mahalanobis: Mahalanobis Distance

Description:
Returns the squared Mahalanobis distance of all rows in 'x' and the vector mu='center' with
respect to Sigma='cov'. This is (for vector 'x') defined as D^2 = (x - mu)' Sigma^{-1} (x - mu)
Usage:
mahalanobis(x, center, cov, inverted=FALSE, ...)
Arguments:
x: vector or matrix of data with, say, p columns.
center: mean vector of the distribution or second data vector of length p.
cov: covariance matrix (p x p) of the distribution.
inverted: logical. If 'TRUE', 'cov' is supposed to contain the _inverse_ of the covariance
matrix.
?rect.hclust: Draw Rectangles Around Hierarchical Clusters
Description:
Draws rectangles around the branches of a dendrogram highlighting the corresponding
clusters. First the dendrogram is cut at a certain level, then a rectangle is drawn around
selected branches.
Usage:
rect.hclust(tree, k = NULL, which = NULL, x = NULL, h = NULL, border = 2, cluster = NULL)
Arguments:
tree: an object of the type produced by 'hclust'.
k, h: Scalar. Cut the dendrogram such that either exactly 'k' clusters are produced or by cutting
at height 'h'.
which, x: A vector selecting the clusters around which a rectangle should be drawn. 'which'
seleccts clusters by number (from left to right in the tree), 'x' selects clusters containing the
respective horizontal coordinates. Default is 'which =1:k'.
border: Vector with border colors for the rectangles.
cluster: Optional vector with cluster memberships as returned by 'cutree(hclust.obj, k = k)', can
be specified for efficiency if already computed.

Cluster Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cluster Analysis

Uploaded by

Copyright:

Available Formats

1

Understanding Statistical Cliques: Cluster Analysis in R

What is Cluster Analysis?

Select the variables that you want to analyze.

Lets Try an Example!

Within cluster sum of squares by cluster:

This returns the following diagram:

This returns a vector with group memberships ranging from 1 to 4:

This last code returns:

?dist: Distance Matrix Computation

'dist' returns an object of class '"dist"'.

?hclust: Hierarchical Clustering

An object of class '"kmeans"' which is a list with components:

?mahalanobis: Mahalanobis Distance

You might also like