You are on page 1of 57

Cluster analysis

In cluster analysis we search for patterns in


a data set by grouping the (multivariate)
observations into clusters.
The goal is to find an optimal grouping for
which the observations or objects within
each cluster are similar, but the clusters
are dissimilar to each other.
We hope to find the natural groupings in
the data, groupings that make sense to the
researcher.
What is Cluster Analysis?
Cluster analysis . . . is a group of
multivariate techniques whose primary
purpose is to group objects based on the
characteristics they possess.

It has been referred to as Q analysis (a


mathematical framework to describe and
analyze structures), typology
construction, classification analysis, and
numerical taxonomy.

The essence of all clustering approaches


is the classification of data as suggested
by natural groupings of the data
themselves.
Three Cluster Diagram Showing
Between-Cluster and Within-Cluster Variation

Between-Cluster Variation = Maximize


Within-Cluster Variation = Minimize
Scatter Diagram for Cluster
Observations
High
Frequency of eating out

Low
Low High

Frequency of going to fast food


restaurants

9-5
Scatter Diagram for Cluster Observations

High
Frequency of eating out

Low
Low High

Frequency of going to fast food


restaurants

9-6
Scatter Diagram for Cluster Observations

High
Frequency of eating out

Low
Low High
Frequency of going to fast food
restaurants

9-7
Scatter Diagram for Cluster Observations

High
Frequency of eating out

Low
Low High

Frequency of going to fast food


restaurants

9-8
Criticisms of Cluster Analysis
The following must be addressed by
conceptual rather than empirical
support:

Cluster analysis is descriptive,


atheoretical, and noninferential.
. . . will always create clusters,
regardless of the actual existence of any
structure in the data.
The cluster solution is not generalizable
because it is totally dependent upon the
variables used as the basis for the
similarity measure.
What Can We Do With
Cluster Analysis?

1. Determine if statistically different


clusters exist.

2. Identify the meaning of the clusters.

3. Explain how the clusters can be


used.
Research Questions in
Cluster Analysis
The primary objective of cluster analysis is to
define the structure of the data by placing the
most similar observations into groups. To do
so, we must answer three questions:
How do we measure similarity?
How do we form clusters?
How many groups do we form?
Stage 1: Objectives of
Cluster Analysis
Primary Goal = to partition a set of objects
into two or more groups based on the
similarity of the objects for a set of specified
characteristics (the cluster variate).

Two key issues:


The research questions being addressed,
and
The variables used to characterize objects in
the clustering process.
Other Research
Questions ?
Three basic questions . . .
How to form the taxonomy an
empirically based classification of
objects.
How to simplify the data by grouping
observations for further analysis.
Which relationships can be identified
the process reveals relationships
among the observations.
Selecting Cluster
Variables
Two Issues . . .
1. Conceptual considerations include only
variables that . . .
Characterize the objects being
clustered
Relate specifically to the objectives of
the cluster analysis
2. Practical considerations.
Cluster analysis can be affected dramatically
by the inclusion of one or two inappropriate
or undifferentiated variables eliminate
variables that are not distinctive
Rules ofOF
OBJECTIVES Thumb
CLUSTER 91ANALYSIS
Cluster analysis is used for:
Taxonomy description identifying natural groups
within the data.
Data simplification the ability to analyze groups of
similar observations instead of all individual
observations.
Relationship identification the simplified structure
from cluster analysis portrays relationships not
revealed otherwise.
Theoretical, conceptual and practical considerations
must be observed when selecting clustering variables
for cluster analysis:
Only variables that relate specifically to objectives
of the cluster analysis are included, since
irrelevant variables can not be excluded from the
analysis once it begins
Variables are selected which characterize the
Stage 2: Research
Design in Cluster
Analysis
Four Questions . . .
Is the sample size adequate?
Can outliers be detected an, if so,
should they be deleted?
How should object similarity be
measured?
Should the data be standardized?
Measuring Similarity
Interobject similarity is an empirical measure
of correspondence, or resemblance, between
objects to be clustered.

It can be measured in a variety of ways, but


a convenient measure of proximity is the
distance between two observations.

Since a distance increases as two units


become further apart, distance is actually a
measure of dissimilarity.
Types of Distance Measures
Euclidean distance: the most common measure of
distance.
Squared (or absolute) Euclidean distance: is the
sum of squared distances and is the recommended
measure for the centroid and Wards methods of
clustering
City-block (Manhattan) distance: it uses the sum of
the absolute differences of the variables. It is the
simplest procedure, but may lead to invalid clusters
if clustering variables are highly correlated.
Chebyshev distance
Mahalanobis distance (D2): accounts for variable
intercorrelations and weights each variable equally.
When variables are highly intercorrelated,
Mahalanobis distance is most appropriate
Euclidean distance
City-block (Manhattan) distance

n
d ( x, y ) | xi yi |
i 1
Chebyshev distance

The Chebyshev distance between two vectors or points p and


q, with standard coordinates pi and qi is

D( p, q) : max pi qi
i

Example if p(x1, y1 ) and q(x2 , y2 ):

D max x2 x1 , y2 y1
Mahalanobis distance
The Mahalanobis distance is a
measure of the distance between a
point P and a distribution, D.
It is a multi-dimensional
generalization of the idea of
measuring how many standard
deviation away P is from the mean of
D. This distance is zero if P is at the
mean of D, and grows as P moves
away from the mean. it measures the
number of standard deviations from P
to the mean of D. If each of these
axes is rescaled to have unit variance,
then Mahalanobis distance
corresponds to standard euclidean
distance in the transformed space.
Mahalanobis distance is thus unitless
and scale variant and takes into
account the correlations of the data
9-23
Exercise
Three items have the following
bivariate measurements (y1, y2): (2,
5), (4, 2), (7, 9).

Make an proximity matrix of


Euclidean distance.
What happen if the scale in y1 is
multiplied by 100 (e.g. changing from
cm to m)
Exercise
Determine Euclidean distance
between Atlanta and Boston.
Rules of Thumb 9 2
Research Design in Cluster Analysis
The sample size required is not based on
statistical considerations for inference testing,
but rather:
Sufficient size is needed to ensure
representativeness of the population and its
underlying structure, particularly small
groups within the population.
Minimum group sizes are based on the
relevance of each group to the research
question and the confidence needed in
characterizing that group.
Research Rules
Designofin
Thumb 9 Analysis
Cluster 2 continued . . .
Similarity measures calculated across the entire set of
clustering variables allow for the grouping of observations and
their comparison to each other.
Distance measures are most often used as a measure of
similarity, with higher values representing greater
dissimilarity (distance between cases) not similarity.
There are many different distance measures,
including:
Euclidean (straight line) distance is the most common
measure of distance.
Squared Euclidean distance is the sum of squared
distances and is the recommended measure for the
centroid and Wards methods of clustering.
Mahalanobis distance accounts for variable
intercorrelations and weights each variable equally.
When variables are highly intercorrelated,
Mahalanobis distance is most appropriate.
Less frequently used are correlational measures, where
large values do indicate similarity.
Rules of Thumb 9 2 Continued . . .
Research Design in Cluster Analysis
Given the sensitivity of some procedures to the similarity
measure used, the researcher should employ several distance
measures and compare the results from each with other
results or theoretical/known patterns.
Outliers can severely distort the representativeness of the
results if they appear as structure (clusters) that are
inconsistent with the research objectives
They should be removed if the outlier represents:
Aberrant observations not representative of the
population
Observations of small or insignificant segments
within the population which are of no interest to the
research objectives
They should be retained if representing an under-
sampling/poor representation of relevant groups in the
population. In this case, the sample should be augmented
to ensure representation of these groups.
Rules of Thumb 9 2 Continued . . .

Research Design in Cluster Analysis


Outliers can be identified based on the similarity measure by:
Finding observations with large distances from all other
observations
Graphic profile diagrams highlighting outlying cases
Their appearance in cluster solutions as single-member or
very small clusters
Clustering variables should be standardized whenever
possible to avoid problems resulting from the use of different
scale values among clustering variables.
The most common standardization conversion is Z scores.
If groups are to be identified according to an individuals
response style, then within-case or row-centering
standardization is appropriate.
Stage 3: Assumptions of
Cluster Analysis

Representativeness of the
sample.
Impact of multicollinearity.
Rules of Thumb 9 3
ASSUMPTIONS IN CLUSTER ANALYSIS
Input variables should be examined for
substantial multicollinearity and if
present . . .
Reduce the variables to equal numbers in
each set of correlated measures.
Use a distance measure that
compensates for the correlation, like
Mahalanobis Distance.
Take a proactive approach and include
only cluster variables that are not highly
correlated.
Stage 4: Deriving Clusters and
Assessing Overall Fit

The researcher must . . .


Select the partitioning procedure
used for forming clusters
Hierarchical
Non-hierarchical

Decide on the number of clusters


to be formed.
Two Types of Hierarchical
Clustering Procedures

1. Agglomerative Methods
(buildup)

2. Divisive Methods (breakdown)


How Agglomerative Hierarchical
Approaches Work?
Start with all observations as their own
cluster.
Using the selected similarity measure,
combine the two most similar observations
into a new cluster, now containing two
observations.
Repeat the clustering procedure using the
similarity measure to combine the two most
similar observations or combinations of
observations into another new cluster.
Continue the process until all observations
are in a single cluster.
Agglomerative Algorithms
Single Linkage (nearest
neighbor)
Complete Linkage (farthest
neighbor)
Average Linkage.
Centroid Method.
Median method.
Wards Method.
Single Linkage (nearest neighbor)
Copyright 2010
Pearson Education, Inc.,
9-40
publishing as Prentice-
Complete Linkage (farthest neighbor)
Copyright 2010
Pearson Education, Inc.,
9-42
publishing as Prentice-
Average Linkage
Centroid method
Median method
Wards method
Similarity between two clusters is not a single
measure of similarity but the sum of squares within
the clusters summed over all variables.
The selection of which two clusters to combine is
based on which combination of clusters minimizes
the within-cluster sum of squares across the
complete set of disjoint or separate clusters.
At each step, the two clusters combined are those
that minimize the increase in the total sum of
squares across all variables in all clusters.
The use of sum of squares measure makes this
method easily distorted by outliers
Deriving Hierarchical Clusters
Hierarchical clustering methods differ in the
method of representing similarity between clusters,
each with advantages and disadvantages:
Single-linkage is probably the most versatile algorithm,
but poorly delineated cluster structures within the data
produce unacceptable snakelike chains for clusters.
Complete linkage eliminates the chaining problem, but
only considers the outermost observations in a cluster,
thus impacted by outliers.
Average linkage is based on the average similarity of all
individuals in a cluster and tends to generate clusters with
small within-cluster variation and is less affected by
outliers.
Centroid linkage measures distance between cluster
centroids and like average linkage, is less affected by
outliers.
Wards is based on the total sum of squares within
clusters and is most appropriate when the researcher
Copyright expects
2010 somewhat equally sized clusters. But it is easily
Pearson Education, Inc.,
publishingdistorted by outliers. 9-48
as Prentice-
K-Means Clustering
Initially, the number of clusters must be
known, or chosen, to be K say.
The initial step is the choose a set of K
instances as centres of the clusters. Often
chosen such that the points are mutually
farthest apart, in some way.
Next, the algorithm considers each instance
and assigns it to the cluster which is closest.
The cluster centroids are recalculated either
after each instance assignment, or after the
whole cycle of re-assignments.
This process is iterated.
Divisive hierarchical
clustering
Start with a single cluster composed of all data points

Split this into components

Continue recursively

Monothetic divisive methods split clusters using one variable/dimension


at a time

Polythetic divisive methods make splits on the basis of all variables


together

Any intercluster distance measure can be used

Computationally intensive, less widely used than agglomerative methods


When to stop
A formalization of this procedure was
proposed by Mojena :
Choose the number of groups given by the
first stage in the dendrogram at which
j > + ks, j = 1, 2, . . . , n,
where 1, 2, . . . , n are the distance
values for stages with n, n1, . . . , 1
clusters,
and s are the mean and standard
deviation of the s, and k is a constant.
Mojena suggested using a value of k in the
range 2.75 to 3.5, but Milligan and Cooper
(1985) recommended k = 1.25, based on a
Stage 5: Interpretation of
the Clusters

This stage involves examining


each cluster in terms of the cluster
variate to name or assign a label
accurately describing the nature of
the clusters
Stage 6: Validation and Profiling
of the Clusters
Validation . . .
Cross-validation
Criterion validity
Profiling . . . . describing the characteristics of
each cluster to explain how they may differ
on relevant dimensions. This typically
involves the use of discriminant analysis or
ANOVA.
Steps in Cluster
Analysis . . .
1. Select the variables.
2. Determine if clusters exist. To do so,
verify the clusters are statistically
different and theoretically meaningful (a
logical name can be assigned).
3. Decide how many clusters to use.
4. Describe the characteristics of the
derived clusters using demographics,
psychographics, etc.

Copyright 2010
Pearson Education, Inc.,
9-54
publishing as Prentice-
Step 1: Cluster Analysis Variable
Selection
Variables are typically measured
metrically, but technique can be
applied to non-metric variables.
Variables must be logically related
to a single underlying concept or
construct.

Copyright 2010
Pearson Education, Inc.,
9-55
publishing as Prentice-
Description of HBAT Primary Database Variables
Variable Description Variable Type
Data Warehouse Classification Variables
X1 Customer Type nonmetric
X2 Industry Type nonmetric
X3 Firm Size nonmetric
X4 Region nonmetric
X5 Distribution System nonmetric
Performance Perceptions Variables
X6 Product Quality metric
X7 E-Commerce Activities/Website metric
X8 Technical Support metric
X9 Complaint Resolution metric
X10 Advertising metric
X11 Product Line metric
X12 Salesforce Image metric
X13 Competitive Pricing metric
X14 Warranty & Claims metric
X15 New Products metric
X16 Ordering & Billing metric
X17 Price Flexibility metric
X18 Delivery Speed metric
Outcome/Relationship Measures
X19 Satisfaction metric
X20 Likelihood of Recommendation metric
X21 Likelihood of Future Purchase metric
X22 Current Purchase/Usage Level metric
Copyright 2010
X23 Consider Strategic Alliance/Partnership in Future nonmetric
Pearson Education, Inc.,
9-56
publishing as Prentice-
Cluster Analysis
Learning Checkpoint

1. Why might we use cluster


analysis?
2. What are the three major steps in
cluster analysis?
3. How do you decide how many
clusters
to extract?
4. Why do we validate clusters?

Copyright 2010
Pearson Education, Inc.,
9-57
publishing as Prentice-

You might also like