You are on page 1of 24

Cluster Analysis

Analysis and Output Interpretation


using
Hierarchical Cluster Technique & SPSS 6.00

1
2
4
0011 0010 1010 1101 0001 0100 1011
Dr. Rohit Vishal Kumar
Reader, Department of Marketing
Xavier Institute of Social Service
PO Box No 7, Purulia Road
Ranchi - 834001
Email: rohitvishalkumar@gmail.com
All trademarks & Copyrights Acknowledged.
Presentation Copyright Rohit Vishal Kumar 2002
Cluster Analysis - Introduction
• Cluster Analysis is a multivariate analysis technique that
seeks to organize information about variables so that
0011 0010 1010 1101 0001 0100 1011
relatively homogeneous groups, or "clusters," can be formed.
The clusters formed with this family of methods should be

2
highly internally homogenous (members are similar to one

1
another) and highly externally heterogeneous (members are
not like members of other clusters.

4
• Although cluster analysis is relatively simple, and can use a
variety of input data, it is a relatively new technique and is not
supported by a comprehensive body of statistical literature.
So, most of the guidelines for using cluster analysis are rules
of thumb and some authors caution that researchers should
use cluster analysis
Cluster Analysis - Key Features
• Cluster analysis is not as much a typical statistical test as it is
a "collection" of different algorithms that "put objects into
0011 0010 1010 1101 0001 0100 1011
clusters."
• Cluster analysis methods are mostly used when we do not

2
have any a priori hypotheses, but are still in the exploratory

1
phase of research. In a sense, cluster analysis finds the

4
"most significant solution possible." Therefore, statistical
significance testing is really not appropriate here
Cluster Analysis - Applications
• Medicine: clustering diseases, cures for diseases, or symptoms of
diseases can lead to very useful classification and better diagnosis.
0011 0010 1010 1101 0001 0100 1011
• Psychiatry: the correct diagnosis of clusters of symptoms such as
paranoia, schizophrenia, etc. is essential for successful therapy.

2
• Archeology: researchers have attempted to establish taxonomies of

1
stone tools, funeral objects, etc. by applying cluster analytic techniques.
• Marketing: researchers have attempted to use cluster analysis to identify

4
the closeness or difference (real or perceived) between brands image,
identify relatively homogenous marketing segments, identify similarities in
ideas of communications etc.

In general, whenever one needs to classify a "mountain"


of information into manageable meaningful piles, cluster
analysis is of great utility.
Four Common Distance Measures
• Euclidean distance. This is probably the most commonly chosen type of
distance. It simply is the geometric distance in the multidimensional
0011space. It is computed
0010 1010 1101 0001as:
0100 1011
distance(x,y) = { (xi - yi)2 }½

2
• Note: Euclidean (and squared Euclidean) distances are usually

1
computed from raw data, and not from standardized data.
• Advantage: the distance between any two objects is not affected by the

4
addition of new objects to the analysis, which may be outliers.
• Disadvantage: The distances can be greatly affected by differences in
scale among the dimensions from which the distances are computed.
For example, if one of the dimensions denotes a measured length in
centimeters, and you then convert it to millimeters (by multiplying the
values by 10), the resulting Euclidean or squared Euclidean distances
(computed from multiple dimensions) can be greatly affected, and
consequently, the results of cluster analyses may be very different.
Four Common Distance Measures
• Squared Euclidean distance. One may want to square the standard
Euclidean distance in order to place progressively greater weight on objects
0011that
0010are1010
further apart.
1101 0001 This
0100distance
1011 is computed as :
distance(x,y) = i (xi - yi)2

2
• City-block (Manhattan) distance. This distance is simply the average

1
difference across dimensions. In most cases, this distance measure yields
results similar to the simple Euclidean distance. However, note that in this

4
measure, the effect of single large differences (outliers) is dampened (since
they are not squared). The city-block distance is computed as:
distance(x,y) = i |xi - yi|
• Chebychev distance. This distance measure may be appropriate in cases
when one wants to define two objects as "different" if they are different on
any one of the dimensions. The Chebychev distance is computed as:
distance(x,y) = Maximum|xi - yi|
2
Cluster Analysis

1
4
0011 0010 1010 1101 0001 0100 1011

The Example and SPSS


Procedure
The Raw Data
Res
pondent V1 V2 V3 V4 V5 V6
1 6 4 7 3 2 3
2 2 3 1 4 5 4
3 7 2 6 4 1 3
4 4 6 4 5 3 6
0011 00105 1010 11011 0001 0100
3 1011 2 2 6 4
6 6 4 6 3 3 4
7 5 3 6 3 3 4
8 7 3 7 4 1 4
9 2 4 3 3 6 3

2
10 3 5 3 6 4 6
11 1 3 2 3 5 3

1
12 5 4 5 4 2 4
13 2 2 1 5 4 4
14 4 6 4 6 4 7

4
15 6 5 4 2 1 4
16 3 5 4 6 4 7
17 4 4 7 2 2 5
18 3 7 2 6 4 3
19 4 6 3 7 2 7
20 2 3 2 4 7 2

The above data was collected from 20 respondents. The respondents were
asked to rate the following statement on a 7 point scale
V1 : Shopping is Fun V2 : Shopping is bad for your budget
V3 : I combine shopping with eating out V4 : I try to get the best buys while shopping
V5 : I don’t care about shopping V6 : You can save money by comparing prices
SCALE USED
Completely Disagree Neither Agree Nor Disagree Completely Agree
1 4 7
SPSS Screen 1
The data entry screen in SPSS

0011 0010 1010 1101 0001 0100 1011

2
1
4
SPSS Screen 2 : Hierarchical Cluster
Choose Statistics -> Data Reduction -> Hierarchical Cluster
We are shown the Hierarchical Cluster Screen as follows:

0011 0010 1010 1101 0001 0100 1011


1. Select All six variables
(V1-V6) and transfer them to

2
the variable(s) box

1
4
2. Select Cluster “Cases”

3. Select Display “Statistics


and “Plots”

4. Press on the Statistics


Button
SPSS Screen 3 : Hierarchical Cluster
On Pressing the “Statistics” Button we are shown the following screen

0011 0010 1010 1101 0001 0100 1011


1. “Agglomeration Schedule”
and “Cluster Membership ->

2
None” should be checked by

1
default. If not select these
options

4
2. Press “Continue”

3. Select “Plots” from the


“Screen 2”
SPSS Screen 4 : Hierarchical Cluster
On Pressing the “Plots” Button we are shown the following screen

0011 0010 1010 1101 0001 0100 1011


1. Select “Dendogram”

2
2. Select “All Icicles”

1
4
3. Select Orientation
“Vertical”

4. Select “Methods” from the


“Screen 2”
SPSS Screen 5 : Hierarchical Cluster
On Pressing the “Methods” Button we are shown the following screen

0011 0010 1010 1101 0001 0100 1011 1. Choose in Cluster Method:
“Between Group Linkage”

2
2. Select in Measure “Interval”

1
and select “Squared Euclidean
Distances”

4
3. Select in “Transform Values”
“none” in the standardize
dropdown list

4. Select Continue

5. In Screen 2 select “OK”


2
Cluster Analysis

1
4
0011 0010 1010 1101 0001 0100 1011

The SPSS Output


SPSS Output 1 : Hierarchical Cluster
The following output “Proximities” is displayed by SPSS

Data Information
20 unweighted cases accepted.
0011 0010 1010 1101 0001 0100
0 cases 1011because of missing value.
rejected
Squared Euclidean measure used.
* * * * * * * * * * * * * * P R O X I M I T I E S * * * * * * * * * * * * * *
Agglomeration Schedule using Average Linkage (Between Groups)

2
Clusters Combined Stage Cluster 1st Appears Next

1
Stage Cluster 1 Cluster 2 Coefficient Cluster 1 Cluster 2 Stage

1 14 16 2.000000 0 0 3

4
2 6 7 2.000000 0 0 7
3 10 14 3.000000 0 1 8
4 2 13 3.000000 0 0 14
5 5 11 3.000000 0 0 9
6 3 8 3.000000 0 0 15
7 6 12 4.000000 2 0 10
8 4 10 4.333333 0 3 11
9 5 9 4.500000 5 0 12
10 1 6 5.000000 0 7 13
11 4 19 7.250000 8 0 17
12 5 20 7.333333 9 0 14
13 1 17 8.250000 10 0 15
14 2 5 10.750000 4 12 18
15 1 3 11.300000 13 6 16
16 1 15 14.000000 15 0 19
17 4 18 20.200001 11 0 18
18 2 4 38.611111 14 17 19
19 1 2 48.291668 16 18 0
SPSS Output 1 : Hierarchical Cluster
The Analysis : Proximities

•The "average linkage (between group)" clustering was used.


0011 0010 1010 1101 0001 0100 1011
•There were a total of 20 data points. In the first stage two data point (14 and 16)
were combined. This information is provided under cluster combined cluster 1 and

2
cluster 2 column.

1
•The squared Euclidean distance between the data point 14 and 16 is provided and

4
is equal to 2.00. This is shown in column “Coefficients”

•The column entitled "Stage Cluster First Appeared" indicates the stage of
combining the data in which the cluster first appears. The entry of 0 and 0 implies
that right now no new clusters have been demarcated. The first cluster demarcation
appears at stage 3 when data point 10 and 14 are combined to form a cluster.

•The “next stage” columns gives the step in which the next data point was
combined. The entry is 3. If we look at stage 3 then we find that data point 10 and
14 were combined to form the next cluster.
SPSS Output 2 : Hierarchical Cluster
The following output “Icicle Plot” is displayed by SPSS

Vertical Icicle Plot using Average Linkage (Between Groups)

(Down) Number of Clusters (Across) Case Label and number


0011 0010 1010 1101 0001 0100 1011
C C C C C C C C C C C C C C C C C C C C
a a a a a a a a a a a a a a a a a a a a
s s s s s s s s s s s s s s s s s s s s
e e e e e e e e e e e e e e e e e e e e

2
1 1 1 1 1 4 2 9 1 5 1 2 1 8 3 1 1 7 6 1

1
8 9 6 4 0 0 1 3 5 7 2

1 1 1 1 1 2 1 1 1 1 1

4
8 9 6 4 0 4 0 9 1 5 3 2 5 8 3 7 2 7 6 1
1 +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
2 +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXX
3 +XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXX
4 +X XXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXX
5 +X XXXXXXXXXXXXX XXXXXXXXXXXXXXXX X XXXXXXXXXXXXXXXXXXX
6 +X XXXXXXXXXXXXX XXXXXXXXXXXXXXXX X XXXX XXXXXXXXXXXXX
7 +X XXXXXXXXXXXXX XXXXXXXXXX XXXX X XXXX XXXXXXXXXXXXX
8 +X XXXXXXXXXXXXX XXXXXXXXXX XXXX X XXXX X XXXXXXXXXX
9 +X XXXXXXXXXXXXX X XXXXXXX XXXX X XXXX X XXXXXXXXXX
10 +X X XXXXXXXXXX X XXXXXXX XXXX X XXXX X XXXXXXXXXX
11 +X X XXXXXXXXXX X XXXXXXX XXXX X XXXX X XXXXXXX X
12 +X X XXXXXXXXXX X X XXXX XXXX X XXXX X XXXXXXX X
13 +X X XXXXXXX X X X XXXX XXXX X XXXX X XXXXXXX X
14 +X X XXXXXXX X X X XXXX XXXX X XXXX X X XXXX X
15 +X X XXXXXXX X X X XXXX XXXX X X X X X XXXX X
16 +X X XXXXXXX X X X X X XXXX X X X X X XXXX X
17 +X X XXXXXXX X X X X X X X X X X X X XXXX X
18 +X X XXXX X X X X X X X X X X X X X XXXX X
19 +X X XXXX X X X X X X X X X X X X X X X X
SPSS Output 2 : Hierarchical Cluster
The Analysis : Icicle Plot

•The icicle plot shows the cluster combination. It is read from bottom to top.
0011 0010 1010 1101 0001 0100 1011
•Initially it was assumed that there are 20 initial cluster. Then in row labeled 19 a
combination was made and 19 clusters were formed.

1
2
•The icicle plot in pictorial form represents the whole process of cluster formation.
For example, if we take row labelled 7 we shall see that there are 7 clusters denoted

4
by a series of X's:
X XXXXXXXXXXXXX XXXXXXXXXX XXXX X XXXX XXXXXXXXXXXXX

•Each subsequent step leads to a formation of new cluster in one of the following
three (3) ways:
–Two individual cases are grouped together
–A case is joined to an already existing cluster
–Two clusters are grouped together
SPSS Output 3 : Hierarchical Cluster
The following output “Dendogram” is displayed by SPSS

Dendrogram using Average Linkage (Between Groups)

0011 0010 1010 1101 0001 0100Distance


Rescaled 1011 Cluster Combine

C A S E 0 5 10 15 20 25
Label Num +---------+---------+---------+---------+---------+

2
Case 14 14 -+

1
Case 16 16 -+-+
Case 10 10 -+ +-+
Case 4 4 ---+ +-------------+

4
Case 19 19 -----+ +-------------------+
Case 18 18 -------------------+ |
Case 2 2 -+-------+ +---------+
Case 13 13 -+ | | |
Case 5 5 -+-+ +-----------------------------+ |
Case 11 11 -+ +-+ | |
Case 9 9 ---+ +---+ |
Case 20 20 -----+ |
Case 3 3 -+---------+ |
Case 8 8 -+ | |
Case 6 6 -+-+ +-+ |
Case 7 7 -+ | | | |
Case 12 12 ---+---+ | +-----------------------------------+
Case 1 1 ---+ +---+ |
Case 17 17 -------+ |
Case 15 15 -------------+
SPSS Output 3 : Hierarchical Cluster
The Analysis : Dendogram

•The Dendogram is a graphical output which is useful in identifying the


clusters.
0011 It is read
0010 1010 1101from
0001left to right.
0100 1011

•Vertical lines represent the clusters that are joined together. The position

2
of the vertical line on the scale indicates the distance at which the clusters

1
were joined. Because many of the distances in the early stages are of
similar magnitude, it is difficult to tell the sequence in which some of the

4
early clusters were formed. However, it is clear that in the last two stages,
the distances at which the clusters are combined are large. This
information is useful in deciding the number of clusters to retain.
2
Cluster Analysis

1
4
0011 0010 1010 1101 0001 0100 1011

Exercises and Final Notes


Practice Example
• The following data was collected for US baseball champions:
– Height : Height in Inches
– Weight : Weight in Pounds
– FGPct : Field Goal Percentage
– Points: Average Points per game
0011 –0010 1010Average
Rebounds: 1101 0001 0100
rebounds 1011
per game

Champion Height Weight FGPct Points Rebound


Jabbar K.A. 86 230 55.9 24.6 11.2

2
Barry R 79 205 44.9 23.2 06.7

1
Baylor E 77 225 43.1 27.4 13.5
Bird L 81 220 50.3 25.0 10.2

4
Chamberlain W 85 275 54.0 30.1 22.9
Cousy B 73 175 37.5 18.4 05.2
Erving J 79 200 50.6 24.2 08.5
Johnson M 81 215 53.0 19.5 07.4
Jordan M 78 195 51.3 32.6 06.2
Robertson O 77 210 48.5 25.7 07.5
Russell B 82 220 44.0 15.1 22.6
West J 75 180 47.4 27.0 05.8

• Conduct a Hierarchieal Cluster Analysis using


a) Height, Weight, FGPct, Points and Rebound
b) Height, FGPct, Points and Rebound
c) FGPct, Points and Rebound
Analyse the Dendograms to identify how the clusters have changed
between (a) and (b) and (c)
Warning
• We have only shown the output of a hierarchical Cluster
Analysis
0011 0010 1010 1101 0001 0100 1011
• Similar Interpretations may or may not be applicable to non-
hierarchical Cluster Analysis

2
• The analysis software used was SPSS® 6.0. The output may

1
vary with the type of analysis tool selected

4
• Cluster Analysis should be run more than once using different
distance measures and results compared before a final
interpretation is attempted.
Thank You
1
2
4
0011 0010 1010 1101 0001 0100 1011
Feel Free to revert with your
comments and suggestions

You might also like