You are on page 1of 30

Cluster Analysis

Introduction:
Cluster Analysis is an exploratory data analysis tool used to organize observed
data into various sub groups or clusters that are easier to manage than
individual datum. The grouping or clusters are defined through data analysis and
requires no prior knowledge of which item belongs to which group or cluster. It
does so by maximizing the similarity between elements within a group and
maximizing dissimilarity between different groups. Geometrically, objects within
a cluster will be close together and different clusters will be far apart. There are
various ways in which clusters can be formed. SPSS has 3 different methods that
can be used to perform cluster analysis: hierarchical cluster analysis, k-means
cluster and two-step cluster.

1) Hierarchical Cluster Analysis


Hierarchical cluster analysis can be of two types: agglomerative or divisive.
In agglomerative clustering similar clusters are merged at successive steps.
The algorithm ends when all objects end up in one cluster. In divisive
clustering, all objects start in one cluster and dissimilar objects are removed
from the cluster. In this case the algorithm ends when all objects are in their
own individual cluster.

General Agglomerative Algorithm


STEP1: Distance between individual clusters is calculated
STEP2: Two most similar clusters are fused and distances recalculated
STEP3: Repeat STEP2 until all cases are eventually in one cluster

Single-Linkage Agglomerative Algorithm


STEP1: shortest distance from any object in one cluster to any object in
other cluster is
calculated
STEP2: Two most similar clusters are fused and distances recalculated
STEP3: Repeat STEP2 until all cases are eventually in one cluster

Complete-Linkage Agglomerative Algorithm


STEP1: maximum distance between any two members in the two clusters
is calculated
STEP2: Two most similar clusters are fused and distances recalculated
STEP3: Repeat STEP2 until all cases are eventually in one cluster

Average-Linkage Agglomerative Algorithm


STEP1: average distance between all pairs of the two clusters members is
calculated
STEP2: Two most similar clusters are fused and distances recalculated
STEP3: Repeat STEP2 until all cases are eventually in one cluster

Centroid Method Agglomerative Algorithm


STEP1: distance between the two centroids is calculated; Where Centroids
are the mean
values of the observation on the variables of one cluster
STEP2: Two most similar clusters are fused and distances recalculated
STEP3: Repeat STEP2 until all cases are eventually in one cluster

Wards Method Agglomerative Algorithm


STEP1: The average similarity of the cluster is measured; The similarity
between two clusters
is the sum of squares within the clusters summed over all
variables.
STEP2: Two clusters are merged IF merger results in minimum increase in
error sum of
squares
STEP3: Repeat STEP2 until all cases are eventually in one cluster

Flowchart

Advantages of Hierarchical Cluster Analysis:


1- Clusters can be visualized through simple yet comprehensive
Dendograms or tree diagrams.
2- No need to input number of clusters beforehand

Disadvantages of Hierarchical Cluster Analysis:

1- It is considerably slower as there are many split/merge decisions to be


made, making
2- Not suitable for large datasets because of same reason as above
3- Different starting clusters can result in different final clusters

Example:
The dataset used in the example is shown below.
Name
Beefy
Benny
Bertie
Biffy
Billy
Champ
Charger
Charlie
Chewy
Chechee
Chico
Chief
Laddy
Larry
Lassie
Lemmy
Loco
LouLou

Weight.kilos
11.31
9.34
10.79
11.04
9.74
2.94
2.99
2.66
2.32
2.82
2.34
3.12
29.57
29.64
28.59
33.03
32.83
31.23

Height.cms
33.79
34.38
40.86
37.07
33.77
22.98
16.21
22.38
19.68
20.11
18.78
20.92
61.69
59.03
62.98
60.69
60.26
61.34

It has the Name, Weight in kilos and Height in centimetres of dogs. Our aim is to
categorize the dogs into different groups based on the above mentioned fields.

STEP1: Choose Classify option in Analyze menu, then choose Hierarchical


Cluster.... A window, as shown below, opens.

STEP2: Click Statistics option and check the options Agglomeration schedule
and Proximity matrix and click Continue.

Step3: Click Plots option and check Dendrogram, then click Continue

STEP4: Click Method and choose Wards method in the Cluster Method
dropdown menu and choose Squared Euclidean distance after checking
Interval option in the Measure category. Click Continue.

Squared Euclidean distance, an extension of Pythagoras theorem, is


considered the most straightforward and generally accepted way of computing
distances between objects in a multi-dimensional space. If we had a two- or
three-dimensional space this measure is the actual geometric distance between
objects in the space. In a univariate example, the Euclidean distance between
two values is the arithmetic difference, i.e. value1 value2. In the bivariate case,
the minimum distance is the hypotenuse of a triangle formed from the points, as
in Pythagoras theory. The squared Euclidean distance is used more often than
the simple Euclidean distance in order to place progressively greater weight on
objects that are further apart. Euclidean (and squared Euclidean) distances are
usually computed from raw data, and not from standardized data.
Wards method is chosen over the others because unlike other methods it uses
an analysis of variance approach to evaluate the distances between clusters.
This method in comparison to the other methods is considered to be very
efficient. In this method, the total sum of squared deviations from the mean of a
cluster is calculated to assess cluster membership. The smallest possible
increase in the error sum of squares is considered as the criterion for merging
into a cluster.
STEP5: Click OK in the main menu.

The Output is as shown:


Case Processing Summarya
Cases
Valid
N

Missing
Percent

18

100.0

Total

Percent
0

.0

Percent
18

100.0

a. Ward Linkage

Proximity Matrix
Case

Squared Euclidean Distance


1:Beefy

2:Benny

3:Bertie

4:Biffy

5:Billy

6:Champ

7:Charger

1:Beefy

.000

4.229

50.255

10.831

2.465

186.913

378.279

2:Benny

4.229

.000

44.093

10.126

.532

170.920

370.471

3:Bertie

50.255

44.093

.000

14.427

51.371

381.317

668.463

4:Biffy

10.831

10.126

14.427

.000

12.580

264.138

499.942

5:Billy

2.465

.532

51.371

12.580

.000

162.664

353.916

6:Champ

186.913

170.920

381.317

264.138

162.664

.000

45.835

7:Charger

378.279

370.471

668.463

499.942

353.916

45.835

.000

8:Charlie

205.011

188.622

407.607

286.021

179.859

.438

38.178

9:Chewy

279.912

265.370

520.333

378.451

253.585

11.274

12.490

10:Chechee

259.223

246.143

494.083

355.210

234.482

8.251

15.239

11:Chico

305.761

292.360

558.929

410.214

279.460

18.000

7.027

12:Chief

232.713

219.860

456.433

323.549

208.947

4.276

22.201

13:Laddy

1111.838

1155.089

786.577

949.505

1172.755

2207.621

2774.927

14:Larry

973.047

1019.713

685.471

828.202

1034.078

2012.492

2543.775

15:Lassie

1150.654

1188.522

806.134

979.331

1208.547

2257.922

2842.793

16:Lemmy

1195.368

1253.432

887.847

1041.465

1267.110

2327.452

2880.872

17:Loco

1163.771

1221.554

862.122

1012.580

1234.868

2283.211

2830.828

18:LouLou

1155.809

1206.014

837.224

996.669

1221.925

2271.814

2834.215

Proximity Matrix
Case

Squared Euclidean Distance


8:Charlie

9:Chewy

10:Chechee

11:Chico

12:Chief

13:Laddy

14:Larry

1:Beefy

205.011

279.912

259.223

305.761

232.713

1111.838

973.047

2:Benny

188.622

265.370

246.143

292.360

219.860

1155.089

1019.713

3:Bertie

407.607

520.333

494.083

558.929

456.433

786.577

685.471

4:Biffy

286.021

378.451

355.210

410.214

323.549

949.505

828.202

5:Billy

179.859

253.585

234.482

279.460

208.947

1172.755

1034.078

.438

11.274

8.251

18.000

4.276

2207.621

2012.492

38.178

12.490

15.239

7.027

22.201

2774.927

2543.775

8:Charlie

.000

7.406

5.179

13.062

2.343

2269.424

2071.143

9:Chewy

7.406

.000

.435

.810

2.178

2507.403

2294.805

10:Chechee

5.179

.435

.000

1.999

.746

2444.459

2234.079

11:Chico

13.062

.810

1.999

.000

5.188

2582.741

2365.353

12:Chief

2.343

2.178

.746

5.188

.000

2361.795

2155.683

2269.424

2507.403

2444.459

2582.741

2361.795

.000

7.081

6:Champ
7:Charger

13:Laddy

14:Larry

2071.143

2294.805

2234.079

2365.353

2155.683

7.081

.000

15:Lassie

2320.725

2565.003

2501.930

2642.702

2417.764

2.625

16.705

16:Lemmy

2389.993

2624.924

2559.380

2698.324

2476.261

12.972

14.248

17:Loco

2345.123

2577.596

2512.623

2650.230

2430.320

12.673

11.689

18:LouLou

2334.127

2571.344

2507.041

2645.986

2423.949

2.878

7.864

Proximity Matrix
Case

Squared Euclidean Distance


15:Lassie

16:Lemmy

17:Loco

18:LouLou

1:Beefy

1150.654

1195.368

1163.771

1155.809

2:Benny

1188.522

1253.432

1221.554

1206.014

3:Bertie

806.134

887.847

862.122

837.224

4:Biffy

979.331

1041.465

1012.580

996.669

5:Billy

1208.547

1267.110

1234.868

1221.925

6:Champ

2257.922

2327.452

2283.211

2271.814

7:Charger

2842.793

2880.872

2830.828

2834.215

8:Charlie

2320.725

2389.993

2345.123

2334.127

9:Chewy

2565.003

2624.924

2577.596

2571.344

10:Chechee

2501.930

2559.380

2512.623

2507.041

11:Chico

2642.702

2698.324

2650.230

2645.986

12:Chief

2417.764

2476.261

2430.320

2423.949

13:Laddy

2.625

12.972

12.673

2.878

14:Larry

16.705

14.248

11.689

7.864

.000

24.958

25.376

9.659

16:Lemmy

24.958

.000

.225

3.663

17:Loco

25.376

.225

.000

3.726

15:Lassie

18:LouLou

9.659

3.663

3.726

.000

This is a dissimilarity matrix

NOTE: The proximity matrix shown above gives us the dissimilarity between different objects
in the dataset. So, the smaller the corresponding value, the larger is their similarity. For
example 16:Lemmy and 17:Loco have the dissimilarity value of 0.225, which is the
lowest in the entire matrix, indicating that they are the most similar objects in the entire
dataset and hence must belong to the same cluster.

Ward Linkage

Agglomeration Schedule
Stage

Cluster Combined
Cluster 1

Coefficients

Cluster 2

Stage Cluster First Appears


Cluster 1

Next Stage

Cluster 2

16

17

.112

10

.330

.549

12

.815

11

1.679

13

15

2.991

11

12

4.749

12

6.892

15

16

18

9.317

13

10

16.531

15

11

13

14

24.022

13

12

34.561

14

13

13

16

49.276

11

17

14

67.473

12

16

15

98.032

10

16

16

1001.275

15

14

17

17

13

8167.574

16

13

NOTE: The Agglomeration Schedule tells us which objects were added to which cluster and
at which stage. As we saw in the proximity matrix, 16:Lemmy and 17:Loco had a
dissimilarity value of 0.225, which is the lowest in the entire matrix, indicating that they are
the most similar objects in the entire dataset and hence must belong to the same cluster.
Therefor they were the first two objects to be joined into a cluster. Unlike the proximity
matrix which gives us the dissimilarity between objects, the Agglomeration Schedule tells us
which objects were added to which cluster.

NOTE: The above diagram is called an icicle plot. It is a graphical representation of the
agglomeration schedule.

NOTE: The above Dendrogram shows how which objects belong to which cluster. It can be
seen from above that objects 16,17,18,13,15,14 form a cluster, 6,8,9,10,11,12,7 form a cluster
and 2,5,1,3,4 form a cluster. Also the last 2 clusters are more similar to each other in
comparison to any other.

2)

K-Means Clustering
K-means clustering, unlike hierarchical clustering, involves specifying the
number of clusters (k) required beforehand. Objects are then classified as
belonging to one of the k groups based on their distances from the cluster
centres or centroids. Objects are added to those clusters in which their objectcentroid distance is minimum.

Algorithm
STEP1: Number of clusters k is specified
STEP2: Centroid position or initial cluster centre for each of the k clusters
is calculated
STEP3: Centroid distance for each object is calculated
STEP4: Objects are added to that cluster where distance calculated is
STEP3 is minimum
STEP5: Repeat STEP2 until the computed values stabilize or maximum
number of iterations
is reached

Flowchart
Let k be the number of clusters.

Advantages of K-Means Clustering


1- Computationally faster than hierarchical cluster analysis
2- Produces tighter cluster than hierarchical cluster analysis when clusters
are globular

Disadvantages of K-Means Clustering:


1- Predicting K value is difficult
2- Different starting clusters can result in different final clusters
3- Results for non-globular cluster patterns are sub-optimal

Example:
The dataset used in the example is given below
Case
1
2
3
4
5
6

Sepal.Length
5.1
4.9
4.7
4.6
5.0
5.4

Sepal.Width
3.5
3.0
3.2
3.1
3.6
3.9

Petal.Length
1.4
1.4
1.3
1.5
1.4
1.7

Petal.Width
0.2
0.2
0.2
0.2
0.2
0.4

7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70

4.6
5.0
4.4
4.9
5.4
4.8
4.8
4.3
5.8
5.7
5.4
5.1
5.7
5.1
5.4
5.1
4.6
5.1
4.8
5.0
5.0
5.2
5.2
4.7
4.8
5.4
5.2
5.5
4.9
5.0
5.5
4.9
4.4
5.1
5.0
4.5
4.4
5.0
5.1
4.8
5.1
4.6
5.3
5.0
7.0
6.4
6.9
5.5
6.5
5.7
6.3
4.9
6.6
5.2
5.0
5.9
6.0
6.1
5.6
6.7
5.6
5.8
6.2
5.6

3.4
3.4
2.9
3.1
3.7
3.4
3.0
3.0
4.0
4.4
3.9
3.5
3.8
3.8
3.4
3.7
3.6
3.3
3.4
3.0
3.4
3.5
3.4
3.2
3.1
3.4
4.1
4.2
3.1
3.2
3.5
3.6
3.0
3.4
3.5
2.3
3.2
3.5
3.8
3.0
3.8
3.2
3.7
3.3
3.2
3.2
3.1
2.3
2.8
2.8
3.3
2.4
2.9
2.7
2.0
3.0
2.2
2.9
2.9
3.1
3.0
2.7
2.2
2.5

1.4
1.5
1.4
1.5
1.5
1.6
1.4
1.1
1.2
1.5
1.3
1.4
1.7
1.5
1.7
1.5
1.0
1.7
1.9
1.6
1.6
1.5
1.4
1.6
1.6
1.5
1.5
1.4
1.5
1.2
1.3
1.4
1.3
1.5
1.3
1.3
1.3
1.6
1.9
1.4
1.6
1.4
1.5
1.4
4.7
4.5
4.9
4.0
4.6
4.5
4.7
3.3
4.6
3.9
3.5
4.2
4.0
4.7
3.6
4.4
4.5
4.1
4.5
3.9

0.3
0.2
0.2
0.1
0.2
0.2
0.1
0.1
0.2
0.4
0.4
0.3
0.3
0.3
0.2
0.4
0.2
0.5
0.2
0.2
0.4
0.2
0.2
0.2
0.2
0.4
0.1
0.2
0.2
0.2
0.2
0.1
0.2
0.2
0.3
0.3
0.2
0.6
0.4
0.3
0.2
0.2
0.2
0.2
1.4
1.5
1.5
1.3
1.5
1.3
1.6
1.0
1.3
1.4
1.0
1.5
1.0
1.4
1.3
1.4
1.5
1.0
1.5
1.1

71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134

5.9
6.1
6.3
6.1
6.4
6.6
6.8
6.7
6.0
5.7
5.5
5.5
5.8
6.0
5.4
6.0
6.7
6.3
5.6
5.5
5.5
6.1
5.8
5.0
5.6
5.7
5.7
6.2
5.1
5.7
6.3
5.8
7.1
6.3
6.5
7.6
4.9
7.3
6.7
7.2
6.5
6.4
6.8
5.7
5.8
6.4
6.5
7.7
7.7
6.0
6.9
5.6
7.7
6.3
6.7
7.2
6.2
6.1
6.4
7.2
7.4
7.9
6.4
6.3

3.2
2.8
2.5
2.8
2.9
3.0
2.8
3.0
2.9
2.6
2.4
2.4
2.7
2.7
3.0
3.4
3.1
2.3
3.0
2.5
2.6
3.0
2.6
2.3
2.7
3.0
2.9
2.9
2.5
2.8
3.3
2.7
3.0
2.9
3.0
3.0
2.5
2.9
2.5
3.6
3.2
2.7
3.0
2.5
2.8
3.2
3.0
3.8
2.6
2.2
3.2
2.8
2.8
2.7
3.3
3.2
2.8
3.0
2.8
3.0
2.8
3.8
2.8
2.8

4.8
4.0
4.9
4.7
4.3
4.4
4.8
5.0
4.5
3.5
3.8
3.7
3.9
5.1
4.5
4.5
4.7
4.4
4.1
4.0
4.4
4.6
4.0
3.3
4.2
4.2
4.2
4.3
3.0
4.1
6.0
5.1
5.9
5.6
5.8
6.6
4.5
6.3
5.8
6.1
5.1
5.3
5.5
5.0
5.1
5.3
5.5
6.7
6.9
5.0
5.7
4.9
6.7
4.9
5.7
6.0
4.8
4.9
5.6
5.8
6.1
6.4
5.6
5.1

1.8
1.3
1.5
1.2
1.3
1.4
1.4
1.7
1.5
1.0
1.1
1.0
1.2
1.6
1.5
1.6
1.5
1.3
1.3
1.3
1.2
1.4
1.2
1.0
1.3
1.2
1.3
1.3
1.1
1.3
2.5
1.9
2.1
1.8
2.2
2.1
1.7
1.8
1.8
2.5
2.0
1.9
2.1
2.0
2.4
2.3
1.8
2.2
2.3
1.5
2.3
2.0
2.0
1.8
2.1
1.8
1.8
1.8
2.1
1.6
1.9
2.0
2.2
1.5

135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150

6.1
7.7
6.3
6.4
6.0
6.9
6.7
6.9
5.8
6.8
6.7
6.7
6.3
6.5
6.2
5.9

2.6
3.0
3.4
3.1
3.0
3.1
3.1
3.1
2.7
3.2
3.3
3.0
2.5
3.0
3.4
3.0

5.6
6.1
5.6
5.5
4.8
5.4
5.6
5.1
5.1
5.9
5.7
5.2
5.0
5.2
5.4
5.1

1.4
2.3
2.4
1.8
1.8
2.1
2.4
2.3
1.9
2.3
2.5
2.3
1.9
2.0
2.3
1.8

The tables gives us the physical attributes of a sample of flowers, our aim
is to categorize them into 3 categories of flowers based on the given
physical attributes.
STEP1: Choose Classify option in Analyze menu, then choose K-Means
Cluster.... A window, as shown below, opens. Enter 3 in the field
Number of Clusters

STEP2: Click Iterate option and enter 10 (10 is chosen for the purpose
of demonstration, any number can be chosen) in the Maximum
Iterations field, click Continue.

STEP3: Click Save option, check the options Cluster membership and
Distance from cluster center, and click Continue

Doing so will create 2 new fields in the original data file which will give us
the Cluster membership (which tells us which cluster the object belongs
to ) and Distance from cluster center
STEP5: Click Options, check the options Initial cluster centers and
Cluster Information for each case, click Continue

STEP5: Click OK in the main menu.


The Output is as shown:
Initial Cluster Centers
Cluster
1
Sepal.Length

2
7.7

3
5.7

4.9

Sepal.Width

3.8

4.4

2.5

Petal.Length

6.7

1.5

4.5

Petal.Width

2.2

.4

1.7

Iteration Historya
Iteration

Change in Cluster Centers


1

1.226

1.205

1.141

.175

.000

.121

.070

.000

.047

.050

.000

.033

.000

.000

.000

a. Convergence achieved due to no or small


change in cluster centers. The maximum
absolute coordinate change for any center is .
000. The current iteration is 5. The minimum
distance between initial centers is 3.824.

NOTE: The table below is the required result. The values under the column names Cluster
tells us which cluster the Case belongs to. For example: Case Numbers 1,2, 3 etc. belong
to cluster 2 AND Case Number 53 belongs to cluster 1.
Cluster Membership
Case Number

Case

Cluster

Distance

.141

.448

.417

.525

.189

.677

.415

.066

.807

10

10

.376

11

11

.482

12

12

.254

13

13

.501

14

14

.913

15

15

1.014

16

16

1.205

17

17

.654

18

18

.144

19

19

.824

20

20

.389

21

21

.463

22

22

.329

23

23

.640

24

24

.383

25

25

.487

26

26

.452

27

27

.209

28

28

.215

29

29

.211

30

30

.408

31

31

.414

32

32

.426

33

33

.716

34

34

.920

35

35

.350

36

36

.350

37

37

.527

38

38

.257

39

39

.761

40

40

.115

41

41

.185

42

42

1.248

43

43

.669

44

44

.387

45

45

.602

46

46

.482

47

47

.410

48

48

.472

49

49

.405

50

50

.150

51

51

1.227

52

52

.684

53

53

1.019

54

54

.732

55

55

.639

56

56

.269

57

57

.765

58

58

1.584

59

59

.756

60

60

.860

61

61

1.536

62

62

.324

63

63

.808

64

64

.397

65

65

.873

66

66

.873

67

67

.412

68

68

.536

69

69

.637

70

70

.713

71

71

.709

72

72

.463

73

73

.694

74

74

.437

75

75

.546

76

76

.743

77

77

.988

78

78

.846

79

79

.220

80

80

1.024

81

81

.864

82

82

.976

83

83

.558

84

84

.734

85

85

.575

86

86

.688

87

87

.927

88

88

.615

89

89

.508

90

90

.629

91

91

.488

92

92

.383

93

93

.492

94

94

1.549

95

95

.386

96

96

.443

97

97

.345

98

98

.372

99

99

1.661

100

100

.384

101

101

.777

102

102

.854

103

103

.306

104

104

.653

105

105

.385

106

106

1.142

107

107

1.071

108

108

.786

109

109

.655

110

110

.844

111

111

.746

112

112

.753

113

113

.260

114

114

.889

115

115

1.202

116

116

.683

117

117

.510

118

118

1.478

119

119

1.530

120

120

.826

121

121

.270

122

122

.819

123

123

1.311

124

124

.743

125

125

.276

126

126

.528

127

127

.625

128

128

.702

129

129

.546

130

130

.594

131

131

.731

132

132

1.438

133

133

.561

134

134

.815

135

135

1.121

136

136

.953

137

137

.733

138

138

.579

139

139

.610

140

140

.348

141

141

.389

142

142

.684

143

143

.854

144

144

.310

145

145

.509

146

146

.612

147

147

.897

148

148

.653

149

149

.836

150

150

.835

Final Cluster Centers


Cluster
1

Sepal.Length

6.9

5.0

5.9

Sepal.Width

3.1

3.4

2.7

Petal.Length

5.7

1.5

4.4

Petal.Width

2.1

.2

1.4

Distances between Final Cluster Centers


Cluster

5.018

5.018

1.797

1.797
3.357

3.357

Number of Cases in each


Cluster

Cluster

Valid
Missing

38.000

50.000

62.000
150.000
.000

3)Two-Step Cluster
For datasets that are very large or when a fast clustering process is required
which can develop clusters on the basis of continuous (like age, salary etc) or
categorical data (like gender, marital status etc) , none of the above
mentioned cluster analyses procedures fulfil the requirement. a matrix of
distances between all pairs of cases is required for Hierarchical Cluster
Analysis whereas shuffling of cases in and out of clusters and number of
clusters should be known in advance for K-Means clustering. Two-Step Cluster
Analysis is designed to fulfil both the above requirements. It requires only one
pass of data (which is important for very large data files), and it can produce
solutions based on mixtures of continuous and categorical variables and for
varying numbers of clusters.

Algorithm
STEP1: Preclusters, which are clusters of the original case which are used
in place of raw data in hierarchical clustering, are formed
STEP2: Hierarchical clustering, of Preclusters formed in STEP1, is carried
out

Example:
The dataset used in the example is given below
Name
Oscar
Susie
Kimberly
Louise
Ronald
Charlie
Gertrude
Beatrice
Queenie
Thomas
Harry
Daisy
Ethel
Angus
Morris
John
Noel
Fred
Peter
Ian

age
4.00
4.00
4.10
5.50
5.50
5.50
5.70
5.90
5.90
5.90
6.15
6.20
6.40
6.70
6.90
6.90
7.20
7.30
7.30
7.50

mem_span
4.20
4.20
3.90
4.20
4.20
4.10
3.60
4.00
4.00
4.00
5.00
4.80
5.00
4.40
4.50
5.00
5.00
5.50
5.50
5.40

iq
101.00
101.00
108.00
90.00
90.00
105.00
88.00
90.00
90.00
90.00
95.00
98.00
106.00
95.00
91.00
104.00
92.00
100.00
100.00
96.00

read_ab
5.60
5.60
5.00
5.80
5.80
6.00
5.30
6.00
6.00
6.00
6.40
6.60
7.00
7.20
6.60
7.30
6.80
7.20
7.20
6.60

Gender
1.00
2.00
2.00
2.00
1.00
1.00
2.00
2.00
2.00
1.00
1.00
2.00
2.00
1.00
2.00
1.00
1.00
1.00
1.00
1.00

The dataset has the fields age, memory span, IQ, reading ability and gender
of students. Our aim is to categorize them into 3 clusters based on age,
mem_span, read_ab, gender and evaluate the 3 clusters on their IQ values.

STEP1: Choose Classify option in Analyze menu, then choose TwoStep


Cluster.... A window, as shown below, opens. Add Gender to the Categorical
Variables field and age, mem_span, read_ab to Continuous Variables
field. Enter 4 in the Determine automatically Maximum field in the Number
of Clusters category.

STEP2: Click Output option. Add IQ to the Evaluation Fields (since we want
to evaluate the formed clusters on the basis of IQ). Check the option Charts and
tables in Model Viewer. Click Continue

The Output is as follows:

NOTE: On double clicking the above picture we get:

You might also like