You are on page 1of 24

Welcome to Powerpoint slides for

Chapter 13

Cluster Analysis for Market Segmentation


Marketing Research Text and Cases by Rajendra Nargundkar

Slide 1

1. A cluster, by definition, is a group of similar objects

2. There could be clusters of people, brands or other objects


3. If clusters are formed of customers similar to one another, then cluster analysis can help marketers identify segments (clusters)

4. If clusters of brands are formed, this can be used to gain insights into brands that are perceived as similar to each other on a set of attributes
5. This chapter explains the use of cluster analysis for customer segmentation 6. Cluster analysis is best performed when the variables are interval or ratio-scaled

Slide 2

1. There are two major classes of cluster analysis techniques- hierarchical and non-hierarchical 2. In hierarchical clustering, some measure of distance is used to identify distances between all pairs of objects to be clustered. One of the popular distance measures used is Euclidean Distance. Another is the Squared Euclidean Distance 3. We begin with all objects in separate clusters. Say, we have ten objects in separate clusters. Two closest objects are joined to form a cluster. The remaining 8 objects would remain separate. This is stage 1 of hierarchical clustering.

Slide 2 contd... 4. In stage 2, again the two closest objects form another cluster. Now, we have two clusters, and 6 unclustered objects. This means a total of eight clusters, two with two objects each, and six with one object each. 5. This process continues, until points join existing clusters (because they are closest to an existing cluster), and clusters join other clusters, based on the shortest distance criterion 6. In this way, a range of possible solutions is formed, from a 10-cluster solution in the beginning, to a single cluster solution at the end. 7. We have to decide how many clusters the data seems to have, depending on either the agglomeration schedule, or the dendrogram to help make the decision. Both of these are computer outputs that describe in numbers or visually, the sequence of cluster formation. This decision is somewhat subjective, but there are some guidelines one can follow, as illustrated in the worked example

Slide 3 1. In non-hierarchical clustering methods (also known as k-means clustering methods), we need to specify the number of clusters we want the objects to be clustered into. 2. This can be done if we have a hypothesis that the objects will group into a certain number of clusters. Alternatively, we can first do a hierarchical clustering on the data, find the approximate number of clusters, and then perform a k-means clustering 3. In our illustration, we have used both hierarchical and non-hierarchical methods in combination with one another 4. Let us move on to our worked example

Slide 4

Worked Out Example

Problem: A major Indian FMCG company wants to map the profile of its target market in terms of lifestyle, attitudes and perceptions. The company's managers prepare, with the help of their marketing research team, a set of 15 statements, which they feel measure many of the variables of interest. These 15 statements are given below. The respondent had to agree or disagree (1 = Strongly Agree, 2 = Agree, 3 = Neither Agree nor Disagree, 4 = Disagree, 5 = Strongly Disagree) with each statement. 1. I prefer to use e-mail rather than write a letter. 2. I feel that quality products are always priced high. 3. I think twice before I buy anything. 4. Television is a major source of entertainment. 5. A car is a necessity rather than a luxury. 6. I prefer fast food and ready to use products. 7. People are more health conscious today. 8. Entry of foreign companies has increased the efficiency of Indian companies. 9. Women are active participants in purchase decisions. 10. I believe politicians can play a positive role. 11. I enjoy watching movies. 12. If I get a chance, I would like to settle abroad. 13. I always buy branded products. 14. I frequently go out on weekends. 15. I prefer to pay by credit card rather than in cash.

Slide 5 For the purpose of this illustration, we will assume that 20 respondents answered the questionnaire above (In a real life situation, the sample size would be higher). The input data matrix of 20 respondents x 15 variables is shown in fig 1.
Fig. 1
var000 var0000 01 2 var0000 3 var0000 4 var0000 5 var0000 6 var0000 7 var00008

1. 1.00 2. 2.00 3. 3.00 4. 3.00 5. 2.00 6. 2.00 7. 1.00 8. 4.00 9. 2.00 10. 5.00 11. 4.00 12. 3.00 13. 4.00 14. 1.00 15. 2.00 16. 3.00 17. 5.00 18. 3.00 19. 3.00 20. 1.00

3.00 3.00 2.00 2.00 2.00 4.00 1.00 5.00 1.00 2.00 3.00 4.00 3.00 2.00 3.00 2.00 1.00 5.00 2.00 3.00

5.00 2.00 3.00 4.00 4.00 3.00 2.00 1.00 5.00 4.00 3.00 4.00 2.00 2.00 4.00 1.00 1.00 5.00 4.00 3.00

4.00 3.00 4.00 2.00 2.00 3.00 4.00 4.00 3.00 3.00 2.00 4.00 2.00 4.00 1.00 3.00 5.00 3.00 2.00 2.00

3.00 4.00 3.00 2.00 2.00 5.00 4.00 5.00 4.00 2.00 1.00 3.00 3.00 2.00 5.00 4.00 1.00 5.00 4.00 2.00

5.00 4.00 5.00 4.00 5.00 4.00 1.00 4.00 4.00 5.00 2.00 2.00 3.00 5.00 4.00 3.00 2.00 5.00 4.00 5.00

3.00 3.00 3.00 3.00 3.00 4.00 2.00 5.00 2.00 1.00 1.00 5.00 4.00 1.00 2.00 2.00 4.00 5.00 1.00 2.00

2.00 2.00 3.00 4.00 3.00 2.00 4.00 1.00 1.00 5.00 5.00 1.00 2.00 3.00 4.00 3.00 2.00 1.00 4.00 5.00

Slide 5 contd...

Fig 1 contd...
var000 09 var0001 0 var00011 var00012 var00013 var00014 var00015

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.

3.00 2.00 4.00 5.00 4.00 3.00 2.00 1.00 2.00 3.00 2.00 5.00 2.00 5.00 4.00 2.00 2.00 2.00 1.00 1.00

2.00 2.00 2.00 4.00 4.00 4.00 5.00 5.00 1.00 2.00 2.00 3.00 3.00 4.00 5.00 5.00 4.00 3.00 3.00 3.00

4.00 4.00 4.00 5.00 5.00 5.00 4.00 3.00 2.00 5.00 4.00 2.00 4.00 3.00 2.00 1.00 4.00 4.00 4.00 4.00

1.00 2.00 3.00 4.00 5.00 4.00 3.00 3.00 2.00 1.00 5.00 4.00 3.00 2.00 1.00 2.00 3.00 4.00 5.00 4.00

1.00 2.00 4.00 5.00 3.00 3.00 3.00 5.00 4.00 2.00 1.00 4.00 5.00 2.00 1.00 5.00 3.00 2.00 3.00 3.00

1.00 2.00 4.00 5.00 3.00 3.00 3.00 5.00 4.00 2.00 1.00 4.00 5.00 2.00 1.00 5.00 3.00 2.00 3.00 3.00

5.00 4.00 3.00 5.00 4.00 3.00 1.00 2.00 3.00 1.00 2.00 3.00 4.00 5.00 4.00 3.00 2.00 1.00 2.00 3.00

Slide 6 The computer output is obtained by first doing a hierarchical cluster analysis to find the number of clusters that exist in the data. These outputs are in figs 2 to 4 (Agglomeration schedule, vertical Icicle Plot and Dendrogram using Average Linkage, respectively). The second stage is a K-means (quick cluster) output with a pre-determined number of clusters to be specified. In this case, the output is for 4 clusters. We will look at both stage 1 and stage 2 outputs to understand the interpretation of both stages.

Slide 7 Fig. 2 : Agglomeration Schedule Clusters Combined Sta Clus Clus ge ter1 ter2 1 4 5 2 19 20 3 6 18 4 1 2 5 3 4 6 13 16 7 11 19 8 1 14 9 6 8 10 1 15 11 3 11 12 6 12 13 7 13 14 1 9 15 3 10 16 1 3 17 7 17 18 1 7 19 1 6 Stage Cluster 1st Appears Next

Coefficient 14.000000 16.000000 17.000000 17.000000 20.000000 25.000000 28.000000 28.500000 32.500000 34.666668 36.444443 36.666668 39.500000 41.000000 41.666668 46.342857 47.000000 51.791668 58.156250

Clust Cluste Stage er1 r2 0 0 5 0 0 7 0 0 9 0 0 8 0 1 11 0 0 13 0 2 11 0 0 10 0 0 12 0 0 14 0 7 15 0 0 19 0 6 17 10 0 16 11 0 16 14 15 18 13 0 18 16 17 19 18 12 0

Slide 8 1. A look at fig 2, the agglomeration schedule, can help us to identify large differences in the coefficient (4th column). The agglomeration schedule from top to bottom (stage 1 to 19) indicates the sequence in which cases get combined with others (or one cluster combines with another), until all 20 cases are combined together in one cluster at the last stage (stage 19). 2. Therefore, stage 19 represents a 1 cluster solution, stage 18 represents a 2 cluster solution, stage 17 represents a 3 cluster solution, and so on, going up from the last row to the first row. We have to identify how many clusters are in the data. We use the difference between rows in a measure called coefficient (also known as fusion coefficient) in column 4 to identify the number of clusters in the data.

Slide 8 Contd. 3. We will look at this figure from the last row upwards, because we would like to have lowest possible number of clusters, for reasons of economy and ease of interpretation. We see that there is a difference of (58.15 51.79) in the coefficients between the 1 cluster solution (stage 19) and the 2 cluster solution (stage 18). This is a difference of 6.36. The next difference is of (51.79 47.00) which is equal to 4.79 (between stage 18, the 2 cluster solution and stage 17, the 3 cluster solution). The next one after that is (47-46.34), only 0.66, between stage 17 and stage 16. After this, there is again a large difference between the 4 cluster and 5 cluster solutions, of (46.34 41.660) or 4.68. Thereafter, the differences are smaller between subsequent rows of coefficients. 4. A large difference in the coefficient values between any two rows indicates a solution pertaining to the number of clusters which the lower row represents. Ignoring the first difference of 6.36 which would indicate only 1 cluster in the data, we look at the next largest differences. 4.79 is the difference between row 2 from the bottom and row 3 from the bottom, indicating a 2 cluster solution. But almost the same is the difference between stage 16 and 15, indicating a 4 cluster solution. At this point, it is the judgement of the researcher, which should decide whether to go for a 2 cluster or a 4 cluster solution. Just for illustration, we will choose the 4 cluster solution.

Slide 9 Now, in stage 2, a k-means clustering is run with 4 cluster solution requested (as identified from the hierarchical clustering done above). In the given problem, figs 5, 6, 7 and 8 indicate the outputs of Kmeans clustering for a 4 cluster solution. These outputs give us the initial cluster centres, the case listing of cluster membership (i.e., which case belongs to which of the clusters), the final cluster centres (the solution) and an ANOVA table.

Fig. 7 : Final Cluster Centers

Cluster VAR00001 VAR00002 VAR00003 VAR00004 1 2.2000 2.2000 3.8000 3.2000 2 3.5000 3.6667 2.6667 3.5000 3 1.7500 2.0000 2.2500 3.0000 4 3.0000 2.4000 3.6000 2.2000
Cluster VAR00005 VAR00006 VAR00007 VAR00008

1 2 3 4

3.2000 3.6667 3.7500 2.2000

4.4000 3.3333 3.2500 4.2000

2.8000 4.5000 1.7500 1.6000

2.4000 1.5000 3.5000 4.4000

Slide 9 Contd.

Fig 7 contd...
Cluster VAR00009 VAR00010 VAR00011 VAR00012

1 2 3 4

3.2000 2.5000 3.2500 2.2000

2.2000 3.6667 4.7500 2.8000

3.8000 3.6667 2.5000 4.4000

2.4000 3.5000 2.0000 4.0000

Cluster VAR00013 VAR00014 VAR00015

1 2 3 4

2.4000 4.1667 1.2500 3.0000

3.2000 3.6667 2.7500 2.4000

4.0000 2.5000 3.2500 2.4000

Slide 10 1. The final cluster centres (above) describe the mean value of each variable for each of the 4 clusters. For example, cluster 1 is described by the mean values of variable 1 = 2.2, variable 2 = 2.2, variable 3 = 3.8, variable 4 = 3.2 and so on. Similarly, cluster 3 is described by variable 1 = 1.75, variable 2 = 2.0, variable 3 = 2.25, and variable 4 = 3.0, and so on.

2. We now go back to the original variables (in this case the 15 statements in our questinnaire), and interpret the clusters in terms of the 15 variables. For example, cluster 3 consists of people who are on the e-mail rather than writing conventional letters (variable 1 value = 1.75 which is equivalent to agree on the scale of 1 to 5). Similarly, they are also people who tend to think twice before buying anything (variable 3 value 2.25) in other words, careful spenders. They also agree (variable 2 value = 2.00) that quality products are always priced high that is, they have a positive correlation in their minds about a products quality and price.
3. On these same variables, cluster 2 shows people who prefer conventional mail to e-mail (variable 1 value = 3.5 or close to disagree), people who do not necessarily associate high price with good quality (variable 2 value = 3.67), and tend to be neutral about care in spending (variable 3 value = 2.67). In this way, when we compare final cluster centre values on each of the 15 variables, for 1 cluster at a time, a complete picture of the clusters emerges.

Slide 11 In this case, we will briefly describe each of the 4 clusters as follows: Cluster 1

E-mail users, feel quality comes at a price, not careful spenders, do not like television much, do not think a car is a necessity, do not like fast food and ready to use products, are not sure whether people are more health-conscious today, think foreign companies have increased somewhat the efficiency of Indian companies, disagree that women are active purchasing decision makers, feel that politicians can play an active role, do not enjoy watching movies, might consider settling abroad, tend to buy branded products, do not go out much on weekends and like to pay cash, rather than charging to their credit cards (if they have one). It is thus a cluster exhibiting many traditional values, except that they have adapted to email use. They are also beginning to loosen their purse strings, and are probably in transition in some other factors like acceptance of women as decision makers and the advent of credit cards.

Slide 11 contd... Cluster 2

Regular mail writers, bargain hunters or aggressive buyers, not too particular about thinking before spending, not great valuers of TV, believe the car is a luxury not too fond of fast food and convenience products, do not think people are very health conscious, feel foreign companies have done us good, think women are active purchasing decision makers, do not believe in politicians, do not like movies, do not want to settle abroad, do not stress on branded products, do not go out on weekends, but do prefer credit cards for payments.
It is a group which likes to use credit, spends more freely, believes in woman power, believe in economics rather than politics, and feel quality products can be cheap. Also, they seem to have a patriotic streak, as they do not want to settle abroad.

Slide 12 Cluster 3 E-mail users, quality measured by price, think twice before buying, indifferent to TV, car is a luxury to them, not too fond of fast food, agree that people are health conscious, do not think foreign companies have made us efficient, do not believe in woman power, detest politicians, enjoy watching movies, willing to settle abroad, always buy branded products, go out on weekends, slightly prefer cash to credit cards. This group is not a free spending one, but health conscious, more patriarchical, more brand loyal to branded products, but outgoing compared to other groups, even willing to go abroad to settle.

Slide 12 contd...

Cluster 4 Not too particular about e-mail, measure quality by price, free spending, enjoy watching TV, think a car is necessary, not fond of fast food, think people are health conscious, do not think foreign companies have made us efficient, believe in woman power, somewhat positive about politicians, not movie watchers, do not want to settle abroad, indifferent to branding, moderately outgoing and moderately in favour of credit cards rather than cash. This group is optimistic, free spending and a good target for TV advertising, particularly consumer durables and entertainment. But they are not necessarily influenced by brands. They may want value for money, but if they see value, they may spend a lot. In summary, the cluster analysis of this sample of respondents tells us a lot about the possible segments which exist in the target population.

Slide 13 ANOVA: Fig. 8 : Analysis of Variance


Variable Cluster MS DF Error MS DF F Prob

VAR00001 VAR00002 VAR00003 VAR00004 VAR00005 VAR00006 VAR00007 VAR00008 VAR00009 VAR00010 VAR00011 VAR00012 VAR00013 VAR00014 VAR00015

3.0500 3.0722 2.5722 1.6333 2.5056 1.7056 9.6500 8.5500 1.3000 5.5.56 2.7389 4.0833 7.2556 1.6222 2.8500

3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

1.315 1.083 1.630 .943 1.605 1.505 .390 .681 1.865 .730 1.020 1.293 .799 1.880 1.465

16.0 2.3183 .114 16.0 2.8359 .071 16.0 1.5778 .234 16.0 1.7307 .201 16.0 1.5609 .238 16.0 1.1331 .365 16.0 24.7040 .000 16.0 12.5505 .000 16.0 .6968 .567 16.0 7.5397 .002 16.0 2.6830 .082 16.0 3.1562 .054 16.0 9.0813 .001 16.0 .8628 .480 16.0 1.9446 .163

Slide 13 contd...

The ANOVA table (fig. 8) tells us which of the 15 variables is significantly different across the 4 clusters. The last column indicates that variables 2, 7, 8, 10, 11, 12, 13 are significant at the 0.10 level (equivalent to 90% confidence level) as they have prob. Values less than 0.10. The other variables are not statistically significant, as they all have prob. Values greater then 0.10. But there is divided opinion about the utility of statistical testing for cluster analysis. Most established writers seen to feel that these tests (ANOVA or other tests) are not valid. Therefore, it is left to the researchers judgement whether he would like to use these in determining which variables are significant. If the tests were used, then the interpretation of clusters and differences across clusters should be only on the basis of those variables which are (statistically) significantly different across clusters at 0.10 or 0.05 or some other level.

Slide 14 Additional Comments on Cluster Analysis Objects We have looked at an example of classifying people, with interval-scaled data. It is possible to classify objects such as brands, products, cities, etc. with cluster analysis. For example, which brands are clustered together in terms of consumer perceptions for a positioning exercise, or which cities are clustered together in terms of income, education and age profile of its residents. Number of Clusters One of the main decisions of a researcher is to decide how many clusters are present in the data. In certain cases, if for example we have a prior hypothesis about how many clusters ought to be present, this decision may already be made. But otherwise, it tends to be a subjective decision. One of the criteria that can be used in addition to ones we have described in the chapter is that every cluster must have a reasonable or minimum number of objects. Which means, if a cluster comes out with only one or two objects in it, look for another solution. It may be useful to experiment with two or three possible solutions before deciding on the number of clusters.

Slide 15

Variables
Once the reader is aware of the basics of cluster analysis, he can begin to use it creatively. For example, a cluster analysis can be done on some of the measured variables, and then other variables can be checked to see if they also exhibit differences across clusters. In the worked out example discussed earlier, only Psychographics or behavioural variables were used to get the 4 clusters. We could then see if they belonged to different places, had different education levels, or whether one gender figured predominantly in any one of the clusters. Scale Cluster analysis is ideally suited to interval scaled variables, because Euclidean distance is a commonly used distance measure used in the clustering process. But nominal and ordinal level data can be used after standardisation if appropriate. This may also necessitate the use of other measures of distance, more appropriate with the scales of variables being used. But this should be done with care. In general, it is a good idea to standardise the variables before clustering, if the units of measurement are radically different.

Slide 15 Contd...

Statistical Tests As mentioned briefly earlier, some statistical tests for cluster analysis are available. But their validity being questionable, caution is recommended in using either ANOVA or any other tests. A general caution about cluster analysis itself is that it tends to produce different results with different methods and some methods are quite vulnerable to errors in data. So, the stability of the clusters can be checked through splitting the sample and repeating the cluster analysis.

You might also like