You are on page 1of 56

CLAMDA - INTERNATIONAL MANAGEMENT FACULTY OF ECONOMICS

Coffee Drinking Habits in Lithuania, Palestine and Italy


Business Intelligence Written Assignment

Kotryna Garsvaite Mahran Sharqawi Marcello Canetto

Professor Furio Camillo

CONTENT
Introduction ...................................................................................................................................3 1 2 3 4 5 6 7 8 The Research .........................................................................................................................4 Importing Data.......................................................................................................................5 Simple Statistics ....................................................................................................................7 Principal Component Analysis ..............................................................................................8 Size Effects Removal .........................................................................................................12 5.1 Principal Component Analysis after Size Effect Removal ..........................................14 The Cluster Analysis ...........................................................................................................18 Wards Method ....................................................................................................................20 7.1 8.1 8.2 8.3 8.4 8.5 9 9.1 9.2 10 Dendrogram - graphical representation ........................................................................21 Preparation of the dataset .............................................................................................23 T-Test for Respondents ................................................................................................23 Cluster 1 .......................................................................................................................25 Cluster 2 .......................................................................................................................26 Cluster 3 .......................................................................................................................27 Cluster x Variable ........................................................................................................29 Country X Variable ......................................................................................................42 Strategic Decisions ..........................................................................................................50 T-TEST Procedure...............................................................................................................23

Proc Freq Procedure (Chi Square Test) ...............................................................................28

10.1 Cluster 1- Sophisticated Coffee and Cigarettes ...........................................................50 10.2 Cluster 2- Fast Coffee ..................................................................................................51 10.3 Cluster 3- Sweet Break or Take-Away ........................................................................52 11 Appendix ..........................................................................................................................53

Introduction
The energizing effect of the coffee bean plant is thought to have been discovered in the northeast region of Ethiopia, and the cultivation of coffee first expanded in the Arab world. The earliest credible evidence of coffee drinking appears in the middle of the 15th century, in the Sufi monasteries of Yemen in southern Arabia. From the Muslim World, coffee spread to Italy, then to the rest of Europe. Coffee had been through many centuries a popular drink. Searching through history pages for the roots of this amazing drink, can lead to lot of stories, legends, we may say. From the South American countries as Brazil, through the northeast region of Ethiopia, to the Arab peninsula reaching Europe, people had used coffee for its stimulating effect on humans due to its caffeine content. Because of the popularity and attractiveness of coffee and possibility to spread the questionnaire in different countries, we chose as a subject of our research the topic Coffee Drinking Habits in Lithuania, Palestine and Italy. Our goal is to find the best concept of cafes for different groups of people in three different counties.

The Research

In our research, we tried to examine the habits of drinking coffee outside home, in three different countries, Italy, Lithuania and Palestine. To achieve this goal we have formulated a questionnaire, mainly aimed to people in these three different countries, located in different points on the map, having different climates and of course different cultures. Our purpose is to try to find the differences between the habits in drinking coffee in these three countries, in addition, to find similarities between specific groups in these countries. We are also aware of the wide range of respondents, and the effect of other factors to their answers. Coffee has different perception in these different countries, still, worldwide network companies such as Starbucks, may have a similar effect on consumers in different places. Through our questionnaire we tried to get information about drinking coffee habits such as the type of coffee preferred, times people prefer to drink coffee, and other factors which are important for people who drink their coffee outside. The questionnaire was worded as clearly as possible to try and give everyone the ability to understand the questions and answers with no errors. Another feature is its simplicity; we tried to structure the questions as simply as possible so that respondents could answer the questions at the minimum time available. The result was a questionnaire of 18 questions (Appendix 1). First we placed four qualitative questions; favorite type of coffee, frequency of drinking coffee, how many times, and the modality of drinking coffee outside. Then, we identified 9 factors we consider as important for our research in order to understand the reasons behind the decisions made by the respondents, we formulated 9 questions and asked to rate them with a scale from 1 to 10, where 1 is the most negative/ not important evaluation, 10-the most positive/ important. These 9 factors are: 1. Interior and atmosphere of the place. 2. Socializing with people. 3. Effect of caffeine. 4. Traditional tastes of coffee. 5. Importance of the price.
4

6. Smoking. 7. "Take-away culture". 8. Ice coffee. 9. Dessert/croissant. As our target audience was from three different countries, and it is hard to reach them physically in order to hand the questionnaire, we reached them through internet. We published the questionnaire online for four days; We also placed the generic questions in the end of the questionnaire in order to identify our respondents At the end our sample was 157 useful observations, around 40-60 from each country. As the most active were Lithuanians, the most passive- Palestinians. The data were originally cataloged by Microsoft Excel and later imported into SAS software.

Importing Data

The first step was to import the data from Excel to SAS using the import function Wizard. Then we renamed and labeled the questions as follows: ID id 1. Question m_1 (type of coffee) we gave it the label type 2. Question m_2 (time of drinking) we gave it the label time 3. Question m_3 (times drinking) we gave it the label times a day 4. Question m_4 (preferred drinking way) we gave it the label way to drink 5. Next we placed 9 sub questions which asked the importance of different factors in choosing for the respondents. 6. Question s_1 (interior and atmosphere) we gave it the label interior 7. Question s_2 (socializing with people) we gave it the label socializing 8. Question s_3 (effect of caffeine) we gave it the label caffeine 9. Question s_4 (traditional tastes) we gave it the label tastes 10. Question s_5 (importance of price) we gave it the label price
5

11. Question s_6 (relating smoking) we gave it the label smoking 12. Question s_7 (take away cultural) we gave it the label take away 13. Question s_8 (ice coffee) we gave it the label ice coffee 14. Question s_9 (dessert/croissant) we gave it the label dessert 15. Question Country we gave it the label country 16. Question Gender we gave it the label sex 17. Question Age we gave it the label age 18. Question 8 Occupation we gave it the label occupation 19. Question 9 Smoking we gave it the label smoker

To do so, we had the following commands in SAS:


data Coffee.Coffee; set Coffee.Coffee; label id='id' m_1='time' m_2='type' m_3='times a day' m_4='way to drink' s_1='interior' s_2='socializing' s_3='caffeine' s_4='tastes' s_5='price' s_6='smoking' s_7='take away' s_8='ice coffee' s_9='dessert' country='country' gender='gender' age='age' occupation='occupation' smoker='smoker' run;

Our data was ready for the analysis of the values in the respective tables.

Simple Statistics

When our data was, sorted, and renamed, we started with the first data analysis. The first procedure that we started with is the PROC MEANS. To do this we gave SAS the command:
proc means data=Coffee.Coffee n mean stddev stddev stderr median cv; var s_1-s_9; run;

With this procedure we can know the number (n), the average (mean), the standard deviation (stddev), the standard error (stderr), the median (median) and the coefficient of variation (cv).
The MEANS Procedure
Coeff of Variable s_1 s_2 s_3 s_4 s_5 s_6 s_7 s_8 s_9 Label interior socializing caffeine tastes price smoking take away ice coffee dessert N 157 157 157 157 157 143 157 157 157 Mean 6.8343949 6.3312102 6.5796178 4.1528662 5.3630573 4.7482517 6.1401274 5.5732484 5.3503185 Std Dev 2.4543189 2.4215606 2.6339193 3.0237640 2.6485431 3.8409736 2.8699562 3.1136668 2.7939179 Std Error 0.1958760 0.1932616 0.2102096 0.2413226 0.2113767 0.3211983 0.2290474 0.2484977 0.2229789 Median 8.0000000 7.0000000 7.0000000 3.0000000 5.0000000 3.0000000 6.0000000 6.0000000 5.0000000 Variation 35.9112829 38.2479888 40.0314930 72.8114953 49.3849478 80.8923740 46.7409882 55.8680781 52.2196568

In this table, we marked the highest means with lowest standard deviation, which give us a clear perspective of the important variables in our research. For example, high value in interior variable means that respondents give this factors a high importance- choosing the place for drinking coffee. Moreover, respondents gave high importance for Socializing, Take- Away option. That tells that for those who are choosing to drink coffee the important factors are connected to three factors which are not connected to coffee itself, but to the habit of drinking coffee. Another important fact is that Caffeine is one of the highest four mean values we got, which tells, that there is a part of respondents relating coffee with its primary feature- caffeine.
7

4 Principal Component Analysis


In this part we try to find out possible relationship between different variables. In other words we want to see if there is any relationship between the different possible answers to the questionnaire. To do this you need to do a multivariate analysis of responses, and this is done through principal component analysis. In SAS we use program:
proc princomp data=Coffee.Coffee; var s_1-s_9; run;

In this way we will have 3 useful results to be analyzed: 1) Correlation coefficients


Correlation Matrix
s_1 s_1 s_2 s_3 s_4 s_5 s_6 s_7 s_8 s_9 interior socializing caffeine tastes price smoking take away ice coffee dessert 1.0000 0.5319 -.1563 0.2340 0.1566 -.0505 0.3069 0.0336 0.1214 s_2 0.5319 1.0000 -.0171 0.3303 0.1707 -.0478 0.3081 0.0174 0.1539 s_3 -.1563 -.0171 1.0000 0.0022 0.1269 -.0377 -.0592 0.0135 0.1890 s_4 0.2340 0.3303 0.0022 1.0000 0.1767 -.1145 0.3631 0.1942 0.2501 s_5 0.1566 0.1707 0.1269 0.1767 1.0000 0.1137 -.0102 0.0520 0.0889 s_6 -.0505 -.0478 -.0377 -.1145 0.1137 1.0000 -.0909 -.0998 -.1030 s_7 0.3069 0.3081 -.0592 0.3631 -.0102 -.0909 1.0000 0.3366 0.1401 s_8 0.0336 0.0174 0.0135 0.1942 0.0520 -.0998 0.3366 1.0000 0.2037 s_9 0.1214 0.1539 0.1890 0.2501 0.0889 -.1030 0.1401 0.2037 1.0000

The first observation concerning the correlation coefficients is that they are mixed between positive and negative values; the majority of the values are positive, and only around 10 cases we had a negative values. We highlighted the highest values in yellow and the lowest in light blue. Regarding this time the positive values, the highest value is 0.5319 and it indicates the correlation between Socializing and Interior/Atmosphere, we can conclude that respondents who gave importance to the Socializing with people while drinking coffee, gave a big importance too to the Interior of the cafe and vice versa.
8

Even when we had only one correlation above 0.5 we still consider values above 0.2 as high values, as we are testing a wide range of respondents. Continuing in the standings to second place we find the correlation between Tastes and Take -Away which had a high value of 0.3631. We assume that respondents who like to take their coffee away with them, care about the different tastes of coffee. The third place in our analysis is the correlation between Ice Coffee and Take-Away, with a value of 0.3366, it could be concluded from this that the ice coffee lovers, take it away. Moreover, Take-Away people, next to mentioned before different tastes, like also Ice Coffee. Another positive correlation is Tastes and Dessert (0,25), respondents who like different, probably sweet tastes of coffee, do not refuse also dessert. The last high value in our analysis in this table is the correlation between Tastes and Socializing. It says that people like spending time with others in a coffee place, which can offer various tastes. Furthermore, Tastes have another positive correlation of 0.234 with Interior. The negative values show the features, which do not correlate with each other (Caffeine and Socializing, Smoking with Tastes, Take-Away, Ice Coffee, Dessert). The observations give us the first view of the trends in our research, which will counted more precisely in later calculations. The values are not all positive, which minimize the possibility of having an error called size effect. Still the data must be corrected to verify any existence error and its possible influence on the results obtained. To do this we will use a procedure which will be shown later. Now we continue with the second part of the PRIN COMP analysis. 2) Correlation matrix eigenvalues We noted the presence of positive correlation between the variables, but we also got negative correlation, still we decided to try to eliminate the size effect. For simplicity of the procedure PRIN COMP in SAS generates new vectors defining a new vector system that is composed of new, independent and unrelated dimensions. Each principal component is the linear combination of original variables with the coefficient equal to eigenvector of the correlation matrix.

Eigenvalues of the Correlation Matrix


Eigenvalue 1 2 3 4 5 6 7 8 9 2.30618151 1.32123803 1.22302572 0.98993443 0.79245210 0.73982482 0.70179132 0.49723342 0.42831865 Difference 0.98494347 0.09821232 0.23309129 0.19748232 0.05262728 0.03803349 0.20455790 0.06891478 Proportion 0.2562 0.1468 0.1359 0.1100 0.0881 0.0822 0.0780 0.0552 0.0476 Cumulative 0.2562 0.4030 0.5389 0.6489 0.7370 0.8192 0.8972 0.9524 1.0000

The first column shows the length of the eigenvalue of the principal components. We are interested in considering the eigenvalues to determine the importance of Principal Components. The first 3 eigenvalues have a value greater than one and therefore the most significant. However, considering the variance, we note that considering only the first three would stop at 53% of variance explained. Other components show lower importance, but still represent 5% and more variables. Our target is not specific so not to loose information we consider all the 9 principal components, which let us to explain the total variance.

10

3) Eigenvectors
Prin1
s_1 s_2 s_3 s_4 s_5 s_6 s_7 s_8 s_9 interior socializing caffeine tastes price smoking take away ice coffee dessert 0.434038 0.461075 -.010706 0.447888 0.187179 -.129733 0.441400 0.260091 0.289749

Prin2
-.417329 -.313510 0.517591 0.103031 -.016211 -.273119 0.033958 0.421480 0.442015

Prin3
0.046529 0.160018 0.462081 0.018111 0.637148 0.374877 -.319502 -.287120 0.165447

Prin4
-.169922 -.240113 -.245532 0.051922 0.248980 0.665444 0.240765 0.516211 -.14574

Prin5
s_1 s_2 s_3 s_4 s_5 s_6 s_7 s_8 s_9 0.079371 0.143574 0.180006 -.151813 -.606727 0.546455 0.141779 -.155812 0.454455

Prin6
-.036478 0.169851 0.609017 -.070963 -.119484 0.030417 0.417523 0.007970 -.635840

Prin7
0.389923 0.091209 0.062985 -.787577 0.125885 -.112881 -.065994 0.418987 0.083081

Prin8
-.050380 0.437747 0.064716 0.244862 -.279837 0.092286 -.640667 0.450530 -.203562

Prin9
0.666483 -.597040 0.216151 0.278280 -.142138 0.066560 -.186355 0.054111 -.113544

In the first column Prin1 we count 7 variables positively correlated and 2 variables negatively correlated. This observation shows us that there is not such a significant size effect. However, before further considerations, we will erase the size effect to improve the result of our analysis.

11

5 Size Effects Removal The operation of size effects removal finds reason in the fact that the values which has been allocated to the factor of the questionnaire depend to the average value of the judgments of one person. These values can greatly change so we'll find very low values in all people have a more pessimistic view, while higher values in those who are more accustomed to giving high values to different parameters (optimistic). The process of size effects removal is made through standardization procedure in SAS. We start creating 9 new variables (n_1- n_9), they will represent the new values of the 9 scale questions. These values will be centered to the average value of each individual. SAS will calculate the maximum, the minimum and the average value or each individual. Consequently the software will standardize the answers given in a range between -1 and +1. The average value will be represented by 0. We gave to SAS the following command:
data data Coffee.Coffee_1; set data Coffee.Coffee; if _n_<158; array m1 s_1-s_9; array m2 n_1-n_9; mean=mean(of s_1-s_9); min=min(of s_1-s_9); max=max(of s_1-s_9); do over m2; if m1<mean then m2=(m1-mean)/(mean-min); if m1>mean then m2=(m1-mean)/(max-mean); if m1=mean then m2=0; if mean=min and mean=max then m2=0; if m1=. Then m2=0; end; run;

The above described process let us to create a new database that we called Coffee_1. It has variables n_1-n_9 with no size effect.

12

Now, using our new database, we repeat the initial procedures to control the actual difference between the original database and the corrected one. As before, we are going to use the PROC MEANS procedure so in our SAS program we will write down:
proc means data=Coffee.Coffee_1; var n_1-n_9; run;

The result is:

The MEANS Procedure


Variable
n_1 n_2 n_3 n_4 n_5 n_6 n_7 n_8 n_9

N
157 157 157 157 157 157 157 157 157

Mean
0.3196513 0.1830072 0.2677898 -0.3539103 -0.0736206 -0.1591141 0.1395236 0.0133477 -0.0531759

Std Dev
0.5870416 0.5670946 0.6738488 0.6861465 0.6409227 0.8576755 0.6865145 0.7477820 0.6824755

Minimum
-1.0000000 -1.0000000 -1.0000000 -1.0000000 -1.0000000 -1.0000000 -1.0000000 -1.0000000 -1.0000000

Maximum
1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000

Initially we notice that after the standardization the values are within values -1 and +1 The standardization doesnt show surprising effects. In fact the three factors with the highest value didnt change: Interior, Socializing and Caffeine. However the factor Take-Away, which it is still at the fourth place, lost its decisive role as factor and loses importance in the analysis. Looking at the negative signs (highlighted with blue) we report in order: Tastes, Smoking and Price. The trend is again similar to the one seen in the procedure before size effect removal..

13

5.1 Principal Component Analysis after Size Effect Removal After size effects removal we can repeat the Principal Component procedure using the new more precise database. In SAS program we write:
Proc princomp data=Coffee.Coffee_1 out=Coffee.cluster; var n:; run;

Correlation Matrix

n_1 n_1 n_2 n_3 n_4 n_5 n_6 n_7 n_8 n_9 1.0000 0.3349 -.1629 -.0016 0.0090 -.1434 0.1069 -.1875 -.0245

n_2 0.3349 1.0000 -.1091 0.1039 0.0056 -.2060 0.0633 -.2269 -.0505

n_3 -.1629 -.1091 1.0000 -.2004 0.0376 -.1189 -.2219 -.0620 0.0918

n_4 -.0016 0.1039 -.2004 1.0000 -.0055 -.2079 0.1704 0.0148 0.1436

n_5 0.0090 0.0056 0.0376 -.0055 1.0000 0.0012 -.1916 -.1977 -.0554

n_6 -.1434 -.2060 -.1189 -.2079 0.0012 1.0000 -.2312 -.1898 -.1804

n_7 0.1069 0.0633 -.2219 0.1704 -.1916 -.2312 1.0000 0.1870 -.0672

n_8 -.1875 -.2269 -.0620 0.0148 -.1977 -.1898 0.1870 1.0000 0.0374

n_9 -.0245 -.0505 0.0918 0.1436 -.0554 -.1804 -.0672 0.0374 1.0000

The outcome of this correlation matrix shows some differences respect to the previous one without size effects removal. The highest correlation value (0,3349) is Socializing with Interior. The second highest correlation (0,1870) matches up Ice coffee with Take-Away. The third one (0,1704) correlates Take-Away with different Tastes. Analysing the correlations we can argue that the factors more correlated are similar or can have logic correlation: if a person links coffee with socializing with other people, then he will choose a caf with nice interior to spend his time with another person. Further Ice coffee is a long coffee which needs time to be finished so people can take it away to drink it slowly.

14

Furthermore the negative correlations let us to make some important conclusions. Firstly, the highest negative correlation (-0,2312) shows that people do not link Smoking with Take-Away. Also the negative correlation (-0,2269) shows how people do not relate Ice coffee with Socializing.

Eigenvalues of the Correlation Matrix


Eigenvalue 1 2 3 4 5 6 7 8 9 1.73942362 1.54092355 1.27436170 1.04382667 0.92565426 0.77303154 0.67978966 0.58364994 0.43933906 Difference 0.19850007 0.26656185 0.23053503 0.11817241 0.15262272 0.09324188 0.09613972 0.14431088 Proportion 0.1933 0.1712 0.1416 0.1160 0.1029 0.0859 0.0755 0.0648 0.0488 Cumulative 0.1933 0.3645 0.5061 0.6221 0.7249 0.8108 0.8863 0.9512 1.0000

The eigenvalues of the new correlation matrix has 4 principal components with values greater than unity that we want to consider. However, they explain only 62% of the information, which would lead us to consider at least 5 components. After some considerations we still decided to choose all the 9 variables for our analysis because our sample is not perceived as homogeneous and specific. It is composed by people from three different countries and subsequently diverse cultures. We need all the principal components because we require specific and detailed data. We want a macro view of the costumer as the scope of this research is to make some macro marketing and strategic general decision about opening cafs in different countries.

15

Eigenvectors
Prin1
n_1 n_2 n_3 n_4 n_5 n_6 n_7 n_8 n_9 0.371810 0.394858 -.324870 0.389770 -.165954 -.429974 0.469938 0.096979 0.075611

Prin2
0.427576 0.444974 -.052409 -.104202 0.338640 0.152762 -.245918 -.607624 -.194617

Prin3
-.021365 0.100454 0.533502 0.143501 0.207353 -.489183 -.263986 -.094785 0.568567

Prin4
-.273788 -.181705 -.428624 0.644117 0.394213 0.222761 -.124679 -.145799 0.227133

Prin5
-.143199 -.054699 0.124845 -.002649 0.673584 -.358620 0.197221 0.226310 -.537096

Prin6
0.477643 -.283115 -.367410 -.407235 0.353772 -.021224 -.081206 0.331441 0.385772

Prin7
0.150053 -.471510 0.230515 -.032695 0.084423 0.084132 0.649893 -.494236 0.141847

Prin8
-.541602 0.454429 -.208916 -.437329 0.177793 0.051592 0.339844 -.086094 0.328755

Prin9
0.197539 0.301668 0.419537 0.208731 0.204390 0.603378 0.223837 0.422368 0.126714

The principal component analysis is a method to reduce the dimensionality of data. Whether there are two or more variables that explain the same phenomenon, the purpose is to find a summary of their information between the variables. Then where we have correlated variables we introduce the use of principal components. The statistical area Rp, where p is the number of variables, will be reduced in an area Rx with x <p, but trying to minimize the loss of information. The new macro variables will have as much as possible to describe the variance explained by all the original variables. We have as many Principal Component as many variables. In our case, then we have 9 principal components, replacing the original 9 variables. The first PC is defined as a linear combination of the p initial variables having maximum variance; the second PC will decrease its variance, and will have the constraint of orthogonality with the PC1. The presumption of this implies that PCs are not correlated (COV = 0) in contrast to original variables. To determine the PC1 we find the weights of the linear combination to maximize the variance of the component. The same for PC2, with the additional condition of no correlation with PC1. And so it works for everything else.

16

As we have seen we have 9 theoretical PC; increasing the order of the draw the variance of the components decreases and this indicates that lose their importance (significance). It happens because the way in which PC are built gives maximum variance to the first, declining to the followings. However, each PC will bring a new informative content. We decided to use for our analysis all the principal components because they all have weight equal or higher than 5% and, as already mentioned, our sample is not homogeneous so all the collected information is useful for our research.

17

The Cluster Analysis

The next step is to divide the sample in clusters, also called segments. The purpose of this analysis is to find groups of people with characteristics sufficiently similar with respect to one or more variables (in this case the segmentation is carried out according to the scale questions of our questionnaire). These groups should present a very small variance within them (homogeneity of the cluster) and, at the same time, a significant variance among them (the clusters must be as diverse as possible so we can give them a meaning). In this way we should create homogenous groups of potential customers, easy to find and study, for example, for marketing purposes. We have to point out that the numerosity of cluster will be influenced by the purpose of research: in the case of marketing and strategic general choices, the number of clusters should be low- we are trying to find best concepts for cafes, and making too many groups could result to be too various to realize. Vice versa, in the case of micro marketing and operational decisions the number of clusters should be very high to be effective. The first step of the cluster analysis is the creation of a distance matrix in which the observations are the rows and columns while the cells represent the measure of similarity or distance for each pair of observations. The distance matrix is a tool which puts in relation the observations in a matrix NxN obtaining the distance in relative terms and the distance and similarity between the observations. Example of distance-matrix a b c d e a 0 16 1 9 10 b 16 0 17 25 2 c 1 17 0 4 9 d 9 25 4 0 13 e 10 2 9 13 0

There are various ways of measuring the distance between the observations, the distance between clusters and their similarity. Then these elements are used as selection criteria for
18

deciding whether merge or not to different clusters: the distance measures when the observations can be considered different. In a hierarchical clustering the initial distances between observations arranged in the matrix are progressively reduced by the method of the merger among observations with the shorter distance (minimum distance fusion). Under this method, in our example, customers a and c are the most similar (value equal to 1). So the customers a and c will merge into a new unit f. This new unit will have as values of the distances, respect to other units, the minimum existent values between the two old units. Following the example we have:

f f

d e 9

0 16 4

b 16 0 25 2 d 9 25 0 13

e 9 2 13 0

Now the most similar customers are b and e, so they will merge in a new unit g, and so on until it just remain a single variable. Ant this theory leads to Wards Method.

19

Wards Method

In this analysis we will use the Ward's method that allows us to create groups merging observations when the distance between the two is minimal. This distance is calculated as the sum of Euclidean distances squared. The distance will then be calculated using the Pythagorean Theorem. Sum of squared errors= SSE= i j k (Xi,jk Yi,jk ) Therefore this method will tend to maximize the so-called variance between (or among the different clusters) and to minimize the within (i.e. within the cluster): VAR (X)= Var Within+ Var Between

Var within= [ (Xip- Xp)] / Np

The sum of the values of the ith unit average value of divided for the whole group group

Var btw= [ (Xp- X)2 Np ] / Ntot

The sum of the average value of the group (cluster) - total average value) multiplied by the whole group] divided the whole population

Below there is the formula we used with SAS to create the dataset Coffee.tree.
proc cluster data=Coffee.cluster method=ward outtree=Coffee.tree; var Prin1-Prin9; id id; run;

Then we utilized this dataset to draw the dendrogram which helped us to find the number of significant clusters for our analysis.

20

7.1 Dendrogram - Graphical Representation In clustering procedure, the dendrogram is used to provide a graphical representation of the process of grouping observations. It provides a graphical representation of the relative distance to which the statistics units are melted together. The graph is represented in a Cartesian plane with axis X as the logical distance of the cluster according to the measure defined and axis Y the hierarchical level of aggregation or Fusion Distance. The choice of the hierarchical level defines the number of clusters adoptable for the analysis. The observations will be aggregated and distributed among them according to their degree of proximity, the further are the observations on the axis X, lower is the possibility that they can aggregate into the same cluster. However, this probability is also function of the level of fusion distance accepted, and since it is well known that higher is the level of the hierarchy chosen, lower is the number of clusters found. Regarding to our research, we will use the dendogram to identify the relevant number of clusters to examine. Now it is possible, therefore, to create the dendrogram through SAS. Trying a few different numbers of clusters, we stayed at the number of 3. The result of the procedure is the graph below which was appropriately cut to identify these 3 different clusters. The cut was made at a point r = 0.06, indicating a pretty good level of accuracy. The dendogram could be cut below, making higher number of clusters, but to our strategic decisions, as mentioned before, the lower number serves better. The procedure in SAS used:
proc tree data=Coffee.tree nclusters=3 out=Coffee.parti3; id id; run;

21

0.100

S e m i 0.075 P a r t 0.050 i a l R S q u a r e d
0.025

0.000 16681771968811 417934617123121914465419112317149145181811811111111171111111556712315932951236621114814116116132711253311119511353717241978911118584168119 1 1 1 2 5 7 0152688921 287953445 820326470 9005526401340634614315405213123913014345922067765393801219518282403045554 732531142423 03004575 820160834591271163419 2 0 3 4 7 4 35 9 2 3 0 1 6 5 17 4 3 3 7 1 2 600 8667 0379 90 1 8 8 9 2 5 1 42 62 8 1 2385 74 5 6 054 4 97

id

SAS Dendogram. 3 Clusters

22

T-TEST Procedure

8.1 Preparation of the dataset Following the identification of several clusters we now understand what the characteristics distinctive of each group found are. To do this we created generic three clusters in which we included all the answers received. This allows us to compare the mean scores of each cluster with the average general results and understand how and what clusters are different from general comments 8.2 T-Test for Respondents To study the relationship between quantitative variables and the segments that we have obtained using the PROC CLUSTER we can use the T-test. This test allows us to calculate a value T that is associated with the exact probability of the variable being tested, the average calculated for the cluster differs from the average calculated on the entire sample only by chance. It follows this more likely, indicated in SAS Pr> | t |, is small, the lower the probability that the difference between the means is caused by the random effect, and then increased the probability that the variable is instead significant to explain that cluster. As for the significance level to the target value in literature is 0.05. The null hypothesis is then accepted if the probability is less than 0.05, with a 95% assurance. The t-value found in the back edge was calculated as follows: t = (c - tot) / sc sc = standard error, estimation of variability of the estimator mean. Now some words should be said about the variance, having no course available the variance of the population, an estimator must be used of the variance. SAS calculates the T test using two methods, which differ just for the treatment of variance proceeds from which will then be used in the standard error T-test The Satterthwaite method is calculating the standard error for n dividing the weighted average of the two variances (of the cluster and population). This method does not place the assumption of equality of variances, and can be applied in all circumstances. The Pooled method differs from the previous one obtained the standard error from the arithmetic mean of the two variances, and doing so requires equality variances: the result that the latter can only be applied in specific circumstances, namely when the result of the equality of variances F-test confirms the null hypothesis. Considering then that in the event of equality of variances, the two methods produce the same value of T; it seems more efficient to use the Satterthwaite method.
23

Before further calculations, the data should be sorted and merged in order to create general cluster 4 to be able to pompare data in T-test:

proc sort data=Coffee.parti3; by id; run; data Coffee.compare; merge Coffee.cluster Coffee.parti3; by id; run; data Coffee.compare1; set Coffee.compare; cluster=4; run; data Coffee.compare3; set Coffee.compare Coffee.compare1; run;

It is possible now to check in each cluster, which variables can be used to describe the specificity, describing in turn the direction and strength of this relationship, if any.

24

8.3

Cluster 1

proc ttest data=Coffee.compare3; var n_1-n_9; class cluster; where cluster=1 or cluster=4; run;

8.3.1
Method

Features:
Variances Unequal Unequal Unequal Unequal Unequal Unequal Unequal Unequal Unequal DF 131.32 108.6 90.227 97.436 108.28 123.69 97.028 124.32 93.623 t Value 3.24 1.60 -2.45 -0.96 -0.71 5.53 -1.57 -4.97 -1.14 Pr > |t| 0.0015interior 0.1124socializing 0.0161caffeine 0.3406 0.4784 <.0001smoking 0.1207take away <.0001ice coffee 0.2563

1 Satterthwaite 2 Satterthwaite 3 Satterthwaite 4 Satterthwaite 5 Satterthwaite 6 Satterthwaite 7 Satterthwaite 8 Satterthwaite 9 Satterthwaite

8.3.2 Description The results of the analysis show that the significant variables for the first cluster, which we colored in red (the lowest values Pr>|t|), are the n_1, n_3, n_6 and n_8, which are referring to Interior, Caffeine, Smoking and Ice coffee respectively. A bit less important, colored in yellow n_2, n_7, representing Socializing and Take-Away. Analyzing the t value we found that the factors; Interior, Socializing and Smoking had a positive influence on the habits of drinking coffee, on the other hand, we found out that the factors; Ice coffee, Caffeine and Take away had a negative t value- no influence on the choices of respondents and their habits. Following the data it is already possible to do an initial description of the first cluster found. The key feature, important for members of this group to enjoy their coffee drinking process is Smoking and Interior and Socializing, with different weights to final choice. It is also shown from the list that the least important factors in our analysis are Ice Coffee, Caffeine and Take Away.
25

It leads us to a conclusion in case opening the cafe, that respondents in Cluster 1 value a lot a possibility to smoke, interior of the caf and socializing during the process. While for their decision where to drink coffee is not important if there is a choice of Ice Coffee, is the coffee is strong or not, or if they have a possibility to take-away.

8.4 Cluster 2
proc ttest data=Coffee.compare3; var n_1-n_9; class cluster; where cluster=2 or cluster=4; run;

8.4.1
Method

Features:
Variances Unequal Unequal Unequal Unequal Unequal Unequal Unequal Unequal Unequal DF 69.053 85.885 90.132 90.382 72.708 74.282 67.627 124.8 70.623 t Value -3.96 -5.26 2.40 -2.50 -1.23 0.77 -0.36 6.79 -0.56 Pr > |t| 0.0002interior <.0001socializing 0.0182caffeine 0.0143tastes 0.2229price 0.4455 0.7191 <.0001ice coffee 0.5802

1 Satterthwaite 2 Satterthwaite 3 Satterthwaite 4 Satterthwaite 5 Satterthwaite 6 Satterthwaite 7 Satterthwaite 8 Satterthwaite 9 Satterthwaite

8.4.2 Description An examination of the table for the second cluster shows that the significant variables with lowest Pr> |t| values are s_1-s_4 and s_8: Socializing, Ice coffee, Interior, Caffeine and Tastes respectively. A bit less significant- s_5- Price. Analyzing the t-value we find that the only two factors that positively influence the choice are Ice coffee and Caffeine. All the other factors have a negative t-value, which means not important in drinking coffee procedure. Different than in Cluster 1, Interior, Socializing are not important factors for this cluster. Moreover respondents in Cluster 2 do not give a lot of attention to different tastes of coffee and price.

26

8.5 Cluster 3
proc ttest data=Coffee.compare3; var n_1-n_9; class cluster; where cluster=3 or cluster=4; run;

8.5.1
Method

Features:
Variances Unequal Unequal Unequal Unequal Unequal Unequal Unequal Unequal Unequal DF 108.85 107.46 104.54 94.269 89.599 199.78 117.92 94.366 111.62 t Value 1.19 2.96 1.02 2.99 1.77 -10.11 2.40 0.03 2.04 Pr > |t| 0.2367 0.0037socializing 0.3121 0.0036tastes 0.0802price <.0001smoking 0.0181take away 0.9761 0.0440 dessert

1 Satterthwaite 2 Satterthwaite 3 Satterthwaite 4 Satterthwaite 5 Satterthwaite 6 Satterthwaite 7 Satterthwaite 8 Satterthwaite 9 Satterthwaite

8.5.2

Description

Analyzing the table we find that, for members of the third cluster, variables s_2, s_4- s_7, and s_9: Socializing, Tastes, Price, Smoking, Take-Away and Dessert are significant (lowest Pr>|t|). Analyzing the t-value, we find that except the variable Smoking, all the other variables had a positive influence on the choice of the members of this group, which means they care about the named features during their drinking coffee time. While Smoking here is not important. We can see from the results that the variables Tastes, Socializing and Take-Away have the biggest effect, slightly lower effect- Dessert. The less effective one was the Price among these five variables. It is clear that we are dealing with people who like to have Coffee of different tastes, taking it with dessert and they associated with socializing with other people.

27

9 Proc Freq Procedure (Chi Square Test) After determining three clusters and their particular characteristics, it is important to compare each cluster with qualitative characteristics in order to know more about each cluster. Moreover we can compare two qualitative variables with each other, to understand better our sample. For this calculation we will use PROC FREQ procedure and CHI SQUARE (Chisq) test. Chisq provides chi-square tests of independence of each stratum and computes measures of association. The chi-square test is used when you have one variable/group (cluster) and compare it with two or more values (sex, country, age, etc.). The observed counts of numbers of observations in each category are compared with the expected counts, which are calculated using some kind of theoretical expectation. Firstly the null hypothesis is that variables are independent with each other (cluster and country, age, etc.), opposite hypothesis is that variables are not independent- correlate with each other. Analyzing each frequency, the statistical null hypothesis is that the number of observations in each category is equal to that predicted, and the alternative hypothesis is that the observed numbers are different from the expected. The test will let us to confirm or reject major hypothesis, that the clusters and chosen variable are independent, furthermore, compare frequencies of each group in each cluster. The test statistic is calculated by taking an observed number (O), subtracting the expected number (E), and then squaring this difference. The larger the deviation from the null hypothesis, the larger the difference between observed and expected is. Squaring the differences makes them all positive. Each difference is divided by the expected number, and these standardized differences are summed. The shape of the chi-square distribution depends on the number of degrees of freedom. For an extrinsic null hypothesis, the number of degrees of freedom is simply the number of values of the variable, minus one. The degrees of freedom in a test of where there are more than one nominal variable, the degree of freedom is equal to (number of rows)1 (number of columns)1; in our case 43 table, there are (41)(31)=6 degrees of freedom. In practice, the main hypothesis for evaluating each variable, comparing it with cluster is: H0 = variables are independent: H1 = variables are not independent.

28

We say that variables are independent, and we confirm hypothesis H0, when CHI SQUARE probability p > 0.05 confidence level (we choose 95% confidence level as a default), otherwise we reject H0 and take H1.

9.1 Cluster x Variable


In this part of the analysis we will compare each cluster with all qualitative characteristics, firstly all from generic questions, than questions m_1-m_4. We will not compare cluster just with variable occupation, as it does not give a lot of information, knowing that the majority of respondents are students.
9.1.1 Cluster x Country Our biggest difference among people, who participated in the survey, is country, as they have different culture and habits. Firstly we will compare each cluster with the country, where they live.

The program we use is SAS is:


proc freq data=Coffee.compare3; table cluster*country / all expected; Frequency Expected Percent Row Pct Italy Col Pct
1

Lithuania Palestine Total

15 24 18 57 18.516 22.873 15.611 4.78 7.64 5.73 18.15 26.32 42.11 31.58 14.71 19.05 20.93 46 14.65

23 13 10 14.943 18.459 12.599 7.32 4.14 3.18 50.00 28.26 21.74 22.55 10.32 11.63 13 26 17.541 21.669 4.14 8.28 24.07 48.15 12.75 20.63 51 51 16.24 32.48 50.00 15 14.79 4.78 27.78 17.44

54 17.20

63 43 157 63 43 20.06 13.69 50.00 40.13 27.39 50.00 50.00

Total

102 32.48

126 40.13

86 27.39

314 100

29

Statistics for Table of CLUSTER by country


Statistic
Chi-Square Likelihood Ratio Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer's V

DF
6 6 1

Value
9.6280 9.3295 0.0052 0.1751 0.1725 0.1238

Prob
0.1412 0.1559 0.9428

We see that Chi-Square prob- 0.1412> 0,05 (Chi-square value- 9,626, with 6 degree of freedom). We confirm the zero hypotheses H0 and state that cluster and variable country are independent. On the other hand, analyzing the fields one by one, firstly we see that probability of independence is 14%, which is not very high and Chi-Square value is more than 9, not extremely low, there might be some relationships. We can find some differences in each cluster between expected value and the real frequency: Cluster 2: has frequency of 23 Italians, instead of expected 15, which is 50% instead of 32%, which means that the second cluster has features more common to Italians. Moreover, the same cluster has a little bit lower than expected frequency of Lithuanians, which is 28% instead of 40%. So we see that these cluster characteristics are not so common to Lithuanians. Palestinians do not have significant difference between expected and real frequency. Cluster 3 does not have really significant differences from expected value. A little bit lower frequency than expected there we find of Italians. 24% instead of 32%, and a little bit more than expected Lithuanians, 48% instead of 40%. All in all, Cluster 1 is common for all three countries. Cluster 2 is more suitable for Italians, Palestinians also do not reject it. Cluster 3 reflects more Lithuanian habits, Palestinians do not reject it, Italians preferences slightly differ here.

30

9.1.2 Cluster x Gender Second characteristic, by which we will compare clusters is gender (sex). Here we use a SAS program:
proc freq data=Coffee.compare3; table cluster*sex / all expected run; Frequency Expected Percent Row Pct Col Pct Female Male
1
32 32.312 10.19 56.14 17.98 25 24.688 7.96 43.86 18.38

Total
57

18.15

25 21 26.076 19.924 7.96 6.69 54.35 45.65 14.04 15.44

46 14.65

32 22 54 30.611 23.389 10.19 7.01 17.20 59.26 40.74 17.98 16.18 89 89 28.34 56.69 50.00 68 68 21.66 43.31 50.00 157 50.00

Total

178 56.69

136 43.31

314 100.00

Statistics for Table of CLUSTER by sex


Statistic
Chi-Square Likelihood Ratio Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer's V

DF
3 3 1

Value
0.2550 0.2553 0.0272 0.0285 0.0285 0.0285

Prob
0.9683 0.9682 0.8689

These results indicate that there is no statistically significant relationship between cluster and gender (chi-square with 3 degree of freedom = 0.2550, p = 0.9683). The probability of no correlation is 96% and the Chi-square value is very low; we clearly see that there is no any correlation between cluster and gender, all three clusters features are acceptable for both genders.
31

9.1.3 Cluster x Age Another variable to check is age, counted with the program:
proc freq data=Coffee.compare3; table cluster*age / all expected; run; Frequency Expected Percent Row Pct Col Pct 14-19 20-24 25-29 30-39 =>40 Total
1
1 26 21 5 4 57 1.0892 31.223 18.153 3.9936 2.5414 0.32 8.28 6.69 1.59 1.27 18.15 1.75 45.61 36.84 8.77 7.02 16.67 15.12 21.00 22.73 28.57 1 0.879 0.32 2.17 16.67 30 10 3 2 46 25.197 14.65 3.2229 2.051 9.55 3.18 0.96 0.64 14.65 65.22 21.74 6.52 4.35 17.44 10.00 13.64 14.29

1 30 19 3 1 54 1.0318 29.58 17.197 3.7834 2.4076 0.32 9.55 6.05 0.96 0.32 17.20 1.85 55.56 35.19 5.56 1.85 16.67 17.44 19.00 13.64 7.14 3 3 0.96 1.91 50.00 86 86 27.39 54.78 50.00 50 50 15.92 31.85 50.00 11 11 3.50 7.01 50.00 7 7 2.23 4.46 50.00 157 50.00

Total

6 1.91

172 54.78

100 31.85

22 7.01

14 4.46

314 100.00

Statistics for Table of CLUSTER by age


Statistic
Chi-Square Likelihood Ratio Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer's V

DF
12 12 1

Value
6.0238 6.2856 0.5908 0.1385 0.1372 0.0gg800

Prob
0.9149 0.9010 0.4421

WARNING: 50% of the cells have expected counts less than 5. Chi-Square may not be a valid test.

32

These results shuffler that there is no statistically significant relationship between cluster attended and gender (chi-square with 12 degree of freedom = 6,0238, p = 0.9149). On the other hand, SAS suggest that 50% of the cells have expected counts less than 5 and Chi-Square may not be a valid test. For further calculations The Fishers test should be used. The Fisher's exact test is used when you want to conduct a chi-square test, but one or more of your cells has an expected frequency of five or less. Remember that the chi-square test assumes that each cell has an expected frequency of five or more, but the Fisher's exact test has no such assumption and can be used regardless of how small the expected frequency is. We could use the program as follows:

proc freq data = Coffee.comapre3; tables cluster*age / fisher; run;

On the other hand we clearly see that the majority of our respondents are 20-29 years old (82%), so we focus on young people overall and further calculations are not necessary.

33

9.1.4 Cluster x Smoker Opening a cafe it is important to know, how the respondents relate with smoking, in order to prepare places for smokers or not invest in it. Firstly we will determine the frequencies of smokers in each cluster.
proc freq data=Coffee.compare3; table cluster*smoker/ all expected; run;
Frequency Expected Percent Row Pct Col Pct No

Yes

Total
57 18.15

1 19 38 33.764 23.236 6.05 12.10 33.33 66.67 10.22 29.69 2 25 21 27.248 18.752 7.96 6.69 54.35 45.65 13.44 16.41 3 49 5 31.987 22.013 15.61 1.59 90.74 9.26 26.34 3.91 4 Total 93 93 29.62 59.24 50.00 186 59.24 64 64 20.38 40.76 50.00 128 40.76

46 14.65

54 17.20

157 50.00

314 100.00

Statistics for Table of CLUSTER by smoker


Statistic
Chi-Square Likelihood Ratio Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer's V

DF
3 3 1

Value
38.4895 42.9592 9.6723 0.3501 0.3304 0.3501

Prob
<.0001 <.0001 0.0019

34

This time results show that that there is statistically significant relationship between cluster and smoking habits (chi-square with 3 degree of freedom = 38.49, p = <0.0001). We reject H0 hypothesis and confirm H1, saying that cluster and smoking habits correlate. In more detail, analyzing cluster by cluster we can make some observations: Cluster 1 shows lower than expected percentage of non-smokers, 33% instead of 59%. On the other hand higher than expected frequency of smokers: 66% instead of 40%.The result is expected, knowing from analysis of clusters, that smoking is the most required feature in Cluster 1. Cluster 2 does not have significant observations according to smoking, there are similar to expected percentage of smokers and non smokers. There are 25 non smoker and 21 smoker, and almost 50:50 of both. Cluster 3 has an opposite trend than Cluster 1.There is much higher percentage of non smokers than expected- 91% instead of 59%. And opposite, just 9% instead of 41% of smokers. This also was noticed before, seeing the only factor obviously not influencing the preference of coffee drinking- smoking. To sum up, the cluster 1 contains more common features for smokers. Cluster 2 is common for both, smoking and not, while Cluster 3 reflects more non-smokers coffee drinking habits.

35

9.1.5 Cluster x The Time of the Day In this part we will compare, if there is any link between time of the day to drink coffee (m_1) and cluster. In this way, for example, the opening hours of cafe could be optimized. We use the program:
proc freq data=Coffee.compare3; table cluster*m_1/ all expected; run;
Frequency Expected Percent Row Pct
Col Pct After Meals Afternoon Does not Evening matter The time 1 23 0 2.9045 20.694 0.3631 0.32 7.32 0.00 1.75 40.35 0.00 6.25 20.18 0.00 1 0.293 0.32 2.17 50.00 Morning Usually Total when I meet for this purpose 18 8 57 20.331 6.535 5.73 2.55 18.15 31.58 14.04 16.07 22.22 19 3 16.408 5.2739 6.05 0.96 41.30 6.52 16.96 8.33 46 14.65

7 6.172 2.23 12.28 20.59

2 7 2 14 4.9809 2.3439 16.701 2.23 0.64 4.46 15.22 4.35 30.43 20.59 12.50 12.28

3 3 5 20 0 19 7 5.8471 2.7516 19.605 0.3439 19.261 6.1911 0.96 1.59 6.37 0.00 6.05 2.23 5.56 9.26 37.04 0.00 35.19 12.96 8.82 31.25 17.54 0.00 16.96 19.44 4 17 17 5.41 10.83 50.00 8 8 2.55 5.10 50.00 57 57 18.15 36.31 50.00 1 1 0.32 0.64 50.00 56 56 17.83 35.67 50.00 18 18 5.73 11.46 50.00

54 17.20

157 50.00

Total

34 10.83

16 5.10

114 36.31

2 0.64

112 35.67

36 11.46

314 100.00

Statistics for Table of CLUSTER by m_1


Statistic
Chi-Square Likelihood Ratio Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer's V

DF
15 15 1

Value
10.6619 11.1431 0.0290 0.1843 0.1812 0.1064

Prob
0.7762 0.7424 0.8648

WARNING: 33% of the cells have expected counts less than 5. Chi-Square may not be a valid test.

36

These results show that there is no statistically significant relationship between cluster and coffee drinking time (chi-square with fifteenth degree of freedom = 10.66, p = 0.7762). On the other hand, SAS suggest that 33% of the cells have expected counts less than 5 and Chi-Square may not be a valid test. For further calculations The Fishers test should be used. On the other hand, we also notice that majority of respondents (36%) drink coffee in the morning, or say, that time is not important (36%) , or that they do it when they meet other people (11%). Just one respondent mark evening, so overall we can say that respondents are used to drink coffee all the times, except evening. All the clusters have similar trend, so further calculations are not necessary for our conclusion. 9.1.6 Cluster x The Type of Coffee Another variable to analyze is type of coffee (m_2) preferred by each cluster.
proc freq data=Coffee.compare3; table cluster*m_2/ all expected; run;
Frequency , Expected Percent Row Pct Col Pct AmericanCaffee LCappucciEspressoOther, n o atte no ot tradi tional t ypes 1 1 13 8 32 3 2.5414 13.796 11.981 25.777 2.9045 0.32 4.14 2.55 10.19 0.96 1.75 22.81 14.04 56.14 5.26 7.14 17.11 12.12 22.54 18.75 2 2 2.051 0.64 4.35 14.29 6 10 25 3 11.134 9.6688 20.803 2.3439 1.91 3.18 7.96 0.96 13.04 21.74 54.35 6.52 7.89 15.15 17.61 18.75 19 13.07 6.05 35.19 25.00 38 38 12.10 24.20 50.00 15 11.35 4.78 27.78 22.73 33 33 10.51 21.02 50.00 14 24.42 4.46 25.93 9.86 71 71 22.61 45.22 50.00 2 2.7516 0.64 3.70 12.50 8 8 2.55 5.10 50.00

Total

57 18.15

46 14.65

3 4 2.4076 1.27 7.41 28.57 4 7 7 2.23 4.46 50.00

54 17.20

157 50.00

Total

14 m4.46

76 24.20

66 21.02

142 45.22

16 5.10

314 100.00

37

Statistics for Table of CLUSTER by m_2


Statistic
Chi-Square Likelihood Ratio Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer's V

DF
12 12 1

Value
16.7882 17.7738 2.2114 0.2312 0.2253 0.1335

Prob
0.1577 0.1227 0.1370

WARNING: 30% of the cells have expected counts less than 5. Chi-Square may not be a valid test.

The results show that there is no statistically significant relationship between cluster and type of coffee preferred (chi-square with 12 degree of freedom = 16,79, p = 0.1577). Again, SAS suggests that 30% of the cells have expected counts less than 5 and Chi-Square may not be a valid test. For further calculations The Fishers test should be used. On the other hand we see that Espresso is the of course the most popular type of coffee (45%), respondent also choose Caffee Latte (24%) and Cappuccino (21%), other choices are not so significant. Cluster 1, analyzing just the frequencies is fonder of Espresso, which is 56% instead of expected 45%. They also like Caffe Latte (23%), but do not choose so much Cappuccino (14% instead of 21%). Cluster 2 respondents are also Espresso drinkers: 54% after expected 45%. Second choice is Cappucino (22%), but Caffe Latte is not so popular here ( 13%, instead of 24%). Cluster 3 is of definitely Caffee Latte drinkers (35% instead of 24%). Their second choice is Cappuccino (27% instead of 21%), third- Espresso. On the other hand Espresso frequency is quite lower than expected (26% instead of 45%)

38

9.1.7 Cluster x Times Per Day In his part we will evaluate each cluster comparing with times per day (m_3) respondents drink coffee.
proc freq data=Coffee.compare3; table cluster*m_3/ all expected; run;
Frequency Expected Percent Row Pct
Col Pct 1 or less 2, 3, >3 , Total

1 13 20 11 13 17.427 21.783 8.7134 9.0764 4.14 6.37 3.50 4.14 22.81 35.09 19.30 22.81 13.54 16.67 22.92 26.00 2 12 14.064 3.82 26.09 12.50 3 4 23 16.51 7.32 42.59 23.96 48 48 15.29 30.57 50.00 22 17.58 7.01 47.83 18.33 6 6 7.0318 7.3248 1.91 1.91 13.04 13.04 12.50 12.00

57 18.15

46 14.65

18 7 6 20.637 8.2548 8.5987 5.73 2.23 1.91 33.33 12.96 11.11 15.00 14.58 12.00 60 60 19.11 38.22 50.00 24 24 7.64 15.29 50.00 25 25 7.96 15.92 50.00

54 17.20

157 50.00

Total

96 30.57

120 38.22

48 15.29

50 15.92

314 100.00

Statistics for Table of CLUSTER by m_3


Statistic Chi-Square Likelihood Ratio Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer's V DF 9 9 1 Value 9.2367 8.8972 1.6380 0.1715 0.1690 0.0990 Prob 0.4157 0.4468 0.2006

39

These results confirm zero hypothesis H0 that there is no statistically significant relationship between cluster and times a day coffee is used (chi-square with 9 degree of freedom = 9.2367, p = 0.4157). The probability of no correlation is 41%. There are no very significant differences analyzing one by one clusters and variable answers. The only notice could be made, that in Cluster 3 respondents choose more often than expected drinking coffee less than once a day. We have 43% instead of expected 31% frequency. Knowing the characteristics of clusters, we can say, that probably respondents relate coffee with socializing, dessert, not every day routine. Analyzing frequencies in the Cluster 1 and Cluster 2 respondents usually choose coffee twice a day. Also in cluster 3 twice a day choice is significant. 9.1.8 Cluster x Way of Drinking Coffee This time we will compare cluster and the way of drinking coffee (m_4), using the program:
proc freq data=Coffee.compare3; table cluster*m_4/ all expected; run;
Frequency Expected Percent Row Pct Col Pct Bar

Take Away

Sitting in Total Cafe 57 18.15

1 11 11 35 13.433 11.618 31.949 3.50 3.50 11.15 19.30 19.30 61.40 14.86 17.19 19.89 2 18 9 19 10.841 9.3758 25.783 5.73 2.87 6.05 39.13 19.57 41.30 24.32 14.06 10.80 3 8 12 34 12.726 11.006 30.268 2.55 3.82 10.83 14.81 22.22 62.96 10.81 18.75 19.32 4 Total 37 37 11.78 23.57 50.00 74 23.57 32 32 10.19 20.38 50.00 64 20.38 88 88 28.03 56.05 50.00 176 56.05

46 14.65

54 17.20

157 50.00

314 100.00

40

Statistics for Table of CLUSTER by m_4


Statistic
Chi-Square Likelihood Ratio Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer's V

DF
6 6 1

Value
9.5977 9.2570 0.0296 0.1748 0.1722 0.1236

Prob
0.1426 0.1596 0.8633

The results confirm zero hypothesis H0 that there is no statistically significant relationship between cluster and way to drink coffee is used (chi-square with 6 degree of freedom = 9.5977, p = 0.1426). The probability of no correlation is not so strong- 14%. Analyzing frequencies one by one, we notice some different values than expected in Cluster 2, where respondent choose more often than usual to drink coffee fast next to the bar (39% instead of expected 24%), and less than usual taking a cup of coffee without a hurry in a caf (41% instead of 56%) Overall, looking at all the sample, sitting in a caf, taking the time is most popular way to drink coffee (56%) take-away (20%) and fast coffee in a bar (24%) have more or less the same popularity

41

9.2 Country X Variable In this part we will compare our variable Country, with other qualitative variables, related to personal coffee drinking habit (questions m_1-m_4). As mentioned before, for marketing decisions variable country is important to analyze, because of cultural differences. 9.2.1 Country x The Time of the Day Firstly we compare variable country, with time of the day to drink coffee (m_1), using the program:
proc freq data=Coffee.compare3; table country*m_1/ all expected; run;
Frequency Expected Percent Row Pct Col Pct After meAfternooDoes notEvening als n matter the time 20 4 34 2 11.045 5.1975 37.032 0.6497 6.37 1.27 10.83 0.64 19.61 3.92 33.33 1.96 58.82 25.00 29.82 100.00

Morning

Usually when I m eet othe r people for thi s purpos e

Total

Italy

42 0 36.382 11.694 13.38 0.00 41 in 0.00 37.50 0.00

102 32.48

Lithuanian 10 12 48 0 40 16 13.643 6.4204 45.745 0.8025 44.943 14.446 3.18 3.82 15.29 0.00 12.74 5.10 7.94 9.52 38.10 0.00 31.75 12.70 29.41 75.00 42.11 0.00 35.71 44.44 Palestine 4 0 32 0 30 20 9.3121 4.3822 31.223 0.5478 30.675 9.8599 1.27 0.00 10.19 0.00 9.55 6.37 4.65 0.00 37.21 0.00 34.88 23.26 11.76 0.00 28.07 0.00 26.79 55.56

126 40.13

86 27.39

Total

34 10.83

16 5.10

114 36.31

2 0.64

112 35.67

36 11.46

314 100.00

42

Statistics for Table of country by m_1


Statistic DF
10 10 1

Value
49.0229 61.5408 15.7465 0.3951 0.3675 0.2794

Prob
<.0001 <.0001 <.0001

Chi-Square Likelihood Ratio Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer's V

WARNING: 22% of the cells have expected counts less than 5. Chi-Square may not be a valid test.

This time results show that that there is statistically significant relationship between country and time of the day to drink coffee (chi-square with tenth degree of freedom = 49, p = <0.0001). We reject H0 hypothesis and confirm H1, saying that county with this variable correlates. For further calculations The Fishers test should be used, as SAS suggests, when 22% of cells have lower values than 5. But for now we will analyze countries by frequencies. In more detail, analyzing each country we can make some useful observations: Italy. Respondents a bit more than expected (41% instead of 34%) choose morning to drink coffee, another popular choice is does not matter the time, or after meals. Lithuanians drink coffee usually in the morning (31%), and then not looking at the time (38%), or when meeting other people (12%). For Lithuanians coffee drinking is not so much related with the routine. Palestine is more similar to Lithuania, Coffee is used not looking at the time (37%), meeting other people (23%), or in the morning (35%). Palestinians do not relate coffee drinking and meals (5 % instead of expected 11%).

43

9.2.2 Country x Type of Coffee Another variable to compare by countries is the type of coffee (m_2).
proc freq data=Coffee.compare3; table country*m_2/ all expected; run;
Frequency Expected Percent Row Pct Col Pct AmericanCaffee LCappucciEspressoOther, n o atte no ot tradi tional t ypes 6 6 8 82 0 4.5478 24.688 21.439 46.127 5.1975 1.91 1.91 2.55 26.11 0.00 5.88 5.88 7.84 80.39 0.00 42.86 7.89 12.12 57.75 0.00

Total

Italy

102 32.48

Lithuanian 6 64 22 30 4 5.6178 30.497 26.484 56.981 6.4204 1.91 20.38 7.01 9.55 1.27 4.76 50.79 17.46 23.81 3.17 42.86 84.21 33.33 21.13 25.00 Palestine 2 6 36 30 12 3.8344 20.815 18.076 38.892 4.3822 0.64 1.91 11.46 9.55 3.82 2.33 6.98 41.86 34.88 13.95 14.29 7.89 54.55 21.13 75.00 14 4.46 76 24.20 66 21.02 142 45.22 16 5.10

126 40.13

86 27.39

Total

314 100.00

Statistics for Table of country by m_2


Statistic Chi-Square Likelihood Ratio Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer's V DF 8 8 1 Value 151.8787 150.8345 1.4006 0.6955 0.5710 0.4918 Prob <.0001 <.0001 0.2366

44

There definitely is statistically significant relationship between country and type of the coffee (chi-square with eight degree of freedom = 151.87, p = <0.0001). We reject H0 hypothesis and confirm H1, saying country and the type of coffee chosen correlate. Some observations can be done comparing two variables: Italy. The most significant type of coffee is , as expected, Espresso (80% , instead of 45%). Caffee Latte is opposite- not preferred at all comparing to all sample (6% instead of 24%), and Cappuccino (8% instead of 21%). It shows that Italians are heavy Espresso drinkers. Lithuania opposite has a preference of Caffee Latte (50% instead of 24%). Moreover. even Lithuanians like cappuccino and espresso more than Americano or other types, comparing to expected frequencies, Espresso is less preferred (24% instead of 45%). Lithuania has clearly a preference of coffee with milk. Palestinians are cappuccino drinkers (41% instead of 21%) Furthermore their second preference is espresso, but also a little less than expected (35% of 45%), also sometimes other, not traditional tastes Palestinians prefer more than other countries (14% instead of 5%). All in all, Palestinians are more open for different coffee tastes. They exlude just Caffee Latte (7% instead of 24%)

9.2.3 Country X Times per Day Going further, we compare one of the last relations: country and frequency per day of drinking coffee (m_3).
proc freq data=Coffee.compare3; table country*m_4/ all expected; run;
Frequency Expected Percent Row Pct Col Pct 1 or les2 s

More tha n 3

Total

Italy

16 42 22 22 31.185 38.981 15.592 16.242 5.10 13.38 7.01 7.01 15.69 41.18 21.57 21.57 16.67 35.00 45.83 44.00

102 32.48

Lithuanian 50 54 8 14 38.522 48.153 19.261 20.064 15.92 17.20 2.55 4.46 39.68 42.86 6.35 11.11 52.08 45.00 16.67 28.00

126 40.13

45

Palestine

30 24 18 14 26.293 32.866 13.146 13.694 9.55 7.64 5.73 4.46 34.88 27.91 20.93 16.28 31.25 20.00 37.50 28.00 96 30.57 120 38.22 48 15.29 50 15.92

86 27.39

Total

314 100.00

Statistics for Table of country by m_3


Statistic Chi-Square Likelihood Ratio Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer's V DF 6 6 1 Value 29.5616 32.4846 4.9002 0.3068 0.2933 0.2170 Prob <.0001 <.0001 0.0269

The results show that that there is statistically significant relationship between country and times per day drinking coffee (chi-square with 6 degree of freedom = 29,56, p = <0.0001). We reject H0 hypothesis and confirm H1, saying that country and times a day drinking coffee habits correlate. Looking in details: Italians (41%) drink coffee twice a day. Comparing with the analysis above, probably morning and then another chosen time, usually after meals. It is less common than expected for Italians to drink coffee once a day (16% instead of 31%). Lithuanians also choose to drink coffee twice a day (43%). Moreover, slightly more frequently than expected they drink coffee once a day or less (40% instead of 31%). Relating with the information above, Lithuanians probably drink coffee in the morning and/or whenever they have an occasion, meeting other people. Palestinians are a bit different from other countries. 35% of them drink coffee one or less a day. Twice a day choice has a bit lower frequency than expected (28% instead of 38%), and others drink coffee three or more times a day.

46

9.2.4 Country X The Way of Drinking Finally we will compare country with the way of drinking coffee (m_4).
proc freq data=Coffee.compare3; table country*m_4/ all expected; run;
Frequency Expected Percent Row Pct Col Pct Italy Bar 66 24.038 21.02 64.71 89.19

Take Away 12 20.79 3.82 11.76 18.75

Sitting in Cafe 24 57.172 7.64 23.53 13.64

102 32.48

Lithuanian 4 36 86 29.694 25.682 70.624 1.27 11.46 27.39 3.17 28.57 68.25 5.41 56.25 48.86 Palestine 4 16 66 20.268 17.529 48.204 1.27 5.10 21.02 4.65 18.60 76.74 5.41 25.00 37.50 74 23.57 64 20.38 176 56.05

126 40.13

86 27.39

Total

314 100.00

Statistics for Table of country by m_4


Statistic
Chi-Square Likelihood Ratio Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer's V

DF
4 4 1

Value
145.6996 146.2024 91.9405 0.6812 0.5630 0.4817

Prob
<.0001 <.0001 <.0001

The results show that that there is statistically significant relationship between country and way of drinking coffee (chi-square with 4 degree of freedom = 145,69, p = <0.0001). We reject H0 hypothesis and confirm H1, saying that country and way of drinking coffee correlate. Some specific features noticed here: Italy. 64% instead of 23% expected respondents drink coffee fast standing next to the bar. They do not have a culture of sitting long in the caf to drink coffee (24% instead
47

of 56%), moreover take-away culture is not so common for this culture (12% instead of 20%). Lithuanians opposite, more than expected like to take their time to drink a cup of coffee (68% instead of 56%). If not this choice, saving time Lithuanians take-away their cup of coffee (29%). They do not have a habit to drink coffee in a hurry just next to the bar (3% instead of expected 24 %). Palestinians are more similar to Lithuanians, than Italians. Firstly they prefer taking their time to have a cup of coffee (76% instead of expected 56%). 19% of Palestinians like take-away coffee. On the other hand just a few of them (5% instead of expected 24%) are for taking fast coffee next to the bar.

48

9.2.5 The Most Significant Features of The Countries After comparing three countries with different variables, we can recognize some obvious features and differences between Italy, Lithuania and Palestine. 9.2.5.1 Italy Italians are people with specific coffee drinking traditions. Firstly they usually drink coffee twice: in the morning and after the meals, or any other time of the day. Italians prefer taking Espresso and more likely fast, next to the bar. These would be the most significant features of Italian respondents. 9.2.5.2 Lithuania Lithuanians drink coffee once or twice a day, usually morning and then any other time of the day, often with the purpose of socializing. Lithuanians, despite most popular coffee- Espresso, moreover they are real Coffee Latte lovers. Moreover they enjoy sitting in the caf and taking their time.
9.2.5.3 Palestine

Palestinians are more similar to Lithuania, than Italy. Palestinians usually drink coffee in the morning, and the not related to the timetable, for the purpose of meeting people and socializing. Palestinians drink coffee or really rarely, like less than once a day, or three or more times. Palestinians have a strong preference for Cappuccino; moreover, as everywhere, Espresso is also important. Palestinians more than other people like different, not traditional tastes of coffee. This nation the same as Lithuanians, have a strong preference for taking their time in a caf for coffee.

49

10 Strategic Decisions
Finally we came to the conclusion, where we will determine the different groups with their particular characteristics and habits, which tell us what kind of caf would be popular for each group. Firstly we will determine the most significant features of each country. Later we will add these features making strategic marketing decision what kind of caf to open in ach area. 10.1 Cluster 1- Sophisticated Coffee and Cigarettes

As we saw above, cluster 1 is a group of people, who enjoy smoking, good interior and atmosphere, and socializing during their coffee drinking process. Respondents with mentioned characteristics are spread in all three countries, without any significant differences from expected frequencies. This cluster has a significant feature- most of the respondents are smokers. Moreover, as common for all the sample, people in the cluster mostly choose taking their time to drink coffee. For this group of respondents fashionable cafes would be opened. The biggest attention should be paid for creating of interior and cozy atmosphere, inviting to stay inside longer. Even it is forbidden to smoke inside; the smoking area should be available, with heaters for winter (especially in Lithuania). The cafes should be situated in strategically comfortable location for meetings. While Italians tend to drink coffee fast, next to the bar, the caf in Italy should have less places for sitting and more attention should be paid for attractive bar to take fast coffee on the way, or with cigarette outside. In other countries a bar is not necessary; more attention must be paid to creating enough places to sit. The menu in a sophisticated place should include traditional types of coffee.
50

10.2 Cluster 2- Fast Coffee

Cluster 2 describes people, whose preference is good quality strong coffee, mostly Espresso, moreover Ice- Coffee (we assume that in summer season). Here people do not like spend a lot of time for the process, they prefer fast coffee. Moreover Cluster 2 has a higher frequency of Italians, than expected. The caf to satisfy the needs of the group described by Cluster 2 should be simple coffee bar in convenient locations: next to offices, city center shopping places, universities, lunch restaurants. Good quality strong coffee and ice coffee choice are summer is essential features. No investment should be made in extended menus with not traditional tastes of coffee. Knowing the features of coffee drinking habits in Italy and the fact, that cluster has a significant number of Italians, firstly these coffee bars should be opened in Italy. Knowing that there is a big competition of similar concept places in this country, we would compete with good quality coffee, convenient locations and fast service or the possibility of self service to make the process less time consuming.

51

10.3 Cluster 3- Sweet Break or Take-Away

Cluster 3 describes people, who like meeting other people, and while socializing, having a cup of coffee, with different tastes, or a dessert next to eat. Their preferred type of coffee in Caffe Latte and probably its variations. These respondents also choose Take-Away coffee. This way of coffee relating with socializing and dessert is more common to Lithuanians. Here Starbucks style coffee place should be opened. The caf would offer an extended menu of different tastes. Most of the attention should be paid to Caffee Latte with different syrups. An attractive dessert menu should be available. The place should be cozy to spend some time there, but simple, keeping the prices low. Knowing the features of three countries, and the fact that this cluster is more common to Lithuanians, firstly we open this concept cafes in Lithuania. Also Palestine shouldnt be forgotten, as they express their preference for different types of coffee, drinking it many times a day and taking their time for the process. For the moment we should not invest in opening such lace in Italy, as there people have slightly different habits.

52

11 Appendix
Appendix 1. Questionnaire. Coffee drinking habits in Palestine, Lithuania and Italy
We are making a research about the preferences of cafe and coffee drinking habits in three different cultures, with the purpose to find the best concept of cafe in each location. So the questionnaire is oriented to your coffee drinking habits in cafes, not at home (if, for example you drink coffee every morning at home, and later in a cafe with other people, please relate the answers more with the second). Please keep that in mind answering the questions. If you do not drink coffee, do not fill the questionnaire. We are kindly asking to fill the questionnaire just if you are originally from Lithuania, Italy or Palestine. For each multiple choice question choose just one best answer. For scale type questions, evaluate the argument or answer the question, when 1 is the most negative answer, 10 is the most positive. The survey is absolutely anonymous. 1. Choose one favourite type of coffee, which you usually drink, from the list below: *

Espresso Americano Cappuccino Caffee Latte Other, not traditional types 2. When do you usually drink coffee? *

Morning Afternoon Evening After meals Does not matter the time Usually when I meet other people for this purpose 3. How many times a day do you usually drink coffee? *

1 or less 2 3 53

More than 3 4. How do prefer drinking coffee, when you are not at home? *

Fast coffee in the cafe, standing next to the bar Sitting in a cafe, taking your time On the way, take-away coffee 5. Evaluate the importance of the interior and atmosphere of the place, where you prefer drinking coffee: * 1 Not important 2 3 4 5 6 7 8 9 10 Very important

6. How strongly do you relate drinking coffee process to socializing with people? * 1 Don't relate 2 3 4 5 6 7 8 9 10 Relate a lot

7. Evaluate the importance of the effect of caffeine in a drinking coffee process for you: * 1 Not important 2 3 4 5 6 7 8 9 10 Very important

8. Evaluate your preference of not traditional tastes of coffee (coffee with different syrups, spices, etc.) * 1 Don't like 2 3 4 5 6 7 8 9 10 Like a lot

9. Evaluate the importance of the price in choosing where to drink coffee: * 1 Not important 2 3 4 5 6 7 8 9 10 Very important

10. Do you relate drinking coffee with smoking? answer just smokers 1 2 3 4 5 6 7 8 9 10

54

Absolutely not

Absolutely yes

11. Evaluate your preference of the "take-away culture" of coffee: * 1 Don't like 2 3 4 5 6 7 8 9 10 Like a lot

12. Evaluate your preference of ice coffee: * 1 Don't like 2 3 4 5 6 7 8 9 10 Like a lot

13. Evaluate you preference/habit to choose a dessert/croissant together with coffee: * 1 Don't do it 2 3 4 5 6 7 8 9 10 Do it always

14. Your country: *


Italy Lithuanian Palestine 15. Your gender: *

Male Female 16. Your age: *

up to 13 14-19 20-24 25-29 30-39 more than 40 17. Your occupation: *

Student 55

Working person Unemployed Other

18. Are you smoker? *


No Yes

56

You might also like