You are on page 1of 127

Authors

Martin Klint Hansen Nicholas Fritsche Rasmus Porsgaard Per Surland Ulrick Tttrup Niels Yding Srensen

Cand. Merc. Manual for SPSS 16.0

Description Contains statistical methods used on master level.

IT department Fall 2008

Table of Contents
1. 2. Introduction ............................................................................................................................................................. 1 General facts about SPSS ....................................................................................................................................... 2 2.1 The contents of the Data Editor............................................................................................................................ 2 2.1.1 The File menu ............................................................................................................................................. 2 2.1.2 The Edit menu............................................................................................................................................. 2 2.1.3 The View menu ........................................................................................................................................... 2 2.1.4 The Data menu ............................................................................................................................................ 3 2.1.5 The Transform menu .................................................................................................................................. 3 2.1.6 The Analyze menu ...................................................................................................................................... 3 2.1.7 The Graphs menu ........................................................................................................................................ 4 2.1.8 The Utilities menu ...................................................................................................................................... 4 2.1.9 The Help Menu ........................................................................................................................................... 4 2.2 Contents of the Output Window ............................................................................................................................ 4 3. Cluster analysis ....................................................................................................................................................... 6 3.1 Introduction .......................................................................................................................................................... 6 3.2 Hierarchical analysis of clusters .......................................................................................................................... 6 3.2.1 Example ...................................................................................................................................................... 6 3.2.2 Implementation of the analysis ................................................................................................................... 8 3.2.3 Output ....................................................................................................................................................... 12 3.3 K-means cluster analysis (Non-hierarchical cluster analysis) ........................................................................... 14 3.3.1 Example .................................................................................................................................................... 14 3.3.2 Implementation of the analysis ................................................................................................................. 14 3.3.3 Output ....................................................................................................................................................... 18 4. Correspondence Analysis ..................................................................................................................................... 21 4.1 Introduction ........................................................................................................................................................ 21 4.2 Example .............................................................................................................................................................. 21 4.3 Implementation of the analysis ........................................................................................................................... 22 4.3.1 Specification of the variables .................................................................................................................... 23 4.3.2 Model ........................................................................................................................................................ 24 4.3.3 Statistics .................................................................................................................................................... 25 4.3.4 Plots .......................................................................................................................................................... 25 4.4 Output ................................................................................................................................................................. 26 4.5 HOMALS (Multiple correspondence analysis)................................................................................................... 31 5. Multiple analysis of variance ................................................................................................................................ 32 5.1 Introduction ........................................................................................................................................................ 32 5.2 Example .............................................................................................................................................................. 33 5.3 Implementation of the analysis ........................................................................................................................... 33 5.4 Output ................................................................................................................................................................. 39 6. Discriminant analysis ............................................................................................................................................ 43 6.1 Introduction ........................................................................................................................................................ 43 6.2 Example .............................................................................................................................................................. 44 6.3 Implementation of the analysis ........................................................................................................................... 44 6.3.1 Statistics .................................................................................................................................................... 46 6.3.2 Classify ..................................................................................................................................................... 47

6.3.3

Save .......................................................................................................................................................... 48

6.4 Output ................................................................................................................................................................. 48 7. Profile analysis....................................................................................................................................................... 53 7.1 Introduction ........................................................................................................................................................ 53 7.2 Example .............................................................................................................................................................. 53 7.3 Implementation of the analysis ........................................................................................................................... 54 7.3.1 Are the profiles parallel? ........................................................................................................................... 57 7.3.2 Are the profiles congruent? ....................................................................................................................... 60 7.3.3 Are the profiles horizontal? ...................................................................................................................... 61 8. Factor analysis ....................................................................................................................................................... 63 8.1 Introduction ........................................................................................................................................................ 63 8.2 Example .............................................................................................................................................................. 63 8.3 Implementation of the analysis ........................................................................................................................... 64 8.3.1 Descriptives .............................................................................................................................................. 65 8.3.2 Extraction .................................................................................................................................................. 66 8.3.3 Rotation..................................................................................................................................................... 67 8.3.4 Scores........................................................................................................................................................ 68 8.3.5 Options...................................................................................................................................................... 68 8.4 Output ................................................................................................................................................................. 69 9. Multidimensional scaling (MDS) ......................................................................................................................... 74 9.1 Introduction ........................................................................................................................................................ 74 9.2 Example .............................................................................................................................................................. 74 9.3 Implementation of the analysis ........................................................................................................................... 75 9.3.1 Model ........................................................................................................................................................ 77 9.3.2 Options...................................................................................................................................................... 78 9.4 Output ................................................................................................................................................................. 79 10. Conjoint Analysis .................................................................................................................................................. 86 10.1 10.2 Introduction ................................................................................................................................................... 86 Example ......................................................................................................................................................... 86

10.3 Generating an orthogonal design .................................................................................................................. 87 Options ................................................................................................................................................. 90 10.3.1 10.4 10.5 10.6 PlanCards ...................................................................................................................................................... 90 Conjoint analysis ........................................................................................................................................... 92 Output ............................................................................................................................................................ 94

11. Item analysis .......................................................................................................................................................... 97 11.1 11.2 Introduction ................................................................................................................................................... 97 Example ......................................................................................................................................................... 97

11.3 Implementation of the analysis ...................................................................................................................... 98 Model ................................................................................................................................................... 99 11.3.1 Statistics ............................................................................................................................................... 99 11.3.2 11.4 Output .......................................................................................................................................................... 101

12. Introduction to Matrix algebra .......................................................................................................................... 104 12.1 12.2 Initialization ................................................................................................................................................ 104 Importing variables into matrices................................................................................................................ 105

12.3 12.4 12.5

Matrix computation ..................................................................................................................................... 105 General Matrix operations .......................................................................................................................... 109 Matrix algebraic functions .......................................................................................................................... 118

APPENDIX A [Choice of Statistical Methods] ............................................................................................................ 122 Literature List ............................................................................................................................................................... 123

Introduction

1. Introduction
The purpose of this manual is to introduce the reader to the use of SPSS for the statistic analyses used at Master Level (Cand. Merc.). For data manipulation and basic statistical analysis (such as Ttest, ANOVA, log linear and logit loglinear analysis) please refer to the manual Introduction to SPSS 16.0. The manual is to be considered a collection of examples. The idea is to introduce the reader to the statistical methods by means of fairly simple examples based on a specific problem. The manual suggests a solution and provides a short interpretation. However, it is important to emphasize that it is no more than a suggestion - and not a final solution. The data sets, which are referred to, can be found on the following server location (accessible for all students at The Aarhus School of Business): \\okf-filesrv1\exemp\Spss\Manual. The manual provides a short description of each statistical analysis including its main purpose, and exemplifies a relevant situation to use the treated test. However, no theoretic basis is provided, which means that the reader is expected to find the specific theoretical literature him/herself. Inspiration can be found in the list of literature at the back of this manual, or at the beginning of each chapter. The manual has been written for SPSS version 16.0. If using other versions differences may occur. Suggestions for improvements or corrections can be handed in at the IT instructors office in room H14 or sent by e-mail to the following address: analyse@asb.dk

IT Department, The Aarhus School of Business, Autumn 2007.

General facts about SPSS

2. General facts about SPSS


SPSS consists of two parts a Data Editor and an Output Window. In the Data Editor, Manipulation of data is carried out and commands are executed. In the Output window all results, tables and figures are displayed, while at the same time operating as a log window.

2.1 The contents of the Data Editor


The menu bar is located at the top of the Data Editor, and each of its items is discussed below.

2.1.1

The File menu

This item is used for the administration of data, i.e. loading, saving, importing and exporting data and printing. The options are roughly similar to those of other Windows based programs.

2.1.2

The Edit menu

Edit is used for editing the contents of the current window. Here, the cut, copy and paste functions are found. In addition, the options function makes it possible to change the fonts used in the outputs, the decimal separator, etc.

2.1.3

The View menu

In this item it is possible to switch the status bar, gridlines, etc. on and off. The font and font size Data Editor used in are defined here as well. 2

General facts about SPSS

2.1.4

The Data menu

This is where various manipulations of data are made possible. Examples include the definition of new variables (Define Variable) and the sorting of variables (Sort Cases), etc. 2.1.5 The Transform menu

The Transform menu gives the user the opportunity to recode variables, generate random numbers, rank data sets (construct ordinal data), define missing values, etc. 2.1.6 The Analyze menu

In this menu the user chooses which statistic analysis to use. In the table below the available analyses are listed together with a short description. Method of Analysis Reports Descriptive Statistics Custom Tables Compare Means General Linear Model Generalized Linear Model Description Case- and report summaries Descriptive statistics, frequencies, P-P- and Q-Q plots etc. Construction of various tables Comparison of various mean values, e.g. by the use of a T-test or ANOVA Estimation by means of GLM, MANOVA Expand of the opportunities of the regression and General Linear Model. Specifically of models on data which does not satisfy the condition of normally distributed errors and models containing links between independent variables. Mixed Models Correlate Flexible modeling with data that are correlated and non-constant variability. Various association measures for the variables of the data set. Among other things it is possible to calculate covariance, Pearsons correlation coefficient, etc. Regression Loglinear Classify Data Reduction Scale Nonparametric Test Time Series Regression; both linear, logistic and curve estimation General loglinear analysis and logit loglinear analysis Cluster- and discriminant analysis Factor analysis and Correspondence analysis Item analysis and multidimensional scaling Chi-square-, binomial-, hypothesis tests and test of independence Autoregression and ARIMA 3

General facts about SPSS

Survival Multiple Response Missing Value Analysis

Kaplan-Maier, Cox-regression, etc. Frequency tables and crosstabs for multiple response datasets Description of patterns concerning missing values.

2.1.7

The Graphs menu

In the Graphs menu there are several possibilities to get a graphical overview of the data set. The options include the construction of histograms, line graphs, pie charts, box plots as well as pareto-, pp- and qq-diagrams.

2.1.8

The Utilities menu

Here it is possible to obtain information about the different variables, e.g. type, length, etc. If only some of the variables are needed, the user has the opportunity to construct a new data set based on the existing variables. This is done by clicking Define Sets and choosing the relevant variables followed by choosing Use Sets from the menu, where it is possible to use the new data sets. 2.1.9 The Help Menu

In the Help menu it is possible to seek help about how, the specific calculations and data manipulations are handled in SPSS. The most important submenu is Topics, where it is possible carry out a search on keywords using the index bar. You can also press the F1 key anywhere in SPSS to get help about the specific menu you are working in. If you want to do a regression and need some help in the Statistics menu you can simply press F1 and information about the different possibilities within the Statistics menu will be displayed.

2.2 Contents of the Output Window


As mentioned, the Output Window shows results, tables and figures and works as a log window. In the Window menu the user can switch between the Data Editor and the Output Window.

General facts about SPSS

In the left side of the Output Window a tree structure is displayed, which provides a good overview of results, figures and tables. A specific output can be viewed by clicking on it in the tree structure, whereupon it will appear in the right side of the screen. If an error is committed, a log-item will pop up. By clicking this item a window will appear, where a possible explanation to the problem is provided.

Cluster analysis

3. Cluster analysis
The following analysis and interpretation of Cluster analysis is based on the following literature:

Videregende data-analyse med SPSS og AMOS, Niels Blunch 1. udgave 2000, Systime. Chp. 1, p. 3 29.

3.1 Introduction
Cluster analysis is a multivariate procedure for detecting groupings in the data where there is no clarity. Cluster analysis is often applied as an explorative technique. The purpose of a cluster analysis is to divide the units of the analysis into smaller clusters so that the observations in the cluster are homogenous and the observations in the other clusters in one way or the other are different from these. In contrast to the discriminant analysis, the groups in the cluster analysis are unknown from the outset. This means that the groups are not based on pre-defined characteristics. Instead the groups are created on basis of the characteristics of the data material. It is important to note that cluster analysis should be used with extreme caution since SPSS will always find a structure no matter if there is one present or not. The choice of method should therefore always be carefully considered and the output should be critically examined before the result of the analysis is employed. Cluster analysis in SPSS is divided into hierarchical and K-means clustering (non-hierarchical analysis) Examples of both types of analyses are found below.

3.2 Hierarchical analysis of clusters


In the hierarchical method, clustering begins by finding the closest pair of objects (cases or variables) according to a distance measure and combining them to form a cluster. This algorithm starts with each case (or variable) in a separate cluster and combines clusters one step at a time until all data is placed in a cluster. The method is called hierarchical because it does not allow single objects to change cluster once they have been joined; this requires that the entire cluster be changed. As indicated, the method can be used for both cases and variables.

3.2.1

Example \\okf-filesrv1\exemp\Spss\Manual\Hierakiskdata.sav 6

Data set:

Cluster analysis

Data on a number of economic variables is registered for 15 randomly chosen countries. By means of hierarchical cluster analysis it is now possible to find out which countries are similar with regard to the variables in question. In other words the task is to divide the countries into relevant clusters. The chosen countries are: Argentina, Austria, Bangladesh, Bolivia, Brazil, Chile, Denmark, The Dominican Republic, India, Indonesia, Italy, Japan, Norway, Paraguay and Switzerland. The following variables are included in the analysis: urban (number of people living in cities), lifeexpf (average life span for women), literacy (number of people that can read), pop_incr (increase in population in % per year), babymort (mortality for newborns per 1000 newborns), calories (daily calorie intake), birth_rt (birth rate per capita), death_rt (death rate per 1000 inhabitants), log_gdp (the logarithm to BNP per inhabitant), b_to_d (birth rate in proportion to mortality), fertilty (average number of babies), log_pop (the logarithm of a number of inhabitants).

Both the hierarchical and the non-hierarchical method will be described in the following paragraphs. However it should be noted that for both analysis methods the cluster structure is made based on cases. As it will be mentioned later, it is possible to conduct the hierarchical analysis based on variables, however this is beyond the scope of the manual.

Cluster analysis

3.2.2

Implementation of the analysis

Choose the following commands in the menu bar:

Hereby the following dialogue box will appear:

Cluster analysis

The relevant variables are chosen and added to the Variable(s) window (see below). Under Cluster it is chosen whether the analysis has to do with observations or variables, i.e. if the user wants to group variables or observations into clusters. Since this example has to do with observations, Cases are chosen. Provided that the hierarchical analysis is based on cases, such as the above standing example, it is very important that the chosen variable is defined as a string ( in variable view type). Under Display it is possible to include a display of plots and/or statistics. The Statistics and Methods buttons must be chosen before the analysis can be carried through. When these dialogue boxes are activated it is possible to click the Save button. Subsequently, the above-mentioned buttons will be described:
3.2.2.1 Statistics

Choosing Agglomeration schedule helps identify which clusters are combined in each iterative step. Proximity matrix shows how the distances or the similarities between the units (cases or variables) appear in the output. Cluster membership shows how every single observations- or variables cluster membership appear in the output.

3.2.2.2

Plots

A dendrogram or an icicle plot can be chosen here. In this example as well as in the following, a dendrogram will be chosen, since the interpretation of the cluster analysis is often based on this.

Cluster analysis

3.2.2.3

Method

In this dialogue box the methods on which the cluster analysis is based are indicated. In Cluster method a number of methods can be chosen. The most important are described below:

Nearest neighbour (single linkage): The clusters are created by minimizing the distance between pairs of clusters. Furthest neighbour (complete linkage): The distance between two clusters is calculated as the distance between the two elements (one from each cluster) that are furthest apart.

10

Cluster analysis

Between-groups linkage (average linkage): The distance between two clusters is calculated as the average of all distance combinations (which is given by all distances between elements from the two clusters).

Wards method (Wards algorithm): For each cluster a stepwise minimization of the sum of the squared distances to the centre (the centroid) within each cluster is carried out.

The result of the cluster analysis depends heavily on the chosen method. As a consequence, one should try different methods before completing the analysis. In this example Between-groups linkage has been chosen. Measure provides the user with the choice of different distance measures depending on the type of data in question. The type of the data in question should be stated (interval data, frequency table or binary variables). In this example interval has been chosen. After this, the distance measure must be specified in the interval drop-down box. A thorough discussion of these is beyond the scope of this manual, although several are relevant. It should be noted, hovewer, that Pearsons correlation is often used in connection with variables, while Squared Euclidean distance is often used in connection with cases. The latter is default in SPSS, and it is chosen in the example as well, since observations are considered here. When using Pearsons correlation one should be aware that larger distance measures mean less distance (measures range from 1 to 1). Transform values is used for standardizing data before the calculation of the distance matrix, which is often relevant for the cluster analysis. This is due to the fact that variables with high values contribute more than variables with low values to the calculation of distances. Consequently, this matter should be carefully considered. In this case, the values are transformed to a scale ranging from 0 to 1. Transform Measures is used for transforming data after the calculation of the distance matrix, which is not relevant in this case.

3.2.2.4

Save

This dialogue box gives the user the opportunity to save cluster relationships for observations or variables as new variables in the data set. This option is ignored in the example. 11

Cluster analysis

3.2.3

Output

After the specification of the options the analysis is carried out. This results in a number of outputs, of which three parts are commented on and interpreted below. (1) Proximity matrix shows the distances that have been calculated in the analysis, in this case the calculated distances between the countries. For example, the distance between Norway and Denmark is 0.154 when using the Squared Euclidean distance. The cluster analysis itself is carried out on the basis of this distance matrix. As can be seen, the matrix is symmetric:

Proximity Matrix Squared Euclidean Distance 8:Domincan 15: R. 7:Denmark 9:India 10:Indonesia 11:Italy 12:Japan 13:Norway14:Paraguay Switzerland 1,024 ,936 3,164 1,353 ,877 ,836 ,694 2,286 ,840 ,184 2,667 5,653 2,746 ,191 ,742 ,135 4,659 ,128 8,498 3,024 ,484 1,727 7,886 7,867 7,655 4,089 7,830 5,276 ,788 1,367 1,261 5,157 4,817 4,318 1,370 4,600 2,106 ,787 1,665 ,547 1,611 1,452 1,717 2,572 1,837 2,081 ,446 3,543 1,721 1,771 1,217 1,341 1,309 1,441 ,000 3,125 6,437 3,488 ,355 ,994 ,154 5,243 ,342 3,125 ,000 2,160 ,911 2,772 2,286 2,231 ,920 2,311 6,437 2,160 ,000 ,739 5,479 5,270 5,684 3,344 5,763 3,488 ,911 ,739 ,000 2,657 2,639 2,872 2,320 2,759 ,355 2,772 5,479 2,657 ,000 ,312 ,311 4,883 ,236 ,994 2,286 5,270 2,639 ,312 ,000 ,634 4,245 ,561 ,154 2,231 5,684 2,872 ,311 ,634 ,000 4,013 ,112 5,243 ,920 3,344 2,320 4,883 4,245 4,013 ,000 3,952 ,342 2,311 5,763 2,759 ,236 ,561 ,112 3,952 ,000

Case 1:Argentina 2:Austria 3:Bangladesh 4:Bolivia 1:Argentina ,000 1,008 4,883 2,233 2:Austria 1,008 ,000 7,671 4,809 3:Bangladesh 4,883 7,671 ,000 1,398 4:Bolivia 2,233 4,809 1,398 ,000 5:Brazil ,442 1,885 3,032 1,590 6:Chile ,423 1,895 5,109 1,883 7:Denmark 1,024 ,184 8,498 5,276 8:Domincan R ,936 2,667 3,024 ,788 9:India 3,164 5,653 ,484 1,367 10:Indonesia 1,353 2,746 1,727 1,261 11:Italy ,877 ,191 7,886 5,157 12:Japan ,836 ,742 7,867 4,817 13:Norway ,694 ,135 7,655 4,318 14:Paraguay 2,286 4,659 4,089 1,370 15:Switzerlan ,840 ,128 7,830 4,600 This is a dissimilarity matrix

5:Brazil ,442 1,885 3,032 1,590 ,000 ,921 2,106 ,787 1,665 ,547 1,611 1,452 1,717 2,572 1,837

6:Chile ,423 1,895 5,109 1,883 ,921 ,000 2,081 ,446 3,543 1,721 1,771 1,217 1,341 1,309 1,441

The next output is (2) Agglomeration Schedule, which shows when the individual observations or clusters are combined. In the example observation 13 (Norway) is first combined with observation 15 (Switzerland), where after observation 2 (Austria) is tied to the very same cluster. This procedure continues until all observations are combined. The Coefficients column shows the distance between the observations/clusters that are combined. 12

Cluster analysis

Agglomeration Schedule Cluster Combined Cluster 1 Cluster 2 13 15 2 13 2 7 2 11 1 6 3 9 5 10 2 12 1 8 1 5 4 14 1 4 1 2 1 3 Stage Cluster First Appears Cluster 1 Cluster 2 0 0 0 1 2 0 3 0 0 0 0 0 0 0 4 0 5 0 9 7 0 0 10 11 12 8 13 6

Stage 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Coefficients ,112 ,132 ,227 ,273 ,423 ,484 ,547 ,649 ,691 1,023 1,370 1,716 2,718 4,651

Next Stage 2 3 4 8 9 14 10 13 10 12 12 13 14 0

The (3) Dendrogram shown below is a simple way of presenting the cluster analysis:

The dendrogram indicates that there are three clusters in the data set. These are:

13

Cluster analysis

Norway, Denmark, Austria, Switzerland, Italy and Japan (i.e. the European countries and Japan). Brazil, Argentina, Chile, The Dominican Republic, Bolivia, Paraguay and Indonesia (i.e. the South American countries and Indonesia). India and Bangladesh (i.e. the Asian countries).

The cluster structure is found by making a sectional elevation through the dendrogram, where the distance is relatively long. In the example above the sectional elevation is made between 10 and 15. This structure is very much in accordance with the prior expectations one might have regarding the economic variables in question.

3.3 K-means cluster analysis (Non-hierarchical cluster analysis)


Contrary to the hierarchical cluster analysis, the non-hierarchical cluster analysis does allow observations to change from one cluster to another while the analysis is proceeding. The method is based on an iterative procedure where every single observation is grouped into a number of clusters until the relation, the variance between the clusters and the variance within the clusters, is maximized. The procedure itself will not be explained further here. It is noted, however, that the user has the opportunity to choose clusters as well as so-called cluster-centres, which will often be an advantage. The non-hierarchical cluster analysis is often used in large data sets since the method is capable of treating more observations than the hierarchical method. Apart from this the method is only used in analysis regarding observations and not analysis regarding variables.

3.3.1

Example

Data: \\okf-filesrv1\exemp\Spss\Manual\K-meansdata.sav The analysis is based on the same example as the hierarchical cluster analysis in section 3.2.1.

3.3.2

Implementation of the analysis

Before the cluster analysis is carried out it is necessary to standardize the variables from the previous example. As mentioned earlier this is done to prevent large values from swinging too much weight on the analysis. When using the k-means method it is not possible to make any 14

Cluster analysis

transformations during the analysis, as described in the hierarchical method. Therefore, this must be done prior to the analysis. Choose Analyze Descriptive statistics Descriptives from the menu bar. This results in a dialogue box, which should be completed as follows:

After this it is time to carry out the analysis on the basis of the new standardized variables. Choose the following from the menu bar:

15

Cluster analysis

This results in the following dialogue box:

Here all the standardized variables are moved to the Variables window. Since the names of the countries are meant to be included in the output, the variable Country is moved to the Label Cases by window. As opposed to the hierarchical cluster analysis it is not a requirement for the non-hierarchical cluster analysis that the variable used to group the clusters by (in this case country) is defined as a string. Method is set to Iterate and classify as default, since the classification is to be performed on the basis of iterations. In that connection it must be decided how many clusters are wanted as a result of the iteration process. This is chosen in the Number of Clusters window. The number of clusters depends on the data set as well as on preliminary considerations. From the result of the hierarchical analysis there is reason to believe that the number of clusters is three, so this is the number that has been entered. If one has detailed knowledge of the number and characteristics of expected clusters, it is possible to define cluster centres in Centers.

16

Cluster analysis

In addition to the options described above there is a number of ways, in which the user can influence the iteration procedure and the output. These are dealt with below.

3.3.2.1

Iterate

Clicking the Iterate button results in the following dialogue box with three changeable settings:

Maximum iteration indicates the number of iterations in the algorithm, which ranges between 1 and 999. Default is 10 and this will be used in this example as well. Convergence Criterion defines the limit, at which the iteration is ended. The value must be between 0 and 1. In this example a value of 0.001 is used, which means that the iteration ends when none of the cluster centres move by more than 0.1 percent of the shortest distance between two clusters. Selecting Use running means is the mean values will be recalculated each time an observation has been assigned to a cluster. By default this only happens when all observations have been assigned to the clusters.

3.3.2.2

Save

Under Save it is possible to save the cluster number in a new variable in the data set. In addition it is possible to create a new variable that indicates the distance between the specific observation and the mean value of the cluster. See the dialogue box below:

3.3.2.3

Options

Clicking Options gives the user the following possibilities: 17

Cluster analysis

Initial cluster centers are the initial estimates of the cluster mean values (before any iterations). ANOVA table carries out an F-test for the cluster variables. However, since the observations are assigned to clusters on the basis of the distance between the clusters, it is not advisable to use this test for the null hypothesis that there is no difference between the clusters. Cluster information for each case ensures that a cluster number is assigned to each observation i.e. information about the cluster relationship. Furthermore, it produces information about the distance between the observation and the mean value of the cluster. Choosing exclude cases listwise in the Missing values section means, that a respondent will be excluded from the calculations of the clusters, if there is a missing value for any of the variables. Exclude cases pairwise will classify a respondent in the closest cluster on the basis of the remaining variables. Therefore, a respondent will always be classified, unless there are missing values for all variables.

3.3.3

Output

The analysis results in a number of outputs, of which the most relevant will be interpreted and commented on the following page.

18

Cluster analysis

The output (1) Cluster Membership shows, as the name indicates, the countries relation to the clusters. From the output it is obvious that, except for a few changes, the cluster structure is identical to that of the hierarchical method. Distance is a measure of how far the individual country is located compared to the centre of its cluster. The output shows that Brazil is the country, which is located furthest away from its cluster centre with a distance of 3,075. The next output to be commented on is (2) Final Cluster Centers. From this output the mean value of the standardized variables for each of the three clusters appear. It is therefore possible to compare the clusters on the basis of each variable.

19

Cluster analysis

The last output, which will be commented on is (3) Distances between Final Cluster Centers.
Distances between Final Cluster Centers Cluster 1 2 3 1 4,301 4,321 2 4,301 5,818 3 4,321 5,818

This output indicates the individual distances between the clusters. In this example cluster 1 and 2 are most alike. A closer look at output (2) is required in order to explore the differences and similarities in depth.

20

Correspondence Analysis

4. Correspondence Analysis
The following explanation and interpretation of the correspondence analysis is based on the following literature: SPSS Categories 10.0, Jacqueline J. Meulman & Willem J. Heiser, SPSS Inc. 1999. Chp. 1, p. 10-11; Chp. 5, p. 45-46; Chp. 11, p. 147 For further reading see: Theory and Applications of Correspondence Analysis, Michael J. Greenacre, Academic Press inc., 1984.

4.1 Introduction
The purpose of correspondence analysis is to make a graphical illustration of data in cross tables, since the relation between rows and columns in a cross table can be hard to comprehend. Often, a cross table does not create a well-arranged picture of the characteristics of a data set i.e. the relation between two variables. This is especially the case when nominal scaled data are considered (data without any natural order or rank). Factor analysis is normally perceived as the standard technique used to describe the connection between variables. However an important requirement for factor analysis is that the data is interval scaled. Correspondence analysis makes it possible to perform an analysis of the relationship between nominal scaled variables graphically on a multidimensional level (see appendix 1). Row and column values are calculated and plots are produced based on this. Categories that are alike will be placed close to each other in a plot. This makes the analysis a useful technique for market segmentation. Please note that it is possible to make a multiple correspondence analysis (homogeneity analysis) with more than two variables. See the last section of this chapter.

4.2 Example 1
Data set: \\okf-filesrv1\exemp\Spss\Manual\Korrespondancedata.sav

A company wants to analyze which brands are preferred by different age groups.
1

The example is based on Blunch, Niels, Analyse af markedsdata, 2. rev. udgave 2000, Systime, pp. 70-78

21

Correspondence Analysis

Brands: A, B, C and D Age groups: 20-29 years, 30-39 years, 40-49 years and >50 years

210 random respondents distributed across the age groups have been asked about their preference for the four brands. Firstly, it must be clarified whether the objective is to study the differences between the age groups with regard to their choice of brand, to map out the brands composition in relation to age, or a combination of both. In the last mentioned example, which is shown here, it is necessary to run the analysis twice. The first step is to carry out an analysis based on row profiles or Row principal as it is called in SPSS. After this the analysis is carried out based on column profiles or Column principal as SPSS calls it.

4.3 Implementation of the analysis


The starting point of the correspondence analysis is shown below:

22

Correspondence Analysis

4.3.1

Specification of the variables

A dialogue box now appears, and it should be filled out as follows:

The age variable is moved to the Row window and the brand variable is moved to Column. The question marks behind the variables appear until the Range has been defined, i.e. the minimum and maximum values. This must be done for each variable by clicking the Define Range button. This results in the dialogue box shown below:

For both variables in the data set Range is from 1 to 4. The output from the correspondence analysis has yet to be specified. This is done by specifying the settings in Model, Statistics and Plots. Each of these will be described separately below.

23

Correspondence Analysis

4.3.2

Model

By clicking the Model button the user can define which methods should be used for estimating the model. First, the desired number of dimensions for the solution is defined. A two-dimensional solution is desired in this example. Distance Measure deals with the method used for the estimation of the distance matrix. For a standard correspondence analysis as this the Chi Square distance measure is chosen. This automatically activates Row and column means are removed as the Standardization Method.

The last setting has to do with the definition of the method by which normalization takes place. The chosen method is important for the later interpretation of the analysis, which is discussed below (see section 4.4 output). As mentioned above, the analysis must be carried out twice to obtain a graphic picture of the row and the column profiles and to provide an opportunity for a comparison of the two. First, the analysis is carried out on the basis of the row profiles, because our first priority is to examine the difference between the age groups with regard to brand choice. Therefore, Row principal is chosen.

24

Correspondence Analysis

4.3.3

Statistics

Clicking the Statistics button makes it possible to calculate various statistic measures. Correspondence table displays a cross tabulation of the data. This table provides a very clear picture of the distribution in the individual groups. The Row and Column profiles are useful as well, so they are also selected. In Confidence Statistics it is possible to calculate standard deviations and correlations for either the row or the column profiles. Which additional measures to choose, depends on the individual analysis and its purpose.

4.3.4

Plots

The Plots menu is divided into two parts. Scatterplots makes it possible to plot the individual variables (Row and Column points) against the individual dimensions (in this case age and brand), as well as to plot both variables against these dimensions (Biplot).

25

Correspondence Analysis

ID label width for scatterplots is used to define the maximum number of characters (letters) to be used in the description of an individual point in the plot. When there are a lot of points in the plot, it is often a good idea to reduce this number in order to avoid confusion in the scatterplot. In this case a reduction is not necessary. The second part of the Plots menu, Lineplots, makes it possible to plot the two variables individually against each of the dimensions. Transformed row categories produce a plot of the row categories (i.e. age) against the matching row scores. A similar plot with the column categories (i.e. brand) and the column scores instead is produced, when Transformed column categories is chosen. Since Row principal was chosen above, one part of the scatterplot will be identical to the plot of the row profiles. The analysis is carried out by clicking Continue followed by OK. Finally it is possible to restrict the number of dimensions in the Plot Dimensions submenu.

4.4 Output
The output from the correspondence analysis is discussed below. Some parts of the output are left out, as they have not been considered relevant. The first output is (1) Correspondence table, which is a cross tabulation of the data set.

Correspondence Table brand C 6 10 27 40 83

age 20-29 years 30-39 years 40-49 years >50 years Active Margin

A 30 10 3 2 45

B 18 30 10 2 60

D 3 5 8 6 22

Active Margin 57 55 48 50 210

The table indicates that 210 respondents are included in the analysis. Furthermore, it shows that there are 55 respondents in the age group 30-39 years. Ten of these prefer brand A, 30 prefer brand B, ten prefer brand C, and five people prefer brand D. It can be concluded that this age group is in favor of brand B in contrast to the age group >50 years, where only two prefer brand B. For all age groups in total there is a preference for brand C with 83 of the 210 respondents in favor of this brand.

26

Correspondence Analysis

The next output, (2) Row Profiles, is a cross tabulation of the row profiles:

Row Profiles brand C ,105 ,182 ,563 ,800 ,395

age 20-29 years 30-39 years 40-49 years >50 years Mass

A ,526 ,182 ,063 ,040 ,214

B ,316 ,545 ,208 ,040 ,286

D ,053 ,091 ,167 ,120 ,105

Active Margin 1,000 1,000 1,000 1,000

The horizontal percentages display the difference between the individual age groups with regard to their brand choice. For example, 18.2 % in the age group 30-39 years prefer brand A, 54.4 % prefer B, 18.2 % prefer C, and 9.1 % prefer D. As mentioned above, the largest preference for all 210 respondents is brand C. As can be seen, this preference represents 39.5 %. A cross tabulation of the column profiles is found in (3) Column Profiles:

Column Profiles brand C ,072 ,120 ,325 ,482 1,000

age 20-29 years 30-39 years 40-49 years >50 years Active Margin

A ,667 ,222 ,067 ,044 1,000

B ,300 ,500 ,167 ,033 1,000

D ,136 ,227 ,364 ,273 1,000

Mass ,271 ,262 ,229 ,238

The vertical percentages display the age composition of each individual brand. For example, of all the 45 respondents preferring brand A, 66.7 % belong to the age group 20-29 years, 22.2 % belong to the age group 30-39 years, 6.7 % belong to the age group 40-49 years, and only 4.4 % belong to the age group >50 years. In order to determine the degree of pure coincidence in the data set with regard to the differences described above, a 2-test is carried out. It tests a homogeneity hypothesis, i.e. a test of whether the individual age groups can be assumed to be alike. The result is to be found in (4) Summary:

27

Correspondence Analysis
Summary Proportion of Inertia Singular Value ,648 ,309 ,068 Confidence Singular Value Standard Deviation ,048 ,075 Correlation 2 ,153

Dimension 1 2 3 Total

Inertia ,420 ,095 ,005 ,520

Chi Square

Sig.

109,191

,000a

Accounted for ,808 ,184 ,009 1,000

Cumulative ,808 ,991 1,000 1,000

a. 9 degrees of freedom

It appears from the output that the hypothesis regarding equal market profiles for the four age groups must be rejected. This conclusion is based on a 2-value of 109.19 with a corresponding pvalue of 0.000. Therefore, the profiles can be expected to be different, and in the plot below they will be separated from each other. In the case of an insignificant p-value the dispersion would have been small, and it would not have been possible to conclude dissimilarity in the preferences. (4) Summary also shows how much of the dispersion that can be maintained in a graphing of only two dimensions. The first dimension (age) explains 80.8 % of the inertia, and the second dimension (brand) explains 18.4 % of the inertia. This leaves only 0.9 % for the third, excluded, dimension. Below the interesting Biplot (5) Row and Column Points/Row Principal Normalization is shown:

28

Correspondence Analysis

The plot shows the row and the column profiles together. However, it is only possible to compare the row points (age) individually, since Row Principal was chosen as the normalization method. The smaller the distances between the age groups, the more similar they are with regard to brand preferences and vice versa. In this case the age groups 40-49 years and >50 years are most alike. However, it is not possible to conclude anything on the basis of the location of the brands. As mentioned above, the analysis must be run twice in order to compare the individual column points (brand). In the second analysis Column Principal should be chosen as normalization method. The procedure was shown in section 4.3.2 in the dialogue box Correspondence Analysis: Model:

Part of the dialogue box Correspondence Analysis: Model (see section 4.3.2). This results in a new Biplot (6) Row and Column Points/Column Principal Normalization:

29

Correspondence Analysis

The plot shows the row and the column profiles together. Again it is only possible to compare the column points (brand) individually, since Column Principal has now been chosen as normalization method. From the plot it appears that brand C and D are most alike with regard to the age groups, to which they appeal. Unfortunately, it is not possible to create The French plot, a combination of the two plots (5) and (6), directly in SPSS. However, this can be done from the output instead. From the analysis where Row Principal was used, the coordinates for both the first and the second dimension for the row variable (age) can be viewed. This is shown in (7) Overview Row Points below.
a Overview Row Points

Score in Dimension Of Point to Inertia of Dimension 1 2 ,369 ,356 ,099 ,542 ,116 ,022 ,416 ,080 1,000 1,000

Contribution Of Dimension to Inertia of Point Total 1 2 ,820 ,180 1,000 ,443 ,552 ,995 ,906 ,040 ,946 ,952 ,041 ,993

age 20-29 years 30-39 years 40-49 years >50 years Active Total

Mass ,271 ,262 ,229 ,238 1,000

1 -,756 -,399 ,462 ,856

2 ,354 -,445 -,097 ,179

Inertia ,189 ,094 ,054 ,183 ,520

a. Row Principal normalization

After this the coordinates for both dimensions for the column variable (brand) can be viewed from the analysis, where Column Principal was used. This is shown in (8) Overview Column Points below.
a Overview Column Points

Score in Dimension Of Point to Inertia of Dimension 1 2 ,333 ,451 ,166 ,500 ,475 ,031 ,026 ,018 1,000 1,000

Contribution Of Dimension to Inertia of Point 1 2 Total ,764 ,236 1,000 ,593 ,405 ,998 ,983 ,014 ,998 ,657 ,103 ,761

brand A B C D Active Total

Mass ,214 ,286 ,395 ,105 1,000

1 -,808 -,494 ,710 ,321

2 ,448 -,409 ,086 -,127

Inertia ,183 ,118 ,203 ,016 ,520

a. Column Principal normalization

When the coordinates for each point (in this case there are eight points in total, i.e. four age groups and four brands) have been found, they can be plotted into a coordinate system (Excel can easily be used). This procedure as well as the interpretation of The French plot is left to the reader 2 .

Blunch, Niels J.: Analyse af markedsdata, p. 70.

30

Correspondence Analysis

4.5 HOMALS (Multiple correspondence analysis)


As mentioned in the beginning it is also possible to make a multiple correspondence analysis in SPSS, which will be briefly presented here. HOMALS is an abbreviation for homogeneity analysis by means of altering least squares. The purpose of HOMALS is to find a number of homogenous groups. The analysis attempts to produce a solution where groups within the same category will be close to each other, while groups that are different from each other will be plotted far from each other. An important difference from multiple correspondence analysis and the ordinary correspondence analysis is that input for the multiple analysis is a data matrix where the rows represent objects and the columns represent variables. In the ordinary correspondence analysis, however, it is possible for both rows and columns to represent variables. The multiple correspondence analysis makes an analysis with more than two variables possible. This is not possible in ordinary correspondence analysis. The purpose of multiple correspondence analysis is to describe the relationship between two or more nominal scaled variables on a multidimensional level, containing the categories of variables as well as the objects in these categories. It is a qualitative factor analysis where nominal scaled variables are measured. As an example, multiple correspondence analysis can be used to graphically illustrate the relation between employment, income and gender. It is possible that the analysis will reveal that income and gender discriminates between the respondents but that there is no difference with regard to employment. The implementation of the analysis will not be shown in this manual.

31

Multiple analysis of variance

5. Multiple analysis of variance


The following explanation and interpretation of the correspondence analysis is based on the following literature: Videregende data-analyse med SPSS og AMOS, Niels Blunch 1. edition 2000, Systime. Chp. 3, p. 51 67. Analyse af markedsdata, Niels Blunch 2. rev. edition 2000, Systime. Chp. 6, p. 213-220 Multivariate data analysis, Hair, Anderson and more, 5. edition, Prentice Hall. Chp. 6, p. 326-387

5.1 Introduction
Multiple analysis of variance (MANOVA) is an extension of the traditional analysis of variance (ANOVA), and is used for situations where there are more than one dependent variable, which is dependent on one or more explanatory variables (see appendix 1). Similar to ANOVA the method is used to measure or prove differences between groups or experimental treatments. The difference between the two methods is that ANOVA identifies group differences based on one single dependent variable. MANOVA, in contrast, identifies differences simultaneously across several dependent variables. In other words, MANOVA clarifies differences between groups while at the same time considering the correlations between the dependent variables. Notice that it is not sufficient to carry through an ANOVA test for the dependent variables. There will be situations where MANOVA is significant despite the fact that all ANOVA tests are insignificant. Likewise the opposite can also occur where the ANOVA test is significant and the MANOVA is insignificant. A short notice on the connection between MANOVA and discriminant analysis: Provided that MANOVA turns out to be significant it will be possible to find a linear function of the dependent variables that separate the groups. Hereby, at least one of the discriminant functions is significant. Therefore, MANOVA and discriminant analysis are opposite analyses. In MANOVA the groups are known in advance, and the objective is to find their relationship with the values of the variables. In discriminant analysis the objective is to find the groupings based on the values of the variables. Obviously discriminant analysis, mentioned in chapter 4, should only be employed when the MANOVA test is significant.

32

Multiple analysis of variance

5.2 Example
Data set: \\okf-filesrv1\exemp\Spss\Manual\Manovadata.sav

The objective is to test whether the effect of two pills given at the same time is more effective than the intake of only one of the pills. 20 patients with the disease in question are distributed into four groups by lot. In group 1 the patients get a placebo, in group 2 they get both pills, in group 3 the patients get one of the pills and in group 4 they get the other pill. The effect is measured on two variables: Y1 and Y2. The data is shown in table 1 below:

Table 1 Treatments (Groups) 1 Y1 1 2 3 2 2 2 Y2 2 1 2 3 2 2 Y1 8 9 7 8 8 8 2 Y2 9 8 9 9 10 9 Y1 2 3 3 3 4 3 3 Y2 4 2 3 5 6 4 Y1 4 3 3 5 5 4 4 Y2 5 3 4 6 7 5

Means

The null hypothesis for this multivariate analysis of variance is that the treatment has no influence on either Y1 or Y2. The alternative hypothesis is that the treatment can explain the effects on Y1 and Y2.

5.3 Implementation of the analysis


The starting point of the MANOVA is shown below:

33

Multiple analysis of variance

When choosing GLM Multivariate a dialogue box with various options appears as shown below. First the dependent variables must be chosen. In this case it is Y 1 and Y 2 , so they are marked and moved to the Dependent Variables window. All dependent variables must be interval scaled. After this the group variable(s) that determines the groups to be compared on the basis of the dependent variables must be defined in the Fixed Factor(s) window. In this example the group variable (treatment) are moved to the window so that the four groups are compared with regard to Y 1 and Y 2 .

34

Multiple analysis of variance

In the Covariate(s) window variables can be added. The linear effect of the covariances on the dependent variables will in that case be removed separately from the factor level effects. In this example such variables are assumed to be non-existing. In the WLS Weight window a variable can be added in order to assign different weights to the observations. However, this is not relevant in this example. The output and the estimation of the MANOVA have yet to be specified. This will be done by describing the Model, Contrast, Plots, Post Hoc and Options buttons individually below.
5.3.1.1 Model

Clicking the Model button results in the following dialogue box, where the model can be specified.

In Specify Model the default setting is Full factorial, which means that a full model is specified i.e. a model containing the intercept, all factor and covariate main effects as well as all interactions. The alternative is Custom, where the model is specified manually i.e. terms from the full model can be omitted. If Custom is chosen, the desired factors must be marked and moved to the window in the right-hand side of the dialogue box. In this example a Full factorial model has been chosen. In the bottom of the dialogue box it is possible to specify the method, by which the sum of squares is calculated. The options are type 1, 2, 3 or 4. Type 3 is default, and this is the most widely used method on The Aarhus School of Business as well. Therefore, type 3 is chosen for this example also. 35

Multiple analysis of variance

Before clicking Continue it is a good idea to make sure that Include intercept in model is activated. This ensures the calculation of an overall mean, which is required if Deviation is chosen as the estimation method for the factors (see next section).
5.3.1.2 Contrast

Clicking the Contrast button results in the following dialogue box:

In Change Contrast the estimation method for each factor can be changed. No method is chosen by default, but the user can choose between six different ones: Deviation, Simple, Difference, Helmert, Repeated and Polynomial. A method is chosen by marking the factor in the Factors window and clicking the appropriate method in the Contrast drop-down-box. The chosen method depends on each individual data set. If Deviation or Simple is chosen, SPSS requires a specification of a Reference Category (either the first or last variable in the data set). In this example Deviation has been chosen.

5.3.1.3

Plots

Clicking this button makes it possible to get profile plots produced for each factor. This plot will spread out on the dependent variables. In Horizontal Axis the user must specify the factor to be plotted. When the chosen factor is moved from Factors to Horizontal Axis, the Add button must be clicked for the factor to move to the Plots window in the bottom square.

36

Multiple analysis of variance

When there are several factors (grouping variables), the described procedure is repeated for all factors, of which a profile plot is desired. Furthermore, it is possible to use Separate Lines for each variable to combine two factors with each other in the plot. In this example where there is only one factor, the factor (group) is moved to Horizontal Axis, and the Add button is clicked.

5.3.1.4

Post Hoc

When it has been determined that there is a difference between two or more groups, Post Hoc Multiple Comparison can be used to determine which groups are significantly different from each other. It is worth mentioning that the chosen tests are only carried out for each dependent variable separately, which is similar to the comparisons that are made in ANOVA. In order to be able to specify which test to use, it is necessary to specify which factors to be tested first. This is done at the top of the dialogue box:

37

Multiple analysis of variance

There are several tests, which are categorized on the basis of whether or not equal variances in the groups are assumed. For this reason, it is beyond the scope of this manual to have a thorough discussion of each test, but at the Aarhus School of Business the tests that have been activated in the dialogue box shown above are normally used.

5.3.1.5

Options

Clicking the Options button results in the dialogue box shown below. In Estimated Marginal Means the factors or factor combinations (in the case of several factors) are chosen, for which an estimate is desired. Since this example only contains a single factor, group is moved to the Display means for window.

Display offers various ways of controlling the output from the GLM procedure. The boxes that are selected in the dialogue box above will be described below. Descriptive statistics displays the observed mean value, the standard deviation for the value of each dependent variable in the factor combinations. Estimates of effect size produce a partial, eta-squared value for each effect and each parameter estimate. Observed power indicates the strength of the individual tests. Parameter estimates displays the parameter estimates, standard errors, t-tests, confidence intervals, and the observed power for each test. 38

Multiple analysis of variance

SSCP matrices display the sum-of-squares and cross-product matrices. When Residual SSCP is chosen, a Bartletts test for sphericity of the residual covariance matrix is added. Homogeneity tests carries out Levenes test for variance homogeneity for each dependent variable across the existing factor levels/combinations. In addition Boxs M test for homogeneity of the covariance matrices of the dependent variables across all factor combinations is carried out.

Spread vs. level plots and residual plots are very useful for checking the model assumptions. Residual plots produce an observed/expected standardized residual plot for each dependent variable.

Furthermore, it is possible to change the level of significance.

5.4 Output
When the procedure described above is carried out, click the OK button from the initial dialogue box Multivariate in order for SPSS to carry out the analysis and produce the outputs. The outputs, with the exception of a few, which have been considered irrelevant for a discussion, will be commented on stepwise below.

Between-Subjects Factors gruppe 1 2 3 4 Value Label placebo begge piller pille A pille B N 5 5 5 5

Descriptive Statistics y1 gruppe placebo begge piller pille A pille B Total placebo begge piller pille A pille B Total Mean 2,00 8,00 3,00 4,00 4,25 2,00 9,00 4,00 5,00 5,00 Std. Deviation ,707 ,707 ,707 1,000 2,447 ,707 ,707 1,581 1,581 2,847 N 5 5 5 5 20 5 5 5 5 20

y2

39

Multiple analysis of variance

The outputs shown above contains a description of the data set, among other things the number of observations in each group output (1a). In output (1b) there is some descriptive statistics, which will not be further discussed. Output (2) displays Boxs test, which indicates whether or not the covariance matrices for the individual groups are equal. This constitutes one of the assumptions of MANOVA, and it is similar to the assumption regarding equal variances for all groups known from ANOVA. From the shown output it appears that the null hypothesis of equal covariance matrices is accepted, which means that the assumption is not violated.

a Box's Test of Equality of Covariance Matrices

Box's M F df1 df2 Sig.

13,100 1,123 9 2933,711 ,343

Tests the null hypothesis that the observed covariance matrices of the dependent variables are equal across groups. a. Design: Intercept+gruppe

After this test of assumptions (3) Bartletts test of Sphericity follows. The null hypothesis is that the residual covariance matrices are proportional with an identity matrix, which means that the two dependent variables are uncorrelated. If the null hypothesis is accepted, there is no reason to carry out the MANOVA, since it would be sufficient with a number of simple ANOVAs. As can be seen from the output below, the null hypothesis is rejected, and MANOVA is therefore relevant in this example.
a Bartlett's Test of Sphericity

Likelihood Ratio Approx. Chi-Square df Sig.

,016 6,212 2 ,045

Tests the null hypothesis that the residual covariance matrix is proportional to an identity matrix. a. Design: Intercept+gruppe

The output (4) Multivariate Tests shown below illustrates a number of multivariate tests, which are all significant. It shows that all four test values indicate that the null hypothesis of equal groups is

40

Multiple analysis of variance

rejected, since the p-value is 0. It can therefore be concluded that the combined dependent variables, Y 1 and Y 2 , vary across the treatment.

Multivariate Testsd Effect Intercept Value ,976 ,024 40,419 40,419 1,027 ,073 11,414 11,292 F Hypothesis df 303,141b 2,000 303,141b 2,000 303,141b 2,000 303,141b 2,000 5,631 6,000 13,566b 6,000 26,632 6,000 60,223c 3,000 Error df 15,000 15,000 15,000 15,000 32,000 30,000 28,000 16,000 Sig. ,000 ,000 ,000 ,000 ,000 ,000 ,000 ,000 Partial Eta Squared ,976 ,976 ,976 ,976 ,514 ,731 ,851 ,919 Noncent. Parameter 606,283 606,283 606,283 606,283 33,786 81,396 159,791 180,670 Observed a Power 1,000 1,000 1,000 1,000 ,989 1,000 1,000 1,000

gruppe

Pillai's Trace Wilks' Lambda Hotelling's Trace Roy's Largest Root Pillai's Trace Wilks' Lambda Hotelling's Trace Roy's Largest Root

a. Computed using alpha = ,05 b. Exact statistic c. The statistic is an upper bound on F that yields a lower bound on the significance level. d. Design: Intercept+gruppe

The SSCP matrices are shown in output (5) below. The group (H) and error (E) matrix displays the variance between and within the groups, respectively. The matrices are displayed with bold letters. It is the relationship between these matrices that forms the basis of the multivariate tests mentioned above.

Between-Subjects SSCP Matrix Hypothesis Intercept gruppe Error y1 y2 y1 y2 y1 y2 y1 361,250 425,000 103,750 115,000 10,000 7,000 y2 425,000 500,000 115,000 130,000 7,000 24,000

Based on Type III Sum of Squares

The output (6) Tests of Between-Subjects-Effects shows whether or not the explanatory variable is significant for the dependent variables. As can be seen, the contribution of the explanatory variable group to the explanation of the variation of the dependent variables is significant for both of the dependent variables.

41

Multiple analysis of variance


Tests of Between-Subjects Effects Source Corrected Model Intercept gruppe Error Total Corrected Total Dependent Variable y1 y2 y1 y2 y1 y2 y1 y2 y1 y2 y1 y2 Type III Sum of Squares 103,750b 130,000c 361,250 500,000 103,750 130,000 10,000 24,000 475,000 654,000 113,750 154,000 df 3 3 1 1 3 3 16 16 20 20 19 19 Mean Square 34,583 43,333 361,250 500,000 34,583 43,333 ,625 1,500 F 55,333 28,889 578,000 333,333 55,333 28,889 Sig. ,000 ,000 ,000 ,000 ,000 ,000 Partial Eta Squared ,912 ,844 ,973 ,954 ,912 ,844 Noncent. Parameter 166,000 86,667 578,000 333,333 166,000 86,667 Observed a Power 1,000 1,000 1,000 1,000 1,000 1,000

a. Computed using alpha = ,05 b. R Squared = ,912 (Adjusted R Squared = ,896) c. R Squared = ,844 (Adjusted R Squared = ,815)

So far it has only been determined that one or more groups are different. Therefore it is relevant to take a closer look at these differences. In SPSS this is done by means of two by two comparisons, but as mentioned in section 5.3.2.4, these comparisons are only carried out for each dependent variable separately. As a consequence these two by two comparisons are only partly relevant and will not be shown here. As mentioned in the introduction of this section, a significant MANOVA means that at least one discriminant function is significant as well. Therefore a discriminant analysis is very often carried out following the MANOVA. Hereby it is possible to interpret the differences between the groups by studying the discriminant functions. Please refer to chapter 6 about discriminant analysis.

42

Discriminant analysis

6. Discriminant analysis
The explanation and interpretation of discriminant analysis presented in this section is based on the following literature:

Videregende data-analyse med SPSS og AMOS, Niels Blunch 1. edition 2000, Systime. Chp. 2, p. 29 48, Chp. 3, p. 49-77. Analyse af markedsdata, Niels Blunch 2. rev. edition 2000, Systime. Chp. 6, p. 209-220

6.1 Introduction
Discriminant analysis is used for studying the relationship between one nominal scaled dependent variable and one or more interval scaled explanatory variables (see appendix 1). In other words discriminant analysis is used instead of regression analysis when the dependent variable is qualitative. The main purpose of discriminant analysis is to find the function of the explanatory variables that discriminates best between two or more groups that are made up of the dependent variable. After this it is possible to: Classify and place the units of the analysis (the respondents, the observations, etc.) in several previously defined groups. Assess the discrimination ability of the explanatory variables.

A distinction is made between simple and multiple discriminant analysis. The difference is that the simple discriminant analysis only discriminates between two groups, whereas the multiple analysis discriminates between three or more groups. The group relation for both types is decided through discriminant functions. In the multiple discriminant analysis there are two or more of these functions depending on the number of groups and variables. In this representation it is assumed that the above-mentioned discriminant functions are linear. The following is an example of the simple discriminant analysis. As mentioned in the previous section regarding MANOVA, a discriminant analysis is often conducted as an extension of a significant MANOVA test. In this case a discriminant analysis will ease the interpretation of the MANOVA. However it should be mentioned that the discriminant 43

Discriminant analysis

analysis can be performed even without a preceding MANOVA. The following analysis is carried through under the assumption of a significant MANOVA.

6.2 Example
Data: \\okf-filesrv1\exemp\Spss\Manual\Diskriminantdata.sav Salmon fishing is a very dear resource for both Alaska (USA) and Canada. Since it is a limited resource it is very important that fishermen from both countries observe the fishing quotas. The salmon have a remarkable lifecycle. They are born in freshwater and after approximately 1-2 years they swim out into the ocean, i.e. into saltwater. After a couple of years in the ocean they return to their place of birth to spawn and die. Before this withdrawal takes place the fishermen try to catch the fish. With a view to regulate the catch, there is a desire to identify where the fish come from. The salmon carries an interesting piece of information about their place of birth in the shape of growth rings on their scale. These growth rings, which can be identified both for the time spent in freshwater and in saltwater, are typically smaller on the Alaskan (salmon than on the Canadian salmon. On the basis of this information 50 salmons from Alaska and 50 salmons from Canada are examined. The 100 salmon are examined for the following explanatory variables. X1 = diameter of growth rings for the first year of the salmons life in freshwater (1/100 inches) X2 = diameter of growth rings for the first year of the salmons life in saltwater (1/100 inches) X3 = gender Subsequently, a discriminant analysis based on the collected data is performed. The purpose is to find a discriminant function, which is capable of classifying where the fish originates from, based on the salmons growth rings.

6.3 Implementation of the analysis


The starting point of the discriminant analysis is shown below:

44

Discriminant analysis

The following dialogue box now appears:

In this dialogue box the desired grouping variable must be marked in the left square and moved to the Grouping Variable window by clicking the arrow. In this case the grouping variable is country. The range is defined by clicking the Define Range button so that SPSS can calculate the number of groups. In this example 1 is specified as minimum and 2 as maximum. The two variables, Freshwater [x1] and Saltwater [x2], which have been moved to the Independents window, are the measured variables on the basis of which the classification will be carried out.

45

Discriminant analysis

Furthermore, it must be specified whether the Stepwise method or the Independents together method is desired. Using Stepwise method means that the variables that discriminate best are found automatically. However, one must be aware of the possible problems with multicollinearity for the explanatory variables. The Independents together method uses all the variables simultaneously for the discrimination. In this example the latter method has been chosen, since both variables are to be included in the discriminant function. In the side of the dialogue box there is a number of buttons, which are used for specifying the method that determines how the discriminant analysis is carried out. Subsequently, Statistics, Classify and Save are described below. Method should only be activated if the Stepwise method has been chosen, since this is where the specification of the criteria that apply when choosing the independent variables takes place.

6.3.1

Statistics

This button is used for managing the estimation itself and defining which covariance matrices to display in the output. The mean values for each group can be calculated here as well. Furthermore, it is possible to carry out a test of whether pooling the covariance matrices for the two groups is possible. The following dialogue box appears:

The purpose of the individual parts is described as follows: By activating Means in Descriptives, a calculation of the mean and the standard deviation for each group of independent variables and for the entire sample takes place.

46

Discriminant analysis

Univariate ANOVAs tests the hypothesis that the mean value for each independent variable is the same for each group, i.e. an analysis of variance. Selecting Boxs M, results in a test of the hypothesis that covariance matrices for each group are equal, i.e. if it is possible to pool the matrices. In Function Coefficients it is specified, whether Fishers method or the unstandardized method is used for the calculation of the classification coefficients. In Matrices it is specified which matrices to include in the output. In this example it has been chosen to have Means calculated and to carry out Boxs M test to see if the covariance matrices can be pooled. The coefficients are calculated by means of Fishers method, since an estimate of the discriminant function itself is desired. Furthermore, an output of the covariance matrices for each group as well as a single within-groups covariance matrix has been chosen.

6.3.2

Classify

In the following dialogue box the classification is managed:

The options are as follows: Prior Probabilities specify the a priori probability that forms the basis of the classification. Either the prior probability for group membership can be determined from the observed group size, or it can be equal for all groups. 47

Discriminant analysis

In Use Covariance Matrix 3 it can be determined whether the pooled or the separate covariance matrices should be used. This is normally not determined until Boxs M test has been carried out.

By activating Casewise results in Display the group membership will be displayed for each observation. Selecting Summary table displays the result of the classification as well as a table showing observed and expected group membership.

In Plots various plots can be chosen.

In this example the prior probability will be equal for all groups and the estimation of the discriminant function will be performed on the basis of separate covariance matrices, since there is reason to believe that the differences between these are too great to justify a pooled matrix. Furthermore, it has been chosen to display the classification result.

6.3.3

Save

In brief it is possible to save expected group membership, the probability for group membership and the calculated discriminant scores. These values will be saved as new variables in the data set. However, there will be no such new variables in this example.

6.4 Output
When the various settings in the dialogue boxes have been specified, SPSS performs the analysis. The result is a number of outputs, of which the most relevant ones will be commented and interpreted below.
Group Statistics Valid N (listwise) Unweighted Weighted 50 50,000 50 50,000 50 50,000 50 50,000 100 100,000 100 100,000

fdested USA Canada Total

ferskvand saltvand ferskvand saltvand ferskvand saltvand

Mean 98,38 429,66 137,46 366,62 117,92 398,14

Std. Deviation 16,143 37,404 18,058 29,887 26,001 46,240

This choice only has a consequence for the classification of the observations, and not for the calculation of the discriminant function, since equal covariance matrices are assumed within the groups. A violation of this assumption means that SPSS cannot perform the calculation.

48

Discriminant analysis

(1) Group Statistics shows mean values, standard deviations and the number of observations for both groups for each of the explanatory variables. Furthermore, results for the entire data set are shown. The figures below are (2) The pooled covariance matrix and (3) The covariance matrix for the two groups. The two matrices appear to be significantly different from each other, so it is unlikely that the estimation of the discriminant function can be based on the pooled covariance matrix. This is verified by (4) Boxs M test below.

a Pooled Within-Groups Matrices

Covariance

ferskvand saltvand

ferskvand 293,349 -27,294

saltvand -27,294 1146,173

a. The covariance matrix has 98 degrees of freedom.

Covariance Matrices fdested USA Canada ferskvand saltvand ferskvand saltvand ferskvand 260,608 -188,093 326,090 133,505 saltvand -188,093 1399,086 133,505 893,261

(4) Boxs M test below rejects the null hypothesis of equal covariance matrices. This is the argument behind the former statement about performing the analysis on the basis of separate covariance matrices.

Test Results Box's M F Approx. df1 df2 Sig. 10,938 3,565 3 1728720 ,013

Tests null hypothesis of equal population covariance matrices.

In the part of the output called Summary of Canonical Discriminant Functions it is relevant to look at (5) Wilks Lambda.

49

Discriminant analysis

Wilks' Lambda Test of Function(s) 1 Wilks' Lambda ,321 Chi-square 110,223 df 2 Sig. ,000

Wilks Lambda displays the capability of the discriminant function to discriminate. As the output shows, the discriminant function is significant at all -levels. In (6) The unstandardized canonical Discriminant Function below, it appears that the mutual relationship between the two variables is of interest to the analysis.

The discriminant function can be written as follows: D = 1.924 + 0.045 x1 0.018 x2. The standardized discriminant function is displayed in (7) Standardized Canonical Discriminant Function Coefficients. It has been calculated on the basis of a standardization of the explanatory variables.
Standardized Canonical Discriminant Function Coefficients Function 1 ,764 -,611

ferskvand saltvand

After this there are a number of outputs, which will not be commented here. The part of the output called Classification contains relevant outputs, which indicate how good the discriminant function is at discriminating. (8) Classification Function Coefficients displays Fishers linear discriminant functions. These functions are calculated on the basis of the previously mentioned unstandardized canonical discriminant function. From the output below it appears that there is a function for both USA and Canada.

50

Discriminant analysis

By looking at the difference between the two functions the following discriminant function can be written as:
Classification Function Coefficients fdested USA Canada ,371 ,499 ,384 ,332 -101,377 -95,835

ferskvand saltvand (Constant)

Fisher's linear discriminant functions

Y = -5.54121 0.12839 x1 + 0.05194 x2 In this function the values for the thickness of the growth rings for the individual fish can be substituted for x1 and x2. If the value of the function is positive, the fish is classified as American, and if it is negative, the fish is classified as Canadian. If there are more than two groups, Fishers linear functions are used separately for each group. After this the function values are calculated for each function, and the units of the analysis are assigned to the group with the highest function value. (9) Classification results show the classification of the fish on the basis of the Y function above. It appears that 93% of the fish are classified correctly by means of the function. Thus 7% of the fish are classified incorrectly, which deserves a comment. In that connection it should be noted that the probability of incorrectly classifying Canadian fish as American is lower than the other way around. This is due to the fact that the standard deviation for American salmon is larger than for Canadian salmon (see output (3). It is important to be aware of this matter when performing a linear discriminant analysis.

a Classification Results

Original

Count %

fdested USA Canada USA Canada

Predicted Group Membership USA Canada 46 4 3 47 92,0 8,0 6,0 94,0

Total 50 50 100,0 100,0

a. 93,0% of original grouped cases correctly classified.

51

Discriminant analysis

Another important matter to be aware of is that the discriminant function will often classify the units of the analysis better than new units, since the function has been calculated from the sample data. As a consequence, a verification of the discriminant function is often required before it is used.

52

Profile analysis

7. Profile analysis
The explanation and interpretation of Profile Analysis is made based on the following literature: SPSS, Advanced Models 10.0, chp. 1 p. 1 14 and chp. 12 p. 95 108

7.1 Introduction
Profile analysis is useful for illustrating and visualizing a number of problems. The analysis is often employed to compare averages in statistical models by the use of so-called profile plots. A profile plot is a line in a co-ordinate system where every point corresponds to an average for an independent variable (corrected for correlation) on a given factor level. A line can be illustrated for every factor level. It is the comparison of these lines that is interesting. It should be noted that there is a close connection between MANOVA and the analysis, which is made on profile plots. This connection lies in the fact that there are more interval scaled dependent variables and one or more nominal scaled explanatory variables.

7.2 Example
Data set: \\okf-filesrv1\exemp\Spss\Manual\Profildata.sav

The starting point for this example is the data set above. The data set is a result of a survey of the way men and women experience their marriage. The following variables are registered in the data set: Dependent variables: q1 (contribution to the marriage) q2 (the lapse of the marriage) q3 (the degree of passion in the marriage) q4 (the degree of friendship in the marriage) Explanatory variables: Spouse (man/woman)

With the above-mentioned survey as a starting point the objective is to find out whether there is a difference in the profiles that men and women create with reference to the dependent variables.

53

Profile analysis

7.3 Implementation of the analysis


The first thing to do is to plot the profiles in a co-ordinate system. It should be noted that it is possible to choose profile plots in connection with GLM multivariate analyses (MANOVA). However, this is not recommendable, since this approach does not allow for multiple plots. Instead the following approach can be adopted:

This results in a dialogue box, which should be filled out as follows:

54

Profile analysis

This enables the user to plot several variables in the same graph as well as a display of summary measures for the individual variables. After clicking Define the resulting dialogue box is filled out as shown below:

Lines Represent specifies which lines are included in the graph. Category Axis specifies the variable, which is used for categorization.

55

Profile analysis

To get the profiles plotted it is now necessary to get the questions displayed in accordance to the profiles and not the other way around as is the current case. This is obtained by opening the graph editor by double clicking on the graph and selecting properties in the edit menu. When this is done the explanatory variables depiction is changed by selecting Group instead of X Axis. The questions will now be grouped by the outcomes in the explanatory variable instead of the dependent variables. See below.

5
4,633 4,333 4,1 4,4 4,533

Spouse Husband Wife Husband Wife

3,9 3,967 3,833

Count
2 1 0 q1 q2 q3 q4

legend

The profiles have now been plotted, and a visual comparison can be made. It appears that the profiles are likely to be parallel and perhaps even congruent. Whether the profiles are horizontal is more doubtful. The following sections address the following issues: 56

Profile analysis

Are the profiles horizontal and parallel? (7.3.1) Both genders attach equal weight to the factors, but there is a difference in the level. Are the profiles congruent? (7.3.2) Both genders attach weight to the factors. Are the profiles at the same level? (7.3.3) Both genders find that all factors have equal influence.

7.3.1

Are the profiles parallel?

The starting point for all three issues mentioned above is shown below:

The resulting dialogue box is filled out by moving the dependent variables to the Dependent Variables window and the explanatory variable to the Fixed Factor(s) window as shown below:

57

Profile analysis

By clicking the Model button it is possible to specify the model for the current analysis (parallelism). The dialogue box is filled out as follows:

After this Continue is selected. In order to perform the current analysis it is necessary to change the method of the SPSS programme slightly. This is done by means of the syntax window. 58

Profile analysis

Since the model has been almost completely set up by SPSS already, it is a good idea to make use of that in the syntax. The conversion of the model to the syntax is done by clicking the Paste button.

After that two additional lines should be added as shown. LMATRIX defines the two levels to be compared in the spouse variable. MMATRIX defines the C matrix. A thorough description of the theory behind the used matrices is beyond the scope of this manual. Please refer to SPSS Advanced Models, which is located at the IT instructors office in room H15 instead.

To make SPSS run the analysis the syntax should be marked (i.e. by dragging the mouse over it and at the same time holding down the left mouse-button) and
7.3.1.1 Output

chosen from the menu bar.

The procedure described above results in a number of outputs. One of those is an output known from the MANOVA analysis (see section 7 for an interpretation of this). In the part of the output named Custom Hypothesis Tests #1 an output is found, which indicates whether or not the null hypothesis regarding parallelism is true. This is (1) Multivariate Test Results shown below. From the output it appears that the null hypothesis is accepted at = 0.05, since the p-value is 0.063. The conclusion is therefore that the profiles are parallel. However, this conclusion is sensitive to the choice of the level of significance. 59

Profile analysis

Multivariate Test Results Pillai's trace Wilks' lambda Hotelling's trace Roy's largest root a. Exact statistic Value ,121 ,879 ,138 ,138 F Hypothesis df 2,580a 3,000 2,580a 3,000 a 2,580 3,000 2,580a 3,000 Error df 56,000 56,000 56,000 56,000 Sig. ,063 ,063 ,063 ,063

7.3.2

Are the profiles congruent?

For this analysis the previous syntax can be used again with minor changes. Instead of inserting the C-matrix the 1-matrix should be inserted. The sum of the coefficients must be 1, which explains why each has been multiplied by 0.25. The complete syntax is as follows:

The syntax is run in the same way as the previous.

7.3.2.1

Output

As in the previous section the relevant output is found in Custom Hypothesis Tests #1. This output indicates whether the null hypothesis of congruent profiles is true. In this particular example SPSS only uses the univariate result. This means that the correlation between the dependent variables is not taken into account.

60

Profile analysis

Test Results Transformed Variable: T1 Source Contrast Error Sum of Squares 2343,750 88687,500 df 1 58 Mean Square 2343,750 1529,095 F 1,533 Sig. ,221

(2) Test Results shows that the null hypothesis of congruent profiles cannot be rejected at = 0.05, since the p-value is 0.221. Therefore, it can be concluded that the profiles are congruent.

7.3.3

Are the profiles horizontal?

For this analysis the C-matrix from section 7.3.1 is reinserted, and an intercept is added as shown below:

The syntax is run in the same way. For this analysis to make any sense it is necessary for all variables to be scaled similarly.

7.3.3.1

Output

From (3) Multivariate Test Results shown below it appears that the null hypothesis of horizontal profiles can be rejected at = 0.05, since the p-value is 0.000. As a consequence, it can be concluded that the mean values for the four dependent variables are not equal.

61

Profile analysis

Multivariate Test Results Pillai's trace Wilks' lambda Hotelling's trace Roy's largest root a. Exact statistic Value ,305 ,695 ,439 ,439 F Hypothesis df 8,188a 3,000 8,188a 3,000 a 8,188 3,000 8,188a 3,000 Error df 56,000 56,000 56,000 56,000 Sig. ,000 ,000 ,000 ,000

62

Factor analysis

8. Factor analysis
The following presentation and interpretation of Factor Analysis of the results are based on: -

Videregende data-analyse med SPSS og AMOS, Niels Blunch 1. udgave 2000, Systime. Chp. 6, p. 124-155 Analyse af markedsdata, 2 rev. Udgave 2000, Systime. Chp. 3, p. 87-118
Hair et.al. (2006): Multivariate Analysis 6th Ed., Pearson, kap. 3

8.1 Introduction
Factor analysis can be divided into component analysis as well as exploratory and confirmative factor analysis. The three types of analysis can be used on the same data set and builds on different mathematical models. Component analysis and exploratory factor analysis still produce relatively similar results. In this manual only component analysis is described. In component analysis the original variables are transformed into the same number of new variables, socalled principal components, by linear transformation. The principal components have the characteristic that they are uncorrelated, which is why they are suitable for further data processing such as regression analysis. Furthermore, the principal components are calculated so that the first component carries the bulk of information (explains most variance), the second component carries second-most information and so forth. For this reason component analysis is often used to reduce the number of variables/components so that the last components with the least information are disregarded. Henceforth the task is to discover what the new components correspond to, which is exemplified below.

8.2 Example
Data: \\okf-filesrv1\Exemp\Spss\Manual\Faktoranalysedata.sav

The following example is based on an evaluation of a course at the Aarhus School of Business. 162 students attending the lecture were asked to fill out a questionnaire containing various questions regarding the assessment of the course. Each of the assessments were based on a five point Likert-scale, where 1 is No, not at all and 5 is Yes, absolutely. The questions in the survey were (label name in parenthesis): Has this course met your expectations? (Met expectations)

63

Factor analysis

Was this course more difficult than your other courses? (More difficult than other courses) Was there a reasonable relationship between the amount of time spent and the benefit derived? (Relationship time/Benefit) Have you found the course interesting? (Course Interesting) Have you found the textbooks suitable for the course (Textbooks suitable) Was the literature inspiring? (Literature inspiring) Was the curriculum too extensive? (Curriculum too extensive) Has your overall benefit from the course been good? (Overall benefit) The original data consists of more assessments, but these have not been included in the example. The purpose is now, based on the survey, to carry out a component analysis with a view to reduce the number of assessment criteria in a smaller number of components. Furthermore, the new components should be examined with a view to name them.

8.3 Implementation of the analysis


Component analysis is a method, which is used exclusively for uncovering latent factors from manifest variables in a data set. Since these fewer factors usually form the basis of further analysis, the component analysis is to be found in the following menu:

64

Factor analysis

This results in the dialogue box shown below. The variables to be included in the component analysis are marked in the left-hand window, where all numeric variables in the data set are listed, and moved to the Variables window by clicking the arrow. In this case all the variables are chosen.

It has now been specified, which variables SPSS should base the analysis on. However, a more definite method for performing the component analysis has yet to be chosen. This is done by means of the Descriptives, Extraction, Rotation, Scores and Options buttons. These are described individually below.

8.3.1

Descriptives

By clicking the Descriptives button the following dialogue box appears:

In short the purpose of the individual options is as follows:

Statistics Univariate descriptives includes the mean, standard deviation and the number of useful observations for each variable. Initial solution includes initial communalities, eigenvalues, and the percentage of variance explained.

65

Factor analysis

Correlation Matrix Here it is possible to get information about the correlation matrix, among other thing the appropriateness to perform factor analysis on the data set. In this example Initial solution is chosen, because a display of the explained variance for the suggested factors of the component analysis is desired. At the same time this is the most widely used method. AntiImage, KMO and Bartletts test of sphericity are checked as well to analyze the appropriateness of the data analysis on the given data set. With regards to the tests selected above, it may be useful to add a few comments. The Anti-Image provides the negative values of the partial correlations between the variables. Anti-Image ought therefore to be low indicating that the variables do not differ too much from the other variables. KMO and Bartlett provide as previously mentioned measures for the appropriateness as well. As a rule of thumb one could say that KMO ought to attain values of at least 0.5 and preferably above 0.7 to indicate that the data is suitable for a factor analysis. Equivalently the Bartletts test should be significant, indicating that significant correlations exist between the variables.

8.3.2

Extraction

By clicking the Extraction button the following dialogue box appears:

This is where the component analysis in itself is managed. In this example it has been chosen to use the Principal components method for the component analysis. This is chosen in the Method drop-down-box. Since the individual variables of this example are scaled very differently, it has been chosen to base the analysis on the Correlation matrix, Cf. a standardization is carried out.

66

Factor analysis

A display of the un-rotated factor solution is wanted in order to compare this with the rotated solution. Therefore, Unrotated factor solution is activated in Display.

Since the last components do not explain very much of the variance in a data set, it is standard practice to ignore these. This results in a bit of lost information (variance) in the data set, but in return a more simple output is obtained for further analysis. In addition, it makes the interpretation of the data easier. The excluded components are treated as noise in the data set. The important question is just how many components to exclude without causing too much loss of information. The following rules of thumb for choosing the right number of components for the analysis apply: Scree plot: By selecting this option in Display a graphical illustration of the variance of the components appears. A typical feature of this graph is a break on the curve. This curve break forms the basis of a judgement of the right number of components to include. Kaisers criterion: Components with an eigenvalue of more than 1 are included. This can be observed from the Total Variance Explained table or the scree plot shown in section 8.4. However, these are only guidelines. The actual number of chosen factors is subjective, and it depends strongly on the data set and the characteristics of the further analysis. If the user wants to carry out the component analysis based on a specific number of factors, this number can be specified in Number of factors in Extract. Last but not least it is possible to specify the maximum number of iterations. Default is 25. This option is not relevant for this example, but for the Maximum Likelihood method it could be relevant.

8.3.3

Rotation

Clicking the Rotation button results in the following dialogue box:

67

Factor analysis

In brief, rotation of the solution is a method, where the axes of the original solution are rotated in order to obtain an easier interpretation of the found components. In other words it is assured that the individual variables are highly correlated with a small proportion of the components, while being low correlated with the remaining components. In this example Varimax is chosen as the rotation method, since it ensures that the components of the rotated solution are uncorrelated. The remaining methods will not be further described here. A display of the rotated solution has been chosen in Display. Loading plots enables a display of the solution in a three-dimensional plot of the first three components. If the solution consists of only two components, the plot will be two-dimensional instead. With the 12 assessment criteria in this example, a three-dimensional plot looks rather confusing, and it has therefore been ignored here.

8.3.4

Scores

Now the solution of the component analysis has been rotated, which should have resulted in a clearer picture of the results. However, there are still two options to bear in mind before the analysis is carried out. By selecting Scores it is possible to save the factor scores, which is sensible if they are to be used for other analyses such as profile analysis, etc. These factor scores will be added to the data set as new variables with default names provided by SPSS. In this example this option has not been chosen, as no further analysis including these scores is to be performed.

8.3.5

Options

Treatment of missing values in the data set is managed in the Options dialogue box shown below:

68

Factor analysis

Missing values can be treated as follows:

Exclude cases listwise excludes observations that have missing values for any of the variables. Exclude cases pairwise excludes observations with missing values for either or both of the pair of variables in computing a specific statistic. Replace with mean replaces missing values with the variable mean.

In this example Exclude cases listwise has been chosen in order to exclude variables with missing values. With regard to the output of the component analysis there are two options: Sorted by size sorts factor loading and structure matrices so that variables with high loadings on the same factor appear together. The loadings are sorted in descending order. Suppress absolute values less than makes it possible to control the output so that coefficients with absolute values less than a specified value (between 0 and 1) are not shown. This option has no effect on the analysis, but ensures a good overview of the variables in their respective factors. In this analysis it has been chosen that no values below 0.1 is shown in the output

8.4 Output
When the various settings in the dialogue boxes have been specified, SPSS performs the component analysis. After clicking OK a rather comprehensive output is produced, of which the most relevant outputs are commented below.

KMO and Bartlett's Test Kaiser-Meyer-Olkin Measure of Sampling Adequacy. Bartlett's Test of Sphericity Approx. Chi-Square df Sig. ,791 413,377 28 ,000

69

Factor analysis
Anti-image Matrices Faget spndende og interessant -,221 -,079 -,009 ,499 ,005 -,111 ,027 -,104 -,443 -,120 -,017 ,806
a

Anti-image Covariance

Anti-image Correlation

Faget relevant Faget svrt Tid ift. udbytte Faget spndende og interessant Egnede lrebger Inspirerende litteratur Pensum for stort Samlede udbytte Faget relevant Faget svrt Tid ift. udbytte Faget spndende og interessant Egnede lrebger Inspirerende litteratur Pensum for stort Samlede udbytte

Faget relevant ,497 ,070 -,172 -,221 -,032 ,080 -,005 -,076 ,763a ,106 -,321 -,443 -,063 ,166 -,008 -,155

Faget svrt ,070 ,876 -,025 -,079 -,010 ,114 -,152 ,050 ,106 ,728a -,036 -,120 -,015 ,178 -,172 ,077

Tid ift. udbytte -,172 -,025 ,579 -,009 ,019 -,061 ,081 -,147 -,321 -,036 ,843a -,017 ,035 -,117 ,114 -,278

Egnede lrebger -,032 -,010 ,019 ,005 ,526 -,263 -,087 -,108 -,063 -,015 ,035 ,010 ,746a -,531 -,127 -,214

Inspirerende litteratur ,080 ,114 -,061 -,111 -,263 ,467 ,084 -,054 ,166 ,178 -,117 -,231 -,531 ,735a ,130 -,114

Pensum for stort -,005 -,152 ,081 ,027 -,087 ,084 ,886 ,019 -,008 -,172 ,114 ,041 -,127 ,130 ,751a ,029

Samlede udbytte -,076 ,050 -,147 -,104 -,108 -,054 ,019 ,486 -,155 ,077 -,278 -,211 -,214 -,114 ,029 ,875a

,010 -,231 ,041 -,211

a. Measures of Sampling Adequacy(MSA)

SPSS has now generated the tables for the KMO and Bartletts test as well as the Anti-Image matrices. KMO attains a value of 0.791, which apparently seems to satisfy the criteria mentioned above. Equivalently the Bartletts test attains a probability value of 0.000. Similary high correlations are primarily found on the diagonal of the Anti-Image matrix (marked with an a). This confirms out thesis of an underlying structure of the variables.

Total Variance Explained Initial Eigenvalues % of Variance Cumulative % 43,740 43,740 13,922 57,662 12,887 70,549 9,434 79,983 6,835 86,818 5,013 91,831 4,792 96,624 3,376 100,000 Extraction Sums of Squared Loadings Total % of Variance Cumulative % 3,499 43,740 43,740 1,114 13,922 57,662 1,031 12,887 70,549 Rotation Sums of Squared Loadings Total % of Variance Cumulative % 2,526 31,574 31,574 1,853 23,167 54,741 1,265 15,807 70,549

Component 1 2 3 4 5 6 7 8

Total 3,499 1,114 1,031 ,755 ,547 ,401 ,383 ,270

Extraction Method: Principal Component Analysis.

As can be seen from the output shown above, SPSS has produced the table Total Variance Explained, which takes account of both the rotated and unrotated solution. (1) Initial Eigenvalues displays the calculated eigenvalues as well as the explained and accumulated variance for each of the 8 components. (2a) Extraction Sums of Squared Loadings displays the components, which satisfy the criterion that has been chosen in section Extraction (Kaisers criterion was chosen in section 8.3.2). In this case there are three components with an eigenvalue above 1. These three components together explain 70.549% of the total variation in the data. The individual contributions are 43.740%, 13.922%, 12.887% of the variation for component 1, 2 and 3 respectively. These are the results for the un-rotated solution.

70

Factor analysis

In (2b) Rotation Sums of Squared Loadings similar information to that of (2a) can be found, except that these are the results for the Varimax rotated solution. As shown the sum of the three variances is the same both before and after the rotation. However, there has been a shift in the relationship between the three components, as they contribute more equally to the variation in the rotated solution. The Scree plot below is an illustration of the variance of the principal components:

Scree Plot

Eigenvalue

0 1 2 3 4 5 6 7 8

Component Number

After the inclusion of the third component the factors no longer have eigenvalues above 1, and consequently the curve flattens. Normally the Scree plot will exhibit a break on the curve, which confirms how many components to include. The graphical depiction could therefore be better in the current example however it is decided to include three components that satisfy the Kaisers Criteria in the further analysis. It is therefore reasonable to treat the remaining components as noise. This agrees with the previous output of the explained variance in the table Total Variance Explained.

71

Factor analysis
a Component Matrix

1 Samlede udbytte Faget spndende og interessant Inspirerende litteratur Faget relevant Tid ift. udbytte Egnede lrebger Faget svrt Pensum for stort ,816 ,762 ,730 ,723 ,722 ,672 -,340 -,331

Component 2 ,286 -,257 ,330 ,187 ,710 ,547

3 -,111 ,430 -,328 -,278 ,592 ,540

Extraction Method: Principal Component Analysis. a. 3 components extracted.

(3a) Component Matrix displays the principal component loadings for the un-rotated solution. This table shows the coefficients of the variables in the un-rotated solution. For example it can be observed that the correlation between the variable Overall benefit and component 1 is 0.829, i.e. very high. The un-rotated solution does not form an optimal picture of the correlations, however. Therefore, it is a good idea to rotate the solution in hope of clearer results. (3b) Rotated Component Matrix displays the principal component loadings in the same way, but as the name reveals this is for the rotated solution.

a Rotated Component Matrix

1 Faget relevant Faget spndende og interessant Tid ift. udbytte Samlede udbytte Egnede lrebger Inspirerende litteratur Pensum for stort Faget svrt ,854 ,771 ,764 ,669 ,226 ,261 -,226

Component 2 ,282 ,153 ,462 ,871 ,819 -,303

-,161 -,123 -,214 ,799 ,731

Extraction Method: Principal Component Analysis. Rotation Method: Varimax with Kaiser Normalization. a. Rotation converged in 5 iterations.

As previously mentioned, the purpose of rotating the solution is to make some variables correlate highly with one of the components i.e. to enlarge the large coefficients (loadings) and to reduce the small coefficients

72

Factor analysis

(loadings). Whether a variable is highly or lowly correlated is subjective, but one rule of thumb is that correlations below 0.4 are considered low. By means of output (3b) Rotated Component Matrix it is possible to try to join the most similar assessment criteria in three different groups:

Component 1: For this group the variables Met expectations, Relationship time/Benefit, Course Interesting and Overall benefit are important. Therefore it will be appropriate to categorize component 1 as course benefit.

Component 2: The variable Textbooks suitable and Literature inspiring are correlating highly with component 2 and therefore it is named quality of the literature.

Component 3: The variables More difficult than other courses and Curriculum too extensive all have values above 0.4 compared to the third component. Thus could be named difficulty of the course.

This concludes the example. It is worth mentioning that if these results where to be used in later analyses, all factor scores should be used. As mentioned, these scores can be calculated by selecting the property called Scores. SPSS will then calculate a score for each respondent based on each of the components.

73

Multidimensional scaling (MDS)

9. Multidimensional scaling (MDS)


The explanation and interpretation of multidimensional scaling presented in this section is based on the following literature: Analyse af markedsdata, Niels Blunch 2. rev. edition 2000, Systime. chp. 4, p. 136157 Marketing Research An Applied Orientation, Naresh K. Malhotra, 3rd edition 1999, chp. 21, p. 633-646. SPSS Categories 10.0, Jacqueline J. Meulman, Willem J. Heiser, SPSS Inc. 1999, chp. 1,
p. 13, chp. 13, p.193

Multidimensional skalering og conjoint analyse et produktudviklingsvrktj, Hans Jrn Juhl, Internt undervisningsmateriale H nr. 151, Institut for informationsbehandling 1992, Handelshjskolen i rhus. chp. 2 & 4.

9.1 Introduction
The purpose of multidimensional scaling is to find a structure in a number of objects or cases based on the distances between them. This is done by placing the objects/units of the analysis in a dimensional plane (normally two- or three-dimensional) based on a proximity-matrix. The objects are placed in the plane based on information about the differences between them (the differences may be disclosed by asking a number of respondents). In the literature a distinction is made between two different types of multidimensional scaling: Metric and non-metric MDS. For non-metric as well as metric MDS the starting point is a square proximity matrix. By metric MDS it is a requirement that the data is on interval- or ratio-level which means that the proximity matrix can be a covariance- or a correlation matrix. The non-metric MDS is used when the data is ordinal or nominal scaled. An example could be a matrix of pair wise comparisons or a transformation of rank values.

9.2 Example
Data set: \\okf-filesrv1\exemp\Spss\Manual\MDS.sav

74

Multidimensional scaling (MDS)

In a test of beverages two respondents have been asked to compare ten different soft drinks. After the consumption of the drinks in random order the respondents were asked to express their opinion 4 about each single soft drink ranking them on a 10-point scale. Small values expressed a large degree of homogeneity and large values expressed a small degree of homogeneity (=distance). The following soft drinks were rated: Coca-Cola, Pepsi, Fanta, Afri-Cola, Sprite, Bluna, Sinalco, ClubCola, Red Bull and Mezzo-Mix 5 . Because this example deals with ordinal data, a non-metric multidimensional scaling is carried out.

9.3 Implementation of the analysis


The multidimensional scaling is to be found in the following menu:

Following this action the following dialogue box appears:

4 5

Pair wise comparison The survey is fictitious and is taken from the following publication: Multidimensional Skalierung Beispieldatei zur Datenanalyse, Lehrstuhl fr empirische Wirtshafts- und Sozialforschung Fachbreich Wirtschaftswissenschaft, BUGH Wuppertal 2001. Bergische Universitt Gesamthochschule Wuppertal.

75

Multidimensional scaling (MDS)

In this dialogue box the desired variables for the multidimensional scaling should be marked in the left-hand side window. The chosen variables are then transferred to the Variables window by clicking the arrow next to this field (in this case all the variables should be transferred). In this example the data have been entered in a quadratic and symmetrical matrix and are at this point already distances. Therefore Data are distances is chosen and Shape is activated, which results in the dialogue box shown below. It should be filled out as illustrated:

If the data set is not already in a proximity matrix this should be defined, because MDS requires the input to be in the shape of proximity data. Because of this it is sometimes necessary to transform the data, which can be done by choosing the option Create distances from data. (Please notice that this option is not pictured here).

76

Multidimensional scaling (MDS)

Hereby the data are expressed as distances. In the section Measure the method for transformation is chosen. If the analysis is based on continuous data, Interval is chosen. If the analysis is based on frequency data, Counts is chosen, and if it is based on binary data, Binary is chosen. The method by which the distances should be calculated is chosen under each item. Normally the Euclidean measure of distance is used, which is also the case in this example. The proximity matrix is generated based on this measure of distance. The menu item Transform Values provides the possibility of standardization before the calculation of the proximity data. It is possible to specify the standardisation method in the drop-down menu. However there are other crucial functions, which are to control the scaling itself. For this reason Model and Options will subsequently be described separately.

9.3.1

Model

The activation of Model is done to specify the model. The dialogue box below will appear:

In this dialogue box there are a number of possibilities: Level of Measurement: Specification of scaling of the data. Under this item it is chosen to calculate the distances on an ordinal level, since the data in this example are ordinal scaled. Conditionality: This item indicates which comparisons make sense. Matrix Conditionality should be chosen when the calculated distances must be compared within each single 77

Multidimensional scaling (MDS)

distance matrix. This is the case when the analysis is concerned with one matrix only (as in this case) or when each matrix represents different areas. Row means that the calculated distances are only compared in the same row of the matrix. This is relevant if the analysis is concerned with asymmetrical or rectangular matrixes. Row cannot be chosen if the distance matrix has been made for the data, e.g. by proximity data since these are quadratic and symmetrical. Unconditional should be chosen if comparisons between various distance matrices are needed. In Dimensions it should be indicated how many dimensions are needed for the analysis. If only one solution is needed, minimum and maximum should have the same value. In Scaling Model the assumptions behind the scaling are indicated. Euclidian distance can be used for all distance matrices, while Individual differences Euclidian distances (known as INDSCAL) can only be used when the analysis deals with multiple distance matrices and there are at least two dimensions. In this case MDS should be performed with ordinal data in a two-dimensional solution. Since the analysis only deals with one distance matrix, which is quadratic and symmetrical, matrix conditionality and Euclidian distance model are chosen.

9.3.2

Options

Options is activated to define the actual output of the multidimensional scaling. Hereby the following dialogue box will appear:

The decision regarding the contents of the output from the MDS-procedure is controlled under Display. The following options are available:

78

Multidimensional scaling (MDS)

Group Plots produces a scatter plot of the linear fit between the model and the data material as the most important feature. Individual subject plots creates a plot of the transformations of the data from a single area. A requirement is that Matrix Conditionality and Ordinal data have been activated in the Model dialogue box as described above.

Data matrix includes both the raw data and the scaled data for each subject-area. Model and options summary describe the effect that the scaling-possibilities have on the analysis.

Another important item is Criteria since it controls the scaling itself. Here a decision should be made regarding when the iteration should stop based on criteria concerning badness-of-fit. S-stress convergence: S-stress is a badness-of-fit criterion, which indicates how well the distance matrix is pictured in the two-dimensional space. In S-stress convergence the smallest acceptable size of the change in badness-of-fit between the iterations is decided. This means that the iteration procedure stops if the change equals the given value. This size, of course, cannot be negative. The value should be as small as possible since smaller s-stress values enhances precision, however this will also require more time for the calculations. Minimum s-stress value indicates the smallest acceptable badness-of-fit. The value should be between 0 and 1, with 1 being the worst measure of fit. Maximum iterations indicate the number of iterations. MDS is conducted via an iteration procedure by which the products in turn are placed in the coordinate system. It is therefore possible to decide the maximum number of iterations. The default in SPSS is 30 iterations but this number can be raised to get a more precise solution (again this will require more time for the calculations). Finally it is possible to ensure that distances of a certain size are treated as missing. This is done in Treat distances less than __ as missing. This function has not been chosen in this example.

9.4 Output
The above actions result in a long and comprehensive output of data regarding the multidimensional scaling. The most relevant parts will subsequently be commented on. First some descriptive information about the procedure itself appears which will only be commented briefly: 79

Multidimensional scaling (MDS)

Alscal Procedure Options

Data Options-

Number of Rows (Observations/Matrix). Number of Columns (Variables) . Number of Matrices Measurement Level . Data Matrix Shape . Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10 10 1 Ordinal Symmetric Dissimilarity Leave Tied Matrix ,000000

Approach to Ties Conditionality . Data Cutoff at .

The output above shows that the symmetrical matrix consists of ten rows and ten columns. Furthermore, it can be seen that the data is of the ordinal scale of measurement. A lot of descriptive output follows which will not be commented on. A print-out of raw data matrices follows (in this case only one). This matrix is similar to the original matrix (not displayed here). The following warning draws attention to the fact that an increase of the scale level in the multidimensional scaling has taken place (from ordinal to ratio).

>Warning # 14654 >The total number of parameters being estimated (the number of stimulus >coordinates plus the number of weights, if any) is large relative to the >number of data values in your data matrix. The results may not be reliable

>since there may not be enough data to precisely estimate the values of the >parameters. You should reduce the number of parameters (e.g. request

>fewer dimensions) or increase the number of observations.

>Number of parameters is 20.

Number of data values is 45

80

Multidimensional scaling (MDS)

If the analysis has more than ten objects in a two-dimensional scaling this will no longer appear. Then follows a description of the iteration procedure:

Iteration history for the 2 dimensional solution (in squared distances)

Young's S-stress formula 1 is used.

Iteration

S-stress

Improvement

1 2 3

,02813 ,02541 ,02459 ,00272 ,00082

Iterations stopped because S-stress improvement is less than ,001000

As the output illustrates, no more than three iterations were made before the search process stopped. The reason for this is that the s-stress improvement was less than the specified value of 0.001. Furthermore the development in badness-of-fit (s-stress) can be compared to the improvement from iteration to iteration. Finally the first result of the multidimensional scaling follows:
Stress and squared correlation (RSQ) in distances

RSQ values are the proportion of variance of the scaled data (disparities) in the partition (row, matrix, or entire data) which is accounted for by their corresponding distances. Stress values are Kruskal's stress formula 1.

For Stress =

matrix RSQ = ,99707

,02179

Here the s-stress value, which should be less than 0.1 according to Kruskals assumption, can be read. This assumption is fulfilled and the conclusion is that it has been possible to illustrate the distance matrix in a two-dimensional space with a very high degree of precision. The RSQ (R81

Multidimensional scaling (MDS)

Square = R2), which has a value of 0.99707, is the traditional expression for the degree of explanation; in this case it is quite large and accordingly very satisfactory. The next output indicates the set of co-ordinates in the two dimensions for each soft drink.

Configuration derived in 2 dimensions

Stimulus Coordinates

Dimension

Stimulus Number

Stimulus Name

1 2 3 4 5 6 7 8 9 10

cocacola pepsi fanta afri sprite bluna sinalco club redbull mezzo

1,0732 ,7125 -1,3131 ,6451 -,0711 -1,7408 -1,7354 ,3461 2,2358 -,1523

-,4818 -,3833 -,2381 ,8952 -1,4253 -,0561 ,5402 1,1895 ,1069 -,1472

These co-ordinates are used to present the configuration of the objects graphically (see (1) Derived Stimulus Configuration below). The co-ordinates may be transformed to other programs if a different graphical representation, different from the one SPSS offers, is needed. Then follows a matrix with the transformed data (the optimally scaled data):

82

Multidimensional scaling (MDS)

Optimally scaled data (disparities) for subject

1 2 3 4 5 6 7 8 9 10

,000 ,471 2,424 1,490 1,490 2,808 2,989 1,799 1,274 1,274 ,000 2,031 1,274 1,274 2,424 2,602 1,664 1,490 ,891 ,000 2,194 1,664 ,471 ,891 2,194 3,566 1,274 ,000 2,424 2,602 2,424 ,471 1,799 1,274 ,000 2,194 2,602 2,602 2,808 1,274

10

6 7 8 9 10

,000 ,471 2,424 3,987 1,664 ,000 2,194 3,987 1,664 ,000 2,194 1,490 ,000 2,424 ,000

With this the text based output ends and the graphical output follows. The first thing that is presented is the result of the multidimensional scaling in two dimensions:

83

Multidimensional scaling (MDS)

Derived Stimulus Configuration

Euclidean distance model


1,5 club 1,0 sinalco afri

Dimension 2

0,5 redbull 0,0 bluna fanta mezzo pepsi -0,5 cocacola

-1,0 sprite -1,5 -2 -1 0 1 2 3

Dimension 1

This plot gives a visual impression of the position of the soft drinks in relation to each other, based on the two respondents evaluation. The two respondents opinion of the ten soft drinks can be divided into six categories: 1. Orange lemonade (Sinalco, Bluna and Fanta) 2. Lemon lemonade (Sprite) 3. Cola-Fanta-mix (mezzo-mix) 4. Branded Cola (Coca-Cola and Pepsi) 5. Regional brands (Club-Cola and Afri-Cola) 6. Energy drinks (Red Bull) It seems that the further to the right in the co-ordinate system the observation is located, the larger the caffeine content in the drinks. The drinks to the left contain little or no caffeine.

84

Multidimensional scaling (MDS)

The vertical plane indicates facts about the brands of the drinks. The known brands are further south in the graph (diagram) while the lesser-known brands are situated further north. The last scatter plots serve the purpose of showing how good the multidimensional scaling is.

Scatterplot of Linear Fit

Euclidean distance model


4

Distances

0 0 1 2 3 4

Disparities
This scatter plot shows the linear fit. In this case Disparities has been plotted against Distances. The closer these points are around the sloping line through the origin, the larger the degree of explanation.

85

Conjoint Analysis

10. Conjoint Analysis


This chapters explanation and interpretation of Conjoint Analysis in SPSS is based on the following literature: Analyse af markedsdata, Niels Blunch 2. rev. edition 2000, Systime. Multidimensional skalering og conjoint analyse- et produktudviklingsvrktj, Internt undervisningsmateriale H nr. 151, Chp. 3, p. 40-59, 68-73 Multivariate Dataanalysis, Hair, Anderson and others, fifth edition, Prentice Hall

10.1 Introduction
Conjoint analysis is a statistical method especially used when conducting market analysis. The analysis can help explain consumer preferences for (ranking of) existing or new products/product concepts. A side from market analysis conjoint analysis can also be used for a number of other analyses, such as cost-benefit analysis, segmentation, etc. For a more detailed explanation of these analyses please refer to the above mentioned literature.
The purpose with the analysis in this chapter is to determine a companys optimal product design. This is done by a study of how much a respondent is willing to surrender of one attribute in order to receive more of another. An important part of the conjoint analysis is the experiment plan where each respondent ranks the presented product design based on the combination of attributes. However due to the number of attribute combinations this will often prove to be a difficult task. Therefore an easier method is available in SPSS, here the respondent is exposed to a smaller number of attribute combinations, and based on these combinations it is possible to determine the importance of each attribute.

10.2 Example
Data: Blank dataset (see example) \\okf-filesrv1\exemp\Spss\Manual\Conjoint_orthogonaldesign.sav \\okf-filesrv1\exemp\Spss\Manual\Conjointdata.sav A company is interested in marketing their new carpet and furniture cleaning product. The management has identifies 5 attributes which are believed to influence consumer preferences these are, package design (package), brand, price, seal type (seal), and money back guarantee if unsatisfactory (money). 86

Conjoint Analysis

In the table below it is shown how many levels each attribute has. Attribute Package Brand Price Seal Money Level 1 A K2R $ 1,19 No No Level 2 B Glory $ 1,39 Yes Yes Level 3 C Bisset $ 1,59

The company is now looking to perform a conjoint analysis in order to examine how much value the consumer assigns each attributes different levels and which attributes the consumer finds most important. Therefore an analysis is conducted where 10 respondents rank a number of product concepts based on their preferences for each concept. However, a full factorial design would require that the respondents should rank 3x3x3x2x2 = 108 different alternatives. In order not to confuse the respondents and keep costs to costs to a minimum a simplified examination is therefore conducted at first. The entire example basically consists of two parts. First, the reduced number of combinations which the respondent has to asses is determined. Then based on this result the importance of each attribute is determined.

10.3 Generating an orthogonal design


SPSS can help you generate a simplified analysis plan, this way the 108 product concepts are reduced to a more reasonable number. In spite of being reduced a large part of the informational value is still present 6 . In order to start this procedure an empty data set is opened in SPSS. Below the full procedure is explained.

This function assumes that there are no interactions.

87

Conjoint Analysis

After Orthogonal Design and Generate has been activated, following dialogue box appears, this is where the definition of the design is carried out.

88

Conjoint Analysis

In the Factor Name you have to specify each variable for the orthogonal design independently. Furthermore a label name has to be specified for each variable since these will be used on the generated plancards which will be shown to the respondents. E.g. the package variable is defined with a label as package design. When this has been done click add, and the variable is transferred to the large area. This is done for all variables. In the Data File area you can either choose to save the new dataset on the computer or choose Replace working data file which will make the generated data appear in the current dataset. After this the levels for each variable has to be specified. Each factor has to be selected one at a time and then click Define Values. Doing so activates the following box:

As can be seen from the figure a level has to be specified for each variable. The figure above shows the defined values for the variable Brand. It is also possible to specify labels for each of the levels, it is recommended that this is done. Furthermore, it is possible to make SPSS generate a number of levels using Auto-Fill. However you still have to specify the labels for each level yourself. When this is done for all the variables the actual design function must be established. This is done by activating options in the main dialogue box.

89

Conjoint Analysis

10.3.1 Options

In the Option dialogue box the number of wanted variables are specified. Since the company wished to test 18 designs this means that 18 different designs have to be picked from the possible 108. Thus, the number 18 is stated in the Minimum number of cases to generate. If nothing is specified SPSS will automatically generate the minimum number of cases which are necessary to calculate the value of each of the attributes. The more different designs that are generated by SPSS and assessed by the respondents the better the results will be but at the same time the costs will also increase and at one point you can reach so many different designs that the validity of each respondents answer will drop. Holdout Cases are set to 4, which mean that 4 extra designs are generated. These will correspondingly be assessed by the respondents but not used to the estimation of utilities (the value of each of the attributes). The Holdout Cases are used to check the validity of the analysis.
The procedure is now ready to be started therefore OK is clicked on in the main dialogue box and the design will be generated.

10.4 PlanCards
The above standing implies that 22 plancards are generated, which respondents later will have to asses. Furthermore, 2 so-called simulationcards are also added, these simulationcards can either be existing or expected products which then will be assessed based on the results of the test. The simulationcards has to be created manually in SPSS a long with the specification of the attributes. This mentioned dataset can be found on: \\okf-filesrv1\exemp\spss\manual\conjoint_orthogonaldesign.sav 90

Conjoint Analysis

These plancards must then be moved to the output window. This procedure is shown below:

By doing this the dialogue box shown below appears here it is specified which variables that are included in these plancards.

In this case we wish that the respondents have to asses all the different variables which mean that they all are selected. In addition, listing for experimenter and profiles for subjects are activated. Below examples of two generated plancards are shown.

91

Conjoint Analysis

Profile Number 1 Package design A Good Housekee ping seal Yes Money-back gurantee No

Card ID 1

Brand name Glory

Price $1,39

Profile Number 2 Package design B Good Housekee ping seal No Money-back gurantee No

Card ID 2

Brand name K2R

Price $1,19

10.5 Conjoint analysis


The generated plancards are now used in connection with the analysis where a number of respondents have to asses the different designs. There are different ways the respondents can be asked to asses the designs. In this example each respondent has ranked all of the 22 different (18+4) different plancards from 1 to 22. The higher the rank value the higher the respondents preference for this combination. For this illustration 10 respondents have been used. However, this is clearly below what is normally recommended. The result can be seen here: \\okf-filesrv1\exemp\Spss\Manual\Conjointdata.sav

92

Conjoint Analysis

In SPSS it is not possible to perform a conjoint analysis by the normal point and click version, therefore it is necessary to use the syntax. The syntax code for the conjoint analysis is shown below:

Furthermore, in SPSS it is possible to specify a relationship between single variables and the respondents rankings, if one is believed to be present. In the above standing window this is described as e.g. discrete and linear (less / more specifies whether the relationship is believed to be positive or negative. Further explanation of the code is available in the SPSS help menu. Choose Help\topics and search for Conjoint, and the following command syntax will show:

93

Conjoint Analysis

In order to run the code from the syntax you choose the Run menu and the All. SPSS then performs the conjoint analysis and the result can be seen in the output window

10.6 Output
In the output the results for each respondent is available as well as an average result of all the respondents. The output of the average result is shown below in (1) ) Overall Statistics. Initially only the results for the first 2 respondents are shown, however if the output window is expanded at the bottom then all results are available.
Importance Values package brand price seal money 35,635 14,911 29,410 11,172 8,872

Averaged Importance Score

The Averaged importance column shows each attributes importance to the respondent. The higher the score the more important is the specific attribute to the consumers. Based on the output below it can be seen that the average respondent regards Package Design and Price as the most important 94

Conjoint Analysis

attributes. After those come Brand, Good Housekeeping seal and least important is Moneyback Guarantee.
Utilities Utility Estimate -2,233 1,867 ,367 ,367 -,350 -,017 -1,108 -2,217 -3,325 2,000 4,000 1,250 2,500 7,383 Std. Error ,192 ,192 ,192 ,192 ,192 ,192 ,166 ,332 ,498 ,287 ,575 ,287 ,575 ,650

package

brand

price

seal money (Constant)

A B C K2R Glory Bissell $1,19 $1,39 $1,59 No Yes No Yes

Coefficients B price seal money Estimate -1,108 2,000 1,250

Utility expresses the utility values for the variables of each attribute. These values are calculated in such a way that if they are added together, with respect to each of the attribute combinations, the product that receives the highest score will also be the product that has received the highest rank. An example of how to calculate the utility for a product concept is shown below (Package = B, Brand Name = K2R, Price = $1,19, Good Housekeeping seal = No og Money-back guarantee = No):

1,8667 + 0,3667 1,1083 + 2 + 1,25 = 4,3004 It is these calculated utility values the company can employ as a decision basis for the development of their new product. This however, requires a closer study of the output which will not be described here. Factor is mini plots of the utility values for the variables of each attributes. They basically show the same as the plots that can be found at the bottom of the output. Based on the Factor plots one can see that B is the preferred Package Design and that $1,19 is the preferred price. These results must of course be seen in relation with the fact that the respondents gave the highest grades to the most preferred product combinations. This means that the higher the utility values are, the better.

95

Correlationsa Pearson's R Kendall's tau Kendall's tau for Holdouts Value ,982 ,892 ,667 Sig. ,000 ,000 ,087

a. Correlations between observed and estimated preferences

Pearsons R and Kendalls tau, which are specified after each of the respondents values, indicate how good the model is. Both are calculated as correlations between the respondents observed and expected preferences and therefore these should yield as high a value as possible. By changing the syntax /PRINT=ALL to /PRINT=SUMMARYONLY the following output can be obtained. (2) Preference Probabilities of Simulations:

b Preference Probabilities of Simulations

Card Number 1 2

ID 23 24

Maximum a Utility 30,0% 70,0%

BradleyTerry-Luce 43,1% 56,9%

Logit 30,9% 69,1%

a. Including tied simulations b. y out of x subjects are used in the Bradley-Terry-Luce and Logit methods because these subjects have all nonnegative scores.

(2) Preference Probabilities of Simulations shows the probabilities of choosing a certain profile as the preferred one by the help of three different probability calculations. The two profiles which have been assessed here are the simulation designs which were generated earlier; these could be either current or newly planned products. The three probability models which are applied are the maximum utility model, the BTL (BradleyTerry-Luce) model and the logit model. The probabilities are calculated based on the two earlier mentioned simulation cards. It can be seen that all three models indicates that card number 24 is preferred.

96

Item analysis

11. Item analysis


The explanation and interpretation of multidimensional scaling presented in this section is based on the following literature: - SPSS Base 10.0 Users guide, SPSS. Ch. 34 page 407-411.

11.1 Introduction
The item analysis is often used for evaluating questions for questionnaires. The analysis is used to confirm the choice of a group of questions, which added together, is presumed to measure a predefined concept. The analysis can be used for several purposes. One of the purposes is to evaluate each question. In this connection the focus is on whether questions can be left out of an analysis, perhaps because the questions do not measure what they were intended to measure. Furthermore, it is possible to examine if the answers should be turned around by making a so-called negation. For example, an answer with the value 1 (on a scale of 1 to 5) might be turned around and be assigned a value of 5 instead. This can be very important for the further analysis. The item analysis is often used before the factor analysis (see chapter 8) or AMOS models (See also Manual for AMOS 6.0 from ITA).

11.2 Example
Data set: \\okf-filesrv1\exemp\Spss\Manual\Itemanalysedata.sav

This example is based on questionnaires from the lectures, which are filled out by the students at the Aarhus School of Business every semester. Each questionnaire contains 19 questions. The students are asked to evaluate the subject itself; pedagogical competence and his/her own performance. The data is the same as what was used in the Factor Analysis (chapter 8). The answer of each single question is indicated on a Likert-scale from 1 to 5 where: 1 is: No, not at all 3 is: Acceptable 5 is: Yes, absolutely 97

Item analysis

In this example of the item analysis, the purpose is to assess to what extent the first eight questions are relevant for the evaluation of the subject and to what extent questions can be left out. Furthermore it is examined whether it is relevant to make so-called negations on the eight questions.

11.3 Implementation of the analysis


In SPSS terminology item analysis is called Reliability Analysis. It is not possible to look for the word Item in the online help-function. The following choices are made under Analyze:

This results in the following dialogue box:

98

Item analysis

In this dialogue box the following variables (questions) are chosen in the window to the left and moved to the right-hand side window by clicking on the arrow: Relevant, svrt, tid_udb, spn_int, lit_egn, lit_insp, pensum, udbytte There are now two places of specification. First a Model must be chosen and subsequently it is possible to define the output under Statistics.

11.3.1 Model Model gives the following possibilities of reliability models: Alpha (Cronbach): This is a model of internal consistency, based on the average inter-item correlation. Split-half: This model splits the scale into two parts and examines the correlation between the parts. Guttman: This model computes Guttmans lower bounds for true reliability. Parallel: This model assumes that all items have equal variances and equal error variances across replications. Strict parallel: This model makes the assumptions of the parallel model and also assumes equal means across items. In this example Alpha has been chosen as model.

11.3.2 Statistics By clicking the Statistics button, the following dialogue box appears. The boxes that are activated indicate the chosen outputs:

99

Item analysis

Descriptives for provides the opportunity to display the mean and the standard deviation for item and scale. In this case items and scale have been chosen. Scale if item deleted indicates how much it is possible to raise the scale (Chonbachs ) if the item is
removed from the analysis.

Inter-item provides the opportunity to display the correlation matrix and/or the covariance matrix between all items.

Summaries: Means: Summary statistics for item means. The smallest, largest, and average item means, the range and variance of item means, and the ratio of the largest to the smallest item means are displayed. Variances: The same as for means, only for the variance instead. Covariances: The same as for means, only for the covariances between items. Correlations: The same as for means, only for the correlations instead.

ANOVA Table:

100

Item analysis

Here it is possible to carry out a repeated-measurement analysis of the variance under F-test. The Friedman chi-square test is appropriate for rank data. The Cochran chi-square is appropriate for data that is dichotomous. The following possibilities are also available: Hotellings T-square: A multivariate test of the null hypothesis that all items on the scale have the same mean. Tukeys test of additivity: Estimates the power to which a scale must be raised to achieve additivity. Tests the assumption that there is no multiplicative interaction among the items. Intraclass correlation coefficient: Produces single and average measure intraclass correlation coefficients, along with a confidence interval, F statistic, and significance value for each. Model: Here, the model for the computation of the intraclass correlation coefficients is specified. Select two-way mixed when people effects are random and the item effects are fixed. Select two-way random when people effects and the item effects are random, and oneway random when people effects are random. Type: Here the definition of the intraclass correlation coefficient is chosen. Confidence interval: Here the desired confidence interval is indicated. Test value: The value with which an estimate of the intraclass correlation coefficient is compared. The value should be between 0 and 1.

11.4 Output
Below, the relevant output is chosen and commented on. The first output is (1) Correlation matrix, which shows the correlation between the responses on the eight questions.
Inter-Item Correlation Matrix relevant svrt tid_udb spn_int lit_egn lit_insp pensum udbytte relevant 1,000 -,155 ,399 ,427 ,305 ,333 -,187 ,447 svrt -,155 1,000 -,276 -,267 -,139 -,144 ,313 -,248 tid_udb ,399 -,276 1,000 ,404 ,318 ,232 -,232 ,474 spn_int ,427 -,267 ,404 1,000 ,346 ,438 -,057 ,443 lit_egn ,305 -,139 ,318 ,346 1,000 ,547 -,214 ,447 lit_insp ,333 -,144 ,232 ,438 ,547 1,000 -,086 ,332 pensum -,187 ,313 -,232 -,057 -,214 -,086 1,000 -,193 udbytte ,447 -,248 ,474 ,443 ,447 ,332 -,193 1,000

101

Item analysis

This output shows that DIFFICUL (the degree of difficulty) and CURRICUL (the extent of the curriculum) correlate negatively with the remainder of the variables. For this reason a so-called negation is made, where the scale is turned around. This is done by creating two new variables (choose Transform Compute): NY_SVRT = 6 SVRT NY_PENSUM = 6 PENSUM

The negation takes place by calculating the new variables as: N+1-the former value, where N is the number of possible answers. After the new variables have been calculated the analysis is run once more with the two new variables but without the variables svrt and pensum. The method is the same as the previously mentioned and will not be repeated here. The new (2) Correlation Matrix looks as follows:
Inter-Item Correlation Matrix relevant tid_udb spn_int lit_egn lit_insp udbytte ny_svrt Ny_PENSUM relevant 1,000 ,399 ,427 ,305 ,333 ,447 ,155 ,187 tid_udb ,399 1,000 ,404 ,318 ,232 ,474 ,276 ,232 spn_int ,427 ,404 1,000 ,346 ,438 ,443 ,267 ,057 lit_egn ,305 ,318 ,346 1,000 ,547 ,447 ,139 ,214 lit_insp ,333 ,232 ,438 ,547 1,000 ,332 ,144 ,086 udbytte ,447 ,474 ,443 ,447 ,332 1,000 ,248 ,193 ny_svrt ,155 ,276 ,267 ,139 ,144 ,248 1,000 ,313 Ny_PENSUM ,187 ,232 ,057 ,214 ,086 ,193 ,313 1,000

The negative correlations are now positive. Then follows (3) Item-total statistics. This output shows Cronbachs alpha in case the items/questions were removed.
Item-Total Statistics Scale Mean if Item Deleted 19,7429 20,0810 19,9667 20,1048 20,6429 20,1095 21,2905 20,5286 Scale Variance if Item Deleted 12,020 12,792 11,879 12,104 12,652 12,404 13,403 13,772 Corrected Item-Total Correlation ,511 ,534 ,546 ,526 ,488 ,600 ,334 ,275 Squared Multiple Correlation ,311 ,326 ,377 ,399 ,380 ,398 ,183 ,167 Cronbach's Alpha if Item Deleted ,741 ,740 ,735 ,739 ,746 ,729 ,771 ,780

relevant tid_udb spn_int lit_egn lit_insp udbytte ny_svrt Ny_PENSUM

Output (3) must be compared to Cronbachs alpha for the analysis, which is mentioned in the output below (4) Reliability Coefficients. 102

Item analysis

Reliability Statistics Cronbach's Alpha Based on Standardized Items ,774

Cronbach's Alpha ,773

N of Items 8

By comparing output (3) and (4) it can be seen that Cronbachs alpha climbs from 0.773 to 0.780 if NY_PENSUM is removed from the analysis. Hence, a decision should be made whether this question contributes to the analysis. The rule of thumb is that items should be removed if Cronbachs alpha increases. Furthermore, Cronbachs alpha is in this case higher than 0.7, which it should be. It is also noted that Cronbachs alpha increases with the number of items.

103

Introduction to Matrix algebra

12. Introduction to Matrix algebra


The purpose of this introduction to matrix algebra in SPSS is to provide the initial insight needed to become familiar with the programming language. The matrix programming language contains many twists and turns and you will therefore not become an expert by following this introduction. The introduction primarily contains matrix algebraic operations and will be strongly inspired by the Cand.merc subjects Applied Econometric Methods as well as Applied Quantitative Methods. The complexities encountered when using interactive matrix programming are NOT associated with identifying and understanding applied syntax commandos, since this is relatively simple. On the other hand it can be quite difficult to gain an overview of the vast possibilities that the program provides access to. Often it is not difficult to through theoretical considerations - to determine which matrices one desires to compute, but it can be a very problematic task to determine which algebraic operators and functions are needed to achieve the desired computation. Even so all these difficulties become smaller and smaller by repeated use of the program.

12.1 Initialization
In order to work with matrix algebra in SPSS it is necessary to work in the syntax environment. This is done by opening a syntax file under the menu - FIlE NEW SYNTAX. It is then possible to initialize the recognition of matrix programming code by writing the following syntax command: MATRIX.

It is important to remember that matrix programming code must be executed with the following syntax statement: END MATRIX.

In other words the matrix algebra must be written between these 2 commands.

104

Introduction to Matrix algebra

12.2 Importing variables into matrices


Often it will be necessary for you to import an existing dataset into matrices. For this purpose you can use the following syntax:

GET X /VARIABLES = Variable1, Variable2 /NAMES = VARNAMES .

Above, SPSS reads in all the observations from the open dataset into matrix X. The variable selection is carried out by listing the variable names after the equal sign, separating the variable names by commas. The selected variables will hence be imported as column vectors into the matrix X.

12.3 Matrix computation


The calculations in the matrix language can be carried out on common scalar values or it is possible to capitalize on the advantages of using matrix operators to save time and space. The matrix language establishes scalar values as matrices of the dimension 1x1. The establishment of scalar values is carried out by using the following command: Compute A = 2. Establishing a row vector:

Compute B = {1,2, 3}.

Establishing a column vector: 105

Introduction to Matrix algebra

Compute C = {4;5;6}.

Finally the establishment of matrices: Compute D = {1,2,3; 4,5,6; 7,8,9}.

106

Introduction to Matrix algebra

Or for an improved overview:

Compute E ={4,2,3; 6,8,6; 1,3,9}.

If you are interested in seeing the matrices that have been computed you can use the following code:

Print E.

This prints the matrix E to the output. One must be observant of the fact that it is not possible to list matrices in the print commando unless you are in the process of printing matrices that have an equal number of columns or rows AND you employ brackets to embrace the matrices which should be separated by commas, as shown here:

Print {B,E}.

This prints a matrix composed of the two temporarily merged matrices. To summarize the following code and corresponding output exemplifies the above mentioned syntax commands:

Example 1 Full syntax file.

107

Introduction to Matrix algebra

MATRIX . Compute A = 2. Compute B = {1,2, 3}. Compute C ={4;5;6}. Compute D = {1,2,3; 4,5,6; 7,8,9}. Compute E = {4,2,3; 6,8,6; 1,3,9}. Print A. Print B. Print C. Print {A,B}. Print {B;D}. END MATRIX .

108

Introduction to Matrix algebra

Example 1 Output

12.4 General Matrix operations


The following table provides an overview of the different operators that are applied in order to carry out matrix operations. Symbol + * / ** &* Operator Changes the sign of the matrix. A minus sign alone in front of a matrix will invert the sign of each element in the matrix. Matrix addition. Matrix subtraction. Multiplication. Division. Matrix exponentiation. The matrix is multiplied by itself, as many times as the number it is raised to the power of. Elementwise multiplikation. 109

Introduction to Matrix algebra

&/ &**

Elementwise division. Elementwise exponentiation.

110

Introduction to Matrix algebra

Scalarvalues and matrices can hence be combined base don conventional arithmetical operations. This can be seen by adding the following syntax to the before mentioned matrix syntax (remember that the code must be between the MATRIX. and END MATRIX. statements):

Compute A = B+4. Compute F = B-4. Compute G= C*4. Compute H= D/4. Print A. Print F. Print G. Print H.

By running (RUN - ALL) the new syntax code it is possible to get the following output:
A 5 6 7

F -3 -2 -1 G 16 20 24 H ,250000000 1,000000000 1,750000000

,500000000 1,250000000 2,000000000

,750000000 1,500000000 2,250000000

Note that the matrix A no longer has the value 2. Two matrices can only be added if they are of the same order:

111

Introduction to Matrix algebra

Compute I = D+E. Print I.

By adding this code you will add the following to your new output document:

I 5 10 8 4 13 11 6 12 18

The same rules apply for subtraction. As seen in the matrix operator table it is possible to change sign of the elements in the matrix by using a minus sign, as seen here:

Compute J = -I.

Multiplication can only be carried out as long the matrices conform.

Compute K = B*D. Print K.

In this case the dimensions of B are 1x3 while D has 3x3 meaning that J will become a 1x3 matrix:
K 30 36 42

The joining of two matrices can be done either horizontally or vertically, but once again it is required that the matrices conform to each other.

Compute L={D,C}. Print L.

/* Horisontalt, 3x3 matrice med 3x1 sjlevektor.*/

112

Introduction to Matrix algebra

Compute M={D;B}. Print M.

/* Vertikalt, 3x3 matrice med 1x3 rkkevektor.*/

The result of these mergers are:


M L 1 4 7 2 5 8 3 6 9 4 5 6 1 4 7 1 2 5 8 2 3 6 9 3

It is also possible to compute matrix values by using the index operator (:). It works as follows:

Compute N={9:4}. Print N.

The result is:

N 9 8 7 6 5 4

113

Introduction to Matrix algebra

It is also possible to generate arithmetic sequences by using the following type of syntax:

Compute O = {2:14:3}. Print O.

In this case a matrix N is computed. Its first column value is 2 upon which the following columns value increases by 3 until the last rows value reacher 14.

O 2 5 8 11 14

Finally, it is also possible to transpose a matrix or a vector by using the following syntax:

Compute P=T(K). Print P.

In this case a new matrix, O, is computed which corresponds to the transposed J matrix. Output resulting from adding the above syntax is:
P 30 36 42

The use of different matrix operations will now be exemplified through the use of a more comprehensive example.

EXAMPLE 2

114

Introduction to Matrix algebra

The Simpson family has decided to keep account of the expenses used on their meals, so they can track the daily expenses for each person. The family is composed of 4 people: Homer, Marge, Bart and Lisa. The expenses are registered into three categories: Breakfast, Lunch and Dinner. The accounts for the past week can be seen in the following table:

ACCOUNTS Homer Marge Bart Lisa

BREAKFAST 50.50 45.00 30.00 20.00

LUNCH 45.00 40.25 30.25 40.00

DINNER 67.25 88.75 30.75 80.50

The Simpson Family has chosen to keep track of their expenses using SPSS matrix algebra. 1.part MATRIX. COMPUTE Accounts = {50.5, 45, 67.25; 45, 40.25, 88.75; 30, 30.25, 30.75; 20, 40, 80.5}. COMPUTE names = {"Homer";"Marge";"Bart";"Lisa"}. COMPUTE meal = {"Breakfast","Lunch","Dinner"}. PRINT Accounts /TITLE = "The Simpson Family" /FORMAT = F12.2 /RNAMES = names /CNAMES = meal. END MATRIX.

This provides the following output:

115

Introduction to Matrix algebra

The Simpson Family Breakfast Homer 50,50 Marge 45,00 Bart 30,00 Lisa 20,00

Lunch 45,00 40,25 30,25 40,00

Dinner 67,25 88,75 30,75 80,50

The family agrees to compute their expenses without VAT, and therefore add the following syntax within the matrix syntax:

2.part COMPUTE vat = 0.22. COMPUTE Accounts = Accounts/(1+vat). PRINT Accounts /TITLE "The Simpson Family, Without VAT" /FORMAT = F12.2 /RNAMES = names /CNAMES = meal. This results in the following output:

The Simpson Family, Without VAT Breakfast Lunch Homer 41,39 36,89 Marge 36,89 32,99 Bart 24,59 24,80 Lisa 16,39 32,79

Dinner 55,12 72,75 25,20 65,98

When the week is over the accounts are totaled for each person.

116

Introduction to Matrix algebra

3.part COMPUTE unit={1;1;1}. COMPUTE pers_tot = Accounts*unit. PRINT pers_tot /TITLE "Total each person" /FORMAT = F12.2 /RNAMES = names. This results in the following table:

Total each person Homer 133,40 Marge 142,62 Bart 74,59 Lisa 115,16

In the mean time the Simpson Family has received a guest for the holidays Milhouse, a poor student will also be entering into the family accounts. Milhouse is exempt from VAT!

4.part COMPUTE guest = {"Milhouse"}. COMPUTE newnames = {names; guest}. COMPUTE mexpense = {80, 120, 150}. COMPUTE holiday = {Accounts; mexpense}. PRINT holiday /TITLE "Holiday Accounts" /FORMAT = F12.2 /RNAMES = newnames /CNAMES = meal. Below it is possible to observe the output that results from adding the above syntax.
Holiday Accounts Breakfast Homer 41,39 Marge 36,89 Bart 24,59 Lisa 16,39 Milhouse 80,00

Lunch 36,89 32,99 24,80 32,79 120,00

Dinner 55,12 72,75 25,20 65,98 150,00

117

Introduction to Matrix algebra

12.5 Matrix algebraic functions


Function DET(squarematrix) MDIAG(argument) EVAL(symmetricmatrix) IDENT(dimension) INV(matrix) MAKE(n,m, value) MSSQ(matrix) CSUM(matrix) TRACE(matrix) NCOL(matrix) NROW(matrix) SOLVE(Matrix1,Matrix2) DIAG(matrix) Use Returns the determinant af en square matrix. Returns a diagonal matrix. The argument is a vector or a square matrix. Returns a columnvector contain eigen values in descending order. Computes an identity matrix. Computes the invers of a matrix. Computes a n x m matrice of identical values. Computes the squared sum of all elements. Computes the sum of all column elements. Returns the sum of the diagonal elements. Returns a scalar containing the number of columns in a matrix. Returns a scalar containing the number of rows in a matrix. Solves homogenous equations. If M1*X=M2, then X= SOLVE(M1, M2). Computes a columnmatrix containing the elements of the main diagonal.

Below the use of some of the above functions are demonstrated in a comprehensive example. The example carries out a regressionanalysis. The theory used is, where deemed necessary, included as comments to the syntax. The example can be run by entering the code into a syntax window (FILENEW-SYNTAX) and then using the menu point Run All (RUN-ALL).

118

Introduction to Matrix algebra

EXAMPLE 3 Regressionanalysis: By using the dataset which can found here: \\okf-filesrv1\exemp\Spss\Ha manual\het-km-alder.sav . MATRIX. GET y /VARIABLES = km /*Start recognition of matrix commands in SPSS syntax*/ /*Import the dependent varaible from the open dataset*/ /*The variables name in this case km */

/NAMES = varnamey . /* Row vector with variable name*/ GET x /VARIABLES = alder /* Import the independent variables from the open dataset*/ /* The variablenames in this case alder, with more than one variable: separate variable names wiht commas */ /NAMES = varnamex. /* Row vector with variable names*/ Compute k = ncol(x). Compute n = nrow(x). Compute varnames = t({"Constant", varnameX}). /* Create a name column vector*/ Compute con = make(n,1,1). Compute x={con,x}. Compute df2 = n-k-1. Compute invxtx = inv(t(x)*x). Compute b = invxtx*(t(x)*y). Compute j = nrow(b). Compute yhat = x*b. Compute resid = (y-(yhat)). Compute sse= cssq(resid). Compute mse = sse/df2. Compute tss = cssq((y-(csum(y)/n))). Compute r2 = (tss-sse)/tss. Compute covb = (invxtx)&*(mse). Compute stdeb = sqrt(diag(covb)). Compute tobs = b/stdeb. Compute f = ((tss-sse)/k)/mse. Compute pf = 1-fcdf(f,k,df2). Compute s = sqrt(mse). /* Lookup in the F-distribution */ /* Regression Standard Error */ 119 /* Create a column vector to compute the constant*/ /* Merge the constant to the independent variables*/ /* Degrees of freedom established*/ /* OLS-estimation*/ /* b=(xx)-1xy */ /* Number of estimated parameters */ /*Estimated value of Y, called yhat*/ /*Hence calculate residuals*/ /* The squared error (sum of e i 2) is estimated to gain an estimate for the unexplained variation*/ /* Mean-Square Error*/ /* Total variation, SAKy or TSS*/ /* Explanatory strength, R2 */ /* Varianscovarians matrix*/ /*Column vector with std error of parameters */ /* Calculate t- values, t=(b-)/ */ /*Problem dimension established*/

Introduction to Matrix algebra

Compute pf = {r2,f,k,df2,pf}. Compute oput = {b, stdeb, tobs, p}.

/* Result row vektor */ /* Collecting estimates in output matrix */ /* Print the name of the dependent variable */

Compute p = 2*(1-tcdf(abs(tobs), n-j)). /*p-values for parameter estimates */

Print varnameY /title = "Dependent Variable" /format A8. Print sse /TITLE = "SSE" /FORMAT = F12.3 .

/* Print measure for unexplained part*/

Print s /TITLE = "Regression Standard Error (RES)" /FORMAT = F12.3 . Print pf /title = "Model Fit:" /clabels = "R-sq" "F" "df1" "df2" "p"/format F10.4 . Print oput /title = 'Regression Results' /clabels = "B(OLS)" "SE(OLS)" "t" "P>|t| " /rnames = varnames /format f10.4. END MATRIX. EXECUTE. /* Stop recognition of the matrix language in SPSS syntax.

Try to carry out an interpretation of the results of this regressionanalysis. It is especially interesting to gain a greater understanding of the dimensions of the different matrices while the

120

Introduction to Matrix algebra

regressionanalysis is being carried out. It is possible to find more matrixalgebraic functions by using the SPSS help function. It can be found as follows: Choose help\topics and search for Matrix functions under Index:

121

APPENDIX A [Choice of Statistical Methods]

APPENDIX A [Choice of Statistical Methods]


Is any of the variables dependent of other variables?

No

Yes

Is the analysis concerning variables or observations?

One

Number of dependent variables?

Several dependent variables (interval scaled)

Scale for dependent variable? Observations Variables Both

Nominal

Cluster Analysis

Scale for independent variable

Ordinal

Scale for independent variable

Interval

Scale for independent variable

Nominal

Correspondence Analysis

Nominal

Regression w/dummy

Discriminant Analysis

Analysis of Variance

Log-linear Analysis Correspondence Analysis

Source: Niels J. Blunch: Analyse af markedsdata 2.rev. udg.2000. Systime.

Canonical Analysis

Factor Analysis Logit Analysis

Regression Analysis

Conjoint Analysis

MANOVA/GLM

Interval

Multidimensional Scaling

Nominal

Interval

Interval

122

Literature List

Literature List

Blunch, Niels J. (2000): Analyse af markedsdata, 2 rev. udgave, Systime _____________ (2000): Videregende data-analyse med SPSS og AMOS, 1. udgave, Systime _____________ (1993): Nogle teknikker til analyse af kvalitative data, Internt undervisningsmateriale E nr. 19, Handelshjskolen i rhus. Greenacre, Michael J. (1984): Theory and Applications of Correspondence Analysis, Academic Press Inc. Hair, Anderson m.fl.(1998): Multivariate Data Analysis, 5th edition 1998, Prentice Hall Johnson, Richard A.; Wichern, Dean W. (1998): Applied Multivariate Statistical Analysis, 4th edition, Prentice Hall

Juhl,

Hans

Jrn

(1992):

Multidimensional

skalering

og

conjoint

analyse

et

produktudviklingsvrktj. Internt undervisningsmateriale H nr. 151, Institut for informationsbehandling, Handelshjskolen i rhus.

Kappelhoff, Peter (2001): Multidimensionale Skalierung - Beispieldatei zur Datenanalyse. Lehrstuhl Wuppertal. Malhotra, Naresh K.(1999): Marketing Research An Applied Orientation, 3rd edition, Prentice Hall Meulman, Jacqueline J.; Heiser, Willem J. (1999): SPSS Categories10.0, SPSS Inc. ___________________________________ (1999): SPSS Advanced ModelsTM 10.0, SPSS Inc. __________________________________ (1999): SPSS Base 10.0 Applications Guide, SPSS Inc. ___________________________________ (1999): SPSS Base 10.0 Users Guide, SPSS Inc. fr empirische Wirtschaftsund Sozialforschung Fachbreich (BUGH) Bergische Universitt Gesamthochschule

Wirtschaftswissenschaft,

123

You might also like