You are on page 1of 122

SW388R7 Data Analysis & Computers II Slide 1

Discriminant Analysis Basic Relationships

Discriminant Functions and Scores Describing Relationships

Classification Accuracy
Sample Problems

SW388R7 Data Analysis & Computers II Slide 2

Discriminant analysis

Discriminant analysis is used to analyze relationships between a non-metric dependent variable and metric or dichotomous independent variables. Discriminant analysis attempts to use the independent variables to distinguish among the groups or categories of the dependent variable. The usefulness of a discriminant model is based upon its accuracy rate, or ability to predict the known group memberships in the categories of the dependent variable.

SW388R7 Data Analysis & Computers II Slide 3

Discriminant scores

Discriminant analysis works by creating a new variable called the discriminant function score which is used to predict to which group a case belongs. Discriminant function scores are computed similarly to factor scores, i.e. using eigenvalues. The computations find the coefficients for the independent variables that maximize the measure of distance between the groups defined by the dependent variable. The discriminant function is similar to a regression equation in which the independent variables are multiplied by coefficients and summed to produce a score.

SW388R7 Data Analysis & Computers II Slide 4

Discriminant functions

Conceptually, we can think of the discriminant function or equation as defining the boundary between groups. Discriminant scores are standardized, so that if the score falls on one side of the boundary (standard score less than zero, the case is predicted to be a member of one group) and if the score falls on the other side of the boundary (positive standard score), it is predicted to be a member of the other group.

SW388R7 Data Analysis & Computers II Slide 5

Number of functions

If the dependent variable defines two groups, one statistically significant discriminant function is required to distinguish the groups; if the dependent variable defines three groups, two statistically significant discriminant functions are required to distinguish among the three groups; etc. If a discriminant function is able to distinguish among groups, it must have a strong relationship to at least one of the independent variables.

The number of possible discriminant functions in an analysis is limited to the smaller of the number of independent variables or one less than the number of groups defined by the dependent variable.

SW388R7 Data Analysis & Computers II Slide 6

Overall test of relationship

The overall test of relationship among the independent variables and groups defined by the dependent variable is a series of tests that each of the functions needed to distinguish among the groups is statistically significant. In some analyses, we might discover that two or more of the groups defined by the dependent variable cannot be distinguished using the available independent variables. While it is reasonable to interpret a solution in which there are fewer significant discriminant functions than the maximum number possible, our problems will require that all of the possible discriminant functions be significant.

SW388R7 Data Analysis & Computers II Slide 7

Interpreting the relationship between independent and dependent variables

The interpretative statement about the relationship between the independent variable and the dependent variable is a statement like: cases in group A tended to have higher scores on variable X than cases in group B or group C. This interpretation is complicated by the fact that the relationship is not direct, but operates through the discriminant function. Dependent variable groups are distinguished by scores on discriminant functions, not on values of independent variables. The scores on functions are based on the values of the independent variables that are multiplied by the function coefficients.

SW388R7 Data Analysis & Computers II Slide 8

Groups, functions, and variables

To interpret the relationship between an independent variable and the dependent variable, we must first identify how the discriminant functions separate the groups, and then the role of the independent variable is for each function. SPSS provides a table called "Functions at Group Centroids" (multivariate means) that indicates which groups are separated by which functions. SPSS provides another table called the "Structure Matrix" which, like its counterpart in factor analysis, identifies the loading, or correlation, between each independent variable and each function. This tells us which variables to interpret for each function. Each variable is interpreted on the function that it loads most highly on.

SW388R7 Data Analysis & Computers II Slide 9

Functions at Group Centroids


In order to specify the role that each independent variable plays in predicting group membership on the dependent variable, we must link together the relationship between the discriminant functions and the groups defined by the dependent variable, the role of the significant independent variables in the discriminant functions, and the differences in group means for each of the variables. Function 2 separates survey respondents who thought we spend too little money on welfare (positive value of 0.235) from survey respondents who thought we spend too much money (negative value of -0.362) on welfare. We ignore the second group (-0.031) in this comparison because it was distinguished from the other two groups by function 1.

Functions at Group Centroids Function WELFARE 1 2 3 1 -.220 .446 -.311 2 .235 -.031 -.362

Unstandardized canonical discriminant functions evaluated at group means

Function 1 separates survey respondents who thought we spend about the right amount of money on welfare (the positive value of 0.446) from survey respondents who thought we spend too much (negative value of -0.311) or little money (negative value of -0.220) on welfare.

SW388R7 Data Analysis & Computers II Slide 10

Structure Matrix
We do not interpret loadings in the structure matrix unless they are 0.30 or higher.

Based on the structure matrix, the predictor variables strongly associated with discriminant function 1 which distinguished between survey respondents who thought we spend about the right amount of money on welfare and survey respondents who thought we spend too much or little money on welfare were number of hours worked in the past week (r=-0.582) and highest year of school completed (r=0.687).

Structure Matrix Function 1 2 .136 .345 .889* .292* .687* -.582* .223 .101

HIGHEST YEAR OF SCHOOL COMPLETED NUMBER OF HOURS WORKED LAST WEEK R SELF-EMP OR WORKS FOR SOMEBODY a RESPONDENTS INCOME

Pooled within-groups correlations between discriminating variables and standardized canonical discriminant functions Variables ordered by absolute size of correlation within function. Based on the *. structure matrix, the predictor variable Largest absolute correlation between each variable and strongly associated with discriminant function 2 which any discriminant function

distinguished between survey respondents who thought we a. This variable not used in the analysis. spend too little money on welfare and survey respondents who thought we spend too much money on welfare was self-employment (r=0.889).

SW388R7 Data Analysis & Computers II Slide 11

Group Statistics
Group Statistics Valid N (listwise) Unweighted Weighted

WELFARE 1 TOO LITTLE

Mean 43.96 13.73 1.93 13.70 37.90 14.78 1.90 14.00 42.03 13.38 1.75 14.75 41.32 14.03

Std. Deviation

NUMBER OF HOURS WORKED LAST WEEK HIGHEST YEAR OF SCHOOL COMPLETED R SELF-EMP OR WORKS FOR SOMEBODY RESPONDENTS INCOME 2 ABOUT RIGHT NUMBER OF HOURS WORKED LAST WEEK HIGHEST YEAR OF SCHOOL COMPLETED R SELF-EMP OR WORKS FOR SOMEBODY RESPONDENTS INCOME 3 TOO MUCH NUMBER OF HOURS WORKED LAST WEEK HIGHEST YEAR OF SCHOOL COMPLETED R SELF-EMP OR WORKS FOR SOMEBODY RESPONDENTS INCOME Total NUMBER OF HOURS WORKED LAST WEEK HIGHEST YEAR OF SCHOOL COMPLETED

13.240in the past 56week 56.000 for survey

The average number of hours worked


respondents who thought we spend

welfare (mean=37.90) was lower than of hours worked in .260the average 56 number 56.000 the past weeks for survey respondents 5.034who thought 56 we56.000 spend too much money on welfare (mean=43.96) and 13.235survey respondents 50 50.000 who thought we spend too little money on welfare 2.558(mean=42.03). 50 50.000 statement: "survey respondents who about the right 5.503thought we 50 spend 50.000 amount of money on welfare worked 10.456fewer hours 32 in the 32.000 past week than survey respondents who thought we much32.000 or little money on 2.524spend too32 welfare."
.440 5.304 12.846 2.537 32 32 138 138 32.000 32.000 138.000 138.000

2.401about the 56 56.000 of money on right amount

to make the .303This enables 50 us 50.000

SW388R7 Data Analysis & Computers II Slide 12

Which independent variables to interpret

In a simultaneous discriminant analysis, in which all independent variables are entered together, we only interpret the relationships for independent variables that have a loading of 0.30 or higher one or more discriminant functions. A variable can have a high loading on more than one function, which complicates the interpretation. We will interpret the variable for the function on which it has the highest loading. In a stepwise discriminant analysis, we limit the interpretation of relationships between independent variables and groups defined by the dependent variable to those independent variables that met the statistical test for inclusion in the analysis.

SW388R7 Data Analysis & Computers II Slide 13

Discriminant analysis and classification

Discriminant analysis consists of two stages: in the first stage, the discriminant functions are derived; in the second stage, the discriminant functions are used to classify the cases. While discriminant analysis does compute correlation measures to estimate the strength of the relationship, these correlations measure the relationship between the independent variables and the discriminant scores. A more useful measure to assess the utility of a discriminant model is classification accuracy, which compares predicted group membership based on the discriminant model to the actual, known group membership which is the value for the dependent variable.

SW388R7 Data Analysis & Computers II Slide 14

Evaluating usefulness for discriminant models

The benchmark that we will use to characterize a discriminant model as useful is a 25% improvement over the rate of accuracy achievable by chance alone. Even if the independent variables had no relationship to the groups defined by the dependent variable, we would still expect to be correct in our predictions of group membership some percentage of the time. This is referred to as by chance accuracy.

The estimate of by chance accuracy that we will use is the proportional by chance accuracy rate, computed by summing the squared percentage of cases in each group.

SW388R7 Data Analysis & Computers II Slide 15

Comparing accuracy rates

To characterize our model as useful, we compare the crossvalidated accuracy rate produced by SPSS to 25% more than the proportional by chance accuracy. The cross-validated accuracy rate is a one-at-a-time hold out method that classifies each case based on a discriminant solution for all of the other cases in the analysis. It is a more realistic estimate of the accuracy rate we should expect in the population because discriminant analysis inflates accuracy rates when the cases classified are the same cases used to derive the discriminant functions. Cross-validated accuracy rates are not produced by SPSS when separate covariance matrices are used in the classification, which we address more next week.

SW388R7 Data Analysis & Computers II Slide 16

Computing by chance accuracy

The percentage of cases in each group defined by the dependent variable are reported in the table "Prior Probabilities for Groups"
Prior Probabilities for Groups Cases Used in Analysis Unweighted Weighted 56 56.000 50 50.000 32 32.000 138 138.000

WELFARE 1 TOO LITTLE 2 ABOUT RIGHT 3 TOO MUCH Total

Prior .406 .362 .232 1.000

The proportional by chance accuracy rate was computed by squaring and summing the proportion of cases in each group from the table of prior probabilities for groups (0.406 + 0.362 + 0.232 = 0.350). A 25% increase over this would require that our cross-validated accuracy be 43.7% (1.25 x 35.0% = 43.7%).

SW388R7 Data Analysis & Computers II Slide 17

Comparing the cross-validated accuracy rate


b,c Classification Results

Original

Count

Cross-validated a

Count

Predicted Group Membership 1 TOO 2 ABOUT WELFARE LITTLE RIGHT 3 TOO MUCH 1 TOO LITTLE 43 15 6 2 ABOUT RIGHT 26 30 6 3 TOO MUCH 17 10 9 Ungrouped cases 3 3 2 1 TOO LITTLE 67.2 23.4 9.4 2 ABOUT RIGHT 41.9 48.4 9.7 3 TOO MUCH 47.2 27.8 25.0 Ungrouped cases 37.5 37.5 25.0 1 TOO LITTLE 43 15 6 SPSS reports the cross-validated accuracy rate 2 ABOUT RIGHT 26 table "Classification 30 6 in the footnotes to the 3 TOO MUCH 17 11 Results." The cross-validated accuracy rate 8 computed was 50.0% which was 1 TOO LITTLE by SPSS 67.2 23.4 9.4 greater than or equal to the proportional by 2 ABOUT RIGHT 41.9 48.4 9.7 chance accuracy criteria of 43.7%. 3 TOO MUCH 47.2 30.6 22.2

Total 64 62 36 8 100.0 100.0 100.0 100.0 64 62 36 100.0 100.0 100.0

a. Cross validation is done only for those cases in the analysis. In cross validation, each case is classified by the functions derived from all cases other than that case. b. 50.6% of original grouped cases correctly classified. c. 50.0% of cross-validated grouped cases correctly classified.

SW388R7 Data Analysis & Computers II Slide 18

Problem 1
1. In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship. The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income" [rincom98] are useful in distinguishing between groups based on responses to "seen x-rated movie in last year" [xmovie]. These predictors differentiate survey respondents who had seen an x-rated movie in the last year from survey respondents who had not seen an x-rated movie in the last year.

Survey respondents who had seen an x-rated movie in the last year were younger than survey respondents who had not seen an x-rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were more likely to be male than survey respondents who had not seen an x-rated movie in the last year.
1. 2. 3. 4. True True with caution False Inappropriate application of a statistic

SW388R7 Data Analysis & Computers II Slide 19

Dissecting problem 1 - 1
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship. The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income" [rincom98] are useful in distinguishing between groups based on responses to "seen x-rated For these problems, we will predictors differentiate survey respondents who had seen movie in last year" [xmovie]. These assume that there is no problem an x-rated movie in the last year from survey respondents who had not seen an x-rated movie with missing data, violation of in the last year. assumptions, or outliers.

Survey respondents who had seen an x-rated movie in the last year were younger than survey In this problem, we are told to respondents who as had not for seen an x-rated movie in the last year. Survey respondents who had use 0.05 alpha the seen an discriminant x-rated movie in the last year were more likely to be male than survey respondents analysis. who had not seen an x-rated movie in the last year.
1. 2. 3. 4. True True with caution False Inappropriate application of a statistic

SW388R7 Data Analysis & Computers II Slide 20

Dissecting problem 1 - 2
The variables listed first in the problem statement are the independent variables (IVs): "age" [age], "highest is year of school statement true, false, or an incorrect 1. In the dataset GSS2000.sav, the following completed" [educ], "sex" [sex], and application of a statistic? Assume that there is no problem with missing data, violation of "income" [rincom98].

assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship.

The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income" [rincom98] are useful in distinguishing between groups based on responses to "seen x-rated movie in last year" [xmovie]. These predictors differentiate survey respondents who had seen an x-rated movie in the last year from survey respondents who had not seen an xrated movie in the last year. Survey respondents had seen an x-rated movie in the last year were younger than survey The variable used who to define respondents who had not seen an x-rated movie in the last year. Survey respondents who had groups is the dependent seen an x-rated the last year were more likely to be male than survey respondents variable (DV):movie "seen in x-rated When a problem states movie in last year" [xmovie]. who had not seen an x-rated movie in the last year.
that a list of independent variables can distinguish among groups, we do a discriminant analysis entering all of the variables simultaneously.

SW388R7 Data Analysis & Computers II Slide 21

Dissecting problem 1 - 3
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship. The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income" [rincom98] are useful in distinguishing between groups based on responses to "seen x-rated movie in last year" [xmovie]. These predictors differentiate survey respondents who had seen an x-rated movie in the last year from survey respondents who had not seen an xrated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were younger than survey respondents who had not seen an x-rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were more likely to be male than survey respondents The problem identifies two groups for the dependent who had not seen an x-rated movie in the last year. variable: 1. 2. 3. 4. True movie in the last year survey respondents who had not seen an xTrue with caution rated movie in the last year False To distinguish among two groups, the analysis will be Inappropriate application of a statistic
required to find one statistically significant discriminant function. survey respondents who had seen an x-rated

SW388R7 Data Analysis & Computers II Slide 22

Dissecting problem 1 - 4
The specific relationships listed in the problem indicate how the independent The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income" variable relates to groups of the [rincom98] are useful in distinguishing between groups based on responses to "seen x-rated dependent variable, i.e., the mean for age movie in last year" [xmovie]. These predictors will differentiate survey respondents who had seen be lower for respondents who had seen an x-rated movie in the last year. an x-rated movie in the last year from survey respondents who had not seen an x-rated movie

in the last year. Survey respondents who had seen an x-rated movie in the last year were younger than survey respondents who had not seen an x-rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were more likely to be male than survey respondents who had not seen an x-rated movie in the last year. 1. 2. 3. 4. True True with caution False In order for the discriminant analysis to be true, we must have enough statistically Inappropriate application of a statistic

significant functions to distinguish among the groups, the classification accuracy rate must be substantially better than could be obtained by chance alone, and each significant relationship must be interpreted correctly.

SW388R7 Data Analysis & Computers II Slide 23

LEVEL OF MEASUREMENT - 1
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship. The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income" [rincom98] are useful in distinguishing between groups based on responses to "seen x-rated movie in last year" [xmovie]. These predictors differentiate survey respondents who had seen an x-rated movie in the last year from survey respondents who had not seen an xrated movie in the last year.

Survey respondents who had seen an x-rated movie in the last year were younger than survey respondents who had not seen an x-rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were more likely to be male than survey respondents who had not seen an x-rated movie in the last year.
Discriminant analysis requires that the dependent variable be non-metric and the 1. True independent variables be metric or dichotomous. 2. True with caution"seen x-rated movie in last year" [xmovie] is an dichotomous variable, which satisfies the level of 3. False measurement requirement.

4. Inappropriate application of a statistic

It contains two categories: survey respondents who had seen an x-rated movie in the last year and survey respondents who had not seen an xrated movie in the last year.

SW388R7 Data Analysis & Computers II Slide 24

LEVEL OF MEASUREMENT - 2
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship. The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income" [rincom98] are useful in distinguishing between groups based on responses to "seen xrated movie in last year" [xmovie]. These predictors differentiate survey respondents who had seen an x-rated movie in the last year from survey respondents who had not seen an x-rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were younger than survey respondents who had not seen an x-rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were more likely to be male than survey respondents "Age" [age] and "highest year of who had not seen an x-rated school completed" [educ] are movie in the last year.
interval level variables, which satisfies the level of measurement 1. True for discriminant requirements analysis. 2. True with caution "Income" [rincom98] is an ordinal level variable. If we follow the convention of treating ordinal level variables as metric variables, the level of measurement requirement for discriminant analysis is satisfied. Since some data analysts do not agree with this convention, a note of caution should be included in our interpretation.

3. False 4. Inappropriate application of a statistic


"Sex" [sex] is a dichotomous or dummy-coded nominal variable which may be included in discriminant analysis.

SW388R7 Data Analysis & Computers II Slide 25

Request simultaneous discriminant analysis

Select the Classify | Discriminant command from the Analyze menu.

SW388R7 Data Analysis & Computers II Slide 26

Selecting the dependent variable

First, highlight the dependent variable xmovie in the list of variables.

Second, click on the right arrow button to move the dependent variable to the Grouping Variable text box.

SW388R7 Data Analysis & Computers II Slide 27

Defining the group values


When SPSS moves the dependent variable to the Grouping Variable textbox, it puts two question marks in parentheses after the variable name. This is a reminder that we have to enter the number that represent the groups we want to include in the analysis.

First, to specify the group numbers, click on the Define Range button.

SW388R7 Data Analysis & Computers II Slide 28

Completing the range of group values

The value labels for xmovie show two categories: 1 = YES 2 = NO The range of values that we need to enter goes from 1 as the minimum and 2 as the maximum. First, type in 1 in the Minimum text box.

Second, type in 2 in the Maximum text box.

Third, click on the Continue button to close the dialog box.

SW388R7 Data Analysis & Computers II Slide 29

Selecting the independent variables

Move the independent variables listed in the problem to the Independents list box.

SW388R7 Data Analysis & Computers II Slide 30

Specifying the method for including variables


SPSS provides us with two methods for including variables: to enter all of the independent variables at one time, and a stepwise method for selecting variables using a statistical test to determine the order in which variables are included.

Since the problem states that there is a relationship without requesting the best predictors, we accept the default to Enter independents together.

SW388R7 Data Analysis & Computers II Slide 31

Requesting statistics for the output

Click on the Statistics button to select statistics we will need for the analysis.

SW388R7 Data Analysis & Computers II Slide 32

Specifying statistical output


First, mark the Means checkbox on the Descriptives panel. We will use the group means in our interpretation.

Second, mark the Univariate ANOVAs checkbox on the Descriptives panel. Perusing these tests suggests which variables might be useful descriminators.

Third, mark the Boxs M checkbox. Boxs M statistic evaluates conformity to the assumption of homogeneity of group variances.

Fourth, click on the Continue button to close the dialog box.

SW388R7 Data Analysis & Computers II Slide 33

Specifying details for classification

Click on the Classify button to specify details for the classification phase of the analysis.

SW388R7 Data Analysis & Computers II Slide 34

Details for classification - 1


First, mark the option button to Compute from group sizes on the Prior Probabilities panel. This incorporates the size of the groups defined by the dependent variable into the classification of cases using the discriminant functions. Second, mark the Casewise results checkbox on the Display panel to include classification details for each case in the output.

Third, mark the Summary table checkbox to include summary tables comparing actual and predicted classification.

SW388R7 Data Analysis & Computers II Slide 35

Details for classification - 2

Fourth, mark the Leave-one-out classification checkbox to request SPSS to include a cross-validated classification in the output. This option produces a less biased estimate of classification accuracy by sequentially holding each case out of the calculations for the discriminant functions, and using the derived functions to classify the case held out.

SW388R7 Data Analysis & Computers II Slide 36

Details for classification - 3

Fifth, accept the default of Within-groups option button on the Use Covariance Matrix panel. The Covariance matrices are the measure of the dispersion in the groups defined by the dependent variable. If we fail the homogeneity of group variances test (Boxs M), our option is use Separate groups covariance in classification.

Seventh, click on the Continue button to close the dialog box.

Sixth, mark the Combinesgroups checkbox on the Plots panel to obtain a visual plot of the relationship between functions and groups defined by the dependent variable.

SW388R7 Data Analysis & Computers II Slide 37

Completing the discriminant analysis request

Click on the OK button to request the output for the disciminant analysis.

SW388R7 Data Analysis & Computers II Slide 38

Sample size ratio of cases to variables


Analysis Case Processing Summary Unweighted Cases Valid Excluded Missing or out-of-range group codes At least one missing discriminating variable Both missing or out-of-range group codes and at least one missing discriminating variable Total Total N 119 49 66 Percent 44.1 18.1 24.4

36 151 270

The minimum ratio of valid cases to independent variables for discriminant analysis is 5 to 1, with a 55.9 preferred ratio of 20 to 1. In 100.0 this analysis, there are 119 valid cases and 4 independent variables. The ratio of cases to independent variables is 29.75 to 1, which satisfies the minimum requirement. In addition, the ratio of 29.75 to 1 satisfies the preferred ratio of 20 to 1.
13.3

SW388R7 Data Analysis & Computers II Slide 39

Sample size minimum group size


Prior Probabilities for Groups Cases Used in Analysis Unweighted Weighted 37 37.000 82 82.000 119 119.000

XMOVIE 1 2 Total

Prior .311 .689 1.000

In addition to the requirement for the ratio of cases to independent variables, discriminant analysis requires that there be a minimum number of cases in the smallest group defined by the dependent variable. The number of cases in the smallest group must be larger than the number of independent variables, and preferably contains 20 or more cases. The number of cases in the smallest group in this problem is 37, which is larger than the number of independent variables (4), satisfying the minimum requirement. In addition, the number of cases in the smallest group satisfies the preferred minimum of 20 cases.

If the sample size did not initially satisfy the minimum requirements, discriminant analysis is not appropriate.

SW388R7 Data Analysis & Computers II Slide 40

NUMBER OF DISCRIMINANT FUNCTIONS - 1

The maximum possible number of discriminant functions is the smaller of one less than the number of groups defined by the dependent variable and the number of independent variables. In this analysis there were 2 groups defined by seen x-rated movie in last year and 4 independent variables, so the maximum possible number of discriminant functions was 1.

SW388R7 Data Analysis & Computers II Slide 41

NUMBER OF DISCRIMINANT FUNCTIONS - 2

In the table of Wilks' Lambda which tested functions for statistical significance, the direct analysis identified 1 discriminant functions that were statistically significant. The Wilks' lambda statistic for the test of function 1 (chi-square=24.159) had a probability of <0.001 which was less than or equal to the level of significance of 0.05. The significance of the maximum possible number of discriminant functions supports the interpretation of a solution using 1 discriminant function.

SW388R7 Data Analysis & Computers II Slide 42

Independent variables and group membership: relationship of functions to groups


In order to specify the role that each independent variable plays in predicting group membership on the dependent variable, we must link together the relationship between the discriminant functions and the groups defined by the dependent variable, the role of the significant independent variables in the discriminant functions, and the differences in group means for each of the variables.

Functions at Group Centroids Function 1 -.714 .322

Each function divides the groups into two subgroups by assigning negative values to one subgroup and positive values to the other subgroup. Function 1 separates survey respondents who had seen an xrated movie in the last year (-.714) from survey respondents who had not seen an x-rated movie in the last year (.322).

XMOVIE 1 2

Unstandardized canonical discriminant functions evaluated at group means

SW388R7 Data Analysis & Computers II Slide 43

Independent variables and group membership: predictor loadings on functions


We do not interpret loadings in the structure matrix unless they are 0.30 or higher. Based on the structure matrix, the predictor variables strongly associated with discriminant function 1 which distinguished between survey respondents who had seen an x-rated movie in the last year and survey respondents who had not seen an x-rated movie in the last year were age (r=0.467) and sex (r=0.770).

Structure Matrix Function 1 .770 .467 .118 .044

SEX AGE EDUC RINCOM98

Pooled within-groups correlations between discriminating variables and standardized canonical discriminant functions Variables ordered by absolute size of correlation within function.

SW388R7 Data Analysis & Computers II Slide 44

Independent variables and group membership: predictors associated with first function - 1

Group Statistics Valid N (listwise) Unweighted The Weighted average age for survey 37 37.000 who had seen an respondents x-rated movie in the last year 37 37.000 (mean=37.24) was lower than the 37 37.000 average age for survey 37 37.000 who had not seen an respondents 82 82.000 x-rated movie in the last year (mean=42.70). 82 82.000 82 82.000 the relationship that This supports 82 82.000 "survey respondents who had seen an x-rated movie in the last year 119 119.000 were younger than survey 119 119.000 respondents who had not seen an 119 119.000 x-rated movie in the last year." 119 119.000

XMOVIE 1 AGE EDUC SEX RINCOM98 2 AGE EDUC SEX RINCOM98 Total AGE EDUC SEX RINCOM98

Mean 37.24 13.86 1.27 13.76 42.70 14.18 1.65 14.00 41.00 14.08 1.53 13.92

Std. Deviation 10.838 2.720 .450 5.209 11.461 2.534 .481 5.308 11.508 2.586 .501 5.256

SW388R7 Data Analysis & Computers II Slide 45

Independent variables and group membership: predictors associated with first function - 2
Group Statistics Valid N (listwise) Unweighted Weighted Since sex is a dichotomous variable, 37the mean 37.000 is not directly interpretable. 37Its interpretation 37.000 must take into account the coding by which 1 37 37.000 corresponds to male and 2 37 37.000 corresponds to female. The lower 82mean 82.000 for survey respondents who an x-rated movie in the last 82had seen 82.000 year (mean=1.27), when compared 82 82.000 to the mean for survey respondents 82who 82.000 had not seen an x-rated movie in last year (mean=1.65), implies 119the 119.000 the group contained more survey 119that119.000 respondents who were male and 119fewer 119.000 survey respondents who were 119female. 119.000

XMOVIE 1 AGE EDUC SEX RINCOM98 2 AGE EDUC SEX RINCOM98 Total AGE EDUC SEX RINCOM98

Mean 37.24 13.86 1.27 13.76 42.70 14.18 1.65 14.00 41.00 14.08 1.53 13.92

Std. Deviation 10.838 2.720 .450 5.209 11.461 2.534 .481 5.308 11.508 2.586 .501 5.256

This supports the relationship that "survey respondents who had seen an x-rated movie in the last year were more likely to be male than survey respondents who had not seen an xrated movie in the last year."

SW388R7 Data Analysis & Computers II Slide 46

CLASSIFICATION USING THE DISCRIMINANT MODEL: by chance accuracy rate


The independent variables could be characterized as useful predictors of membership in the groups defined by the dependent variable if the cross-validated classification accuracy rate was significantly higher than the accuracy attainable by chance alone. Operationally, the cross-validated classfication accuracy rate should be 25% or more higher than the proportional by chance accuracy rate. The proportional by chance accuracy rate was computed by squaring and summing the proportion of cases in each group from the table of prior probabilities for groups (0.311 + 0.689 = 0.571).

Prior Probabilities for Groups Cases Used in Analysis Unweighted Weighted 37 37.000 82 82.000 119 119.000

XMOVIE 1 2 Total

Prior .311 .689 1.000

SW388R7 Data Analysis & Computers II Slide 47

CLASSIFICATION USING THE DISCRIMINANT MODEL: criteria for classification accuracy


b,c Classification Results

Original

Count

Cross-validated a

Count %

XMOVIE 1 2 Ungrouped cases 1 2 Ungrouped cases 1 2 1 2

Predicted Group Membership 1 2 15 22 12 70 13 36 40.5 59.5 14.6 85.4 26.5 73.5 15 22 12 70 40.5 59.5 14.6 85.4

Total 37 82 49 100.0 100.0 100.0 37 82 100.0 100.0

a. Cross validation is done only for those cases in the analysis. In cross validation, each case is classified by the functions derived from all cases other than that case. b. 71.4% of original grouped cases correctly classified. c. 71.4% of cross-validated grouped cases correctly classified.

The cross-validated accuracy rate computed by SPSS was 71.4% which was greater than or equal to the proportional by chance accuracy criteria of 71.4% (1.25 x 57.1% = 71.4%). The criteria for classification accuracy is satisfied.

SW388R7 Data Analysis & Computers II Slide 48

Answering the question in problem 1 - 1


In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship. The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income" [rincom98] are useful in distinguishing between groups based on responses to "seen x-rated movie in last year" [xmovie]. These predictors differentiate survey respondents who had seen an x-rated movie in the last year from survey respondents who had not seen an xrated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were younger than survey respondents who had not seen an x-rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were more likely to be male than survey respondents who had not seen an x-rated movie in the last year. 1. 2. 3. 4.
We found one statistically significant True discriminant function, making it possible to True with caution distinguish among the two groups defined by the dependent variable. False Inappropriate application of a statistic classification Moreover, the cross-validated accuracy surpassed the by chance accuracy criteria, supporting the utility of the model.

SW388R7 Data Analysis & Computers II Slide 49

Answering the question in problem 1 - 2


In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship. The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income" [rincom98] are useful in distinguishing groups based on responses to "seen x-rated Webetween verified that each statement movie in last year" [xmovie]. These predictors differentiate survey respondents who had seen about the relationship between predictors and groups was correct. an x-rated movie in the last year from survey respondents who had not seen an x-rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were younger than survey respondents who had not seen an x-rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were more likely to be male than survey respondents who had not seen an x-rated movie in the last year. 1. 2. 3. 4. True True with caution False Inappropriate application of a statistic

The answer to the question is true with caution. A caution is added because of the inclusion of ordinal level variables.

SW388R7 Data Analysis & Computers II Slide 50

Problem 2
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship. From the list of variables "respondent's degree of religious fundamentalism" [fund], "frequency of prayer" [pray], and "frequency of attendance at religious services" [attend], the most useful predictor for distinguishing between groups based on responses to "attitude toward abortion when there is a strong chance of serious defect in the baby" [abdefect] is "frequency of prayer" [pray]. These predictors differentiate survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby from survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby. The most important predictor of groups based on responses to attitude toward abortion when there is a strong chance of serious defect in the baby was frequency of prayer. Survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby prayed more often than survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby. 1. 2. 3. 4. True True with caution False Inappropriate application of a statistic

SW388R7 Data Analysis & Computers II Slide 51

Dissecting problem 2 - 1
The variables listed first in the problem statement are the independent variables (IVs): "respondent's degree of religious fundamentalism" [fund], "frequency of prayer" [pray], and "frequency of attendance at religious services" [attend].

In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship. From the list of variables "respondent's degree of religious fundamentalism" [fund], "frequency of prayer" [pray], and "frequency of attendance at religious services" [attend], the most useful predictor for distinguishing between groups based on responses to "attitude toward abortion when there is a strong chance of serious defect in the baby" [abdefect] is "frequency of prayer" [pray]. These predictors differentiate survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby from survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby. The most important predictor of groups based on responses to attitude toward abortion when groups is the dependent When a problem us defect in the baby was variable there is a strong chance of asks serious frequency of (DV):prayer. "attitude toward
to identify the best or most useful predictors from a list of independent variables, we do stepwise discriminant analysis. The variable used to define

abortion when there is a strong chance of serious defect in the baby" [abdefect]

SW388R7 Data Analysis & Computers II Slide 52

Dissecting problem 2 - 2

In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. To distinguish among two groups, the analysis will be required to find one Use a level of significance of 0.05 for evaluating the statistical relationship.

The problem identifies two groups for the dependent variable: survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby. statistically significant discriminant functions.

From the list of variables "respondent's degree of religious fundamentalism" [fund], "frequency of prayer" [pray], and "frequency of attendance at religious services" [attend], the most useful predictor for distinguishing between groups based on responses to "attitude toward abortion when there is a strong chance of serious defect in the baby" [abdefect] is "frequency of prayer" [pray]. These predictors differentiate survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby from survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby. The most important predictor of groups based on responses to attitude toward abortion when there is a strong chance of serious defect in the baby was frequency of prayer.
The importance of predictors is based upon the stepwise addition of variables to the analysis.

SW388R7 Data Analysis & Computers II Slide 53

Dissecting problem 2 - 3

The specific "respondent's relationships listed in the problem fundamentalism" indicate how the [fund], "frequency of From the list of variables degree of religious independent variable relates to groups of theservices" dependent variable, i.e., prayer" [pray], and "frequency of attendance at religious [attend], the most useful the mean for frequency of prayer will be for respondents predictor for distinguishing between groups based on lower responses to "attitudewho toward abortion when there is a strong chance of serious defect in the baby" [abdefect] "frequency thought it should be possible for a woman to obtain a is legal abortionof if prayer" [pray]. These predictors differentiate survey respondents who thought it should be possible for a woman there is a strong chance of a serious defect in the baby compared to to obtain a legal abortion if therewho is a didn't strongthink chance of a serious defect in survey respondents it should be possible for a the baby from survey respondents who didn't think it should be possible for a woman to obtain aa legal abortion if there woman to obtain a legal abortion if there is a strong chance of is a strong chance of a serious defect in the baby. serious defect in the baby.

The most important predictor of groups based on responses to attitude toward abortion when there is a strong chance of serious defect in the baby was frequency of prayer. Survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby prayed more often than survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby. 1. True 2. True with caution In a 3. stepwise False analysis, we only interpret the independent 4. Inappropriate application of a statistic
variables that are entered in the stepwise analysis.

In order for a stepwise analysis to be true, we must have enough statistically significant functions to distinguish among the groups, the order of entry must be correct, and each significant relationship must be interpreted correctly.

SW388R7 Data Analysis & Computers II Slide 54

LEVEL OF MEASUREMENT - 1
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship.

From the list of variables "respondent's degree of religious fundamentalism" [fund], "frequency of prayer" [pray], and "frequency of attendance at religious services" [attend], the most useful predictor for distinguishing between groups based on responses to "attitude toward abortion when there is a strong chance of serious defect in the baby" [abdefect] is "frequency of prayer" [pray]. These predictors differentiate survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby from survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby. The most important predictor of groups based on responses to attitude toward abortion when there is a strong chance of serious defect in the baby was frequency of prayer.
Survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby prayed more often than survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect analysis in the baby. Discriminant requires that the
dependent variable be non-metric and the independent variables be metric or dichotomous. "Attitude toward abortion when there is a strong chance of serious defect in the baby" [abdefect] is a nominal level variable, which satisfies the level of measurement requirement.

SW388R7 Data Analysis & Computers II Slide 55

LEVEL OF MEASUREMENT - 2
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship. From the list of variables "respondent's degree of religious fundamentalism" [fund], "frequency of prayer" [pray], and "frequency of attendance at religious services" [attend], the most useful predictor for distinguishing between groups based on responses to "attitude toward abortion when there is a strong chance of serious defect in the baby" [abdefect] is "frequency of prayer" [pray]. These predictors differentiate survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby from survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby. The most important predictor of groups based on responses to attitude toward abortion when there is a strong chance of serious defect in the baby was frequency of prayer.
"Respondent's degree to of obtain religious Survey respondents who didn't think it should be possible for a woman a legal abortion fundamentalism" [fund], "frequency of if there is a strong chance of a serious defect in the baby prayed more often than survey prayer" [pray], and "frequency of respondents who thought it should be possible forattendance a woman to obtain a legal abortion if there is at religious services" a strong chance of a serious defect in the baby. [attend] are ordinal level variables. If we follow the convention of treating ordinal level variables as metric variables, the level of measurement requirement for discriminant analysis is satisfied. Since some data analysts do not agree with this convention, a note of caution should be included in our interpretation.

SW388R7 Data Analysis & Computers II Slide 56

Request stepwise discriminant analysis

Select the Classify | Discriminant command from the Analyze menu.

SW388R7 Data Analysis & Computers II Slide 57

Selecting the dependent variable

First, highlight the dependent variable abdefect in the list of variables.

Second, click on the right arrow button to move the dependent variable to the Grouping Variable text box.

SW388R7 Data Analysis & Computers II Slide 58

Defining the group values


When SPSS moves the dependent variable to the Grouping Variable textbox, it puts two question marks in parentheses after the variable name. This is a reminder that we have to enter the number that represent the groups we want to include in the analysis.

First, to specify the group numbers, click on the Define Range button.

SW388R7 Data Analysis & Computers II Slide 59

Completing the range of group values

The value labels for abdefect show two categories: 1 = YES 2 = NO The range of values that we need to enter goes from 1 as the minimum and 2 as the maximum. First, type in 1 in the Minimum text box.

Second, type in 2 in the Maximum text box.

Third, click on the Continue button to close the dialog box.

SW388R7 Data Analysis & Computers II Slide 60

Selecting the independent variables

Move the independent variables listed in the problem to the Independents list box.

SW388R7 Data Analysis & Computers II Slide 61

Specifying the method for including variables


SPSS provides us with two methods for including variables: to enter all of the independent variables at one time, and a stepwise method for selecting variables using a statistical test to determine the order in which variables are included.

Since the problem calls for identifying the best predictors, we click on the option button to Use stepwise method.

SW388R7 Data Analysis & Computers II Slide 62

Requesting statistics for the output

Click on the Statistics button to select statistics we will need for the analysis.

SW388R7 Data Analysis & Computers II Slide 63

Specifying statistical output


First, mark the Means checkbox on the Descriptives panel. We will use the group means in our interpretation.

Second, mark the Univariate ANOVAs checkbox on the Descriptives panel. Perusing these tests suggests which variables might be useful descriminators.

Third, mark the Boxs M checkbox. Boxs M statistic evaluates conformity to the assumption of homogeneity of group variances.

Fourth, click on the Continue button to close the dialog box.

SW388R7 Data Analysis & Computers II Slide 64

Specifying details for the stepwise method

Click on the Method button to specify the specific statistical criteria to use for including variables.

SW388R7 Data Analysis & Computers II Slide 65

Details for the stepwise method


First, mark the Mahalanobis distance option button on the Method panel.

Second, mark the Summary of steps checkbox to produce a summary table when a new variable is added.

Third, click on the Continue button to close the dialog box.

Third, click on the option button Use probability of F so that we can incorporate the level of significance specified in the problem.

Fourth, type the level of significance in the Entry text box. The Removal value is twice as large as the entry value.

SW388R7 Data Analysis & Computers II Slide 66

Specifying details for classification

Click on the Classify button to specify details for the classification phase of the analysis.

SW388R7 Data Analysis & Computers II Slide 67

Details for classification - 1


First, mark the option button to Compute from group sizes on the Prior Probabilities panel. This incorporates the size of the groups defined by the dependent variable into the classification of cases using the discriminant functions. Second, mark the Casewise results checkbox on the Display panel to include classification details for each case in the output.

Third, mark the Summary table checkbox to include summary tables comparing actual and predicted classification.

SW388R7 Data Analysis & Computers II Slide 68

Details for classification - 2

Fourth, mark the Leave-one-out classification checkbox to request SPSS to include a cross-validated classification in the output. This option produces a less biased estimate of classification accuracy by sequentially holding each case out of the calculations for the discriminant functions, and using the derived functions to classify the case held out.

SW388R7 Data Analysis & Computers II Slide 69

Details for classification - 3

Fifth, accept the default of Within-groups option button on the Use Covariance Matrix panel. The Covariance matrices are the measure of the dispersion in the groups defined by the dependent variable. If we fail the homogeneity of group variances test (Boxs M), our option is use Separate groups covariance in classification.

Seventh, click on the Continue button to close the dialog box.

Sixth, mark the Combinesgroups checkbox on the Plots panel to obtain a visual plot of the relationship between functions and groups defined by the dependent variable.

SW388R7 Data Analysis & Computers II Slide 70

Completing the discriminant analysis request

Click on the OK button to request the output for the disciminant analysis.

SW388R7 Data Analysis & Computers II Slide 71

Sample size ratio of cases to variables


Analysis Case Processing Summary Unweighted Cases Valid Excluded Missing or out-of-range group codes At least one missing discriminating variable Both missing or out-of-range group codes and at least one missing discriminating variable Total Total N 77 41 105 Percent 28.5 15.2 38.9

47 193 270

variables for discriminant analysis is 5 to 1, with a 71.5 preferred ratio of 20 to 1. In 100.0 this analysis, there are 77 valid cases and 3 independent variables. The ratio of cases to independent variables is 25.67 to 1, which satisfies the minimum requirement. In addition, the ratio of 25.67 to 1 satisfies the preferred ratio of 20 to 1.

17.4 cases to independent

The minimum ratio of valid

SW388R7 Data Analysis & Computers II Slide 72

Sample size minimum group size


Prior Probabilities for Groups Cases Used in Analysis Unweighted Weighted 64 64.000 13 13.000 77 77.000

STRONG CHANCE OF SERIOUS DEFECT 1 2 Total

Prior .831 .169 1.000

In addition to the requirement for the ratio of cases to independent variables, discriminant analysis requires that there be a minimum number of cases in the smallest group defined by the dependent variable. The number of cases in the smallest group must be larger than the number of independent variables, and preferably contains 20 or more cases. The number of cases in the smallest group in this problem is 13, which is larger than the number of independent variables (3), satisfying the minimum requirement. However, the number of cases in the smallest group is less than the preferred minimum of 20 cases. A caution should be added to the interpretation of the analysis.

If the sample size did not initially satisfy the minimum requirements, discriminant analysis is not appropriate.

SW388R7 Data Analysis & Computers II Slide 73

NUMBER OF DISCRIMINANT FUNCTIONS - 1

The maximum possible number of discriminant functions is the smaller of one less than the number of groups defined by the dependent variable and the number of independent variables. In this analysis there were 2 groups defined by seen x-rated movie in last year and 3 independent variables, so the maximum possible number of discriminant functions was 1.

SW388R7 Data Analysis & Computers II Slide 74

NUMBER OF DISCRIMINANT FUNCTIONS - 2

In the table of Wilks' Lambda which tested functions for statistical significance, the stepwise analysis identified 1 discriminant functions that were statistically significant. The Wilks' lambda statistic for the test of function 1 (chisquare=3.887) had a probability of 0.049 which was less than or equal to the level of significance of 0.05.

The significance of the maximum possible number of discriminant functions supports the interpretation of a solution using 1 discriminant function.

SW388R7 Data Analysis & Computers II Slide 75

Independent variables and group membership: relationship of functions to groups


In order to specify the role that each independent variable plays in predicting group membership on the dependent variable, we must link together the relationship between the discriminant functions and the groups defined by the dependent variable, the role of the significant independent variables in the discriminant functions, and the differences in group means for each of the variables.

Functions at Group Centroids STRONG CHANCE OF SERIOUS DEFECT 1 2 Function 1 .103 -.507

Each function divides the groups into two subgroups by assigning negative values to one subgroup and positive values to the other subgroup. Function 1 separates survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby (-.507) from survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby (.103).

Unstandardized canonical discriminant functions evaluated at group means

SW388R7 Data Analysis & Computers II Slide 76

Independent variables and group membership: which predictors to interpret


When we use the stepwise method of variable inclusion, we limit our interpretation of independent variable predictors to those listed as statistically significant in the table of Variables Entered/Removed. The stepwise method of variable selection identified 1 variable that satisfied the level of significance of 0.05. The most important predictor of groups based on responses to attitude toward abortion when there is a strong chance of serious defect in the baby was: frequency of prayer.

a,b,c,d Variables Entered/Remov ed

Min. D Squared Between Groups 1 and 2 Exact F Statistic 4.017 df1 1 df2 75.000 Sig. .049

Step 1

Had we use simultaneous entry of all variables, we would At each step, the variable that maximizes the Mahalanobis distance between the two closest not have imposed this groups is entered. limitation.
a. Maximum number of steps is 6. b. Maximum significance of F to enter is .05. c. Minimum significance of F to remove is .10.

Entered HOW OFTEN DOES R PRAY

Statistic .372

SW388R7 Data Analysis & Computers II Slide 77

Independent variables and group membership: predictor loadings on functions


Based on the structure matrix, the predictor variable strongly associated with discriminant function 1 which distinguished between survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby and survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby was frequency of prayer (r=1.000). The correlation of 1.0 is an artifact of having only one statistically significant variable.

Structure Matrix

While we would normally interpret loadings in the structure matrix if they are 0.30 or higher, when we do stepwise analysis, we limit Pooled within-groups correlations between discriminating ourselves to the variables that were statistically variables and standardized canonical discriminant functions significant. Variables ordered by absolute size of correlation within function.
PRAY a ATTEND FUNDa a. This variable not used in the analysis.

Function 1 1.000 -.511 .336

SW388R7 Data Analysis & Computers II Slide 78

Independent variables and group membership: predictors associated with first function - 1
Group Statistics

ABDEFECT 1

Total

ATTEND PRAY FUND ATTEND PRAY FUND ATTEND PRAY FUND

Mean 3.05 3.05 2.03 4.23 2.08 1.69 3.25 2.88 1.97

Std. Deviation 2.627 1.608 .776 2.948 1.498 .630 2.701 1.622 .760

The average frequency of prayer for survey Unweighted Weighted respondents who didn't think it should be 64 a woman 64.000 to obtain a legal possible for 64.000 abortion 64 if there is a strong chance of a serious defect in the baby (mean=2.08) was 64 64.000 lower than the average frequency of prayer 13 13.000 for survey respondents who thought it should 13 13.000 be possible for a woman to obtain a legal abortion 13 if there is a strong chance of a 13.000 serious defect in the baby (mean=3.05). 77 77.000 Frequency of prayer is an ordinal level 77 is 77.000 variable that coded so that higher numeric values are 77associated 77.000 with survey respondents who prayed less often. The relationship that "survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby prayed more often than survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby" is supported.

Valid N (listwise)

SW388R7 Data Analysis & Computers II Slide 79

CLASSIFICATION USING THE DISCRIMINANT MODEL: by chance accuracy rate


The independent variables could be characterized as useful predictors of membership in the groups defined by the dependent variable if the cross-validated classification accuracy rate was significantly higher than the accuracy attainable by chance alone. Operationally, the cross-validated classification accuracy rate should be 25% or more higher than the proportional by chance accuracy rate. The proportional by chance accuracy rate of was computed by squaring and summing the proportion of cases in each group from the table of prior probabilities for groups (0.831 + 0.169 = 0.719).

Prior Probabilities for Groups Cases Used in Analysis Unweighted Weighted 64 64.000 13 13.000 77 77.000

ABDEFECT 1 2 Total

Prior .831 .169 1.000

SW388R7 Data Analysis & Computers II Slide 80

CLASSIFICATION USING THE DISCRIMINANT MODEL: criteria for classification accuracy


b,c Classification Results

Original

Count

Cross-validated a

Count %

ABDEFECT 1 2 Ungrouped cases 1 2 Ungrouped cases 1 2 1 2

Predicted Group Membership 1 2 72 15 48 100.0 100.0 100.0 72 15 100.0 100.0

0 0 0 .0 .0 .0 0 0 .0 .0

Total 72 15 48 100.0 100.0 100.0 72 15 100.0 100.0

a. Cross validation is done only for those cases in the analysis. In cross validation, each case is classified by the functions derived from all cases other than that case. b. 82.8% of original grouped cases correctly classified. c. 82.8% of cross-validated grouped cases correctly classified.

The cross-validated accuracy rate computed by SPSS was 82.8% which was less than the proportional by chance accuracy criteria of 89.9% (1.25 x 71.9% = 89.9%). The criteria for classification accuracy is not satisfied.

SW388R7 Data Analysis & Computers II Slide 81

Answering the question in problem 2


From the list of variables "respondent's degree of religious fundamentalism" [fund], "frequency of prayer" [pray], and "frequency of attendance at religious services" [attend], the most useful predictor for distinguishing between groups based on responses to "attitude toward abortion when there is a strong chance of serious defect in the baby" [abdefect] is "frequency of prayer" [pray]. These predictors differentiate survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby from survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby. The most important predictor of groups based on responses to attitude toward abortion when there is a strong chance of serious defect in the baby was frequency of prayer. Survey respondents who didn't think it should be possible for a woman to obtain a legal abortion We found one statistically significant discriminant if there is a strong chance of a serious defect in the baby prayed more often than survey making it possible to distinguish among respondents who function, thought it should be possible for a woman to obtain a legal abortion if there is the two groups defined bybaby. the dependent variable. a strong chance of a serious defect in the 1. True However, the cross-validated classification accuracy 2. True with caution was not 25% greater than the by chance accuracy rate, failing to support the utility of the model. 3. False 4. Inappropriate application of a statistic
The answer to the question is false.

SW388R7 Data Analysis & Computers II Slide 82

Problem 3
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.01 for evaluating assumptions. Use a level of significance of 0.05 for evaluating the statistical relationship. From the list of variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], "highest year of school completed" [educ], and "income" [rincom98], the most useful predictors for distinguishing among groups based on responses to "opinion about spending on welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], and "highest year of school completed" [educ]. These predictors differentiate survey respondents who thought we spend too much money on welfare from survey respondents who thought we spend about the right amount of money on welfare who, in turn, are differentiated from survey respondents who thought we spend too little money on welfare. The most important predictor of groups based on responses to opinion about spending on welfare was number of hours worked in the past week. The second most important predictor of groups based on responses to opinion about spending on welfare was self-employment. The third most important predictor of groups based on responses to opinion about spending on welfare was highest year of school completed. Survey respondents who thought we spend about the right amount of money on welfare worked fewer hours in the past week than survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought we spend about the right amount of money on welfare had completed more years of school than survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought we spend too much money on welfare were more likely to be self-employed than survey respondents who thought we spend too little money on welfare. 1. 2. 3. 4. True True with caution False Inappropriate application of a statistic

SW388R7 Data Analysis & Computers II Slide 83

Dissecting problem 3 - 1

The variables listed first in the problem statement are the independent variables (IVs): "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], In the dataset GSS2000.sav, is the following "highest year of school completed" [educ], statement true, false, or an incorrect application of and a statistic? Assume that there is no problem with missing data. Use a level of significance of "income" [rincom98].

0.01 for evaluating assumptions. Use a level of significance of 0.05 for evaluating the statistical relationship.

From the list of variables "number of hours worked in the past week" [hrs1], "selfemployment" [wrkslf], "highest year of school completed" [educ], and "income" [rincom98], the most useful predictors for distinguishing among groups based on responses to "opinion about spending on welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], and "highest year of school completed" [educ]. These predictors differentiate survey respondents who thought we spend too much money on welfare from survey respondents who thought we spend about the right amount of money on welfare who, in The turn, variable to define from survey respondents who thought we spend too little money on areused differentiated groups is the dependent welfare.
variable (DV): "opinion about When a problem asks us spending on welfare" to identify the best or The most important predictor of groups based on responses to opinion about spending on [natfare]. most useful predictors welfare was number of hours worked in the past week. The second most important predictor of from a list of groups based on responses to opinion about spending on welfare was self-employment. The independent variables, third most important predictor of groups based on responses to opinion about spending on we do stepwise welfare was highest year of school completed. discriminant analysis.

SW388R7 Data Analysis & Computers II Slide 84

Dissecting problem 3 - 2
The problem identifies three groups for the dependent variable: survey respondents who thought we spend too much money on welfare survey respondents who thought we spend about the right amount of In the dataset GSS2000.sav, is the following statement true, false, or an money on welfare incorrect application a spend statistic? that on there is no problem with survey respondents who thoughtof we too Assume little money welfare.
two statistically significant discriminant functions.

missing data. Use a level of significance of 0.01 for evaluating assumptions. Useamong a levelthree of significance of 0.05 for evaluating theto statistical relationship. To distinguish groups, the analysis will be required find From the list of variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], "highest year of school completed" [educ], and "income" [rincom98], the most useful predictors for distinguishing among groups based on responses to "opinion about spending on welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], and "highest year of school completed" [educ]. These predictors differentiate survey respondents who thought we spend too much money on welfare from survey respondents who thought we spend about the right amount of money on welfare who, in turn, are differentiated from survey respondents who thought we spend too little money on welfare. The most important predictor of groups based on responses to opinion about spending on welfare was number of hours worked in the past week. The second most important predictor of groups based on responses to opinion about spending on welfare was self-employment. The third most important predictor of groups based on responses to opinion about spending on welfare was highest year of school completed.

The importance of predictors is based upon the stepwise addition of variables to the analysis.

SW388R7 Data Analysis & Computers II Slide 85

Dissecting problem 3 - 3
The specific relationships listed in the problem indicate how the independent variable relates to groups of the dependent variable, i.e., the mean for hours worked in the past week will be for respondents who think we The most important predictor of groups based on lower responses to opinion about spending on spend The the right amount money predictor of welfare was number of hours worked in the past week. second most of important versus think we groups based on responses to opinion about spending on respondents welfare was who self-employment. The spend too much or too little. third most important predictor of groups based on responses to opinion about spending on In a stepwise analysis, we only interpret the independent variables that are entered in the stepwise analysis.

welfare was highest year of school completed.

Survey respondents who thought we spend about the right amount of money on welfare worked fewer hours in the past week than survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought we spend about the right amount of money on welfare had completed more years of school than survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought we spend too much money on welfare were more likely to be self-employed than survey respondents who thought we spend too little money on welfare. 1. 2. 3. 4. True True with caution False Inappropriate application of a statistic
In order for a stepwise analysis to be true, we must have enough statistically significant functions to distinguish among the groups, the order of entry must be correct, and each significant relationship must be interpreted correctly.

SW388R7 Data Analysis & Computers II Slide 86

LEVEL OF MEASUREMENT - 1
From the list of variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], "highest year of school completed" [educ], and "income" [rincom98], the most useful predictors for distinguishing among groups based on responses to "opinion about spending on welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], and "highest year of school completed" [educ]. These predictors differentiate survey respondents who thought we spend too much money on welfare from survey respondents who thought we spend about the right amount of money on welfare who, in turn, are differentiated from survey respondents who thought we spend too little money on welfare. The most important predictor of groups based on responses to opinion about spending on welfare was number of hours worked in the past week. The second most important predictor of groups based on responses to opinion about spending on welfare was self-employment. The third most important predictor of groups based on responses to opinion about spending on welfare was highest year of school completed.
Discriminant analysis requires that the Survey respondents who thought we spend about the right amount of money on welfare worked dependent variable be non-metric and the fewer hours in the past week than survey respondents who thought we spend too much or little independent variables be metric or dichotomous. money on welfare. Survey respondents who thought we spend about the right amount of money "Opinion about spending on welfare" [natfare] is on welfare had completed more years of school than survey respondents who thought we spend an ordinal level variable, which satisfies the level too much or little money on welfare. Survey respondents who thought we spend too much of measurement requirement. money on welfare were more likely to be self-employed than survey respondents who thought we spend too little money on welfare. It contains three categories: survey respondents who thought we spend too much money on welfare, survey respondents who thought we spend about the right amount of money on welfare, and survey respondents who thought we spend too little money on welfare.

SW388R7 Data Analysis & Computers II Slide 87

LEVEL OF MEASUREMENT - 2
From the list of variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], "highest year of school completed" [educ], and "income" [rincom98], the most useful predictors for distinguishing among groups based on responses to "opinion about spending on welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], and "highest year of school completed" [educ]. These predictors differentiate survey respondents who thought we spend too much money on welfare from survey respondents who thought we spend about the right amount of money on welfare who, in turn, are differentiated from survey respondents who thought we spend too little money on welfare. The most important predictor of groups based on responses to opinion about spending on welfare was number of hours worked in the past week. The second most important predictor of groups based on responses to opinion about spending on welfare was self-employment. The third most important predictor of groups based on responses to opinion about spending on welfare was highest year of school completed.
"Number of hours worked in the past week" [hrs1] andwho "highest Survey respondents thought we spend about the right amount of money on welfare worked year of school completed" [educ] fewer hours in the past week than survey respondents who thought we spend too much or little are interval level variables, which money on welfare. Survey respondents who thought we spend about the right amount of money satisfies the level of measurement on welfare had completed more years of school than survey respondents who thought we spend "Income" [rincom98] is an ordinal level requirements for discriminant too much or little money on welfare. Survey respondents who thought we spend too much variable. If we follow the convention of analysis. money on welfare were more likely to be self-employed survey respondents thought treatingthan ordinal level variables as who metric we spend too little money on welfare. variables, the level of measurement requirement for discriminant analysis is satisfied. Since some data analysts do not agree with this convention, a note of caution should be included in our "Self-employment" [wrkslf] is a interpretation. dichotomous or dummy-coded nominal variable which may be included in discriminant analysis.

SW388R7 Data Analysis & Computers II Slide 88

The stepwise discriminant analysis

To answer the question, we do a stepwise discriminant analysis with natfare as the dependent variable and hrs1, wkrslf, educ, and rincom98, and as the independent variables.

Select the Classify | Discriminant command from the Analyze menu.

SW388R7 Data Analysis & Computers II Slide 89

Selecting the dependent variable

First, highlight the dependent variable natfare in the list of variables.

Second, click on the right arrow button to move the dependent variable to the Grouping Variable text box.

SW388R7 Data Analysis & Computers II Slide 90

Defining the group values


When SPSS moves the dependent variable to the Grouping Variable textbox, it puts two question marks in parentheses after the variable name. This is a reminder that we have to enter the number that represent the groups we want to include in the analysis.

First, to specify the group numbers, click on the Define Range button.

SW388R7 Data Analysis & Computers II Slide 91

Completing the range of group values


The value labels for natfare show three categories: 1 = TOO LITTLE 2 = ABOUT RIGHT 3 = TOO MUCH The range of values that we need to enter goes from 1 as the minimum and 3 as the maximum. First, type in 1 in the Minimum text box.

Second, type in 3 in the Maximum text box.

Third, click on the Continue button to close the dialog box.

Note: if we enter the wrong range of group numbers, e.g., 1 to 2 instead of 1 to 3, SPSS will only include groups 1 and 2 in the analysis.

SW388R7 Data Analysis & Computers II Slide 92

Specifying the method for including variables


SPSS provides us with two methods for including variables: to enter all of the independent variables at one time, and a stepwise method for selecting variables using a statistical test to determine the order in which variables are included.

Since the problem calls for identifying the best predictors, we click on the option button to Use stepwise method.

SW388R7 Data Analysis & Computers II Slide 93

Requesting statistics for the output

Click on the Statistics button to select statistics we will need for the analysis.

SW388R7 Data Analysis & Computers II Slide 94

Specifying statistical output


First, mark the Means checkbox on the Descriptives panel. We will use the group means in our interpretation.

Second, mark the Univariate ANOVAs checkbox on the Descriptives panel. Perusing these tests suggests which variables might be useful descriminators.

Third, mark the Boxs M checkbox. Boxs M statistic evaluates conformity to the assumption of homogeneity of group variances.

Fourth, click on the Continue button to close the dialog box.

SW388R7 Data Analysis & Computers II Slide 95

Specifying details for the stepwise method

Click on the Method button to specify the specific statistical criteria to use for including variables.

SW388R7 Data Analysis & Computers II Slide 96

Details for the stepwise method


First, mark the Mahalanobis distance option button on the Method panel.

Second, mark the Summary of steps checkbox to produce a summary table when a new variable is added.

Third, click on the Continue button to close the dialog box.

Third, click on the option button Use probability of F so that we can incorporate the level of significance specified in the problem.

Fourth, type the level of significance in the Entry text box. The Removal value is twice as large as the entry value.

SW388R7 Data Analysis & Computers II Slide 97

Specifying details for classification

Click on the Classify button to specify details for the classification phase of the analysis.

SW388R7 Data Analysis & Computers II Slide 98

Details for classification - 1


First, mark the option button to Compute from group sizes on the Prior Probabilities panel. This incorporates the size of the groups defined by the dependent variable into the classification of cases using the discriminant functions. Second, mark the Casewise results checkbox on the Display panel to include classification details for each case in the output.

Third, mark the Summary table checkbox to include summary tables comparing actual and predicted classification.

SW388R7 Data Analysis & Computers II Slide 99

Details for classification - 2

Fourth, mark the Leave-one-out classification checkbox to request SPSS to include a cross-validated classification in the output. This option produces a less biased estimate of classification accuracy by sequentially holding each case out of the calculations for the discriminant functions, and using the derived functions to classify the case held out.

SW388R7 Data Analysis & Computers II Slide 100

Details for classification - 3

Fifth, accept the default of Within-groups option button on the Use Covariance Matrix panel. The Covariance matrices are the measure of the dispersion in the groups defined by the dependent variable. If we fail the homogeneity of group variances test (Boxs M), our option is use Separate groups covariance in classification.

Seventh, click on the Continue button to close the dialog box.

Sixth, mark the Combinedgroups checkbox on the Plots panel to obtain a visual plot of the relationship between functions and groups defined by the dependent variable.

SW388R7 Data Analysis & Computers II Slide 101

Completing the discriminant analysis request

Click on the OK button to request the output for the disciminant analysis.

SW388R7 Data Analysis & Computers II Slide 102

SAMPLE SIZE - 1

Analysis Case Processing Summary Unweighted Cases Valid Excluded Missing or out-of-range group codes At least one missing discriminating variable Both missing or out-of-range group codes and at least one missing discriminating variable Total Total N 138 7 115 Percent 51.1 2.6 42.6

10 132 270

3.7 48.9 100.0

The minimum ratio of valid cases to independent variables for discriminant analysis is 5 to 1, with a preferred ratio of 20 to 1. In this analysis, there are 138 valid cases and 4 independent variables. The ratio of cases to independent variables is 34.5 to 1, which satisfies the minimum requirement. In addition, the ratio of 34.5 to 1 satisfies the preferred ratio of 20 to 1.

SW388R7 Data Analysis & Computers II Slide 103

SAMPLE SIZE - 2
Prior Probabilities for Groups Cases Used in Analysis Unweighted Weighted 56 56.000 49 49.000 32 32.000 137 137.000

WELFARE 1 TOO LITTLE 2 ABOUT RIGHT 3 TOO MUCH Total

Prior .409 .358 .234 1.000

In addition to the requirement for the ratio of cases to independent variables, discriminant analysis requires that there be a minimum number of cases in the smallest group defined by the dependent variable. The number of cases in the smallest group must be larger than the number of independent variables, and preferably contain 20 or more cases. The number of cases in the smallest group in this problem is 32, which is larger than the number of independent variables (4), satisfying the minimum requirement. In addition, the number of cases in the smallest group satisfies the preferred minimum of 20 cases.

SW388R7 Data Analysis & Computers II Slide 104

NUMBER OF DISCRIMINANT FUNCTIONS - 1

The maximum possible number of discriminant functions is the smaller of one less than the number of groups defined by the dependent variable and the number of independent variables. In this analysis there were 3 groups defined by opinion about spending on welfare and 4 independent variables, so the maximum possible number of discriminant functions was 2.

SW388R7 Data Analysis & Computers II Slide 105

NUMBER OF DISCRIMINANT FUNCTIONS - 2

In the table of Wilks' Lambda which tested functions for statistical significance, the stepwise analysis identified 2 discriminant functions that were statistically significant. The Wilks' lambda statistic for the test of function 1 through 2 functions (chi-square=21.853) had a probability of 0.001 which was less than or equal to the level of significance of 0.05.

After removing function 1, the Wilks' lambda statistic for the test of function 2 (chi-square=7.074) had a probability of 0.029 which was less than or equal to the level of significance of 0.05. The significance of the maximum possible number of discriminant functions supports the interpretation of a solution using 2 discriminant functions.

SW388R7 Data Analysis & Computers II Slide 106

Independent variables and group membership: relationship of functions to groups


In order to specify the role that each independent variable plays in predicting group membership on the dependent variable, we must link together the relationship between the discriminant functions and the groups defined by the dependent variable, the role of the significant independent variables in the discriminant functions, and the differences in group means for each of the variables. Function 2 separates survey respondents who thought we spend too little money on welfare (positive value of 0.235) from survey respondents who thought we spend too much money (negative value of -0.362) on welfare. We ignore the second group (-0.031) in this comparison because it was distinguished from the other two groups by function 1.

Functions at Group Centroids Function WELFARE 1 2 3 1 -.220 .446 -.311 2 .235 -.031 -.362

Unstandardized canonical discriminant functions evaluated at group means

Function 1 separates survey respondents who thought we spend about the right amount of money on welfare (the positive value of 0.446) from survey respondents who thought we spend too much (negative value of -0.311) or little money (negative value of -0.220) on welfare.

SW388R7 Data Analysis & Computers II Slide 107

Independent variables and group membership: which predictors to interpret


a,b,c,d Variables Entered/Remov ed

Min. D Squared Between Groups Exact F Statistic df1 df2 Sig.

Step 1

Had we use simultaneous entry of all variables, we would not have imposed this At each step, the variable that maximizes the Mahalanobis distance between the two closest limitation. groups is entered.
a. Maximum number of steps is 8. b. Maximum significance of F to enter is .05. c.

Entered NUMBER OF HOURS WORKED LAST WEEK R SELF-EM P OR WORKS FOR SOMEBO DY HIGHEST YEAR OF SCHOOL COMPLE TED

Statistic

.023

When we use the stepwise method of variable inclusion, we limit our interpretation of independent variable predictors to those 1 and 3 .475 1 135.000 .492 listed as statistically significant in the table of Variables Entered/Removed. We will interpret the impact on membership in groups defined by the dependent variable by the independent variables: number of hours worked in the past week 1 and 2 self-employment. 3.289 2 134.000 .040 highest year of school completed

.251

.364

1 and 3

2.433

133.000

.068

SW388R7 Data Analysis & Computers II Slide 108

Independent variables and group membership: predictor loadings on functions


We do not interpret loadings in the structure matrix unless they are 0.30 or higher.
Structure Matrix Function 1 HIGHEST YEAR OF SCHOOL COMPLETED NUMBER OF HOURS WORKED LAST WEEK R SELF-EMP OR WORKS FOR SOMEBODY a RESPONDENTS INCOME .687* -.582* .223 .101 2 .136 .345 .889* .292*

Pooled within-groups correlations between discriminating variables and standardized canonical discriminant functions Variables ordered by absolute size of correlation within function. Based on the structure *. Largest absolute correlation between each variable and matrix, the predictor any discriminant function Based on the structure matrix, the variable strongly

predictor variables strongly associated with a. This variable not used in the analysis. discriminant function 1 which distinguished between survey respondents who thought we spend about the right amount of money on welfare and survey respondents who thought we spend too much or little money on welfare were number of hours worked in the past week (r=-0.582) and highest year of school completed (r=0.687).

associated with discriminant function 2 which distinguished between survey respondents who thought we spend too little money on welfare and survey respondents who thought we spend too much money on welfare was selfemployment (r=0.889).

SW388R7 Data Analysis & Computers II Slide 109

Independent variables and group membership: predictors associated with first function - 1
Group Statistics Valid N (listwise) Unweighted Weighted

WELFARE 1 TOO LITTLE

Mean 43.96 13.73 1.93 13.70 37.90 14.78 1.90 14.00 42.03 13.38 1.75 14.75 41.32 14.03

Std. Deviation

NUMBER OF HOURS WORKED LAST WEEK HIGHEST YEAR OF SCHOOL COMPLETED R SELF-EMP OR WORKS FOR SOMEBODY RESPONDENTS INCOME 2 ABOUT RIGHT NUMBER OF HOURS WORKED LAST WEEK HIGHEST YEAR OF SCHOOL COMPLETED R SELF-EMP OR WORKS FOR SOMEBODY RESPONDENTS INCOME 3 TOO MUCH NUMBER OF HOURS WORKED LAST WEEK HIGHEST YEAR OF SCHOOL COMPLETED R SELF-EMP OR WORKS FOR SOMEBODY RESPONDENTS INCOME Total NUMBER OF HOURS WORKED LAST WEEK HIGHEST YEAR OF SCHOOL COMPLETED

13.240 56 in the past week56.000 for survey

The average number of hours worked


respondents who thought we spend

welfare (mean=37.90) was lower number of hours .260than the average 56 56.000 worked in the past weeks for survey 5.034respondents 56 who 56.000 thought we spend too little money on welfare 13.235 50 50.000 (mean=43.96) and survey respondents who thought we spend money on welfare 2.558too much 50 50.000 (mean=42.03).
.303 50 50.000

2.401about the 56 56.000 of money on right amount

This supports the relationship that who thought we 5.503"survey respondents 50 50.000 spend about the right amount of 10.456 32.000 money on32 welfare worked fewer hours in the past week than survey thought we spend 2.524respondents 32 who 32.000 too little or much money on welfare."
.440 5.304 12.846 2.537 32 32 138 138 32.000 32.000 138.000 138.000

SW388R7 Data Analysis & Computers II Slide 110

Independent variables and group membership: predictors associated with first function - 2
Group Statistics Valid N (listwise) Unweighted Weighted

WELFARE 1 TOO LITTLE

Mean 43.96 13.73 1.93 13.70 37.90 14.78 1.90 14.00 42.03 13.38 1.75 14.75 41.32 14.03

Std. Deviation

NUMBER OF HOURS WORKED LAST WEEK HIGHEST YEAR OF SCHOOL COMPLETED R SELF-EMP OR WORKS FOR SOMEBODY RESPONDENTS INCOME 2 ABOUT RIGHT NUMBER OF HOURS WORKED LAST WEEK HIGHEST YEAR OF SCHOOL COMPLETED R SELF-EMP OR WORKS FOR SOMEBODY RESPONDENTS INCOME 3 TOO MUCH NUMBER OF HOURS WORKED LAST WEEK HIGHEST YEAR OF SCHOOL COMPLETED R SELF-EMP OR WORKS FOR SOMEBODY RESPONDENTS INCOME Total NUMBER OF HOURS WORKED LAST WEEK HIGHEST YEAR OF SCHOOL COMPLETED

The average year of school 13.240 56 highest 56.000

completed for survey respondents who thought we 56.000 spend about the 2.401 56 right amount of money on welfare (mean=14.78) was higher than the .260average highest 56 56.000 year of school completeds for survey respondents 5.034 56 56.000 who thought we spend too little money on50 welfare (mean=13.73) and 13.235 50.000 survey respondents who thought we spend too much money on welfare 2.558 50 50.000 (mean=13.38). "survey respondents who thought we 50 spend about the50.000 right amount of money on welfare had completed 10.456 32 32.000 more years of school than survey respondents who thought we spend 2.524 32 32.000 too little or much money on welfare."
5.503 .440 5.304 12.846 2.537 32 32 138 138 32.000 32.000 138.000 138.000 .303This supports 50 the 50.000 relationship that

SW388R7 Data Analysis & Computers II Slide 111

Independent variables and group membership: predictors associated with second function
Group Statistics Valid N (listwise) Unweighted Weighted

WELFARE 1 TOO LITTLE

Mean 43.96 13.73 1.93 13.70 37.90 14.78 1.90 14.00 42.03 13.38 1.75 14.75 41.32 14.03

Std. Deviation

NUMBER OF HOURS WORKED LAST WEEK HIGHEST YEAR OF SCHOOL COMPLETED R SELF-EMP OR WORKS FOR SOMEBODY RESPONDENTS INCOME 2 ABOUT RIGHT NUMBER OF HOURS WORKED LAST WEEK HIGHEST YEAR OF SCHOOL COMPLETED R SELF-EMP OR WORKS FOR SOMEBODY RESPONDENTS INCOME 3 TOO MUCH NUMBER OF HOURS WORKED LAST WEEK HIGHEST YEAR OF SCHOOL COMPLETED R SELF-EMP OR WORKS FOR SOMEBODY RESPONDENTS INCOME Total NUMBER OF HOURS WORKED LAST WEEK HIGHEST YEAR OF SCHOOL COMPLETED

13.240 56 mean 56.000 variable, the is not directly

Since self-employment is a dichotomous

interpretable. Its interpretation must take into account the coding by which 1 2.401 56 56.000 corresponds to self-employed and 2 corresponds else. The lower .260 56 to someone 56.000 mean for survey respondents who thought we too much money on 5.034 56spend 56.000 welfare (mean=1.75), when compared 13.235 50 for survey 50.000 respondents who to the mean thought we spend too little money on welfare (mean=1.93), 2.558 50 50.000 implies that the group contained more survey respondents were self-employed .303 50 who 50.000 and fewer survey respondents who were working for else. 5.503 50someone 50.000
10.456 32 the 32.000 This supports relationship that

"survey respondents who thought we spend too 32 much 32.000 money on welfare were 2.524 more likely to be self-employed than survey respondents who thought we .440 32 32.000 spend too little money on welfare."
5.304 32 138 138 32.000 138.000 138.000

12.846 2.537

SW388R7 Data Analysis & Computers II Slide 112

CLASSIFICATION USING THE DISCRIMINANT MODEL: by chance accuracy rate

The independent variables could be characterized as useful predictors of membership in the groups defined by the dependent variable if the cross-validated classification accuracy rate was significantly higher than the accuracy attainable by chance alone. Operationally, the cross-validated classification accuracy rate should be 25% or more higher than the proportional by chance accuracy rate.

The proportional by chance accuracy rate of was computed by squaring and summing the proportion of cases in each group from the table of prior probabilities for groups (0.406 + 0.362 + 0.232 = 0.350).

Prior Probabilities for Groups Cases Used in Analysis Unweighted Weighted 56 56.000 50 50.000 32 32.000 138 138.000

WELFARE 1 TOO LITTLE 2 ABOUT RIGHT 3 TOO MUCH Total

Prior .406 .362 .232 1.000

SW388R7 Data Analysis & Computers II Slide 113

CLASSIFICATION USING THE DISCRIMINANT MODEL: criteria for classification accuracy


b,c Classification Results

Original

Count

Cross-validated a

Count

Predicted Group Membership 1 TOO 2 ABOUT WELFARE LITTLE RIGHT 3 TOO MUCH 1 TOO LITTLE 43 15 6 2 ABOUT RIGHT 26 30 6 3 TOO MUCH 17 10 9 Ungrouped cases 3 3 2 1 TOO LITTLE 67.2 23.4 9.4 2 ABOUT RIGHT 41.9 48.4 9.7 3 TOO MUCH 47.2 27.8 25.0 Ungrouped cases 37.5 37.5 25.0 1 TOO LITTLE 43 15 6 2 The ABOUT RIGHT 26 30 6 cross-validated accuracy rate computed by SPSS was 50.0% 3 TOO MUCH 17 11 8 which was greater than or equal to 1 TOO LITTLE 23.4 9.4 the proportional by 67.2 chance accuracy 2 criteria ABOUT of RIGHT 41.9 x 35.0% 48.4 9.7 43.7% (1.25 = The criteria for 3 43.7%). TOO MUCH 47.2 30.6 22.2

a. Cross validation is done only for those cases in the analysis. In cross validation, each case is classified by the functions derived from all cases other than that case. b. 50.6% of original grouped cases correctly classified. c. 50.0% of cross-validated grouped cases correctly classified.

classification accuracy is satisfied.

Total 64 62 36 8 100.0 100.0 100.0 100.0 64 62 36 100.0 100.0 100.0

SW388R7 Data Analysis & Computers II Slide 114

Answering the question in problem 3 - 1


From the list of variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], "highest year of school completed" [educ], and "income" [rincom98], the most useful predictors for distinguishing among groups based on responses to "opinion about spending on welfare" [natfare] are "number of hours worked in the past week" [hrs1], "selfemployment" [wrkslf], and "highest year of school completed" [educ]. These predictors differentiate survey respondents who thought we spend too much money on welfare from survey respondents who thought we spend about the right amount of money on welfare who, in The stepwise discriminant analysis turn, are differentiated from survey respondents who thought we spend too little money on included the three variables identified welfare.
as the most use predictors.

The most important predictor of groups based on responses to opinion about spending on welfare was number of hours worked in the past week. The second most important predictor of groups based on responses to opinion about spending on welfare was self-employment. The third most important predictor of groups based on responses to opinion about spending on welfare was highest year of school completed. Survey respondents who thought we spend about the right amount of money on welfare worked fewer hours in the past week than survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought we spend about the right amount of money on welfare had completed more years of school than survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought we spend too much money on welfare were more likely to be self-employed than survey respondents who thought we spend too little money on welfare.

SW388R7 Data Analysis & Computers II Slide 115

Answering the question in problem 3 - 2


From the list of variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], "highest year of school completed" [educ], and "income" [rincom98], the most useful predictors for distinguishing among groups based on responses to "opinion about spending on welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], and "highest year of school completed" [educ]. These predictors differentiate survey respondents who thought we spend too much money on welfare from survey respondents who thought we spend about the right amount of money on welfare who, in turn, are differentiated from survey respondents who thought we spend too little money on welfare. The most important predictor of groups based on responses to opinion about spending on welfare was numberWe of hours in the past week. The second most important predictor of found worked two statistically significant groups based on responses to opinion about spending on welfare discriminant functions, making it possible to was self-employment. The third most important predictor among of groups responses to opinion about spending on distinguish thebased three on groups defined the dependent variable. welfare was highest by year of school completed. Survey respondents who thought we spend the right amount of money on welfare worked accuracy surpassed the about by chance accuracy criteria, supporting the utility of thewho model. fewer hours in the past week than survey respondents thought we spend too much or little money on welfare. Survey respondents who thought we spend about the right amount of money on welfare had completed more years of school than survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought we spend too much money on welfare were more likely to be self-employed than survey respondents who thought we spend too little money on welfare.
Moreover, the cross-validated classification

SW388R7 Data Analysis & Computers II Slide 116

Answering the question in problem 3 - 3


From the list of variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], "highest year of school completed" [educ], and "income" [rincom98], the most useful predictors for distinguishing among groups based on responses to "opinion about spending on welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], and "highest year of school completed" [educ]. These predictors differentiate survey respondents who thought we spend too much money onmatched welfare from survey respondents who The order of importance thought we spend about the right amount of money on welfare the order of entry in the table of who, in turn, are differentiated "Variables from survey respondents who thought we Entered/Removed." spend too little money on welfare. The most important predictor of groups based on responses to opinion about spending on welfare was number of hours worked in the past week. The second most important predictor of groups based on responses to opinion about spending on welfare was selfemployment. The third most important predictor of groups based on responses to opinion about spending on welfare was highest year of school completed. Survey respondents who thought we spend about the right amount of money on welfare worked fewer hours in the past week than survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought we spend about the right amount of money on welfare had completed more years of school than survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought we spend too much money on welfare were more likely to be self-employed than survey respondents who thought we spend too little money on welfare.

SW388R7 Data Analysis & Computers II Slide 117

Answering the question in problem 3 - 4


The most important predictor of groups based on responses to opinion about spending on welfare was number of hours worked in the past week. The second most important predictor of groups based on responses to opinion about spending welfare was self-employment. The We verified that on each statement third most important predictor of groups based responses to opinion about spending on about the on relationship between welfare was highest year of school completed. predictors and groups was correct. Survey respondents who thought we spend about the right amount of money on welfare worked fewer hours in the past week than survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought we spend about the right amount of money on welfare had completed more years of school than survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought we spend too much money on welfare were more likely to be self-employed than survey respondents who thought we spend too little money on welfare. 1. 2. 3. 4. True True with caution False Inappropriate application of a statistic

The answer to the question is true with caution. A caution is added because of the inclusion of ordinal level variables.

SW388R7 Data Analysis & Computers II Slide 118

Steps in discriminant analysis: level of measurement and initial sample size

The following is a guide to the decision process for answering problems about the basic relationships in discriminant analysis:
Dependent non-metric? Independent variables metric or dichotomous?

No

Inappropriate application of a statistic

Yes

Ratio of cases to independent variables at least 5 to 1?

No

Inappropriate application of a statistic

Yes
Number of cases in smallest group greater than number of independent variables?

No

Inappropriate application of a statistic

Yes

SW388R7 Data Analysis & Computers II Slide 119

Steps in discriminant analysis: usable discriminant model

Run discriminant analysis, using method for including variables identified in the research question.

Sufficient statistically significant functions to distinguish DV groups?

No

False

Yes

SW388R7 Data Analysis & Computers II Slide 120

Steps in discriminant analysis: relationships between IV's and DV

Stepwise method of entry used to include independent variables?

Yes

No
Entry order of variables interpreted correctly?

No
Yes False

Relationships between individual IVs and DV groups interpreted correctly?

No

False

Yes

SW388R7 Data Analysis & Computers II Slide 121

Steps in discriminant analysis: classification accuracy

Cross-validated accuracy is 25% higher than proportional by chance accuracy rate?

No

False

Yes

SW388R7 Data Analysis & Computers II Slide 122

Steps in discriminant analysis: adding cautions to solution

Satisfies preferred ratio of cases to IV's of 20 to 1

No

True with caution

Yes No

Satisfies preferred DV group minimum size of 20 cases?

True with caution

Yes
No

DV is non-metric level and IVs are interval level or dichotomous (not ordinal)?

True with caution

Yes True

You might also like