f1 is the frequency of the modal class f0 is the frequency of the class before the modal class in the frequency table f2 is the frequency of the class after the modal class in the frequency table h is the class interval of the modal class THE CHI-SQUARE TEST Introduction: The chi-square test is a statistical test that can be used to dete rmine whether observed frequencies are significantly different from expected fre quencies. For example, after we calculated expected frequencies for different al lozymes in the HARDY-WEINBERG module we would use a chi-square test to compare t he observed and expected frequencies and determine whether there is a statistica lly significant difference between the two. As in other statistical tests, we be gin by stating a null hypothesis (H0: there is no significant difference between observed and expected frequencies) and an alternative hypothesis (H1: there is a significant difference). Based on the outcome of the chi-square test we will e ither reject or fail to reject the null hypothesis. Importance: Chi-square tests enable us to compare observed and expected frequenc ies objectively, since it is not always possible to tell just by looking at them whether they are "different enough" to be considered statistically significant. Statistical significance in this case implies that the differences are not due to chance alone, but instead may be indicative of other processes at work. Question: How is the chi-square test used to compare samples or populations? Wha t does a comparison of observed and expected frequencies tell us about these sam ples? Variables:
the chi-square test statistic o observed count or frequency e expected count or frequency n total number of observations RT row total CT column total Methods: Shaklee et al. (1993) collected data to study genetic variation within a species of fish called the barramundi perch (Lates calcarifer). Many fish spec ies are composed of breeding groups called stocks, which are populations that ar e genetically distinct from one another. One of the goals of Shaklee et al.'s st udy was to identify individual stocks of the barramundi perch on the basis of si gnificant genetic differentiation. Of the 25 collections examined, those that we re not significantly genetically distinct from one another were considered to be from the same stock; collections that were genetically distinct were considered to be from different stocks. Understanding species subdivision into stocks has important implications for conservation and fisheries management, since maintain ing the genetic diversity of the species as a whole will require conservation of the different stocks. We'll use some of their data here to illustrate the application of a simple chi- square test. Below are data showing allele frequencies at seven loci for eight c ollections of perch from different parts of the Australian coast (table adapted from Shaklee et al. 1993; all errors due to rounding are mine).
Locus & allele # 1 # 2 # 14 # 15 # 18 # 21 # 22 # 25 EST-2* *100+ 249 78 97 115 101 242 128 116 *98 26 4 0 1 2 0 2 30 *95 126 41 60 60 52 226 125 70 ESTD* *100+ 390 120 155 176 171 465 335 210 *114 15 4 0 0 0 9 2 6 mIDHP* *100 387 123 152 167 152 474 333 216 *78 0 0 5 10 4 1 0 0 sIDHP* *100 354 113 111 137 143 432 310 177 *121+ 37 7 44 33 27 39 18 28 *83 9 3 0 0 0 1 1 3 LDH-C* *100 373 115 156 175 154 400 245 208 *90+ 29 9 1 1 1 75 25 5 PGDH* *100 382 122 130 145 153 378 240 199 *88+ 5 2 21 18 16 95 89 3 PROT* *100+ 399 120 149 168 147 453 326 207 *97 8 4 8 9 9 22 5 9 We can use the chi-square test to compare collections # 1 and # 25 at the EST-2* locus. The expected values are the allele frequencies we would expect if there were no difference between the two collections at this locus. We can calculate t he expected allele frequencies using the row and column totals from a table of t he observed frequencies for these two collections. For the first cell (collection #1, allele *100+) we begin by calculating the pro bability of an observation being in the first row, regardless of column. To do t his, take the row total (365) and divide it by n (617) (note that n changes depe nding on which locus and which pair of populations is being compared). Based on these two collections, the probability of a barramundi perch having the *100+ al lele at the EST-2* locus is 0.5916 (365/617). Next, we calculate the probability of an observation being in the first column, regardless of row, by taking the c olumn total (401) and dividing it by n (617). The probability of an observation coming from collection #1 as opposed to collection #25 is 0.6499 (401/617). We have now determined the probability of a perch having a given allele at this locus, and the probability of being in a given collection. But what is the proba bility that an individual observation will have the *100+ allele at the EST-2* l ocus and be from collection #1? The probability of two outcomes occurring togeth er is called the joint probability, and is calculated by multiplying the two sep arate probabilities: 0.5916 x 0.6499 = 0.3845. It follows that in a sample of 61 7 fish we would expect 617 x 0.3845 = 237 individuals to be from collection #1 a nd have the *100+ allele, and we have now calculated our expected value for the first cell in the table. This calculation can be simplified with the following f ormula: e = (RT/n)(CT/n)*n Verify that the other expected frequencies have been calculated correctly. Observed frequencies Expected frequencies allele # 1 # 25 RT allele # 1 # 25 RT *100+ 249 116 365 *100+ 237 128 365 *98 26 30 56 *98 36 20 56 *95 126 70 196 *95 127 69 196 CT 401 216 n=617 CT 401 216 n=617 Note also that the row and column totals remain the same. Now we can use the chi -square test to compare the observed and expected frequencies. The chi-square te st statistic is calculated with the following formula: For each cell, the expected frequency is subtracted from the observed frequency, the difference is squared, and the total is divided by the expected frequency. The values are then summed across all cells. This sum is the chi-square test sta tistic. For the example here, = 0.608 + 2.778 + 0.008 + 1.125 + 5.000 + 0.014 = 9.533. Interpretation: The critical value for the chi-square in this case () is 5.991; if the calculated chi-square value is equal to or greater than this critical val ue, we can conclude that the probability of the null hypothesis being correct is 0.05 or less-- a very small probability indeed! Our calculated value of 9.533 i s greater than the critical value of 5.991. We therefore reject the null hypothe sis, and conclude that there is a significant difference between the observed an d expected frequencies of alleles at the EST-2* locus for these two collections of barramundi perch. (Critical values for the chi-square are determined from a s tatistical table based on the significance level at which the test is being perf ormed [0.05 in our case] and a number called degrees of freedom [2 in this examp le], but the details are beyond the scope of this module). Conclusions: Our rejection of the null hypothesis allows us to conclude that the two collections of barramundi perch compared here are genetically distinct at t he EST-2* locus. In other words, the frequencies of the three alleles at this lo cus are significantly different between the two populations. Using somewhat more complicated applications of the chi-square test, the authors concluded that the 25 collections they analyzed came from seven genetically distinct stocks, or po pulations, from adjacent stretches of the northeastern Australian coast. One of the goals of conservation and/or management is the preservation of genetic diver sity within a species. Management decisions based on the assumption that a speci es' genetic variation is distributed across populations could have disastrous co nsequences for the future of the species if the populations are indeed genetical ly distinct. Techniques for identifying amounts and patterns of genetic variatio n within a species are critical tools for biologists. Additional Questions: 1) Are the allele frequencies at the other six loci also significantly differen t between collections #1 and #25? (**For loci with two alleles instead of three, the critical value of the chi-square is 3.841, but otherwise the procedure is t he same). 2) Use the chi-square test to compare allele frequencies for collections #14 an d #15. Can you determine whether or not these two collections are from the same stock? Sources: Rohlf, F. J. and R. R. Sokal. 1995. Biometry, 3rd ed. W. H. Freeman and Company, New York, NY. Rohlf, F. J. and R. R. Sokal. 1995. Statistical Tables, 3rd ed. W. H. Freeman an d Company, New York, NY. Shaklee, J. B., J. Salini, and R. N. Garrett. 1993. Electrophoretic characteriza tion of multiple genetic stocks of barramundi perch in Queensland, Australia. Tr ansactions of the American Fisheries Society 122:685-701. copyright 1999 by M. Beals, L. Gross, and S. Harrell Related Searches: Mathematics Majors Calculus Tutorials Debit Credit Card Mathematics Teachers Introduction To Differential Equations Mathematics Degree Programs Merchant Account Services Mathematical Methods In The Physical Sciences Credit Card Today About this Ad Trust Rating 91% tiem.utk.edu Close Chi-Square Test for Independence This lesson explains how to conduct a chi-square test for independence. The test is applied when you have two categorical variables from a single population. It is used to determine whether there is a significant association between the two variables. For example, in an election survey, voters might be classified by gender (male o r female) and voting preference (Democrat, Republican, or Independent). We could use a chi-square test for independence to determine whether gender is related t o voting preference. The sample problem at the end of the lesson considers this example. When to Use Chi-Square Test for Independence The test procedure described in this lesson is appropriate when the following co nditions are met: The sampling method is simple random sampling. Each population is at least 10 times as large as its respective sample. The variables under study are each categorical. If sample data are displayed in a contingency table, the expected frequency coun t for each cell of the table is at least 5. This approach consists of four steps: (1) state the hypotheses, (2) formulate an analysis plan, (3) analyze sample data, and (4) interpret results. State the Hypotheses Suppose that Variable A has r levels, and Variable B has c levels. The null hypo thesis states that knowing the level of Variable A does not help you predict the level of Variable B. That is, the variables are independent. H0: Variable A and Variable B are independent. Ha: Variable A and Variable B are not independent. The alternative hypothesis is that knowing the level of Variable A can help you predict the level of Variable B. Note: Support for the alternative hypothesis suggests that the variables are rel ated; but the relationship is not necessarily causal, in the sense that one vari able "causes" the other. Formulate an Analysis Plan The analysis plan describes how to use sample data to accept or reject the null hypothesis. The plan should specify the following elements. Significance level. Often, researchers choose significance levels equal to 0.01, 0.05, or 0.10; but any value between 0 and 1 can be used. Test method. Use the chi-square test for independence to determine whether there is a significant relationship between two categorical variables. Analyze Sample Data Using sample data, find the degrees of freedom, expected frequencies, test stati stic, and the P-value associated with the test statistic. The approach described in this section is illustrated in the sample problem at the end of this lesson. Degrees of freedom. The degrees of freedom (DF) is equal to: DF = (r - 1) * (c - 1) where r is the number of levels for one catagorical variable, and c is the numbe r of levels for the other categorical variable. Expected frequencies. The expected frequency counts are computed separately for each level of one categorical variable at each level of the other categorical va riable. Compute r * c expected frequencies, according to the following formula. Er,c = (nr * nc) / n where Er,c is the expected frequency count for level r of Variable A and level c of Variable B, nr is the total number of sample observations at level r of Vari able A, nc is the total number of sample observations at level c of Variable B, and n is the total sample size. Test statistic. The test statistic is a chi-square random variable (?2) defined by the following equation. ?2 = S [ (Or,c - Er,c)2 / Er,c ] where Or,c is the observed frequency count at level r of Variable A and level c of Variable B, and Er,c is the expected frequency count at level r of Variable A and level c of Variable B. P-value. The P-value is the probability of observing a sample statistic as extre me as the test statistic. Since the test statistic is a chi-square, use the Chi- Square Distribution Calculator to assess the probability associated with the tes t statistic. Use the degrees of freedom computed above. Interpret Results If the sample findings are unlikely, given the null hypothesis, the researcher r ejects the null hypothesis. Typically, this involves comparing the P-value to th e significance level, and rejecting the null hypothesis when the P-value is less than the significance level. Test Your Understanding of This Lesson Problem A public opinion poll surveyed a simple random sample of 1000 voters. Respondent s were classified by gender (male or female) and by voting preference (Republica n, Democrat, or Independent). Results are shown in the contingency table below. Voting Preferences Row total Republican Democrat Independent Male 200 150 50 400 Female 250 300 50 600 Column total 450 450 100 1000 Is there a gender gap? Do the men's voting preferences differ significantly from the women's preferences? Use a 0.05 level of significance. Solution The solution to this problem takes four steps: (1) state the hypotheses, (2) for mulate an analysis plan, (3) analyze sample data, and (4) interpret results. We work through those steps below: State the hypotheses. The first step is to state the null hypothesis and an alte rnative hypothesis. H0: Gender and voting preferences are independent. Ha: Gender and voting preferences are not independent. Formulate an analysis plan. For this analysis, the significance level is 0.05. U sing sample data, we will conduct a chi-square test for independence. Analyze sample data. Applying the chi-square test for independence to sample dat a, we compute the degrees of freedom, the expected frequency counts, and the chi -square test statistic. Based on the chi-square statistic and the degrees of fre edom, we determine the P-value. DF = (r - 1) * (c - 1) = (2 - 1) * (3 - 1) = 2 Er,c = (nr * nc) / n E1,1 = (400 * 450) / 1000 = 180000/1000 = 180 E1,2 = (400 * 450) / 1000 = 180000/1000 = 180 E1,3 = (400 * 100) / 1000 = 40000/1000 = 40 E2,1 = (600 * 450) / 1000 = 270000/1000 = 270 E2,2 = (600 * 450) / 1000 = 270000/1000 = 270 E2,3 = (600 * 100) / 1000 = 60000/1000 = 60 ?2 = S [ (Or,c - Er,c)2 / Er,c ] ?2 = (200 - 180)2/180 + (150 - 180)2/180 + (50 - 40)2/40 + (250 - 270)2/270 + (300 - 270)2/270 + (50 - 60)2/60 ?2 = 400/180 + 900/180 + 100/40 + 400/270 + 900/270 + 100/60 ?2 = 2.22 + 5.00 + 2.50 + 1.48 + 3.33 + 1.67 = 16.2 where DF is the degrees of freedom, r is the number of levels of gender, c is th e number of levels of the voting preference, nr is the number of observations fr om level r of gender, nc is the number of observations from level c of voting pr eference, n is the number of observations in the sample, Er,c is the expected fr equency count when gender is level r and voting preference is level c, and Or,c is the observed frequency count when gender is level r voting preference is leve l c. The P-value is the probability that a chi-square statistic having 2 degrees of f reedom is more extreme than 16.2. We use the Chi-Square Distribution Calculator to find P(?2 > 16.2) = 0.0003. Interpret results. Since the P-value (0.0003) is less than the significance leve l (0.05), we cannot accept the null hypothesis. Thus, we conclude that there is a relationship between gender and voting preference. Note: If you use this approach on an exam, you may also want to mention why this approach is appropriate. Specifically, the approach is appropriate because the sampling method was simple random sampling, each population was more than 10 tim es larger than its respective sample, the variables under study were categorical , and the expected frequency count was at least 5 in each cell of the contingenc y table.