Professional Documents
Culture Documents
Chapter 8: Chi-Square
C
Chhaapptteerr 88::
CHAPTER OVERVIEW
Introduction
Assumptions
Goodness of fit test Chapter 1: Introduction
Chapter 2: Descriptive Statistics
χ2 test for independence
Chapter 3: The Normal Distribution
Chapter 4: Hypothesis Testing
Chapter 5: T-test
Summary
Chapter 6: Oneway Analysis of Variance
Key Terms
Chapter 7: Correlation
Chapter 8: Chi-Square
This chapter introduces the concept of the chi-square which is a non-parametric statistical
tool which does not require the assumption of normality to be used. The goodness of fit
test enables whether the expected frequencies are equal to the observed frequencies. If the
observed frequencies differ a great deal from the expected frequencies, then it is likely
that there are significant differences between the groups.
2
Chapter 8: Chi-Square
Introduction
So far we have discussed the use of inferential statistical tools such as the t-
test and ANOVA which demand strict adherence to certain assumptions such as
normality of population. When you have serious violations of the assumptions of
parametric test, you can use non-parametric techniques. These tests tend to be less
powerful than their parametric counterparts.
Also, in some situations you need to use non-parametric statistics because the
variable measured are not interval or ratio but instead are categorical such as religion,
ethnic origin, socioeconomic class, political preference and so forth. To examine
hypotheses using such variables, the chi-square test has been widely used. In this
chapter, we will discuss this popular non-parametric tests called the CHI-SQUARE
(pronounced as “kai-square”) and denoted by this symbol: χ2
Assumptions
Even though certain assumptions are not critical for using the chi-square; you
need to address a number of generic assumptions:
Size of Expected Frequencies ─ When the number of cells is less than 10 and
particularly when the total sample size is small, the lowest expected frequency
required for a chi-square test is 5. However, the observed frequencies can be
any value, including zero.
This test enables us to find out whether a set of Obtained (or Observed)
Frequencies differs from a set of Expected Frequencies. Usually the Expected
Frequencies are the ones that we expect to find if the null hypothesis is true. We
compare our Observed Frequencies with the Expected Frequencies and see how good
the fit is.
EXAMPLE:
Working through the computations for this example will enable you to
understand how the One-Variable Χ2 Or Goodness-Of-Fit Test is used.
A sample of 110 teenagers were asked which of four types of handphone brands they
preferred. The number of people choosing the different brands were recorded in Table
8.1.
We want to find out if one or more brands are preferred over others. If they are
not, then we should expect roughly the same number of people in each category.
There will not be exactly the same number of people in each category, but they should
be near equal.
Another way of saying this is: If the null hypothesis is TRUE, and some
brands are not preferred more than others, then all brands should equally represented.
We expect roughly EQUAL NUMBERS IN EACH CATEGORIES, if the NULL
HYPOTHESIS is TRUE.
Expected Frequencies
There is 110 people, and there are four categories. If the null hypothesis is true, then
we should expect 110 / 4 = 27.5 teenagers to be in each category. This is because, if
all brands of handphones are equally popular, we would expect roughly equal
numbers of people in each category. In other words, the number of teenagers should
be evenly distributed among the four brands.
The numbers that we find in the four categories, if the null hypothesis is true
are called the EXPECTED FREQUENCIES (i.e. all brands are equally
popular)
The numbers that we find in the four categories are called the OBSERVED
FREQUENCIES (i.e. based on the data we collected).
See Table 8.2. What χ2 does is to compare the Observed Frequencies with the
Expected Frequencies.
4
Chapter 8: Chi-Square
If all brands of handphones are equally popular, the Observed Frequencies will
not differ from the Expected Frequencies.
If the Observed Frequencies differ a great deal from the Expected Frequencies,
then it is likely that all four brands of handphones are not equally popular.
Table 8.2 shows the observed and expected frequencies for the four brands of
handphones. It is often difficult to tell just by looking at the data which is why you
have to use the χ2 test.
TOTAL 53.65
Step 1:
Calculate the differences between the Expected Frequencies and Observed
Frequencies (see Column 4). Do not worry about the minus and plus signs!
Step 2:
Square the differences (see Column 5) to obtain the absolute value of the difference.
Step 3:
Divide the squared difference with the measure of variance (see Column 6). The
„measure of variance‟ is the Expected Frequencies (i.e. 27.5). For Brand A it is 56.25 ∕
27.5 = 2.05 and do the same for the other brands.
Step 4:
Add up the figures you obtained in Column 6 and you get 53.65. So the χ2 is 53.65.
5
Chapter 8: Chi-Square
The FORMULA for the χ2 which you did above is shown as follows:
Step 5:
The degrees of freedom (DF) is one less than the number of categories. In this case
DF is 4 categories – 1 = 3. We need to know this, for it is usual to report the DF,
along with the χ2 and the associate probability level.
SPSS Output
HANDPHONES
Chi-Square 45.636a
Df 3
Asymp. Sig. .0000
The χ2 value of 53.65 (rounded to 53.6) is compared with that value that would be
expected for a χ2 with 3 DF, if the null hypothesis were true (i.e. all brands of
handphones are preferred equally). [SPSS will compute this comparison]. The SPSS
Output shows that with a χ2 value of 53.6 the associated probability value is 0.0001.
This means that the probability that this difference was due to chance is very small.
We can conclude that there is a significant difference between the Observed and
Expected Frequencies; i.e. all the four brands of handphones are not equally popular.
More people prefer brand B (60) than the other handphone brands.
6
Chapter 8: Chi-Square
EXAMPLE:
Say for example you ask 110 students the following questions:
How many of you smoke and are active in sports?
7
Chapter 8: Chi-Square
The other primary use of the chi-square test is to examine whether two
variables are independent or not. What does it mean to be independent? It means that
the two factors are not related. Typically in educational research, we are interested in
finding factors that are related. For example, education and income, occupation and
prestige, age and job satisfaction. In this case, the chi- square can be used to assess
whether two variables are independent or not.
More generally, we say that variable Y is "not correlated with" or
"independent of" variable X if more of one is not associated with more of another. If
two categorical variables are correlated their values tend to move together, either in
the same direction or in the opposite.
Example
A researcher is interested in finding out whether males from high income or low
income students get into trouble more often in school. Table 8.4 is the table
documenting the percentage of high income and low income students who have
discipline problems in school:
To examine statistically whether boys got in trouble in school more often, we need to
frame the question in terms of hypotheses.
The first step of the chi-square test for independence is to establish hypotheses. The
null hypothesis is that the two variables are independent - or, in this particular case
that the likelihood of getting into discipline problems is the same for high income and
low income students. The alternative hypothesis to be tested is that the likelihood of
getting in into discipline problems is not the same for high income and low income
students.
8
Chapter 8: Chi-Square
It is important to keep in mind that the chi-square test only tests whether two
variables are independent. It cannot address questions of which is greater or less.
Using the chi-square test, we cannot evaluate directly the hypothesis that low income
students get in trouble more than high income students; rather, the test (strictly
speaking) can only test whether the two variables are independent or not.
Step 2: Calculate the expected value for each cell of the table
As with the goodness-of-fit example described earlier, the key idea of the chi-square
test for independence is a comparison of observed and expected values. How many of
something were expected and how many were observed in some process? In the case
of tabular data, however, we usually do not know what the distribution should look
like. Rather, in this use of the chi-square test, expected values are calculated based on
the row and column totals from the table.
The expected value for each cell of the table can be calculated using the following
formula:
For example, in the table comparing the percentage of high income and low income
students involved in discipline problems, the expected count for the number of low
income students with discipline problems (Cell A) is:
117 x 83
Expected Frequency (E1) = = 40.97
237
120 x 154
Use the formula and compute the Expected Frequencies for E2 and E3. Table 8.5
shows the completed expected frequencies for all the four cells.
Chi-square = Sum of
Expected Frequency
a) Degrees of Freedom
Before we can proceed we need to know how many degrees of freedom we have.
When a comparison is made between one sample and another, a simple rule is that the
degrees of freedom equal (number of columns minus one) x (number of rows minus
one) not counting the totals for rows or columns.
b) Statistical Significance
We now have our chi square statistic (χ2 = 1.87), our predetermined alpha
level of significance (0.05), and our degrees of freedom (df =1). Entering the
Chi square distribution table with 1 degree of freedom and reading along the
row we find our value of χ2 = 1.87 is below 3.841 (see Table 8.6).
When the computed χ2 statistic is less than the critical value in the table for a
0.05 probability level, then we DO NOT reject the null hypothesis of equal
distributions.
Since our χ2 = 1.87 statistic is less than the critical value for 0.05 probability
level (3.841) we DO NOT reject the null hypothesis and conclude that
students from low income families are NOT SIGNIFICANTLY more likely to
have discipline problems than students from high income families.
10
Chapter 8: Chi-Square
Select a row variable and click on > button to move the variable into the
Row(s): box
Select a column variable and click on the > button to move the variable into
the Column(s): box
Click on Continue
In the Counts box, click on the Observed and Expected check boxes
In the Percentages box, click on the Row, Column and Total check boxes
LEARNING ACTIVITY
LEARNING ACTIVITY
Yes No Total
Urban 36 14 50
Rural 30 25 55
Total 66 39 105
SUMMARY
Goodness of fit test enables us to find out whether a set of Obtained (or
Observed) Frequencies differs from a set of Expected Frequencies.
If the Observed Frequencies differ a great deal from the Expected Frequencies,
then it is likely that there are significant differences.
The degrees of freedom (DF) is one less than the number of categories.
The chi-square test is used to examine whether two variables are independent
or not; i.e. whether the two factors are not related.
KEY WORDS:
Goodness of fit
Chi-square
Test of independence
Observed frequencies
Expected frequencies
Row total
Column total
Degress of freedom
--------000--------