You are on page 1of 22

UNIT 7 APPLICATIONS OF CHI-SQUARE

IN PROBLEMS WITH
CATEGORICAL DATA
Structure Page Nos.

7.1 Introduction
Objectives
7.2 Goodness-of-fit
7.3 Test of Independence
7.4 Summary
7.5 Solutions to Exercises
Appendix : Multinomial Distribution

7.1 INTRODUCTION
In this block, you have already studied several problems of testing of hypotheses.
The tests that you have studied so far relate to problems where the sample data
have been obtained from a continuous distribution, for example, the normal
distribution. In practice, however, one often obtains data in which the sampled
"observations" are classified into classes according to one or more attributes. For
example, a sample of flowers can be classified according to their colour -- some
of the flowers in the sample could be white, the others could be purple. Again,
suppose it is claimed that a vaccine controls a disease. To 'verify' the truth of this
claim, a sample of N individuals is taken and these individuals can be classified
according to two attributes -inoculated or not inoculated, and affected or not
affected by the disease.

When the sampled data are classified according to one or more attributes, we say
that we have a set of categorical data. How do we tackle the inference problems
arising out of categorical data? In this unit we shall discuss the use of one of the
most widely used tests, the chi-square test, in this context.

To start with, in Sec.7.2, we shall consider the use of the chi-square test in
"goodness-of-fit" problems. Then, in Sec.7.3, we shall see how the chi-square
test can help us compare two features of a population to see if there is any
relationship between them or not. In other words, we test to see if the features
occur independent of each other or not.

While studying this unit, please keep comparing the situations in this unit and
Units 5 and 6 to really understand the difference in the questions being asked and
answered.

Objectives

After studying this unit, you should be able to

define categorical data;


identify inference problems associated with categorical data;
use the chi-square test for solving some inference problems arising in
categorical data.
Statistical Inference
7.2 GOODNESS-OF-FIT
Let us begin by trying to solve Ms.Dalta7sproblem. She is the marketing director
of a company that sells four types of steel almirahs. As part of her duties, she has
to make sure that there is no loss of sales due to less stock availability. So far she
has been ordering new cupboards assuming that the demand for all four types is
the same.

Recently, however, the stock inventories have become more difficult to control.
Therefore, Ms.Dalta feels that she should check whether her hypothesis of
uniform demand is valid or not.

Can you apply any of the methods you have studied so far for helping Ms.Dalta?
There is no parameter that she is estimating and no assumption regarding the
distribution of the population. So Ms.Dalta needs to look for some new tools.
Ho is the null What she needs to do is to test the hypothesis :
hypothesis, and HAis Ho : The demand is uniform for all four types of almirahs
the alternative against
hypothesis. HA : The demand is not uniform for all four types of almirahs.

For doing this, she selects a sample of 80 almirahs sold over the past few months.
Ms.Dalta assumes that the demand is uniform. So the probability of an almirah of
Type i being bought is the same, for i = 1, 2, 3 , 4 . If we denote this probability by
1
pi, then pl= pz = p3 = p4 = - . So, if the demand is uniform, she can expect
4

80 -
I:( = 20 almirahs of each type to be sold. But the observed sales of each

type are 23, 19, 18 and 20, respectively. Her problem is to see how well her
hypothesis of uniform demand fits the observed sales. In other words, how can
this data set be used for testing Ho?

More generally, suppose a sample of n individuals are classified into k classes.


Suppose the number of individuals falling in the ith class is Oi (i = 1, ..., k).
The problem of "goodness-of-fit" consists in testing the hypothesis, Ho, that the
probability of an individual falling in the ith class is pi (i = 1, ..., k), where
k
Note that, xi,-,is z p i = I . In other words, the hypothesis HOto be tested is that the number of
i=l
denoted by ' X: with
individuals (in a sample of size n) falling in the ith class is npi (i = 1, ..., k). This
k-1 degrees of
freedom' in Unit 4.
is to be tested against the hypothesis HA,that Ho is not true.

So, in this general situation, the "expected" number of individuals falling in the ith
class is Ei = npi (i = 1, ..., k). Note that these "expected" numbers are known to
us because to start with we assume Ho and calculate them. That is, we assume that
the probability of an individual falling in the ith class is pi (i = 1, ..., k). Based on
these numbers 01,0 2 , ..., OkrEl, E2, ..., Ek,there is a way of testing the validity
of Ho. Let us see what this method is.

U is also called the


sample X2 value, or the
Let U = z
i=l
(o~-E~)~
Ei
. This statistic U, under some mild conditions, is known to

observed value of X Z have an approximate chi-square distribution with k - 1 degrees of freedom,


for the data. where k is the number of classes. If we want to test the hypothesis Ho at the a
level of significance, then we need to find X2,,,_,,from the standard X2
distribution tables (given at the end of this block). If U > xZ
a.k-l
, we reject Ho.
Otherwise we do not reject HO.
Applications of Chi-Square
To see how this test works, let us consider Ms.Dalta7sdata, presented in Table 1. in Problems with Categorical
Data
Table 1
Expected sales (E,)
Type of almirah Observed sales (0,)
=np,= 80 x
-
I 23 20
I1 19 20
I11 18 20
IV 20 20

Here U = (0,-%)i + (02 -~ 2 +) 3 ~ 3 + (04 -E4Y


El E2 E3 E4

- (23-20)~ + ( 1 9 - 2 0 p
- (18-20)~ +(20-20)'
+- -
20 20 20 20

-
--9 +-
1+4+0 = 0.7
20
Here k = 4. If Ms.Dalta wants to test Ha at a 5% level of significance, a = 0.05.
Now, 0 5 , , = 7.8 15. Since U < ,5,, , Ms.Dalta does not reject Ha. In other
words, Ms.Dalta concludes that the demand for the four types of almirahs is
uniform.

Another example may help you to see how this test works.

Example 1 (Experiment on the breeding of flowers of a certain species) :


Jaswant is interested in breeding flowers of a certain species. The experimental
breeding can result in four possible types of flowers :
(a) magenta flowers with a green stigma (MG),
(b) magenta flowers with a red stigma (MR),
(c) red flowers with a green stigma (RG),
(d) red flowers with a red stigma(RR).

According to the well-known Mendel's law, these four kinds of flowers should
come out in the ratio 9 : 3 : 3 : 1. Jaswant found that under her experiment, out of
160 flowers that bloomed, the number of flowers with types MG, MR, RG and RR
were 84, 35,28 and 13, respectively. She wants to find out whether these data Fig.1
are compatible with Mendel's law or not.

If they are compatible, then the probabilities of each of these types of blooming- are
~ -

9 p2= -
p,= - 3 3 1
, p3 = - , and p4 = - . So Jaswant wants to test the hypothesis
16 ' 16 16 16

Ho : The distribution of the flower types is multinomial with


See the appendix to the unit
for a brief introduction to the
multinomial distribution.
against
HA : Ho is not true, that is, the distribution is not multinomial with the specified
probabilities.

Jaswant's data can be presented as shown in Table 2.


Statistical Inference Table 2
I Flower I Observed number I Expected number 1
type
MG
I
1
Oi
84
E, (= npi)
90 1
Here k = 4, and

Jaswant needs to compare this value with the appropriate critical x2-value. She
takes the significance level of the test as a = 0.05. Also, in this case, since the
number of classes is 4, the degrees of freedom are 4 - 1 = 3. So, she finds
xi ,o,,,, which is 7.8 1. Since U = 2.27 < 7.81 = xi,,,,,,
she does not reject Ho.
Thus, Jaswant concludes that her data is compatible with Mendel's law.

In the two situations above, the hypothetical probabilities p,, p2, . .. were known to
us from before because of the type of assumption Ho was. However, in some
problems, these probabilities may have to be estimated from the data itself. The
following example illustrates this.

Example 2 : A consultant was employed by a city council to study the pattern of


bus amvals and departures at a very busy interstate bus terminus. Since many

arrival processes fit the Poisson distribution, she decided to test the following
hypothesis :

Ho : The arrivals are distributed as a Poisson random variable,


against
HA : The arrivals are not Poisson distributed.
She sampled the number of arrivals in 200 minutes. Then she grouped the arrivals Applications of Chi-Square
into k = 6 categories, and noted her observations, as shown in Column 2 of Table in Problems with Categorical
3 below. Data

However, since the parameter of the Poisson distribution is unspecified in the


hypothesis, the consultant needed to estimate this from the data itself. For this
she first computed the sample mean as

So, she estimated the parameter of the Poisson distribution as h = 2.96.


With this value of 2 , she computed the Poisson probabilities for the different
classes from the tables (which are also provided at the end of this block). These
are shown in Column 3 of the table below.

Table 3 : Arrivals at ISBT


Observed Prob. according Expected
No. of
frequencies to Poisson dist. frequencies
arrivals E.(zn PI.)
Oi Pi I

0 10 0.0524 10.48
1 23 0.1545 30.90
2 45 0.2277 45.54
3 49 0.2238 44.76
4 32 , 0.1651 33.02
5 or more I 41 0.1765 35.30
Total 200 1.OOOO 200.00
According to her data,

Here k = 6 but one parameter has been estimated. So, the degrees of freedom
associated with the chi-square distribution is (k - 1) - 1 = k - 2 = 4. The critical
To find pi for h = 2.96, we
value of chi-square at 4 degrees of freedom and 1 percent level of significance is
13.27. Since 3.402 < 13.27, the consultant did not reject the null hypothesis. In take the average of the
other words, she was in a position to conclude that the arrivals and departures at values in the columns
the bus terminus were Poisson distributed. corresponding to 2.9 and
3.0, respectively. Thus,
p, = 0.055 + 0.0498 = 0.0524.
2
Let us now look at a problem involving normal distribution. While solving it, the
following very important point about applying the X2-testwill show up.

Remark 1 : If, corresponding to a category, say j, the expected value Ej is small,


i.e., less than 5, then the chi-square approximation for the distribution of U will
not be good. So, if the condition Ei 2 5 is not satisfied for all i, then we should
combine the category j with Ej < 5 with its adjacent categories j + 1, j + 2, ...,j + r,
whereEj+E,+!+ ... + E j + , 2 5 b u t E j + E j + , + ... + E j + , - , < 5 . Thenumberof
classes, accordingly, gets reduced by r.

This remark will become more clear as you study the solution of Problem 1.
Problem 1 : A chemical company wants to know if its sales of a liquid chemical
are normally distributed. This information will help them in planning and
Statistical Inference controlling the inventory. The sales record for a random sample of 200 days is
given in Table 4.
Table 4
{
Less than 34.0 0
34.0-35.5 13
We assume that the 35.5-37.0 20
upper limit of a class 37.0-38.5 35
shows that quantities less 38.5-40.0 43
than that limit are in the 40.0-41.5 51
class. So, for example, 41.5-43.0 27
35.5 will be included in
43 .O-44.5 10
the third class interval, 44.5-46.0 1
not the second one. I 46.0 or more I 0 1
Total 200

At the 5% level of significance, test the hypothesis that the company's sales are
normally distributed.
Solution : Let us start by clearly stating our hypotheses.
Ho : The company's sales are normally distributed.
against
HA: The company's sales are not normally distributed.
Now, we assume for just now that Ho is valid. By methods known to us, we can
calculate the sample mean and sample standard deviation Y, and s,. You can
check that these are :
2 = 40,000 litres, s,= 2.5 thousand litres.
Now, we need to find athe expected frequencies E, corresponding to each 0,. You
know that El = 200 x p,, where p, is the probability for each class in Table 4,
I
computed under the assumption of normal distribution.
So, let us expand Table 4 to include all the class probabilities (Column 3), the
expected frequencies (Column 4) and the corresponding values of (0,- E , l2
E,
(Column 5).
(X
To get the first entry in Column 3, we compute z = -
u
for x = 34. As you
(34-40) -
I
know, p and (3 are estimated by Z and s,, respectively. So, z = -- -2.4.
2.5
Now, from the table of normal probabilities in the Block Appendix, you know that
P[-2.4 < Z < 01 = P [0 2 Z 5 2.41 = 0.491 8.
So, the probability we want is
pl = 0.5 - P [-2.4 < Z 501 = 0.5 - 0.4918 = 0.0082.
Therefore, El = 200 (0.0082) = 1.64.
Similarly, you can compute the other expected frequencies and complete the 4th
column of Table 5. You may wonder about the brackets in Columns 2 , 3 and 4
I
of the table. This is because, as we have mentioned in Remark 1, the xZ
goodness-of-fit test is a good approximation only if the Ei are not very small.
This is why we have grouped the first two classes and the last two classes in Table
5.
To fill in the fifth column of Table 5. we treat the bracketed classes as a single
1,V.""".
, qn (01 )' = (13-7'18)2 .., ,. ". -..r"....
= d 71 76 an .
Vn11 ~,.
i.r..n.i l a r l vr a l c i ~ l n t rthe
--.-
I
"V) A

E1 7.18
other entries of Column 5 in the table below.
*
Table 5 Applications of Chi-Square
in-problems with Categorical
Sales
(in 1000 litres) frequency
Observed I Class
probability
Expected
frequency
Data

less than 34.0 5.54-


34.0 - 35.5
35.5 - 37.0 15.84
37.0 - 38.5 31.84
38.5 - 40.0 45.14
40.0 - 41.5 45.14
41.5 - 43.0 31.84
43.0 - 44.5 15.84
44.5 - 46.0
greater than 46.0 : 7.18

Now, summing up the entries in the last column of Table 5, we get U = 15.194.

Next, to see whether we accept or reject Ho, we look up the value of x2 at the 5%
level of significance and for the appropriate number of degrees of freedom. Note
that, though we started with the data categorised into 10 classes, we needed to
group two sets of 2 frequencies each. So, for purposes of the test we now have
8 classes. Also, we have estimated two parameters, p and o. Therefore, the
degrees of freedom are (8 - 1) - 2 = 5.

So, from the x2 table, we find Xt,q5 = 11.07.

Since U >X:.,,,,, we must reject Ho. That is, the normal distribution is not a good
fit to the data.
***
Now try the following exercises.

El) In Table 6 below you find the distribution of the heights for 100 college
students. Estimate the mean and the standard deviation of the
distribution. Check whether the sample is drawn from a normally
distributed population at 5% level of significance.
Table 6

Class (cm) Number of


students (0,)
Less than 161 4
161 - 164 11
164 - 167 16
167 - 170 19
170 - 173 25
173 - 176 18
176 - 179 4
179 - 182 2
182 or more 1
Total 100

E2) Test whether the ~bservedfrequencies, as given below, in 4 phenotypic


classes AB, Ab, aB, ab are in agreement with the expected ratio 9: 3: 3: 1.

Class AB1 aB &


- &
Frequency 102 25 28 5
Statistical Inference E3) A die is rolled 1200 times with the following results

i No. thatcomesur, I 1 i 2 1 3 1 4 1 5 6 1

Test if the die is unbiased.

In all the situations so far, the problem was related to data that were classified
according to one attribute. Now let us see how the X2 test can be used to infer
about situations in which the data are classified according to two or more
attributes.

7.3 TEST OF INDEPENDENCE


In this section we shall look at inference problems like the following one.

Dr.Surya had recently developed a serum that she thought might be effective in
preventing colds. But, she needed to verify its efficacy. For this purpose she
carried out an experiment,

One thousand individuals were classified into two groups of the same size. The
serum was administered to the members of the first group only. The number of
individuals in each group who caught a cold zero times, or once, or more than
once during some period after the treatment was noted. The data are shown in the
following table having 2 rows and 3 columns.

Table 7 : Table showing the effect of serum


-

The number catching a cold


Fig3 :" Don't worry! Total
You take this
medicine, and you 252 145
won't have any more untreated group
I
224 136 1 140 1 500
colds in future." Total 476 1 281 1 243 I 1000

Dr.Surya's problem was to examine whether or not this serum is effective in


preventing a cold. In other words, she wants to know if a person can catch a cold
one or more times whether s/he has taken the serum or not. We can reword this as:
is the treatment by the serum independent of the number of times of catching a
cold?

So, Dr.Surya formulated the following null and alternative hypotheses to be


tested:

Ho : There is no interdependence between the serum treatment and the number of


times of getting a cold.

HA : Hois not true, i.e., the serum has some effect on preventing colds.

To test Ho against HAshe planned to use the X 2 test at the 5% significance level.
As you know, to do so she needed to calculate the expected frequencies
corresponding to each of the 6 entriesin the 2 x 3 table, Table 7, assuming Ho,
i..e., the independence of the number of times one gets a cold and of taking serum
treatment.

Let us see how she obtained Ell. For this, she used the fact that out of the 1000
people, 476 had no cold. So, out of the 500 in the treatment group,
476 Applications of Chi-Square
-x 500 = 238 were expected to not have any cold. Note that this is
1000 in Problems with Categorical
Data
(sum of the first row entries) x (sum of first column entries)
(total sample size)

Similarly, she calculated the other expected frequencies :


28 1
E12= 500 x - = 140.5, E13= 121.5, EZI= 238, E2' = 140.5, E23= 121.5.
1000

Then , Surya calculated the sample statistic U as

She took the significance level of the test as a = 0.05. Also, in this case the For anm x n
number of degrees of freedom was (2 - 1) (3 - 1) = 2. So, comparing the value of table, the
number of
U with = 5.99, she found that U
degrees of
freedom is
So, she rejected Ho, and concluded that the serum has some effect in preventing (m-1) (n-1).
colds.

Let us look closely at the steps Dr. Surya went through for testing the
independence of two features of the population under study.

Step 1 : She stated the hypothesis regarding the independence of two features of
the sample.

Step 2 : As in the case of the goodness-of-fit tests, she noted the frequencies -
how many of each type of person (treated or untreated) had which kind
of feature (the number of times they catch a cold). These frequencies
were written in a table, called a contingency table.

In this case, the contingency table had 2 rows and 3 columns, because
corresponding to each of the two groups of people there were 3
possibilities about the cold they did or did not catch. In brief, we say
that the table was a 2 x 3 contingency table.

Step 3 : Corresponding to each of the 6 cells of the contingency table,


Dr.Surya calculated the expected frequency. She did this as follows :
Eij= expected frequency for ith row and jth column
- (sum of entries of ith row) (sum of entries of jth column)
-
(total sample size)
wherei= 1 , 2 andj = 1,2,3.

Step 4 : Then the sample X 2 , U, was calculated by

, where Oijwas the entry in the ith row and jth

column.
Note that, more generally, if she had had an m x n contingency table,
the value would be
Statistical Inference
,where m and n are natural numbers.
i=l j=l

Step 5 : She compared this value with the value of Xt,d


, where a is the level
of significance and d = (m - 1) (n- 1) is the number of degrees of
freedom. Then, as you saw in Sec. 7.2, if U < Xt,d
, Ho is accepted.
Otherwise Ho is rejected.

Another example may help you to clarify your understanding regarding the
process of testing for independence.

Example 4 : The Glorious Watch Company wants to find out if there is any
relationship between the income of a person and the importance she attaches to
the price of a brand name. Mr.Zafar, the Chief of the Marketing Division, wants
to test the hypothesis

Ho : Income of a person and importance to her of price attached are independent.


against
HA : HOis not true.

Zafar does a survey among the customers. To analyse his results, he groups them
into 3 income levels, and asks them to mark the level of importance they give on a
3-point scale -great, moderate or low. He noted the results in a contingency
table (see Table 8). In this table, you will also find the expected frequency
corresponding to each observed frequency written alongside. As you know, these
will be calculated as follows :

(sum of entries of first row) (sum of entries of first column)


E I I=
(total sample size)

All the other Eijs are calculated in the same way.

Table 8

Feature 2 (Income)
Feature 1 Low I Middle ( High - Total
(Importance Level) Oil Eil O ~ Z EIZ Oi3 Ei3

Great 1 1
79 63.58 1 1
58 61.2 1 33 145~22 1 170

Moderate 48 59.09 65 56.88 1 45 42.03 158 1


Low 60 64.33 57 61.92 55 45.75 172
----
Total 187 180 133 7
So, the sample X 2 value that Zafar calculated was
(79-63.58)' (58 - 61.2)'
u= 63.58
+

+ ... + (57-61.92)'
61.2 61.92
(55-45.75)'
45.75
+

= 3.74 + 0.167 + 3.302 + 2.081 + 1.159+ 0.21 + 0.291 + 0.391 + 1.87


Then Zafar compared this with the value of X2 for (3-1) (3-1) = 4 degrees of Applications of Chi-square
freedom and at the 2% level of significance, which is Xi,02,4
= 11.668. in Problems with Categorical
Data

He found U > xi.02,4,


which made him decide that he should reject Ho. In other
words, Zafar is 98% certain that the level of income of a person is related to the
importance she gives to the price of the brand of watches.

In the example above, it is interesting to note that if Zafar had chosen to be 99%
certain, then = 13.277 > U. So that, he would not have rejected Ho. What

does this tell us about statistical analyses? Think about it.


And now here are some problems for you to solve.

E4) The data in the following table give mortality rates among vaccinated and
non-vaccinated patients. Test if the vaccine has any effect in curing the
disease.
Categories Living Dead Total
Vaccinated 320 125 445
Non-vaccinated 98 23 0 328
Total 418 355 773

E5) Do the following data on sociability of soldiers recruited in cities and


villages suggest that city soldiers are more sociable than village soldiers?

Sociable Non-sociable
City 13 6

E6) A group of 1650 school children were classified according to their


performance in school tests and family economic level. Test if there is
.
any association between these two attributes.

Very Good Good Average Poor Total

Very Rich 4 7 16 25 52
Rich 13 37 79 73 202
Average 105 372 298 175 950
Poor 36 213 75 123 446
Total 157 629 468 396 1650

E7) In an experiment to study whether smoking affects health, the following


data were collected. Test the hypothesis that smoking does not affect
health.
Light Moderate Heavy
smoking smoking smoking
Health affected 16 29 35
Health not 36 23 17
affected
Statistical Inference In this section you have seen situations in which the population is tested
to see if two or more common features of the population are related or
not. This is as far as we intend to discuss the use of XZ for analysing
categorical data. Let us end with a brief look at what we have covered in
this unit.

SUMMARY
In this unit we have started with a look at data presented in the form of
frequencies falling in different categories or classes. Based on such data we have
undertaken different tests of hypotheses using the chi-squared distribution.
We have considered two types of tests :

1) Test of goodness-of-fit : The hypotheses are given by


Ho : The data fit a given distribution ; against
HA : Ho is not true, i.e., the data do not fit that distribution.

For testing whether Ho is acceptable, we consider the observed and


expected frequencies of the various categories in the data.

Suppose there are k categories with Oi as the observed frequency and Ei as


the expected frequency of the ith category. Then the sample X2 value is

If Ho were acceptable, then this value should be less than with


lOOa % significance level, where s is the number of parameters estimated
in finding the expected frequencies.

So, if U < :X ,-s-, , then Ho is not rejected. Otherwise, Ho is rejected.

2) Test of independence : Suppose a population can be classified into r


categories on the basis of feature A, and into c categories on the basis of
feature B. The hypotheses are given by :

Ho : There is no interdependence between the features A and B


HA : Ho is not true, that is, A has an effect on B.

The data is presented in the form of an r x c contingency table.


Let nij be the frequency in the ith row and jth column and let

If the two classification criteria are mutually independent, the expected


value Eij for the ith row and jth column is given by

r c ( o i j - ~ i j 2)
Then, the sample X 2 value, U = 11
i=lj=1 Eij

If this value is less than x ,,(,-,)(,-,)


2
, then Ho is acceptable at the a level
of significance.
And now you may like to check whether you have achieved the objectives of the Applications of Chi-Square
unit listed in Sec.7.1. Also, while doing the exercises in this unit, you may have in Problems with Categorical
had some doubts. If so, please go through the following section also. Data

7.5 SOLUTIONSIANSWERS
El) HereF=170,s2=36andn=100.

Ho : The sample is drawn from a population with normal distribution


N (170.0, 6').
HA : Hois not true.

In order to solve this problem by the same method as in Example 1, we


consider the classes in Table 6 corresponding to categories of a
multinomial distribution. Let 0,be the observed value for the ith
category. Then, what is the expected value for the ith category in this
case? Since the population distribution is completely specified as
N(170, 62) under the null hypothesis Ho, we can obtain the probability pi
with which the height of a student chosen randomly falls into the ith
category. The expected value for the ith category is obtained by Ei = npi.
To compute the values pi, the boundary points of the classes should be
standardised by the population means and the standard deviation so as to
make use of the table for a standard normal distribution. The
standardised boundary points are given below in Table 11.

Table 11
Boundary Points
161 164 167 170 173 176 179 182
of Class (xi)
Standardised -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
Boundary Points (z,)

xi -170.0
Here, zi = The pi and Eivalues can be obtained as follows.
6.0

The values are all given in Table 12 below.


Table 12
Number of Probabilities Expected values Ei
Class (cm)
(Pi)
0.0668
161 - 164 11 0.0919
164 - 167 16 0.1498
167 - 170 19 0.1915
170 - 173 25 0.1915
173 - 176 18 0.1498
176 - 179 4 0.0919
179 - 182
182 or more
Total
'1
1
100
0.0228
1.oooo

Let us now test Ho against HAat the 5% significance level.

Now, from Table 12,


Statistical Inference

Taking the significance level of the test a = 0.05, we have = 11.07.


The degrees of freedom 8 - 3 = 5, because the number of categories, after
combining the last two categories is 8, and the number of parameters estimated is
2.

Since U = 8.86 < 11.07 = x 0.05,5 , w e conclude that there is good


agreement between the observed frequencies and the fitted values. So Ho
is accepted.

E2) If the data are compatible with the given ratios, the expected frequencies
are : AB: 90, Ab: 30, aB: 30, ab: 10.

The value of U is 5.07. This is less than Xi,o5,7.815.


= Hence, we
accept the null hypothesis that the given data are in agreement with the
expected ratios.

E3) Here Ho : the expected frequency is 200 in each class.


HA : Ho is not true.

Therefore, U = 112.87 > = 11.070. Hence, we conclude on the


basis of the given data that we reject Ho. So the die is not unbiased.
E4) The hypothesis here is:

Ho : There is no effect of the vaccine on mortality.


against
HA : HOis not true.
The expected frequencies Eij are given in the table below.
Living Total
Vaccinated
Non-Vaccinated
Total 418 355 773

The observed value of X 2 is U = 133.08.


The number of degrees of freedom = (2 - 1) (2 - 1) = 1.
0.05, 1
=3.84 < U .
Hence, we conclude that we cannot accept Ho. So, on the basis of the
given data, we conclude that the vaccine has a definite effect on the
mortality rate.

E5) Ho : There is no interdependence between place and sociability level.


HA : Ho is not true.
The table of expected frequencies is

ociability
Social Non-social Total
Place

Villa e 10.5 10.5


Total 20 20 40
Applications of Chi-Square
in Problems with Categorical
Data
The number of degrees of freedom = 1.
2
X 0.05, 1 =3.84 < U .
Therefore, we reject Ho. So, the data suggests that the place a soldier
comes from affects herhis sociability level.

E6) The expected frequencies are given below :

1 Very Good Good Average Poor

Very Rich
Rich
Average
Poor
4.95
9.22
90.39
42.44
1 2
19.82
77.00
, 14.75
57.29
269.45
12.48
48.48
228
126.50 ,107.04

The value of the sample x2 is U = 127.61 > 25.0 = Xi,o5, . Hence, the
hypothesis of independence between the categories is rejected.

E7) Under the assumption that smoking does not affect health, the expected
frequencies are given below.

1 Light smoking 1 Moderate smoking ( Heavy smoking


Health affected I 26.67 26.67 26.67
Health not 25.33 25.33 25.33
affected

The observed value of x2 is U = 14.52 > X:,ok, = 5.991. Hence, it is


concluded on the basis of the given data that smoking affects health.

APPENDIX : MLTLTINOMIAL DISTRIBUTION


This distribution is an extension of the binomial distribution that you studied in
Unit 3. It shows up in the following situation :
There is an experiment which consists of n identical trials, which are independent.
Each trial can have k possible outcomes. Suppose the probability of each of these
outcomes is P I ,p2, ..., pk, with pl + p2 ... + pk = 1. These probabilities remain the
same from trial to trial.
Mathematically, this situation is represented by considering k random variables XI,
..., Xk with probabilities ply..., pk that XI = XI, ..., Xk = xk,respectively, where
k k
C x i =n, Cpi =1, pi # O'v'i = l , . . . , k . Ifthe randomvector (XI, ..., Xk)is
i=l i=l
n!
multinomially distributed, the P[Xl = x 1,. ..,Xk = x k ] = X
P1 I "'Pk
X
k .
xl!"'xk!

So, if we are testing if a certain population is multinomially distributed, we will test


Statistical Inference Ho : The population of size n is multinomially distributed with probabilities
PI,PZ,..., pk (known to us);
against
HA : The population is not multinomially distributed.

As in all the other cases discussed in the unit, if the npi are not very small, then the
test statistic U =
(oi-npiI2 has approximately a chi-squared distribution with
i=1 "Pi
(k - 1) degrees of freedom. The approximation is usually good for Ei = npi 2 5.
APPENDIX - 1

Sampling Sampling Sampling


Original
distribution distribution distribution
Population
Kfor n = 2 %for n = 5 Xfor n = 30

A- 1-
n
n ri

M~~

ILLL
Fig. 1 Sampling distribution of X for different populations
and different sample sizes.
TABLE 1
AREAS UNDER THE STANDARD NORMAL CURVE
This table shows the area between zero (the mean of a standard normal
variable) and z. For example, if z = 1.50, this is the shaded area shown
below which equals .4332.

Soume: This table is adapted h m National Bureau of Standards, Tables o f Nonnal Probability Func-
tions, Applied Mathematics Series 23, U.S. Department of Commerce, 1953.
TABLE-2
t-distribution
TABLE-3
CHI-SQUARED DISTRIBUTION
I-l
--no-
- ? % $ ? ;Pq?4f
Z s m o t t n n n n
s eNsH? ?
~ H
?=:??
NN N N N N
qqqqq
N N N N N

04GZS 4 & % 3 ? nq m
--*-
= sn gN gN L,,, ,,a,,
m o N O o n - m -
,,q
,,
2 k . s d G
tnnm-in N N N N N N N U N H
*
N N N N N

II 0 2 2 ~ 32 4 6 % ;
zg-dn d i i d w i
qq736
nnnnn
; q s z s n~ nqnqn n= g~ q a g g l
nn444 N N N N N

I -I ,O=;=;
Z&Gd s : zm4 z~
m $ o- - 0 t m n - m n
~ ~ 1
- o o q
t t t t t
t5.q-
t m t t t
u 0 m o t
1 - v q 1
r w r t t
-m
-98qq
t t t m n

You might also like