Professional Documents
Culture Documents
IN PROBLEMS WITH
CATEGORICAL DATA
Structure Page Nos.
7.1 Introduction
Objectives
7.2 Goodness-of-fit
7.3 Test of Independence
7.4 Summary
7.5 Solutions to Exercises
Appendix : Multinomial Distribution
7.1 INTRODUCTION
In this block, you have already studied several problems of testing of hypotheses.
The tests that you have studied so far relate to problems where the sample data
have been obtained from a continuous distribution, for example, the normal
distribution. In practice, however, one often obtains data in which the sampled
"observations" are classified into classes according to one or more attributes. For
example, a sample of flowers can be classified according to their colour -- some
of the flowers in the sample could be white, the others could be purple. Again,
suppose it is claimed that a vaccine controls a disease. To 'verify' the truth of this
claim, a sample of N individuals is taken and these individuals can be classified
according to two attributes -inoculated or not inoculated, and affected or not
affected by the disease.
When the sampled data are classified according to one or more attributes, we say
that we have a set of categorical data. How do we tackle the inference problems
arising out of categorical data? In this unit we shall discuss the use of one of the
most widely used tests, the chi-square test, in this context.
To start with, in Sec.7.2, we shall consider the use of the chi-square test in
"goodness-of-fit" problems. Then, in Sec.7.3, we shall see how the chi-square
test can help us compare two features of a population to see if there is any
relationship between them or not. In other words, we test to see if the features
occur independent of each other or not.
While studying this unit, please keep comparing the situations in this unit and
Units 5 and 6 to really understand the difference in the questions being asked and
answered.
Objectives
Recently, however, the stock inventories have become more difficult to control.
Therefore, Ms.Dalta feels that she should check whether her hypothesis of
uniform demand is valid or not.
Can you apply any of the methods you have studied so far for helping Ms.Dalta?
There is no parameter that she is estimating and no assumption regarding the
distribution of the population. So Ms.Dalta needs to look for some new tools.
Ho is the null What she needs to do is to test the hypothesis :
hypothesis, and HAis Ho : The demand is uniform for all four types of almirahs
the alternative against
hypothesis. HA : The demand is not uniform for all four types of almirahs.
For doing this, she selects a sample of 80 almirahs sold over the past few months.
Ms.Dalta assumes that the demand is uniform. So the probability of an almirah of
Type i being bought is the same, for i = 1, 2, 3 , 4 . If we denote this probability by
1
pi, then pl= pz = p3 = p4 = - . So, if the demand is uniform, she can expect
4
80 -
I:( = 20 almirahs of each type to be sold. But the observed sales of each
type are 23, 19, 18 and 20, respectively. Her problem is to see how well her
hypothesis of uniform demand fits the observed sales. In other words, how can
this data set be used for testing Ho?
So, in this general situation, the "expected" number of individuals falling in the ith
class is Ei = npi (i = 1, ..., k). Note that these "expected" numbers are known to
us because to start with we assume Ho and calculate them. That is, we assume that
the probability of an individual falling in the ith class is pi (i = 1, ..., k). Based on
these numbers 01,0 2 , ..., OkrEl, E2, ..., Ek,there is a way of testing the validity
of Ho. Let us see what this method is.
- (23-20)~ + ( 1 9 - 2 0 p
- (18-20)~ +(20-20)'
+- -
20 20 20 20
-
--9 +-
1+4+0 = 0.7
20
Here k = 4. If Ms.Dalta wants to test Ha at a 5% level of significance, a = 0.05.
Now, 0 5 , , = 7.8 15. Since U < ,5,, , Ms.Dalta does not reject Ha. In other
words, Ms.Dalta concludes that the demand for the four types of almirahs is
uniform.
Another example may help you to see how this test works.
According to the well-known Mendel's law, these four kinds of flowers should
come out in the ratio 9 : 3 : 3 : 1. Jaswant found that under her experiment, out of
160 flowers that bloomed, the number of flowers with types MG, MR, RG and RR
were 84, 35,28 and 13, respectively. She wants to find out whether these data Fig.1
are compatible with Mendel's law or not.
If they are compatible, then the probabilities of each of these types of blooming- are
~ -
9 p2= -
p,= - 3 3 1
, p3 = - , and p4 = - . So Jaswant wants to test the hypothesis
16 ' 16 16 16
Jaswant needs to compare this value with the appropriate critical x2-value. She
takes the significance level of the test as a = 0.05. Also, in this case, since the
number of classes is 4, the degrees of freedom are 4 - 1 = 3. So, she finds
xi ,o,,,, which is 7.8 1. Since U = 2.27 < 7.81 = xi,,,,,,
she does not reject Ho.
Thus, Jaswant concludes that her data is compatible with Mendel's law.
In the two situations above, the hypothetical probabilities p,, p2, . .. were known to
us from before because of the type of assumption Ho was. However, in some
problems, these probabilities may have to be estimated from the data itself. The
following example illustrates this.
arrival processes fit the Poisson distribution, she decided to test the following
hypothesis :
0 10 0.0524 10.48
1 23 0.1545 30.90
2 45 0.2277 45.54
3 49 0.2238 44.76
4 32 , 0.1651 33.02
5 or more I 41 0.1765 35.30
Total 200 1.OOOO 200.00
According to her data,
Here k = 6 but one parameter has been estimated. So, the degrees of freedom
associated with the chi-square distribution is (k - 1) - 1 = k - 2 = 4. The critical
To find pi for h = 2.96, we
value of chi-square at 4 degrees of freedom and 1 percent level of significance is
13.27. Since 3.402 < 13.27, the consultant did not reject the null hypothesis. In take the average of the
other words, she was in a position to conclude that the arrivals and departures at values in the columns
the bus terminus were Poisson distributed. corresponding to 2.9 and
3.0, respectively. Thus,
p, = 0.055 + 0.0498 = 0.0524.
2
Let us now look at a problem involving normal distribution. While solving it, the
following very important point about applying the X2-testwill show up.
This remark will become more clear as you study the solution of Problem 1.
Problem 1 : A chemical company wants to know if its sales of a liquid chemical
are normally distributed. This information will help them in planning and
Statistical Inference controlling the inventory. The sales record for a random sample of 200 days is
given in Table 4.
Table 4
{
Less than 34.0 0
34.0-35.5 13
We assume that the 35.5-37.0 20
upper limit of a class 37.0-38.5 35
shows that quantities less 38.5-40.0 43
than that limit are in the 40.0-41.5 51
class. So, for example, 41.5-43.0 27
35.5 will be included in
43 .O-44.5 10
the third class interval, 44.5-46.0 1
not the second one. I 46.0 or more I 0 1
Total 200
At the 5% level of significance, test the hypothesis that the company's sales are
normally distributed.
Solution : Let us start by clearly stating our hypotheses.
Ho : The company's sales are normally distributed.
against
HA: The company's sales are not normally distributed.
Now, we assume for just now that Ho is valid. By methods known to us, we can
calculate the sample mean and sample standard deviation Y, and s,. You can
check that these are :
2 = 40,000 litres, s,= 2.5 thousand litres.
Now, we need to find athe expected frequencies E, corresponding to each 0,. You
know that El = 200 x p,, where p, is the probability for each class in Table 4,
I
computed under the assumption of normal distribution.
So, let us expand Table 4 to include all the class probabilities (Column 3), the
expected frequencies (Column 4) and the corresponding values of (0,- E , l2
E,
(Column 5).
(X
To get the first entry in Column 3, we compute z = -
u
for x = 34. As you
(34-40) -
I
know, p and (3 are estimated by Z and s,, respectively. So, z = -- -2.4.
2.5
Now, from the table of normal probabilities in the Block Appendix, you know that
P[-2.4 < Z < 01 = P [0 2 Z 5 2.41 = 0.491 8.
So, the probability we want is
pl = 0.5 - P [-2.4 < Z 501 = 0.5 - 0.4918 = 0.0082.
Therefore, El = 200 (0.0082) = 1.64.
Similarly, you can compute the other expected frequencies and complete the 4th
column of Table 5. You may wonder about the brackets in Columns 2 , 3 and 4
I
of the table. This is because, as we have mentioned in Remark 1, the xZ
goodness-of-fit test is a good approximation only if the Ei are not very small.
This is why we have grouped the first two classes and the last two classes in Table
5.
To fill in the fifth column of Table 5. we treat the bracketed classes as a single
1,V.""".
, qn (01 )' = (13-7'18)2 .., ,. ". -..r"....
= d 71 76 an .
Vn11 ~,.
i.r..n.i l a r l vr a l c i ~ l n t rthe
--.-
I
"V) A
E1 7.18
other entries of Column 5 in the table below.
*
Table 5 Applications of Chi-Square
in-problems with Categorical
Sales
(in 1000 litres) frequency
Observed I Class
probability
Expected
frequency
Data
Now, summing up the entries in the last column of Table 5, we get U = 15.194.
Next, to see whether we accept or reject Ho, we look up the value of x2 at the 5%
level of significance and for the appropriate number of degrees of freedom. Note
that, though we started with the data categorised into 10 classes, we needed to
group two sets of 2 frequencies each. So, for purposes of the test we now have
8 classes. Also, we have estimated two parameters, p and o. Therefore, the
degrees of freedom are (8 - 1) - 2 = 5.
Since U >X:.,,,,, we must reject Ho. That is, the normal distribution is not a good
fit to the data.
***
Now try the following exercises.
El) In Table 6 below you find the distribution of the heights for 100 college
students. Estimate the mean and the standard deviation of the
distribution. Check whether the sample is drawn from a normally
distributed population at 5% level of significance.
Table 6
i No. thatcomesur, I 1 i 2 1 3 1 4 1 5 6 1
In all the situations so far, the problem was related to data that were classified
according to one attribute. Now let us see how the X2 test can be used to infer
about situations in which the data are classified according to two or more
attributes.
Dr.Surya had recently developed a serum that she thought might be effective in
preventing colds. But, she needed to verify its efficacy. For this purpose she
carried out an experiment,
One thousand individuals were classified into two groups of the same size. The
serum was administered to the members of the first group only. The number of
individuals in each group who caught a cold zero times, or once, or more than
once during some period after the treatment was noted. The data are shown in the
following table having 2 rows and 3 columns.
HA : Hois not true, i.e., the serum has some effect on preventing colds.
To test Ho against HAshe planned to use the X 2 test at the 5% significance level.
As you know, to do so she needed to calculate the expected frequencies
corresponding to each of the 6 entriesin the 2 x 3 table, Table 7, assuming Ho,
i..e., the independence of the number of times one gets a cold and of taking serum
treatment.
Let us see how she obtained Ell. For this, she used the fact that out of the 1000
people, 476 had no cold. So, out of the 500 in the treatment group,
476 Applications of Chi-Square
-x 500 = 238 were expected to not have any cold. Note that this is
1000 in Problems with Categorical
Data
(sum of the first row entries) x (sum of first column entries)
(total sample size)
She took the significance level of the test as a = 0.05. Also, in this case the For anm x n
number of degrees of freedom was (2 - 1) (3 - 1) = 2. So, comparing the value of table, the
number of
U with = 5.99, she found that U
degrees of
freedom is
So, she rejected Ho, and concluded that the serum has some effect in preventing (m-1) (n-1).
colds.
Let us look closely at the steps Dr. Surya went through for testing the
independence of two features of the population under study.
Step 1 : She stated the hypothesis regarding the independence of two features of
the sample.
Step 2 : As in the case of the goodness-of-fit tests, she noted the frequencies -
how many of each type of person (treated or untreated) had which kind
of feature (the number of times they catch a cold). These frequencies
were written in a table, called a contingency table.
In this case, the contingency table had 2 rows and 3 columns, because
corresponding to each of the two groups of people there were 3
possibilities about the cold they did or did not catch. In brief, we say
that the table was a 2 x 3 contingency table.
column.
Note that, more generally, if she had had an m x n contingency table,
the value would be
Statistical Inference
,where m and n are natural numbers.
i=l j=l
Another example may help you to clarify your understanding regarding the
process of testing for independence.
Example 4 : The Glorious Watch Company wants to find out if there is any
relationship between the income of a person and the importance she attaches to
the price of a brand name. Mr.Zafar, the Chief of the Marketing Division, wants
to test the hypothesis
Zafar does a survey among the customers. To analyse his results, he groups them
into 3 income levels, and asks them to mark the level of importance they give on a
3-point scale -great, moderate or low. He noted the results in a contingency
table (see Table 8). In this table, you will also find the expected frequency
corresponding to each observed frequency written alongside. As you know, these
will be calculated as follows :
Table 8
Feature 2 (Income)
Feature 1 Low I Middle ( High - Total
(Importance Level) Oil Eil O ~ Z EIZ Oi3 Ei3
Great 1 1
79 63.58 1 1
58 61.2 1 33 145~22 1 170
+ ... + (57-61.92)'
61.2 61.92
(55-45.75)'
45.75
+
In the example above, it is interesting to note that if Zafar had chosen to be 99%
certain, then = 13.277 > U. So that, he would not have rejected Ho. What
E4) The data in the following table give mortality rates among vaccinated and
non-vaccinated patients. Test if the vaccine has any effect in curing the
disease.
Categories Living Dead Total
Vaccinated 320 125 445
Non-vaccinated 98 23 0 328
Total 418 355 773
Sociable Non-sociable
City 13 6
Very Rich 4 7 16 25 52
Rich 13 37 79 73 202
Average 105 372 298 175 950
Poor 36 213 75 123 446
Total 157 629 468 396 1650
SUMMARY
In this unit we have started with a look at data presented in the form of
frequencies falling in different categories or classes. Based on such data we have
undertaken different tests of hypotheses using the chi-squared distribution.
We have considered two types of tests :
r c ( o i j - ~ i j 2)
Then, the sample X 2 value, U = 11
i=lj=1 Eij
7.5 SOLUTIONSIANSWERS
El) HereF=170,s2=36andn=100.
Table 11
Boundary Points
161 164 167 170 173 176 179 182
of Class (xi)
Standardised -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
Boundary Points (z,)
xi -170.0
Here, zi = The pi and Eivalues can be obtained as follows.
6.0
E2) If the data are compatible with the given ratios, the expected frequencies
are : AB: 90, Ab: 30, aB: 30, ab: 10.
ociability
Social Non-social Total
Place
Very Rich
Rich
Average
Poor
4.95
9.22
90.39
42.44
1 2
19.82
77.00
, 14.75
57.29
269.45
12.48
48.48
228
126.50 ,107.04
The value of the sample x2 is U = 127.61 > 25.0 = Xi,o5, . Hence, the
hypothesis of independence between the categories is rejected.
E7) Under the assumption that smoking does not affect health, the expected
frequencies are given below.
As in all the other cases discussed in the unit, if the npi are not very small, then the
test statistic U =
(oi-npiI2 has approximately a chi-squared distribution with
i=1 "Pi
(k - 1) degrees of freedom. The approximation is usually good for Ei = npi 2 5.
APPENDIX - 1
A- 1-
n
n ri
M~~
ILLL
Fig. 1 Sampling distribution of X for different populations
and different sample sizes.
TABLE 1
AREAS UNDER THE STANDARD NORMAL CURVE
This table shows the area between zero (the mean of a standard normal
variable) and z. For example, if z = 1.50, this is the shaded area shown
below which equals .4332.
Soume: This table is adapted h m National Bureau of Standards, Tables o f Nonnal Probability Func-
tions, Applied Mathematics Series 23, U.S. Department of Commerce, 1953.
TABLE-2
t-distribution
TABLE-3
CHI-SQUARED DISTRIBUTION
I-l
--no-
- ? % $ ? ;Pq?4f
Z s m o t t n n n n
s eNsH? ?
~ H
?=:??
NN N N N N
qqqqq
N N N N N
04GZS 4 & % 3 ? nq m
--*-
= sn gN gN L,,, ,,a,,
m o N O o n - m -
,,q
,,
2 k . s d G
tnnm-in N N N N N N N U N H
*
N N N N N
II 0 2 2 ~ 32 4 6 % ;
zg-dn d i i d w i
qq736
nnnnn
; q s z s n~ nqnqn n= g~ q a g g l
nn444 N N N N N
I -I ,O=;=;
Z&Gd s : zm4 z~
m $ o- - 0 t m n - m n
~ ~ 1
- o o q
t t t t t
t5.q-
t m t t t
u 0 m o t
1 - v q 1
r w r t t
-m
-98qq
t t t m n