You are on page 1of 180

Recall that there are basically 4 steps in the process of hypothesis testing: 1.

State the null and alternative hypotheses. 2. Collect relevant data from a random sample and summarize them (using a test statistic . !. "ind the p#value$ the probability of observing data li%e those observed assuming that &o is true. 4. 'ased on the p#value$ decide (hether (e have enough evidence to re)ect &o (and accept &a $ and dra( our conclusions in conte*t. +e are no( going to go through these steps as they apply to the hypothesis testing for the population proportion p. ,t should be noted that even though the details (ill be specific to this particular test$ some of the ideas that (e (ill add apply to hypothesis testing in general.

1. Stating the Hypotheses


Here again are the three set of hypotheses that are being tested in each of our three examples:
EXAMP E

1
Has the proportion of defective products been reduced as a result of the repair? H : p = .20 (No change; the repair did not help). H : p < .20 (The repair as effective).
o a

EXAMP

!
!s the proportion of "ari#uana users in the college higher than the national figure? H : p = .$%& ('a"e as a"ong all college students in the countr(). H : p ) .$%& (Higher than the national figure).
o a

EXAMP

"
*id the proportion of +.'. adults bet een 200, and a later poll?
o a

ho support the death penalt( change

H : p =.-. (No change fro" 200,). H : p /.-. ('o"e change since 200,). #ote that the null hypothesis al$ays ta%es the form:

H : p & some 'alue


o

and the alternati'e hypothesis ta%es one of the follo$ing three forms: H : p ( that 'alue )li%e in example 1* or H : p + that 'alue )li%e in example !* or H : p , that 'alue )li%e in example "*. #ote that it $as -uite clear from the context $hich form of the alternati'e hypothesis $ould be appropriate. .he 'alue that is specified in the null hypothesis is called the null value/ and is generally denoted by p . 0e can say/ therefore/ that in general the null hypothesis about the population proportion )p* $ould ta%e the form: H:p&p 0e $rite H : p & p to say that $e are ma%ing the hypothesis that the population proportion has the 'alue of p . 1n other $ords/ p is the un%no$n population proportion and p is the number $e thin% p might be for the gi'en situation.
a a a o o o o o o o

.he alternati'e hypothesis ta%es one of the follo$ing three forms )depending on the context*: H : p ( p (one-sided) H : p + p (one-sided) H : p , p (two-sided) .he first t$o possible forms of the alternati'es )$here the & sign in H is challenged by ( or +* are calledone-sided alternatives/ and the third form of alternati'e )$here the & sign in H is challenged by ,* is called a two-sided alternative. .o understand the intuition behind these names let2s go bac% to our examples.
a a a o o o o o

Example " )death penalty* is a case $here $e ha'e a t$o3sided alternati'e: H : p &.45 )#o change from !66"*. H : p ,.45 )Some change since !66"*. 1n this case/ in order to re7ect H and accept H $e $ill need to get a sample proportion of death penalty supporters $hich is 'ery different from .45 in either direction, either much larger or much smaller than .45.
o a o a

1n example ! )mari7uana use* $e ha'e a one3sided alternati'e: H : p & .189 )Same as among all college students in the country*. H : p + .189 )Higher than the national figure*. Here/ in order to re7ect H and accept H $e $ill need to get a sample proportion of mari7uana users $hich is much higher than .189.
o a o a

Similarly/ in example 1 )defecti'e products*/ $here $e are testing:

H : p & .!6 )#o change: the repair did not help*. H : p ( .!6 ).he repair $as effecti'e*. in order to re7ect H and accept H / $e $ill need to get a sample proportion of defecti'e products $hich is much smaller than .!6.
o a o a

Learn By Doing

,n each of the follo(ing e*amples$ a test for the population proportion (p is called for. -ou are as%ed to select the right null and alternative hypotheses. Scenario 1: .he /C01 ,nternet Report ("ebruary 222! estimated that roughly 3.45 of ,nternet users are e*tremely concerned about credit card fraud (hen buying online. &as that figure changed since6 .o test this$ a random sample of 122 ,nternet users (as chosen$ and (hen intervie(ed$ 12 said that they (ere e*tremely (orried about credit card fraud (hen buying online. 0et p be the proportion of all ,nternet users (ho are concerned about credit card fraud. Scenario 2: .he /C01 ,nternet Report ("ebruary 222! estimated that a proportion of roughly .47 of online homes are still using dial#up access$ but claimed that the use of dial#up is declining. ,s that really the case6 .o e*amine this$ a follo(#up study (as conducted a year later in (hich out of a random sample of 1$!23 households that had ,nternet access$ 324 (ere connecting using a dial#up modem. 0et p be the proportion of all /.S. ,nternet#using households that have dial#up access. Scenario 3: 1ccording to the /C01 ,nternet Report ("ebruary 222! the use of the ,nternet at home is gro(ing steadily and it is estimated that roughly 78.!5 of households in the /nited States have ,nternet access at home. &as that trend continued since the report (as released6 .o study this$ a random sample of 1$222 households from a big metropolitan area (as chosen for a more recent study$ and it (as found that 842 had an ,nternet connection. 0et p be the proportion of /.S. households that have internet access.
did I get this

Did I Get This?


,n each of the follo(ing e*amples$ a test for the population proportion (p is called for. -ou are as%ed to select the right null and alternative hypotheses. Scenario 1: +hen shirts are made$ there can occasionally be defects (such as improper stitching . 'ut too many such defective shirts can be a sign of substandard manufacturing. Suppose$ in the past$ your favorite department store has had only one defective shirt per 222 shirts (a prior defective rate of only .227 . 'ut you suspect that the store has recently s(itched to a substandard manufacturer. So you decide to test to see if their overall proportion of defective shirts today is higher. Suppose that$ in a random sample of 222 shirts from the store$ you find that 24 of them are defective$ for a sample proportion of defective shirts of .1!7. -ou (ant to test (hether this is

evidence that the store is 9guilty9 of substandard manufacturing$ compared to their prior rate of defective shirts. Scenario 2: ,t is a %no(n medical fact that )ust slightly fe(er females than males are born (although the reasons are not completely understood : the %no(n 9proper9 baseline female birthrate is about 485 females. ,n some cultures$ male children are traditionally loo%ed on more favorably than female children$ and there is concern that the increasing availability of ultrasound may lead to pregnant mothers deciding to abort the fetus if it;s not the culturally 9desired9 gender. ,f this is happening$ then the proportion of females in those nations (ould be significantly lo(er than the proper baseline rate. .o test (hether the proportion of females born in ,ndia is lo(er than the proper baseline female birthrate$ a study investigates a random sample of <$722 births from hospital files in ,ndia$ and finds 44.35 females born among the sample. Scenario 3: 1 properly#balanced <#sided game die should give a 1 in e*actly 1=< (1<.45 of all rolls. 1 casino (ants to test its game die. ,f the die is not properly balanced one (ay or another$ it could give either too many 1;s or too fe( 1;s$ either of (hich could be bad. .he casino (ants to use the proportion of 1;s to test (hether the die is out of balance. So the casino test#rolls the die <2 times and gets a 1 in 8 of the rolls (175 .

!. ;ollecting and Summari<ing the =ata )>sing a .est Statistic*


After the hypotheses ha'e been stated/ the next step is to obtain a sample )on $hich the inference $ill be based*/ collect relevant data/ and summarize them. 1t is extremely important that our sample is representati'e of the population about $hich $e $ant to dra$ conclusions. .his is ensured $hen the sample is chosen at random. ?eyond the practical issue of ensuring representati'eness/ choosing a random sample has theoretical importance that $e $ill mention later. 1n the case of hypothesis testing for the population proportion )p*/ $e $ill collect data on the rele'ant categorical 'ariable from the indi'iduals in the sample and start by calculating the sample proportion/ p )the natural -uantity to calculate $hen the parameter of interest is p*. et2s go bac% to our three examples and add this step to our figures.
EXAMP E

EXAMP

EXAMP

"

As $e mentioned earlier $ithout going into details/ $hen $e summari<e the data in hypothesis testing/ $e go a step beyond calculating the sample statistic and summari<e the data $ith a test statistic. E'ery test has a test

statistic/ $hich to some degree captures the essence of the test. 1n fact/ the p3'alue/ $hich so far $e ha'e loo%ed upon as @the %ing@ )in the sense that e'erything is determined by it*/ is actually determined by )or deri'ed from* the test statistic. 0e $ill no$ gradually introduce the test statistic. .he test statistic is a measure of ho$ far the sample proportion p is from the null 'alue p0/ the 'alue that the null hypothesis claims is the 'alue of p. 1n other $ords/ since p is $hat the data estimates p to be/ the test statistic can be 'ie$ed as a measure of the @distance@ bet$een $hat the data tells us about p and $hat the null hypothesis claims p to be. et2s use our examples to understand this:
EXAMP E

1
The para"eter of interest is p0 the proportion of defective products follo ing the repair. The data esti"ate p to be p=.16 The null h(pothesis clai"s that p = .20 The data are therefore .0. (or . percentage points) belo the null h(pothesis ith respect to hat the( each tell us about p. !t is hard to evaluate hether this difference of .1 in defective products is enough evidence to sa( that the repair as effective0 but clearl(0 the larger the difference0 the "ore evidence it is against the null h(pothesis. 'o if0 for e2a"ple0 our sa"ple proportion of defective products had been0 sa(0 .$0 instead of .$-0 then ! thin3 (ou ould all agree that cutting the proportion of defective products in half (fro" 201 to $01) ould be e2tre"el( strong evidence that the repair as effective.
EXAMP E

!
The para"eter of interest is p0 the proportion of students in a college ho use "ari#uana. The data esti"ate p to be p=.19. The null h(pothesis clai"s that p = .$%& The data are therefore .0,, (or ,., percentage points) above the null h(pothesis ith respect to hat the( each tell us about p.
EXAMP E

"
The para"eter of interest is p0 the proportion of +.'. adults the death penalt( for convicted "urderers. The data esti"ate p to be p=.675 The null h(pothesis clai"s that p = .-.. There is a difference of .0,% (,.% percentage points) bet een the data and the null h(pothesis ith respect to hat the( each tell us about p. .here is a problem $ith 7ust loo%ing at the difference bet$een the sample proportion p and the null 'alue po. Examples ! and " illustrate this problem 'ery $ell. 1n example ! $e ha'e a difference of "." percentage points bet$een the data and the null hypothesis/ $hich is approximately the same as the difference in example " of ".8 percentage points. Ho$e'er/ the difference in example " of ".8 percentage points is based on a sample of size of 1,000 and therefore it is much more impressive than the difference of "." percentage points in example !/ $hich $as obtained from a sample of si<e of only 166. ho support

"or the reason illustrated in the e*amples at the end of the previous page$ the test statistic cannot simply be the difference pp0$ but must be some form of that formula that accounts for the sample size. ,n other (ords$ (e need to someho( standardize the difference pp0 so that comparison bet(een different situations (ill be possible. +e are very close to revealing the test statistic$ but before (e construct it$ let>s be reminded of the follo(ing t(o facts from probability: 1. +hen (e ta%e a random sample of size n from a population (ith population proportion p$ the possible values of the sample proportion p ((hen certain conditions are met have appro*imately a normal distribution (ith: ? mean: p ? standard deviation: p(1p n! 2. .he z#score of a normal value (a value that comes from a normal distribution is:

z="a#$e%eanstandard de"iation
and it represents ho( many standard deviations belo( or above the mean the value is. +e are finally ready to reveal the test statistic:

.he test statistic for this test measures the difference bet(een the sample proportion p and the null value p0by the z#score (standardized score of the sample proportion p$ assuming that the null hypothesis is true (i.e.$ assuming that p=p0 . "rom fact 1$ (e %no( that the values of the sample proportion (p are normal$ and (e are given the mean and standard deviation. /sing fact 2$ (e conclude that the z#score of p (hen p=p0 is:

z=pp0p0(1p0 n!
This is the test statistic. ,t represents the difference bet(een the sample proportion (p and the null value (p0 $ measured in standard deviations.

&ere is a representation of the sampling distribution of p$ assuming p @ po. ,n other (ords$ this is a model of ho( p>s behave if (e are dra(ing random samples from a population for (hich &o is true. Aotice the center of the sampling distribution is at po$ (hich is the hypothesized proportion given in the null hypothesis (&o: p @ po. +e could also mar% the a*is in standard deviation units$ p0(1p0 n!. "or e*ample$ if our null hypothesis claims that the proportion of /.S. adults supporting the death penalty is 2.<4$ then the sampling distribution is dra(n as if the null is true. +e dra( a normal distribution centered at p @ 2.<4 (ith a standard deviation dependent on sample size$ 0.6&(10.6& n !.

1mportant ;omment
Aote that under the assumption that &o is true (i.e.$ p=p0 $ the test statistic$ by the nature of the fact that it is a z#score$ has A(2$1 (standard normal distribution. 1nother (ay to say the same thing (hich is Buite common is: 9.he null distribution of the test statistic is A(2$1 .9 'y 9null distribution$9 (e mean the distribution under the assumption that &o is true. 1s (e>ll see and stress again later$ the null distribution of the test statistic is (hat the calculation of the p# value is based on.

0et>s go bac% to our three e*amples and find the test statistic in each case:
EXAMP E

'ince the null h(pothesis is Ho4 p = .200 the standardi5ed score of p=.16 is4 z=.16.'0.'0(1.'0 &00!='. This is the value of the test statistic for this e2a"ple. 6hat does this tell "e? This 57score of 72 tells "e that (assu"ing that Ho is true) the sa"ple proportion p=.16 is 2 standard deviations belo the null value (.20).
EXAMP E

'ince the null h(pothesis is Ho4 p = .$%&0 the standardi5ed score of p=.19 is4 z=.19.157.157(1.157 100!(.91. This is the value of the test statistic for this e2a"ple. 6e interpret this to "ean that0 assu"ing that Ho is true0 the sa"ple proportion p=.19 is . 8$ standard deviations above the null value (.$%&).

EXAMP

"

'ince the null h(pothesis is Ho4 p = .-.0 the standardi5ed score of p=.675 is4 z=.675.6&.6&(1.6& 1000!('.)1. This is the value of the test statistic for this e2a"ple. 6e interpret this to "ean that0 assu"ing that Ho is true0 the sa"ple proportion p=.675 is 2.,$ standard deviations above the null value (.-.).

;omments
1. 1t should no$ be clear $hy this test is commonly %no$n as the z-test for the population proportion. .he name comes from the fact that it is based on a test statistic that is a z-score. Aecall fact 1 that $e used for constructing the <3test statistic. Here is part of it again: 0hen $e ta%e a random sample of si<e n from a population $ith population proportion p/ the possible 'alues of the sample proportion )p* )when certain conditions are met* ha'e approximately a normal distribution $ith a mean of ... and a standard de'iation of .... .his result pro'ides the theoretical 7ustification for constructing the test statistic the $ay $e did/ and therefore the assumptions under $hich this result holds )in bold/ abo'e* are the conditions that our data need to satisfy so that $e can use this test. .hese t$o conditions are: a.
*.

!.

.he sample has to be random. .he conditions under $hich the sampling distribution of p is normal are met. 1n other $ords:

".

Here $e $ill pause to say more about condition )i.* abo'e/ the need for a random sample. 1n the Probability >nit $e discussed sampling plans based on probability )such as a simple random sample/ cluster/ or stratified sampling* that produce a non3biased sample/ $hich can be safely used in order to ma%e inferences about a population. 0e noted in the Probability >nit that/ in practice/ other )non3random* sampling techni-ues are sometimes used $hen random sampling is not feasible. 1t is important though/ $hen these techni-ues are used/ to be a$are of the type of bias that they introduce/ and thus the limitations of the conclusions that can be dra$n from them. Bor our purpose here/ $e $ill focus on one such practice/ the situation in $hich a sample is not really chosen randomly/ but in the context of the categorical 'ariable that is being studied/ the sample is regarded as random. Bor example/ say that you are interested in the proportion of students at a certain college $ho suffer from seasonal allergies. Bor that purpose/ the students in a large engineering class could be considered as a random sample/ since there is nothing about being in an engineering class that ma%es you more or less li%ely to suffer from seasonal allergies. .echnically/ the engineering class is a con'enience sample/ but it is treated as a random sample in the context of this categorical 'ariable. Cn the other hand/ if you are interested in the proportion of students in the college $ho ha'e math anxiety/ then the class of engineering students clearly could not be 'ie$ed as a random sample/ since engineering students probably ha'e a much lo$er incidence of math anxiety than the college population o'erall.

". Binding the P3'alue of the .est


So far $e2'e tal%ed about the p3'alue at the intuiti'e le'el: understanding $hat it is )or $hat it measures* and ho$ $e use it to dra$ conclusions about the significance of our results. 0e $ill no$ go more deeply into ho$ the p3'alue is calculated. 1t should be mentioned that e'entually $e $ill rely on technology to calculate the p3'alue for us )as $ell as the test statistic*/ but in order to ma%e intelligent use of the output/ it is important to first understand the details/ and only then let the computer do the calculations for us. et2s start.

Aecall that so far $e ha'e said that the p3'alue is the probability of obtaining data li%e those obser'ed assuming that H is true. i%e the test statistic/ the p3'alue is/ therefore/ a measure of the e'idence against H . 1n the case of the test statistic, the larger it is in magnitude )positi'e or negati'e* / the further p is from p0 / the more evidence we have against H . 1n the case of the p-value/ it is the opposite: thesmaller it is/ the more unli%ely it is to get data li%e those obser'ed $hen H is true/ the more evidence it is against H . Cne can actually dra$ conclusions in hypothesis testing 7ust using the test statistic/ and as $e2ll see the p3'alue is/ in a sense/ 7ust another $ay of loo%ing at the test statistic. .he reason that $e actually ta%e the extra step in this course and deri'e the p3'alue from the test statistic is that e'en though in this case )the test about the population proportion* and some other tests/ the 'alue of the test statistic has a 'ery clear and intuiti'e interpretation/ there are some tests $here its 'alue is not as easy to interpret. Cn the other hand/ the p3'alue %eeps its intuiti'e appeal across all statistical tests. How is the p-value calculated 1ntuiti'ely/ the p3'alue is the pro!a!ilit" of obser'ing data li#e those o!served assuming that H is true. et2s be a bit more formal: Since this is a probability -uestion about the data/ it ma%es sense that the calculation $ill in'ol'e the data summary/ the test statistic. 0hat do $e mean by $li#e$ those obser'edD ?y @li%e@ $e mean $as e%treme or even more e%treme.$ Putting it all together/ $e get that in general: &he p-value is the pro!a!ilit" of o!serving a test statistic as e%treme as that o!served (or even more e%treme) assuming that the null h"pothesis is true.
o o o o o o

;omment
?y $e%treme$ $e mean extreme in the direction of the alternative hypothesis. 'pecificall"/ for the <3test for the population proportion: 1. 1f the alternati'e hypothesis is Ha+p,p0 )less than*/ then @extreme@ means small/ and the p3'alue is: .he probability of obser'ing a test statistic as small as that o!served or smaller if the null hypothesis is true. '. 1f the alternati'e hypothesis is Ha+p-p0 )greater than*/ then @extreme@ means large/ and the p3'alue is: .he probability of obser'ing a test statistic as large as that o!served or larger if the null hypothesis is true.

).

if the alternati'e is Ha+p.p0 )different from*/ then @extreme@ means extreme in either directioneither small or large (i.e., large in magnitude)/ and the p3'alue therefore is: .he probability of obser'ing a test statistic as large in magnitude as that o!served or larger if the null hypothesis is true.

)Examples: 1f < & 3!.8: p3'alue & probability of obser'ing a test statistic as small as 3!.8 or smaller or as large as !.8 or larger. 1f < & 1.8: p3'alue & probability of obser'ing a test statistic as large as 1.8 or larger/ or as small as 31.8 or smaller.* (), that ma#es sense. *ut how do we actuall" calculate it Aecall the important comment from our discussion about our test statistic/
z=pp0p0(1p0n!

$hich said that $hen the null hypothesis is true )i.e./ $hen p=p0*/ the possible 'alues of our test statistic )because it is a <3score* follo$ a standard normal )#)6/1*/ denoted by E* distribution. .herefore/ the p3'alue calculations )$hich assume that H is true* are simply standard normal distribution calculations for the " possible alternati'e hypotheses.
o

ess .han
.he probability of obser'ing a test statistic as small as that o!served or smaller/ assuming that the 'alues of the test statistic follo$ a standard normal distribution. 0e $ill no$ represent this probability in symbols and also using the normal distribution.

oo%ing at the shaded region/ you can see $hy this is often referred to as a left-tailed test. 0e shaded to the left of the test statistic/ since less than is to the left.

Freater .han
.he probability of obser'ing a test statistic as large as that o!served or larger/ assuming that the 'alues of the test statistic follo$ a standard normal distribution. Again/ $e $ill represent this probability in symbols and using the normal distribution.

oo%ing at the shaded region/ you can see $hy this is often referred to as a right-tailed test. 0e shaded to the right of the test statistic/ since greater than is to the right.

#ot E-ual .o
.he probability of obser'ing a test statistic $hich is as large as in magnitude as that obser'ed or larger/ assuming that the 'alues of the test statistic follo$ a standard normal distribution.

.his is often referred to as a two-tailed test/ since $e shaded in both directions. As noted earlier/ before the $idespread use of statistical soft$are/ it $as common to use 2critical 'alues2 instead of p3'alues to assess the e'idence pro'ided by the data. E'en though the critical 'alues approach is not used in this course/ students might find it insightful. .hus/ the interested students are encouraged to re'ie$ the critical 'alue method in the follo$ing GMany Students 0onder....H lin%. 1f your instructor clearly states that you are re-uired to ha'e %no$ledge of the critical 'alue method/ you should definitely re'ie$ the information.

EXAMP

The p7value in this case is4 9 The probabilit( of observing a test statistic as s"all as 72 or s"aller0 assu"ing that H is true. OR (recalling what the test statistic actually means in this case), 9 The probabilit( of observing a sa"ple proportion that is 2 standard deviations or "ore belo p0=.'00 assu"ing that p0 is the true population proportion. OR, more specifically, 9 The probabilit( of observing a sa"ple proportion of .$- or lo er in a rando" sa"ple of si5e .000 hen the true population proportion is p0=.'0.
o

!n either case0 the p7value is found as sho n in the follo ing figure4

To find P(Z/' e can either use a table or soft are. :ventuall(0 after e understand the details0 e ill use soft are to run the test for us and the output ill give us all the infor"ation e need. The p7value that the statistical soft are provides for this specific e2a"ple is 0.02,. The p7value

tells "e that it is prett( unli3el( (probabilit( of .02,) to get data li3e those observed (test statistic of 72 or less) assu"ing that H is true.
o

EXAMP

The p7value in this case is4 9 The probabilit( of observing a test statistic as large as .8$ or larger0 assu"ing that H is true. OR (recalling what the test statistic actually means in this case), 9 The probabilit( of observing a sa"ple proportion that is .8$ standard deviations or "ore abovep0=.1570 assu"ing that p0 is the true population proportion. OR, more specifically, 9 The probabilit( of observing a sa"ple proportion of .$8 or higher in a rando" sa"ple of si5e $000 hen the true population proportion is p0=.157.
o

!n either case0 the p7value is found as sho n in the follo ing figure4

;gain0 at this point value is 0.$<2.

e can either use a table or soft are to find that the p7

The p7value tells us that it is not ver( surprising (probabilit( of .$<2) to get data li3e those observed ( hich (ield a test statistic of .8$ or higher) assu"ing that the null h(pothesis is true.
EXAMP E

"

The p7value in this case is4 9 The probabilit( of observing a test statistic as large as 2.,$ (or larger) or as s"all as 72.,$ (or s"aller)0 assu"ing that H is true. OR (recalling what the test statistic actually means in this case), 9 The probabilit( of observing a sa"ple proportion that is 2.,$ standard deviations or "ore a a( fro" p0=.6&0 assu"ing that p0 is the true population proportion. OR, more specifically,
o

9 The probabilit( of observing a sa"ple proportion as different as .-&% is fro" .-.0 or even "ore different (i.e. as high as .-&% or higher or as lo as . -0% or lo er) in a rando" sa"ple of si5e $00000 hen the true population proportion is p0=.6&. !n either case0 the p7value is found as sho n in the follo ing figure4

;gain0 at this point value is 0.02$.

e can either use a table or soft are to find that the p7

The p7value tells us that it is prett( unli3el( (probabilit( of .02$) to get data li3e those observed (test statistic as high as 2.,$ or higher or as lo as 72.,$ or lo er) assu"ing that H is true.
o

;omment
0e2'e 7ust seen that finding p3'alues in'ol'es probability calculations about the 'alue of the test statistic assuming that H is true. 1n this case/ $hen H is true/ the 'alues of the test statistic follo$ a standard normal distribution )i.e./ the sampling distribution of the test statistic $hen the null hypothesis is true is #)6/1**. .herefore/ p3'alues correspond to areas )probabilities* under the standard normal cur'e. Similarly/ in an" test/ p3'alues are found using the sampling distribution of the test statistic $hen the null hypothesis is true )also %no$n as the @null distribution@ of the test statistic*. 1n this case/ it $as relati'ely easy to argue that the null distribution of our test statistic is #)6/1*. As $e2ll see/ in other tests/ other distributions come up )li%e the t3distribution and the B3 distribution*/ $hich $e $ill 7ust mention briefly/ and rely hea'ily on the output of our statistical pac%age for obtaining the p3'alues.
o o

0e2'e 7ust completed our discussion about the p3'alue/ and ho$ it is calculated both in general and more specifically for the <3test for the population proportion. et2s go bac% to the four3step process of hypothesis testing and see $hat $e2'e co'ered and $hat still needs to be discussed.

.he Bour Steps in Hypothesis .esting


State the appropriate null and alternati'e hypotheses/ H and H . !. Cbtain a random sample/ collect rele'ant data/ and chec# whether the data meet the conditions under which the test can !e used. 1f the conditions are met/ summari<e the data using a test statistic.
1.
o a

". 5.

Bind the p3'alue of the test. ?ased on the p3'alue/ decide $hether or not the results are significant/ and draw "our conclusions in conte%t.

0ith respect to the <3test the population proportion: Step 1: ;ompleted Step !: ;ompleted Step ": ;ompleted Step 5: .his is $hat $e $ill $or% on next.

5. =ra$ing ;onclusions ?ased on the P3Ialue


.his last part of the four3step process of hypothesis testing is the same across all statistical tests/ and actually/ $e2'e already said basically e'erything there is to say about it/ but it can2t hurt to say it again. .he p3'alue is a measure of ho$ much e'idence the data present against H . .he smaller the p3'alue/ the more e'idence the data present against H . 0e already mentioned that $hat determines $hat constitutes enough e'idence against H is thesignificance level )J*/ a cutoff point belo$ $hich the p3'alue is considered small enough to re7ect H in fa'or of H . .he most commonly used significance le'el is .68.
o o o o a

1t is important to mention again that this step has essentially t$o sub3steps: )i* ?ased on the p3'alue/ determine $hether or not the results are significant )i.e./ the data present enough e'idence to re7ect H *.
o

)ii* State your conclusions in the context of the problem.

et2s go bac% to our three examples and dra$ conclusions.


EXAMP E

1
(Has the proportion of defective products been reduced fro" .20 as a result of the repair?) 6e found that the p7value for this test as .02,.

'ince .02, is s"all (in particular0 .02, < .0%)0 the data provide enough evidence to re#ect H and conclude that as a result of the repair the proportion of defective products has been reduced to belo .20. The follo ing figure is the co"plete stor( of this e2a"ple0 and includes all the steps e ent through0 starting fro" stating the h(potheses and ending ith our conclusions4
o

EXAMP

!
(!s the proportion of students ho use "ari#uana at the college higher than the national proportion0 hich is .$%&?) 6e found that the p7value for this test as .$<2.

'ince .$<2 is not s"all (in particular0 .$<2 ) .0%)0 the data do not provide enough evidence to re#ect H . 6e therefore do not have enough evidence to conclude that the proportion of students at the college ho use "ari#uana is higher than the national figure. Here is the co"plete stor( of this e2a"ple4
o

learn !" doing

H"pothesis &esting for the +opulation +roportion p


EXAMP E

"
(Has the proportion of +.'. adults ho support the death penalt( for convicted "urderers changed since 200,0 hen it as .-.?) 6e found that the p7value for this test as .02$.

'ince .02$ is s"all (in particular0 .02$ < .0%)0 the data provide enough evidence to re#ect H 0 and e conclude that the proportion of adults ho support the death penalt( for convicted "urderers has changed since 200,. Here is the co"plete stor( of this e2a"ple4
o

0e ha'e no$ completed going through the four steps of hypothesis testing/ and in particular $e learned ho$ they are applied to the <3test for the population proportion. et2s briefly summari<e:

Step 1
State the null and alternati'e hypotheses:
H0+p=p0

$here the choice of the appropriate alternati'e )out of the three* is usually -uite clear from the context of the problem.

Step !
Cbtain data from a sample and: )i* chec% $hether the data satisfy the conditions $hich allo$ you to use this test. K random sample )or at least a sample that can be considered random in context*

)ii* ;alculate the sample proportion p/ and summari<e the data using the test statistic: z=pp0p0(1p0 n!. ),ecall: .his standardi<ed test statistic represents ho$ many standard de'iations abo'e or belo$ p0 our sample proportion p is. *

Step "
Bind the p3'alue of the test either by using soft$are or by using the test statistic as follo$s: K for Ha:p,p0:P(Z/z K for Ha:p-p0:P(Z0z K for Ha:p.p0:'P(Z01z1

Step 5
Aeach a conclusion first regarding the significance of the results/ and then determine $hat it means in the context of the problem. Aecall that:

1f the p3'alue is small )in particular/ smaller than the significance le'el/ $hich is usually .68*/ the results are significant )in the sense that there is a significant difference bet$een $hat $as obser'ed in the sample and $hat $as claimed in H */ and so $e re7ect H . 1f the p3'alue is not small/ $e do not ha'e enough statistical e'idence to re7ect H / and so $e continue to belie'e that H ma" be true. ),emem!er: In h"pothesis testing we never $accept$ H *.
o o o o o

learn !" doing

H"pothesis &esting for the +opulation +roportion p

0hat2s nextD ?efore $e mo'e on to the next test/ $e are going to use the <3test for proportions to bring up and illustrate some 'ery important issues regarding hypothesis testing.

More About Hypothesis .esting


.he issues regarding hypothesis testing that $e $ill discuss are: 1. .he effect of sample si<e on hypothesis testing. !. Statistical significance 's. practical importance. ).his $ill be discussed in the acti'ity follo$ing number 1.* ". Cne3sided alternati'e 's. t$o3sided alternati'eLunderstanding $hat is going on. 5. Hypothesis testing and confidence inter'alsLho$ are they relatedD et2s start.

1. .he Effect of Sample Si<e on Hypothesis .esting


0e ha'e already seen the effect that the sample si<e has on inference/ $hen $e discussed point and inter'al estimation for the population mean )M* and population proportion )p*. 1ntuiti'ely ... arger sample si<es gi'e us more information to pin do$n the true nature of the population. 0e can therefore expect the sample mean and sample proportion obtained from a larger sample to be closer to the population mean and proportion/ respecti'ely. As a result/ for the same le'el of confidence/ $e can report a smaller margin of error/ and get a

narro$er confidence inter'al. 0hat $e2'e seen/ then/ is that larger sample si<e gi'es a boost to ho$ much $e trust our sample results. 1n hypothesis testing/ larger sample si<es ha'e a similar effect. .he follo$ing t$o examples $ill illustrate that a larger sample si<e pro'ides more con'incing e'idence/ and ho$ the e'idence manifests itself in hypothesis testing. et2s go bac% to our example ! )mari7uana use at a certain liberal arts college*.
EXAMP E

The data do not provide enough evidence that the proportion of "ari#uana users at the college is higher than the proportion a"ong all +.'. college students0 hich is .$%&. 'o far0 nothing ne . =et>s "a3e s"all changes to the proble" (and call it e2a"ple 29). The changes are highlighted and the proble" is follo ed b( a ne figure that reflects the changes.
EXAMP E

!K
There are ru"ors that students in a certain liberal arts college are "ore inclined to use drugs than +.'. college students in general. 'uppose that in a simple random sample of 400 students from the college, 76 admitted to mari uana use. *o the data provide enough evidence to conclude that the proportion of "ari#uana users a"ong the students in the college (p) is higher than the national proportion0 'chool of ?ublic Health). hich is .$%&? (reported b( the Harvard

6e no have a larger sa"ple (.00 instead of $00)0 and also e changed the nu"ber of "ari#uana users (&- instead of $8). =et>s carr( out the test in this case. !" The @uestion of interest did not change0 so e are testing the sa"e h(potheses4 Ho4 p = .$%& Ha4 p ) .$%& !!" 6e select a rando" sa"ple of si5e 400 and find that &- are "ari#uana users. (Note that the data satisf( the conditions that allo =et>s su""ari5e the data4 us to use this test. Aerif( this (ourself).

This is the sa"e sa"ple proportion as in the original proble"0 so it see"s that the data give us the sa"e evidence0 but not the case4 hen e calculate the test statistic0 e see that actuall( this is

:ven though the sa"ple proportion is the sa"e (.$8)0 since here it is based on a larger sa"ple (.00 instead of $00)0 it is $.<$ standard deviations above the null value of .$%& (as opposed to .8$ standard deviations in the original proble"). !!!" Bor the p7value0 e use statistical soft are to find p7value = 0.0,%. ords0 hen

The p7value here is .0,% (as opposed to .$<2 in the original proble"). !n other

Ho is true (i.e. hen p = .$%&) it is @uite unli3el( (probabilit( of .0,%) to get a sa"ple proportion of .$8 or higher based on a sa"ple of si5e .00 (probabilit( .0,%)0 and not ver( unli3el( !#" hen the sa"ple si5e is $00 (probabilit( .$<2).

Cur results here are significant. !n other ords0 in e2a"ple 29 the data provide enough evidence to re#ect Ho and conclude that the proportion of "ari#uana users at the college is higher than a"ong all +.'. students. =et>s su""ari5e ith a figure4

0hat do $e learn from these t$o examplesD 0e see that sample results that are based on a larger sample carry more $eight. 1n example !/ $e sa$ that a sample proportion of .1N based on a sample of si<e of 166 $as not enough e'idence that the proportion of mari7uana users in the college is higher than .189. Aecall/ from our general o'er'ie$ of hypothesis testing/ that this conclusion )not ha'ing enough e'idence to re7ect the null hypothesis*doesn't mean the null hypothesis is necessarily true )so/ $e ne'er GacceptH the null*: it only means that the particular study didnOt yield sufficient e'idence to re7ect the null. 1t might be that the sample si<e $as simply too small to detect a statistically significant difference. Ho$e'er/ in example !K/ $e sa$ that $hen the sample proportion of .1N is obtained from a sample of si<e 566/ it carries much more $eight/ and in particular/ pro'ides enough e'idence that the proportion of mari7uana users in the college is higher than .189 )the national figure*. 1n this case/ the sample si<e of 566was large enough to detect a statistically significant difference. .he follo$ing acti'ity $ill allo$ you to practice the ideas and terminology used in hypothesis testing $hen a result is not statistically significant.

". Cne3Sided Alternati'e 's. .$o3Sided Alternati'e


Aecall that earlier $e noticed )only 'isually* that for a gi'en 'alue of the test statistic </ the p3'alue of the t$o3sided test is t$ice as large as the p3'alue of the one3sided test. 0e $ill no$ further discuss this issue. 1n particular/ $e $ill use our example ! )mari7uana users at a certain college* to gain better intuition about this fact. Bor illustration purposes/ $e are actually going to use example !K )$here out of a sample of size -00/ 94 $ere mari7uana users*. et2s recall example !K/ but this time gi'e t$o 'ersions of it: the original 'ersion/ and a slightly changed 'ersion/ $hich $e2ll call example !KK. .he differences are highlighted.
EXAMP E

!K
$here are rumors that students at a certain li%eral arts college are more inclined to use drugs than &"'" college students in general" 'uppose that in a si"ple rando" sa"ple of .00 students fro" the college0 &- ad"itted to "ari#uana use. *o the data provide enough evidence to conclude that the proportion of "ari#uana users a"ong the students in the college (p) is higher than the national proportion0 hich is . $%&? (This nu"ber is reported b( the Harvard 'chool of ?ublic Health.)
EXAMP E

!KK
$he dean of students in a certain li%eral arts college was interested in whether the proportion of students who use drugs in her college is different than the proportion among &"'" college students in general" 'uppose that in a si"ple rando" sa"ple of .00 students fro" the college0 &- ad"itted to "ari#uana use. *o the data provide enough evidence to conclude that the proportion of "ari#uana users a"ong the students in the college (p) differs fro" the national proportion0 hich is . $%&? (This nu"ber is reported b( the Harvard 'chool of ?ublic Health.)
learn !" doing

1ndeed/ in example !K $e suspect from the outset )based on the rumors* that the o'erall proportion )p* of mari7uana smo%ers at the college is higher than the reported national proportion of .189/ and therefore the appropriate alternati'e is H : p + .189. 1n example !KK/ as a result of the change of $ording )$hich eliminated the part about the rumors*/ $e simply $onder if p is different )in either direction* from the reported national proportion of .189/ and therefore the appropriate alternati'e is the t$o3
o

sided test: Ha+p.p0. 0ould s$itching to the t$o3sided alternati'e ha'e an effect on our resultsD et2s explore that.
EXAMP E

!K
6e alread( carried out the test for this e2a"ple0 and the results are su""ari5ed in the follo ing figure4

The follo ing figure re"inds (ou ho the p7value as found (using the test statistic)4

EXAMP

!KK
!. Here e are testing4

!!. 'ince e have the sa"e data as in e2a"ple 29 (&- "ari#uana users out of .00)0 e have the sa"e sa"ple proportion and the sa"e test statistic4

!!!. 'ince the calculation of the p7value depends on the t(pe of alternative e have0 here is here things start to be different. 'tatistical soft are tells us that the p7value for e2a"ple 299 is 0.0&0. Here is a figure that re"inds us ho the p7value as calculated (based on the test statistic)4

!A. !f e use the .0% level of significance0 the p7value e got is not s"all enough (.0& ) .0%)0 and therefore e cannot re#ect H . !n other ords0 the data do not provide enough evidence to conclude that the proportion of "ari#uana s"o3ers in the college is different fro" the national proportion (.$%&).
o

0hat happened hereD 1t should be pretty clear $hat happened here numerically. .he p3'alue of the one3sided test )example !K* is .6"8/ suggesting the results are significant at the .68 significant le'el. Ho$e'er/ the p3'alue of the t$o sided3test )example !KK* is t$ice the p3'alue of the one3sided test/ and is therefore ! K .6"8 & .69/ suggesting that the results are not significant at the .68 significance le'el. Here is a more conceptual explanation:

.he idea is that in Example !K/ $e began our hypothesis test $ith a piece of information )in the form of a rumor* about un%no$n population proportion p/ $hich ga'e us a sort of head3start to$ards the goal of re7ecting the null hypothesis. 0e found that the e'idence that the data pro'ided $ere then enough to cross the finish line and re7ect H . 1n Example !KK/ $e had no prior information to go on/ and the data alone $ere not enough e'idence to cross the finish line and re7ect H . .he follo$ing figure illustrates this idea:
o o

0e can summari<e and say that in general it is harder to re7ect H against a t$o3sided H because the p3'alue is t$ice as large. 1ntuiti'ely/ a one3sided alternati'e gi'es us a head3start/ and on top of that $e ha'e the e'idence pro'ided by the data. 0hen our alternati'e is the t$o3sided test/ $e get no head3start and all $e ha'e are the data/ and therefore it is harder to cross the finish line and re7ect H .
o a o

5. Hypothesis .esting and ;onfidence 1nter'als


.he last topic $e $ant to discuss is the relationship bet$een hypothesis testing and confidence inter'als. E'en though the fla'or of these t$o forms of inference is different )confidence inter'als estimate a parameter/ and hypothesis testing assesses the e'idence in the data against one claim and in fa'or of another*/ there is a strong lin% bet$een them. 0e $ill explain this lin% )using the <3test and confidence inter'al for the population proportion*/ and then explain ho$ confidence inter'als can be used after a test has been carried out.

Aecall that a confidence inter'al gi'es us a set of plausible 'alues for the un%no$n population parameter. 0e may therefore examine a confidence inter'al to informally decide if a proposed 'alue of population proportion seems plausible. Bor example/ if a N8P confidence inter'al for p/ the proportion of all >.S. adults already familiar $ith Iiagra in May 1NNQ/ $as ).41/ .49*/ then it seems clear that $e should be able to re7ect a claim that only 86P of all >.S. adults $ere familiar $ith the drug/ since based on the confidence inter'al/ . 86 is not one of the plausible 'alues for p. 1n fact/ the information pro'ided by a confidence inter'al can be formally related to the information pro'ided by a hypothesis test. ) .omment: .he relationship is more straightfor$ard for t$o3sided alternati'es/ and so $e $ill not present results for the one3sided cases.* Suppose $e $ant to carry out the two-sided test:

using a significance le'el of .68. An alternati'e $ay to perform this test is to find a N8P confidence interval for p and chec%: 1f p0 falls outside the confidence inter'al/ re/ect H . 1f p0 falls inside the confidence inter'al/ do not re/ect H . 1n other $ords/ if p0 is not one of the plausible 'alues for p/ $e re7ect H . 1f p0 is a plausible 'alue for p/ $e cannot re7ect H . ).omment: Similarly/ the results of a test using a significance le'el of .61 can be related to the NNP confidence inter'al.*
o o o o

et2s loo% at t$o examples:


EXAMP E

Decall e2a"ple ,0 here e anted to 3no hether the proportion of +.'. adults ho support the death penalt( for convicted "urderers has changed since 200,0 hen it as .-..

6e are testing4

and as the figure re"inds us0 e too3 a sa"ple of $0000 +.'. adults0 and the data told us that -&% supported the death penalt( for convicted "urderers (i.e. p=.675). ; 8%1 confidence interval for p0 the proportion of all +.'. adults ho support the death penalt(0 is4

.6752'.675(1.675 1000!(.6752.0)=(.6&53 .705


'ince the 8%1 confidence interval for p does not include .-. as a plausible value for p0 e can re#ect H and conclude (as e did before) that the proportion of +.'. adults ho support the death penalt( for convicted "urderers has changed since 200,.
o

EXAMP

Eou and (our roo""ate are arguing about hose turn it is to clean the apart"ent. Eour roo""ate suggests that (ou settle this b( tossing a coin and ta3es one out of a loc3ed bo2 he has on the shelf. 'uspecting that the coin "ight not be fair0 (ou decide to test it first. Eou toss the coin <0 ti"es0 thin3ing to (ourself that if0 indeed0 the coin is fair0 (ou should get around .0 heads. !nstead (ou get .< heads. Eou are pu55led. Eou are not sure hether getting .< heads out of <0 is enough evidence to conclude that the coin is unbalanced0 or hether this a result that could have happened #ust b( chance hen the coin is fair. 'tatistics can help (ou ans er this @uestion. =et p be the true proportion (probabilit() of heads. 6e ant to test the coin is fair or not4 hether

The data e have are that out of n = <0 tosses0 e got .< heads0 or that the sa"ple proportion of heads is4 p=&440=.6 The 8%1 confidence interval for p0 the true proportion of heads for this coin0 is4

.62'5.6(1.6 40!(.62.11=(.&93 .71


'ince in this case .% is one of the plausible values for p0 e cannot re#ect H . !n other ords0 the data do not provide enough evidence to conclude that the coin is not fair.
o

;omment
.he context of the last example is a good opportunity to bring up an important point that $as discussed earlier.

E'en though $e use .68 as a cutoff to guide our decision about $hether the results are significant/ $e should not treat it as in'iolable and $e should al$ays add our o$n 7udgment. et2s loo% at the last example again. 1t turns out that the p3'alue of this test is .69"5. 1n other $ords/ it is maybe not extremely unli%ely/ but it is -uite unli%ely )probability of .69"5* that $hen you toss a fair coin Q6 times you2ll get a sample proportion of heads of 5QRQ6 & .4 )or e'en more extreme*. 1t is true that using the .68 significance le'el )cutoff*/ .69"5 is not considered small enough to conclude that the coin is not fair. Ho$e'er/ if you really don2t $ant to clean the apartment/ the p3'alue might be small enough for you to as% your roommate to use a different coin/ or to pro'ide one yourselfS Here is our final point on this sub7ect: 0hen the data pro'ide enough e'idence to re7ect H / $e can conclude )depending on the alternati'e hypothesis* that the population proportion is either less than/ greater than or not e-ual to the null 'alue p0. Ho$e'er/ $e do not get a more informati'e statement about its actual 'alue. 1t might be of interest/ then/ to follo$ the test $ith a N8P confidence inter'al that $ill gi'e us more insight into the actual 'alue of p.
o

EXAMP

!n our e2a"ple ,0

e concluded that the proportion of +.'. adults ho support the death penalt( for convicted "urderers has changed since 200,0 hen it as .-.. !t is probabl( of interest not onl( to 3no that the proportion has changed0 but also to esti"ate hat it has changed to. 6e>ve calculated the 8%1 confidence interval for p on the previous page and found that it is (.-.%0 . &0%).

6e can co"bine our conclusions fro" the test and the confidence interval and sa(4 *ata provide evidence that the proportion of +.'. adults ho support the death penalt( for convicted "urderers has changed since 200,0 and e are 8%1 confident that it is no bet een .-.% and .&0%. (i.e. bet een -..%1 and &0.%1).
EXAMP E

=et>s loo3 at our e2a"ple $ to see ho a confidence interval follo ing a test "ight be insightful in a different a(. Here is a su""ar( of e2a"ple $4

6e conclude that as a result of the repair0 the proportion of defective products has been reduced to belo .20 ( hich as the proportion prior to the repair). !t is probabl( of great interest to the co"pan( not onl( to 3no that the proportion of defective has been reduced0 but also esti"ate hat it has been reduced to0 to get a better sense of ho effective the repair as. ; 8%1 confidence interval for p in this case is4

.162'5.16(1.16 &00!(.162.0)7=(.1'93 .197


6e can therefore sa( that the data provide evidence that the proportion of defective products has been reduced0 and e are 8%1 sure that it has been reduced to so"e here bet een $2.81 and $8.&1. This is ver( useful infor"ation0 since it tells us that even though the results ere significant (i.e.0 the repair reduced the nu"ber of defective products)0 the repair "ight not have been effective enough0 if it "anaged to reduce the nu"ber of defective products onl( to the range provided b( the confidence interval. This0 of course0 ties bac3 in to the idea of statistical significance vs.

practical i"portance that e discussed earlier. :ven though the results are significant (H as re#ected)0 practicall( spea3ing0 the repair "ight be considered ineffective.
o

learn !" doing

H"pothesis &esting for the +opulation +roportion p

et2s summari<e
E'en though this unit is about the <3test for population proportion/ it is loaded $ith 'ery important ideas that apply to hypothesis testing in general. 0e2'e already summari<ed the details that are specific to the <3test for proportions/ so the purpose of this summary is to highlight the general ideas. .he process of hypothesis testing has four steps: I. 'tating the null and alternative h"potheses (H and H ). II. Cbtaining a random sample )or at least one that can be considered random* and collecting data. >sing the data: K .hec# that the conditions under $hich the test can be reliably used are met. K 'ummarize the data using a test statistic. .he test statistic is a measure of the e'idence in the data against H . .he larger the test statistic is in magnitude/ the more e'idence the data present against H . III. 0inding the p-value of the test. .he p3'alue is the probability of getting data li%e those obser'ed )or e'en more extreme* assuming that the null hypothesis is true/ and is calculated using the null distribution of the test statistic. .he p3'alue is a measure of the e'idence against H . .he smaller the p3'alue/ the more e'idence the data present against H . I1. 2a#ing conclusions. 3 ;onclusions about the significance of the results: 1f the p3'alue is small/ the data present enough e'idence to re7ect H )and accept H *. 1f the p3'alue is not small/ the data do not pro'ide enough e'idence to re7ect H .
o a o o o a o a o

.o help guide our decision/ $e use the significance le'el as a cutoff for $hat is considered a small p3'alue. .he significance cutoff is usually set at .68/ but should not be considered in'iolable. 3 ;onclusions in the conte%t of the problem.

Aesults that are based on a larger sample carry more $eight/ and therefore as the sample size increases, results !ecome more significant. E'en a 'ery small and practically unimportant effect becomes statistically significant $ith a large enough sample si<e. .he distinction !etween statistical significance and practical importance should therefore al$ays be considered. Bor gi'en data/ the p-value of the two-sided test is alwa"s twice as large as the p-value of the one-sided test. 1t is therefore harder to re7ect H in the t$o3sided case than it is in the one3sided case in the sense that stronger e'idence is re-uired. 1ntuiti'ely/ the hunch or information that leads us to use the one3sided test can be regarded as a head3start to$ard the goal of re7ecting H . .onfidence intervals can !e used in order to carr" out two-sided tests )at the .68 significance le'el*. 1f the null 'alue is not included in the confidence inter'al )i.e./ is not one of the plausible 'alues for the parameter*/ $e ha'e enough e'idence to re7ect H . Cther$ise/ $e cannot re7ect H . 1f the results are significant/ it might be of interest to follow up the tests with a confidence interval in order to get insight into the actual 'alue of the parameter of interest.
o o o o

So far $e ha'e tal%ed about the logic behind hypothesis testing and then illustrated ho$ this process proceeds in practice/ using the <3test for the population proportion )p*. 0e are no$ mo'ing on to discuss testing for the population mean )M*/ in $hich is the parameter of interest $hen the 'ariable of interest is -uantitati'e. .$o comments about the structure of this section: 1. .he basic ground$or% for carrying out hypothesis tests has already been laid in the unit on tests about proportions/ and therefore $e can easily modify the four steps to carry out tests about means instead/ $ithout going into all the little details. !. .his unit $ill ha'e t$o parts/ as $e need to distinguish bet$een cases: the case $here the population standard de'iation )T* is %no$n/ and the case $here T is un%no$n. K 1n the first case )T %no$n*/ the test is called the @<3test for the population mean M.@ K 1n the second case )T un%no$n*/ the test is called the @t3test for the population mean M.@

.he reason for the different names )< 's. t* is for exactly the same reason that the test for the proportion )p* is called a <3test. 1n the first case/ the test statistic $ill ha'e a standard normal )<* distribution )$hen H is true*/ and in the second case/ the test statistic $ill ha'e a t3distribution )$hen H is true*.
o o

et2s start. &ests a!out 3 when sigma (4) is #nown5&he z-test for the population mean 1n this section/ $e $ill proceed under the assumption that the population standard de'iation )T* is %no$n. 0e2'e already discussed the practicality of this assumption. 1n most situations/ the population standard de'iation is not %no$n/ but in some cases/ especially $hen the 'ariables of interest ha'e been in'estigated thoroughly o'er the years/ it $ould ma%e sense to assume that T is %no$n. Such 'ariables are/ for example/ 1Us and other standardi<ed test scores/ or heights/ $eights/ and other physical characteristics. 0e2ll start by introducing t$o examples that $ill be our leading examples for this part.
EXAMP E

1
The ';T is constructed so that scores in each portion have a national average of %00 and standard deviation of $00. The distribution is close to nor"al. The dean of students of Doss Follege suspects that in recent (ears the college attracts students ho are "ore @uantitativel( inclined. ; rando" sa"ple of . students fro" a recent entering class at Doss Follege had an average "ath ';T (';T7G) score of %%0. *oes this provide enough evidence for the dean to conclude that the "ean ';T7G of all Doss college students is higher than the national "ean of %00? ;ssu"e that the standard deviation of $00 applies also to all Doss Follege students. (omment: This is a situation here it is @uite reasonable to assu"e that the population standard e can assu"e deviation (H) is 3no n. ';T tests are constructed so that the standard deviation is $000 and provided that there is nothing special about students at Doss college0 that =100 in the population of Doss college students. Here is a figure that represents this e2a"ple4

EXAMP

!
; certain prescription "edicine is supposed to contain an average of 2%0 parts per "illion (pp") of a certain che"ical. !f the concentration is higher than this0 the drug "a( cause har"ful side effects; if it is lo er0 the drug "a( be ineffective. The "anufacturer runs a chec3 to see if the "ean concentration in a large ship"ent confor"s to the target level of 2%0 pp" or not. ; si"ple rando" sa"ple of $00 portions is tested0 and the sa"ple "ean concentration is found to be 2.& pp". !t is assu"ed that the concentration standard deviation in the entire ship"ent is =1' pp". (omment: Here it is not that clear h( the assu"ption that H is 3no n to be $2 is reasonable. !f ship"ents are chec3ed on a regular basis0 then "a(be past e2perience has sho n that indeed =1'. !n an( case0 e ill co"e bac3 to this proble" and discuss this point again later.

i%e any other test/ the <3test for the population mean follo$s the four3step process: I. Stating the hypotheses H and H . II. ;ollecting rele'ant data/ chec%ing that the data satisfy the conditions $hich allo$ us to use this test/ and summari<ing the data using a test statistic.
o a

III. Binding the p3'alue of the test/ the probability of obtaining data as extreme as those collected )or e'en more extreme/ in the direction of the alternati'e hypothesis*/ assuming that the null hypothesis is true. 1n other $ords/ ho$ li%ely is it that the only reason for getting data li%e those obser'ed is sampling 'ariability )and not because H is not true*D I1. =ra$ing conclusions/ assessing the significance of the results based on the p3'alue/ and stating our conclusions in context. )=o $e or don2t $e ha'e e'idence to re7ect H and accept H D*
o o a

0e $ill no$ go through the four steps specifically for the <3test for the population mean and apply them to our t$o examples.

1. Stating the Hypotheses


.he null and alternati'e hypotheses for the <3test for the population mean )M* ha'e exactly the same structure as the hypotheses for <3test for the population proportion )p*: K .he null hypothesis has the form:
H 0 + = 0 )$here 0 is the null 'alue*.

K .he alternati'e hypothesis ta%es one of the follo$ing three forms )depending on the context*:
Ha+,0 )one3sided* Ha+-0 )one3sided* Ha+.0 )t$o3sided*
EXAMP E

1
!n our e2a"ple $0 based on a sa"ple of . students fro" Doss Follege0 ( hich0 b( construction0 is %00). Here is the figure that su""ari5es e2a"ple $4 e ere testing hether the "ean ';T7G of all of Doss Follege students is higher than the national "ean

learn !" doing did I get this


EXAMP E

!
Here e ant to test hether the "ean concentration of a certain che"ical in a large ship"ent of a certain prescription drug is the re@uired 2%0 pp"4

The null and alternative h(potheses in this case are therefore4

!. ;ollecting =ata and Summari<ing .hem


Since our parameter of interest is the population mean )M*/ once $e collect the data/ $e find the sample mean )x6*. Ho$e'er/ $e already %no$ that in hypothesis testing $e go a step beyond calculating the rele'ant sample statistic and summari<e the data $ith a test statistic.

Aecall that in the <3test for the proportion/ the test statistic is the <3score )standardi<ed 'alue* of the sample proportion/ assuming that H is true. 1t should not be 'ery surprising that in the <3test for the population mean/ $e do exactly the same thing. .he test statistic is the <3score )standardi<ed 'alue* of the sample mean ) x6* assuming that H is true )in other $ords/ assuming that =0*. 0e rely once again on probability resultsLin this case/ $e refer to results about the sampling distribution of the sample mean ) X666*:
o o

0hen $e discussed probability models based on sampling distributions/ $e concluded that sample means beha'e as follo$s: ;enter: .he mean of the sample means is V/ the population mean. Spread: .he standard de'iation of the sample means is n!. Shape: .he sample means are normally distributed if the 'ariable is normally distributed in the population or the sample si<e is large enough to guarantee approximate normality. Aecall that this last statement is the ;entral imit .heorem. As a general guideline/ $e said that if n + "6/ the ;entral imit .heorem applies and $e can use a normal cur'e as a probability model. ?ased on this description of the sampling distribution of X666/ $e can define a test statistic that measures the distance bet$een the hypothesi<ed 'alue of V )denoted V * and the sample mean )determined by the data* in standard de'iation units. .he test statistic is: z=x60n! .

;omments
1. #ote that our test statistic )because it is a <3score*/ tells us ho$ far x6 is from the null 'alue 0 measured in standard de'iations. Since x6 represents the data and 0 represents the null hypothesis/ the test statistic is a measure of ho$ different our data are from $hat is claimed in the null hypothesis. .he larger the test statistic/ the more e'idence $e ha'e against H / since $hat $e sa$ in our data is 'ery different from $hat H claims. .his is an idea that $e mentioned in the pre'ious test as $ell.
o o

!. As $e established earlier/ all inference procedures are based on probability. 0e are trying to determine if our sample results are li%ely or unli%ely based on our assumptions about the population. .his re-uires that $e ha'e a probability model that describes the long3term beha'ior of sample results that are randomly collected from a population that fits our hypothesis. Bor this reason/ the ;entral imit .heorem gi'es us criteria for deciding if the <3test for the population mean can be used. 0e need to 'erify:

)i* .he sample is random )or at least can be considered as random in context*. )ii* 0e are in one of the three situations mar%ed $ith a green chec% mar% in the follo$ing table:

". 1f the conditions are met/ then X666 'alues 'ary normally/ or at least close enough to normally to use a normal model to calculate probabilities. 0hen X666 'alues are normal/ then the <3scores $ill be normally distributed $ith a mean of 6 and a standard de'iation of 1. et2s go bac% to our examples.
EXAMP E

1
Here is a su""ar( of e2a"ple $4

=et>s start b( chec3ing the conditions4 (i) The sa"ple is rando". (ii) The variable of interest0 ';T7G scores0 is assu"ed to var( nor"all( in the population0 so the fact that the sa"ple si5e is s"all (n = .) is not a proble". 'a"ple "eans ill be nor"all( distributed and e can use a nor"al probabilit( "odel based on 57scores to deter"ine probabilities.

The sa"ple "ean is x6=5500 and so the test statistic is4

z=550500100&!=1
This "eans that our data (represented b( the sa"ple "ean) are onl( $ standard deviation above the null value (%00). Flearl(0 this provides so"e evidence against H 0 but is this strong enough evidence to re#ect it? ?robabl( not. This ill be confir"ed hen e find the p7value. Here is an updated figure4
o

EXAMP

!n this case0 the conditions that allo since4 (i) The sa"ple is rando".

us to carr( out the 57test are "et

(ii) The sa"ple si5e (n = $00) is large enough for the Fentral =i"it Theore" to appl( (note that in this case the large sa"ple is essential since the concentration level is not 3no n to var( nor"all().

The 57statistic in this case is4 z='&7'501'100!='.5 Cur data (represented b( the sa"ple "ean concentration levelI2.&) are 2.% standard deviations belo the null value. ; difference of 2.% standard deviations is considered @uite strong evidence against H . (:ssentiall( an( difference that is above 2 standard deviations is considered @uite large.) This ill be confir"ed hen e find the p7value of the test. Here is an updated figure that represents the h(pothesis testing process for this proble" so far4
o

". Binding the p3'alue of the test


.he p3'alue L the probability of getting data )summari<ed $ith the test statistic* as extreme as those obser'ed or e'en more extreme )in the direction of the alternati'e hypothesis* $hen H is true L for the <3test for the population mean is found exactly li%e the p3'alue in the <3test for the population proportion. 0e2'e already learned that the p3'alue is found under the null distribution of the test statistic/ and since for both means )$ith T %no$n* and proportions the null distribution of the test statistic is #)6/1*/ the p3'alue is calculated as follo$s:
o

ess .han

Freater .han

#ot E-ual .o

EXAMP

1
!n the e2a"ple about the ';T7G scores of students at Doss Follege0 the test statistic as found to be 5 = $. The p7value is therefore ?(J ) $)4

To find the p7value0

e can either4 hich

use the (-<1 part of the) 'tandard *eviation Dule for the nor"al distribution0 tells us that the p7value is appro2i"atel( 0.$- (since ?(7$ < J < $) = 0.-<)0 or

use the nor"al table0 or carr( out the test using statistical soft are. !n this case0 e get a p7value of 0.$%8.

EXAMP

!
!n the concentration level e2a"ple0 the test statistic as found to be 72.%. 'ince this is the t o7sided test0 the p7value is the co"bination of the t o shaded areas in the follo ing figure.

The p7value is therefore t ice ?(J ) 2.%). 6e can either use the table0 or carr( out the test using statistical soft are. !n this case0 e get a p7value of 0.0$2.

So far/ $e2'e discussed the first three steps in the hypothesis testing process of the <3test for the population mean )M*. .he last step is to dra$ conclusions.

5. =ra$ing ;onclusions
Here $e assess the significance of the results )based on the p3'alue compared $ith some significance le'el of choice*/ and state our conclusions in context.
EXAMP E

1
Here the p7value is @uite large (.$-) hich "eans that it is not ver( surprising to get data li3e those observed hen Ho is true. The results are therefore not significant0 and so e do not have enough evidence to re#ect Ho and conclude that the "ean ';T7G of all Doss Follege students is higher than the national "ean (%00). Note that even though the average ';T7G in our sa"ple larger than %00)0 since this result as %%0 ( hich is substantiall( as based on a sa"ple of onl( . students0 it does not

provide enough evidence to conclude that the "ean ';T7G is higher than %00. 6e>ll further e2plore this point in the ne2t activit(. Here is the co"pleted figure representing the h(pothesis testing process for this e2a"ple4

EXAMP

!
!n this e2a"ple0 the p7value is @uite s"all (.0$2). !n particular0 for a significance level of .0%0 the p7value indicates that the results are significant. The data provide enough evidence for us to re#ect Ho and conclude that the "ean concentration level in the ship"ent is not the re@uired 2%0 pp". Here is the co"pleted figure representing the h(pothesis testing process for this e2a"ple4

Aelating Hypothesis .ests and ;onfidence 1nter'als


Wust as $e did for proportions/ $e may examine a confidence inter'al to decide $hether a proposed 'alue of the population mean is plausible.

Suppose $e $ant to test H0+=0 's. Ha+.0 using a significance le'el of =.05 An alternati'e $ay to perform this test is to find a N8P confidence inter'al for M and ma%e the follo$ing conclusions: 1f 0 falls outside the confidence inter'al/ re7ect H . 1f 0 falls inside the confidence inter'al/ do not re7ect H .
o o

EXAMP

6e>ll use e2a"ple 20 in hich the alternative as t o7sided. Decall that e ant to chec3 hether a "edication confor"s to a target concentration of a

che"ical ingredient b( testing

H0+='50 Ha+.'50
6e assu"e that =1'0 and in a sa"ple of si5e n = $00 e obtained a sa"ple "ean of x6='&7. ; 8%1 confidence interval for K is x62'n!='&72'1'100!='&72'.&=('&&.63 'ince the interval does not contain 2%00 true4 the population "ean concentration differs fro" 2%0.

'&9.&

e re#ect Ho and conclude that the alternative is

;omment
?eyond using the confidence inter'al as a -uic% $ay to carry out the t$o3 sided test/ the confidence inter'al can pro'ide insight into the actual 'alue of the population mean if H is re7ected. 1n the concentration le'el example/ H $as re7ected/ and all $e could conclude about the mean concentration le'el of the entire shipment/ M/ $as that it $as not !86. .he N8P confidence inter'al for M )!55.4/ !5N.5* gi'es us an idea of $hat plausible 'alues for M $ould be. 1n particular/ $e can conclude that since the confidence inter'al
o o

lies belo$ !86/ at least a large portion of the shipment contains medication that is ineffecti'e.

.ests About M 0hen T is >n%no$nL.he t3test for the Population Mean


1s (e mentioned earlier$ only in a fe( cases is it reasonable to assume that the population standard deviation$ C$ is %no(n. .he case (here C is un%no(n is much more common in practice. +hat can (e use to replace C6 ,f you don>t %no( the population standard deviation$ the best you can do is find the sample standard deviation$ S$ and use it instead of C. (Aote that this is e*actly (hat (e did (hen (e discussed confidence intervals .

,s that it6 Can (e )ust use S instead of C$ and the rest is the same as the previous case6 /nfortunately$ it>s not that simple$ but not very complicated either. +e (ill first go through the four steps of the t#test for the population mean and e*plain in (hat (ay this test is different from the z#test in the previous case. "or comparison purposes$ (e (ill then apply the t#test to a variation of the t(o e*amples (e used in the previous case$ and end (ith an activity (here you>ll get to carry out the t#test yourself. 0et>s start by describing the four steps for the t#test: I. Stating the hypotheses. ,n this step there are no changes: ? .he null hypothesis has the form:

H 0 + = 0 ((here 0 is the null value .


? .he alternative hypothesis ta%es one of the follo(ing three forms (depending on the conte*t :

Ha+,0 (one#sided

Ha+-0 (one#sided Ha+.0 (t(o#sided


II. Chec%ing the conditions under (hich the t#test can be safely used and summarizing the data. .echnically$ this step only changes slightly compared to (hat (e do in the z#test. &o(ever$ as you>ll see$ this small change has important implications. .he conditions under (hich the t# test can be safely carried out are e*actly the same as those for the z#test: (i .he sample is random (or at least can be considered random in conte*t . (ii +e are in one of the three situations mar%ed (ith a green chec% mar% in the follo(ing table ((hich ensure that X666 is at least appro*imately normal :

1ssuming that the conditions are met$ (e calculate the sample mean x6 and the sample standard deviation$ S ((hich replaces C $ and summarize the data (ith a test statistic. 1s in the z#test$ our test statistic (ill be the standardized score of x6 assuming that =0 (&o is true . .he difference here is that (e don>t %no( C$ so (e use S instead. .he test statistic for the t#test for the population mean is therefore:

t=x60sn!
.he change is in the denominator: (hile in the z#test (e divided by the standard deviation of X666$ namely n!$ here (e divide by the standard error of X666$ namely sn!. Does this have an effect on the rest of the test6 -es. .he t#test statistic in the test for the mean does not follo( a standard normal distribution. Rather$ it follo(s another bell# shaped distribution called the t distribution. So (e first need to introduce you to this ne( distribution as a general ob)ect. .hen$ (e;ll come bac% to our discussion of the t#test for the mean and ho( the t#distribution arises in that conte*t.

.he t =istribution
0e ha'e seen that 'ariables can be 'isually modeled by many different sorts of shapes/ and $e call these shapes distributions. Se'eral distributions arise so fre-uently that they ha'e been gi'en special names/ and they ha'e been studied mathematically. So far in the course/ the only one $eO'e named is the normal distribution/ but there are others. Cne of them is called the t distribution.

.he t distribution is another bell3shaped )unimodal and symmetric* distribution/ li%e the normal distribution: and the center of the t distribution is standardi<ed at <ero/ li%e the center of the normal distribution. i%e all distributions that are used as probability models/ the normal and the t distribution are both scaled/ so the total area under each of them is 1. So ho$ is the t distribution fundamentally different from the normal distributionD .he spread. .he follo$ing picture illustrates the fundamental difference bet$een the normal distribution and the t distribution:

Xou can see in the picture that the t distribution has slightl" less area near the e%pected central value than the normal distribution does/ and you can see that the t distribution has correspondingly more area in the $tails$ than the normal distribution does. )1tOs often said that the t distribution has @fatter tails@ or @hea'ier tails@ than the normal distribution.* .his reflects the fact that the t distribution has a larger spread than the normal distribution. .he same total area of 1 is spread out o'er a slightly $ider range on the t distribution/ ma%ing it a bit lo$er near the center compared to the normal distribution/ and gi'ing the t distribution slightly more probability in the YtailsO compared to the normal distribution. .herefore/ the t distribution ends up being the appropriate model in certain cases $here there is more varia!ilit" than $ould be predicted by the normal distribution. Cne of these cases is stoc% 'alues/ $hich ha'e more 'ariability )or @'olatility/@ to use the economic term* than $ould be predicted by the normal distribution.

.hereOs actually an entire family of t distributions. .hey all ha'e similar formulas )but the math is beyond the scope of this introductory course in statistics*/ and they all ha'e slightly @fatter tails@ than the normal distribution. ?ut some are closer to normal than others. .he t distributions that are closer to normal are said to ha'e higher @degrees of freedom@ )thatOs a mathematical concept that $e $onOt use in this course/ beyond merely mentioning it here*. So/ thereOs a t distribution @$ith one degree of freedom/@ another t distribution @$ith ! degrees of freedom@ $hich is slightly closer to normal/ another t distribution @$ith " degrees of freedom.@ $hich is a bit closer to normal than the pre'ious ones/ and so on. .he follo$ing picture illustrates this idea $ith 7ust a couple of t distributions )note that Gdegrees of freedomH is abbre'iated @d.f.@ on the picture*:

Aecall that $e $ere discussing the situation of testing for a mean/ in the case $hen sigma is un%no$n. 0eO'e seen pre'iously that $hen sigma is %no$n/ the test statistic is z=x60n! )note the sigma )T* in the formula*/ $hich follo$s a normal distribution. ?ut $hen sigma is un#nown/ the test statistic in the test for a mean becomes t=x60sn! )note the use of @s@ in the formula/ in place of the un%no$n sigma*. Here is $here the t3distribution arises in the context of a test for a mean/ because t=x60sn! )$ith @s@ in the formula in place of the un%no$n sigma* follo$s a t distribution. #otice the only difference bet$een the formula for the E statistic and the formula for the t statistic: 1n the formula for the E statistic/ sigma )the standard de'iation of the population* must be %no$n: $hereas/ $hen sigma isnOt %no$n/ then @s@ )the standard de'iation of the sample data* is used in place of the un%no$n sigma. .hatOs the change that causes the statistic to be a t statistic.

0hy $ould this single change )using @s@ in place of @sigma@* result in a sampling distribution that is the t distribution instead of the standard normal )E* distributionD Aemember that the t distribution is more appropriate in cases $here there is more 'ariability. So $hy is there more 'ariability $hen s is used in place of the un%no$n sigmaD 0ell/ remember that sigma )T* is a parameter )itOs the standard de'iation of the population*/ $hose 'alue therefore ne'er changes. 0hereas/ s )the standard de'iation of the sample data* 'aries from sample to sample/ and therefore itOs another source of 'ariation. So/ using s in place of sigma causes the sampling distribution to be the t distribution because of that extra source of 'ariation: 1n the formula z=x60n!/ the only source of 'ariation is the sampling 'ariability of the sample mean X666)none of the other terms in that formula 'ary randomly in a gi'en study*: 0hereas in the formula t=x60sn!/ there are two sources of 'ariation: Cne source is the sampling 'ariability of the sample mean X666: .he other source is the sampling 'ariability of sample standard de'iation s. So/ in a test for a mean/ if sigma isnOt %no$n/ then s is used in place of the un%no$n sigma and that results in the test statistic being a t score. .he t score/ in the context of a test for a mean/ is summari<ed by the follo$ing figure:

1n fact/ the t score that arises in the context of a test for a mean is a t score $ith )n Z 1* degrees of freedom. Aecall that each t distribution is indexed

according to @degrees of freedom.@ #otice that/ in the context of a test for a mean/ the degrees of freedom depend on the sample si<e in the study. Aemember that $e said that higher degrees of freedom indicate that the t distribution is closer to normal. So in the context of a test for the mean/ the larger the sample size/ the higher the degrees of freedom/ and the closer the t distri!ution is to a normal z distri!ution. .his is summari<ed $ith the notation near the bottom on the follo$ing image:

As a result/ in the context of a test for a mean/ the effect of the t distribution is most important for a study $ith a relativel" small sample size. 0e are no$ done introducing the t distribution. 0hat are implications of all of thisD 1. .he null distribution of our t3test statistic: t=x60sn! is the t distribution $ith )n31* d.f. 1n other $ords/ $hen H is true )i.e./ $hen =0*/ our test statistic has a t distribution $ith )n31* d.f./ and this is the distribution under $hich $e find p3'alues.
o

!. Bor a large sample si<e )n*/ the null distribution of the test statistic is approximately E/ so $hether $e use t)n 3 1* or E to calculate the p3'alues should not ma%e a big difference. Here is another practical $ay to loo% at this point. 1f $e ha'e a large n/ our sample has more information about the population. .herefore/ $e can expect the sample standard de'iation s to be close enough to the population standard de'iation/ T/ so that for practical purposes $e can use s as the %no$n T/ and $e2re bac% to the <3test.

". Binding the p3'alue


.he p3'alue of the t3test is found exactly the same $ay as it is found for the <3test/ except that the t distribution is used instead of the E distribution/ as the figures belo$ illustrate.

;omment:
E'en though tables exist for the different t distributions/ $e $ill only use soft$are to do the calculation for us.

;omment
#ote that due to the symmetry of the t distribution/ for a gi'en 'alue of the test statistic t/ the p3'alue for the t$o3sided test is t$ice as large as the p3 'alue of either of the one3sided tests. .he same thing happens $hen p3 'alues are calculated under the t distribution as $hen they are calculated under the E distribution.

5. =ra$ing ;onclusions
As usual/ based on the p3'alue )and some significance le'el of choice* $e assess the significance of results/ and dra$ our conclusions in context. .o summari<e: .he main difference bet$een the <3test and the t3test for the population mean is that $e use the sample standard de'iation s instead of the un%no$n population standard de'iation T. As a result/ the p3'alues are calculated under the t distribution instead of under the E distribution. Since $e are using soft$are/ this doesn2t really impact us practically. Ho$e'er/ it is important to understand $hat is going on behind the scenes/ and not 7ust use the soft$are mechanically. .his is $hy $e $ent through the trouble of explaining the t distribution. 0e are no$ ready to loo% at t$o examples.

Bor comparison purposes/ $e $ill use a modified 'ersion of the t$o problems $e used in the pre'ious case. 0e2ll first introduce the modified 'ersions and explain the changes.
EXAMP E

1
The ';T is constructed so that scores have a national average of %00. The distribution is close to nor"al. The dean of students of Doss Follege suspects that in recent (ears the college attracts students ho are "ore @uantitativel( inclined. ; rando" sa"ple of . students entering Doss college had an average "ath ';T (';T7G) score of %%00 and a sa"ple standard deviation of $00. *oes this provide enough evidence for the dean to conclude that the "ean ';T7G of all Doss Follege students is higher than the national "ean of %00? Here is a figure that represents this e2a"ple "ar3ed in blue4 here the changes are

Note that the proble" as changed so that the population standard deviation ( hich as assu"ed to be $00 before) is no un3no n0 and instead e assu"e that the sa"ple of . students produced a sa"ple "ean of %%0 (no change) and a sa"ple standard deviation of s=$00. ('a"ple standard deviations are never such nice rounded nu"bers0 but for the sa3e of co"parison e left it as $00.) Note that due to the changes0 the 57test for the population "ean is no longer appropriate0 and e need to use the t7 test.
EXAMP E

!
; certain prescription "edicine is supposed to contain an average of 2%0 parts per "illion (pp") of a certain che"ical. !f the concentration is higher than this0 the drug "a( cause har"ful side effects; if it is lo er0 the drug

"a( be ineffective. The "anufacturer runs a chec3 to see if the "ean concentration in a large ship"ent confor"s to the target level of 2%0 pp" or not. ; si"ple rando" sa"ple of $00 portions is tested0 and the sa"ple "ean concentration is found to be 2.& pp" ith a sa"ple standard deviation of $2 pp". ;gain0 here is a figure that represents this e2a"ple here the changes are "ar3ed in blue4

The changes are si"ilar to e2a"ple $4 e no longer assu"e that the population standard deviation is 3no n0 and instead use the sa"ple standard deviation of $2. ;gain0 the proble" as thus changed fro" a 57 test proble" to a t7test proble". Ho ever0 as e "entioned earlier0 due to the large sa"ple si5e (n = $00) there should not be "uch difference hether e use the 57test or the t7test. The sa"ple standard deviation0 s0 is e2pected to be close enough to the population standard deviation . 6e>ll see this as e solve the proble". et2s carry out the t3test for both of these problems: 6%ample 1: 1. .here are no changes in the hypotheses being tested:

!. .he conditions that allo$ us to use the t3test are met since: )i* .he sample is random. )ii* SA.3M is %no$n to 'ary normally in the population )$hich is crucial here/ since the sample si<e is only 5*. 1n other $ords/ $e are in the follo$ing situation:

.he test statistic is t=x60sn!=550500100&!=1 .he data )represented by the sample mean* are 1 standard error abo'e the null 'alue. ". Binding the p3'alue. Aecall that in general the p3'alue is calculated under the null distribution of the test statistic/ $hich/ in the t3test case/ is t)n31*. 1n our case/ in $hich n & 5/ the p3'alue is calculated under the t)"* distribution:

>sing statistical soft$are/ $e find that the p3'alue is 6.1N4. Bor comparison purposes/ the p3'alue that $e got $hen $e carried out the <3test for this problem )$hen $e assumed that 166 is the %no$n rather the calculated sample standard de'iation/ s* $as 6.18N. 1t is not surprising that the p3'alue of the t3test is larger/ since the t distribution has fatter tails. E'en though in this particular case the difference bet$een the t$o 'alues does not ha'e practical implications )since both are large and $ill lead to the same conclusion*/ the difference is not tri'ial. 5. Ma%ing conclusions.

.he p3'alue )6.1N4* is large/ indicating that the results are not significant. .he data do not pro'ide enough e'idence to conclude that the mean SA.3M among Aoss ;ollege students is higher than the national mean )866*. Here is a summary:

Example !: 1. .here are no changes in the hypotheses being tested:

!. .he conditions that allo$ us to use the t3test are met: )i* .he sample is random )ii* .he sample si<e is large enough for the ;entral imit .heorem to apply and ensure the normality of X666. 1n other $ords/ $e are in the follo$ing situation:

.he test statistic is: t=x60sn!='&7'501'100!='.5 .he data )represented by the sample mean* are !.8 standard errors belo$ the null 'alue. ". Binding the p3'alue.

.o find the p3'alue $e use statistical soft$are/ and $e calculate a p3'alue of 6.615 $ith a N8P confidence inter'al of )!55.41N/ !5N."Q1*. Bor comparison purposes/ the output $e got $hen $e carried out the <3test for the same problem $as a p3'alue of 6.61! $ith a N8P confidence inter'al of )!55.45Q/ !5N."8!*. #ote that here the difference bet$een the p3'alues is -uite negligible ).66!*. .his is not surprising/ since the sample si<e is -uite large )n & 166* in $hich case/ as $e mentioned/ the <3test )in $hich $e are treating s as the %no$n * is a 'ery good approximation to the t3test. #ote also ho$ the t$o N8P confidence inter'als are similar )for the same reason*. 5. ;onclusions: .he p3'alue is small ).615* indicating that at the 8P significance le'el/ the results are significant. .he data therefore pro'ide e'idence to conclude that the mean concentration in entire shipment is not the re-uired !86. Here is a summary:

;omments
1. .he N8P confidence inter'al for can be used here in the same $ay it is used $hen is %no$n: either as a $ay to conduct the t$o3sided test )chec%ing $hether the null 'alue falls inside or outside the confidence inter'al* or follo$ing a t3test $here H $as re7ected )in order to get insight into the 'alue of *. !. 0hile it is true that $hen is un%no$n and for large sample si<es the <3 test is a good approximation for the t3test/ since $e are using soft$are to carry out the t3test any$ay/ there is not much gain in using the <3test as an approximation instead. 0e might as $ell use the more exact t3test regardless of the sample si<e.
o

Ho$e'er/ it is al$ays $orth$hile %no$ing $hat happens behind the scenes.

.o Summari<e
1. 1n hypothesis testing for the population mean ) */ $e distinguish bet$een t$o cases: 1. .he less common case $hen the population standard de'iation ) * is %no$n. 11. .he more practical case $hen the population standard de'iation is un%no$n and the sample standard de'iation )s* is used instead. !. 1n the case $hen is %no$n/ the test for is called the <3test/ and in case $hen is un%no$n and s is used instead/ the test is called the t3test. ". 1n both cases/ the null hypothesis is: H0+=0 and the alternati'e/ depending on the context/ is one of the follo$ing:
Ha+,0/ or Ha+-0/ or Ha+.0

5. ?oth tests can be safely used as long as the follo$ing t$o conditions are met: )i* .he sample is random )or can at least be considered random in context*. )ii* Either the sample si<e is large )n + "6* or/ if not/ the 'ariable of interest can be assumed to 'ary normally in the population. 8. 1n the <3test/ the test statistic is:
z=X6660n!

$hose null distribution is the standard normal distribution )under $hich the p3'alues are calculated*. 4. 1n the t3test/ the test statistic is:
t=X6660sn!

$hose null distribution is t)n 3 1* )under $hich the p3'alues are calculated*. 9. Bor large sample si<es/ the <3test is a good approximation for the t3test. Q. ;onfidence inter'als can be used to carry out the t$o3sided test
Ha+.0/ and in cases $here H is re7ected/ the confidence inter'al can gi'e insight into the 'alue of the population mean ) *.
o

N. Here is a summary of $hich test to use under $hich conditions:

.his module co'ered the <3test for population proportion and both the <3 test and t3test for the population mean. .he follo$ing table summari<es $hen each of the tests are used:

.he module is also loaded $ith 'ery important ideas that apply to the general process of hypothesis testing. .hus/ the follo$ing summary discusses each of the abo'e named hypothesis tests $ithin the context of the hypothesis testing process. .he process of hypothesis testing has four steps: I. 'tating the null and alternative h"potheses (Ho and Ha). &"pe of H"pothesis &est 7ull H"pothesis
6 6

8lternative H"pothesis H :p,p or H :p(p or Ha:p+p


a 6 a 6

<3test for the Population H :p&p Proportion <3test for the Population H :M&M Mean
6

H :M,M or H :M(M or H :M+M


a 6 a 6 a

t3test for the Population H :M&M Mean


6

H :M,M or H :M(M or H :M+M


a 6 a 6 a

II. (!taining a random sample (or at least one that can !e considered random) and collecting data. 9sing the data: :.hec# that the conditions under which the test can !e relia!l" used are met.

Bor the z-test for the +opulation +roportion/ $e can reliably use the test is if the follo$ing conditions holds: np [ 16 and n(1-p )[ 16 Bor the z-test for the +opulation 2ean and the t-test for the +opulation 2ean/ the follo$ing table is a summary the conditions under $hich they can be reliably used/ and $hich test to use $hen:
0 0

:'ummarize the data using a test statistic. .he test statistic is a measure of the e'idence in the data against the H . .he larger the test statistic is in magnitude/ the more e'idence the data present against the H .
o o

H"pothesis &est

&est 'tatistic

<3test for the Population Proportion z=pp0p0(1p0 n! <3test for the Population Mean t3test for the Population Mean
z=x60n! t=x60sn!

III. 0inding the p-value of the test. .he p3'alue is the probability of getting data li%e those obser'ed )or e'en more extreme* assuming that the null hypothesis is true/ and is calculated using the null distribution of the test statistic. .he p3'alue is a measure of

the e'idence against H . .he smaller the p3'alue/ the more e'idence the data present against H .
o a

1n this module/ $e learned ho$ to compute the p3'alue for the t$o <3tests )<3test for the population proportion and the <3test for the population mean*. Ho$e'er/ for the t3test )and/ actually/ from this point on in the course*/ $e $ill use soft$are to find the p3'alue for us. I1. 2a#ing conclusions. -.onclusions a!out the significance of the results: 1f the p3'alue is small/ the data present enough e'idence to re7ect H )and accept H *. 1f the p3'alue is not small/ the data do not pro'ide enough e'idence to re7ect H .
6 a 6

.o help guide our decision/ $e use the significance le'el as a cutoff for $hat is considered a small p3'alue. .he significance cutoff is usually set at .68/ but should not be considered in'iolable. ;onclusions should al$ays be made in the context of the problem. 8dditional ;*ig Ideas< a!out h"pothesis &esting. 7ote: These ideas were already mentioned in the summary for hypothesis testing for the population proportion p, however it is worth repeating them and thus stress that these idea apply to hypothesis testing in general! Aesults that are based on a larger sample carry more $eight/ and therefore results that are not significant )do not pro'ide e'idence to re7ect H * may become significant if based on a larger sample si<e. As a result...
6

E'en a 'ery small and practically unimportant effect becomes statistically significant $ith a large enough sample si<e. .he distinction bet$een statistical significance and practical importance should therefore al$ays be considered. Bor gi'en data/ the p3'alue of the t$o3sided test is al$ays t$ice as large as the p3'alue of the one3sided test. 1t is therefore harder to re7ect H in the t$o3sided case than it is in the one3sided case in the sense that stronger e'idence is re-uired. 1ntuiti'ely/ the hunch or information that leads us to use the one3sided test can be regarded as a head3start to$ard the goal of re7ecting H . =>? confidence intervals can be used in order to carry out two-sided tests )at the .68 significance le'el*. 1f the null 'alue is not included in the confidence inter'al )i.e./ is not one of the plausible 'alues for the
6 6

parameter*/ $e ha'e enough e'idence to re7ect H . Cther$ise/ $e cannot re7ect H .


6 6

1f the results are significant/ it might be of interest to follo$ up the tests $ith a confidence inter'al in order to get insight into the actual 'alue of the parameter of interest. .he follo$ing Stat.utor exercise is the first one in $hich the @More Bormal Analysis@ )i.e./ @inference@* node is acti'e. .his exercise $ill therefore gi'e you a chance to practice some of the methods you learned/ but most importantly/ $ill gi'e you a sense of ho$ inference fits into and completes the big picture of statistics.

0hat ;an Fo 0rong: .$o .ypes of Errors


Statistical in'estigations in'ol'e ma%ing decisions in the face of uncertainty. So there is al$ays some chance of ma%ing a $rong decision. 1n hypothesis testing/ the follo$ing decisions can occur: 1f the null hypothesis is true and $e do not re7ect it/ it is a correct decision. 1f the null hypothesis is false and $e re7ect it/ it is a correct decision. 1f the null hypothesis is true/ but $e re7ect it. .his is a t"pe I error. 1f the null hypothesis is false/ but $e fail to re7ect it. .his is a t"pe II error. .ype 1 and type 11 errors are not caused by mista%es. .hey are the result of random chance. .he datapro'ide e'idence for a conclusion that is false. 1tOs no oneOs faultS

;omplete the follo$ing table by dragging the appropriate ans$er into each box:
learn !" doing
EXAMP E

A ;ourtroom Analogy for Hypothesis .ests


*efendants standing trial for a cri"e are considered innocent until evidence sho s the( are guilt(. !t is the #ob of the prosecution to present evidence that sho s the defendant is guilt( Lbe(ond a reasonable doubt.M !t is the #ob of the defense to challenge this evidence and establish a reasonable doubt. The #ur( eighs the evidence and "a3es a decision. 6hen a #ur( "a3es a decision0 onl( t o verdicts are possible4

)uilty: The #ur( concludes that there is enough evidence to convict the defendant. The evidence is so strong that there is not a reasonable doubt of the defendantNs guilt. *ot guilty: The #ur( concludes that there is not enough evidence to conclude be(ond a reasonable doubt that the person is guilt(.

Notice that a verdict of Lnot guilt(M is not a conclusion that the defendant is innocent. This verdict sa(s onl( that there is not enough evidence to return a guilt( verdict. How is this like a hypothesis test? The null h(pothesis0 H 0 in ;"erican courtroo"s is Lthe defendant is innocent.M The alternative h(pothesis0 H 0 is Lthe defendant is guilt(.M The evidence presented in the case is the data on hich the verdict is based. !n a courtroo"0 the defendant is assu"ed to be innocent until proven guilt(. !n a h(pothesis test0 e assu"e the null h(pothesis is true until the dataindicates other ise.
0 a

The t o possible verdicts are si"ilar to the t o conclusions that are possible in a h(pothesis test. Re ect the null hypothesis: 6hen e re#ect a null h(pothesis0 e accept the alternative h(pothesis. This is e@uivalent to a guilt( verdict in the courtroo". The evidence is strong enough for the #ur( to re#ect the initial assu"ption of innocence. !n a h(pothesis test0 the data is strong enough for us to re#ect the assu"ption that the null h(pothesis is true. +ail to re ect the null hypothesis: 6hen e fail to re#ect the null h(pothesis0 e are delivering a Lnot guilt(M verdict. The #ur( concludes that the evidence is not strong enough to re#ect the assu"ption of innocence. 'o the data is too ea3 to support a guilt( verdict. 6e conclude the data is not strong enough to re#ect the null h(pothesis. !n other ords0 the data is too ea3 to accept the alternative h(pothesis. How does the courtroom analogy relate to Type I and Type II errors? $ype ! error: The evidence leads the #ur( to convict an innocent person. O( analog(0 e re#ect a true null h(pothesis and accept a false alternative h(pothesis. $ype !! error: The evidence leads the #ur( to declare a defendant not guilt(0 hen he is in fact guilt(. O( analog(0 e fail to re#ect a null h(pothesis that is false. !n other ords0 e do not accept an alternative h(pothesis hen it is reall( true. !t ould be nice to 3no hen each of these errors is happening hen e "a3e our decision about the null h(pothesis0 but statistical decisions are based on evidence gathered through sa"pling0 and our sa"pling evidence

ill so"eti"es fool us. ;s long as e are "a3ing a decision0 e ill never be able to eli"inate the potential for these t o t(pes of errors. Thus0 e need to learn ho to ad#ust to the conse@uences of "a3ing these t(pes of errors.
learn !" doing

A double3blind experiment is conducted to in'estigate the side effects of hormone replacement therapy for $omen $ith menopausal symptoms. .he experiment randomly assigns more than 14/666 American $omen to either a hormone treatment or a placebo. After fi'e years/ the HA. study finds no significant difference in the proportion of $omen de'eloping breast cancer and heart disease. Aesearchers decide/ based on this finding/ to allo$ the study to continue. Suppose that at the end of the fi'e3year study described abo'e/ a greater proportion of the hormone3treated group ha'e breast cancer and heart disease. .his obser'ed difference is statistically significant. Aesearchers are so alarmed by the results that the experiment is ended early to pre'ent further harm to the health of the $omen participating in the hormone group.
did I get this

An #?; #e$sRWall treet !ournal poll conducted in !616 determined that 41 percent of Americans did not support the .ea Party mo'ement. 1n a poll of 1/666 Americans this year/ 45 percent say they do not support the .ea Party mo'ement. Has opposition to the .ea Party mo'ement increased since !616D 0e tested the follo$ing hypotheses at the 8 percent le'el of significance: " : .he proportion of Americans this year $ho oppose the .ea Party mo'ement is 6.41. " : .he proportion of Americans this year $ho oppose the .ea Party mo'ement is greater than 6.41. .he # 'alue is .6!4/ so $e re7ect the null hypothesis/ " / and accept the alternati'e hypothesis/ " . 0e conclude that public opposition to the .ea Party mo'ement is greater than 41P this year.

0 a 0 a

did I get this

0hat is the probability that $e $ill ma%e a .ype 1 errorD


1f the significance le'el is 8 percent )J & .68*/ then 8 percent of the time/ $e $ill re7ect the null hypothesis/ e'en if it is true. Cf course $e $ill not %no$ $hether the null hypothesis is true. ?ut if it is/ the natural 'ariability that

$e expect in random samples $ill produce GrareH results 8 percent of the time. .his ma%es sense/ because $hen $e create the sampling distribution/ $e assume the null hypothesis is true. 0e loo% at the 'ariability in random samples selected from the population described by the null hypothesis. Similarly/ if the significance le'el is 1 percent/ then $e can expect the sample results to lead us to re7ect the null hypothesis 1 percent of the time. 1n other $ords/ about one in 166 data sets $ould sho$ GrareH results that contrast $ith the other NN data sets/ leading us to re7ect the null hypothesis. 1f the null hypothesis is actually true/ then by chance alone/ $e $ill re7ect a true null hypothesis 1 percent of the time. So the probability of a type 1 error in this case is 1 percent. 1n general/ the probability of a type 1 error is J.

0hat is the probability that $e $ill ma%e a .ype 11 errorD


As you ha'e 7ust seen/ the probability of a type 1 error is e-ual to the significance le'el/ J. .he probability of a type 11 error is much more complicated to calculate/ but it is in'ersely related to the probability of ma%ing a type 1 error. .hus/ reducing the chance of ma%ing a type 11 error causes an increase in the li%elihood of a type 1 error.

=ecreasing the ;hance of .ype 1 or .ype 11 Error


Ho$ can $e decrease the chance of a type 1 or type 11 errorD 0ell/ decreasing the chance of a type 1 error increases the chance of a type 11 error/ so $e must $eigh the conse-uences of these errors before deciding ho$ to proceed. Aecall that the probability of committing a type 1 error is J. 0hy is thisD 0ell/ $hen $e choose a le'el of significance )J*/ $e are choosing a benchmar% for re7ecting the null hypothesis. 1f the null hypothesis is true/ then the probability that $e $ill re7ect a true null hypothesis is J. So the smaller J is/ the smaller the probability of a type 1 error there $ill be. 1t is more complicated to calculate the probability of a type 11 error. .he best $ay to reduce the probability of a type 11 error is to increase the sample si<e. ?ut once the sample si<e is set/ larger 'alues of J $ill decrease the probability of a type 11 error $hile increasing the probability of a type 1 error.
learn !" doing

A double3blind experiment is conducted to in'estigate the side effects of hormone replacement therapy for $omen $ith menopausal symptoms. .he experiment randomly assigns more than 14/666 American $omen to either a hormone treatment or a placebo. After fi'e years/ the HA. study finds no significant difference in the proportion of $omen de'eloping breast cancer and heart disease. Aesearchers decide/ based on this finding/ to allo$ the study to continue. As the null hypothesis $as not re7ected/ there is a chance that the researchers made a type 11 error. Suppose that at the end of the fi'e3year study described abo'e/ a greater proportion of the hormone3treated group ha'e breast cancer and heart disease. .his obser'ed difference is statistically significant. Aesearchers are so alarmed by the results that the experiment is ended early to pre'ent further harm to the health of the $omen participating in the hormone group. Since the null hypothesis $as re7ected/ it is possible researchers made a type 1 error.
did I get this

An #?; #e$sRWall treet !ournal poll conducted in !616 determined that 41 percent of Americans did not support the .ea Party mo'ement. 1n a poll of 1/666 Americans this year/ 45 percent say they do not support the .ea Party mo'ement. Has opposition to the .ea Party mo'ement increased since !616D 0e tested the follo$ing hypotheses at the 8 percent le'el of significance: " : .he proportion of Americans this year $ho oppose the .ea Party mo'ement is 6.41. " : .he proportion of Americans this year $ho oppose the .ea Party mo'ement is greater than 6.41. .he # 'alue is .6!4/ so $e re7ect the null hypothesis/ " / and accept the alternati'e hypothesis/ " . 0e conclude that public opposition to the .ea Party mo'ement is greater than 41P this year. ?ecause the null hypothesis has been re7ected/ it is possible that the researchers made a type 1 error.

0 a 0 a

did I get this

@eneral guidelines for choosing a level of significance: 1f the conse-uences of a type 1 error are more serious/ choose a small le'el of significance )J*. 1f the conse-uences of a type 11 error are more serious/ choose a larger le'el of significance )J*. ?ut remember that the le'el of significance is the probability of committing a type 1 error.

1n general/ $e choose the largest le'el of significance that $e can tolerate as the chance of ma%ing a type 1 error.

#ote: 1t is not al$ays the case that one type of error is $orse than the other.

7ra8 $8 Ty8e I and Ty8e II 9rrors


1t is important to remember to ta%e into consideration the possibility of the occurrence of .ype 1 or .ype 11 error/ $hen dra$ing conclusions from hypothesis tests. .hus:

0hene'er there is a failure to re7ect the null hypothesis/ it is possible that a .ype 11 error has occurred. .hus/ it is concluded from the results of the study/ that there are no significant differences/ e'en though/ in reality/ there are significant differences. 0hene'er the null hypothesis is re7ected/ it is possible that a .ype 1 error has been committed. .hat is/ it is concluded from the results of the study/ that there are significant differences/ $hen/ in reality/ there are no differences.

Ho$e'er/ it is not possible to %no$ $hen either .ype 1 or .ype 11 error has occurred.

1ntroduction 199 ;ase ;EU )1 of !* 19Q

;ase ;EU )! of !* &wo Independent 'amples 19N .$o 1ndependent Samples )1 of 9* 1Q6 .$o 1ndependent Samples )! of 9* 1Q1 .$o 1ndependent Samples )" of 9* 1Q! .$o 1ndependent Samples )5 of 9* 1Q" .$o 1ndependent Samples )8 of 9* 1Q5 .$o 1ndependent Samples )4 of 9* 1Q8 .$o 1ndependent Samples )9 of 9* 2atched +airs 1Q4 Matched Pairs )1 of Q* 1Q9 Matched Pairs )! of Q* 1QQ Matched Pairs )" of Q* 1QN Matched Pairs )5 of Q* 1N6

Matched Pairs )8 of Q* 1N1 Matched Pairs )4 of Q* 1N! Matched Pairs )9 of Q* 1N" Matched Pairs )Q of Q* 87(18 1N5 A#CIA )1 of 9* 1N8 A#CIA )! of 9* 1N4 A#CIA )" of 9* 1N9 A#CIA )5 of 9* 1NQ A#CIA )8 of 9* 1NN A#CIA )4 of 9* !66 A#CIA )9 of 9* !61 ;onclusion of ;ase ;EU

1ntroduction
1n the pre'ious t$o modules $e performed inference for one 'ariable. More specifically/ $e learned about inference for the population proportion p )$hen the 'ariable of interest is categorical* and inference for the population mean M )$hen the 'ariable of interest is -uantitati'e*. 1n the pre'ious t$o modules $e $ere also exposed to the follo$ing three forms of inference $hich $ill continue to be central as $e mo'e for$ard in the course:

+oint estimationLestimating an un%no$n parameter $ith a single 'alue that is computed from the sample. Interval estimationLestimating an un%no$n parameter by an inter'al of plausible 'alues. .o each such inter'al $e attach a le'el of confidence that indeed the inter'al captures the 'alue of the un%no$n parameter and hence the name confidence inter'als. H"pothesis testingLa four3step process in $hich $e are assessing e'idence pro'ided by the data in fa'or or against some claim about the population parameter.

Cur next )and final* goal for this course is to perform inference about relationships bet$een t$o 'ariables in a population/ based on an obser'ed relationship bet$een 'ariables in a sample. Here is $hat the process loo%s li%e:

0e are interested in studying $hether a relationship exists bet$een the 'ariables X and X in a population of interest. 0e choose a random sample and collect data on both 'ariables from the sub7ects. Cur goal is to determine $hether these data pro'ide strong enough e'idence for us to generali<e the obser'ed relationship in the sample and conclude )$ith some

acceptable and agreed3upon le'el of uncertainty* that a relationship bet$een X and X exists in the entire population. .he primary inference form that $e $ill use in this module/ then/ is hypothesis testing. ;onceptually/ across all the inferential methods that $e $ill learn/ $e2ll test some form of:

)0e $ill also discuss point and inter'al estimation/ but our discussion about these forms of inference $ill be framed around the test.* Aecall that in the module about examining the relationship bet$een t$o 'ariables in the Exploratory =ata Analysis unit/ our discussion $as framed around the role3type classification table. .his part of the course $ill be structured exactly in the same $ay. 1n other $ords/ $e $ill go through " sections corresponding to cases ; EU/ ;E;/ and UEU in the table belo$.

)Aecall that case UE; is not discussed in this course.* 1n total/ $e $ill introduce 8 inferential methods: three in case ; EU )corresponding to a di'ision of this case into " sub3cases* and one in each of the cases ;E; and UEU. >nli%e the pre'ious part of the course on 1nference for Cne Iariable/ $here $e discussed in some detail the theory behind the machinery of the test )such as the null distribution of the test statistic/ under $hich the p3'alues are calculated*/ in the 8 inferential procedures that $e $ill introduce in 1nference for Aelationships/ $e $ill discuss much less of that %ind of detail. .he principles are the same/ but the details behind the null distribution of the test statistic )under $hich the p3'alue is calculated* become more complicated and re-uire %no$ledge of theoretical results that are definitely beyond the scope of this course.

1nstead/ $ithin each of the fi'e inferential methods $e $ill focus on:

0hen the inferential method is appropriate for use. >nder $hat conditions the procedure can safely be used. .he conceptual idea behind the test )as it is usually captured by the test statistic*. Ho$ to use soft$are to carry out the procedure in order to get the p3 'alue of the test. 1nterpreting the results in the context of the problem.

Also/ $e $ill continue to introduce each test according to the four3step process of hypothesis testing. 0e are no$ ready to start $ith ;ase ; EU. Aecall the role3type classification table framing our discussion on inference about the relationship bet$een t$o 'ariables.

0e start $ith case ;EU/ $here the explanatory 'ariable is categorical and the response 'ariable is -uantitati'e. Aecall that in the Exploratory =ata Analysis unit/ examining the relationship bet$een X and X in this case amounts/ in practice/ to comparing the distributions of the )-uantitati'e* response X for each 'alue )category* of the explanatory X. .o do that/ $e used side3by3side boxplots )each representing the distribution of X in one of the groups defined by X*/ and supplemented the display $ith the corresponding descripti'e statistics. 0hat $ill $e do in inferenceD .o understand the logic/ $e2ll start $ith an example and then generali<e.
EXAMP E

FPA and Xear in ;ollege


'a( that our variable of interest is the P?; of college students in the +nited 'tates. Bro" the previous "odule e 3no that since P?; is @uantitative0 relationships0 let>s assu"e that relationship bet een4 e do inference on K0 the (population) "ean P?; a"ong all +.'. college students. 'ince this "odule is about hat e are reall( interested in is not si"pl( P?;0 but the

, : (ear in college ($ = fresh"en0 2 = sopho"ore0 , = #unior0 . = senior) and - : P?; !n other ords0 e ant to e2plore hether P?; is related to (ear in college. The a( to thin3 about this is that the population of +.'. college students is no bro3en into 4 su%. populations4 fresh"en0 sopho"ores0 #uniors and seniors. 6ithin each of these four groups0 e are interested in the P?;. The inference "ust therefore involve the . sub7population "eans4 K$ 4 "ean P?; a"ong fresh"en in the +nited 'tates. K2 4 "ean P?; a"ong sopho"ores in the +nited 'tates K, 4 "ean P?; a"ong #uniors in the +nited 'tates K. 4 "ean P?; a"ong seniors in the +nited 'tates !t "a3es sense that the inference about the relationship bet een (ear and P?; has to be based on so"e 3ind of co"parison of these four "eans. !f e infer that these four "eans are not all e@ual (i.e.0 that there are so"e differences in P?; across (ears in college) then that>s e@uivalent to sa(ing P?; is related to (ear in college. =et>s su""ari5e this e2a"ple ith a figure4

1n general/ then/ ma%ing inferences about the relationship bet$een X and X in ;ase ;EU boils do$n to comparing the means of X in the sub3

populations/ $hich are created by the categories defined in X )say % categories*. .he follo$ing figure summari<es this:

As the introduction to this module mentioned/ $e $ill learn three inferential methods in ;ase ;EU/ corresponding to a sub3di'ision of this case. Birst $e $ill distinguish bet$een cases $here the explanatory X has only t$o categories )% & !*/ and cases $here X has more than t$o categories )% + !*. 1n other $ords/ $e $ill loo% separately at cases $here $e are comparing t$o sub3population means:

and cases $here $e are comparing more than ! sub3population means:

Bor example/ if $e are interested in $hether FPA )X* is related to gender )X*/ this is a case $here # A B)since gender has only t$o categories: M/ B*/ and the inference $ill boil do$n to comparing the mean FPA in the sub3 population of males to that in the sub3population of females. Cn the other hand/ in the example $e loo%ed at earlier/ the relationship bet$een FPA )X* and year )X* is a case $here # C B or more specifically/ % & 5 )since year has four categories*. 1n terms of inference/ these t$o examples $ill be treated differentlyS Burthermore/ $ithin the sub3case of comparing t$o means )i.e./ examining the relationship bet$een X and X/ $hen X has only t$o categories* $e $ill distinguish bet$een t$o )sub3sub* cases. Here/ the distinction is some$hat subtle/ and has to do $ith ho$ the samples from each of the t$o sub3 populations $e2re comparing are chosen. 1n other $ords/ $hat study design $ill be implemented. 0e ha'e learned that many experiments/ as $ell as obser'ational studies/ ma%e a comparison bet$een t$o groups )sub3 populations* in order to see ho$ responses differ for the t$o possible categorical 'alues. 1n some cases/ one group )sub3population 1* has one categorical 'alue/ and another independent group )sub3population !*

has the other 'alue. 1ndependent samples are then ta%en from each group for comparison.

1n other cases/ a matched pairs sample design may be used/ $here each obser'ation in one sample ismatchedDpairedDlin#ed $ith an obser'ation in the other sample. .hese are sometimes called @ dependent samples.@

Matching could be by person )if the same person is measured t$ice*/ or could actually be a pair of indi'iduals $ho belong together in a rele'ant $ay )husband and $ife/ siblings*. 1n this design/ then/ the same indi'idual or a matched pair of indi'iduals is used to ma%e t$o measurements of the responseLone for each of the t$o categorical 'alues.

;omment
#ote that in the first figure/ $here the samples are independent/ the sample si<es of the t$o independent samples need not be the same )and thus $e used n and n to indicate the t$o sample si<es*. Cn the other hand/ it is ob'ious from the design that in the matched pairs the sample si<es of the t$o samples must be the same )and thus $e used n for both*.
1 !

EXAMP

The depart"ent of "otor vehicles ants to chec3 hether drivers are i"paired after drin3ing t o beers. Fonsider the follo ing t o designs4 $. The reaction ti"es ("easured in seconds) in an obstacle course are "easured for a group of $0 drivers ho had no beer. T o beers are given to each of a different group of 8 drivers0 and their reaction ti"es on the sa"e obstacle course are "easured. (!n practice0 this as done b( selecting a rando" sa"ple of $8 drivers and rando"l( assigning the" to one of the t o groups. The rando" assign"ent guarantees0 at least in theor(0 that the t o groups are independent).

2. The reaction ti"es ("easured in seconds) in an obstacle course are "easured for < rando"l( selected drivers %efore and then after the consu"ption of t o beers.

!n the first design0 e have t o independent sa"ples0 and the second design is a "atched7pairs design0 since each individual as "easured t ice0 once before and once after. The t o figures highlight the "ain difference bet een the t o designs. ;s e>ll see0 hen e have t o independent sa"ples0 the co"parison of the reaction ti"es is a co"parison %etween two groups. !n "atched pairs0 the co"parison bet een the reaction ti"es is done for each indi/idual"

;omparing .$o MeansL.$o 1ndependent Samples ).he .$o3Sample t3test*


As $e mentioned in the summary of the introduction to ;ase ; EU/ the first case that $e $ill deal $ith is comparing t$o means $hen the t$o samples are independent:

Aecall that here $e are interested in the effect of a t$o3'alued )% & !* categorical 'ariable )X* on a -uantitati'e response )X*. Samples are dra$n independently from the t$o sub3populations )defined by the t$o categories of X*/ and $e need to e'aluate $hether or not the data pro'ide enough e'idence for us to belie'e that the t$o sub3population means are different. 1n other $ords/ our goal is to test $hether the means M and M )$hich are the means of the 'ariable of interest in the t$o sub3populations* are e-ual or not/ and in order to do that $e ha'e t$o samples/ one from each sub3 population/ $hich $ere chosen independently of each other. As the title of this part suggests/ the test that $e $ill learn here is commonly %no$n as the two-sample t-test. As the name suggests/ this is a t3test/ $hich as $e %no$ means that the p3'alues for this test are calculated under some t distribution. Here is ho$ this part $ill be organi<ed.
1 !

0e2ll first introduce our leading example/ and then go in detail through the four steps of the t$o3sample t3test/ illustrating each step using our example.
7(&6...

>p until no$/ $e ha'e been di'iding our population into su!populations/ then sampling from these sub3populations. Brom no$ on/ instead of calling them sub3populations/ $e $ill usually call the groups $e $ish to compare population 1, population B, etc. .hese t$o descriptions of the groups $e are comparing can be used interchangeably. .he number $ill be determined by the number of 'alues the explanatory categorical 'ariable )X* has.

EXAMP

6hat is "ore i"portant to (ou I personalit( or loo3s? This @uestion as as3ed of a rando" sa"ple of 2,8 college students0 ho ere to ans er on a scale of $ to 2%. ;n ans er of $ "eans personalit( has "a2i"u" i"portance and loo3s no i"portance at all0 hereas an ans er of 2% "eans loo3s have "a2i"u" i"portance and personalit( no i"portance at all. The purpose of this surve( as to e2a"ine hether "ales and fe"ales differ ith respect to the i"portance of loo3s vs. personalit(.
Tip4 ;lternative versions are available0 clic3 the arro to s itch.

To open D ith the dataset preloaded0 right7clic3 here and choose Q'ave Target ;sQ to do nload the file to (our co"puter. Then find the do nloaded file and double7clic3 it to open it in D. The data have been loaded into the variable Qloo3s.Q :nter the co""and looks to see the data. Note that the data have the follo ing for"at4
Score (Y) Gender (X) 17 1! 12 12 14 14 < 14 etc. Fale "emale "emale Fale "emale Fale Fale Fale

The for"at of the data re"inds us that e are essentiall( e2a"ining the relationship bet een the t o7valued categorical variable0 gender0 and the @uantitative response0 score. The t o values of the categorical e2planator( variable define the t o populations that e are co"paring I "ales and fe"ales. The co"parison is ith respect to the response variable score. Here is a figure that su""ari5es the e2a"ple4

(omments: $. Note that this figure e"phasi5es ho the fact that our e2planator( is a t o7valued categorical variable "eans that in practice e are co"paring t o populations (defined b( these t o values) ith respect to our response E. 2. Note that even though the proble" description #ust sa(s that e had 2,8 students0 the figure tells us that there ere <% "ales in the sa"ple0 and $%0 fe"ales. ,. Bollo ing up on co""ent 20 note that <% R $%0 = 2,% and not 2,8. !n these data ( hich are real) there are four Q"issing observationsQI. students for hich e do not have the value of the response variable0 Qi"portance.Q This could be due to a nu"ber of reasons0 such as recording error or nonresponse. The botto" line is that even though data ere collected fro" 2,8 students0 effectivel( e have data fro" onl( 2,%. (Deco""ended4 Po through the data file and note that there are . cases of "issing observations4 students ,.0 $,<0 $&80 and $<,).

.he .$o3Sample t3test


&ere again is the general situation (hich reBuires us to use the t(o#sample t#test:

Gur goal is to compare the means H1 and H2 based on the t(o independent samples.

Step 1: Stating the Hypotheses


.he hypotheses represent our goal$ comparing the means: H1 and H2 .
o

.he null hypothesis has the form:

Ho+1'=0 ((hich is the same as Ho+1='

.he alternative hypothesis ta%es one of the follo(ing three forms (depending on the conte*t :
o o o

Ha+1',0 ((hich is the same as Ha+1,' (one#sided Ha+1'-0 ((hich is the same as Ha+1-' (one#sided Ha+1'.0 ((hich is the same as Ha+1.' (t(o#sided

Aote that the null hypothesis claims that there is no difference bet(een the means$ (hich can either represented as the difference is 2 (no difference $ or as its (algebraically and conceptually eBuivalent$ 1=' (the means are eBual . Iither (ay$ conceptually$ &o claims that there is no relationship bet(een the t(o relevant variables. .he first (ay of (riting the hypotheses (using a difference bet(een the means (ill be easier to use (hen (in the future (e loo% for a difference that is not 2. Iach one of the three alternatives claims that there is a difference bet(een the means. .he t(o one#sided alternatives specify the nature of the difference: either negative$ indicating that H1 is smaller than H2$ or positive$ indicating that H1 is larger than H2. .he t(o#sided alternative$

as usual$ is more general and simply claims that a difference e*ists. 1s before$ it should be clear from the conte*t of the problem (hich of the three alternatives is appropriate.

;omment
Aote that our parameter of interest in this case (the parameter about (hich (e are ma%ing an inference is the difference bet(een the means 1' $ and that the null value is 2.
EXAMP E

Decall that the purpose of this surve( as to e2a"ine case are therefore4

hether the opinions of fe"ales and

"ales differ ith respect to the i"portance of loo3s vs. personalit(. The h(potheses in this

here K$ represents the "ean i"portance for fe"ales and K2 represents the "ean i"portance for "ales. !t is i"portant to understand that conceptuall(0 the t o h(potheses clai"4 Ho4 'core (of loo3s vs. personalit() is not related to gender Ha4 'core (of loo3s vs. personalit() is related to gender did I get this

Did I Get This?


,n order to chec% the claim that the pregnancy length of (omen (ho smo%e during pregnancy is shorter$ on average$ than the pregnancy length of (omen (ho do not smo%e$ a random sample of !7 pregnant (omen (ho smo%e and a random sample of !7 pregnant (omen (ho do not smo%e (ere chosen and their pregnancy lengths (ere recorded. &ere is a figure of this e*ample:

Step !: ;hec% ;onditions/ and Summari<e the =ata >sing a .est Statistic
.he t$o3sample t3test can be safely used as long as the follo$ing conditions are met: 1. !. a. .he t$o samples are indeed independent. 0e are in one of the follo$ing t$o scenarios: ?oth populations are normal/ or more specifically/ the distribution of the response X in both populations is normal/ and both samples are random )or at least can be considered as such*. 1n practice/ chec%ing normality in the populations is done by loo%ing at each of the samples using a histogram and chec%ing $hether there are any signs that the populations are not normal. Such signs could be extreme s%e$ness andRor extreme outliers. .he populations are %no$n or disco'ered not to be normal/ but the sample si<e of each of the random samples is large enough )$e can use the rule of thumb that + "6 is considered large enough*.

b.

Assuming that $e can safely use the t$o3sample t3test/ $e need to summari<e the data/ and in particular/ calculate our data summaryLthe test statistic.

&he two-sample t-test statistic is: t=(y16666y'6666 0s'1n1:s''n'! 0here:


y166663 y'6666 are the sample means of the samples from population 1 and population ! respecti'ely. s13 s' are the sample standard de'iations of the samples from population 1 and population ! respecti'ely. n13 n' are the sample si<es of the t$o samples.

;omment
et2s see $hy this test statistic ma%es sense/ bearing in mind that our inference is about 1'. y16666 estimates M and y'6666 estimates M / and therefore y16666 y'6666 is $hat the data tell me about )or/ ho$ the data estimate* 1'. 6 is the @null 'alue@ L $hat the null hypothesis/ H / claims that 1' is. .he denominator s'1n1:s''n'! is the standard error of y16666 y'6666. )0e $ill not go into the details of $hy this is true.*
1 ! o

0e therefore see that our test statistic/ li%e the pre'ious test statistics $e encountered/ has the structure:
sample estimatenull valuestandard error

and therefore/ li%e the pre'ious test statistics/ measures )in standard errors* the difference bet$een $hat the data tell me about the parameter of interest 1' )sample estimate* and $hat the null hypothesis claims the 'alue of the parameter is )null 'alue*.
287E '&9F67&' G(7F6,...

How to ,ead 'tatistical 'oftware (utput


EXAMP E

=et>s first chec3 hether the conditions that allo us to safel( use the t o7 sa"ple t7test are "et. $. Here0 2,8 students ere chosen and ere naturall( divided into a sa"ple of fe"ales and a sa"ple of "ales. 'ince the students ere chosen at rando"0 the sa"ple of fe"ales is independent of the sa"ple of "ales.

2.

Here e are in the second scenario I the sa"ple si5es ($%0 and <%)0 are definitel( large enough0 and so e can proceed regardless of hether the populations are nor"al or not.
Tip4 ;lternative versions are available0 clic3 the arro to s itch.

!n order to avoid tedious calculations0 e use D to find the test statistic. !n this case0 D gives us a value of 7..--4

Sust for this first e2a"ple0 let>s "a3e sure that e understand hat the ingredients are that go into the test statistic calculation and ho e use the" to find the test statistic. 6e alread( 3no the sa"ple si5es ($%0 and <%)0 the "eans are given in the D output above0 and the sa"ple standard deviations can be found using this co""and4 tapply(looks$Score,factor(looks$Gender),sd, na.rm=TRUE)

;nd

hen e put it all together e get that indeed0

t=(y16666y'6666 0s'1n1:s''n'!=10.7)1).))&.'5'150:&.0''45!=&.66 The test statistic tells us hat the data tell us about 1'. !n this case that difference
($0.&, 7 $,.,,) is ..-- standard errors belo evidence against Ho. hat the null h(pothesis clai"s this difference to be (0). ..-- standard errors is @uite a lot0 and probabl( indicates that the data provide

Step ": Binding the p3'alue of the test


Since our test is called the t$o3sample t test /$e %no$ that the p3'alues are calculated under a t distribution. 1ndeed/ it turns out that the null distribution of our test statistic is approximately t. Biguring out $hich one of the t distributions )in other $ords/ ho$ many degrees of freedom this t

distribution has* is -uite in'ol'ed and $ill not be discussed here. 1nstead/ $e use a statistics pac%age to find that the p3'alue in this case is 6.
EXAMP E

Tip4 ;lternative versions are available0 clic3 the arro

to s itch.

Here0 again is the relevant output for our e2a"ple4

;ccording to D the p7value of this test is so s"all that it is essentiall( 0. Ho this?

do e interpret

; p7value hich is practicall( 0 "eans that it ould be al"ost i"possible to get data li3e that observed (or even "ore e2tre"e) had the null h(pothesis been true. Gore specificall( to our e2a"ple0 if there ere no differences bet een fe"ales and "ales ith respect to hether the( value loo3s vs. personalit(0 it ould be al"ost i"possible (probabilit( appro2i"atel( 0) to get data here the difference bet een the sa"ple "eans of fe"ales and "ales is 72.- (that difference is $0.&, 7 $,.,, = 72.-) or higher. Fo""ent4 Note that the output tells us that y1666 "ore i"portantl(0 fact that this difference is ..-- standard errors belo

y'666 is appro2i"atel( 72.-. Out

e ant to 3no if this difference is significant. To ans er this0 e use the the null value.

Step 5: ;onclusion in context


As usual a small p3'alue pro'ides e'idence against H . 1n our case our p3 'alue is practically 6 )$hich smaller than any le'el of significance that $e $ill choose*. .he data therefore pro'ide 'ery strong e'idence against H so $e re7ect it and conclude that the mean 1mportance score )of loo%s 's personality* of males differs from that of females. 1n other $ords/ males and females differ $ith respect to ho$ they 'alue loo%s 's. personality.
o o

;omments

Xou might as% yourself: @0here do $e use the test statisticD@ 1t is true that for all practical purposes all $e ha'e to do is chec% that the conditions $hich allo$ us to use the t$o3sample t3test are met/ lift the p3 'alue from the output/ and dra$ our conclusions accordingly. Ho$e'er/ $e feel that it is important to mention the test statistic for t$o reasons: 1. .he test statistic is $hat2s behind the scenes: based on its null distribution and its 'alue/ the p3'alue is calculated.
!.

Apart from being the %ey for calculating the p3'alue/ the test statistic is also itself a measure of the e'idence stored in the data against H . As $e mentioned/ it measures )in standard errors* ho$ different our data is from $hat is claimed in the null hypothesis.
o

et2s loo% at another example/ and then you2ll do one yourself.


EXAMP E

;ccording to the National Health ;nd Nutrition :2a"ination 'urve( (NH;N:') sponsored b( the +.'. govern"ent0 a rando" sa"ple of &$2 "ales bet een 20 and 28 (ears of age and a rando" sa"ple of $000$ "ales over the age of &% ere chosen0 and the eight of each of the "ales as recorded (in 3g). Here is a su""ar( of the results ('ource4 http4TT .cdc.govTnchsTdataTadTad,.&.pdf)4

*o the data provide evidence that the (ounger "ale population eighs "ore (on average) than the older "ale population? (Fo""ent4 Note that here the data are given in a su""ari5ed for"0 unli3e the previous proble"0 here the ra data ere given.) Here is a figure that su""ari5es this e2a"ple4

Note that e defined the (ounger age group and the older age group as population $ and population 20 respectivel(0 and K and K as the "ean eight of population $ and population 20 respectivel(. 'tep 0:
$ 2

'ince e ant to test hether the older age group (population 2) eighs less on average than the (ounger age group (population $)0 e are testing4

or e@uivalentl(0

'tep 1: 6e can safel( use the t o7sa"ple t7test in this case since4 $. The sa"ples are independent0 since each of the sa"ples chosen at rando". as

2. Ooth sa"ple si5es are ver( large (&$2 and $000$)0 and therefore e can proceed regardless of hether the populations are nor"al or not. !t is possible fro" these data to calculate the t7statistic of %.,$ and the p7 value of 0.000. The t7value is @uite large0 and the p7value correspondingl( s"all0 indicating that our data are ver( different fro" hat is clai"ed in the null h(pothesis.

'tep 2: The p7value is essentiall( 00 indicating that it ould be nearl( i"possible to observe a difference bet een the sa"ple "ean eights of ..8 (or "ore) if the "ean eights in the age group populations ere the sa"e (i.e.0 if H ere true). 'tep 4: ; p7value of 0 (or ver( close to it) indicates that the data provide strong evidence against H 0 so e re#ect it and conclude that the "ean eight of "ales 20728 (ears old is higher than the "ean eight of "ales &% (ears old and older. !n other ords0 "ales in the (ounger age group eigh "ore0 on average0 than "ales in the older age group.
o o

;onfidence 1nter'al for 1' )t$o3sample t confidence inter'al*


So far $e2'e discussed the t$o3sample t3test/ $hich chec%s $hether there is enough e'idence stored in the data to re7ect the claim that 1'=0 )or e-ui'alently/ that 1=' * in fa'or of one of the three possible alternati'es. 1f $e $ould li%e to estimate 1' $e can use the natural point estimate/ y16666 y'6666 / or preferably/ a N8P confidence inter'al $hich $ill pro'ide us $ith a set of plausible 'alues for the difference bet$een the population means 1' . 1n particular/ if the test has re7ected Ho+1'=0 / a confidence inter'al for 1' can be insightful since it -uantifies the effect that the categorical explanatory 'ariable has on the response.

;omment
0e $ill not go into the formula and calculation of the confidence inter'al/ but rather as% our soft$are to do it for us/ and focus on interpretation.
EXAMP E

Decall our leading e2a"ple about the loo3s vs. personalit( score of fe"ales and "ales4

Tip4 ;lternative versions are available0 clic3 the arro

to s itch.

Here again is the output4

Decall that e re#ected the null h(pothesis in favor of the t o7sided alternative and concluded that the "ean score of fe"ales is different fro" the "ean score of "ales. !t ould be interesting to supple"ent this conclusion ith "ore details about this difference bet een the "eans0 and the 8%1 confidence interval for 1' does e2actl( that. ;ccording to the output the 8%1 confidence interval for 1' is roughl( (7 ,.&0 7$.%). Birst0 note that the confidence interval is strictl( negative suggesting that K is lo er than K . Burther"ore0 the confidence interval tells "e that e are 8%1 confident that the "ean Qloo3s vs. personalit( scoreQ of fe"ales ( K ) is bet een $.% and ,.& points lo er than the "ean loo3s vs. personalit( score of "ales ( K ). The confidence interval therefore @uantifies the effect that the e2planator( variable (gender) has on the response (loo3s vs personalit( score).
$ 2 $ 2

;omment
As $e2'e seen in pre'ious tests/ as $ell as in the t$o3samples case/ the N8P confidence inter'al for 1'can be used for testing in the t$o3sided case )Ho+1'=0 's. Ha+1'.0 *: 1f the null 'alue/ 6/ falls outside the confidence inter'al/ H is re7ected 1f the null 'alue/ 6/ falls inside the confidence inter'al/ H is not re7ected
o o

EXAMP

=et>s go bac3 to our leading e2a"ple of the loo3s vs. personalit( score here e had a t o7sided test.
Tip4 ;lternative versions are available0 clic3 the arro to s itch.

6e used the fact that the p7value is so s"all to conclude that Ho can be re#ected. 6e can also use the confidence interval to reach the sa"e conclusion since 0 falls outside the confidence interval. !n other ords0 since 0 is not a plausible value for 1' e can re#ect H 0 hich clai"s that 1'=0.
o

et2s summari<e

0e are no$ done $ith the t$o3sample t3test $hich is used for comparing t$o population means $hen the t$o samples )dra$n from the t$o populations* are independent. 1n the bac%ground $e ha'e a t$o3'alued categorical explanatory 'ariable $hose categories define the t$o populations that $e are comparing $ith respect to the mean of the response 'ariable. 0e learned under $hat conditions $e can reliably use this test )independent samples/ and either normal populations or large sample si<es*. 0e introduced the test statistic and mentioned that its null distribution is approximately t.

0e obtained the p3'alue from the output and/ as usual/ based our conclusion on its 'alue. A N8P confidence inter'al for 1' can be 'ery insightful after a test has re7ected the null hypothesis/ and can also be used for testing in the t$o3sided case.

;omparing .$o MeansLMatched Pairs )Paired t3test*


0e are still in ;ase ;EU of inference about relationships/ $here the explanatory 'ariable is categorical and the response 'ariable is -uantitati'e. As $e mentioned in the introduction/ $e are going to introduce three inferential procedures in this case. So far $e ha'e introduced the first procedureLthe t$o3sample t3test that is used $hen $e are comparing t$o means and the samples are independent. 0e are no$ mo'ing on to the second procedure/ $here $e are also comparing t$o means/ but the samples are paired or matched. E'ery obser'ation in one sample is lin%ed $ith an obser'ation in the other sample. 1n this case/ the samples are dependent.

Cne of the most common cases $here dependent samples occur is $hen both samples ha'e the same sub7ects and they are @paired !" su!/ect.@ 1n other $ords/ each sub7ect is measured t$ice on the response 'ariable/

typically before and then after some %ind of treatmentRinter'ention in order to assess its effecti'eness.
EXAMP E

SA. Prep ;lass


'uppose (ou ant to assess the effectiveness of an ';T prep class. !t ould "a3e sense to use the "atched pairs design and record each sa"pled student>s ';T score before and after the ';T prep classes are attended4

Decall that the t o populations represent the t o values of the e2planator( variable. !n this situation0 those t o values co"e fro" a single set of su% ects. !n other of the e2planator( variable. Those values are4 no prep class0 prep class. ords0 both populations reall( have the same students. Ho ever0 each population has a different value

.his/ ho$e'er/ is not the only case $here the paired design is used. Cther cases are $hen the pairs are @natural pairs/@ such as siblings/ t$ins/ or couples. 0e $ill present t$o examples in this part. .he first one $ill be of the type $here each sub7ect is measured t$ice/ and the second one $ill be a study in'ol'ing t$ins. .his section on matched pairs design $ill be organi<ed 'ery much li%e the pre'ious section on t$o independent samples. 0e $ill first introduce our leading example/ and then present the paired t3test illustrating each step

using our example. 0e $ill then loo% at another example/ and finally tal% about estimation using a confidence inter'al. As usual/ you2ll be able to chec% your understanding along the $ay/ and $ill learn ho$ to use soft$are to carry out this test.
EXAMP E

=run% =ri'ers
*run3 driving is one the "ain causes of car accidents. !ntervie s ith drun3 drivers ho ere involved in accidents and survived revealed that one of the "ain proble"s is that drivers do not reali5e that the( are i"paired0 thin3ing Q! onl( had $72 drin3s ... ! a" CU to drive.Q ; sa"ple of 20 drivers as chosen0 and their reaction ti"es in an obstacle course ere "easured before and after drin3ing t o beers. The purpose of this stud( as to chec3 hether drivers are i"paired after drin3ing t o beers. Here is a figure su""ari5ing this stud(4

;omments
1. #ote that the categorical explanatory 'ariable here is @drin%ing ! beers )XesR#o*@/ and the -uantitati'e response 'ariable is the reaction time.

!.

#ote that by using the matched pairs design in this study )i.e./ by measuring each dri'er t$ice*/ the researchers isolated the effect of the t$o beers on the dri'ers and eliminated any other confounding factors that might influence the reaction times )such as the dri'er2s experience/ age/ etc.*. Bor each dri'er/ the t$o measurements are the total reaction time before drin%ing t$o beers/ and after. Xou can see the data by follo$ing these instructions:
.ip: Alternati'e 'ersions are a'ailable/ clic% the arro$ to s$itch.

".

.o open A $ith the dataset preloaded/ right3clic% here and choose @Sa'e .arget As@ to do$nload the file to your computer. .hen find the do$nloaded file and double3clic% it to open it in A. .he data ha'e been loaded into the 'ariable @beers.@ Enter the command beers to see the data.

So far/ $e ha'e discussed and illustrated cases in $hich the matched pairs design comes up/ and $e are no$ ready to discuss ho$ to carry out the test in this case. 0e $ill first present the idea behind the paired t3test/ and then go through the four steps in the testing process.

.he Paired t3test


1dea

.he idea behind the paired t3test is to reduce this t$o3sample situation/ $here $e are comparing t$o means/ to a single sample situation $here $e are doing inference on a single mean/ and then use a simple t3test that $e introduced in the pre'ious module. 0e $ill first illustrate this idea using our example/ and then more generally.
G8H)&H,(9@H

1n other $ords/ by reducing the t$o samples to one sample of differences/ $e are essentially reducing the problem from a problem $here $e2re comparing t$o means )i.e./ doing inference on 1'*:

to a problem $here $e are ma%ing an inference about a single mean L the mean of the differences:

1n general/ in e'ery matched pairs problem/ our data consist of ! samples $hich are organi<ed in n pairs:

0e reduce the t$o samples to only one by calculating for each pair the difference bet$een the t$o obser'ations )in the figure $e used d13d'3d)3...3dn to denote the differences*.

.he paired t3test is based on this one sample of n differences/

and it uses those differences as data for a simple t3test on a single mean L the mean of the differences. .his is the general idea behind the paired t3test: it is nothing more than a regular one3sample t3test for the mean of the differences. 0e $ill no$ go through the 53step process of the paired t3test.

Step 1: Stating the hypotheses.


Aecall that in the t3test for a single mean our null hypothesis $as: Ho+=o and the alternati'e $as one of Ha+,or-or.0 . Since the paired t3test is a special case of the one3sample t3test/ the hypotheses are the same except that: 1nstead of simply M $e use the notation d to denote that the parameter of interest is the mean of the differences. 1n this course our null 'alue 0 is al$ays 6 )although technically/ it does not ha'e to be*. .herefore/ in the paired t3test: .he null hypothesis is al$ays:
Ho+d=0

and the alternati'e is one of :

depending on the context.

et2s go bac% to our example to see ho$ this $or%s and $hy it ma%es sense.
EXAMP E

=run% =ri'ing
Decall that in our Q;re drivers i"paired after drin3ing t o beers?Q e2a"ple0 our data as reduced to one sa"ple of differences (one for each driver)0

so our proble"

as reduced to inference about the "ean of the differences d .

;s e "entioned0 the null h(pothesis is4

Ho+d=0 .
The null h(pothesis clai"s that the differences in reaction ti"es are centered at (or around) 00 indicating that drin3ing t o beers has no real i"pact on reaction ti"es. !n other ords0 drivers are not i"paired after drin3ing t o beers. !n order to decide hich of the alternatives is appropriate here e have to thin3 about the e ant to 3no hether their reaction ti"es are longer after the

conte2t of the proble". Decall that e ant to chec3 hether drivers are i"paired after drin3ing t o beers. Thus0

t o beers. 'ince the differences ere calculated before7after0 longer reaction ti"es after the beers ould translate into negative differences. These differences are4 -.2% 7 -.<%0 2.8- 7 ..&<0 etc. Therefore0 the appropriate alternative here is4

Ha+d,0
indicating that the differences are centered at a negative nu"ber.
287E '&9F67&' G(7F6,...

2atched +airs

;omment
Aecall that originally/ the follo$ing figure represented our problem:

ater/ $e reduced the problem to inference about a single mean/ the mean of the differences:

Some students find it helpful to %no$ that it turns out that d=1'. 1n other $ords/ the difference bet$een the means 1' in the first representation is the same as the mean of the differences/ d/in the second one. Some students find it easier to first thin% about the hypotheses in terms of 1' )as $e did in the t$o3sample case* and then represent it in terms of d. 1n our example/ since $e $ant to test $hether the reaction times in population 1 are shorter/ $e are testing Ho+1'=0 vs. Ha+1',0/ $hich in the matched pairs design notation is translated to Ho+d=0 vs. Ha+d,0 . Here is another example:
EXAMP E

'uppose the effectiveness of a lo 7carb diet is studied ith a "atched pairs design0 recording each participant>s eight before and after dieting. 6hat ould be the appropriate h(potheses in this case? ;s before0 d is the "ean of the differences ( eight before diet)7( eight after diet). !n this case0 if the diet is effective and participants> eight after the diet as indeed lo er0 e ould e2pect the differences to be positive0 and therefore the appropriate h(potheses in this case are4

Ho+d=0 vs. Ha+d-0

Step !: ;hec%ing ;onditions and ;alculating the .est Statistic


.he paired t3test/ as a special case of a one3sample t3test/ can be safely used as long as: 1. .he sample of differences is random )or at least can be considered so in context*.

!.

0e are in one of the three situations mar%ed $ith a green chec% mar% in the follo$ing table

1n other $ords/ in order to use the paired t3test safely/ the differences should 'ary normally unless the sample si<e is large/ in $hich case it is safe to use the paired t3test regardless of $hether the differences 'ary normally or not. As $e indicated in the figure abo'e )and ha'e seen many times already*/ in practice/ normality is chec%ed by loo%ing at the histogram of differences and as long as no clear 'iolation of normality )such as extreme s%e$ness andRor outliers* is apparent/ normality is assumed. Assuming that the $e can safely use the paired t3test/ the data are summari<ed by a test statistic:
t=xd66660sdn! $here xd6666 is the sample mean of the differences/ and sd is the sample standard de'iation of the differences. .his is the test statistic $e2'e de'eloped for the one sample t3test )$ith 0=0 */ and has the same conceptual interpretation: it measures )in standard errors* ho$ far our data are )represented by the a'erage of the differences* from the null hypothesis )represented by the null 'alue/ 6*.
EXAMP E

=et>s first chec3 hether e can safel( proceed ith the paired t7test0 b( chec3ing the t o conditions. $. The sa"ple of drivers as chosen at rando".

2. The sa"ple si5e is not large enough (n = 20)0 so in order to proceed0 e need to loo3 at the histogra" of the differences and "a3e sure there is no evidence that the nor"alit( assu"ption is not "et. Here is the histogra"4

Th ere is no evidence of violation of the nor"alit( assu"ption (on the contrar(0 the histogra" loo3s @uite nor"al). ;lso note that the vast "a#orit( of the differences are negative (i.e.0 the total reaction ti"es for "ost of the drivers are larger after the t o beers)0 suggesting that the data provide evidence against the null h(pothesis. The @uestion ( hich the p7value ill ans er) is hether these data provide strong enough evidence or not. 6e can safel( proceed to calculate the test statistic ( hich in practice e leave to the soft are to calculate for us). Here is the output of the paired t7test for our e2a"ple4

;ccording to the output0 the test statistic is 72.%<0 indicating that the data (represented b( the sa"ple "ean of the differences) are 2.%< standard errors belo the null h(pothesis (represented b( the null value0 0). Note in the output0 that be(ond the test statistic itself0 e also highlighted the part of the output that provides the ingredients needed in order to calculate it4 n='03 xd6666=0.50153 sd=0.4646. !ndeed 0.50150.4646'0!='.54.

Step ": Binding the p3'alue

As a special case of the one3sample t3test/ the null distribution of the paired t3test statistic is a t distribution )$ith n 3 1 degrees of freedom*/ $hich is the distribution under $hich the p3'alues are calculated. 0e $ill let the soft$are find the p3'alue for us/ and in this case/ Excel gi'es us a p3'alue of 6.66N. .he small p3'alue tells us that there is 'ery little chance of getting data li%e those obser'ed )or e'en more extreme* if the null hypothesis $ere true. More specifically/ there is less than a 1P chance ).66N&.NP* of obtaining a test statistic of 3!.8Q )or lo$er*/ assuming that ! beers ha'e no impact on reaction times.

Step 5: ;onclusion in ;ontext.


As usual/ $e dra$ our conclusion based on the p3'alue. 1f the p3'alue is small/ there is a significant difference bet$een $hat $as obser'ed in the sample and $hat $as claimed in H / so $e re7ect H and conclude that the categorical explanatory 'ariable does affect the -uantitati'e response 'ariable as specified in H . 1f the p3'alue is not small/ $e do not ha'e enough statistical e'idence to re7ect H . 1n particular/ if a cutoff probability/ J )significance le'el*/ is specified/ $e re7ect H if the p3'alue is less than J. Cther$ise/ $e do not re7ect H . 1n our example/ the p3'alue is .66N/ indicating that the data pro'ide enough e'idence to re7ect H and conclude that drin%ing t$o beers does slo$ the reaction times of dri'ers/ and thus that dri'ers are impaired after drin%ing t$o beers.
o o a o o o o

;omment
1t is 'ery important to pay attention to $hether the t$o3sample t3test or the paired t3test is appropriate. 1n other $ords/ being a$are of the study design is extremely important. ;onsider our example. 1f $e had not @caught@ that this is a matched pairs design/ and had analy<ed the data as if the t$o samples $ere independent using the t$o3sample t3test/ $e $ould ha'e obtained a p3'alue of 6.689. #ote that using this )$rong* method to analy<e the data/ and a significance le'el of .68/ $e $ould conclude that the data do not pro'ide enough e'idence for us to conclude that dri'ers are impaired after drin%ing t$o beers. .his is an example of ho$ using the $rong statistical method can lead you to $rong conclusions/ $hich in this context can ha'e 'ery serious implications.

.he @dri'ing after ha'ing ! beers@ example is a case in $hich obser'ations are paired by sub7ect. 1n other $ords/ both samples ha'e the same sub7ect/ so that each sub7ect is measured t$ice. .ypically/ as in our example/ one of the measurements occurs before a treatmentRinter'ention )! beers in our case*/ and the other measurement after the treatmentRinter'ention. Cur next example is another typical type of study $here the matched pairs design is used L it is a study in'ol'ing t$ins.
EXAMP E

Desearchers have long been interested in the e2tent to hich intelligence0 as "easured b( !V score0 is affected b( QnurtureQ as opposed to QnatureQ4 that is0 are people>s !V scores "ainl( a result of their upbringing and environ"ent0 or are the( "ainl( an inherited trait? ; stud( as designed to "easure the effect of ho"e environ"ent on intelligence0 or "ore specificall(0 the stud( as designed to address the @uestion4 Q;re there significant differences in !V scores bet een people ho ere raised b( their birth parents0 and those ho ere raised b( so"eone else?Q !n order to be able to ans er this @uestion0 the researchers needed to get t o groups of sub#ects (one fro" the population of people ho ere raised b( their birth parents0 and one fro" the population of people ho ere raised b( so"eone else) ho are as si"ilar as possible in all other respects. !n particular0 since genetic differences "a( also affect intelligence0 the researchers anted to control for this confounding factor. 6e 3no fro" our discussion on stud( design (in the ?roducing *ata unit of the course) that one a( to (at least theoreticall() control for all confounding factors is rando"i5ation I rando"i5ing sub#ects to the different treat"ent groups. !n this case0 ho ever0 this is not possible. This is an observational stud(; (ou cannot rando"i5e children to either be raised b( their birth parents or to be raised b( so"eone else. Ho else can e eli"inate the genetics factor? 6e can conduct a Qt in stud(.Q Oecause identical t ins are geneticall( the sa"e0 a good design for obtaining infor"ation to ans er this @uestion ould be to co"pare !V scores for identical t ins0 one of ho" is raised b( birth parents and the other b( so"eone else. 'uch a design ("atched pairs) is an e2cellent a( of "a3ing a co"parison bet een individuals ho onl( differ ith respect to the e2planator( variable of interest (upbringing) but are as ali3e as the( can possibl( be in all other i"portant aspects (inborn intelligence). !dentical t ins raised apart ere studied b( 'usan Barber0 ho published her studies in the boo3 Q!dentical T ins Deared ;partQ ($8<$0 Oasic Ooo3s). !n this proble"0 e are going to use the data that appear in Barber>s boo3 in

table :-0 of the !V scores of ,2 pairs of identical t ins apart. Here is a figure that ill help (ou understand this stud(4

ho

ere reared

Here are the i"portant things to note in the figure4 $. 6e are essentiall( co"paring the "ean !V scores in t o populations that are defined b( our (t o7valued categorical) e2planator( variable I upbringing (W)0 hose t o values are4 raised b( birth parents0 raised b( so"eone else. 2. This is a "atched pairs design (as opposed to a t o independent sa"ples design)0 since each observation in one sa"ple is lin3ed ("atched) ith an observation in the second sa"ple. The observations are paired b( t ins. To loo3 at the data set0 follo these instructions4
Tip4 ;lternative versions are available0 clic3 the arro to s itch.

To open D ith the dataset preloaded0 right7clic3 here and choose Q'ave Target ;sQ to do nload the file to (our co"puter. Then find the do nloaded file and double7clic3 it to open it in D. The data have been loaded into the variable Qt ins.Q :nter the co""and twins to see the data.

:ach of the ,2 ro s represents one pair of t ins. Ueeping the notation that e used above0 t in $ is the t in that as raised b( hisTher birth parents0 and t in 2 is the t in that as raised b( so"eone else. =et>s carr( out the anal(sis. 0" 'tating the hypotheses" Decall that in "atched pairs0 e reduce the data fro" t o sa"ples to one sa"ple of differences4

and e state our h(potheses in ter"s of the "ean of the differences0 d. 'ince e ould li3e to test hether there are differences in !V scores bet een people ho ere raised b( their birth parents and those ho eren>t0 e are carr(ing out the t o7sided test4

(omment: ;gain0 so"e students find it easier to first thin3 about the h(potheses in ter"s of K and K 0 and then rite the" in ter"s of d. !n this case0 since e are testing for differences bet een the t o populations0 the h(potheses ill be4
$ 2

and since d=1' e get bac3 to the h(potheses above. 1" (hec3ing conditions and summari4ing the data with a test statistic" !s it safe to use the paired t7test in this case?

a.

Flearl(0 the sa"ples of t ins are not rando" sa"ples fro" the t o populations. Ho ever0 in this conte2t0 the( can be considered as rando"0 assu"ing that there is nothing special about the !V of a person #ust because heTshe has an identical t in. The sa"ple si5e here is n = ,2. :ven though it>s the case that if e use the n ) ,0 rule of thu"b our sa"ple can be considered large0 it is sort of a borderline case0 so #ust to be on the safe side0 e should loo3 at the histogra" of the differences #ust to "a3e sure that e do not see an(thing e2tre"e. (Fo""ent4 =oo3ing at the histogra" of differences in ever( case is useful even if the sa"ple is ver( large0 #ust in order to get a sense of the data. Decall4 Q;l a(s loo3 at the data.Q)

b.

The data don>t reveal an(thing that e should be orried about (li3e ver( e2tre"e s3e ness or outliers)0 so e can safel( proceed. =oo3ing at the histogra"0 e note that "ost of the differences are negative0 indicating that in "ost of the ,2 pairs of t ins0 t in 2 (raised b( so"eone else) has a higher !V. Bro" this point e rel( on statistical soft are0 and find that4

t7value = 7$.<% p7value = 0.0&.

Cur test statistic is 7$.<%. Cur data (represented b( the average of the differences) are $.<% standard errors belo the null h(pothesis (represented b( the null value 0).
2.

3.

+inding the p./alue" The p7value is 0.0&.0 indicating that there is a &..1 chance of obtaining data li3e those observed (or even "ore e2tre"e) assu"ing that H is true (i.e.0 assu"ing that there are no significant differences in !V scores bet een people ho ere raised b( their natural parents and those ho eren>t). 5a3ing conclusions" +sing the conventional significance level (cut7off probabilit() of .0%0 our p7value is not s"all enough0 and e therefore cannot re#ect H . !n other ords0 our data do not provide enough evidence to conclude that hether a person as raised b( hisTher natural parents has an i"pact on the person>s intelligence (as "easured b( !V scores).
o o

learn !" doing

;omment:
.his means that if/ based on prior %no$ledge/ prior research/ or 7ust a hunch/ $e had $anted to test the hypothesis that the 1U le'el of people raised by their birth parents is lo$er/ on a'erage/ than the 1U le'el of people $ho $ere raised by someone else/ $e $ould ha'e re7ected H and accepted that hypothesis )at the .68 significance le'el/ since .6"9 ( .68*.
o

1t should be stressed/ though/ that one should set the hypotheses before loo%ing at the data. 1t $ould be ethically $rong to loo% at the histogram of differences/ note that most of the differences are negati'e/ and then decide to carry out the one3sided test that the data seem to support. .his is %no$n as @data snooping/@ and is considered to be a 'ery bad statistical practice.

;onfidence 1nter'al for Md )Paired t ;onfidence 1nter'al*


So far $e2'e discussed the paired t3test/ $hich chec%s $hether there is enough e'idence stored in the data to re7ect the claim that d=0 in fa'or of one of the three possible alternati'es. 1f $e $ould li%e to estimate d/ the mean of the differences )response 1 3 response !*/ $e can use the natural point estimate/ xd6666/ the sample mean of the differences/ or preferably/ use a N8P confidence inter'al/ $hich $ill pro'ide us $ith a set of plausible 'alues for d.

1n particular/ if the test has re7ected H0+d=0/ a confidence inter'al for d can be insightful/ since it -uantifies the effect that the categorical explanatory 'ariable has on the response 'ariable. .omment: 0e $ill not go into the formula and calculation of the confidence inter'al/ but rather as% our statistical soft$are to do it for us/ and focus on interpretation.
EXAMP E

Decall our leading e2a"ple about t o beers4

hether drivers are i"paired after having

hich is reduced to inference about a single "ean0 the "ean of the differences (before 7 after)4

The p7value of our test0 H0+d=0 vs. H0+d,0 as .0080 and e therefore re#ected H and concluded that the "ean difference in total reaction ti"e (before beer 7 after beer) as negative0 or in other ords0 that
o

drivers are i"paired after having t o beers. ;s a follo 7up to this conclusion0 it ould be interesting to @uantif( the effect that t o beers have on the driver0 using the 8%1 confidence interval for d. +sing statistical soft are0 e find that the 8%1 confidence interval for d0 the "ean of the differences (before 7 after)0 is roughl( (7.80 7.$). 6e can therefore sa( ith 8%1 confidence that drin3ing t o beers increases the total reaction ti"e of the driver b( bet een .$ and .8 of a second.

;omment
As $e2'e seen in pre'ious tests/ as $ell as in the matched pairs case/ the N8P confidence inter'al for d can be used for testing in the t$o3sided case )H0+d=0 's. Ha+d.0*: 1f the null 'alue/ 6/ falls outside the confidence inter'al/ H is re7ected. 1f the null 'alue/ 6/ falls inside the confidence inter'al/ H is not re7ected.
o o

EXAMP

=et>s go bac3 to our t in stud( e2a"ple0 here e found a 8%1 confidence interval for d of (7-.$$,220 0.,00&2) and a p7value of 0.0&.. 6e used the fact that the p7value is .0&. to conclude that Ho can not be re#ected (at the .0% significance level)0 and that hether or not a person as raised b( his or her birth parents doesn>t necessaril( have an effect on intelligence (as "easured b( !V scores). The last co""ent tells us that e can also use the confidence interval to reach the sa"e conclusion0 since 0 falls inside the confidence interval for d. !n other ords0 since 0 is a plausible value for d e cannot re#ect Ho hich clai"s that d=0.

et2s summari<e

.he paired t3test is used to compare t$o population means $hen the t$o samples )dra$n from the t$o populations* are dependent in the sense that e'ery obser'ation in one sample can be lin%ed to an obser'ation in the other sample. Such a design is called @matched pairs.@ .he most common case in $hich the matched pairs design is used is $hen the same sub7ects are measured t$ice/ usually before and then after some %ind of treatment andRor inter'ention. Another classic case are studies in'ol'ing t$ins. As in the @t$o independent samples@ case/ in the bac%ground/ $e ha'e a t$o3'alued categorical explanatory $hose categories define the t$o populations $e are comparing and $hose effect on the response 'ariable $e are trying to assess.

.he idea behind the paired t3test is to reduce the data from t$o samples to 7ust one sample of the differences/ and use these obser'ed differences as data for inference about a single mean L the mean of the differences/ d. .he paired t3test is therefore simply a one3sample t3test for the mean of the differences d/ $here the null 'alue is 6. Cnce $e 'erify that $e can safely proceed $ith the paired t3test/ $e use soft$are output to carry it out. A N8P confidence inter'al for d can be 'ery insightful after a test has re7ected the null hypothesis/ and can also be used for testing in the t$o3 sided case.

;omparing More .han .$o Means L A#CIA


Introd$;tion

,n this part$ (e continue to handle situations involving one categorical e*planatory variable and one Buantitative response variable$ (hich is case CEJ in our role=type classification table:

So far (e have discussed the t(o samples and matched pairs designs$ in (hich the categorical e*planatory variable is t(o#valued. 1s (e sa($ in these cases$ e*amining the relationship bet(een the e*planatory and the response variables amounts to comparing the mean of the response variable (- in t(o populations$ (hich are defined by the t(o values of the e*planatory variable (K . .he difference bet(een the t(o samples and matched pairs designs is that in the former$ the t(o samples are independent$ and in the latter$ the samples are dependent. +e are no( moving on to cases in (hich the categorical e*planatory variable ta%es more than t(o values. &ere$ as in the t(o#valued case$ ma%ing inferences about the relationship bet(een the e*planatory (K and the response (- variables amounts to comparing the means of the response variable in the populations defined by the values of the e*planatory variable$ (here the number of means (e are comparing depends$ of course$ on the number of values of K. /nli%e the t(o#valued case$ (here (e loo%ed at t(o sub#cases (1 (hen the samples are independent (t(o samples design and (2 (hen the samples are dependent (matched pairs

design$ here$ (e are )ust going to discuss the case (here the samples are independent. ,n other (ords$ (e are )ust going to e*tend the t(o samples design to more than t(o independent samples.

<o%%ent

.he e*tension of the matched pairs design to more than t(o dependent samples is called 9Repeated Feasures9 and is beyond the scope of this course. .he inferential method for comparing more than t(o means that (e (ill introduce in this part is called 1nalysis Gf Lariance (abbreviated as 1AGL1 $ and the test associated (ith this method is called the 1AGL1 "#test. .he structure of this part (ill be very similar to that of the previous t(o. +e (ill first present our leading e*ample$ and then introduce the 1AGL1 "#test by going through its 4 steps$ illustrating each one using the e*ample. (,t (ill become clear as (e e*plain the idea behind the test (here the name 91nalysis of Lariance9 comes from. +e (ill then present another complete e*ample$ and conclude (ith some comments about possible follo(#ups to the test. 1s usual$ you>ll have activities along the (ay to chec% your understanding$ and learn ho( to use soft(are to carry out the test. 0et>s start by introducing our leading e*ample.
EXAMP E

!s Qacade"ic frustrationQ related to "a#or? ; college dean believes that students ith different "a#ors "a( e2perience different levels of acade"ic frustration. Dando" sa"ples of si5e ,% of Ousiness0 :nglish0 Gathe"atics0 and ?s(cholog( "a#ors are as3ed to rate their level of acade"ic frustration on a scale of $ (lo est) to 20 (highest).

The figure highlights hat

e have alread( "entioned4 e2a"ining the relationship bet een

"a#or (W) and frustration level (E) a"ounts to co"paring the "ean frustration levels (13'3)3& ) a"ong the four "a#ors defined b( W. ;lso0 the figure re"inds us that e are dealing ith a case here the sa"ples are independent.

;omment

.here are t$o $ays to record data in the A#CIA setting:

>nstac%ed: Cne column for each of the four ma7ors/ $ith each column listing the frustration le'els reported by all sampled students in that ma7or:
!sin ess 11 En"lis %syc#olo" # $at# y 11 9 11

!sin ess 6 6 et;.

En"lis %syc#olo" # $at# y 9 1& 16 11 19 1)

Stac%ed: one column for all the frustration le'els/ and next to it a column to %eep trac% of $hich ma7or a student is in:
&r!stratio n(') $a(or()) 9 ' 9 10 11 1) 1) 1' et;. B$siness B$siness B$siness 9ng#ish =sy;ho#ogy 9ng#ish =sy;ho#ogy >ath

.he @unstac%ed@ format helps us to loo% at the four groups separately/ $hile the @stac%ed@ format helps us remember that there are/ in fact/ t$o 'ariables in'ol'ed: frustration le'el )the -uantitati'e response 'ariable* and ma7or )the categorical explanatory 'ariable*.
.ip: Alternati'e 'ersions are a'ailable/ clic% the arro$ to s$itch.

.o open A $ith the dataset preloaded/ right3clic% here and choose @Sa'e .arget As@ to do$nload the file to your computer. .hen find the do$nloaded file and double3clic% it to open it in A. .he data ha'e been loaded into the 'ariable @frustration.@ Enter the command fr!stration to see the data. #ote that in the first 5 columns/ the data are in unstac%ed format/ and in the next t$o columns the data are stac%ed.

.he A#CIA B3test


#o$ that $e understand in $hat %ind of situations A#CIA is used/ $e are ready to learn ho$ it $or%s/ or more specifically/ $hat the idea is behind comparing more than t$o means. As $e mentioned earlier/ the test that $e $ill present is called the A#CIA B3test/ and as you2ll see/ this test is different in t$o $ays from all the tests $e ha'e presented so far:

>nli%e the pre'ious tests/ $here $e had three possible alternati'e hypotheses to choose from )depending on the context of the problem*/ in the A#CIA B3test there is only one alternati'e/ $hich actually ma%es life simpler. .he test statistic $ill not ha'e the same structure as the test statistics $e2'e seen so far. 1n other $ords/ it $ill not ha'e the form: sample statisticnull valuestandard error/ but a different structure that captures the essence of the B3test/ and clarifies $here the name @analysis of 'ariance@ is coming from. et2s start.

Step 1: Stating the Hypotheses


.he null hypothesis claims that there is no relationship bet$een X and X. Since the relationship is examined by comparing 13'3)3...3k/ the means of X in the populations defined by the 'alues of X/ no relationship $ould mean that all the means are e-ual. .herefore the null hypothesis of the B3 test is: H0+1='=...=k As $e mentioned earlier/ here $e ha'e 7ust one alternati'e hypothesis/ $hich claims that there is a relationship bet$een X and X. 1n terms of the means 13'3)3...3k/ it simply says the opposite of the alternati'e/ that not all the means are e-ual/ and $e simply $rite: Ha+not all the ?s are equal.
EXAMP E

Decall our Q!s acade"ic frustration related to "a#or?Q e2a"ple4

learn !" doing

The correct h(potheses for our e2a"ple are4

Note that there are "an( a(s for 13'3)3& not to be all e@ual0 and

1.'.).& is #ust one of the". ;nother a( could be 1='=).& or 1='.)=&. The alternative of the ;NCA; B7test si"pl( states that not all of the
"eans are e@ual0 and is not specific about the a( in hich the( are different.

?efore $e mo'e on to the next step )chec%ing conditions and summari<ing the data $ith a test statistic*/ $e $ill present the idea behind the A#CIA B3 test using our example.

.he idea behind the A#CIA B3.est


et2s thin% about ho$ $e $ould go about testing $hether the population means 13'3)3& are e-ual. 1t seems as if the best $e could do is to calculate their point estimatesLthe sample mean in each of our 5 samples )denote them by y166663y'66663y)66663y&6666*/

and see ho$ far apart these sample means are/ or in other $ords/ measure the 'ariation bet$een the sample means. 1f $e find that the four sample means are not all close together/ $e2ll say that $e ha'e e'idence against H / and other$ise/ if they are close together/ $e2ll say that $e do not ha'e e'idence against H . .his seems -uite simple/ but is this enoughD et2s see.
o o

1t turns out that: K .he sample mean frustration score of the "8 business ma7ors is: y16666=7.) K .he sample mean frustration score of the "8 English ma7ors is: y'6666=11.4 K .he sample mean frustration score of the "8 math ma7ors is: y)6666=1).' K .he sample mean frustration score of the "8 psychology ma7ors is: y&6666=1&.0 ?elo$ $e present t$o possible scenarios for our example. 1n both cases/ $e construct side3by3side boxplots for four groups of frustration le'els that ha'e the same 'ariation among their means. .hus/ Scenario \1 and Scenario \! both sho$ data for four groups $ith the sample means 9."/ 11.Q/ 1".!/ and 15.6 )indicated $ith red mar%s*.

learn !" doing

.he important difference bet$een the t$o scenarios is that the first represents data $ith a large amount of 'ariation $ithin each of the four groups: the second represents data $ith a small amount of 'ariation $ithin each of the four groups. Scenario 1/ because of the large amount of spread $ithin the groups/ sho$s boxplots $ith plenty of o'erlap. Cne could imagine the data arising from 5 random samples ta%en from 5 populations/ all ha'ing the same mean of

about 11 or 1!. .he first group of 'alues may ha'e been a bit on the lo$ side/ and the other three a bit on the high side/ but such differences could concei'ably ha'e come about by chance. .his $ould be the case if the null hypothesis/ claiming e-ual population means/ $ere true. Scenario !/ because of the small amount of spread $ithin the groups/ sho$s boxplots $ith 'ery little o'erlap. 1t $ould be 'ery hard to belie'e that $e are sampling from four groups that ha'e e-ual population means. .his $ould be the case if the null hypothesis/ claiming e-ual population means/ $ere false. .hus/ in the language of hypothesis tests/ $e $ould say that if the data $ere configured as they are in scenario 1/ $e $ould not re7ect the null hypothesis that population mean frustration le'els $ere e-ual for the four ma7ors. 1f the data $ere configured as they are in scenario !/ $e $ould re7ect the null hypothesis/ and $e $ould conclude that mean frustration le'els differ/ depending on ma7or. et2s summari<e $hat $e learned from this. .he -uestion $e need to ans$er is: Are the differences among the sample means ) Y6662s* due to true differences among the M2s )alternati'e hypothesis*/ or merely due to sampling 'ariability )null hypothesis*D 1n order to ans$er this -uestion using our data/ $e ob'iously need to loo% at the 'ariation among the sample means/ but this alone is not enough. 0e need to loo% at the 'ariation among the sample means relati'e to the 'ariation $ithin the groups. 1n other $ords/ $e need to loo% at the -uantity:

$hich measures to $hat extent the difference among the sampled groups2 means dominates o'er the usual 'ariation $ithin sampled groups )$hich reflects differences in indi'iduals that are typical in random samples*. 0hen the 'ariation $ithin groups is large )li%e in scenario 1*/ the 'ariation )differences* among the sample means could become negligible and the data pro'ide 'ery little e'idence against H . 0hen the 'ariation $ithin groups is small )li%e in scenario !*/ the 'ariation among the sample means dominates o'er it/ and the data ha'e stronger e'idence against H .
o o

oo%ing at this ratio of 'ariations is the idea behind the comparing more than t$o means: hence the name analysis of 'ariance )A#CIA*.

#o$ that $e understand the idea behind the A#CIA B3test/ let2s mo'e on to step !. 0e2ll start by tal%ing about the test statistic/ since it $ill be a natural continuation of $hat $e2'e 7ust discussed/ and then mo'e on to tal% about the conditions under $hich the A#CIA B3test can be used. 1n practice/ ho$e'er/ the conditions need to be chec%ed first/ as $e did before.

Step !: ;hec%ing ;onditions and Binding the .est Statistic


.he test statistic of the A#CIA B3test/ called the B statistic/ has the form

As $e mentioned earlier/ it has a different structure from all the test statistics $e2'e loo%ed at so far: ho$e'er/ it is similar in that it is still a measure of the e'idence against H . .he larger B is )$hich happens $hen the denominator/ the 'ariation $ithin groups/ is small relati'e to the numerator/ the 'ariation among the sample means*/ the more e'idence $e ha'e against H .
o o

did I get this

=id 1 Fet .hisD


;onsider the follo$ing generic situation:

$here $e2re testing:

.he follo$ing are t$o possible scenarios of the data )note in both scenarios the sample means are !8/ "6/ and "8*.

;omments
1. .he focus here is for you to understand the idea behind this test statistic. 0e are not going to go into any of the details about ho$ the t$o 'ariations are measured. .his $ill be included in an extension module to this course in the future. 0e $ill rely on soft$are output to obtain the B3statistic.

!.

.his test is called the A#CIA B3test. So far/ $e ha'e explained the A#CIA part of the name. ?ased on the pre'ious tests $e introduced/ it should not be surprising that the @B3test@ part comes from the fact that the null distribution of the test statistic/ under $hich the p3'alues are calculated/ is called an B3distribution. 0e $ill say 'ery little about the B3 distribution in this course/ $hich $ill essentially be limited to this comment and the next one. 1t is fairly straightfor$ard to decide if a <3statistic is large. E'en $ithout tables/ $e should reali<e by no$ that a <3statistic of .Q is not especially large/ $hereas a <3statistic of !.8 is large. 1n the case of the t3 statistic/ it is less straightfor$ard/ because there is a different t3 distribution for e'ery sample si<e n )and degrees of freedom n 3 1*. Ho$e'er/ the fact that a t3distribution $ith a large number of degrees of freedom is 'ery close to the E )standard normal* distribution can help to assess the magnitude of the t3test statistic. 0hen the si<e of the B3statistic must be assessed/ the tas% is e'en more complicated/ because there is a different B3distribution for e'ery combination of the number of groups $e are comparing and the total sample si<e. 0e $ill ne'ertheless say that for most situations/ an B3 statistic greater than 5 $ould be considered rather large/ but tables or soft$are are needed to get a truly accurate assessment.

".

EXAMP

.ip: 1lternative versions are available$ clic% the arro( to s(itch.

Here is the D output for the ;NCA; B7test. !n particular0 note that the B7statistic is .-.-00 hich is ver( large0 indicating that the data provide a lot of evidence against H ( e can also see that the p7value is so s"all that it is essentiall( 00 hich supports that conclusion as ell).
o

et2s mo'e on to tal% about the conditions under $hich $e can safely use the A#CIA B3test/ $here the first t$o conditions are 'ery similar to ones $e2'e seen before/ but there is a ne$ third condition. 1t is safe to use the A#CIA procedure $hen the follo$ing conditions hold: 1. .he samples dra$n from each of the populations $e2re comparing are independent.

!.

.he response 'ariable 'aries normally $ithin each of the populations $e2re comparing. As you already %no$/ in practice this is done by loo%ing at the histograms of the samples and ma%ing sure that there is no e'idence of extreme departure from normality in the form of extreme s%e$ness and outliers. Another possibility is to loo% at side3by3side boxplots of the data/ and add histograms if a more detailed 'ie$ is necessary. Bor large sample si<es/ $e don2t really need to $orry about normality/ although it is al$ays a good idea to loo% at the data. .he populations all ha'e the same standard de'iation. .he best $e can do to chec% this condition is to find the sample standard de'iations of our samples and chec% $hether they are @close.@ A common rule of thumb is to chec% $hether the ratio bet$een the largest sample standard de'iation and the smallest is less than !. 1f that2s the case/ this condition is considered to be satisfied.
E

".

EXAMP

!n our e2a"ple all the conditions are satisfied4 $. ;ll the sa"ples ere chosen rando"l(0 and are therefore independent. e reall( don>t have to orr( about

2. The sa"ple si5es are large enough (n = ,%) that a sense of it4

the nor"alit(; ho ever0 let>s loo3 at the data using side7b(7side bo2plots0 #ust to get

Eou>ll recogni5e this plot as 'cenario 2 fro" earlier. The data suggest that the frustration level of the business students is generall( lo er than students fro" the

other three "a#ors. The ;NCA; B7test ill tell us hether these differences are significant. ,. !n order to use the rule of thu"b0 our sa"ples.
.ip: 1lternative versions are available$ clic% the arro( to s(itch.

e need to get the sa"ple standard deviations of

=et>s use D to calculate the standard deviation for each of the four sa"ples4

The rule of thu"b is satisfied since ,.0<2 T 2.0<< < 2.

Step ": Binding the p3'alue


.he p#value of the 1AGL1 "#test is the probability of getting an " statistic as large as (e got (or even larger $ had H0+1='=...=k been true. ,n other (ords$ it tells us ho( surprising it is to find data li%e those observed$ assuming that there is no difference among the population means H1$ H2$ ...$ H%.
EXAMP E

Tip4 ;lternative versions are available0 clic3 the arro

to s itch.

;s e alread( noticed before0 the p7value in our e2a"ple is so s"all that it is essentiall( 00 telling us that it ould be ne2t to i"possible to get data li3e those observed had the "ean frustration level of the four "a#ors been the sa"e (as the null h(pothesis clai"s).

Step 5: Ma%ing ;onclusions in ;ontext


As usual/ $e base our conclusion on the p3'alue. A small p3'alue tells us that our data contain a lot of e'idence against H . More specifically/ a small p3'alue tells us that the differences bet$een the sample means are statistically significant )unli%ely to ha'e happened by chance*/ and therefore $e re7ect H . 1f the p3'alue is not small/ the data do not pro'ide
o o

enough e'idence to re7ect H / and so $e continue to belie'e that it may be true. A significance le'el )cut3off probability* of .68 can help determine $hat is considered a small p3'alue.
o

EXAMP

!n our e2a"ple0 the p7value is e2tre"el( s"allIclose to 0Iindicating that our data provide e2tre"el( strong evidence to re#ect H . 6e conclude that the frustration level "eans of the four "a#ors are not all the sa"e0 or in other ords0 that "a#ors do have an effect on students> acade"ic frustration levels at the school here the test as conducted.
o

?efore $e gi'e you hands3on practice in carrying out the A#CIA B3test/ let2s loo% at another example:
EXAMP E

*o advertisers alter the reading level of their ads based on the target audience of the "aga5ine the( advertise in? !n $8<$0 a stud( of "aga5ine advertise"ents as conducted (B.U. 'huptrine and *.*. GcAic3er0 QDeadabilit( =evels of Gaga5ine ;ds0Q Sournal of ;dvertising Desearch0 2$4%0 Cctober $8<$). Desearchers selected rando" sa"ples of advertise"ents fro" each of three groups of "aga5ines4 Proup $Ihighest educational level "aga5ines (such as 'cientific ;"erican0 Bortune0 The Ne Eor3er) Proup 2I"iddle educational level "aga5ines (such as 'ports !llustrated0 Ne s ee30 ?eople) Proup ,Ilo est educational level "aga5ines (such as National :n@uirer0 Prit0 True Fonfessions) The "easure that the researchers used to assess the level of the ads as the nu"ber of ords in the ad. $< ads ere rando"l( selected fro" each of the "aga5ine groups0 and the nu"ber of ords per ad ere recorded. The follo ing figure su""ari5es this proble"4

Cur @uestion of interest is hether the nu"ber of ords in ads (E) is related to the educational level of the "aga5ine (W). To ans er this @uestion0 e need to co"pare 13'3)0 the "ean nu"ber of ords in ads of the three "aga5ine groups. Note in the figure that the sa"ple "eans are provided. !t see"s that hat the data suggest "a3es sense; the "aga5ines in group $ have the largest nu"ber of ords per ad (on average) follo ed b( group 20 and then group ,. The @uestion is hether these differences bet een the sa"ple "eans are significant. !n other ords0 are the differences a"ong the observed sa"ple "eans due to true differences a"ong the K>s or "erel( due to sa"pling variabilit(? To ans er this @uestion0 e need to carr( out the ;NCA; B7 test. 'tep $4 'tating the h(potheses. 6e are testing4

Fonceptuall(0 the null h(pothesis clai"s that the nu"ber of ords in ads is not related to the educational level of the "aga5ine0 and the alternative h(pothesis clai"s that there is a relationship. 'tep 24 Fhec3ing conditions and su""ari5ing the data. (i) The ads ere selected at rando" fro" each "aga5ine group0 so the three sa"ples are independent. !n order to chec3 the ne2t t o conditions0 e>ll need to loo3 at the data (condition ii)0 and calculate the sa"ple standard deviations of the three sa"ples (condition iii). Here are the side7b(7side bo2plots of the data0 follo ed b( the standard deviations4

(ii) The graph does not displa( an( alar"ing violations of the nor"alit( assu"ption. !t see"s li3e there is so"e s3e ness in groups 2 and ,0 but not e2tre"el( so0 and there are no outliers in the data. (iii) 6e can assu"e that the e@ual standard deviation assu"ption is "et since the rule of thu"b is satisfied4 the largest sa"ple standard deviation of the three is &. (group $)0 the s"allest one is %&.- (group ,)0 and &.T%&.- < 2.

Oefore e "ove on0 let>s loo3 again at the graph. !t is eas( to see the trend of the sa"ple "eans (indicated b( red circles). Ho ever0 there is so "uch variation ithin each of the groups that there is al"ost a co"plete overlap bet een the three bo2plots0 and the differences bet een the "eans are over7shado ed and see" li3e so"ething that could have happened #ust b( chance. =et>s "ove on and see hether the ;NCA; B7test ill support this observation. +sing statistical soft are to conduct the ;NCA; B7test0 e find that the B statistic is $.$<0 hich is not ver( large. 6e also find that the p7value is 0.,$&. 'tep ,. Binding the p7value. The p7value is 0.,$&0 hich tells us that getting data li3e those observed is not ver( surprising assu"ing that there are no differences bet een the three "aga5ine groups ith respect to the "ean nu"ber of ords in ads ( hich is hat H clai"s).
o

!n other ords0 the large p7value tells us that it is @uite reasonable that the differences bet een the observed sa"ple "eans could have happened #ust b( chance (i.e.0 due to sa"pling variabilit() and not because of true differences bet een the "eans. 'tep .4 Ga3ing conclusions in conte2t. The large p7value indicates that the results are not significant0 and that cannot re#ect H .
o

6e therefore conclude that the stud( does not provide evidence that the "ean nu"ber of ords in ads is related to the educational level of the "aga5ine. !n other ords0 the stud( does not provide evidence that advertisers alter the reading level of their ads (as "easured b( the nu"ber of ords) based on the educational level of the target audience of the "aga5ine.

Binal ;omment
Ho$e'er/ the A#CIA B3test does not pro'ide any insight into $hy H $as re7ected: it does not tell us in $hat $ay 13'3)...3k are not all e-ual. 0e $ould li%e to %no$ $hich pairs of Os are not e-ual. As an exploratory )or 'isual* aid to get that insight/ $e may ta%e a loo% at the confidence inter'als for group population means13'3)...3k that appears in the output. More specifically/ $e should loo% at the position of the confidence inter'als and o'erlapRno o'erlap bet$een them.
6

K 1f the confidence inter'al for/ say/ i o'erlaps $ith the confidence inter'al for / then i and share some plausible 'alues/ $hich means that based on the data $e ha'e no e'idence that these t$o Os are different.

K 1f the confidence inter'al for i does not o'erlap $ith the confidence inter'al for / then i and do not share plausible 'alues/ $hich means that the data suggest that these t$o Os are different.

Burthermore/ if li%e in the figure abo'e the confidence inter'al )set of plausible 'alues* for i lies entirely belo$ the confidence inter'al )set of plausible 'alues* for / then the data suggest that i is smaller than .
EXAMP E

Fonsider our first e2a"ple on the level of acade"ic frustration.

Oased on the s"all p7value0 "eans are e@ual0 or in other

e re#ected Ho and concluded that not all four frustration level ords that frustration level is related to the student>s "a#or. To e can loo3 at the confidence intervals above ("ar3ed

get "ore insight into that relationship0

in red). The top confidence interval is the set of plausible values for $0 the "ean frustration level of business students. The confidence interval belo it is the set of plausible values for 20 the "ean frustration level of :nglish students0 etc. 6hat overlap e see is that the business confidence interval is a( belo ith an( of the"). The "ath confidence interval overlaps the other three (it doesn>t ith both the :nglish and

the ps(cholog( confidence intervals; ho ever0 there is no overlap bet een the :nglish and ps(cholog( confidence intervals. This gives us the i"pression that the "ean frustration level of business students is lo er than the "ean in the other three "a#ors. 6ithin the other three "a#ors0 e get the i"pression that the "ean frustration of "ath students "a( not differ "uch fro" the "ean of both :nglish and ps(cholog( students0 ho ever the "ean frustration of :nglish students "a( be lo er than the "ean of ps(cholog( students. Note that this is onl( an e2plorator(Tvisual a( of getting an i"pression of h( Ho as

re#ected0 not a for"al one. There is a for"al a( of doing it that is called Q"ultiple co"parisons0Q hich is be(ond the scope of this course. ;n e2tension to this course ill include this topic in the future.

et2s summari<e

.he A#CIA B3test is used for comparing more than t$o population means $hen the samples )dra$n from each of the populations $e are comparing* are independent. 0e encounter this situation $hen $e $ant to examine the relationship bet$een a -uantitati'e response 'ariable and a categorical explanatory 'ariable that has more than t$o 'alues. .he hypotheses that are being tested in the A#CIA B3test are: H0+1='=...=kHa+not all the ?s are equal .he idea behind the A#CIA B3test is to chec% $hether the 'ariation among the sample means is due to true differences among the M2s or merely due to sampling 'ariability by loo%ing at: !ariation amon" the sample means!ariation #ithin the "roups Cnce $e 'erify that $e can safely proceed $ith the A#CIA B3test/ $e use soft$are to carry it out. 1f the A#CIA B3test has re7ected the null hypothesis $e can loo% at the confidence inter'als for the population means that are in the output to get a 'isual insight into $hy H $as re7ected )i.e./ $hich of the means differ*.
o

<on;#$sion o@ <ase <AB


0e are no$ done $ith case ;EU. 0e learned that this case is further classified into sub3cases/ depending on the number of groups that $e are comparing )i.e./ the number of categories that the explanatory 'ariable has*/ and the design of the study )independent 's. dependent samples*. Bor each of the three sub3cases that $e co'ered/ $e learned the appropriate inferential method/ and emphasi<ed the idea behind the method/ the conditions under $hich it can be safely used/ ho$ to carry it out using soft$are/ and the interpretation of the results. .he follo$ing table summari<es $hen each of the three sub3cases/ co'ered in this module/ are used:

.he follo$ing summary discusses each of the abo'e named sub3cases of ;EU $ithin the context of the hypothesis testing process. I. 'tating the null and alternative h"potheses (H and H ).
0 a

II. .hec# .onditions, and 'ummarize the Fata 9sing a &est 'tatistic :.hec# that the conditions under which the test can !e relia!l" used are met. Bor the .$o3Sample t3test/ the conditions are: 1. !.

.$o samples are independent and random Cne of the follo$ing t$o scenarios: ?oth populations are normal Populations are not normal/ but large sample si<e )+"6*

Bor the Paired t3test )as a special case of a one3sample t3test*/ the conditions are: 1. .he sample of differences is random )or at least can be considered so in context*. !. 0e are in one of the three situations mar%ed $ith a green chec% mar% in the follo$ing table:

Bor an A#CIA/ the conditions are: .he samples dra$n from each of the populations being compared are independent. !. .he response 'ariable 'aries normally $ithin each of the populations being compared. As is often the case/ $e do not ha'e to $orry about this assumption for large sample si<es. ". .he populations all ha'e the same standard de'iation. :'ummarize the data using a test statistic. 'pecial .ase of .Q .$o3Sample t3test Paired t3test A#CIA &est 'tatistic 1.

t=(y16666y'6666 0s'1n1:s''n'!

t=xd66660sdn! $=Cariation D%ong the Ea%8#e >eansCariation 7ithin t he Gro$8s

III. 0inding the p-value of the test >se statistical soft$are to determine the p3'alue. .he p3'alue is the probability of getting data li%e those obser'ed )or e'en more extreme* assuming that the null hypothesis is true/ and is calculated using the null distribution of the test statistic. .he p3'alue is a measure of the e'idence

against H . .he smaller the p3'alue/ the more e'idence the data present against H . .he p3'alues for three ;EU tests are obtained from the output. I1. 2a#ing conclusions. -.onclusions a!out the significance of the results: 1f the p3'alue is small/ the data present enough e'idence to re7ect H )and accept H *. 1f the p3'alue is not small/ the data do not pro'ide enough e'idence to re7ect H .
6 a o a 6

.o help guide our decision/ $e use the significance le'el as a cutoff for $hat is considered a small p3'alue. .he significance cutoff is usually set at .68/ but should not be considered in'iolable. ;onclusions should al$ays be stated in the context of the problem. 0ollowing the test... Bor a t$o3sample t3test/ a =>? confidence interval for M ]M can be 'ery insightful after a test has re7ected the null hypothesis/ and can also be used for testing in the t$o3sided case. Bor a paired t3test/ a =>? confidence interval for M can be 'ery insightful after a test has re7ected the null hypothesis/ and can also be used for testing in the t$o3sided case. 1f the A#CIA B3test has re7ected the null hypothesis/ loo%ing at the confidence intervals for the population means that are in the output can pro'ide 'isual insight into $hy the H $as re7ected )i.e./ $hich of the means differ*.
1 ! d 6

1nference for the Aelationships ?et$een ! ;ategorical Iariables ).he ;hi3S-uare .est for 1ndependence*
.he last three procedures that $e studied )t$o3sample t/ paired t/ and A#CIA* all in'ol'e the relationship bet$een a categorical explanatory 'ariable and a -uantitati'e response 'ariable/ corresponding to ;ase ; EU in the roleRtype classification table belo$. #ext/ $e $ill consider inferences about the relationships bet$een t$o categorical 'ariables/ corresponding to case ;E;.

1n the Exploratory =ata Analysis unit of the course/ $e summari<ed the relationship bet$een t$o categorical 'ariables for a gi'en data set )using a t$o3$ay table and conditional percents*/ $ithout trying to generali<e beyond the sample data. #o$ $e $ill perform statistical inference for t$o categorical 'ariables/ using the sample data to dra$ conclusions about $hether or not $e ha'e e'idence that the 'ariables are related in the larger population from $hich the sample $as dra$n. 1n other $ords/ $e $ould li%e to assess $hether the relationship bet$een X and X that $e obser'ed in the data is due to a real relationship bet$een X and X in the population/ or if it is something that could ha'e happened 7ust by chance due to sampling 'ariability.

.he statistical test that $ill ans$er this -uestion is called the chi-sIuare test for independence. ;hi is a Free% letter that loo%s li%e this: %/ so the test is sometimes referred to as: .he %' test for independence. .he structure of this section $ill be 'ery similar to that of the pre'ious ones in this module. 0e $ill first present our leading example/ and then introduce the chi3s-uare test by going through its 5 steps/ illustrating each one using the example. 0e $ill conclude by presenting another complete example. As usual/ you2ll ha'e acti'ities along the $ay to chec% your understanding/ and to learn ho$ to use soft$are to carry out the test. et2s start $ith our leading example.
EXAMP E

!n the earl( $8&0s0 a (oung "an challenged an C3laho"a state la that prohibited the sale of ,.21 beer to "ales under age 2$ but allo ed its sale to fe"ales in the sa"e age group. The case (Fraig v. Ooren0 .28 +.'. $800 $8&-) as ulti"atel( heard b( the +.'. 'upre"e Fourt.

The "ain #ustification provided b( C3laho"a for the la as traffic safet(. Cne of the , "ain pieces of data presented to the court as the result of a Qrando" roadside surve(Q that recorded infor"ation on gender0 and hether or not the driver had been drin3ing alcohol in the previous t o hours. There ere a total of -$8 drivers under 20 (ears of age included in the surve(. Here is hat the collected data loo3ed li3e4

The follo ing t o7 a( table su""ari5es the observed counts in the roadside surve(4

Cur tas3 is to assess hether these results provide evidence of a significant (QrealQ) relationship bet een gender and drun3 driving. The follo ing figure su""ari5es this e2a"ple4

Note that as the figure stresses0 since e are loo3ing to see hether drun3 driving is related to gender0 our e2planator( variable (W) is gender0 and the response variable (E) is drun3 driving. Ooth variables are t o7valued categorical variables0 and therefore our t o7 a( table of observed counts is 27b(72. !t should be "entioned that the chi7s@uare procedure that e are going to introduce here is not li"ited to 27b(72 situations0 but can be applied to an( r7b(7c situation here r is the nu"ber of ro s (corresponding to the nu"ber of values of one of the variables) and c is the nu"ber of colu"ns (corresponding to the nu"ber of values of the other variable). Oefore e introduce the chi7s@uare test0 let>s conduct an e2plorator( data anal(sis (that is0 loo3 at the data to get an initial feel for it). O( doing that0 e ill also get a better conceptual understanding of the role of the test. 67ploratory 8nalysis Decall that the 3e( to reporting appropriate su""aries for a t o7 a( table is deciding hich of the t o categorical variables pla(s the role of e2planator( variable0 and then calculating the conditional percentages I the percentages of the response variable for each value of the e2planator( variable I separatel(. !n this case0 since the e2planator( variable is gender0 e ould calculate the percentages of drivers ho did (and did not) drin3 alcohol for "ales and fe"ales separatel(. Here is the table of conditional percentages4

Bor the -$8 sa"pled drivers0 a larger percentage of "ales ere found to be drun3 than fe"ales ($-.01 vs. $$.-1). Cur data0 in other ords0 provide so"e evidence that drun3 driving is related to gender; ho ever0 this in itself is not enough to conclude that such a relationship e2ists in the larger population of drivers under 20. 6e need to further investigate the data and decide bet een the follo ing t o points of vie 4

The evidence provided b( the roadside surve( ($-1 vs $$.-1) is strong enough to conclude (be(ond a reasonable doubt) that it "ust be due to a relationship bet een drun3 driving and gender in the population of drivers under 20. The evidence provided b( the roadside surve( ($-1 vs. $$.-1) is not strong enough to "a3e that conclusion0 and could have happened #ust b( chance0 due to sa"pling variabilit(0 and not necessaril( because a relationship e2ists in the population.

Actually/ these t$o opposing points of 'ie$ constitute the null and alternati'e hypotheses of the chi3s-uare test for independence/ so no$ that $e understand our example and $hat $e still need to find out/ let2s introduce the four3step process of this test.

.he ;hi3S-uare .est for 1ndependence


.he chi3s-uare test for independence examines our obser'ed data and tells us $hether $e ha'e enough e'idence to conclude beyond a reasonable doubt that t$o categorical 'ariables are related. Much li%e the pre'ious part on the A#CIA B3test/ $e are going to introduce the hypotheses )step 1*/ and then discuss the idea behind the test/ $hich $ill naturally lead to the test statistic )step !*. et2s start.

Step 1: Stating the hypotheses

>nli%e all the pre'ious tests that $e presented/ the null and alternati'e hypotheses in the chi3s-uare test are stated in $ords rather than in terms of population parameters. .hey are: H : .here is no relationship bet$een the t$o categorical 'ariables. ).hey are independent.* H : .here is a relationship bet$een the t$o categorical 'ariables. ).hey are not independent.*
o a

EXAMP

!n our e2a"ple0 the null and alternative h(potheses ould then state4 Ho4 There is no relationship bet een gender and drun3 driving. Ha4 There is a relationship bet een gender and drun3 driving. Cr e@uivalentl(0 Ho4 *run3 driving and gender are independent Ha4 *run3 driving and gender are not independent and hence the na"e Qchi7s@uare test for independence.Q

;omment
Algebraically/ independence bet$een gender and dri'ing drun% is e-ui'alent to ha'ing e-ual proportions $ho dran% )or did not drin%* for males 's. females. 1n fact/ the null and alternati'e hypotheses could ha'e been re3formulated as H : proportion of male drun% dri'ers & proportion of female drun% dri'ers H : proportion of male drun% dri'ers , proportion of female drun% dri'ers
o a

Ho$e'er/ expressing the hypotheses in terms of proportions $or%s $ell and is -uite intuiti'e for t$o3by3t$o tables/ but the formulation becomes 'ery cumbersome $hen at least one of the 'ariables has se'eral possible 'alues/ not 7ust t$o. 0e are therefore going to al$ays stic% $ith the @$ordy@ form of the hypotheses presented in step 1 abo'e.

.he 1dea of the ;hi3S-uare .est


.he idea behind the chi3s-uare test/ much li%e pre'ious tests that $e2'e introduced/ is to measure ho$ far the data are from $hat is claimed in the null hypothesis. .he further the data are from the null hypothesis/ the more e'idence the data presents against it. 0e2ll use our data to de'elop this idea. Cur data are represented by the obser'ed counts:

Ho$ $ill $e represent the null hypothesisD 1n the pre'ious tests $e introduced/ the null hypothesis $as represented by the null 'alue. Here there is not really a null 'alue/ but rather a claim that the t$o categorical 'ariables )drun% dri'ing and gender/ in this case* are independent. .o represent the null hypothesis/ $e $ill calculate another set of counts L the counts that $e $ould expect to see )instead of the obser'ed ones* if drun% dri'ing and gender $ere really independent )i.e./ if H $ere true*. Bor example/ $e actually obser'ed 99 males $ho dro'e drun%: if drun% dri'ing and gender $ere indeed independent )if H $ere true*/ ho$ many male drun% dri'ers $ould $e expect to see instead of 99D Similarly/ $e can as% the same %ind of -uestion about )and calculate* the other three cells in our table.
o o

1n other $ords/ $e $ill ha'e t$o sets of counts:


the obser'ed counts )the data*


o

the expected counts )if H $ere true* 0e $ill measure ho$ far the obser'ed counts are from the expected ones. >ltimately/ $e $ill base our decision on the si<e of the discrepancy bet$een $hat $e obser'ed and $hat $e $ould expect to obser'e if H $ere true.
o

Ho$ are the expected counts calculatedD Cnce again/ $e are in need of probability results. Aecall from the probability section that if e'ents A and ? are independent/ then P)A and ?* & P)A* K P)?*. 0e use this rule for calculating expected counts/ one cell at a time. Here again are the obser'ed counts:

Applying the rule to the first )top left* cell/ if dri'ing drun% and gender $ere independent then: P)drun% and male* & P)drun%* K P)male* ?y di'iding the counts in our table/ $e see that: P)=run%* & N" R 41N and P)Male* & 5Q1 R 41N/ and so/ P)=run% and Male* & )N" R 41N* )5Q1 R 41N* .herefore/ since there are total of 41N dri'ers/ if drun# driving and gender were independent/ thecount of drun% male dri'ers that 1 $ould e%pect to see is: 619FP(&runk and 'ale =619(9)619 (&41619 =9)F&41619 #otice that this expression is the product of the column and ro$ totals for that particular cell/ di'ided by the o'erall table total.

Similarly/ if the 'ariables are independent/ P)=run% and Bemale* & P)=run%* K P)Bemale* & )N" R 41N* )1"Q R 41N* and the expected count of females dri'ing drun% $ould be
(9)619 (1)4619 =9)F1)4619

Again/ the expected count e-uals the product of the corresponding column and ro$ totals/ di'ided by the o'erall table total:

.his $ill al$ays be the case/ and $ill help streamline our calculations:
(xpected )ount=)olumn *otalF+o# *otal*a,le *otal

Here is the complete table of expected counts/ follo$ed by the table of obser'ed counts:

did I get this

=id 1 Fet .hisD


A study $as done on the relationship bet$een gender and piercing among high3school students. A sample of 1/666 students $as chosen/ then classified according to gender and according to $hether or not they had any of their ears pierced. .he results of the study are summari<ed in the follo$ing !3by3! table:

0e see that there are differences bet$een the obser'ed and expected counts in the respecti'e cells. 0e no$ ha'e to come up $ith a measure that $ill -uantify these differences. .his is the chi3s-uare test statistic.

Step !: ;hec%ing the ;onditions and ;alculating the .est Statistic


Fi'en our discussion on the pre'ious page/ it $ould be natural to present the test statistic/ and then come bac% to the conditions that allo$ us to safely use the chi3s-uare test/ although in practice this is done the other $ay around. .he single number that summari<es the o'erall difference bet$een obser'ed and expected counts is the chi3s-uare statistic %' / $hich tells us in a standardi<ed $ay ho$ far $hat $e obser'ed )data* is from $hat $ould be expected if H $ere true.
o

Here it is:
%'=Gall cells(-,served )ount(xpected)ount '(xpected )ount

;omment
As $e expected/ %' is based on each of the differences: obser'ed count 3 expected count )one such difference for each cell*/ but $hy is it s-uaredD 0hy do $e di'ide each s-uare difference by the expected countD .he reason $e do that is so that the null distribution of %' $ill ha'e a %no$n null distribution )under $hich p3'alues can be easily calculated*. .he details are really beyond the scope of this course/ but $e $ill 7ust say that the null distribution of %' is called chi3s-uare )$hich is not 'ery surprising gi'en that the test is called the chi3s-uare test*/ and li%e the t3distributions there are many chi3s-uare distributions distinguished by the number of degrees of freedom associated $ith them.

;onditions >nder 0hich the ;hi3S-uare .est ;an Safely ?e >sed


1. !. .he sample should be random. 1n general/ the larger the sample/ the more accurate and reliable the test results are. .here are different 'ersions of $hat the conditions are that $ill ensure reliable use of the test/ all of $hich in'ol'e the expected counts. Cne 'ersion of the conditions says that all expected counts need to be greater than 1/ and at least Q6P of expected counts need to be greater than 8. A more conser'ati'e 'ersion re-uires that all expected counts are larger than 8.
E

EXAMP

Here0 again0 are the observed and e2pected counts.

Fhec3ing the conditions4 $. The roadside surve( is 3no n to have been rando". 2. ;ll the e2pected counts are above %. 6e can therefore safel( proceed ith the chi7s@uare test0 and the chi7s@uare test statistic is4
(777'.) '7'.):(&0&&04.7 '&04.7:(16'0.7 ''0.7:(1''117.) '117.)

=.)06:.05&:1.067:.144=1.6'

;omment
Cnce the chi3s-uare statistic has been calculated/ $e can get a feel for its si<e: is there a relati'ely large difference bet$een $hat $e obser'ed and $hat the null hypothesis claims/ or a relati'ely small oneD 1t turns out that for a !3by3! case li%e ours/ $e are inclined to call the chi3s-uare statistic @large@ if it is larger than ".Q5. .herefore/ our test statistic is not large/ indicating that the data are not different enough from the null hypothesis for us to re7ect it )$e $ill also see that in the p3'alue not being small*. Bor other cases )other than !3by3!* there are different cut3offs for $hat is considered large/ $hich are determined by the null distribution in that case. 0e are therefore going to rely only on the p3'alue to dra$ our conclusions. E'en though $e cannot really use the chi3s-uare statistic/ it $as important to learn about it/ since it encompasses the idea behind the test.

Step ": Binding the p3'alue


.he p3'alue for the chi3s-uare test for independence is the probability of getting counts li%e those obser'ed/ assuming that the t$o 'ariables are not related )$hich is $hat is claimed by the null hypothesis*. .he smaller the p3 'alue/ the more surprising it $ould be to get counts li%e $e did/ if the null hypothesis $ere true. .echnically/ the p3'alue is the probability of obser'ing %' at least as large as the one obser'ed. >sing statistical soft$are/ $e find that the p3'alue for this test is 6.!61.

Step 5: Stating the conclusion in context


As usual/ $e use the magnitude of the p3'alue to dra$ our conclusions. A small p3'alue indicates that the e'idence pro'ided by the data is strong enough to re7ect H and conclude )beyond a reasonable doubt* that the t$o 'ariables are related. 1n particular/ if a significance le'el of .68 is used/ $e $ill re7ect H if the p3'alue is less than .68.
o o

EXAMP

; p7value of .20$ is not s"all at all. There is no co"pelling statistical evidence to re#ect H 0 and so e ill continue to assu"e it "a( be true. Pender and drun3 driving "a( be independent0 and so the data suggest that a la that forbids sale of ,.21 beer to "ales and per"its it to fe"ales is un arranted. !n fact0 the 'upre"e Fourt0 b( a &72 "a#orit(0 struc3 do n the C3laho"a la as discri"inator( and un#ustified. !n the "a#orit( opinion Sustice Orennan rote (http4TT .la .u"3c.eduTfacult(Tpro#ectsTftrialsTconla Tcraig.ht"l)4
o

QFlearl(0 the protection of public health and safet( represents an i"portant function of state and local govern"ents. Ho ever0 appellees> statistics in our vie cannot support the conclusion that the gender7based distinction closel( serves to achieve that ob#ective and therefore the distinction cannot under Xprior case la Y ithstand e@ual protection challenge.Q
learn !" doing

.ase ..

;omment
.his is a good opportunity to illustrate an important idea that $as discussed earlier in this unit: .he larger the sample the results are based on/ the more e'idence they carry. et2s ta%e the pre'ious example and simply multiply each of the counts by ":

and see $hat $ould ha'e happened if these $ere the original data. Cb'iously/ the conditional counts $ould remain the same:

1n other $ords/ the sample pro'ides the @same@ results/ but this time they are based on a much larger sample )1Q89 instead of 41N*. .his is reflected

by the chi3s-uare test. 1n this case/ soft$are gi'es us a chi3s-uare statistic of 5.N16 and a p3'alue of 6.6!9. As before/ H states that gender and drun% dri'ing are not related: H states that they are related. Since the obser'ed counts are triple $hat they $ere before/ the expected counts are also tripled. 0hen done $ith soft$are the original chi3s-uare statistic $as 1.4"9 since soft$are doesn2t round as much. .he chi3s-uare statistic $hen $e tripled the data is " times 1.4"9/ or 5.N1 )$hich no$ is in the @large@ range*. .herefore/ the p3'alue is smaller and is no$ .6!9. #o$/ $e do re7ect H / and $e conclude that gender and drun% dri'ing are related. 1n this case/ the @largest contribution to chi3s-uare@ is large enough to pro'ide e'idence of a relationship. .his is due to the fact that so fe$ females dro'e drun% )5Q* compared to the number that $ould be expected )4!.!/ $hich is 515 K !9N R 1Q89* if the 'ariables gender and drun% dri'ing $ere not related. .his contribution is (&46'.' '6'.'=).'&'.
o a o

Steroid >se in Sports


Ga#or7league baseball star Oarr( Oonds ad"itted to using a steroid crea" during the 200, season. !s steroid use different in baseball than in other sports? ;ccording to the 200$ National Follegiate ;thletic ;ssociation (NF;;) surve( (http4TT .ncaa.orgTlibrar(TresearchTsubstanceZuseZhabitsT200$Tsubstanc eZuseZhabits.pdf)0 hich is self7reported and as3ed of a stratified rando" selection of tea"s fro" each of the three NF;; divisions0 reported steroid use a"ong the top % college sports as as follo s4

*o the data provide evidence of a significant relationship bet een steroid use and the t(pe of sport? !n other ords0 are there significant differences in steroid use a"ong the different sports? Oefore e carr( out the chi7s@uare test for independence0 let>s get a sense of the data b( calculating the conditional percents4

!t see"s as if there are differences in steroid use a"ong the different sports. :ven though the differences do not see" to be over hel"ing0 since the sa"ple si5e is so large0 these differences "ight be significant. =et>s carr( out the test and see. 'tep 0: 'tating the hypotheses The h(potheses are4 H : steroid use is not related to the t(pe of sport (or4 t(pe of sport and steroid use are independent) H : 'teroid use is related to the t(pe of sport (or4 t(pe of sport and steroid use are not independent). 'tep 1: (hec3ing conditions and finding the test statistic
o a

Here is the Ginitab output of the chi7s@uare test for this e2a"ple4

Fonditions4 $. 2. 6e are told that the sa"ple as rando".

;ll the e2pected counts are above %.

Test statistic4 The test statistic is $..-2-. Note that the Qlargest contributorsQ to the test statistic are %.&28 and ,.880. The first cell corresponds to football pla(ers ho used steroids0 ith an observed count larger than e ould e2pect to see under independence. The second cell corresponds to tennis pla(ers ho used steroids0 and has an observed count lo er than e ould e2pect under independence.

'tep 2: +inding the p./alue ;ccording to the output p7value it ould be e2tre"el( unli3el( (probabilit( of 0.00-) to get counts li3e those observed if the null h(pothesis ere true. !n other ords0 it ould be ver( surprising to get data li3e those observed if steroid use ere not related to sport t(pe. 'tep 4: (onclusion

The s"all p7value indicates that the data provide strong evidence against the null h(pothesis0 so e re#ect it and conclude that the steroid use is related to the t(pe of sport.

et2s Summari<e
1. .he chi3s-uare test for independence is used to test $hether the relationship bet$een t$o categorical 'ariables is significant. 1n other $ords/ the chi3s-uare procedure assesses $hether the data pro'ide enough e'idence that a true relationship bet$een the t$o 'ariables exists in the population. .he hypotheses that are being tested in the chi3s-uare test for independence are:
o o o o o

!.

H : .here is no relationship bet$een ..... and ...... H : .here is a relationship bet$een ..... and ......
o a

or e-ui'alently/ H : .he 'ariables ..... and ..... are independent. H : .he 'ariables ..... and ..... are not independent.
o a

".

.he idea behind the test is measuring ho$ far the obser'ed data are from the null hypothesis by comparing the obser'ed counts to the expected counts L the counts that $e $ould expect to see )instead of the obser'ed ones* had the null hypothesis been true. .he measure of the difference bet$een the obser'ed and expected counts is the chi3s-uare test statistic/ $hose null distribution is called the chi3s-uare distribution. Cnce $e 'erify that the conditions that allo$ us to safely use the chi3 s-uare test are met/ $e use soft$are to carry it out and use the p3'alue to guide our conclusions.

5.

8.

1nference for the inear Aelationships ?et$een ! Uuantitati'e Iariables


1ntroduction

1n inference for relationships/ so far $e ha'e learned inference procedures for both cases ;EU and ;E; from the roleRtype classification table belo$. .he last case to be considered in this course is case U EU/ $here both the explanatory and response 'ariables are -uantitati'e. );ase U E; re-uires

statistical methods that go beyond the scope of this course/ but might be part of extension modules in the future*.

1n the Exploratory =ata Analysis section/ $e examined the relationship bet$een sample 'alues for t$o -uantitati'e 'ariables by loo%ing at a scatterplot and focused on the linear relationship by supplementing the scatterplot $ith the correlation coefficient r. .here $as no attempt made to claim that $hate'er relationship $as obser'ed in the sample necessarily held for the larger population from $hich the sample originated. #o$ that $e ha'e a better understanding of the process of statistical inference/ $e $ill present the method for inferring something about the relationship bet$een t$o -uantitati'e 'ariables in an entire population/ based on the relationship seen in the sample. 1n particular/ the method $ill focus on linear relationships and $ill ans$er the follo$ing -uestion: 1s the obser'ed linear relationship due to a true linear relationship bet$een t$o 'ariables in the population/ or could it be that $e obtained this %ind of pattern in the data 7ust by chanceD 1f $e conclude that $e can generali<e the obser'ed linear relationship to the entire population/ $e $ill then use the data to estimate the line that go'erns the linear relationship bet$een the t$o 'ariables in the population/ and use it to ma%e predictions. .he follo$ing figure summari<es this process:

#ote that the figure summari<es the $hole process. et2s re'ie$ it again.

0e start by as%ing $hether the t$o -uantitati'e 'ariables are related )in any $ay*. 0e collect data/ and $hen $e summari<e them $ith a scatterplot and the correlation r/ $e obser'e a linear relationship.

.hen $e get to the inference part of the process/ $hich $e are going to learn here:

0e $ill carry out a test that $ill tell us $hether the obser'ed linear relationship is significant )i.e./ can be generali<ed to the entire population*. o 1f the obser'ed linear relationship is not significantLtoo bad. o 1f the obser'ed linear relationship is significant/ $e can use the data to estimate the line that go'erns the linear relationship bet$een X and X in the population/ and can use it to ma%e predictions )see comment 1 belo$*.

;omments

1. 0e estimate the line that go'erns the linear relationship bet$een X and X in the population by the line that best fits the linear pattern in our obser'ed data. Aecall that in the Exploratory =ata Analysis unit $e2'e actually already learned ho$ to find the least s-uares regression lineLthe line that

best fits the obser'ed data. Xou can no$ see that finding the least s-uares regression line actually belongs to the inference unit/ and $hile it is true that it is the line that best fits )in some sense* the obser'ed data/ it is really an estimate of the true linear relationship that exists in the population. .he good thing is that $e already learned ho$ to obtain this line/ so $e2ll only need to re'ie$ it. !. .his section on regression $ill be 'ery -ualitati'e in nature and $ill rely mostly on conceptual ideas and on output. An extension module to this course/ $hich $ill go deeper into the inferential processes of regression/ $ill exist in the near future. ". .his section $ill be organi<ed around a leading example. At some stages along the $ay/ you2ll be directed to an acti'ity/ $here you2ll get to ha'e hands3on practice $ith a different example.

et2s introduce our leading example/ $hich $as actually our leading example in the Exploratory =ata Analysis unit as $ell.
EXAMP E

!n a stud( of the legibilit( and visibilit( of high a( signs0 a ?enns(lvania research fir" deter"ined the "a2i"u" distance at hich each of ,0 drivers could read a ne l( designed sign. The ,0 participants in the stud( ranged in age fro" $< to <2 (ears old. The govern"ent agenc( that funded the research hoped to i"prove high a( safet( for older drivers and anted to e2a"ine the relationship bet een age and sign legibilit( distance. (*ata adopted ith per"ission fro" +tts and Hec3ard0 Gind on 'tatistics). =et>s go through the entire process (outlined on the previous page) for this e2a"ple. 'tarting point4 The researchers anted to e2a"ine the relationship bet een age and sign legibilit( distance in the population of drivers. The researchers collected data fro" a rando" sa"ple of ,0 drivers4

67ploratory 8nalysis: The researchers displa( the data on a scatterplot4

and observe a negative linear relationship in the data. !n order to @uantif( the strength of that linear relationship0 the researchers supple"ent the scatterplot ith a nu"erical "easureIthe correlation coefficient (r)0 hich turns out to be r = 7.<0$0 confir"ing the researchers> visual assess"ent of a negative0 fairl( strong linear relationship bet een age and legibilit( distance.
learn !" doing

.ase JJ

Inference: .he researchers $ould no$ li%e to see $hether the obser'ed linear relationship bet$een age and legibility distance can be generali<ed to the entire population of dri'ers. 1n other $ords/ the researchers $ant to chec% $hether the obser'ed linearity is due to true linearity in the population/ or a pattern that could ha'e happened 7ust by chance. .he test that the researchers are going to carry out is a t3test )most commonly %no$n as the @t3test for the slope@ for reasons that $e are not going to get into* $hich is testing the follo$ing t$o hypotheses )step 1*: H : .here is no linear relationship bet$een age and distance. H : .here is a linear relationship bet$een age and distance.
o a

and in general/ H : .here is no linear relationship bet$een X and X. H : .here is a linear relationship bet$een X and X.
o a

did I get this did I get this

;omments
1. As $e mentioned earlier/ $e are going to %eep this discussion on the -ualitati'e side and in particular $e $ill not go 'ery deeply into step B of the hypothesis test. As for the test-statistic in this case/ $e2ll 7ust say that the test is a t3test/ $hich/ as $e %no$/ means that the null distribution of its test statistic )under $hich the p3'alues are calculated* is some t distribution. !. 0e are also going to focus on only some of the conditions that allo$ us to safely use this t3test. .hey are: the obser'ed data indeed loo% linear )other$ise it $ould not ma%e sense to try and generali<e them* the obser'ations are independent there are no extreme outliers in the data the sample si<e is fairly large #ote that in our example all these conditions are met: the data definitely loo% linear/ the obser'ations )dri'ers* are independent of each other )since they $ere randomly chosen*/ there are no extreme obser'ations in the data/ and a sample si<e of n & "6 is fairly large. Bor step K, the researchers use statistical soft$are to find a test statistic 'alue of 39.6N/ and a p3'alue that is so small that it is essentially 6. .his

means that it $ould be extremely unli%ely )actually/ -uite impossible* to get data li%e those obser'ed if age and legibility distance $ere not linearly related. 1n other $ords/ it $ould be extremely unli%ely to get data li%e those obser'ed 7ust by chance. .he researchers conclude )step -* that since the p3'alue is so small/ the data pro'ide extremely strong e'idence to re7ect H and conclude that age and legibility distance are linearly related.
o

did I get this learn !" doing

.ase JJ

7ote: 1t is important to distinguish bet$een the information pro'ided by r and by the p3'alue. .he correlation coefficient r informs us a!out the strength of the linear relationship in the data: close to ^1 or 31 for a strong linear relationship/ close to 6 for a $ea% linear relationship. 1n contrast/ the regressionp-value informs us a!out the strength of evidence that there is a linear relationship in the population from $hich the data $ere obtained. 1n our example/ since the p3'alue is 6.666 and r & 3.Q61/ $e $ould say that $e ha'e extremely strong e'idence of a fairly strong relationship bet$een age and distance in the population of dri'ers. So far the researchers ha'e obser'ed linearity in the data/ and based on a test concluded that this linear relationship bet$een age and legibility distance can be generali<ed to the entire population of dri'ers. Since that is the case/ the researchers $ould no$ li%e to estimate the e-uation of the straight line that go'erns the linear relationship bet$een age and legibility distance among dri'ers. As $e commented earlier/ this is done by finding the line that best fits the pattern of our obser'ed data. Aecall that this line is called the least s-uares regression line/ $hich is the line that minimi<es the sum of the s-uared 'ertical de'iations:

1n the Exploratory =ata Analysis section/ $e presented the actual formulas for the slope and intercept of the line. 0e are not going to repeat those here/ $e $ill obtain those 'alues from the output:

and as% the soft$are to plot it for us on the scatterplot so $e can see ho$ $ell it fits the data.

?ased on the obser'ed data/ the researchers conclude that the linear relationship bet$een age and legibility distance among dri'ers can be summari<ed $ith the line: FI'&87.6 A >LM.L - K.00L:8@6 1n particular/ the slope of the line is roughly 3"/ $hich means that for e'ery year that a dri'er gets older )1 unit increase in X*/ the maximum legibility distance is reduced/ on a'erage/ by " feet )X changes by the 'alue of the slope*. .he researchers can also use this line to ma%e predictions/ remembering to be$are of extrapolations )predictions for X 'alues that are outside of the range of the original data*. Bor example/ using the e-uation of the line/ $e predict that the maximum legibility distance of a 463year3old dri'er is: distance A >LM.L - K.00L(M0) A K=M.BN. .he follo$ing figure illustrates this prediction.

et2s summari<e all that the researchers ha'e done in a figure:

0rap3>p )1nference for Aelationships*


0e2'e 7ust completed the part of the course about the inferential methods for relationships bet$een 'ariables. .he o'erall goal of inference for relationships is to assess $hether the obser'ed data pro'ide e'idence of a

significant relationship bet$een the t$o 'ariables )i.e./ a true relationship that exists in the population*. Much li%e the module about relationships in the Exploratory =ata Analysis )E=A* unit/ this part of the course $as organi<ed according to the role and type classification of the t$o 'ariables in'ol'ed. Ho$e'er/ unli%e the E=A module/ $hen it comes to inferential methods/ $e further distinguished bet$een three sub3cases in case ;EU/ so essentially $e co'ered 8 cases in total. .he follo$ing 'ery detailed role3type classification table summari<es both E=A and inference for the relationship bet$een 'ariables:

Summary )1nference*
.his summary pro'ides a -uic% recap of the big ideas you2'e learned in the inference section )$ithout going into any of the technical details*. .herefore/ this summary does not pro'ide complete co'erage of the material and thus should be used only as a chec%list or a -uic% re'ie$ of the @big ideas@ before an exam.

1n the 1nference unit/ $hich is the last step in the @?ig Picture/@ $e use the e'idence pro'ided by the data to infer about the rele'ant population. .he inference could be about the value of un#nown parameters in our population )mean/ proportion/ difference bet$een means/ etc.* or about the existence of a certain relationship !etween two varia!les in the population. 0e discussed " forms of inference: point estimation/ inter'al estimation and hypotheses testing.

Point Estimation
Idea: estimating an un%no$n parameter $ith a single 'alue )that $as obtained from the obser'ed data*. !. 0e typically estimate: the population mean 3 by the sample mean x6 the population proportion p by the sample proportion p the population standard de'iation 4 and the population 'ariance ' by the sample standard de'iation s and the sample 'ariance s' #ote that the last t$o parameters )T and '* are not co'ered in this course.
1. ).

x6/ p/ s/ and s' are unbiased estimators for M/ p/ T/ and '/ respecti'ely. .heir precision increases $ith the sample si<e.

1nter'al Estimation
Idea: Estimating an un%no$n parameter $ith an inter'al of plausible 'alues and attaching to the inter'al our le'el of confidence that it indeed co'ers the true 'alue of the parameter. Such an inter'al is therefore called a confidence inter'al. 2. .he general form of confidence inter'als is: 8oint esti%ate2%argin o@ error $here the margin of error represents the maximum estimation error for a gi'en le'el of confidence/ and is the product of the confidence multiplier and the standard de'iation )or standard error* of the point estimator.
1.

".

Since the margin of error )and therefore the $idth of the confidence inter'al* increases $ith the le'el of confidence/ there is a trade3off bet$een the le'el of confidence and the precision of the inter'al estimation. .he price you ha'e to pay for more confidence is less precision )a $ider confidence inter'al* and 'ice 'ersa. A $ay to get better precision for a gi'en le'el of confidence is to increase the sample si<e. Sample si<e calculations can be carried out in order to determine the sample si<e needed for a desired margin of error at a certain le'el of confidence. 0e should %eep in mind/ though/ that in practice/ larger sample si<es are not al$ays a'ailable. Bor the confidence inter'al for the population mean/ M/ $e distinguished bet$een:

5.

8.

the case $here the population standard de'iation 4 is #nown )in $hich case $e use the <K confidence multipliers*/ and the case $here 4 is un#nown and is replaced by the sample standard de'iation S )in $hich case $e use the tK confidence multipliers/ and rely on soft$are to do the calculations*.

Bor large sample si<es/ though/ and for a gi'en le'el of confidence/ <K is approximately e-ual to tK. 1n either case/ $e can safely use the confidence inter'al as long as the population is large andRor the sample si<e is large )+ "6*.
<.

.he confidence inter'al for the population proportion p is the primary statistical method used in the analysis of polls/ and can be safely used as long as n5p010 and n5(1p 010.

Hypothesis .esting
Idea: >nli%e point and inter'al estimation/ in $hich the goal is estimating an un%no$n parameter/ in hypothesis testing $e are assessing the e'idence pro'ided by the data in fa'or or against some claim about the population. 2. 1n practice/ $e ha'e t$o competing hypotheses/ H / $hich is challenged by H / and $e are assessing $hether or not the data pro'ide e'idence )beyond a reasonable doubt* that $e can re7ect H in fa'or of H . 1f they do/ $e say that the results are significant: other$ise/ if H cannot be re7ected/ $e say that the results are not significant. !. H and H are t$o claims about the population. 1n the one 'ariable case/ these claims are about the 'alue of a parameter in the population. 1n inference about relationships/ H and H are about the existenceRnonexistence of a certain relationship bet$een the t$o 'ariables.
1.
o a o a o o a o a

Aecall that in case ;EU the existenceRnonexistence of the relationship is stated in $ith M 3 M / M / or M / M / M / ... / M . 1n case ;E; and UEU/ the relationship is stated in $ords. 4. After the hypotheses ha'e been formulated/ data are collected/ and conditions for use are chec%ed/ the e'idence in the data is assessed by finding the p3'alue of the test/ the probability of getting data li%e those obser'ed )or e'en more extreme* if H $ere true. 1f the p3'alue is small )smaller than some cut3off called the significance le'el/ typically set at . 68*/ meaning that it $ould be unli%ely to get data li%e those obser'ed if H $ere true/ $e re7ect H in fa'or of H . Cther$ise/ if the p3'alue + 6.68/ $e cannot re7ect H . #ote that 6.68 represents our @reasonable doubt.@ .he p3'alue can be 'ie$ed as a measure of the e'idence in the data against H / $here the smaller the p3'alue/ the larger the e'idence against H . 7. 1n practice/ to find the p3'alue $e use the test statistic/ a summary of the data $hich is some measure of @ho$ far@ or @ho$ different@ the obser'ed data are from $hat is claimed in H . .he p3'alue of a test is the probability of getting a test statistic )based on the data* li%e that obser'ed )or e'en more extreme*/ if H $ere true. .he p3'alue is therefore calculated using the sampling distribution of the test statistic $hen H is true ) called the null distribution*.
1 ! d 1 ! " % o o o a o o o o o o

4.

;onclusions are then based on the p3'alue/ and should al$ays be stated in context.

You might also like