You are on page 1of 25

Estimation and Hypothesis Testing II

Data Analysis Using R 1 / 25


Table of contents

1 Nonparametric tests for the population mean

2 Tests for proportions

3 Tests for variance/standard deviation

4 ANOVA

5 Kruskal-Wallis test

Data Analysis Using R 2 / 25


Nonparametric tests for the population mean I

When the populations do not have a normal distribution, we use the


Wilcoxon (paired observations) and Mann-Whitney tests.

In R: wilcox.test(x, y, alternative=”two.sided”,”less”,”greater”,
paired=T,F)

The test hypothesis are the same as for the t-tests.


Example
In the built-in data set named immer, the barley yield in years 1931 and
1932 of the same field are recorded. The yield data are presented in the
data frame columns Y1 and Y2. Without assuming the data to have
normal distribution, test at .05 significance level if the barley yields of
1931 and 1932 in data set immer have identical data distributions.

Data Analysis Using R 3 / 25


Nonparametric tests for the population mean II

Example
A study is designed in order to test the efficiency of vitamin E supplements
for Alzeihmer disease prevention. 20 subjects aged over 65 are randomly
distributed to two groups. The first group (10 pleople) receives 400Ul/day
of vitamin E, while the second group receives a placebo treatment. The
initial vitamin E levels are measured:
Group 1 : 7.5; 12.6; 3.8; 20.2; 6.8; 403.3; 2.9; 7.2; 10.5; 205.4
Group 2 : 8.2; 13.3; 102.0; 12.7; 6.3; 4.8; 19.5; 8.3; 407.1; 10.2
Test if there are a significant difference between the two groups initially.

Data Analysis Using R 4 / 25


Tests for the proportions I

Hypothesis:
H0 : p = p 0
6 p0
Ha : p =
p < p0
p > p0
The test statistic:
p̂ − p0
Z=q
p0 (1−p0 )
n

has the standard normal distribution N(0, 1), where p̂ = xn , x is the


number of cases that exhibit the studied characteristic, n is the
sample size.
Conditions: n × p ≥ 5 şi n × (1 − p) ≥ 5.

Data Analysis Using R 5 / 25


Tests for the proportions II

Example
We study the effect of birth weight on the cognitive abilities of babies. In
order to do this the IQ score of 33 randomly chosen children that were
underweight at birth (< 1500gr ) is measured and it is found that 8 of
them have a score less than 70. In normal children, this proportion is
3.2%. Test if the low birth weight has a significant effect on the cognitive
abilities of children.
Confidence intervals for proportions:
The (1 − α)% confidence interval for the proportion of a population is
r r
p̂(1 − p̂) p̂(1 − p̂)
(p̂ − zα/2 · , p̂ + zα/2 · ),
n n
where p̂ = x/n, x number of succes in the sample, n is the sample size,
zα/2 is the corresponding quantile of the standard normal distribution.

Data Analysis Using R 6 / 25


Tests for the proportions III

Example
Compute the 95% and 99% confidence intervals for the proportion of
underweight newborns that have an IQ score less than 70.

Data Analysis Using R 7 / 25


χ2 test for a proportion I

Hypothesis:
H0 : p = p 0
6 p0
Ha : p =
p < p0
p > p0
Test statistic:
(O1 − E1 )2 (O2 − E2 )2
χ2 = +
E1 E2
has a χ2 (1) distribution.
In R: prop.test(x, n, p, alternative = ”two.sided”/”less”/”greater ”) where
x is the number of individuals in the sample exhibiting the studied
characteristic
n is the sample size

Data Analysis Using R 8 / 25


χ2 test for a proportion II

p is the proportion assumed as true by the null hypothesis H0

Example
The breast cancer incidence rate in the US female population in the 50-54
group age is aproximately 2%. We would like to test if the proportion of
breast cancer patients whose mothers where disgnosed with the same
condition is larger than that of the general population. A random sample
of 10000 women is chosen in the age group 50-54 whose mothers had
breast cancer at some point in their lives and we observe that 400 of the
women also have this condition. At the 0.05 level of significance, what is
your conclusion?

Data Analysis Using R 9 / 25


The χ2 test for several proportions I

Hypothesis:
H0 : p1 = p2 = . . . = pk
Ha : at least one proportions is different
Test statistic:
k
X (Oi − Ei )2
χ2 =
Ei
i=1

has a χ2 (k − 1) distribution.
In R: prop.test(x, n, alternative = ”two.sided”/”less”/”greater ”) where
x is the vector containing the number of individuals in each sample
exhibiting the studied characteristic
n is the vector of sample sizes

Data Analysis Using R 10 / 25


The χ2 test for several proportions II

Example
A study is conducted in order to analyse the effects of oral contraception
(OC) on heart conditions of women aged 40-44. The researchers find that
among 5000 women which participated in the study and took OC, 13
suffered a heart attack in the next 3 years, while among 10000 women
that did not take OC, 7 suffered a heart attack in the next 3 years. What
is the conclusion of the study (α = 0.05)?

Data Analysis Using R 11 / 25


Comparison of two proportions I

Ipotezele statistice:
H0 : p1 = p2
Ha : p1 6 p2
=
p1 < p2
p1 > p2

The test statistic:


p̂1 − p̂2
Z=q
p̂(1 − p̂)( n11 + 1
n2 )

has approximately the N(0, 1) distribution, where p̂1 = nx11 , p̂2 = nx22
x1 +x2
(proportion of succes in the two samples), n1 , n2 sample sizes, p̂ = n1 +n2 .

Conditions: n1 · p̂, n1 · (1 − p̂), n2 · p̂, n2 · (1 − p̂) > 5.

Data Analysis Using R 12 / 25


Comparison of two proportions II

Example
A study is conducted to find out the factors that help spread tuberculosis among
drug users. Two samples of 97 drug users that admittedly shared needles and 161
that didn’t are chosen and tested. The reports show that 34 individuals in the
first group and 28 in the second have tuberculosis. Test whether the proportion of
TBC infected people is larger among drug users that share needles, with a 0.05
level of significance.

Data Analysis Using R 13 / 25


The χ2 for variance I

The population has a normal distribution.


Hypothesis:
H0 : σ 2 = σ02
Ha : σ 2 6= σ02
σ2 < σ02
σ2 > σ02
Test statistic:
(n − 1)s 2
χ2 = ∼ χ2 (n − 1)
σ02
where s 2 is the sample variance, n is the sample size.

Data Analysis Using R 14 / 25


The χ2 for variance II
The critical region:

Ha : σ 2 =
6 σ02 → W = (0, χ2α/2 ) ∪ (χ21−α/2 , ∞)
σ 2 < σ02 → W = (0, χ2α/2 )
σ 2 > σ02 → W = (χ21−α/2 , ∞)

where χ2α/2 a̧nd χ21−α/2 are the quantiles of the χ2 (n − 1) distribution.

Example
Test whether the variance of the data in the ”weight” variable of the
”PlantGrowth” data frame is equal to 0.5 of larger, α = 0.05.

Data Analysis Using R 15 / 25


The χ2 for variance III

The (1 − α)% confidence intervals for variance/standard deviation:

(n − 1)s 2 2 (n − 1)s 2
≤ σ ≤ (varianţă)
χ21−α/2 χ2α/2
s s
(n − 1)s 2 (n − 1)s 2
≤σ≤
χ21−α/2 χ2α/2

Example
Compute the 95% confidence interval for the variance of the weight of
plants in the ”PlantGrowth” data frame.

Data Analysis Using R 16 / 25


F test for the variance of two populations I

The two populations must have a normal distribution.


Hypothesis:
H0 : σ12 = σ22
Ha : σ12 6= σ22
σ12 < σ22
σ12 > σ22
Test statistic:
s12 /σ12
F = ∼ F (n − 1, m − 1)
s22 /σ22
where s12 , s22 are the sample variances, n, m are the sample sizes.

Data Analysis Using R 17 / 25


F test for the variance of two populations II

The critical region:

Ha : σ12 =
6 σ22 → W = (0, Fα/2 ) ∪ (F1−α/2 , ∞)
σ1 < σ22 → W = (0, Fα/2 )
2

σ12 > σ22 → W = (F1−α/2 , ∞)

where Fα/2 and F1−α/2 are the quantiles of the distribution


F (n − 1, m − 1).

Data Analysis Using R 18 / 25


F test for the variance of two populations III

Example
The ”ToothGrowth” data frame contains the length of odontoblasts (cells
responsible for tooth growth) in 60 guinea pigs. Each animal received one
of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two
delivery methods, orange juice or ascorbic acid (a form of vitamin C and
coded as VC). Test if the variance of the length of cells is the same in the
two groups determined by the delivery method. Test if the standard
deviation of the length of cells is the same depending on the dose given
(Bartlett test).

For the non-normal populations and for comparing more then two groups,
there is the Levene test.
In R: leveneTest(var∼groups) in the ”car” package.

Data Analysis Using R 19 / 25


Comparison of means of k > 2 populations

Let µ1 , µ2 , . . . , µk be the means of k populations. We test the hypothesis:


H 0 : µ1 = µ2 = . . . = µ k
Ha : at least one mean is diferent
The following must hold:
i) the observations must be independent
ii) the groups (samples) have the same variance
iii) the errors (differences between the values and the group means) have
a normal distribution
The ANOVA test is used.

Data Analysis Using R 20 / 25


ANOVA I

In R: summary(aov(var∼group))

Example
The data set ”WeightLoss” in the ”car” package contains data about the
weight loss of 34 subjects included in a study, belonging to three groups -
control, diet and diet+sport. We would like to compare the weight loss
after two months (”wl2” variable) between these groups.

Notations:
xij is the jth observation (value) in group i
ni the number of observations in group i
x̄i the mean for group i
The ANOVA model:
xij = µ + αi + εij
where
Data Analysis Using R 21 / 25
ANOVA II

µ is a constant
Pk
αi is a constant specific to the ith group; αi =0
εij is an error term with the N(0, σ 2 ) distribution.
Equivalent hypethesis:
H0 : α1 = α2 = . . . = αk = 0
Ha : there exists an αi 6= 0
We define:
between groups variance:
k
X
SSB = ni (x̄i − x̄)2
i=1

Data Analysis Using R 22 / 25


ANOVA III
within groups variance:
ni
k X
X
SSW = (xij − x̄i )2
i=1 j=1

total variance:
ni
k X
X
SST = (xij − x̄)2
i=1 j=1

We have that
SST = SSB + SSW
The principle behind ANOVA: if the means of the groups are significantly
different, the SSB will be larger than SSW .
The test statistic:
SSB/(k − 1)
F = .
SSW /(n − k)
Data Analysis Using R 23 / 25
ANOVA IV

Considering the null hypothesis as true, F has a F(k-1,n-k) distribution.


We reject H0 if the associated p-value to the calculated value of the test
statistic is below the significance level of the test, α.

Data Analysis Using R 24 / 25


The Kruskal-Wallis test

If the assumptions ii) or iii) of the ANOVA test are not met, then we use
the Kruskal-Wallis test instead.

In R: kruskal.test(var∼group)

Data Analysis Using R 25 / 25

You might also like