Association 1

Measures of Association
Why do we use statistics in

epidemiology?
Generalization of conclusions:
sample population
Assess strength of evidence
Make comparisons
Make predictions
Role of Statistics in Public
Health and Medicine
Science Statistics
1. Idea or Question 1. Stats model /
hypothesis
2. Collect data and
make observations 2. Study design
3. Describe data 3. Descriptive statistics
4. Assess the 4. Inferential statistics

strength of
evidence for or
against the
hypothesis
There are basically two main fields in
statistics:
Point Estimation
For example, estimating Relative Risks
Hypothesis Testing
For example, testing if Relative Risk =
1
Types of Data
Categorical (qualitative)
Nominal scale - no natural order
gender, marital status, race
Ordinal scale
severity scale, good/better/best
Types of Data
Numerical (quantitative)
Discrete - (few) integer values
number of children in a family
Continuous - measure to arbitrary
precision
blood pressure, weight
Dependent versus Independent
Variables
These terms developed out of an
experimental research paradigm
Dependent variables are the traits that we

are trying to explain or predict
Independent variables are the traits that

we are using to try to explain or predict
the dependent variable
Overview of Association Methods
Dependent Independent
Variable Variable Method
Categorical Categorical Relative Risk (C.I.)
(Discrete) (Discrete) Odds Ratio (C.I.)
Chi-square test
test of proportions
Categorical Continuous Logistic regression

(Discrete) Discriminate
Overview of Association Methods
Dependent Independent
Variable Variable Method
Continuous Continuous Linear regression
Correlations
Continuous Categorical t-test

(Discrete) Analysis of
Variance
WHAT is statistical significance and
WHAT does it tell us?
We use statistics to tell us whether apparent
differences between samples are real or
due to chance
A p-value is the probability that a TEST

STATISTIC would be as extreme or more
extreme than observed if the null hypothesis
was true.
For example, a p-value of 0.05 means that

5% of the time we would observe this test
statistic by chance alone (i.e., there is no
What is a confidence interval?
A 95% confidence interval tells us that

with 95% probability the true value of
the variable is contained within the
interval.
( 95% confidence)
(----- 99% confidence------
)
Point estimate
Measures of association for
categorical dependent
variables
and
categorical independent
variables
Test of Proportions and Chi-Square
test are used in related but different
situations
1) We sample members of 2 groups and classify

each member according to some qualitative
characteristic (e.g. cigarette smoking). The
hypothesis is
H0: groups are homogeneous (p1j=p2j=)
HA: groups are not homogeneous
2) We sample members of a population and
cross-classify each member according to two
qualitative characteristics. The hypothesis is
Test of Proportions
The hypothesis that two groups are the
same is addressed by the hypotheses:
H0 : p1 = p2
HA : p1 p2
A statistic useful for this comparison is the
difference in the observed, or sample,
proportions
p1 p 2
Q: What is the distribution of this statistic?

A: Approximately normal.
Q: What is the distribution of this statistic?
A: Approximately normal.
p1 1 p1 p2 1 p2
p1 p 2 ~ N p1 p2 ,
2

n1 n2
Estimator for (p1 p2):
(p1 p2) = (X1/n1 X2/n2)

Standard Deviation: p1q1 + p2q2
n1 n2
A (1-100%) large-sample confidence interv

for (p1 p2):
(p1 p2) +/- p1q1 p2q2

+
n1 n2
Standard Normal Distribution, Z
Confidence Interval for a
difference in proportions
p 1 1 p 1 p 2 1 p 2
p 1 p 2 1.96
n1 n2
Note: A common estimate isnt used when

confidence intervals are computed for the difference
in the population proportions, p1 - p2. In this case,
we dont have any assumption regarding the
relationship between p1 and p2 so use the following
Simple Test of proportions
The test statistic used for testing H0: p1 = p2 is:
p1 p 2 0
Z
p 0 1 p 0 p 0 1 p 0

n1 n2
Note: The test is still valid if we had simply used the

separate estimates, p1 and p2 instead of the common
estimate based on H0.
Example:
Test for the difference between
a) the proportion of women who had

their blood pressure reduced by a
certain drug
versus
b) the proportion of men who had their

blood pressure reduced.
In a study to investigate drug Ys potential in
lowering blood pressure between hypertensive
men and women, 50 women and 100 men were
given the drug. At the end of the study, the
results below were reported:
a. Estimate the difference in the true proportions

who had their blood pressure reduced with a 99%
confidence interval. Men Women
Sample size 100 50
# with reduced BP 65 38
HO: The proportion of men who had their
blood pressure reduced is the same as that
of the women who had their blood pressure
reduced.
HA: The proportion of men who had their

blood pressure reduced is not the same as
that of the women who had their blood
pressure reduced.
Proportion of men with reduced blood pressure
65/100 = .65 = p1
Proportion of women with reduced blood

pressure:
38/50 = .76 = p2
p1q1 p2q2
Point estimate for p1- p2 = .76-.65
+ = .11
n1 n2
.76*.24
Standard .65*.35(p1-p2) =
deviation =
+ = .0770
50 100
Z =p1-p2 = .76-.65 = .11
standard deviation = .
0770
= 1.423
Is this significant at a 0.05 level

of significance?
Standard Normal Distribution, Z
What is the 99% confidence
interval?
p1q1 p2q2
(p1 p2) +/- 2.58 +
n1 n2
= .11 +/- (2.58)(.0770)

= .11 +/- .199
= (-0.011, 0.309)
Conclusion:
Since the confidence interval contains 0,

there is no significant difference between
p1 and p2.
Chi-square test
Is for the case where we sample members of a

population and cross-classify each member
according to two qualitative characteristics.
The hypothesis is
H0: factors are independent (pij=pi.p.j )
HA: factors are not independent
Chi - Square Test of Independence
Null Hypothesis: Variable 1 is independent

of variable 2.
Alternative Hypothesis: Variable 1 is

associated with variable 2.
=
2 (O E ) 2
(df) E
or k
[nij E(nij)]2 n i . n .j
2= i,j=1
E (nij) =
E (nij) n ..
Degrees of freedom: (r-1)(c-1)

Chi-square Test Example
Diabetic Not Diabetic Total
Smoking 50 (n ) 11 25 (n )
12 75 (n )
1.
Not smoking20 (n ) 21 5 (n )22 25 (n )

2.
Total 70 (n ) .1 30 (n ).2 100 (n )

..
HO: No association between smoking and

diabetes.
HA: An association
k between smoking and
diabetes. [n ij E(nij)]2
2= i,j=1
Using: E (nij)
1. Calculate the expected values:
Smoking 75*70/100 75*30/100 75
Not smoking 25*70/100 25*30/100 25
Total 70 30 100
2. Add up the squared differences in Obs -
Exp and divide by the expected values
= ((50-(75*70/100))2/ 75*70/100) +
((25-(75*30/100))2/ 75*30/100) +
((20-(25*70/100))2/ 25*70/100) +
((5 -(25*30/100))2/ 25*30/100) = 1.59
3. Figure out the degrees of freedom

Df= (r-1)(c-1) = 2-1)(2-1) = 1
Upper percentiles of 2 distributions
= distribution
with k degrees of
freedom
Area = 1 - p
0
p, k
df 0.90 0.95 0.99

1 2.706 3.841 6.635
5 9.236 11.070 15.086
10 15.987 18.307 23.209
15 22.307 24.996 30.578
Back to RRs and Ors:
For assessing whether the main

epidemiological measures of
association (relative risk and odds
ratio) are different from the null
value we focus on confidence
interval estimation.
Confidence Interval of the RR
Diseased Not Diseased Total
Exposure a b a+b
No exposure c d c+d
Total a+c b+d a+b+c+d
a/(a+b)
RR =
c/(c+d)
Its easiest to determine the confidence interval

by
taking the natural logarithm (ln) of the RR
because the only way to get a reasonable
formula for the variance of RR is to work in the
world of natural logs
Variance for ln RR=
{b/a*(a+b)} +
{d/c*(c+d)}
=
Confidence Interval
= (RR) exp [+/- variance(ln RR) ]
= exp [ln RR +/- variance(ln RR) ]

Illustration:
Results of a cohort study that followed 100
non-diabetic nurses for 15 years. At the end
of the 15 years their smoking behavior was
related to their diabetic status.
Smoking 50 25 75
Not smoking 20 5 25
Total 70 30 100
50/75
RR = = 0.666667/0.8 = 0.833375
20/25
lnRR= -0.18182
SD (lnRR) = (((25/(50*75)) + (5/(20*25)))1/2

Smoking 50 25 75
Not smoking 20 5 25
Total 70 30 100
lnRR = -0.18182
95% CI for the ln RR = -0.18182 +/-

1.96*0.1291
= -0.43486, 0.071216
95% CI for the RR = (e-0.43486 e0.071216)

= (0.64735, 1.0718)
Conclusion: since the CI contains 1, there is not

a significant relationship between smoking and
Confidence Interval of the Odds Ratio
Odds for disease = p/1-p
OR = p1/(1-p1)
p2/(1-p2)
Diseased Not Diseased Total

Exposure a b a+b
No exposure c d c+d
Total a+c b+d a+b+c+d
Using the Taylor expansion method, an
approximate (1-)CI for the OR is:
CI = (ad/bc) exp [+/- (1/a+1/b+1/c+1/d)]
Results of a case-control study that investigated
the relation between diabetes and smoking.

Smoking 50 25 75
Not smoking 20 5 25
Total 70 30 100
HO: No association between smoking and
diabetes.
HA: An association
(50*5)between smoking and
diabetes. (20*25)
Measures of association for
continuous dependent
variables
and
categorical independent
variables
Normal Distribution
A common probability model for continuous
data
Can be used to characterize the Binomial or
Poisson under certain circumstances
Bell-shaped curve
takes values between - and +
symmetric about mean
mean=median=mode
Examples
birthweights, height, weight
The arithmetic mean is the most common
measure of the central location of a
sample.
1 n
X Xj
n j 1
The standard deviation tells us how widely

dispersed the values are around the mean. It
is a measure of variation. 2
X j X
1 n
s
2
n 1 j 1
The T-Test
Tests for the equality of means in 2 groups
Null Hypothesis:
The two sample means are equal
HO: X1 X2 = 0 or X1 = X2
Alternate Hypothesis:
The two sample means are different
H : X X = 0 or X = X
(X1 X2)
Test Statistic: t (df) =
s2 s2
n1 +
1 2
n2
n1

i=1
(X 1i X1)2
where s2=
1 (n1 1)
n2
(X 2i X2)2
and s2= i=1
2
(n2 1)
Degrees of freedom (df): n1 + n2 - 2

Example: Does drug X influence bilirubin lev
In a study to determine the effectiveness of a

drug in
lowering the plasma bilirubin level, 14 subjects
were randomly divided into two groups (drug X
vs.
placebo). After 14 days, the change in bilirubin
was Drug X Placebo
estimated
Sample sizein the two groups
7 (day 1 day
7 14).
Mean change 1.26 units 0.78 units
Standard deviation 0.32 units 0.32 units
Example: Does drug X influence bilirubin leve
HO: The reduction in plasma bilirubin levels

in the
treatment and placebo groups were the
same after
14 days.
HA: The reduction in plasma bilirubin levels

in the treatment and placebo groups were
different after
14 days.
Calculating the t-statistic:
(1.26 0.78)
t= = 2.806
(.32)2 (.32)2
+
7 7
Calculating the degrees of freedom:
n1 + n 2 2 = 7 + 7 2 =
12
What is the p-value?
Upper percentile of t distribution
Area =
0 t n
df 0.10 0.05 0.01

1 3.078 6.3138 31.821
5 1.476 2.0150 3.365
10 1.372 1.8125 2.764
15 1.341 1.7530 2.602
Conclusion:
The mean drop in bilirubin levels of individu

on drug X was significantly greater than for
individuals taking the placebo.
CI around a mean of a normally distributed variab
X
n
Sample mean= X = i
I=1
Small sample (1-) 100% Confidence interval
S
X + t/2
n
Where s/n is the estimated standard error of the mea

Illustration:
Given a mean of 0.53 and a standard deviation of .055
where n=6, the 95% confidence interval would be
.0559
53 + 2.571 6 Or .53 + .059
When two confidence intervals do not overlap for

two two subgroups it indicates that the means
are significantly different.
Summary: Steps in Statistical Analysis
1. Identify H0 and HA
2. Identify a test statistic
3. Determine a significance level, =
0.05, = 0.01
4. Critical value determines rejection /
acceptance region
5. p-value
6. Interpret the result

Association 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Association 1

Uploaded by

Copyright:

Available Formats

Measures of Association

Why do we use statistics in

4. Assess the 4. Inferential statistics

Dependent variables are the traits that we

Independent variables are the traits that

Categorical Continuous Logistic regression

Continuous Categorical t-test

A p-value is the probability that a TEST

For example, a p-value of 0.05 means that

A 95% confidence interval tells us that

1) We sample members of 2 groups and classify

Q: What is the distribution of this statistic?

(p1 p2) = (X1/n1 X2/n2)

A (1-100%) large-sample confidence interv

(p1 p2) +/- p1q1 p2q2

Note: A common estimate isnt used when

The test statistic used for testing H0: p1 = p2 is:

Note: The test is still valid if we had simply used the

Test for the difference between

a) the proportion of women who had

b) the proportion of men who had their

a. Estimate the difference in the true proportions

HA: The proportion of men who had their

Proportion of women with reduced blood

Is this significant at a 0.05 level

= .11 +/- (2.58)(.0770)

Since the confidence interval contains 0,

Is for the case where we sample members of a

Null Hypothesis: Variable 1 is independent

Alternative Hypothesis: Variable 1 is

Degrees of freedom: (r-1)(c-1)

Not smoking20 (n ) 21 5 (n )22 25 (n )

Total 70 (n ) .1 30 (n ).2 100 (n )

HO: No association between smoking and

3. Figure out the degrees of freedom

df 0.90 0.95 0.99

For assessing whether the main

Its easiest to determine the confidence interval

= (RR) exp [+/- variance(ln RR) ]

= exp [ln RR +/- variance(ln RR) ]

SD (lnRR) = (((25/(50*75)) + (5/(20*25)))1/2

95% CI for the ln RR = -0.18182 +/-

95% CI for the RR = (e-0.43486 e0.071216)

Conclusion: since the CI contains 1, there is not

Diseased Not Diseased Total

Diabetic Not Diabetic Total

The standard deviation tells us how widely

Degrees of freedom (df): n1 + n2 - 2

In a study to determine the effectiveness of a

HO: The reduction in plasma bilirubin levels

HA: The reduction in plasma bilirubin levels

Calculating the degrees of freedom:

df 0.10 0.05 0.01

The mean drop in bilirubin levels of individu

Small sample (1-) 100% Confidence interval

Where s/n is the estimated standard error of the mea

When two confidence intervals do not overlap for

You might also like

SD (lnRR) = (((25/(5075)) + (5/(2025)))1/2