You are on page 1of 48

Measures of Association -

Part II
Overview of Association Methods

Dependent Independent
Variable Variable Method
Categorical Categorical Relative Risk (C.I.)
(Discrete) (Discrete) Odds Ratio (C.I.)
Chi-square test
test of proportions

Categorical Continuous Logistic regression


(Discrete) Discriminate
Overview of Association Methods

Dependent Independent
Variable Variable Method
Continuous Continuous Linear regression
Correlations

Continuous Categorical t-test


(Discrete) Analysis of
Variance
Summary: Steps in Statistical Analysis

1. Identify H0 and HA
2. Identify a test statistic
3. Determine a significance level, =
0.05, = 0.01
4. Critical value determines rejection /
acceptance region
5. p-value
6. Interpret the result
Measures of association for
continuous dependent
variables
and
continuous independent
variables
Linear Regression
In linear regression one variable (X) is used to
predict another (Y).
X independent, predictor variable
Y dependent, response variable
We assume that we collect a sample of pairs
of observations,
(Xi, Yi) for i = 1, 2,, n
Modeling the relationship between X and Y
requires the specification of two components:
A Systematic Component and a Random
Systematic Component:

E(Yi | Xi) = + Xi
= intercept
= slope


X
Percent body fat = + (Abdomen Circum)
Percent body fat = -39.28 + 0.6313 (Abdomen
Circum)

60

40

20

0
50 100 150
Abdomen circumference (cm)
Examples of Different Beta Values
beta positive beta negative
5 5

0 0

y2
y1

-5 -5
-2 -1 0 1 2 -2 -1 0 1 2
x x

beta zero nonlinear


2 6

0 2
y4
y3

0
-2
-2
-2 -1 0 1 2 -2 -1 0 1 2
x x
For the simple linear model we can test
hypotheses regarding the estimated :
H0 : = 0
HA : 0

For a test statistic we use the standardized


estimator:
b0
T ~ t ( n 2)
V (b)

where b
Students t distribution - two tailed test
H0 : = 0
HA : 0
(two tailed because > 0 or < 0 )
.4

.3

.2

.1

0
-4 -2 0 2 4
T
Upper percentile of t distribution

Area =

0 t n

df 0.10 0.05 0.01


1 3.078 6.3138 31.821
5 1.476 2.0150 3.365
10 1.372 1.8125 2.764
15 1.341 1.7530 2.602
This is for a one-tailed test such as
H0 : = 0 and HA : > 0
Correlation
In our discussion of linear regression we designated
one variable as the explanatory, independent or
predictor variable and the other as the dependent
variable. Sometimes, there is no such distinction .
100 Sometimes
all we want is
a measure of
Thigh circumference (cm)

80
the strength
of
60 association
between two
40
(quantitative)
30 35 40 45 50 variables.
Knee circumference (cm)
There are two common correlation
measures:
1. Pearson Correlation Coefficient:
Based on the actual data values.
Measure of linear association.
Natural when the variables have Gaussian.
Related to linear regression and R2.
2. Spearman Rank Correlation:
Based on ranks of each variable (ranks
assigned separately).
Useful measure of the monotone
association, which may not be linear.
Pearsons Correlation Coefficient
The correlation between two variables X
and Y is defined as:

E X X Y Y

V X V Y
Properties:
The correlation is constrained: -1
+1
| | = 1 means perfect linear
relationship:
Y = a + bX
Perfect positive 2

correlation ( = 1) 1

y
-1

-2
-2 -1 0 1 2
x

2
Perfect negative
correlation ( = -1) 1

0
y

-1

-2
-2 -1 0 1 2
x
We estimate the sample correlation
coefficient r using...

r
1 N
i 1 X i X Yi Y
N 1 s X sY
1 N Xi X Yi Y

N 1 i 1 s X sY
To test the hypothesis:

H0 : = 0
HA : 0
We use the statistic:

r
T n2
1 r 2

Under the null hypothesis:


T ~ t (n - 2)
NOTE: For the validity of the test we assume
that both X and Y are normally distributed
(bivariate normality)
Example: Knee circumference and
thigh circumference
n = 252 r = 0.799
H0 : = 0
HA : 0

r
T n2
1 r2
.799
252 2
1 .799 2
21

Conclusion: reject H0 with p < .0001


Spearman Rank
Correlation
A nonparametric analogue to Pearsons
correlation coefficient is Spearmans rank
correlation coefficient. Use Spearmans
correlation when the assumption of
(bivariate) normality is not met.
A measure of monotonic association (not
necessarily linear)
Based on the ranked data
Rank each sample separately
Compute Pearsons
T correlation
n2
rs on
~ tthe
( n 2)
ranks 1 rs2
So much of the work in public health
has become heavily multidimensional

correlations provide a quick and
meaningful way to look at associations
among many variables as a way of
categorizing variables into those
variables that are
positively associated
negatively associated
0 20 40 60 20 40 60 80 80 100 120 140 80 100 120 140 160

1.1

Density determined from 1.05


underwater weighing

1
60

40
Percent body fat from
20 Siri's (1956) equation

0
400

300
Weight (lbs)
200

100
80

60
Height (inches)
40

20
50
45
Neck circumference 40
(cm)
35
30
140

120
Chest circumference
100 (cm)

80
150

Abdomen 2 100
circumference (cm)

50
160

140

120 Hip circumference


(cm)
100

80
50

45
Knee circumference 40
(cm)
35

30
1 1.05 1.1 100 200 300 400 30 35 40 45 50 50 100 150 30 35 40 45 50
Measures of association for
categorical dependent
variables
and
continuous independent
variables
Logistic Regression

Logistic regression provides a powerful and


flexible tool for describing the relationship
between:

Y a dichotomous (diseased/non-diseased)
outcome variable and

X a set of predictor variables (continuous


or discrete) .
However, trying to model something
that
is bounded between 0 and 1 is
mathematically difficult so.

some genius came up with the idea of


modeling the log odds of disease
We already know that:
p/1-p = disease odds
where 0 < p/1-p < a
but this is still bounded on one side.
By taking the natural logarithm, we get a
nice symmetrically distributed variable to
work with:
ln (p/1-p) = log odds of p = logit (p)
where a < ln (p/1-p) < a
Because of the statistical properties of the logit
of p we can now think of logistic regression as
if it were regular linear regression .

logit (p) = B0 + B1X1 + B2 X2 + . . .


logit (p) = B0 + B1Age + B2 Sex + . . .

But what is the interpretation of these Betas?


What is the interpretation of these
Betas?
Intercept
B0 = Log odds of disease when all Xs
are zero
Its not very meaningful terms unless the Xs
are discrete (0,1) then
B0 = Log odds of disease in the
unexposed group
Ex: Suppose X1 = 1 if smoker
= 0 if non smoker
Just like in linear regression, the regression
coefficient

B1 = Change in (natural) log odds of


disease associated with a 1 unit
change X1.

However, there is also an underlying


mathematical relationship such that.

B1 = ln (Odds Ratio)

Odds Ratio = eB1

The details of this is left to a good biostats


Summarizing
Measures of Association
When we examine the results of an
analytical epidemiological analysis of
associations we look at several features of
the statistics we estimate:
The magnitude of that association
statistic
The direction of the association
(positive versus negative)
Whether these attributes change over
important strata - age, sex,
smoking, etc.
The confidence interval or statistical
Three studies examined relative risk for lung
cancer

Study 1 Study 2 Study 3

Non-smokers 1.0 1.0


1.0
<1 pack/day 1.9 0.6
1.6
1-2 packs/day 3.9 3.2 10.6
> 2 packs/day 7.9 1.0 100.5

Using your common sense what do you think of


these results?
These attributes of the association
statistics also help us:
to identify new risk factors
to rank risk factors according their
strength of association
to test whether suspected confounders
are significantly associated with exposure
and disease.
look for interaction ( also known as
effect modification) between two
factors
What would evidence for interactions look like . . .

RR for CHD
Non-Smoker / Non Drinking 1.0
Non-Smoker/ Heavy Drinking 0.6
Smoker/ Non-Drinking 2.1
Smoker/ Heavy Drinker 9.3

It looks like something surprising is going on for


particular subgroups of a population relative to
other subgroups.
A summary of the types of measures of
association used in analytical epidemiology.

Type Examples Usual application


Absolute AR in exposed Primary
Difference prevention
PAR
impact
Efficacy
Search for
Mean differences causes
Differences in Search for
proportions determinants
Relative Relative Risk
Difference
Odds Ratio
AR %, PAR % Search for
Measures of Association for 2 x 2
tables using different study designs.
Randomized trial
1. H0: P (Disease | Intervention) = P (Disease| No

intervention)
2. RR for incident disease
3. 2 test

Cohort Analysis
1. H0: P (Disease | Exposed) = P (Disease| Not
exposed)
Case Control Analysis
1. H0: P ( Exposed | Disease) = P(Not exposed |
Disease)
2. OR
3. 2 test

Cross-sectional Analysis
1. H0: P (Disease | Exposed) = P (Disease| Not
exposed)
2. RR or OR for prevalent disease
3. 2 test
Statistics arent the only way to talk about
association:
A non-statistical way to describe the strength of
association between a risk factor and an outcome is to
put it terms of associations that are well known

For example: A relative risk of 2.2 for coronary heart


disease mortality comparing men drinking 9+ cups of
coffee/day versus <1 cup/per day corresponds to:
Smoking 4.3 cigarettes/day
Increasing Systolic BP by 6.9mm/Hg
Increasing Total Serum Cholesterol by 0.47mmol/l

Tverdal et al. (1990) Coffee consumption and death from CHD in middle-aged
Norwegian men and women. BMJ 300:566-569.
Example Problem : Suppose that 1000
individuals agreed to participate in a study
to test whether a new 8 week exercise
program really reduces physiological and
psychological stress. Participants are
randomized into two exercise groups, given
a questionnaire to collect demographic
information, and their stress levels followed
for eight weeks.

What kind of study is this?

How could we tell if the randomization


procedure worked?
What kind of study is this?
Random trial or intervention trial
Key features: randomized, followed
over time, investigator designed
the experiment.
How could we tell if the randomization
procedure worked?
Look at the distribution of
background

characteristics ( means,
proportions) in one treatment group
versus the other treatment
The descriptive statistics for background
demographic characteristics are given
below.

Group 1 Group 2
% Females 44.0 51.0

% Age >25 58.0 52.0

% BMI >25 65.0 45.0


% CESD>16 32.0 29.0

How many people in Group 1 and Group 2?

How many people are female in Group1?


Simple Test of proportions

The test statistic used for testing H0: p1 = p2 is:

p1 p 2 0
Z
p 0 1 p 0 p 0 1 p 0

n1 n2

Note: The test is still valid if we had simply used the


separate estimates, p1 and p2 instead of the common
estimate based on H0.
Another Example problem:
In a study to determine the effects
of alcohol on plasma cholesterol levels,
investigators selected 300 physicians
aged 45-55years. The investigators
measured the cholesterol of all the
physicians and classified them as either
heavy or not heavy drinkers. The average
cholesterol among the 75 heavy drinkers
was 190 mg/dl, (Standard Deviation
[SD]= 15) while that among the not
heavy drinkers was 210 mg/dl (SD =15).
What kind of study is this?
Whats the null hypothesis?
Study Design: Cross Sectional
Key features: No time element, no
imposed design
Null hypothesis:
Mean cholesterol in heavy drinkers =
Mean cholesterol in not heavy drinkers

What kind of test of association would you


do?
Could you approach this problem using
Could you approach this problem using
confidence intervals?
Yes, there are two ways.
Way 1: Take the difference in the means
and get a confidence interval around that
difference. (This is a direct analog to the t-
test)

Way 2: Estimate the confidence interval


around the mean of heavy drinker and also
around the mean of not heavy drinkers. If
the C.I.s overlap they are not significantly
CI around a mean of a normally distributed variab

X
n

Sample mean= X = i
I=1

Small sample (1-) 100% Confidence interval

S
X + t/2
n

Where s/n is the estimated standard error of the mea


( 95% CI for Group 1)
(95% CI for Group 2)

(95% CI for Group 1)


(95% CI for Group2)

You might also like