Association 2

Measures of Association -
Part II
Overview of Association Methods
Dependent Independent
Variable Variable Method
Categorical Categorical Relative Risk (C.I.)
(Discrete) (Discrete) Odds Ratio (C.I.)
Chi-square test
test of proportions
Categorical Continuous Logistic regression

(Discrete) Discriminate
Overview of Association Methods
Dependent Independent
Variable Variable Method
Continuous Continuous Linear regression
Correlations
Continuous Categorical t-test

(Discrete) Analysis of
Variance
Summary: Steps in Statistical Analysis
1. Identify H0 and HA
2. Identify a test statistic
3. Determine a significance level, =
0.05, = 0.01
4. Critical value determines rejection /
acceptance region
5. p-value
6. Interpret the result
Measures of association for
continuous dependent
variables
and
continuous independent
variables
Linear Regression
In linear regression one variable (X) is used to
predict another (Y).
X independent, predictor variable
Y dependent, response variable
We assume that we collect a sample of pairs
of observations,
(Xi, Yi) for i = 1, 2,, n
Modeling the relationship between X and Y
requires the specification of two components:
A Systematic Component and a Random
Systematic Component:
E(Yi | Xi) = + Xi
= intercept
= slope

X
Percent body fat = + (Abdomen Circum)
Percent body fat = -39.28 + 0.6313 (Abdomen
Circum)
60
40
20
0
50 100 150
Abdomen circumference (cm)
Examples of Different Beta Values
beta positive beta negative
5 5
0 0
y2
y1
-5 -5
-2 -1 0 1 2 -2 -1 0 1 2
x x
beta zero nonlinear

2 6
0 2
y4
y3
0
-2
-2
-2 -1 0 1 2 -2 -1 0 1 2
x x
For the simple linear model we can test
hypotheses regarding the estimated :
H0 : = 0
HA : 0
For a test statistic we use the standardized

estimator:
b0
T ~ t ( n 2)
V (b)
where b
Students t distribution - two tailed test
H0 : = 0
HA : 0
(two tailed because > 0 or < 0 )
.4
.3
.2
.1
0
-4 -2 0 2 4
T
Upper percentile of t distribution
Area =
0 t n
df 0.10 0.05 0.01

1 3.078 6.3138 31.821
5 1.476 2.0150 3.365
10 1.372 1.8125 2.764
15 1.341 1.7530 2.602
This is for a one-tailed test such as
H0 : = 0 and HA : > 0
Correlation
In our discussion of linear regression we designated
one variable as the explanatory, independent or
predictor variable and the other as the dependent
variable. Sometimes, there is no such distinction .
100 Sometimes
all we want is
a measure of
Thigh circumference (cm)
80
the strength
of
60 association
between two
40
(quantitative)
30 35 40 45 50 variables.
Knee circumference (cm)
There are two common correlation
measures:
1. Pearson Correlation Coefficient:
Based on the actual data values.
Measure of linear association.
Natural when the variables have Gaussian.
Related to linear regression and R2.
2. Spearman Rank Correlation:
Based on ranks of each variable (ranks
assigned separately).
Useful measure of the monotone
association, which may not be linear.
Pearsons Correlation Coefficient
The correlation between two variables X
and Y is defined as:
E X X Y Y

V X V Y
Properties:
The correlation is constrained: -1
+1
| | = 1 means perfect linear
relationship:
Y = a + bX
Perfect positive 2
correlation ( = 1) 1
y
-1
-2
-2 -1 0 1 2
x
2
Perfect negative
correlation ( = -1) 1
0
y
-1
-2
-2 -1 0 1 2
x
We estimate the sample correlation
coefficient r using...
r
1 N
i 1 X i X Yi Y
N 1 s X sY
1 N Xi X Yi Y

N 1 i 1 s X sY
To test the hypothesis:
H0 : = 0
HA : 0
We use the statistic:
r
T n2
1 r 2
Under the null hypothesis:

T ~ t (n - 2)
NOTE: For the validity of the test we assume
that both X and Y are normally distributed
(bivariate normality)
Example: Knee circumference and
thigh circumference
n = 252 r = 0.799
H0 : = 0
HA : 0
r
T n2
1 r2
.799
252 2
1 .799 2
21
Conclusion: reject H0 with p < .0001

Spearman Rank
Correlation
A nonparametric analogue to Pearsons
correlation coefficient is Spearmans rank
correlation coefficient. Use Spearmans
correlation when the assumption of
(bivariate) normality is not met.
A measure of monotonic association (not
necessarily linear)
Based on the ranked data
Rank each sample separately
Compute Pearsons
T correlation
n2
rs on
~ tthe
( n 2)
ranks 1 rs2
So much of the work in public health
has become heavily multidimensional

correlations provide a quick and
meaningful way to look at associations
among many variables as a way of
categorizing variables into those
variables that are
positively associated
negatively associated
0 20 40 60 20 40 60 80 80 100 120 140 80 100 120 140 160
1.1
Density determined from 1.05

underwater weighing
1
60
40
Percent body fat from
20 Siri's (1956) equation
0
400
300
Weight (lbs)
200
100
80
60
Height (inches)
40
20
50
45
Neck circumference 40
(cm)
35
30
140
120
Chest circumference
100 (cm)
80
150
Abdomen 2 100
circumference (cm)
50
160
140
120 Hip circumference

(cm)
100
80
50
45
Knee circumference 40
(cm)
35
30
1 1.05 1.1 100 200 300 400 30 35 40 45 50 50 100 150 30 35 40 45 50
Measures of association for
categorical dependent
variables
and
continuous independent
variables
Logistic Regression
Logistic regression provides a powerful and

flexible tool for describing the relationship
between:
Y a dichotomous (diseased/non-diseased)
outcome variable and
X a set of predictor variables (continuous

or discrete) .
However, trying to model something
that
is bounded between 0 and 1 is
mathematically difficult so.
some genius came up with the idea of

modeling the log odds of disease
We already know that:
p/1-p = disease odds
where 0 < p/1-p < a
but this is still bounded on one side.
By taking the natural logarithm, we get a
nice symmetrically distributed variable to
work with:
ln (p/1-p) = log odds of p = logit (p)
where a < ln (p/1-p) < a
Because of the statistical properties of the logit
of p we can now think of logistic regression as
if it were regular linear regression .
logit (p) = B0 + B1X1 + B2 X2 + . . .

logit (p) = B0 + B1Age + B2 Sex + . . .
But what is the interpretation of these Betas?

What is the interpretation of these
Betas?
Intercept
B0 = Log odds of disease when all Xs
are zero
Its not very meaningful terms unless the Xs
are discrete (0,1) then
B0 = Log odds of disease in the
unexposed group
Ex: Suppose X1 = 1 if smoker
= 0 if non smoker
Just like in linear regression, the regression
coefficient
B1 = Change in (natural) log odds of

disease associated with a 1 unit
change X1.
However, there is also an underlying

mathematical relationship such that.
B1 = ln (Odds Ratio)
Odds Ratio = eB1
The details of this is left to a good biostats

Summarizing
Measures of Association
When we examine the results of an
analytical epidemiological analysis of
associations we look at several features of
the statistics we estimate:
The magnitude of that association
statistic
The direction of the association
(positive versus negative)
Whether these attributes change over
important strata - age, sex,
smoking, etc.
The confidence interval or statistical
Three studies examined relative risk for lung
cancer
Study 1 Study 2 Study 3
Non-smokers 1.0 1.0

1.0
<1 pack/day 1.9 0.6
1.6
1-2 packs/day 3.9 3.2 10.6
> 2 packs/day 7.9 1.0 100.5
Using your common sense what do you think of

these results?
These attributes of the association
statistics also help us:
to identify new risk factors
to rank risk factors according their
strength of association
to test whether suspected confounders
are significantly associated with exposure
and disease.
look for interaction ( also known as
effect modification) between two
factors
What would evidence for interactions look like . . .
RR for CHD
Non-Smoker / Non Drinking 1.0
Non-Smoker/ Heavy Drinking 0.6
Smoker/ Non-Drinking 2.1
Smoker/ Heavy Drinker 9.3
It looks like something surprising is going on for

particular subgroups of a population relative to
other subgroups.
A summary of the types of measures of
association used in analytical epidemiology.
Type Examples Usual application

Absolute AR in exposed Primary
Difference prevention
PAR
impact
Efficacy
Search for
Mean differences causes
Differences in Search for
proportions determinants
Relative Relative Risk
Difference
Odds Ratio
AR %, PAR % Search for
Measures of Association for 2 x 2
tables using different study designs.
Randomized trial
1. H0: P (Disease | Intervention) = P (Disease| No
intervention)
2. RR for incident disease
3. 2 test
Cohort Analysis
1. H0: P (Disease | Exposed) = P (Disease| Not
exposed)
Case Control Analysis
1. H0: P ( Exposed | Disease) = P(Not exposed |
Disease)
2. OR
3. 2 test
Cross-sectional Analysis
1. H0: P (Disease | Exposed) = P (Disease| Not
exposed)
2. RR or OR for prevalent disease
3. 2 test
Statistics arent the only way to talk about
association:
A non-statistical way to describe the strength of
association between a risk factor and an outcome is to
put it terms of associations that are well known
For example: A relative risk of 2.2 for coronary heart

disease mortality comparing men drinking 9+ cups of
coffee/day versus <1 cup/per day corresponds to:
Smoking 4.3 cigarettes/day
Increasing Systolic BP by 6.9mm/Hg
Increasing Total Serum Cholesterol by 0.47mmol/l
Tverdal et al. (1990) Coffee consumption and death from CHD in middle-aged
Norwegian men and women. BMJ 300:566-569.
Example Problem : Suppose that 1000
individuals agreed to participate in a study
to test whether a new 8 week exercise
program really reduces physiological and
psychological stress. Participants are
randomized into two exercise groups, given
a questionnaire to collect demographic
information, and their stress levels followed
for eight weeks.
What kind of study is this?
How could we tell if the randomization

procedure worked?
Random trial or intervention trial
Key features: randomized, followed
over time, investigator designed
the experiment.
How could we tell if the randomization
procedure worked?
Look at the distribution of
background
characteristics ( means,
proportions) in one treatment group
versus the other treatment
The descriptive statistics for background
demographic characteristics are given
below.
Group 1 Group 2
% Females 44.0 51.0
% Age >25 58.0 52.0
% BMI >25 65.0 45.0

% CESD>16 32.0 29.0
How many people in Group 1 and Group 2?
How many people are female in Group1?

Simple Test of proportions
The test statistic used for testing H0: p1 = p2 is:
p1 p 2 0
Z
p 0 1 p 0 p 0 1 p 0

n1 n2
Note: The test is still valid if we had simply used the

separate estimates, p1 and p2 instead of the common
estimate based on H0.
Another Example problem:
In a study to determine the effects
of alcohol on plasma cholesterol levels,
investigators selected 300 physicians
aged 45-55years. The investigators
measured the cholesterol of all the
physicians and classified them as either
heavy or not heavy drinkers. The average
cholesterol among the 75 heavy drinkers
was 190 mg/dl, (Standard Deviation
[SD]= 15) while that among the not
heavy drinkers was 210 mg/dl (SD =15).
Whats the null hypothesis?
Study Design: Cross Sectional
Key features: No time element, no
imposed design
Null hypothesis:
Mean cholesterol in heavy drinkers =
Mean cholesterol in not heavy drinkers
What kind of test of association would you

do?
Could you approach this problem using
Could you approach this problem using
confidence intervals?
Yes, there are two ways.
Way 1: Take the difference in the means
and get a confidence interval around that
difference. (This is a direct analog to the t-
test)
Way 2: Estimate the confidence interval

around the mean of heavy drinker and also
around the mean of not heavy drinkers. If
the C.I.s overlap they are not significantly
CI around a mean of a normally distributed variab
X
n
Sample mean= X = i
I=1
Small sample (1-) 100% Confidence interval
S
X + t/2
n
Where s/n is the estimated standard error of the mea

( 95% CI for Group 1)
(95% CI for Group 2)
(95% CI for Group 1)

(95% CI for Group2)

Association 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Association 2

Uploaded by

Copyright:

Available Formats

Measures of Association -

Categorical Continuous Logistic regression

Continuous Categorical t-test

beta zero nonlinear

For a test statistic we use the standardized

df 0.10 0.05 0.01

Under the null hypothesis:

Conclusion: reject H0 with p < .0001

Density determined from 1.05

120 Hip circumference

Logistic regression provides a powerful and

X a set of predictor variables (continuous

some genius came up with the idea of

logit (p) = B0 + B1X1 + B2 X2 + . . .

But what is the interpretation of these Betas?

B1 = Change in (natural) log odds of

However, there is also an underlying

Odds Ratio = eB1

The details of this is left to a good biostats

Study 1 Study 2 Study 3

Non-smokers 1.0 1.0

Using your common sense what do you think of

It looks like something surprising is going on for

Type Examples Usual application

For example: A relative risk of 2.2 for coronary heart

What kind of study is this?

How could we tell if the randomization

% Age >25 58.0 52.0

% BMI >25 65.0 45.0

How many people in Group 1 and Group 2?

How many people are female in Group1?

The test statistic used for testing H0: p1 = p2 is:

Note: The test is still valid if we had simply used the

What kind of test of association would you

Way 2: Estimate the confidence interval

Small sample (1-) 100% Confidence interval

Where s/n is the estimated standard error of the mea

(95% CI for Group 1)

You might also like