You are on page 1of 62

BS704 Review for Final Exam

Final Exam
You may bring 5 pages of notes
You MUST bring full copies of
statistical tables (on Blackboard)
You MUST bring a calculator

Topics Covered Since Midterm:


Hypothesis Testing for a single mean and
proportion, and for two means
One-way ANOVA
Chi-square Tests
Power and Sample size
Regression and Correlation

Logistic Regression
Survival analysis

Hypothesis Tests for...


Single mean
Single proportion p

Comparing two means 1 - 2


Paired (or matched) data d

Conducting a Hypothesis Test


Define null and research hypotheses
Define test statistic, level of
significance and decision rule

Calculate test statistic based upon


sample data.
Use decision rule or p-value to decide
whether to reject or not reject the null
hypothesis.

Conducting a Hypothesis Test


For a single mean
If n 30, use z-test statistic
If n < 30 use t-test statistic

For a single proportion p


Use z-test statistic
Check assumptions

Conducting a Hypothesis Test


For comparing two means 1 - 2
If n1 and n2 both 30, use z-test statistic
If n1 and/or n2 < 30 use t-test statistic

For comparing two proportions p1 - p2


Use chi-square test

Type I and II errors


Type I error occurs when we reject null
hypothesis when we shouldnt.
Pr(Type I error) =
Type II error occurs when we dont
reject null hypothesis when we should
have.
Pr(Type II error) =

One-Way ANOVA
Used when we want to compare the means
of three or more groups from independent
populations.
Continuous outcome measured on each
subject.
We set up an analysis of variance table and
compare the variances of between groups
and within groups.
An F-test is used with two different degrees
of freedom terms.

Chi-Square Test
Chi-square goodness of fit test
Assess whether responses fit a specified
distribution for one sample of people

Chi-square test of independence


Test if two discrete variables are associated in
some way for a sample of people

Chi-square test comparing distributions


Compare distributions of proportions among two
or more independent groups

Calculating a Sample Size for a Study


Need a large enough sample to ensure
you have the pre-specified amount of
precision in analysis
Sample size determined based on type
of planned analysis:
Confidence interval
Hypothesis test

Calculating a Sample Size for a Study


We always round up our calculation.
Need to account for possible dropout
from study. This always increases the
required sample size.

Power
Linked up with Type II error
Power = 1-
=P(Reject H0 | H0 false)
= Probability of correctly
rejecting H0 when H0 is false.

Correlation
Correlation measures the nature and
strength of linear association between
two variables at a time.
Regression equation that best
describes relationship between
variables.

Correlation Coefficient
Population correlation is r (rho)
Sample correlation is r where
-1 < r < +1
Sign indicates nature of relationship
(positive or direct, negative or inverse)

Magnitude indicates strength

Linear Regression
A very popular method for describing
the linear relationship between two
variables (usually continuous
variables).
We use a scatterplot to display the
data graphically

A line to show the association between


the two variables.

Simple Linear Regression


Y = Dependent, Outcome variable
X = Independent, Covariate, Predictor
variable

y = b0 + b1 x

b0 is the Y-intercept, b1 is the slope

Multiple Linear Regression


Useful when we want to jointly
examine the effect of several X
variables on the outcome Y variable.
Y = continuous outcome variable
X1, X2, , Xp = set of independent or
predictor variables
y
. = b0 + b1 x1 + b2 x 2 + . . . + bp x p

Linear Regression
Predictors can be continuous, indicator
variables (0/1) or a set of dummy variables
Confounding the effect of a risk factor on
an outcome is somehow changed due to the
effect of another factor.
Effect Modification a different relationship
between the risk factor and an outcome
depending on the level of another variable.

Logistic Regression
Used when the outcome is dichotomous
(binary), e.g. diseased , not diseased.
Our goals remain the same as for linear
regression:
is there an association between a
variable X and our outcome variable Y?
If so, what type?

Simple Logistic Regression


We model the probability p of having
the disease.
b 0 b1X

e
p
b 0 b1X
1 e
p
b0 b1x
logit( p ) ln
1 p

Multiple Logistic Regression


Outcome is dichotomous (1=event,
0=non-event) and p=P(event)
Outcome is modeled as log odds

p
b0 b1x1 b 2 x 2 ... b p x p
ln
1 - p
Exp(bi) = OR

Survival Analysis
Outcome is the time to an event.
An event could be time to heart attack,
cancer remission or death.

Measure whether person has event or not


(Yes/No) and if so, their time to event.
Determine factors associated with longer
survival.

Survival Analysis
Incomplete follow-up information
Censoring
Measure follow-up time and not time to
event
We know survival time > follow-up time

Log rank test to compare survival in


two or more independent groups

Cox Proportional Hazards Model


Model:
ln(h(t)/h0(t)) = b1X1 + b2X2 + + bpXp

Exp(bi) = hazard ratio


Model used to jointly assess effects of
independent variables on outcome
(time to an event).

BS704 Practice Problems for


Final Exam

Problem 1.
Suppose a cross-sectional study is
conducted to investigate cardiovascular risk
factors among a sample of patients seeking
medical care at one of three local hospitals.
A total of 300 patients are enrolled. Using
the following data, test if there is an
association between enrollment site (i.e.,
hospital) and family history of CVD. Run
the appropriate test at a 5% level of
significance.

Problem 1.

Family
Hx
Definite

Hosp 1

Hosp 2

Hosp 3

24

14

22

Probable

14

No

68

72

70

Total

100

100

100

Problem 1.
H0: Site and family history are
independent
H1: H0 is false
=0.05

Df = (r-1)(c-1) = (3-1)(3-1) = 4.
Reject H0 if 2 > 9.49

Problem 1.
Family
Hx
Definite

Hosp 1

Hosp 2

Hosp 3

24 (20)

14 (20)

22 (20)

Probable

8 (10)

14 (10)

8 (10)

No

68 (70)

72 (70)

70 (70)

100

100

100

Total

Problem 1.
(24 20 ) 2 (14 20 ) 2 (22 20 ) 2 (8 10 ) 2 (14 10 ) 2 (8 10 ) 2

20
20
20
10
10
10
(68 70 ) 2 (72 70 ) 2 (70 70 ) 2

70
70
70
2

= 0.8 + 1.8 + 0.2 + 0.4 + 1.6 + 0.4 + 0.06


+ 0.06 + 0 = 5.32
Do not reject H0 because 5.32 <9.49.
We do not have significant evidence,
=0.05, to show that site and family
history are not independent.

Problem 2.
The following table summarizes data collected
in the study described in problem 1. The
variable summarized below is body mass
index (BMI) computed as the ratio of weight
in kilograms to height in meters squared.
BMI
N
Mean
Std Dev

Overall
300
24.8
2.5

Hosp 1
100
21.6
2.1

Hosp 2
100
24.8
1.8

Hosp 3
100
27.9
1.3

Problem 2.
Test if there is a significant difference in the mean BMI
scores among hospitals. Show all parts of the test and
use a 5% level of significance. (HINT: MSE = 3.1).
H0: 123
H1: means not all equal

SSb n j (X j X)

=0.05

=100((21.6-24.8)2+(24.824.8)2+(27.924.8)2)
= 100(10.24 + 0 + 9.61) = 1985

Problem 2.
Source

SS

Df

MS

Between

1985

992.5

320.2

Error

920.7

297

3.1

Total

2905.7

299

Reject H0 if F > 3.09


F = 320.2
Reject H0 since 320.2 > 3.09. We have significant
evidence, =0.05, to show that the means are not
all equal.

Problem 3.
Suppose each participant in the study
described in problem 1 is assigned a
cardiovascular risk (a value between 0 and
100 with higher scores indicative of more
risk of cardiovascular disease). The mean
cardiovascular risk is 21.7 with a standard
deviation of 5.6. Suppose that the
covariance between BMI and cardiovascular
risk is 4.5.

Problem 3.
Compute the sample correlation coefficient between
BMI and cardiovascular risk.
Var(BMI) = sx2= 2.52
Var(Risk) = sy2 = 5.62

Cov(X,Y)
2 2
x y

ss

4.5
2

(2.5) (5.6)

0.3

Is this correlation statistically significant?


Run the appropriate test at a 5% level of significance.
H0: r = 0
H1: r 0

(n 2)
Zr
1 r2

=0.05
Reject H0 if Z < -1.96 or if Z > 1.96

298
Z 0.3
5.4
2
1 (0.3)
Reject H0 since 5.4 > 1.96. We have significant
evidence, =0.05, to show that r 0.

Problem 4.
Compute the equation of the line that best describes
the relationship between BMI and cardiovascular risk
(Assume that cardiovascular risk is the dependent
variable).
sy

5.6
b1 r 0.3
0.67
sx
2.5

b 0 Y b1X 21.7 0.67 (24 .8) 5.08

y 5.08 0.67X

Problem 5.
Suppose we restrict our attention to the
subgroup of patients at high risk for
cardiovascular disease (cardiovascular
risk score of 30 or more).
Using the following data, test if BMI is
significantly different in men versus
women. Use a 5% level of significance.

Problem 5.
H0: 1 = 2
H1: 1 2

=0.05
BMI

X1 X 2
t
1 1
Sp

n1 n 2

Men

Women

20

10

Mean

31.6

28.1

Std Dev

1.7

2.1

Df=20+10-2 = 28
Reject H0 if t < -2.048 or if t > 2.048

Problem 5.
19(1.7) 2 9(2.1) 2
Sp
1.84
20 10 2

31.6 - 28.1
4.91
1 1
1.84

20 10

Reject H0 since 4.91>2.048. We have significant evidence,


=0.05, to show there is a difference in mean BMI
between men and women.

Problem 6.
How many men and women would be required to
estimate a difference in mean BMI with a 95%
confidence interval and a margin of error not
exceeding 1 unit. (Use data from problem 6 as
needed.)
2

Zs
ni 2
E

Use Sp from #6

1.96(1.84)
ni 2
26.01
1

Need 27 men and 27 women.

Problem 7.
The following table was constructed based on a
comparison of various sociodemographic
characteristics between men and women enrolled in
the study of cardiovascular risk factors.
Which, if any, of the characteristics shown
above are significantly different between men
and women? Justify.

Problem 7.
Characteristic

Men (n=160)

Women (n=140)

Mean Age, yrs

45

47

Race

p
0.7256
0.0354

% White

32

38

% Black

41

37

% Hispanic

25

19

% Other

% HS Graduate

78

64

0.0245

Mean Income $000s

47

31

0.0001

% No Insurance

0.9876

Problem 8.

What test was used to compare ages between


men and women?
Two sample test for equality of independent
means.
What test was used to compare race between
men and women?
Chi-square test of independence.
What test was used to compare educational
level (% high school graduates) between men
and women?
Two sample test for equality of independent
proportions or chi-square test of independence.

Problem 9.
Two different scales are used in a particular
laboratory. There is some concern that one
scale gives different readings than the other.
Ten specimens are randomly selected and
weighed on each scale. The data are shown
below.

Test if there is a significant difference in


weights between the two scales at =0.05

Problem 9.
Specimen

Scale 1

Scale 2

1.2

2.1

3.5

3.6

1.8

1.9

4.0

4.0

5.0

4.9

1.9

2.0

2.7

2.7

2.2

2.3

2.8

2.9

10

3.5

3.7

diff 2 diff /n
2

diff 1.5
Xd

0.15
n
10

sd

n 1

0.91 (1.5)2 /10

0.276
9

H0: d = 0
H1: d 0 =0.05
t

Xd
sd

, df n 1
t

Reject H0 if t < -2.262 or if t > 2.262

Xd
sd

0.15

1.72
n 0.276
10

Do not reject H0 because -2.262 < 1.72 < 2.262. We do not


have significant evidence at =0.05 to show that d 0

Problem 10.
Patients with hypertension are generally
recommended to follow a low salt diet.
Surveys report that approximately 75% of
patients adhere to these diets. In a random
sample of 100 patients with hypertension,
70% report following a low-salt diet. Are
these patients significantly low in terms of
adherence? Run the test at = 0.05.

Problem 10.
H0: p = 0.75
H1: p < 0.75

=0.05

p p 0
p 0 (1 p 0 )
n
Z

Reject H0 if Z < -1.645


p p 0

p 0 (1 p 0 )
n

0.70 0.75
0.75(1 0.75)
100

1.15

Do not reject H0 because -1.15 > -1.645. We do not


have significant evidence at =0.05 to show that p<0.75.

Problem 11.
The following table was presented in a journal and describes
the associations between demographic and clinical risk
factors and systolic blood pressure.
Risk Factors

Intercept
Age
Male Sex
Current Smoker
Number
of
Exercise/Week

Hrs

Outcome = Systolic Blood


Pressure
p
Regression
Coefficient
105.3
0.0001
1.2
0.0042
4.5
0.0956
-0.5
0.2354
-2.4
0.0003

Problem 11.
a) What type of analysis generated the results summarized
above?
Multiple linear regression analysis because the outcome
(systolic blood pressure) is continuous.
b) Which of the risk factors are significantly associated with
systolic blood pressure?

Age and number of hours of exercise are statistically


significant at the 5% level (both have p values < 0.05). Male
sex is marginally significant with a p value of 0.0956.

Problem 11.
c) What is the relative importance of the risk factors?
The most important (statistically significant) risk factor is number of
hours of exercise per week, followed by age and then male sex.
Current smoking status is not statistically significant.
d) How would you interpret the regression coefficient associated with
male sex? With number of hours of exercise per week?
Mens systolic blood pressure is 4.5 units higher than womens
holding age, smoking status and number of hours of exercise
constant. Each additional hour of exercise per week is associated
with a reduction of 2.4 units of systolic blood pressure holding age,
sex and current smoking status constant.

Problem 12.
The following table was presented in a journal and describes
the associations between demographic and clinical risk factors
and hypertension.
Risk Factors

Outcome = Hypertension
Regression Coefficient

3.5

0.0001

Age

0.02

0.0357

Male Sex

0.27

0.0264

-0.005

0.7564

-0.36

0.0111

Intercept

Current Smoker
Number of Hrs Exercise/Week

Problem 12.
a) What type of analysis generated the results summarized above?
Multiple logistic regression analysis because the outcome
(hypertension) is dichotomous.
b) Which of the risk factors are significantly associated with
hypertension?
Age, male sex and number of hours of exercise are statistically
significant at the 5% level (both have p values < 0.05).
c) What is the relative importance of the risk factors?
The most important (statistically significant) risk factor is number of
hours of exercise per week, followed by male sex and then age.
Current smoking status is not statistically significant.

Problem 12.
d) Compute odds ratios for each of the risk factors.
Risk Factors

Outcome = Hypertension
Regression Coefficient

Odds Ratio

Age

0.02

1.02

Male Sex

0.27

1.31

-0.005

0.99

-0.36

0.70

Current Smoker
Number of Hrs Exercise/Week

e) How would you interpret the regression coefficient associated with


male sex? With number of hours of exercise per week?
Men are 1.31 times more likely to have hypertension than women, holding
age, current smoking status and number of hours of exercise per week
constant.
Each additional hour of exercise per week is associated with a 30% reduction in
the likelihood that someone has hypertension, holding age, sex and current
smoking status constant.

Problem 13.
A study is conducted to assess whether there is a difference in physicians
opinions regarding the treatment of early stage throat cancer. Specifically,
physicians were asked if they would recommend radiation, surgery or
neither upon initial diagnosis. Based on the data below, is there a
relationship between treatment recommendations and physicians age?
Run the test at a 5% level of significance.
Radiation

Surgery

Neither

Total

<40

35

15

50

100

40-59

29

30

41

100

60-79

40

43

22

105

Total

104

88

113

305

Problem 13.
H0: Age and treatment recommendation are independent
H1: H0 is false
=0.05
2
(
O

E
)
2
E

Df = (r-1)(c-1) = (3-1)(3-1) = 4.
Reject H0 if 2 > 9.49

(35 34 .1) 2 (15 28 .9) 2 (50 37 ) 2 (29 34 .1) 2 (30 28 .9) 2 (41 37 ) 2

34 .1
28 .9
37
34 .1
28 .9
37
(40 35 .8) 2 (43 30 .3) 2 (22 38 .9) 2

35 .8
30 .3
38 .9
2

Radiation

Surgery

Neither

Total

<40

35 (34.1)

15 (28.9)

50 (37.0)

100

40-59

29 (34.1)

30 (28.9)

41 (37.0)

100

60-79

40 (35.8)

43 (30.3)

22 (38.9)

105

Total

104

88

113

305

=0.02 + 6.69 + 4.57 + 0.76 + 0.04 + 0.43 + 0.49 + 5.32 + 7.34=25.66


Reject H0 because 25.66 > 9.49. We have significant evidence, =0.05,
to show that age and treatment recommendation are not independent.

Problem 14.
For each of the following scenarios,
indicate which test would be used. Use
the letters below to indicate the test in
the space provided. Note that the same
test might be used for more than one
scenario.

Problem 14.
a)
b)
c)
d)
e)
f)
g)
h)
i)
j)
k)

Compare mean to historical/external control


Compare proportion to historical/external control
Compare two independent means
Compare two matched/paired means
Analysis of variance
Chi-square goodness of fit test
Chi-square test of independence
Correlation analysis
Linear regression analysis
Logistic regression analysis
Survival analysis

Problem 14.
Scenario
1. We want to test if there is a significant association between BMI (kg/m2) and
incident myocardial infarction adjusting for age, sex, systolic blood pressure and
smoking.
2. We want to test if a new environmental intervention is effective in reducing
exposure to second-hand smoke. Each participant in the study has levels of exposure
measured before and after the intervention is implemented.
3. We wish to test if there is a significant association between GRE scores and first
year GPA in MPH students who matriculated in fall 2011.
4. We want to determine if there are significant differences in ages of participants
enrolled in a study comparing those with a family history of cardiovascular disease to
those without.
5. A study reports that 15% of college freshman smoke. We want to test if
significantly more BU freshman smoke.
6. We want to test if there is a difference in preterm versus term deliveries among
women of black, Hispanic and white race.
7. We want to test if nutritional supplements prolong life (minimize time to death) in
persons over 65 years of age, adjusted for sex and other comorbid conditions.
8. A clinical trial is run to assess the safety of a new drug compared to a standard
drug and the outcome is development of skin rash or not
9. We want to test if there is a difference in mean time to complete a physical task
when comparing 12, 13, 14 and 15 year olds.
10. We want to test whether smoking in pregnancy increases the risk of infection in
newborns.

Test
j
d
h or i
c
b
g
k
g or j
e
g or k