Final Exam Review Class

BS704 Review for Final Exam
Final Exam
You may bring 5 pages of notes
You MUST bring full copies of
statistical tables (on Blackboard)
You MUST bring a calculator
Topics Covered Since Midterm:

Hypothesis Testing for a single mean and
proportion, and for two means
One-way ANOVA
Chi-square Tests
Power and Sample size
Regression and Correlation
Logistic Regression
Survival analysis
Hypothesis Tests for...

Single mean
Single proportion p
Comparing two means 1 - 2

Paired (or matched) data d
Conducting a Hypothesis Test

Define null and research hypotheses
Define test statistic, level of
significance and decision rule
Calculate test statistic based upon

sample data.
Use decision rule or p-value to decide
whether to reject or not reject the null
hypothesis.

For a single mean
If n 30, use z-test statistic
If n < 30 use t-test statistic
For a single proportion p

Use z-test statistic
Check assumptions

For comparing two means 1 - 2
If n1 and n2 both 30, use z-test statistic
If n1 and/or n2 < 30 use t-test statistic
For comparing two proportions p1 - p2

Use chi-square test
Type I and II errors

Type I error occurs when we reject null
hypothesis when we shouldnt.
Pr(Type I error) =
Type II error occurs when we dont
reject null hypothesis when we should
have.
Pr(Type II error) =
One-Way ANOVA
Used when we want to compare the means
of three or more groups from independent
populations.
Continuous outcome measured on each
subject.
We set up an analysis of variance table and
compare the variances of between groups
and within groups.
An F-test is used with two different degrees
of freedom terms.
Chi-Square Test
Chi-square goodness of fit test
Assess whether responses fit a specified
distribution for one sample of people
Chi-square test of independence

Test if two discrete variables are associated in
some way for a sample of people
Chi-square test comparing distributions

Compare distributions of proportions among two
or more independent groups
Calculating a Sample Size for a Study

Need a large enough sample to ensure
you have the pre-specified amount of
precision in analysis
Sample size determined based on type
of planned analysis:
Confidence interval
Hypothesis test
Calculating a Sample Size for a Study

We always round up our calculation.
Need to account for possible dropout
from study. This always increases the
required sample size.
Power
Linked up with Type II error
Power = 1-
=P(Reject H0 | H0 false)
= Probability of correctly
rejecting H0 when H0 is false.
Correlation
Correlation measures the nature and
strength of linear association between
two variables at a time.
Regression equation that best
describes relationship between
variables.
Correlation Coefficient
Population correlation is r (rho)
Sample correlation is r where
-1 < r < +1
Sign indicates nature of relationship
(positive or direct, negative or inverse)
Magnitude indicates strength
Linear Regression
A very popular method for describing
the linear relationship between two
variables (usually continuous
variables).
We use a scatterplot to display the
data graphically
A line to show the association between

the two variables.
Simple Linear Regression

Y = Dependent, Outcome variable
X = Independent, Covariate, Predictor
variable
y = b0 + b1 x
b0 is the Y-intercept, b1 is the slope
Multiple Linear Regression

Useful when we want to jointly
examine the effect of several X
variables on the outcome Y variable.
Y = continuous outcome variable
X1, X2, , Xp = set of independent or
predictor variables
y
. = b0 + b1 x1 + b2 x 2 + . . . + bp x p
Linear Regression
Predictors can be continuous, indicator
variables (0/1) or a set of dummy variables
Confounding the effect of a risk factor on
an outcome is somehow changed due to the
effect of another factor.
Effect Modification a different relationship
between the risk factor and an outcome
depending on the level of another variable.
Logistic Regression
Used when the outcome is dichotomous
(binary), e.g. diseased , not diseased.
Our goals remain the same as for linear
regression:
is there an association between a
variable X and our outcome variable Y?
If so, what type?
Simple Logistic Regression

We model the probability p of having
the disease.
b 0 b1X
e
p
b 0 b1X
1 e
p
b0 b1x
logit( p ) ln
1 p
Multiple Logistic Regression

Outcome is dichotomous (1=event,
0=non-event) and p=P(event)
Outcome is modeled as log odds
p
b0 b1x1 b 2 x 2 ... b p x p
ln
1 - p
Exp(bi) = OR
Survival Analysis
Outcome is the time to an event.
An event could be time to heart attack,
cancer remission or death.
Measure whether person has event or not

(Yes/No) and if so, their time to event.
Determine factors associated with longer
survival.
Survival Analysis
Incomplete follow-up information
Censoring
Measure follow-up time and not time to
event
We know survival time > follow-up time
Log rank test to compare survival in

two or more independent groups
Cox Proportional Hazards Model

Model:
ln(h(t)/h0(t)) = b1X1 + b2X2 + + bpXp
Exp(bi) = hazard ratio

Model used to jointly assess effects of
independent variables on outcome
(time to an event).
BS704 Practice Problems for

Final Exam
Problem 1.
Suppose a cross-sectional study is
conducted to investigate cardiovascular risk
factors among a sample of patients seeking
medical care at one of three local hospitals.
A total of 300 patients are enrolled. Using
the following data, test if there is an
association between enrollment site (i.e.,
hospital) and family history of CVD. Run
the appropriate test at a 5% level of
significance.
Problem 1.
Family
Hx
Definite
Hosp 1
Hosp 2
Hosp 3
24
14
22
Probable
14
No
68
72
70
Total
100
100
100
Problem 1.
H0: Site and family history are
independent
H1: H0 is false
=0.05
Df = (r-1)(c-1) = (3-1)(3-1) = 4.
Reject H0 if 2 > 9.49
Problem 1.
Family
Hx
Definite
Hosp 1
Hosp 2
Hosp 3
24 (20)
14 (20)
22 (20)
Probable
8 (10)
14 (10)
8 (10)
No
68 (70)
72 (70)
70 (70)
100
100
100
Total
Problem 1.
(24 20 ) 2 (14 20 ) 2 (22 20 ) 2 (8 10 ) 2 (14 10 ) 2 (8 10 ) 2

20
20
20
10
10
10
(68 70 ) 2 (72 70 ) 2 (70 70 ) 2
70
70
70
2
= 0.8 + 1.8 + 0.2 + 0.4 + 1.6 + 0.4 + 0.06

+ 0.06 + 0 = 5.32
Do not reject H0 because 5.32 <9.49.
We do not have significant evidence,
=0.05, to show that site and family
history are not independent.
Problem 2.
The following table summarizes data collected
in the study described in problem 1. The
variable summarized below is body mass
index (BMI) computed as the ratio of weight
in kilograms to height in meters squared.
BMI
N
Mean
Std Dev
Overall
300
24.8
2.5
Hosp 1
100
21.6
2.1
Hosp 2
100
24.8
1.8
Hosp 3
100
27.9
1.3
Problem 2.
Test if there is a significant difference in the mean BMI
scores among hospitals. Show all parts of the test and
use a 5% level of significance. (HINT: MSE = 3.1).
H0: 123
H1: means not all equal
SSb n j (X j X)
=0.05
=100((21.6-24.8)2+(24.824.8)2+(27.924.8)2)
= 100(10.24 + 0 + 9.61) = 1985
Problem 2.
Source
SS
Df
MS
Between
1985
992.5
320.2
Error
920.7
297
3.1
Total
2905.7
299
Reject H0 if F > 3.09

F = 320.2
Reject H0 since 320.2 > 3.09. We have significant
evidence, =0.05, to show that the means are not
all equal.
Problem 3.
Suppose each participant in the study
described in problem 1 is assigned a
cardiovascular risk (a value between 0 and
100 with higher scores indicative of more
risk of cardiovascular disease). The mean
cardiovascular risk is 21.7 with a standard
deviation of 5.6. Suppose that the
covariance between BMI and cardiovascular
risk is 4.5.
Problem 3.
Compute the sample correlation coefficient between
BMI and cardiovascular risk.
Var(BMI) = sx2= 2.52
Var(Risk) = sy2 = 5.62
Cov(X,Y)
2 2
x y
ss
4.5
2
(2.5) (5.6)
0.3
Is this correlation statistically significant?

Run the appropriate test at a 5% level of significance.
H0: r = 0
H1: r 0
(n 2)
Zr
1 r2
=0.05
Reject H0 if Z < -1.96 or if Z > 1.96
298
Z 0.3
5.4
2
1 (0.3)
Reject H0 since 5.4 > 1.96. We have significant
evidence, =0.05, to show that r 0.
Problem 4.
Compute the equation of the line that best describes
the relationship between BMI and cardiovascular risk
(Assume that cardiovascular risk is the dependent
variable).
sy
5.6
b1 r 0.3
0.67
sx
2.5
b 0 Y b1X 21.7 0.67 (24 .8) 5.08
y 5.08 0.67X
Problem 5.
Suppose we restrict our attention to the
subgroup of patients at high risk for
cardiovascular disease (cardiovascular
risk score of 30 or more).
Using the following data, test if BMI is
significantly different in men versus
women. Use a 5% level of significance.
Problem 5.
H0: 1 = 2
H1: 1 2
=0.05
BMI
X1 X 2
t
1 1
Sp
n1 n 2
Men
Women
20
10
Mean
31.6
28.1
Std Dev
1.7
2.1
Df=20+10-2 = 28
Reject H0 if t < -2.048 or if t > 2.048
Problem 5.
19(1.7) 2 9(2.1) 2
Sp
1.84
20 10 2
31.6 - 28.1
4.91
1 1
1.84
20 10
Reject H0 since 4.91>2.048. We have significant evidence,

=0.05, to show there is a difference in mean BMI
between men and women.
Problem 6.
How many men and women would be required to
estimate a difference in mean BMI with a 95%
confidence interval and a margin of error not
exceeding 1 unit. (Use data from problem 6 as
needed.)
2
Zs
ni 2
E
Use Sp from #6
1.96(1.84)
ni 2
26.01
1
Need 27 men and 27 women.
Problem 7.
The following table was constructed based on a
comparison of various sociodemographic
characteristics between men and women enrolled in
the study of cardiovascular risk factors.
Which, if any, of the characteristics shown
above are significantly different between men
and women? Justify.
Problem 7.
Characteristic
Men (n=160)
Women (n=140)
Mean Age, yrs
45
47
Race
p
0.7256
0.0354
% White
32
38
% Black
41
37
% Hispanic
25
19
% Other
% HS Graduate
78
64
0.0245
Mean Income $000s
47
31
0.0001
% No Insurance
0.9876
Problem 8.
What test was used to compare ages between

men and women?
Two sample test for equality of independent
means.
What test was used to compare race between
men and women?
Chi-square test of independence.
What test was used to compare educational
level (% high school graduates) between men
and women?
Two sample test for equality of independent
proportions or chi-square test of independence.
Problem 9.
Two different scales are used in a particular
laboratory. There is some concern that one
scale gives different readings than the other.
Ten specimens are randomly selected and
weighed on each scale. The data are shown
below.
Test if there is a significant difference in

weights between the two scales at =0.05
Problem 9.
Specimen
Scale 1
Scale 2
1.2
2.1
3.5
3.6
1.8
1.9
4.0
4.0
5.0
4.9
1.9
2.0
2.7
2.7
2.2
2.3
2.8
2.9
10
3.5
3.7
diff 2 diff /n
2
diff 1.5
Xd
0.15
n
10
sd
n 1
0.91 (1.5)2 /10
0.276
9
H0: d = 0
H1: d 0 =0.05
t
Xd
sd
, df n 1
t
Reject H0 if t < -2.262 or if t > 2.262
Xd
sd
0.15
1.72
n 0.276
10
Do not reject H0 because -2.262 < 1.72 < 2.262. We do not

have significant evidence at =0.05 to show that d 0
Problem 10.
Patients with hypertension are generally
recommended to follow a low salt diet.
Surveys report that approximately 75% of
patients adhere to these diets. In a random
sample of 100 patients with hypertension,
70% report following a low-salt diet. Are
these patients significantly low in terms of
adherence? Run the test at = 0.05.
Problem 10.
H0: p = 0.75
H1: p < 0.75
=0.05
p p 0
p 0 (1 p 0 )
n
Z
Reject H0 if Z < -1.645

p p 0
p 0 (1 p 0 )
n
0.70 0.75
0.75(1 0.75)
100
1.15
Do not reject H0 because -1.15 > -1.645. We do not

have significant evidence at =0.05 to show that p<0.75.
Problem 11.
The following table was presented in a journal and describes
the associations between demographic and clinical risk
factors and systolic blood pressure.
Risk Factors
Intercept
Age
Male Sex
Current Smoker
Number
of
Exercise/Week
Hrs
Outcome = Systolic Blood

Pressure
p
Regression
Coefficient
105.3
0.0001
1.2
0.0042
4.5
0.0956
-0.5
0.2354
-2.4
0.0003
Problem 11.
a) What type of analysis generated the results summarized
above?
Multiple linear regression analysis because the outcome
(systolic blood pressure) is continuous.
b) Which of the risk factors are significantly associated with
systolic blood pressure?
Age and number of hours of exercise are statistically

significant at the 5% level (both have p values < 0.05). Male
sex is marginally significant with a p value of 0.0956.
Problem 11.
c) What is the relative importance of the risk factors?
The most important (statistically significant) risk factor is number of
hours of exercise per week, followed by age and then male sex.
Current smoking status is not statistically significant.
d) How would you interpret the regression coefficient associated with
male sex? With number of hours of exercise per week?
Mens systolic blood pressure is 4.5 units higher than womens
holding age, smoking status and number of hours of exercise
constant. Each additional hour of exercise per week is associated
with a reduction of 2.4 units of systolic blood pressure holding age,
sex and current smoking status constant.
Problem 12.
The following table was presented in a journal and describes
the associations between demographic and clinical risk factors
and hypertension.
Risk Factors
Outcome = Hypertension
Regression Coefficient
3.5
0.0001
Age
0.02
0.0357
Male Sex
0.27
0.0264
-0.005
0.7564
-0.36
0.0111
Intercept
Current Smoker
Number of Hrs Exercise/Week
Problem 12.
a) What type of analysis generated the results summarized above?
Multiple logistic regression analysis because the outcome
(hypertension) is dichotomous.
b) Which of the risk factors are significantly associated with
hypertension?
Age, male sex and number of hours of exercise are statistically
significant at the 5% level (both have p values < 0.05).
c) What is the relative importance of the risk factors?
The most important (statistically significant) risk factor is number of
hours of exercise per week, followed by male sex and then age.
Current smoking status is not statistically significant.
Problem 12.
d) Compute odds ratios for each of the risk factors.
Risk Factors
Outcome = Hypertension
Regression Coefficient
Odds Ratio
Age
0.02
1.02
Male Sex
0.27
1.31
-0.005
0.99
-0.36
0.70
Current Smoker
Number of Hrs Exercise/Week
e) How would you interpret the regression coefficient associated with

male sex? With number of hours of exercise per week?
Men are 1.31 times more likely to have hypertension than women, holding
age, current smoking status and number of hours of exercise per week
constant.
Each additional hour of exercise per week is associated with a 30% reduction in
the likelihood that someone has hypertension, holding age, sex and current
smoking status constant.
Problem 13.
A study is conducted to assess whether there is a difference in physicians
opinions regarding the treatment of early stage throat cancer. Specifically,
physicians were asked if they would recommend radiation, surgery or
neither upon initial diagnosis. Based on the data below, is there a
relationship between treatment recommendations and physicians age?
Run the test at a 5% level of significance.
Radiation
Surgery
Neither
Total
<40
35
15
50
100
40-59
29
30
41
100
60-79
40
43
22
105
Total
104
88
113
305
Problem 13.
H0: Age and treatment recommendation are independent
H1: H0 is false
=0.05
2
(
O
E
)
2
E
Df = (r-1)(c-1) = (3-1)(3-1) = 4.
Reject H0 if 2 > 9.49
(35 34 .1) 2 (15 28 .9) 2 (50 37 ) 2 (29 34 .1) 2 (30 28 .9) 2 (41 37 ) 2

34 .1
28 .9
37
34 .1
28 .9
37
(40 35 .8) 2 (43 30 .3) 2 (22 38 .9) 2
35 .8
30 .3
38 .9
2
Radiation
Surgery
Neither
Total
<40
35 (34.1)
15 (28.9)
50 (37.0)
100
40-59
29 (34.1)
30 (28.9)
41 (37.0)
100
60-79
40 (35.8)
43 (30.3)
22 (38.9)
105
Total
104
88
113
305
=0.02 + 6.69 + 4.57 + 0.76 + 0.04 + 0.43 + 0.49 + 5.32 + 7.34=25.66

Reject H0 because 25.66 > 9.49. We have significant evidence, =0.05,
to show that age and treatment recommendation are not independent.
Problem 14.
For each of the following scenarios,
indicate which test would be used. Use
the letters below to indicate the test in
the space provided. Note that the same
test might be used for more than one
scenario.
Problem 14.
a)
b)
c)
d)
e)
f)
g)
h)
i)
j)
k)
Compare mean to historical/external control

Compare proportion to historical/external control
Compare two independent means
Compare two matched/paired means
Analysis of variance
Chi-square goodness of fit test
Chi-square test of independence
Correlation analysis
Linear regression analysis
Logistic regression analysis
Survival analysis
Problem 14.
Scenario
1. We want to test if there is a significant association between BMI (kg/m2) and
incident myocardial infarction adjusting for age, sex, systolic blood pressure and
smoking.
2. We want to test if a new environmental intervention is effective in reducing
exposure to second-hand smoke. Each participant in the study has levels of exposure
measured before and after the intervention is implemented.
3. We wish to test if there is a significant association between GRE scores and first
year GPA in MPH students who matriculated in fall 2011.
4. We want to determine if there are significant differences in ages of participants
enrolled in a study comparing those with a family history of cardiovascular disease to
those without.
5. A study reports that 15% of college freshman smoke. We want to test if
significantly more BU freshman smoke.
6. We want to test if there is a difference in preterm versus term deliveries among
women of black, Hispanic and white race.
7. We want to test if nutritional supplements prolong life (minimize time to death) in
persons over 65 years of age, adjusted for sex and other comorbid conditions.
8. A clinical trial is run to assess the safety of a new drug compared to a standard
drug and the outcome is development of skin rash or not
9. We want to test if there is a difference in mean time to complete a physical task
when comparing 12, 13, 14 and 15 year olds.
10. We want to test whether smoking in pregnancy increases the risk of infection in
newborns.
Test
j
d
h or i
c
b
g
k
g or j
e
g or k

Final Exam Review Class

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Final Exam Review Class

Uploaded by

Copyright:

Available Formats

BS704 Review for Final Exam

Topics Covered Since Midterm:

Hypothesis Tests for...

Comparing two means 1 - 2

Conducting a Hypothesis Test

Calculate test statistic based upon

Conducting a Hypothesis Test

For a single proportion p

Conducting a Hypothesis Test

For comparing two proportions p1 - p2

Type I and II errors

Chi-square test of independence

Chi-square test comparing distributions

Calculating a Sample Size for a Study

Calculating a Sample Size for a Study

Magnitude indicates strength

A line to show the association between

Simple Linear Regression

b0 is the Y-intercept, b1 is the slope

Multiple Linear Regression

Simple Logistic Regression

Multiple Logistic Regression

Measure whether person has event or not

Log rank test to compare survival in

Cox Proportional Hazards Model

Exp(bi) = hazard ratio

BS704 Practice Problems for

= 0.8 + 1.8 + 0.2 + 0.4 + 1.6 + 0.4 + 0.06

Reject H0 if F > 3.09

Is this correlation statistically significant?

b 0 Y b1X 21.7 0.67 (24 .8) 5.08

Reject H0 since 4.91>2.048. We have significant evidence,

Need 27 men and 27 women.

Mean Age, yrs

Mean Income $000s

What test was used to compare ages between

Test if there is a significant difference in

0.91 (1.5)2 /10

Reject H0 if t < -2.262 or if t > 2.262

Do not reject H0 because -2.262 < 1.72 < 2.262. We do not

Reject H0 if Z < -1.645

Do not reject H0 because -1.15 > -1.645. We do not

Outcome = Systolic Blood

Age and number of hours of exercise are statistically

e) How would you interpret the regression coefficient associated with

=0.02 + 6.69 + 4.57 + 0.76 + 0.04 + 0.43 + 0.49 + 5.32 + 7.34=25.66

Compare mean to historical/external control

You might also like