Professional Documents
Culture Documents
Research Literature
Hypothesis: Surgeon-directed institutional peer review, associated with
positive physician feedback, can decrease the morbidity and mortality
rates associated with carotid endarterectomy. Results: Stroke rate
decreased from 3.8% (1993-1994) to 0%(1997-1998). The mortality rate
decreased from 2.8% (1993-1994) to 0% (1997-1998). (average) Length
of stay decreased from 4.7 days (1993-1994) to 2.6 days (1997-1998).
The (average) total cost decreased from $13344 (1993-1994) to $9548
(1997-1998).
Biostatistics
SGU
Popular Press
July 2014
Planning
Design
Data Collection
Data Analysis
Presentation
Interpretation
Biostatistics
Design of Studies
Sample size
Selection of study participants
Role of randomization
Data Collection Variability
Important patterns in data are obscured by variability.
Distinguish real patterns from random variation.
Inference
Draw general conclusions from limited data
e.g. survey
Summarize
What summary measures will best convey the results
How to convey uncertainty in results
Interpretation
What do the results mean in terms of practice, the program,
the population etc.
Vaccine
Placebo
Polio Cases
82
162
Reference: Meier P, The Biggest Public Health Experiment Ever: The 1954 Field Trial of the
Salk Poliomyelitis Vaccine, In: Statistics: A Guide to the Unknown, 1972.
Comparison Group
Randomized
Placebo Controls
Double Blind
Polio Cases
Vaccine 82 out of 200,745
Placebo 162 out of 201,229
p-value=?
Question: Could the results be due to chance?
Statistical methods tell us how to make these probability
calculations.
7
Types of Data
1
Categorical data
Race/ethnicity nominalno ordering
Country of birth nominalno ordering
Degree of agreement ordinalordering
Blood pressure
Weight
Height
Age
)
Sample Mean (X
10
)
Notes on Sample Mean (X
Formula
=
X
X1 = 120
X2 = 80
X3 = 90
X4 = 110
X5 = 95
Pn
i=1 Xi
n
Also called sample average or arithmetic mean
12
Population
Population mean
Sample
Sample mean X
to ?
How close is X
is to
Statistical theory will tell us how close X
13
14
Sample Median
80
90
95
110
120
Median
We will return to this later
1
15
16
Sample Median
90
95
110
120
125
Median
95 + 110
= 102.5 mmHg
2
17
Describing Variability
18
Describing Variability
Why n 1?
Stay tuned
s2 =
Pn
)2
X
n1
i=1 (Xi
Sample variance (s 2 )
Sample standard deviation (s or SD) is the square root of s 2
m
19
20
Calculating s
Notes on s
Example: n = 5
Systolic Blood Pressures
(mmHg)
X1
120
X2
80
X3
90
X4
110
X5
95
Often abbreviated SD
Interpretation
Most of the population will be within about 2 standard
Sample Mean
Sample Variance
Sample Standard Deviation (SD)
= 99 (mmHg)
X
s 2 = 255
s = 15.97 (mmHg)
about 95%
21
22
23
24
Range=Max-Min
. Tend to increase?
. Tend to decrease?
. Remain about the same?
25
26
Percent
13.0
5.7
13.0
14.0
10.6
9.7
13.8
13.0
17.6
9.6
13.3
11.3
12.1
12.4
14.9
13.3
12.5
State
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Percent
11.6
14.4
11.3
13.5
12.3
12.1
12.1
13.5
13.4
13.6
11.0
12.0
13.2
11.7
12.9
12.0
14.7
State
Ohio
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
Percent
13.3
13.2
12.8
15.6
14.5
12.1
14.3
12.4
9.9
8.5
12.7
11.2
11.2
15.3
13.1
11.7
Class
4.1 to 5.0
5.1 to 6.0
6.1 to 7.0
7.1 to 8.0
8.1 to 9.0
27
Count
0
1
0
0
1
Class
9.1 to 10.0
10.1 to 11.0
11.1 to 12.0
12.1 to 13.0
13.1 to 14.0
Count
3
2
9
14
12
Class
14.1 to 15.0
15.1 to 16.0
16.1 to 17.0
17.1 to 18.0
18.1 to 19.0
Count
5
2
0
1
0
28
15
10
Bin width:
5 mm Hg
12
14
80
100
120
10
12
14
16
18
160
180
50
3
2
1
0
80
120
140
160
180
80
100
120
140
160
180
30
Frequency
Histogram
10
Frequency
30
35
100
29
Bin width:
1 mm Hg
40
30
20
10
20
Bin width:
20 mm Hg
10
8
6
0
Number of states
140
Number of Intervals
about 3
about 7
about 10
31
1.0
0.8
0.6
0.4
0.2
Relative
Frequency
Polygon
0.0
0.2
0.4
0.6
0.8
Relative
Frequency
Histogram
0.0
Histogram applet at
http://www.stat.sc.edu/~west/javahtml/Histogram.html
1.0
n
10
50
100
32
Boxplots
160
140
150
79
1166788999
0112333444555667777889
00111111233445555566777777889
0011123333456667788999
0111224446
003
05
130
75th Percentile
120
Sample Median
110
|
|
|
|
|
|
|
|
Largest Non-Outlier
25th Percentile
100
9
10
11
12
13
14
15
16
Outlier
Smallest Non-Outlier
33
34
1976
Symmetrical and
bell-shaped
Positively skewed or
skewed to the right
Negatively skewed or
skewed to the left
Bimodal
Reverse J-shaped
Uniform
1988
35
36
Distribution Characteristics
Mode
Median
Mean
37
5
2
Medium
Sample
Mean=Balancing Point
38
80
100
120
140
160
180
80
39
0.020
0.010
Probability Density
Entire
Population
0.000
10
20
30
Large
Sample
100
120
140
160
180
80
100
120
140
160
180
40
Frequency
0.04
0.06
0.00
0.02
25
30
35
40
Serum Albumin (g/l)
45
50
41
42
Symmetric
Bell-shaped
Mean Median Mode
You can tell which normal distribution you have by knowing the
mean and standard deviation:
Mean () is the center
Standard deviation () measures the spread (variability)
Applet at http://stat-www.berkeley.edu/~stark/Java/Html/StandardNormal.htm
43
44
45
46
47
48
within 1 SD of X
60
62.5
65
67.5
70
72.5
A standard score of
Z = 1 = observation lies one SD above the mean
Z = 2 = observation lies two SD above the mean
=0
=2
49
Z -Scores
50
Height = 60 inches
60 65
Z=
= 2.0
2.5
51
52
Within Z
SDs of
the mean
More than Z
SDs above
or below
the mean
Z
0.5
1.0
1.5
2.0
2.5
3.0
3.5
38.29
68.27
86.64
95.45
98.76
99.73
99.95
%
%
%
%
%
%
%
30.85%
15.87%
6.68%
2.28%
0.62%
0.13%
0.02%
30.85%
15.87%
6.68%
2.28%
0.62%
0.13%
0.02%
61.71%
31.73%
13.36%
4.55%
1.24%
0.27%
0.05%
53
Problems
54
55
56
57
Normal Distribution
z
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
0.11
0.12
0.13
0.14
0.15
0.16
0.17
0.18
0.19
0.20
0.21
0.22
0.23
0.24
0.25
0.26
0.27
0.28
0.29
P
1.0000
0.9920
0.9840
0.9761
0.9681
0.9601
0.9522
0.9442
0.9362
0.9283
0.9203
0.9124
0.9045
0.8966
0.8887
0.8808
0.8729
0.8650
0.8572
0.8493
0.8415
0.8337
0.8259
0.8181
0.8103
0.8026
0.7949
0.7872
0.7795
0.7718
z
0.30
0.31
0.32
0.33
0.34
0.35
0.36
0.37
0.38
0.39
0.40
0.41
0.42
0.43
0.44
0.45
0.46
0.47
0.48
0.49
0.50
0.51
0.52
0.53
0.54
0.55
0.56
0.57
0.58
0.59
P
0.7642
0.7566
0.7490
0.7414
0.7339
0.7263
0.7188
0.7114
0.7039
0.6965
0.6892
0.6818
0.6745
0.6672
0.6599
0.6527
0.6455
0.6384
0.6312
0.6241
0.6171
0.6101
0.6031
0.5961
0.5892
0.5823
0.5755
0.5687
0.5619
0.5552
z
0.60
0.61
0.62
0.63
0.64
0.65
0.66
0.67
0.68
0.69
0.70
0.71
0.72
0.73
0.74
0.75
0.76
0.77
0.78
0.79
0.80
0.81
0.82
0.83
0.84
0.85
0.86
0.87
0.88
0.89
P
0.5485
0.5419
0.5353
0.5287
0.5222
0.5157
0.5093
0.5029
0.4965
0.4902
0.4839
0.4777
0.4715
0.4654
0.4593
0.4533
0.4473
0.4413
0.4354
0.4295
0.4237
0.4179
0.4122
0.4065
0.4009
0.3953
0.3898
0.3843
0.3789
0.3735
z
0.90
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1.00
1.01
1.02
1.03
1.04
1.05
1.06
1.07
1.08
1.09
1.10
1.11
1.12
1.13
1.14
1.15
1.16
1.17
1.18
1.19
The above results will turn out to be very important later in our
discussion of p-values.
58
Normal Distribution
z
1.50
1.51
1.52
1.53
1.54
1.55
1.56
1.57
1.58
1.59
1.60
1.61
1.62
1.63
1.64
1.65
1.66
1.67
1.68
1.69
1.70
1.71
1.72
1.73
1.74
1.75
1.76
1.77
1.78
1.79
59
P
0.1336
0.1310
0.1285
0.1260
0.1236
0.1211
0.1188
0.1164
0.1141
0.1118
0.1096
0.1074
0.1052
0.1031
0.1010
0.0989
0.0969
0.0949
0.0930
0.0910
0.0891
0.0873
0.0854
0.0836
0.0819
0.0801
0.0784
0.0767
0.0751
0.0735
z
1.80
1.81
1.82
1.83
1.84
1.85
1.86
1.87
1.88
1.89
1.90
1.91
1.92
1.93
1.94
1.95
1.96
1.97
1.98
1.99
2.00
2.01
2.02
2.03
2.04
2.05
2.06
2.07
2.08
2.09
P
0.0719
0.0703
0.0688
0.0672
0.0658
0.0643
0.0629
0.0615
0.0601
0.0588
0.0574
0.0561
0.0549
0.0536
0.0524
0.0512
0.0500
0.0488
0.0477
0.0466
0.0455
0.0444
0.0434
0.0424
0.0414
0.0404
0.0394
0.0385
0.0375
0.0366
z
2.10
2.11
2.12
2.13
2.14
2.15
2.16
2.17
2.18
2.19
2.20
2.21
2.22
2.23
2.24
2.25
2.26
2.27
2.28
2.29
2.30
2.31
2.32
2.33
2.34
2.35
2.36
2.37
2.38
2.39
P
0.0357
0.0349
0.0340
0.0332
0.0324
0.0316
0.0308
0.0300
0.0293
0.0285
0.0278
0.0271
0.0264
0.0257
0.0251
0.0244
0.0238
0.0232
0.0226
0.0220
0.0214
0.0209
0.0203
0.0198
0.0193
0.0188
0.0183
0.0178
0.0173
0.0168
z
2.40
2.41
2.42
2.43
2.44
2.45
2.46
2.47
2.48
2.49
2.50
2.51
2.52
2.53
2.54
2.55
2.56
2.57
2.58
2.59
2.60
2.61
2.62
2.63
2.64
2.65
2.66
2.67
2.68
2.69
P
0.0164
0.0160
0.0155
0.0151
0.0147
0.0143
0.0139
0.0135
0.0131
0.0128
0.0124
0.0121
0.0117
0.0114
0.0111
0.0108
0.0105
0.0102
0.0099
0.0096
0.0093
0.0091
0.0088
0.0085
0.0083
0.0080
0.0078
0.0076
0.0074
0.0071
z
2.70
2.71
2.72
2.73
2.74
2.75
2.76
2.77
2.78
2.79
2.80
2.81
2.82
2.83
2.84
2.85
2.86
2.87
2.88
2.89
2.90
2.91
2.92
2.93
2.94
2.95
2.96
2.97
2.98
2.99
P
0.0069
0.0067
0.0065
0.0063
0.0061
0.0060
0.0058
0.0056
0.0054
0.0053
0.0051
0.0050
0.0048
0.0047
0.0045
0.0044
0.0042
0.0041
0.0040
0.0039
0.0037
0.0036
0.0035
0.0034
0.0033
0.0032
0.0031
0.0030
0.0029
0.0028
60
Absolutely not.
The population of interest could be
Then why do we spend so much time
studying the normal distribution?
1
61
62
(e.g., X
= 99 mmHg)
We know the sample X
We dont know the population mean but we would like to
Key Question
How close is the sample mean (or proportion) to
the population mean (or proportion)?
63
64
Sources of Error
Solution:
Random sampling
Family members
Non-random; not independent
Telephone survey; random digit dial
Random or non-random sample?
. Convenience sampling
Errors from (Random) Sampling
. Caused by chance occurrence
be random.
66
Bottom Line
Selection Bias
Mail questionnaire to 10 million people
Sources: telephone books, clubs
Poor people are unlikely to have telephone
(only 25% had telephones)
67
68
Random Sample
Sampling Variability
IDEA
If the statistic does not change
much if you repeated the study
(you get the same answer each time),
then it is fairly reliable
(not a lot of variability)
variability or error
69
70
Example
Sample 1
Estimate the proportion of persons in a population who have
health insurance
1100
= .8012
1373
p =
1090
= .7939
1373
Sample 2
Sample 1
n = 1373
p =
p =
1100
= .8012
1373
Sample 3
p = .8347
Sample 4
p = .7786
and so on
71
72
Proportions based on
25
15
30
30
15
10
0
10
20
10
20
Histogram
of 1000
Sample
Proportions
0.70
0.75
0.80
0.85
0.90
0.70
0.76
0.78
0.80
0.82
0.75
0.80
0.85
0.90
0.84
73
74
0.0
0.2
Percentage
p = .80
0.4
0.6
0.8
1.0
No Health Insurance
75
Health Insurance
76
Lets do an experiment...
1.0
Sample 2
1.0
Sample 1
0.8
0.6
Percentage
0.4
0.0
0.0
Ready, set,
go...
0.2
Percentage
0.2
0.4
0.6
0.8
No Health Insurance
Health Insurance
No Health Insurance
p = 0.9
Health Insurance
p = 0.6
77
78
3.0
1.5
2.0
p
= 0.8
s = 0.11
0.0
0.5
1.0
Ready, set,
go...
0.0
0.2
0.4
0.6
0.8
1.0
79
80
0.6
0.2
0.0
Health Insurance
No Health Insurance
Health Insurance
No Health Insurance
p = 0.8
s = 0.06
Percentage
0.4
0.6
0.4
0.0
0.2
Percentage
0.8
0.8
1.0
Sample 2
1.0
Sample 1
p = 0.8
0.0
p = 0.7
0.2
0.4
0.6
0.8
1.0
81
82
1.0
Sample 2
1.0
Sample 1
0.8
0.6
Percentage
No Health Insurance
Health Insurance
p = 0.76
83
0.4
0.0
0.0
Ready, set,
go...
0.2
Percentage
0.2
0.4
0.6
0.8
No Health Insurance
Health Insurance
p = 0.83
84
Lets Review
p = .8
Population
Health Insurance
No Health Insurance
n = 20
p
= 0.8
s = 0.04
0.0
0.2
0.4
0.6
0.8
p
= 0.799
sp = 0.11
p
= 0.803
sp = 0.06
p
= 0.798
sp = 0.04
1.0
n = 50
0.2
0.4
0.6
0.8
1.0
0.0
0.0
0.2
0.4
0.6
0.8
1.0
n = 100
0.0
0.2
0.4
0.6
0.8
1.0
85
Lets do an experiment...
0.025
0.015
0.020
0.005
0.010
= 125 mm Hg
= 14 mm Hg
Ready, set,
go...
0.000
0.030
86
80
100
120
140
160
87
88
0.12
0.10
= 125
X
0.08
0.04
sX = 3.07
100
120
140
160
180
80
100
120
140
160
180
= 125.17
X
= 124.3
X
s = 12.36
s = 11.65
0.00
0.02
80
0.04
0.00
0.00
0.06
0.02
0.01
0.03
0.02
0.01
0.04
Sample 2
0.03
Sample 1
80
100
120
140
160
180
89
90
Ready, set,
go...
0.01
0.00
0.00
0.01
0.02
0.02
0.03
0.03
0.04
Sample 2
0.04
Sample 1
80
91
100
120
140
160
180
80
100
120
140
= 124.98
X
= 126.72
X
s = 14.05
s = 13.64
160
180
92
0.20
0.15
0.10
= 125.01
X
sX = 1.93
0.00
0.05
Ready, set,
go...
80
100
120
140
160
180
93
120
140
160
180
0.25
0.20
0.15
80
100
120
140
160
180
0.05
100
sX = 1.41
= 127.32
X
= 125.06
X
s = 14.93
s = 13.15
0.00
80
= 124.93
X
0.10
0.03
0.02
0.01
0.00
0.03
0.02
0.00
0.01
0.04
Sample 2
0.04
Sample 1
94
80
95
100
120
140
160
180
96
Lets Review
Population
90
100
110
120
130
140
150
= 14
= 124.997
X
sX = 3.07
160
120
140
160
= 125.015
X
n = 50
80
100
120
140
0.15
= 4 days
= 3 days
0.05
100
sX = 1.93
0.00
80
0.10
n = 20
Percentage
0.20
80
= 125
160
10
15
20
25
30
= 124.934
X
n = 100
80
100
120
140
sX = 1.41
160
97
98
Lets do an experiment...
Ready, set,
go...
0.25
0.15
0.05
0.00
0.00
0.05
0.10
0.10
0.15
0.20
0.20
0.25
0.30
Sample 2
0.30
Sample 1
99
10
15
20
25
10
15
= 4.7
X
= 5.01
X
s = 2.88
s = 2.73
20
25
100
0.4
sX = 0.74
0.2
= 4.08
X
0.3
0.5
0.0
0.1
Ready, set,
go...
0
10
15
20
25
101
0.8
0.25
= 4.1
X
0.6
0.20
0.15
0.10
sX = 0.37
0.00
0.4
0.05
0.20
0.15
10
15
20
25
10
15
20
25
0.0
0.2
0.00
0.05
0.10
0.25
0.30
Sample 2
0.30
Sample 1
102
= 4.26
X
= 4.08
X
s = 2.72
s = 2.45
103
10
15
20
25
104
Ready, set,
go...
0.25
0.00
0.05
0.10
0.15
0.00
0.05
0.10
0.15
0.20
0.20
0.25
0.30
Sample 2
0.30
Sample 1
10
15
20
25
10
15
= 4.48
X
= 4.29
X
s = 3.32
s = 2.76
20
25
105
106
Lets Review
Population
10
15
20
s = 3
= 4.081
X
sX = 0.74
= 4.104
X
sX = 0.37
= 4.1
X
sX = 0.19
25
1.5
= 4.1
X
1.0
2.0
= 4
sX = 0.19
n = 16
0
10
15
20
25
0.5
n = 64
5
10
15
20
25
0.0
10
15
20
25
n = 256
0
107
10
15
20
25
108
www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html
500
5000
Simulations
500
5000
Simulations
500
5000
Simulations
n=16
n=64
n=256
109
110
111
112
Amazing Result
Mathematical statisticians have figured out how to predict what
the sampling distribution will look like without actually repeating
the study numerous times and having to choose a sample each time
30
distributed.
10
20
0.76
0.78
0.80
0.82
0.84
113
114
Two Dice
10
Five Dice
Means based
on n = 16
2
10
Number of Occurrences
Means based
on n = 32
0
1
Mean Value
Mean Value
10
Mean Value
Means based
on n = 64
0
115
10
116
standard error), then if you took another sample its likely you
would get a very different result
About 95% of the time the sample mean (or proportion) will
population parameter
117
118
of the statistic
Mathematical statisticians have come up with formulas for the
standard error. There are different formulae for:
119
120
Notes on SEM
1
2
3
Example
Measure systolic
Sample size
Sample mean
Sample SD
is.
The smaller SEM is, the more precise X
SEM depends on n and s.
SEM gets smaller if
s gets smaller
n gets bigger
Question:
122
ANSWER
The standard error of the sample mean tells us 95% of the time
the population mean will lie within about 2 standard errors of the
sample mean
2SEM
X
1.96SEM)
(More accurately: X
123.4 2 1.4
123.4 2.8
123.4 2.8
124
Interpretation:
Plausible values for the population mean with high
confidence
n increases
s decreases
Level of confidence decreases (e.g. 90%, 80% vs 95%)
125
126
Technical Interpretation
The CI works (includes ) 95% of the time
2SEM
X
2 s
X
n
Important
Sample size n is at least 60 to use 2SEM
128
t
12.706
4.303
3.182
2.776
2.571
2.447
2.365
2.306
2.262
2.228
2.201
df
12
13
14
15
20
25
30
40
60
120
t
2.179
2.160
2.145
2.131
2.086
2.060
2.042
2.021
2.000
1.980
1.960
Notes
t s
X
n
129
Students t-Distribution
df
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
0.2
3.078
1.886
1.638
1.533
1.476
1.440
1.415
1.397
1.383
1.372
1.363
1.356
1.350
1.345
1.341
1.337
1.333
1.330
1.328
1.325
1.323
1.321
1.319
1.318
1.316
1.315
1.314
1.313
1.311
1.310
0.05
12.706
4.303
3.182
2.776
2.571
2.447
2.365
2.306
2.262
2.228
2.201
2.179
2.160
2.145
2.131
2.120
2.110
2.101
2.093
2.086
2.080
2.074
2.069
2.064
2.060
2.056
2.052
2.048
2.045
2.042
0.02
31.821
6.965
4.541
3.747
3.365
3.143
2.998
2.896
2.821
2.764
2.718
2.681
2.650
2.624
2.602
2.583
2.567
2.552
2.539
2.528
2.518
2.508
2.500
2.492
2.485
2.479
2.473
2.467
2.462
2.457
0.01
63.657
9.925
5.841
4.604
4.032
3.707
3.499
3.355
3.250
3.169
3.106
3.055
3.012
2.977
2.947
2.921
2.898
2.878
2.861
2.845
2.831
2.819
2.807
2.797
2.787
2.779
2.771
2.763
2.756
2.750
130
Students t-Distribution
df
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
0.001
636.619
31.599
12.924
8.610
6.869
5.959
5.408
5.041
4.781
4.587
4.437
4.318
4.221
4.140
4.073
4.015
3.965
3.922
3.883
3.850
3.819
3.792
3.768
3.745
3.725
3.707
3.690
3.674
3.659
3.646
131
0.2
1.309
1.309
1.308
1.307
1.306
1.306
1.305
1.304
1.304
1.303
1.303
1.302
1.302
1.301
1.301
1.300
1.300
1.299
1.299
1.299
1.298
1.298
1.298
1.297
1.297
1.297
1.297
1.296
1.296
1.296
1.282
0.1
1.696
1.694
1.692
1.691
1.690
1.688
1.687
1.686
1.685
1.684
1.683
1.682
1.681
1.680
1.679
1.679
1.678
1.677
1.677
1.676
1.675
1.675
1.674
1.674
1.673
1.673
1.672
1.672
1.671
1.671
1.645
0.05
2.040
2.037
2.035
2.032
2.030
2.028
2.026
2.024
2.023
2.021
2.020
2.018
2.017
2.015
2.014
2.013
2.012
2.011
2.010
2.009
2.008
2.007
2.006
2.005
2.004
2.003
2.002
2.002
2.001
2.000
1.960
0.02
2.453
2.449
2.445
2.441
2.438
2.434
2.431
2.429
2.426
2.423
2.421
2.418
2.416
2.414
2.412
2.410
2.408
2.407
2.405
2.403
2.402
2.400
2.399
2.397
2.396
2.395
2.394
2.392
2.391
2.390
2.326
0.01
2.744
2.738
2.733
2.728
2.724
2.719
2.715
2.712
2.708
2.704
2.701
2.698
2.695
2.692
2.690
2.687
2.685
2.682
2.680
2.678
2.676
2.674
2.672
2.670
2.668
2.667
2.665
2.663
2.662
2.660
2.576
0.001
3.633
3.622
3.611
3.601
3.591
3.582
3.574
3.566
3.558
3.551
3.544
3.538
3.532
3.526
3.520
3.515
3.510
3.505
3.500
3.496
3.492
3.488
3.484
3.480
3.476
3.473
3.470
3.466
3.463
3.460
3.291
132
t-distribution Applets
= 99 mmHg
X
s = 15.97
2.776 SEM
95% CI is X
http://www.stat.sc.edu/~west/applets/tdemo.html
http:
//www.econtools.com/jevons/java/Graphics2D/tDist.html
99 2.776 7.142
99 19.83
The 95% CI for mean blood pressure is
(79.17, 118.83)
(79.17 118.83)
Rounding off is okay too: (79, 119)
133
134
PROPORTIONS (p)
The standard error of the sample mean depends on the sample size.
135
136
Proportions
Example
n = 200 patients
X = 90 adverse drug reaction
The estimated proportion who experience an adverse drug reaction
is
p = 90/200 = .45
or 45%
population proportion?
NOTES
n = 200 patients
If we had studied a much larger number of patients, would we
population
137
10
12
p 1.96SE (
p)
r
p (1 p)
p 1.96
n
p = 90/200 = .45
r
.45 .55
.45 1.96
200
.45 1.96 0.035
Number of Samples
138
0.35
0.40
0.45
0.50
0.55
.45 0.07
Sample Proportion
139
140
141
142
Example
Sometimes 1.96SE (
p ) is called
p =
31
= 0.79
39
143
144
Comparison of 2 Groups
Are confidence intervals needed even though all infants were
studied?
Paired Design
Before-after data
Twin data
145
Paired Design
146
Before
After
Why Pairing?
147
148
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
sample mean
BP Before OC
115
112
107
119
115
138
126
105
104
115
115.6
BP After OC
128
115
106
128
122
145
132
109
102
117
120.4
Difference
After-Before
4.8
4.8
t.95,df =9 SEM
4.57
2.262
10
2.262 1.445
1.53 mm Hg
to
8.07 mm Hg
Notes
s = 4.57.
149
150
1.53
8.07
population average
152
The Hypotheses
H0 : = 0
The alternative hypothesis H1
Typically represents what you are trying to prove.
For example, oral contraceptives affect blood
pressure:
H1 : 6= 0
153
154
The p-value
155
http://xkcd.com/892/
156
t=
t=
Z=
sample mean 0
SEM
4.8
4.8
=
= 3.31
1.45
4.57/ 10
observation mean
SD
t=
0
X
SEM
Z
t
observation
sample mean
standard deviation standard error
mean
0 because
158
be made
n 1 degrees of freedom
The p-values gets a little larger
3.31
159
160
161
162
. We reject H0
163
164
SummaryPaired t-test
1
2
d 0
d 0
X
X
=
SEM
sd / n
165
166
Statistically significant
The p-value is less than a preset threshold value, .
Do Not Say
The result is statistically significant
The result is statistically significant at = .05
The result is significant (p < .05)
167
168
parameter
here are two possibilities for the truth, data help me choose one
169
170
1.53
that the p-value is less than .05, but it doesnt tell us that it
is p = .009
8.07
Why?
171
172
n = 100, 000
= .03 mmHg
X
blood pressure.
s = 4.57
the explanation.
p-value
= .04
174
n = 5
= 5.0 mmHg
X
Do not reject H0
s = 4.57
Accept H0
p-value
Claim H0 is true
= .07
176
Note
Infants
2 independent groups
How do we calculate
Confidence interval for difference
p-value to determine if the difference in two groups is
significant
2-sample (unpaired) t-test
Treatment n = 85
n
Mean stool output ml/kg
Standard deviation(s)
Control
84
260
254
Tx
85
182
197
178
Generic CI formula:
Principle: Variation from independent sources can be added
estimate 1.96 SE
Variance(X1 X2 ) = (SE (X1 ))2 + (SE (X2 ))2
(X1 X2 ) 1.96 SE (X1 X2 )
SE (X1 X2 ) =
q
(SE (X1 ))2 + (SE (X2 ))2
180
Control
84
260
254
Tx
85
182
197
78 1.96 SE (X1 X2 )
78 1.96 34.94
78 68.48
SE (X1 X2 ) =
r
2
2
254/ 84 + 197/ 85
9 to 147
p
=
27.712 + 21.372
Note
= 34.94
The confidence interval does not include 0.
Thus, p < .05
181
182
2.23
2.23
H0 : 1 = 2
H1 : 1 6= 2
difference in means 0
t=
SE of the difference
t=
260 182
78
=
= 2.23
34.94
34.94
183
DiarrheaPepto BismolSummary
Objective
Advantages
Result The mean stool outputs in the treated and control groups were
182 and 260 respectively. The control group stool output was
significantly higher than the treated group (p = .03). The
control group was 78 ml/kg higher than the treated group
(95% confidence interval 9 147 ml/kg).
186
Rank:
Group:
Intervention (I) 5 0 7 2 19
Only 5 individuals in each
sample
Randomize
Control (C)
3+5+7+9+10
5
1+2+4+6+8
5
6 -5 -6 1 4
= 6.8
= 4.2
p-value calculations:
187
188
Note
Result The median score change was four points higher in the
intervention group than in the control group. The difference in
test score improvements between the intervention and control
groups was not statistically significant (p = .17)
189
190
Statisticians will not always agree, but there are some guidelines:
Use nonparametric test if sample size is small and distribution
191
192
F -test
A p-value is then calculated
Are there any differences among the populations?
t-tests (pairwise)
That could be a lot of statistical testing!
Instead, perform an ANOVA
Nonsmokers (NS)
194
Mean FEF 2 SE
3.3
FEF (L/s)
n
200
200
50
200
200
200
3.1
sd FEF
(L/s)
0.79
0.77
0.86
0.78
0.81
0.82
2.9
Mean FEF
(L/s)
3.78
3.30
3.32
3.23
2.73
2.59
2.7
Group
name
NS
PS
NI
LS
MS
HS
2.5
Group
number
1
2
3
4
5
6
3.5
3.7
NS
PS
NI
LS
MS
HS
Group
you whether there were any differences amongst the six groups
195
196
Statistics
Methods 200 men were randomly selected from each of five smoking
classification groups (non-smoker, passive smokers, light
smokers, moderate smokers, and heavy smokers), as well as 50
men classified as non-inhaling smokers for a study designed to
analyze the relationship between smoking and respiratory
function
Results
197
198
H0 : 1 = 2 = = k
H1 : at least one mean is different
The variation in the sample means between groups is compared to
the variation within a group.
Age
< 20
20
3.3
3.1
2.9
FEF (L/s)
3.5
3.7
n
97
88
Sample
Mean
17.8
24.6
2.7
2.5
NS
PS
NI
LS
MS
HS
Group
If the between group variation is a lot bigger than the within group
variation, that suggests there are some differences among the
populations.
http://www.ruf.rice.edu/~lane/stat_sim/one_way/index.html
199
200
Notes on Design
9 infected
infants
Placebo
n = 127
31 infected
infants
Random assignment of Tx
Helps insure 2 groups are comparable
Patient & physician could not request particular Tx
Double blind
Patient & physician did not know Tx assignment
Randomize
Definition of infection
Two positive cultures (infant > 32 weeks)
201
202
9/121
31/127
=
=
.074
.244
(7.4%)
(24.4%)
AZT
Placebo
95% CI
95% CI
.03 .14
.17 .32
Note
These are NOT the true population parameters for the
transmission rates.
There is sampling variability
204
Hypothesis Testing
H0 : p1 = p2
AZT
Placebo
H1 : p1 6= p2
HIV transmission
(infected)
Yes
31
40
No
112
96
208
121
127
248
205
206
208
HIV-AZTSummary
209
O =
observed
expected =
210
Expected refers to the values for the cell counts that would be
expected if the null hypothesis is true
X (O E )2
E
4 cells
211
212
HIV transmission
(infected)
AZT
Placebo
Yes
31
40
No
112
96
208
121
127
248
Observed = 9
3.84
Expected = 121
40
= 19.52
248
213
214
Expected
Observed
AZT
Yes
No
AZT
9
112
121
Placebo
Placebo
Yes
31
40
Yes
40
HIV
Yes
No
AZT
19.52
101.48
121
Placebo
20.48
106.52
127
40
208
248
No
112
96
208
No
208
121
127
248
= 13.19 2
HIV
127
40
208
248
AZT
121
Placebo
31
96
127
248
13.19
215
13.19 3.63
216
This table assumes that you have one degree of freedomthe case when analyzing a
22 table:
2
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
2.1
2.2
2.3
2.4
P
1.0000
0.7518
0.6547
0.5839
0.5271
0.4795
0.4386
0.4028
0.3711
0.3428
0.3173
0.2943
0.2733
0.2542
0.2367
0.2207
0.2059
0.1923
0.1797
0.1681
0.1573
0.1473
0.1380
0.1294
0.1213
2
2.5
2.6
2.7
2.8
2.9
3.0
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
4.0
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
P
0.1138
0.1069
0.1003
0.0943
0.0886
0.0833
0.0783
0.0736
0.0693
0.0652
0.0614
0.0578
0.0544
0.0513
0.0483
0.0455
0.0429
0.0404
0.0381
0.0359
0.0339
0.0320
0.0302
0.0285
0.0269
2
5.0
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
6.0
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
7.0
7.1
7.2
7.3
7.4
P
0.0253
0.0239
0.0226
0.0213
0.0201
0.0190
0.0180
0.0170
0.0160
0.0151
0.0143
0.0135
0.0128
0.0121
0.0114
0.0108
0.0102
0.0096
0.0091
0.0086
0.0082
0.0077
0.0073
0.0069
0.0065
2
7.5
7.6
7.7
7.8
7.9
8.0
8.1
8.2
8.3
8.4
8.5
8.6
8.7
8.8
8.9
9.0
9.1
9.2
9.3
9.4
9.5
9.6
9.7
9.8
9.9
P
0.0062
0.0058
0.0055
0.0052
0.0049
0.0047
0.0044
0.0042
0.0040
0.0038
0.0036
0.0034
0.0032
0.0030
0.0029
0.0027
0.0026
0.0024
0.0023
0.0022
0.0021
0.0019
0.0018
0.0017
0.0017
2
10.0
10.1
10.2
10.3
10.4
10.5
10.6
10.7
10.8
10.9
11.0
11.1
11.2
11.3
11.4
11.5
11.6
11.7
11.8
11.9
12.0
12.1
12.2
12.3
12.4
P
0.0016
0.0015
0.0014
0.0013
0.0013
0.0012
0.0011
0.0011
0.0010
0.0010
0.0009
0.0009
0.0008
0.0008
0.0007
0.0007
0.0007
0.0006
0.0006
0.0006
0.0005
0.0005
0.0005
0.0005
0.0004
2
12.5
12.6
12.7
12.8
12.9
13.0
13.1
13.2
13.3
13.4
13.5
13.6
13.7
13.8
13.9
14.0
14.1
14.2
14.3
14.4
14.5
14.6
14.7
14.8
14.9
P
0.0004
0.0004
0.0004
0.0003
0.0003
0.0003
0.0003
0.0003
0.0003
0.0003
0.0002
0.0002
0.0002
0.0002
0.0002
0.0002
0.0002
0.0002
0.0002
0.0001
0.0001
0.0001
0.0001
0.0001
0.0001
217
218
HIV
transmission
AZT
Placebo
Yes
10
No
28
24
52
30
32
62
p = .083
219
220
Note
Relative Risk
Ratio of Proportions:
Relative risk = p1 /p2
AZT Example
221
222
A big relative risk does not necessarily mean that the p-value is
small.
224
Scatter plot
Correlation coefficient
1
2
3
4
5
6
7
8
225
226
3.2
3.0
2.8
65
r = 1
r =0
227
r = .7
r =1
70
60
2.6
3.4
r = .7
228
Perfect Positive
Uncorrelated
1 r 1
Corr(X , Y ) = r
r
r
r
r
r
Weak Positive
r = 0: no linear association
Weak Negative
229
230
Correlation Slider:
noppa5.pc.helsinki.fi/koe/corr/cor7.html
Correlation Guessing Game:
http://istics.net/stat/Correlations/
r =0
231
232
Sensitive to outliers
X = height
Y = weight
Anscombes Data
233
234
3.2
3.0
2.8
2.6
3.4
60
65
70
r = .76
235
236
3.6
3.4
2.6
3.0
2.8
3.2
2.4
2.2
linear regression
55
60
65
70
75
237
238
Regression by Eye
vertical deviations
Least squares line
www.ruf.rice.edu/~lane/stat_sim/reg_by_eye/index.html
variable (X )
Each distance is yi yi = yi a + bxi
this is computed for each data point in the sample
239
240
Y
Intercept The expected value of Y when X is 0
Warning: The intercept is not always easily
interpretable.
Only meaningful if X can take the value 0.
Is body weight ever really 0?
b
1
a
X
241
242
Y = a + bX +
Plasma volume = 0.0857+.0436weight
is
Noise
Error
Scatter
Assumptions About
Random noise
Sometimes positive, sometimes negative
244
Prediction
Estimated Slope
Versus
Population Parameter Slope
(n = 8) of data.
Y = 0.0857 + .0436 60 = 2.7liters
2
245
246
Example
Slope = 0.0436
Standard error of slope = 0.0153
(.0062,
247
.081)
248
Notes
t
12.706
4.303
3.182
2.776
2.571
2.447
2.365
2.306
2.262
2.228
2.201
df
12
13
14
15
20
25
30
40
60
120
b = 0 is not plausible
t
2.179
2.160
2.145
2.131
2.086
2.060
2.042
2.021
2.000
1.980
1.960
thus
We would reject H0 :
the = .05
b=0
at
level
Hypothesis Testing
250
Hypothesis Testing
H0 :
b=0
H1 :
b 6= 0
(slope is zero)
test statistic = t =
p-value = .03
0.0436
slope 0
=
SE(slope)
0.0153
2.85
2.85
= 2.85
251
252
Beware of extrapolation
253
254
BUT
association
The correlation coefficient is scaled between -1 and +1
The correlation coefficient measures how close points fall on a
linear line
255
256
Whats a Good R 2 ?
data
Idea
individuals in a sample/population
individual Y -values vary less about their estimated means
based on X
Example
= .76
= .762 = .58
Caveat
258
Observation
Negative correlation between death rates from ovarian cancer and
family size based on 20 countries?
Death Rate
from
Drowning
Faulty Interpretation?
Does this mean that having a large family
will protect you from ovarian cancer?
Amount of Ice Cream Sold
259
260
Spurious Correlation
Dietary Salt
261
262
Y
X1
X2
y = b0 + b1 X1 + b2 X2 + b3 X3 +
Data
Parameters
dependent variable
b0
intercept
X1
b1
X2
second independent
b2
X3
b3
X5 = Age
X6 = Gender
263
264
confounding variables
To develop models to predict the expected value of Y from
the X s
265
Example:
Is there an association between hemoglobin and packed cell volume?
Hemoglobin level, packed cell volume, age, and menopausal status for 20 women
Subject Number Hb (g/dl) PCV
Age (yrs) Menopause (0=No)
1
11.1
35
20
0
2
10.7
45
22
0
3
12.4
47
25
0
4
14.0
50
28
0
5
13.1
31
28
0
6
10.5
30
31
0
7
9.6
25
32
0
8
12.5
33
35
0
9
13.5
35
38
0
10
13.9
40
40
0
11
15.1
45
45
1
12
13.9
47
49
0
13
16.2
49
54
1
14
16.3
42
55
1
15
16.8
40
57
1
16
17.1
50
60
1
17
16.6
46
62
1
18
16.9
55
63
1
19
15.7
42
65
1
20
16.5
46
67
1
Source: Pan data from Campbell et al. (1985)
266
16
14
Hb (g/dl)
12
10
25
30
35
40
45
50
55
PCV (%)
267
268
55
50
16
45
40
PCV (%)
14
Hb (g/dl)
12
35
30
25
10
20
30
40
50
60
20
Age (years)
30
40
50
60
Age (years)
269
270
Interpretation
Variable
Constant
Age
PCV
b0
b1
b2
Regression
coefficient
5.24
0.110
0.097
SE
1.21
0.016
0.033
t-value
4.34
6.74
2.98
p-value
0.0004
0.0001
0.0085
271
272
A parameter estimate
(regression coefficient)
has an associated standard error
(SE)
H0 : b = 0
H1 : b 6= 0
Test statistic: t =
b
SE
Example: df = 20 3 = 17
To test H0 : b2 = 0
size
t=
.097
= 2.98
.033
2.98
2.98
p-value is .0085
273
Example
274
Solution
= b0 + b1 X1 + b2 X2
X1 = Age (years)
X2 = Menopause
0 if N0 (pre)
=
1 if YES (post)
275
276
Note
post-menopausal
277
Results:
MLR of Hb against age and menopausal status
278
b2 t S.E.
Variable
Constant
Age
Menopausal
b0
b1
b2
Regression
coefficient
9.74
0.081
1.88
SE
1.11
0.033
1.03
t-value
8.77
2.41
1.82
p-value
<0.001
0.03
0.09
Interpretation
After accounting for age, post-menopausal women have, on
average, hemoglobin levels about 1.88 g/dl higher than
pre-menopausal women (95% CI is -0.28 to 4.08, p = .09)
279
280
association
281
Truth
H0
282
H1
Reject H0
Type II Error
Decision
Fail to
reject H0
Power
283
284
Statistical Power
Power
10
15
20
25
285
286
10
10
15
20
25
10
10
15
20
25
10
10
15
20
25
Which is harder?
To detect very small differences
To find a large (obvious) difference
287
288
10
Effect of
or How certain we want to be to avoid type I error
10
15
20
25
10
10
15
20
25
289
Effect of
or How certain we want to be to avoid type I error
10
10
10
10
10
10
15
15
15
290
20
20
20
25
10
10
15
20
25
10
10
15
20
25
25
25
291
292
H0 : OC = NOOC
Effect size
H0 : OC 6= NOOC
Variability in measurements
Chosen significance level ()
Pilot Study
Sample size
OC users
Non-OC users
n
8
21
Sample
Mean systolic BP
132.8
127.4
Sample
SD(s)
15.3
18.2
But...
294
Design
127.4 = 5.4
This could be considered scientifically significant, however, the
-9.5 to +20.3
Using the pilot data, we estimate that the standard deviations are
15.3 and 18.2 in OC and non-OC users respectively
295
296
Specify
-level of the test Probability of type I error
p-values below this are called statistically significant
each group...
297
298
What if sample size calculation yields group sizes that are too big?
Increase minimum difference of interest
Increase -level
-level
Decrease SD???
Sample size
299
300
Formulae
Pn
X =
i=1 Xi
s
SE (X ) =
n
s
SE (X 1 X 2 ) =
s12
s2
+ 2
n1 n2
Pn
s =
r
SE (
p) =
X )2
n1
i=1 (Xi
Planning
Design
p (1 p)
n
Data Collection
Data Analysis
Presentation
X (O E )2
=
E
4 cells
2
Interpretation
301
Types of Data
302
Measures of Center
Binary data
Categorical data
Continuous data
Survival data
303
304
Measures of Spread
Pictures of Data
305
Shapes of Distributions
306
Normal Distribution
Right skew
Left skew
Symmetric
Uniform
Rule?
Bimodal
307
308
Constructing Intervals
Z -score =
Measures
309
Sampling Distribution
310
Standard Error
when multiple
311
312
Confidence Intervals
95% of the time, the population mean will lie within about two
standard errors of the sample mean.
313
Hypothesis Testing
314
Step 0:
Situation
Parameter
H0
Statistic
SE
Distribution
Step 1:
Step 2:
Step 3:
Step 4:
315
316
Type I error
Type II error
Power
p-value
317
Power
318
Sample Size
Parameter
versus
Statistic
Difference to Detect
Variability
Significance level ()
319
320
Correlation
r = 1
r <0
r =0
r >0
Assumptions
r =1
CAUTION:
Prediction
Coefficient of Determination
321
Inference on Slope
322
323
324
325