Case Kohort

Chapter 3-14.
Case-Cohort Study Design
Comparison of Three Relative Effect Measures in Cohort Studies (Risk Ratio, Rate Ratio,
and Hazard Ratio)
This comparison, pages 1 to 6, was presented in Chapter 11. It is repeated here for review. The
topic of this chapter is how to do design and analyze case-control studies to obtain the same types
of effect estimates.
For illustration, we will use the following data in life table format from a hypothetical cohort
study.
Life Table of Hypothetical Data
Exposed Non-Exposed
Follow- Begin Disease Day- Begin Disease Day- Day-
up day N Cases Specific N Cases Specific Specific
Risk Risk Risk
Ratio
1 50 5 0.10 50 2 0.04 2.5
2 30 10 0.33 40 8 0.20 1.7
3 10 10 1.00 20 10 0.50 2.0
totals 90 25 110 20
These data can be entered into Stata using the following commands in the gregchapter4.do do-
file.
clear
input day exposure disease count
1 1 1 5
1 1 0 15
2 1 1 10
2 1 0 10
3 1 1 10
1 0 1 2
1 0 0 8
2 0 1 8
2 0 0 12
3 0 1 10
3 0 0 10
end
drop if count==0
expand count
drop count
_____________________
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah
School of Medicine, 2010.
Chapter 3-14 (revision 16 May 2010) p. 1

To verify the data were corrected entered, we request a life table using
ltable day disease ,by(exposure) noadjust intervals(1) hazard
Beg. Cum. Std. Std.

Interval Total Failure Error Hazard Error [95% Conf. Int.]
-------------------------------------------------------------------------------
exposure 0
1 2 50 0.0400 0.0277 0.0400 0.0283 0.0048 0.1114
2 3 40 0.2320 0.0646 0.2000 0.0707 0.0863 0.3606
3 4 20 0.6160 0.0917 0.5000 0.1581 0.2398 0.8542
exposure 1
1 2 50 0.1000 0.0424 0.1000 0.0447 0.0325 0.2048
2 3 30 0.4000 0.0825 0.3333 0.1054 0.1598 0.5695
3 4 10 1.0000 . 1.0000 0.3162 0.4795 1.7085
-------------------------------------------------------------------------------
which agrees with the original table.

Risk Ratio Analysis
This type of analysis ignores time-at-risk. For that reason, it assumes an equal follow-up time for
every study subject.

Exposed Non-Exposed
Risk Risk Risk
Ratio
1 50 5 0.10 50 2 0.04 2.5
2 30 10 0.33 40 8 0.20 1.7
3 10 10 1.00 20 10 0.50 2.0
totals 90 25 110 20
The risk ratio analysis uses partial information (shown in blue) from the complete data in the life
table.
Risk Ratio Analysis Data
Exposed Non-Exposed
Disease 25 (50%) 20 (40%)
Non-Disease 25 30
N 50 50
cs disease exposure
| exposure |
| Exposed Unexposed | Total
-----------------+------------------------+----------
Cases | 25 20 | 45
Noncases | 25 30 | 55
-----------------+------------------------+----------
Total | 50 50 | 100
| |
Risk | .5 .4 | .45
| |
| Point estimate | [95% Conf. Interval]
|------------------------+----------------------
Risk difference | .1 | -.0940265 .2940265
Risk ratio | 1.25 | .8064465 1.937512
Attr. frac. ex. | .2 | -.2400079 .4838742
Attr. frac. pop | .1111111 |
+-----------------------------------------------
chi2(1) = 1.01 Pr>chi2 = 0.3149
Analyzing these data in this way, we do not demonstrate a significant effect (RR=1.25, p=0.315).
In fact, this crude RR underestimates each of the day-specific RR estimates.

Rate Ratio Analysis
This type of analysis uses time-a-risk, but in a crude way. It does not assume an equal follow-up
time for each study subject. It assumes, however, that risk is constant across the follow-up time.

Exposed Non-Exposed
Risk Risk Risk (and
Rate*)
Ratio
1 50 5 0.10 50 2 0.04 2.5
2 30 10 0.33 40 8 0.20 1.7
3 10 10 1.00 20 10 0.50 2.0
totals 90 25 110 20
* Since the intervals are each one day, the day-specific person-time is just the day-specific beginning
sample size, and so the day-specific risk is also the day-specific rate).
The rate ratio analysis uses partial information (shown in blue) from the complete data in the life
table.
Rate Ratio Analysis Data

Exposed Non-Exposed
Disease 25 (50%) 20 (40%)
Person-Days 90 110
ir disease exposure day
| exposure |
-----------------+------------------------+----------
disease | 25 20 | 45
day | 90 110 | 200
-----------------+------------------------+----------
| |
Incidence Rate | .2777778 .1818182 | .225
| |
|------------------------+----------------------
Inc. rate diff. | .0959596 | -.0389695 .2308887
Inc. rate ratio | 1.527778 | .8147248 2.900724 (exact)
Attr. frac. ex. | .3454545 | -.2274083 .6552584 (exact)
Attr. frac. pop | .1919192 |
+-----------------------------------------------
(midp) Pr(k>=25) = 0.0800 (exact)
(midp) 2*Pr(k>=25) = 0.1599 (exact)
Analyzing these data in this way, we almost demonstrate a significant effect (IRR=1.53,
p=0.080). Notice again, this crude IRR underestimates each of the day-specific IRR estimates.

Hazard Ratio Analysis (Survival Analysis)
This type of analysis uses time-a-risk in a very complete way, using all of the information from
the life table. It does not assume an equal follow-up time for each study subject. It allows for
and models a changing risk across the follow-up time.
Exposed Non-Exposed
Risk Risk Risk
Ratio
1 50 5 0.10 50 2 0.04 2.5
2 30 10 0.33 40 8 0.20 1.7
3 10 10 1.00 20 10 0.50 2.0
totals 90 25 110 20
Analyzing these data using survival analysis,
stset day ,failure(disease==1)

stcox exposure
Cox regression -- Breslow method for ties
No. of subjects = 100 Number of obs = 100

No. of failures = 45
Time at risk = 200
LR chi2(1) = 4.65
Log likelihood = -174.40643 Prob > chi2 = 0.0310
------------------------------------------------------------------------------
_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
exposure | 1.916208 .5796004 2.15 0.032 1.059198 3.466632
------------------------------------------------------------------------------
This time, we observe a significant effect (HR = 1.92, p = 0.032).

Inefficient Use of Time in Rate Ratio Analysis
The rate ratio analysis only considers the ratio of cases to average person-time, without
distinguishing times to event and times to censored.
person-time = total time for subjects

= mean time  N
Suppose the individual times-at-risk for a sample are: 10, 20, and 30. The person-time is
computed as:
PT = total time for subjects

=10+20+30
= 60
which is equivalent to
PT = mean time  N
= (10+20+30)/3  3
= 20  3
= 60
Thus, if we had the scenario where events occurred early while censoring occurred later in one
study group, while in the other study group an equal number of events occurred later while
censoring occurred early, the person time could be equal for the two study groups and we would
erroneously conclude no difference in rates (rate ratio =1) between the two groups.
(let x----x denote time)
x-------------------------------------x (censored) Group A

x-----x (died)
x--------x (died)
x--------------------------------------------x (censored)
x-------------------------------------x (died) Group B

x-----x (censored)
x--------x (censored)
x--------------------------------------------x (died)
In this example, the person-time is equal and the rate ratio = 1, yet clearly Group A shows greater
risk for death (Group B survives longer).
Conclusion: Cox regression is sensitive to a changing risk across time, while Poisson regression
(or a rate ratio analysis) is not. Usually both approaches beat the risk ratio approach, since they
have the advantage of using more information, namely time, in the analysis.

How This Relates to Case-Control Studies
It would be nice if we could gain the additional power of a hazard ratio analysis somehow in a
case-control study.
In a case-control study, we do not use time in the analysis. It would seem, then, that a case-
control study can do no better than the risk ratio analysis from a cohort study, which also does
not use time in the analysis.
If the rare-disease assumption is meet, the ordinary case-control study OR is approximately the
RR that would be obtained in a cohort study.
However, if a case-cohort study design is used, which we will see how to design in this chapter,
the OR provides an unbiased estimate of the RR, without the need for the rare-disease
assumption.
Furthermore, if a density case-control study design is used, which is also referred to by many as a
case-cohort study design, the OR provides an unbiased estimate of the HR, also without the need
for the rare-disease assumption. Thus, we can obtain the benefit of a HR analysis, which
improves our chances of demonstrating an exposure-disease association.
We will see how to conduct the sampling required for these three variants of the case-control
study design, use simulation to demonstrate what the OR estimates, and explain why it estimates
these effect measures.

Dataset
We will use the dataset found in the Breslow and Day (1987, Appendix VIII and Appendix ID).
Men (n=679) employed in a nickel refinery in South Wales were investigated to determine
whether the risk of developing carcinoma of the bronchi and nasal sinuses (ICD = 160), which
had been associated with the refining of nickel from previous studies in the 1930s, was present in
this cohort. The data are in the file nickelrefinary.csv. The variables are:
CaseID Case ID
PrimaryICD Primary ICD Code
Exposure Nickel Exposure Level
DateBirth Date of Birth
AgeEmp Age First Employed
AgeBegFol Age Follow-up Began
AgeEndFol Age at Death or Loss
Executing the following commands in the do-file editor, we see up the variables and save them to
a new file (this has already been done, with nickelrefinary.dta in the datasets & do-files
subdirectory.
* -- set up variables -------------------------------
clear
set mem 10M
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "IntroEpiCourse\datasets & do-files"
insheet using nickelrefinary.csv
* -- set up tumor disease variable

gen tumor = cond(primaryicd==160,1,0)
replace tumor = . if primaryicd == .
label define tumorlab 0 "0. no tumor" 1 "1. tumor"
label values tumor tumorlab
label variable tumor "carcinoma of the bronchi and nasal sinuses"
tab tumor
* -- set up nickel exposure variable

gen nickel = cond(exposure>0,1,0)
replace nickel = . if exposure == .
label define nickellab 0 "0. no exposure" 1 "1. exposure"
label values nickel nickellab
label variable nickel "occupational exposure to nickel"
tab nickel
* -- set up time-at-risk variable

gen timerisk = ageendfol - agebegfol
label variable timerisk "time at risk"
sum timerisk
save nickelrefinary , replace
* -- end set up variables ---------------------------

Creating a Dataset Without Rare Disease
So that we have a dataset where the rare disease assumption is not met, we next duplicate the
cases five times save this augmented dataset to a separate file. (This is for illustration only—of
course you would not do this in an actual data analyis.)
* -- set up data with 5 x cases

use nickelrefinary, clear
tab tumor
keep if tumor==1 // reduce to cases only
save tumorcases, replace // save cases to file
use nickelrefinary, clear // bring data back in
append using tumorcases
tab tumor
save nickelrefinary5xcases, replace
This has already been done. The file nickelrefinary5xcases.dta is in the datasets & do-files
subdirectory.
Population Effect Measures (Original Data With Rare Disease)
For illustration, we will assume our N=679 represents the population that we will sample from.
Doing this, we can determine the population effect measures that our samples will be estimates
of. Reading in the original data (not the 5 x cases dataset),
File
Open
Find the directory where you copied the course CD
Change to the subdirectory datasets & do-files
Single click on nickelrefinary.dta
Open
use "C:\Documents and Settings\u0032770.SRVR\Desktop\

Biostats & Epi With Stata\datasets & do-files\
nickelrefinary.dta ", clear
* which must be all on one line, or use:

cd "Biostats & Epi With Stata\datasets & do-files"
use nickelrefinary.dta, clear

Computing the “population” (full cohort dataset before we take a sample) odds ratio,
Statistics
Observational/Epi. analysis
Tables for epidemiologists
case control odds ratio
Main tab: Case variable: tumor
Exposed variable: nickel
OK
cc tumor nickel
cc tumor nickel
Proportion
| Exposed Unexposed | Total Exposed
-----------------+------------------------+----------------------
Cases | 46 10 | 56 0.8214
Controls | 343 280 | 623 0.5506
-----------------+------------------------+----------------------
Total | 389 290 | 679 0.5729
| |
|------------------------+----------------------
Odds ratio | 3.755102 | 1.824533 8.48588 (exact)
Attr. frac. ex. | .7336957 | .4519145 .8821572 (exact)
Attr. frac. pop | .6026786 |
+-----------------------------------------------
chi2(1) = 15.41 Pr>chi2 = 0.0001

Computing the “population” (full cohort dataset before we take a sample) risk ratio,
Statistics
Cohort study: risk ratio etc.
Exposure variable: nickel
OK
cs tumor nickel
| occupational exposure to|

| nickel |
-----------------+------------------------+----------
Cases | 46 10 | 56
Noncases | 343 280 | 623
-----------------+------------------------+----------
Total | 389 290 | 679
| |
Risk | .1182519 .0344828 | .0824742
| |
|------------------------+----------------------
Risk difference | .0837692 | .0454195 .1221188
Risk ratio | 3.429306 | 1.760546 6.679826
Attr. frac. ex. | .7083958 | .4319943 .8502955
Attr. frac. pop | .5818966 |
+-----------------------------------------------
chi2(1) = 15.41 Pr>chi2 = 0.0001

Computing the “population” (full cohort dataset before we take a sample) rate ratio,
Statistics
Incidence rate ratios
Exposed variable: nickel
Person-time variable: timerisk
OK
ir tumor nickel timerisk

| nickel |
-----------------+------------------------+----------
carcinoma of the | 46 10 | 56
time at risk | 7546.318 7801.739 | 15348.06
-----------------+------------------------+----------
| |
Incidence Rate | .0060957 .0012818 | .0036487
| |
|------------------------+----------------------
Inc. rate diff. | .0048139 | .0028815 .0067463
Inc. rate ratio | 4.755696 | 2.367283 10.56925 (exact)
Attr. frac. pop | .6487034 |
+-----------------------------------------------
(midp) Pr(k>=46) = 0.0000 (exact)
(midp) 2*Pr(k>=46) = 0.0000 (exact)

Computing the “population” (full cohort dataset before we take a sample) hazard ratio,
1) informing Stata we have survival time variables:
Statistics
Survival analysis
Set up and utilities
Declare data to be survival time data
Main tab: Time variable: timerisk
Failure event: Failure variable: tumor
Failure values: 1
OK
stset timerisk , failure(tumor==1)
2) requesting a Cox regression,
Statistics
Survival analysis
Regression models
Cox proportional hazards model
Model tab: Independent variables: nickel
OK
stcox nickel

Time at risk = 15348.05715
LR chi2(1) = 27.68
------------------------------------------------------------------------------
-------------+----------------------------------------------------------------
nickel | 5.022065 1.765996 4.59 0.000 2.520922 10.00472
------------------------------------------------------------------------------

Population Effect Measures (Augmented Data With Frequent Disease)
We then do something similar with the augmented dataset
* 5 x cases sample stats

use nickelrefinary5xcases , clear
cc tumor nickel
cs tumor nickel
ir tumor nickel timerisk
stcox nickel
. cc tumor nickel
Proportion
-----------------+------------------------+----------------------
Cases | 230 50 | 280 0.8214
Controls | 343 280 | 623 0.5506
-----------------+------------------------+----------------------
Total | 573 330 | 903 0.6346
| |
|------------------------+----------------------
Odds ratio | 3.755102 | 2.635221 5.405224 (exact)
Attr. frac. pop | .6026786 |
+-----------------------------------------------
chi2(1) = 61.12 Pr>chi2 = 0.0000
. cs tumor nickel

| nickel |
-----------------+------------------------+----------
Cases | 230 50 | 280
Noncases | 343 280 | 623
-----------------+------------------------+----------
Total | 573 330 | 903
| |
Risk | .4013962 .1515152 | .3100775
| |
|------------------------+----------------------
Risk difference | .249881 | .1941373 .3056248
Risk ratio | 2.649215 | 2.013878 3.484987
Attr. frac. ex. | .6225296 | .5034455 .7130549
Attr. frac. pop | .5113636 |
+-----------------------------------------------
chi2(1) = 61.12 Pr>chi2 = 0.0000

. ir tumor nickel timerisk

| nickel |
-----------------+------------------------+----------
carcinoma of the | 230 50 | 280
time at risk | 10233.57 8608.377 | 18841.95
-----------------+------------------------+----------
| |
Incidence Rate | .022475 .0058083 | .0148605
| |
|------------------------+----------------------
Inc. rate diff. | .0166667 | .0133458 .0199877
Inc. rate ratio | 3.869473 | 2.839303 5.365002 (exact)
Attr. frac. pop | .6091442 |
+-----------------------------------------------
(midp) Pr(k>=230) = 0.0000 (exact)
(midp) 2*Pr(k>=230) = 0.0000 (exact)
. stset timerisk , failure(tumor==1)

. stcox nickel

Time at risk = 18841.95076
LR chi2(1) = 105.83
------------------------------------------------------------------------------
-------------+----------------------------------------------------------------
nickel | 4.188323 .660423 9.08 0.000 3.074829 5.705048
------------------------------------------------------------------------------
Population Relative Effects
Considering our total sample as our population, we observed the following population effect
measures.
Population Relative Effect Actual Dataset Augmented Dataset

Measure with almost rare with frequent disease
disease (15% in unexposed
(3% in unexposed 60% in exposed)
12% in exposed)
Odds Ratio (OR) 3.76 3.76
Risk Ratio (RR) 3.43 2.65
Incidence Rate Ratio (IRR) 4.76 3.87
Hazard Ratio (HR) 5.02 4.19

Classical Case-Control Study (Controls Are Sampled From Population Controls Only)
Most researchers choose their controls from the population controls only. First we will do this
for one sample.
We will use 2 controls for each case (2:1 sampling ratio). In real practice, you might choose a
greater number, such as 8 controls for each case. We use 2:1 in this illustration, in order to keep
the sample size much smaller than the population size, which makes the simulation more
believable.
Using the original dataset,
File
Open
Find the directory where you copied the course CD
Change to the subdirectory datasets & do-files
Single click on nickelrefinary.dta
Open
use "C:\Documents and Settings\u0032770.SRVR\Desktop\

Biostats & Epi With Stata\datasets & do-files\
nickelrefinary.dta ", clear
* which must be all on one line, or use:

cd "Biostats & Epi With Stata\datasets & do-files"
use nickelrefinary.dta, clear

Crosstabulating disease and exposure,
Statistics
Summaries, tables & tests
Tables
Two-way tables with measures of association
Main tab: Row variable: tumor
Column variable: nickel
Cell contents: Within-column relative frequencies
OK
tabulate tumor nickel, column
carcinoma |
of the |
bronchi and | occupational exposure
nasal | to nickel
sinuses | 0. no exp 1. exposu | Total
------------+----------------------+----------
0. no tumor | 280 343 | 623
| 96.55 88.17 | 91.75
------------+----------------------+----------
1. tumor | 10 46 | 56
| 3.45 11.83 | 8.25
------------+----------------------+----------
Total | 290 389 | 679
| 100.00 100.00 | 100.00
From this population 2 × 2 table, we want to use all of the cases (n=56, the entire tumor row) and
twice as many controls (sample 112 controls from the “no tumor” row),
carcinoma |
of the |
nasal | to nickel
------------+----------------------+----------
0. no tumor | 280 343 | 623 <= sample 56 x 2 = 112 controls
| 96.55 88.17 | 91.75
------------+----------------------+----------
1. tumor | 10 46 | 56 <= use all 56 cases
| 3.45 11.83 | 8.25
------------+----------------------+----------
Total | 290 389 | 679
| 100.00 100.00 | 100.00
First, we set the random number generator seed so we can get the same sample if we need to
replicate our results later,
set seed 999
We now want to sample n=112 if tumor = 0 and just keep all of the cases (tumor = 1),

Statistics
Resampling
Draw random sample
Main tab: Sample size: 112
by/if/in tab: Restrict to observations (if expression): tumor == 0
OK
sample 112 if tumor==0, count

<or>
sample 112 , count , if tumor==0
Seeing what we got,
tabulate tumor nickel, column
carcinoma |
of the |
nasal | to nickel
------------+----------------------+----------
0. no tumor | 56 56 | 112
| 84.85 54.90 | 66.67
------------+----------------------+----------
1. tumor | 10 46 | 56
| 15.15 45.10 | 33.33
------------+----------------------+----------
Total | 66 102 | 168
| 100.00 100.00 | 100.00
which is just what we wanted.

Now, computing the sample odds ratio,
cc tumor nickel
Proportion
-----------------+------------------------+------------------------
Cases | 46 10 | 56 0.8214
Controls | 56 56 | 112 0.5000
-----------------+------------------------+------------------------
Total | 102 66 | 168 0.6071
| |
|------------------------+------------------------
Odds ratio | 4.6 | 2.019615 11.17006 (exact)
Attr. frac. pop | .6428571 |
+-------------------------------------------------
chi2(1) = 16.17 Pr>chi2 = 0.0001
In our sample, we get an OR of 4.60 (in contrast to the population OR of 3.76), which seems
rather off. However, we cannot judge if this is an unbiased estimate for the population OR,
because this estimate is subject to sampling variability.
Summarizing the steps to conducting a classical case-control study,
* -- classical sampling for case-control study

* sample controls from controls)
use nickelrefinary , clear
tab tumor nickel, col // observe 56 cases, 623 controls
set seed 999
sample 112 , count , if tumor==0
* used 2:1 sampling ratio selecting controls from control row
tab tumor nickel, col // 56 cases, 112 controls
cc tumor nickel // compute odds ratio

Using a Monte Carlo simulation, we compute the OR from 1,000 separate samples, to determine
the long-run average OR. This will inform us whether or not the approach we used produces
unbiased estimates of the population OR.
* -- long-run average ordinary case-control study (control row

sampling)
clear
set obs 1
gen or=.
* create a file with 1 missing observation (a blank file)
save or_control_row, replace
*
set seed 999
forvalues i=1(1)1000{
quietly use nickelrefinary, clear
quietly gen or=. // variable to hold odds ratio
quietly sample 112 , count, if tumor==0 // control row sampling
quietly cc tumor nickel
quietly replace or=r(or) in 1/1
quietly keep or
quietly keep in 1/1
quietly append using or_control_row
quietly save or_control_row, replace
}
use or_control_row, clear
histogram or
sum or
The result is:

.8
.6
Density
.4
.2
0
2 4 6 8
or
Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------
or | 1000 3.814517 .6563894 2.090909 7.965854
Using the mean of the 1,000 ORs as the long-run average, we get OR = 3.81.

Performing a similar simulation for the augmented dataset, with a 1:1 sampling ratio to keep the
sample as small as possible relative to the population size, using the mean of the 1,000 ORs as
the long-run average, we get OR = 3.77.
The simulation results are:
Classic Case-Control Design (sample controls from controls only)

Almost Rare Disease Frequent Disease
(3% unexposed, 12% exposed) (15% unexposed, 60% exposed)
Relative Population Simulation Long Population Simulation Long
Effect Measures Run Average Measures Run Average
Measure OR OR
OR 3.76 3.81 3.76 3.77
RR 3.43 2.65
IRR 4.76 3.87
HR 5.02 4.19
We see that the OR is an unbiased estimate of the OR, regardless of the rare disease assumption.
The OR is not an unbiased estimate of the RR, however. If the rare disease assumption was met,
however, it would be a reasonable close estimate. Notice the OR=3.81 is much closer to the
RR=3.43 in the “almost rare disease” column.

Let’s see why the odds ratio is not affected by our choice of sampling ratio (e.g, 2:1, 3:1, etc.).
Data Layout for Case-Control Study

(Stata’s cc and cci commands)
Exposure
Disease Exposed (1) Unexposed (0) Totals*
cases (1) a b M1
noncases (0) c d M0
Totals* m1 m0
*The uppercase Ns (sample sizes) are fixed by the researcher,
and the lowercase Ns are observed.
odds ratio = ad/bc
With a 1:1 sampling ratio,
1 M1 = M0 = 1  (c+d) = 1c + 1d
With a 2:2 sampling ratio,
2 M1 = M0 = 2  (c+d) = 2c + 2d
We see that both c and d are multiplied by the same constant.
This has no effect on the odds ratio, regardless of the constant k,
odds ratio = a(kd)/b(kc)

= ad/bc since the k’s cancel
Sampling in general,

Exposure
cases (1) a b M1 <-select some n fraction of cases
noncases (0) c d M0 <- select some k fraction of controls
Totals* m1 m0
odds ratio = (na)(kd)/(nb)(kc)

= ad/bc since both the n’s and k’s cancel

Case-Cohort Study Design
Rothman (2002, pp.84-86) suggests sampling controls from the entire population, regardless of
case or control status. Thus some cases may be selected as controls as well. (Rothman does not
even mention the classical case-control design, where controls are sampled only from non-
diseased subjects.) When controls are selected this way, which is from the entire population at
risk, than the study design is called a case-cohort design, rather than a case-control design
(Rothmand and Greenland, 1998, p.108).
This time we sample from the total row

Exposure
cases (1) a b M1
noncases (0) c d M0
Totals* m1 m0 <- select some k fraction of controls
so choosing our controls as some fraction k of the total row, we have
c = km1
d = km0
Our odds ratio is then

a 1 a a
 
ad a(km0 )  a   km0  km1  k  m1 m1
OR       b  1 b  b  RR
bc b(km1 )  km1   b   
km0  k  m0 m0
So if we choose our controls as some fraction of the total row, our odds ratio is identically the
risk ratio. That is, in a case-cohort study, the OR directly estimates the RR, regardless of the rare
disease assumption.
Exercise. Look at the Cai methods paper.
Notice in the second sentence of the abstract, they point out that the controls are sampled
from the “total row of the full cohort 2 x 2 table” when they state,
...a case-cohort design, which consists of a small random sample of the whole cohort and
all of the disease subjects...”
In the second paragraph, they cite some studies that have used the case-cohort design.

Conducting a case-cohort study on this full-cohort, we use all of the cases and take a random
sample of controls from the total row of the 2 x 2 table.
This is a bit more complex, so we will commands rather than menus, since it will be easier to see
what we are doing.
* -- case-cohort study (sample controls from total sample)

tab tumor nickel, col // observe 56 cases, 623 controls
keep if tumor==1 // reduce sample to cases
save tumorcases, replace // save cases to file
*
set seed 999
sample 112 , count // 2:1 sampling ratio, selecting controls from
// total row since we do not use the “if
// tumor==0” this time
replace tumor=0 if tumor==1 // set these all to control status
append using tumorcases // bring cases back in
tab tumor nickel, col // 56 cases, 112 controls
cc tumor nickel
Then using the Monte Carlo method to obtain the long-run average OR from the original (rare
disease) dataset
* -- long-run average ordinary case-control study (total row

sampling)
clear
set obs 1
gen or=.
save or_control_row, replace // create a file with 1 missing
observation
*
set seed 999
quietly gen or=. // variable to hold odds ratio
quietly sample 112 , count // total row sampling
quietly replace tumor=0 if tumor==1 // set all to control status
quietly append using tumorcases // bring cases back in
quietly cc tumor nickel
quietly replace or=r(or) in 1/1
quietly keep or
quietly keep in 1/1
quietly append using or_control_row
quietly save or_control_row, replace
}
use or_control_row, clear
histogram or
sum or
with a similar simulation using the augmented (frequent disease) dataset, also in the do-file.

Classic Case-Control Design (sample controls from controls only)

Measure OR OR
OR 3.76 3.76
RR 3.43 3.48 2.65 2.67
IRR 4.76 3.87
HR 5.02 4.19
For this sampling approach, we see that the sample OR is an unbiased estimator of the population
RR, as Rothman claims it should be.
For the case-cohort design, the rare-disease assumption is not required for the OR to be an
estimate of RR (Rothman and Greenland, 1998, p.110). We have demonstrated that to be the
case.

Density Case-Control Study Where Controls Are Sampled From Cases & Controls Which
Have Same Or Longer Time-At-Risk (Risk Set Sampling)
Rothman (2002, pp.76-80) suggests sampling controls from the entire population, regardless of
case or control status, but also select the controls from subjects with similar or longer time at risk
as the cases, in a matched fashion. Again, some cases may be selected as controls as well. This
is called risk-set sampling.
In this study design, we want the OR to be an unbiased estimate of the hazard ratio, HR, where
HR is a type of weighted average of the day-specific risk ratios.
Using the hypothetical data table we used above,

Exposed Non-Exposed
Risk Risk Risk
Ratio
1 50 5 0.10 50 2 0.04 2.5
2 30 10 0.33 40 8 0.20 1.7
3 10 10 1.00 20 10 0.50 2.0
totals 90 25 110 20
In this design, we also use a type of “total row sampling”. That is we select our controls from the
“Begin N” column’s of the life table.
For the 5+2 cases that occurred on day 1, we sample our controls from the 50+50 persons still at
risk on day 1.
For the 10+8 cases that occurred on day 2, we sample our controls from the 30+40 persons still at
risk on day 2,
and so on.
We do this by forming risk sets. For every case, we form a risk set that includes all subjects with
an equal or longer follow-up time. Then we sample 2 controls from that risk set, if we use a 2:1
sampling ratio, that we match with that case.
This is identically sampling from the correct row of the Begin N column.
Just as we saw in the case-cohort study, proportionality is maintained, which guarantees that our
OR is an estimate of RR for each row of the life table.

Since we will analyze the data with conditional logistic regression, which maintains the matches
with the case and controls on the same row of the life table, we maintain the day-specific RR
analysis. Finally, the OR that comes out of the conditional logistic regression is a type of
weighted average across the rows.
Cox regression, which computes the HR directly, also summarizes the day-specific RR, (which is
also called the day-specific HR), computing a type of weighted average which it reports as the
HR.
Thus, the conditional logistic regression from the case-control study does the same thing as the
Cox regression from a cohort study. (NOTE: the conditional logistic approach is biased and so
should not be used, as is pointed out below.)
Let’s do it.
Obtaining a risk-set sample (density sample) and computing the OR,

* --- density case-control study ----
use nickelrefinary, clear // begin with full dataset
set seed 999 // set seed so can replicate the analysis
* following command selects 2 controls per case from risk set of
* subjects with same or longer time-at-risk
sttocc, number(2)
clogit _case nickel, group(_set) or
we get
Conditional (fixed-effects) logistic regression Number of obs = 168
LR chi2(1) = 17.65
Prob > chi2 = 0.0000
Log likelihood = -52.698159 Pseudo R2 = 0.1434
------------------------------------------------------------------------------
_case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
nickel | 4.951997 2.131085 3.72 0.000 2.130428 11.51049
------------------------------------------------------------------------------
Notice we used conditional logistic regression to obtain the odds ratio. Given our matching of
time-at-risk (from risk-set sampling), we had to obtain an odds ratio using a matched sample
approach since matched studies require a matched analysis (Rothman and Greenland, 1998, p.
98; Greenland and Thomas, 1982).

Look at the data in Stata after running this. Notice Stata created three new variables, called
_case, _set, and _time. Interpret these.
list tumor nickel timerisk _case _set _time in 1/9 , sep(3)
+--------------------------------------------------------------------+
| tumor nickel timerisk _case _set _time |
|--------------------------------------------------------------------|
1. | 0. no tumor 1. exposure 35.6823 0 1 .70149994 |
2. | 0. no tumor 1. exposure 16.3425 0 1 .70149994 |
3. | 1. tumor 1. exposure .7014999 1 1 .70149994 |
|--------------------------------------------------------------------|
4. | 0. no tumor 1. exposure 9.000103 0 2 1.1797981 |
6. | 1. tumor 1. exposure 1.179798 1 2 1.1797981 |
|--------------------------------------------------------------------|
8. | 0. no tumor 0. no exposure 41.7836 0 3 1.4220009 |
9. | 1. tumor 0. no exposure 1.422001 1 3 1.4220009 |
+--------------------------------------------------------------------+
Notice that for each risk set, the follow-up time of the matched controls was greater than or equal
to the follow-up time of the case.

To demonstrate that the OR using this study design provides an unbiased estimate of the HR, we
next obtain the long-run average.
* --- density case-control study (long-run average)----

clear
set obs 1
gen or=.
save or_density, replace // create a file with 1 missing observation
*
set more off // turn off scrolling prompt when display iteration
number
set seed 999
quietly stset timerisk , failure(tumor==1)
quietly sttocc, number(2)
quietly clogit _case nickel, group(_set) or
quietly gen or = exp(_b[nickel]) in 1/1
// convert coefficient to OR
quietly keep or
quietly keep in 1/1
quietly append using or_density
quietly save or_density, replace
display `i' // display iteration number
}
set more on
use or_density, clear
*histogram or
sum or
We then do the same thing for the augmented dataset.
Density Case-Control Design

Measure OR OR
OR 3.76 3.76
RR 3.43 2.65
IRR 4.76 3.87
HR 5.02 5.42 4.19 4.43
We see that the OR is a biased estimate of the HR, and so the conditional logistic regression
model should not be used for the analysis. It is close though. Another approach is taught below.

Terminology Inconsistencies
Not all authors use the study design names consistently. For example,
A)
Rothman (2002, p.84-86) uses the term case-cohort study design to refer to the situation when
follow-up is assumed equal for all subjects, or just simply ignored, so controls are simply
selected from all subjects at risk (from the total row of the 2 × 2 table).
Prentice (1986, p.2) calls this study design a case-cohort design: binary response.
B)
Rothman (2002, pp. 76-80) uses the term density case-control study design to refer to the
situation when follow-up is not equal for all subjects, so risk-set sampling is used to select
controls from all subjects at risk with equal or longer follow-up times (from the appropriate row
of a life table).
Prentice (1986, p. 4) calls this study design a case-cohort design: time to response data.
Some Methods Papers
Jewell (2004, pp.51-53) presents risk-set sampling (density case-control study design) as a way
to use a case-control study to obtain an estimate of the hazard ratio (HR).
Rothman (2002, pp.76-80) presents the density case-control study design as a way to obtain an
estimate of the incidence rate ratio (IRR). Although he does not say so, Rothman is apparently
making the assumption that risk is constant across time. Under that assumption, HR = IRR.
Prentice RL. (1986). A case-cohort design for epidemiologic cohort studies and diease prevention
trials. Biometrika 73:1-11.
King G, Zeng L. (2002). Estimating risk and rate levels, ratios and differences in case-control
studies. Statist Med 21:1409-1427.
Volovics A, van den Brandt PA. (1997). Methods for the analysis of case-cohort studies. Biom J
39(2):195-214.

Some Studies That Used the Case-Cohort Approach
1) Rossing MA, Daling JR, Weiss NS, Moore DE, Self SG. (1996). Risk of breast cancer in a
cohort of infertile women. Gynecol Oncol 60(1):3-7.
Abstract
The purpose of this study was to assess: (1) the risk of breast cancer associated with use
of ovulation-inducing agents (such as clomiphene citrate) as treatment for infertility; and
(2) the risk associated with ovulatory abnormalities that result in infertility. We
performed a case-cohort study among 3837 women evaluated for infertility at clinics in
Seattle, Washington, at some time during 1974–1985. Computer linkage with a
population-based tumor registry was used to identify women diagnosed with breast cancer
before January 1, 1992. Data regarding infertility testing and treatment were abstracted
from the infertility clinic medical records for women who developed breast cancer and a
randomly selected subcohort. Twenty-seven women in the cohort developedin situor
invasive breast cancer, in comparison with an expected number of 28.8 cases
(standardized incidence ratio, 0.9; 95% confidence interval (CI), 0.6–1.4). Infertile
women with evidence of an ovulatory abnormality were at a risk of breast cancer similar
to that of women whose infertility was believed to be due to other causes. The risk among
women who had taken clomiphene was reduced relative to infertile women who had not
used this drug (adjusted relative risk, 0.5; 95% CI, 0.2–1.2), but the reduction in risk did
not increase with duration of use. The possibility that use of clomiphene as treatment for
infertility lowers the risk of breast cancer should be examined in other, larger studies.
Notice this study has a long and unequal follow-up, so the density case-control design is well-
suited for this study.
2) Savitz DA, Cai J, van Wijngaarden E, et al (2000). Case-cohort analysis of brain cancer and
leukemia in electric utility workers using a refined magnetic field job-exposure matrix.
American Journal of Industrial Medicine 38:417-425.
3) Voorrips LE, Goldbohm RA, Brants HA, et al. (2000). A prospective cohort study on
antioxidant and folate intake and male lung cancer risk. Cancer Epidemiol Biomarkers Prev
9:357-65.
In the 3rd paragraph of their Data Analysis section, they state they did something special to
adjust the variance estimates, using a software routine they developed:
“Because standard software was not available for case-cohort analysis, specific macros
were developed to account for the additional variance introduced by sampling from the
cohort instead of using the entire cohort (29).”

Something similar is available in Stata, but you must update your Stata to get it. First use the
help facility to search on “case cohort”. Then click on the sbe41 link when you see this:
STB-59 sbe41 . . . . . . . . . . . . Ordinary case-cohort design and analysis

(help stcascoh, stselpre if installed) . . . . . . . . . V. Coviello
1/01 pp.12--18; STB Reprints Vol 10, pp.121--129
selects a sample from a cohort, prepares the dataset for
analysis using a Cox regression model, and computes the
Self-Prentice variance estimator of the parameters
This approach in Stata fits a Cox regression model to the data with an appropriate variance
estimate, so the p values and confidence intervals are correct.
What Researchers Actually Use
Usually when sampling is from a larger cohort, follow-up times are available. Rather than using
the risk set sampling and conditional regression approached described above, researchers instead
using Cox regression model with a special variance estimator (at least three such estimators have
been proposed). All of the example papers presented in this chapter followed this suitably
adapted Cox regression analysis approach.
After updating Stata to get the stcascho and stselpre commands,

use nickelrefinary, clear
stset timerisk , failure(tumor==1) id(caseid)
stcox nickel // full cohort
stcascoh, alpha(.2) seed(999) // sample 20% of the controls
stselpre nickel
The results are:

1) full cohort

------------------------------------------------------------------------------
-------------+----------------------------------------------------------------
nickel | 5.020635 1.765523 4.59 0.000 2.520176 10.00199
------------------------------------------------------------------------------
2) sampled cohort
. stcascoh, alpha(.2) seed(999) // .2 or 20% of the cohort
failure _d: tumor == 1

analysis time _t: timerisk
id: caseid
Total sample = 174

------------------------------------------------------------------------------
191 total obs.
0 exclusions
------------------------------------------------------------------------------
191 obs. remaining, representing
174 subjects
56 failures in single failure-per-subject data
Self Prentice Variance Estimate for Case-Cohort Design
Self Prentice Scheme
------------------------------------------------------------------------------
| Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
nickel | 4.652655 1.846243 3.87 0.000 2.137624 10.12676
------------------------------------------------------------------------------
Prentice Scheme
------------------------------------------------------------------------------
| Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
nickel | 4.587959 1.82057 3.84 0.000 2.1079 9.98594
------------------------------------------------------------------------------
Saving the Case-Cohort Sample

At this point, only the sample of N=191 subjects are in the data editor. Be sure to save this file if
you want to do something like a chart review to collect further predictor variables on this case-
cohort sample.

Using simulation to check the unbiasedness of this approach, with the Prentice Scheme
clear
set obs 1
gen or=.
save hr_simulation, replace // create a file with 1 missing
observation
*
set more off // turn off scrolling prompt
* set seed 999 // doesn't work outside of shcascoh command
quietly stset timerisk , failure(tumor==1) id(caseid)
quietly stcascoh, alpha(.1798) // sample 18% (n=112) of controls
quietly stselpre nickel // fit model
quietly matrix A=e(b)
quietly svmat A // creates variables from matrix columns
quietly gen hr = exp(A1) in 1/1 // convert coefficient to OR
quietly keep hr
quietly keep in 1/1
quietly append using hr_simulation
quietly save hr_simulation, replace
display `i' // display iteration number
}
set more on
use hr_simulation, clear
sum hr
we get
. sum hr
Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------
hr | 1000 5.051037 1.065216 2.663914 9.633725
so the long-run HR=5.05, compared to the population HR=5.02, which is an unbiased estimate.
In contrast, the risk-set sampling, followed by conditional logistic regression, which was
demonstrated above and gave a long-run average HR=5.42, produces a biased estimate and so
should not be used.

Sample Size Determination
We have seen that subjects can be included as both cases and controls in the case-cohort
approach. This overlap requires that the sample size be inflated to allow for this. Rothman,
Greenland, and Lash (2008) comment,
“Case-cohort designs have other advantages as well as disadvantags relative to alternative

case-conrol designs (Wacholder, 991). One disadvantage is that, because of the overlap of
membership in the case and control groups (controls who are sleeced may also develop
disease and enter the study as cases), one will need to select more controls in a case-
cohort study than in an ordinary case-control study with the same number of cases, if one
is to achieve the same amount of statistical precision. Extra controls are needed because
the stratistical precision of a study is strongly determined by the numbers of distinct cases
and noncases. Thus, if 20% of the source cohort members will become cases, and all
cases will be included in the study, one will have to select 1.25 times as many controls as
cases in a case-cohort study to ensure that there wil be as many controls who never
become cases in the study. On average, only 80% of the controls in such a situation will
remain noncases; the other 20% will become cases. Of course, if the disease is
uncommon, the number of extra controls needed for a case-cohort study will be small.”
------
Wacholder S. (1991). Practical considerations in choosing beteen the case-cohort and
nested case-control design Epidemiology 2:155-158.
The “1.25” comes from: 80%, or 4/5 of the controls are “controls only”. To get this sample size
back up to 100% controls only, with equals number of cases, you (5/4)(4/5) = 1, where 5/4 =
1.25.

References
Breslow NE, Day NE. (1987). Statistical Methods in Cancer Research, Vol II: The Design and
Analysis of Cohort Studies, Lyon, France, IARC.
Cai J, Zeng D. (2004). Sample size/power calculation for case-cohort studies. Biometrics
60:1015-1024.
Dupont WD. (2002). Statistical Modeling for Biomedical Researchers: A Simple Introduction to
the Analysis of Complex Data. Cambridge UK, Cambridge University Press.
Greenland S, Thomas DC. (1982). On the need for the rare disease assumption in case-control
studies. Am J Epidemiol 116(3):547-553. with erratum in Am J Epidemiol
1990;131(6):1102.
King G, Zeng L. (2002). Estimating risk and rate levels, ratios and differences in case-control
studies. Statist Med 21:1409-1427.
Jewell NP. (2004). Statistics for Epidemiology. New York, Chapman & Hall/CRC.
National Heart, Lung, and Blood Institute. (1998). Clinical guidelines for the identification,
evaluation, and treatment of overweight and obesity in adults: the evidence report.
Bethesda, MD, National Heart, Lung, and Blood Institute.
Onyike CU, Crum RM, Lee HB, Lyketsos CG, Eaton WW. (2003). Is obesity associated with
major depression? Results from the third national health and nutrition examination
survey. Am J Epidemiol 158(12):1139-1153.
Prentice RL. (1986). A case-cohort design for epidemiologic cohort studies and diease prevention
trials. Biometrika 73:1-11.
Rothman KJ. (2002). Epidemiology: An Introduction. New York, Oxford University Press.
Rothman KJ, Greenland S. (1998). Modern Epidemiology, 2nd ed. Philadelphia, PA.
Rothman KJ, Greenland S, Lash TL. (2008). Case-control studies. In Rothman KJ, Greenland S,
Lash TL, Modern Epidemiology, Philadelphia, Lippincott Williams & Wilkins, 2008,
pp.111-127.
Savitz DA, Cai J, van Wijngaarden E, et al (2000). Case-cohort analysis of brain cancer and
leukemia in electric utility workers using a refined magnetic field job-exposure matrix.
American Journal of Industrial Medicine 38:417-425.
Volovics A, van den Brandt PA. (1997). Methods for the analysis of case-cohort studies. Biom J
39(2):195-214.
Voorrips LE, Goldbohm RA, Brants HA, et al. (2000). A prospective cohort study on antioxidant
and folate intake and male lung cancer risk. Cancer Epidemiol Biomarkers Prev
9:357-65.

Case Kohort

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Case Kohort

Uploaded by

Copyright:

Available Formats

Chapter 3-14.

Case-Cohort Study Design

Chapter 3-14 (revision 16 May 2010) p. 1

ltable day disease ,by(exposure) noadjust intervals(1) hazard

Beg. Cum. Std. Std.

which agrees with the original table.

Chapter 3-14 (revision 16 May 2010) p. 2

Life Table of Hypothetical Data

Chapter 3-14 (revision 16 May 2010) p. 3

Life Table of Hypothetical Data

Rate Ratio Analysis Data

ir disease exposure day

Chapter 3-14 (revision 16 May 2010) p. 4

Analyzing these data using survival analysis,

stset day ,failure(disease==1)

Cox regression -- Breslow method for ties

No. of subjects = 100 Number of obs = 100

This time, we observe a significant effect (HR = 1.92, p = 0.032).

Chapter 3-14 (revision 16 May 2010) p. 5

person-time = total time for subjects

PT = total time for subjects

(let x----x denote time)

x-------------------------------------x (censored) Group A

x-------------------------------------x (died) Group B

Chapter 3-14 (revision 16 May 2010) p. 6

Chapter 3-14 (revision 16 May 2010) p. 7

* -- set up variables -------------------------------

* -- set up tumor disease variable

* -- set up nickel exposure variable

* -- set up time-at-risk variable

save nickelrefinary , replace

* -- end set up variables ---------------------------

Chapter 3-14 (revision 16 May 2010) p. 8

* -- set up data with 5 x cases

Population Effect Measures (Original Data With Rare Disease)

use "C:\Documents and Settings\u0032770.SRVR\Desktop\

* which must be all on one line, or use:

Chapter 3-14 (revision 16 May 2010) p. 9

Chapter 3-14 (revision 16 May 2010) p. 10

| occupational exposure to|

Chapter 3-14 (revision 16 May 2010) p. 11

ir tumor nickel timerisk

| occupational exposure to|

Chapter 3-14 (revision 16 May 2010) p. 12

1) informing Stata we have survival time variables:

stset timerisk , failure(tumor==1)

2) requesting a Cox regression,

Cox regression -- Breslow method for ties

No. of subjects = 679 Number of obs = 679

Chapter 3-14 (revision 16 May 2010) p. 13

We then do something similar with the augmented dataset

* 5 x cases sample stats

| occupational exposure to|

Chapter 3-14 (revision 16 May 2010) p. 14

| occupational exposure to|

. stset timerisk , failure(tumor==1)

Cox regression -- Breslow method for ties

No. of subjects = 903 Number of obs = 903

Population Relative Effects

Population Relative Effect Actual Dataset Augmented Dataset

Chapter 3-14 (revision 16 May 2010) p. 15

Using the original dataset,

use "C:\Documents and Settings\u0032770.SRVR\Desktop\