This superb book on research philosophy and methodology that Drs. Phyllis
Supino and Jeffrey Borer have written and edited came out of an experience
common to most of us involved in training investigators beginning their
research careers. How do you teach these investigators the mostly unwritten
ways of an area as complex as medical research? How do you help the
research neophyte develop into a creative and reliable researcher? For me and
my associates in the Cardiology Branch of the NIH (of which Dr. Borer was
one) in the 1970s and 1980s, the teaching process was mostly based on an
apprenticeship model, with learning coming in the actual doing of the
research. This time-honored approach led to the development, in many
research centers, of a cadre of superb researchersbut it was hard to master
and the results were necessarily inconsistent, with many young investigators
going down wrong paths.
Drs. Supino and Borers book represents a unique collaboration between
an accomplished educator specializing in research methodology and a promi-
nent physician-scientist. Drs. Supino and Borer began their collaboration
more than 20 years ago at Cornell University Medical College, continuing
their work together in what became the Howard Gilman Institute for Valvular
Heart Diseases. The Institute, of which Dr. Borer is the Director, now is
located at the State University of New York Downstate Medical Center.
Working within the context of a research institute housed within a medical
school, Dr. Borer soon discovered that most of the fellows coming into his
program had no formal research training and scant knowledge of research
methodology. Prior to joining the Institute, Dr. Supino had been conducting
continuing education in research methodology for scientists and health pro-
fessionals since late 1970s. When Dr. Supino joined the Institute in 1990, she
applied her accumulated expertise in this eld to develop a curriculum and
lead a comprehensive course providing formal training in research methodol-
ogy for Dr. Borers fellows and others at the institution. This curriculum and
course, developed in partnership with Dr. Borer, turned out to be our good
fortune. During the ensuing 20+ years Drs. Supino and Borer gradually devel-
oped the pedagogical framework for writing what is one of the best books in
the eld.
This book provides in depth chapters containing information critical to
creating good researchfrom the kind of mind-set that generates valuable
research questions to study design, to exploring a variety of online data

vi Foreword

bases, to the elements making for compelling research grants and papers,
and to the wonderfully informing chapter on the history of the application
of ethics to medical research. There also is a valuable chapter on statistical
considerations and a fascinating discussion on the origins and elements of
hypothesis generation.
Its also important to emphasize that this superb text is not only for the
new investigator, but for experienced investigators as well. This results from
the fact that Drs. Supino, Borer, and their coauthors write their chapters in
ways that are not only easily accessible to the new investigator, but at the
same time are sufciently sophisticated so that the seasoned investigator will
As an example, I particularly enjoyed the rst chapter, written by
Dr. Supino, which provides some down to earth examples of, in essence, why
there should be a clearly dened primary endpoint in clinical investigations.
As I was reading her chapter, I realized I had forgotten the why of this
requirement, and that I was just taking the requirement for granteda situation
that could make investigators vulnerable to dismissing its importance. In this
regard, over the years Ive found it not uncommon for investigators, who nd
that the efcacy of the intervention theyre studying signicantly improves
one or another secondary endpoints but not the primary endpoint, to freely
attack this requirement and argue theyve proven the efcacy of their inter-
vention. But Dr. Supino reminds us what good science is by providing an
elegantly simple example of the marksman who boasts his skills after inter-
preting the results of his shooting a gun at a piece of paper hung on the side
of a barn. The marksman, it turns out, does not prospectively dene the bulls
eye. Rather, after multiple bullets are red at the piece of paper, he inspects
the bullet hole-riddled paper, sees the random bullet hole patterns, and then
draws a circle (bulls eye) around a group of holes that by chance have fallen
into a tight cluster. The post hoc denition of the bulls eye (i.e., now the
primary endpoint) speaks (unjustiably) to the marksmans skill. By this
simple anecdote, Dr. Supino makes the critical importance of prospectively
dening the primary endpoint exquisitely clear.
A foreword is no place to provide extensive details of what a book con-
tains. Ill therefore limit myself and just enthusiastically say this rst chapter
I read is representative of the high quality of the chapters to come. Drs.
Supino and Borer have used the many years they have developed their course
extraordinarily wellthey and their outstanding coauthors have produced a
book that is well written, beautifully edited, and contains wisdom and insight.
It is a book, whether reading it in its entirety or perusing individual chapters,
that presents the reader with a superb learning experience. The authors have
certainly hit the bulls eye.

Washington, DC, USA Stephen E. Epstein, MD


This book has been written to aid medical students, physicians, and other
health professionals as they probe the increasingly complex and varied medi-
cal/scientic literature for knowledge to improve patient care and search for
guidance in the conduct of their own research. It also is intended for basic
scientists involved in translational research who wish to better understand the
unique challenges and demands of clinical research and, thus, become more
successful members of interdisciplinary medical research teams.
The book is based largely on a lecture series on research methodology,
with particular emphasis on issues affecting clinical research, that the editors
designed and have offered for 21 years to more than 1,000 members of the
academic medical communities of Weill Cornell Medical College and the
State University of New York (SUNY) Downstate Medical Center, both
located in New York City. The book spans the entire research process, begin-
ning with the conception of the research problem to publication of ndings.
The need for such a book has become increasingly clear to us during many
years of conducting a program of training and research in cardiovascular dis-
eases and in our general teaching of research methodology to students, train-
ees, and postgraduate clinical physicians and researchers. Though agreement
on the fundamental principles of scientic research has existed for more than
a century, the application of these principles has changed over time. The pre-
cision required in dening study populations and in detailing methodologies
(and their deciencies) is continually increasing. In addition, a bewildering
arsenal of statistical tools has developed (and continues to grow) to identify
and dene the magnitude and consistency of relationships. Simultaneously,
acceptable formats for communicating scientic data have changed in
response to parallel changes in the world at large, and under the pressure of
an information explosion which mandates succinctness and clarity.
Despite these demands, there are few books, if any, that comprehensively and
concisely present these concepts in a manner that is relevant and comprehensible
to a broad professional medical community. This text is designed to resolve this
deciency by combining theory and practical application to familiarize the
reader with the logic of research design and hypothesis construction, the impor-
tance of research planning, the ethical basis of human subjects research, the
basics of writing a clinical protocol, the logic and techniques of data generation
and management, and the fundamentals and implications of various sampling

viii Preface

techniques and alternative statistical methodologies. This book also aims to offer
guidance for assembling and interpreting results, writing scientic papers, and
publishing studies.
The books 13 chapters emphasize the role and structure of the scientic
hypothesis (reinforced throughout the various chapters) in informing meth-
ods and in guiding data interpretation. Chapter 1 describes the general
characteristics of research and differentiates among various types of research;
it also summarizes the steps typically utilized in the hypothesis-testing
(hypothetico-deductive) method and underscores the importance of proper
planning. Chapter 2 reviews the origins of clinical research problems and the
types of questions that are commonly asked in clinical investigations; it also
identies the characteristics of well-conceived research problems and explains
the role of the literature search in research problem development. Chapter 3
introduces the reader to various modes of logical inference utilized for
hypothesis generation, describes the characteristics of well-designed research
hypotheses, distinguishes among various types of hypotheses, and provides
guidelines for constructing them. Chapter 4 takes the reader through classic
epidemiological (observational) methods, including cohort, casecontrol,
and cross-sectional designs, and describes their respective advantages and
limitations. Chapter 5 discusses the meaning of internal and external validity
in the context of studies that aim to examine the effects of purposively applied
interventions, identies the most important sources of bias in these types of
studies, and presents a variety of alternative study designs that can be used to
evaluate interventions, together with their respective strengths and weak-
nesses for controlling each of the identied biases. Chapter 6 denes and
describes the purpose of the clinical trial and provides in-depth guidelines for
writing the clinical protocol that governs its conduct. Chapter 7 describes
methodologies used for data capture and management in clinical trials and
reviews associated regulatory requirements. Chapter 8 explains the steps
involved in designing, implementing, and evaluating questionnaires and
interviews that seek to obtain self-reported information. Chapter 9 reviews
the pros and cons of systematic reviews and meta-analyses for generating
secondary data by synthesizing evidence from previously conducted studies,
and discusses methods for locating, evaluating, and writing them. Chapter 10
describes the various methods by which subjects can be sampled and the
implications of these methods for drawing conclusions from clinical research
ndings. Chapter 11 introduces the reader to fundamental statistical princi-
ples used in biomedical research and describes the basis of determination of
sample size and denition of statistical power. Chapter 12 describes the ethi-
cal basis of human subjects research, identies areas of greatest concern to
institutional review boards, and outlines the basic responsibilities of investi-
gators towards their subjects. Finally, Chapter 13 provides practical guidance
on how to write a publishable scientic paper.
The authors of this book include prominent medical scientists and meth-
odologists who have extensive personal experience in biomedical investiga-
Preface ix

tion and in teaching various key aspects of research methodology to medical

students, physicians, and other health professionals. They have endeavored to
integrate theory and examples to promote concept acquisition and to employ
language that will be clear and useful for a general medical audience. We hope
that this text will serve as a helpful resource for those individuals for whom
performing or understanding the process of research is important.

Brooklyn, NY, USA Phyllis G. Supino

Jeffrey S. Borer
Special Acknowledgments

We wish to give special thanks to the following individuals, who provided

particular assistance to the editors and authors in the preparation of this book:
From our publishers, we especially thank Richard Lansing for his belief in
the importance of our project as well as Kevin Wright, senior developmental
editor, for his excellent pre-production work.
From SUNY Downstate Medical Center, we thank Ofek Hai DO for his
efforts in the preparation of gures and tables; Rachel Reece BS for her assis-
tance in helping us to secure permission for the reproduction of images; and
Dany Bouraad BA, Jaclyn Wilkens BA, Daniel Santarsieri BS, and Romina
Arias BA for their assistance in literature searching, proof reading, and other
essential background work.
Finally, we thank our colleagues at Weill Cornell Medical College and
SUNY Downstate Medical Center who participated in our teaching programs
on which this book is largely based, and to our families for their unfailing
support of this project.


Overview of the Research Process
Phyllis G. Supino

The term research can be defined broadly as a

process of solving problems and resolving pre- Characteristics of Research
viously unanswered questions. This is done by
careful consideration or examination of a sub- No discussion of research methodology should
ject or occurrence. Although approach and begin without examining the characteristics of
specific objectives may vary, the ultimate goal research and its relation to the scientific method.
of research always is to discover new knowl- The reason for this starting point is that the term
edge. In biomedical research, this may include research has been used so loosely in common
the description of a new phenomenon, the parlance and defined in so many different ways
definition of a new relationship, the develop- by scholars in various fields of inquiry [1] that its
ment of a new model, or the application of an meaning is not always appreciated by those with-
existing principle or procedure to a new context. out a formal background. To understand more
Increasingly, the methodology of research is readily what research is, it is useful to begin by
acknowledged as an academic discipline of its considering some examples of what it is not.
own, whose specific rules and requirements for Leedy, in his book Practical Research [2],
securing evidence, though applicable across dis- describes two young students: one whose teacher
ciplines, mandate special study. This chapter has sent him to the library to do research by
describes the characteristics of the research pro- gleaning a few facts about Christopher Columbus
cess and its relation to the scientific method, and another who completes a research paper on
distinguishes among the various forms of the Dark Lady in Shakespeares sonnets by gath-
research used in the biomedical sciences, out- ering facts, assembling a bibliography, and refer-
lines the principal steps involved in initiating a encing statements without drawing conclusions or
research project, and highlights the importance otherwise interpreting the collected data. Both
of planning. students think that research has taken place when,
in fact, all that has occurred has been information
gathering and transport from one location to
another. Leedy argues that these misconceptions
are reinforced at every grade level and that
most students facing the rigors of a graduate
P.G. Supino, EdD () program lack clear understanding about the
Department of Medicine, College of Medicine, specific requirements of the research process and
SUNY Downstate Medical Center,
underestimate what is involved. In academic med-
450 Clarkson Avenue, Box 1199, Brooklyn,
NY 11203, USA ical programs, it is not uncommon for a resident
e-mail: to comment, I have a 2-week block available to

P.G. Supino and J.S. Borer (eds.), Principles of Research Methodology: A Guide for Clinical Investigators, 1
DOI 10.1007/978-1-4614-3360-6_1, Phyllis G. Supino and Jeffrey S. Borer 2012
2 P.G. Supino

conduct a research project and to expect to study to a broader context (external validity or
design, execute, and complete it in that time extrapolability).
frame. 3. It should be empirical.
There is general consensus that information Despite the deductive processes that may pre-
gathering, including reviewing and synthesizing cede data collection, the findings of research
the literature, is a critically important activity to must always be based on observation or experi-
be undertaken by an investigator. However, in ence and, thus, must relate to reality. It is the
and of itself, it is not research. The same can be empirical quality of research that sets it apart
said for data gathering activities aimed at per- from other logical disciplines, such as philoso-
sonal edification or those undertaken to resolve phy, which also attempts to explain reality.
organization-specific issues. So what, then, char- Recognition of this fact may pose a problem for
acterizes research? physicians who, according to some researchers
Tuckman [3] has argued that in order for an [4, 5], have a cognitive style that tends to be
activity to qualify as research, it should possess a more deterministic than probabilistic, causing
minimum of five characteristics: personal experience to be valued more than
1. It should be systematic. data. Under these circumstances, the impor-
While some important research findings have tance of subordinating the hypothesis to data
occurred serendipitously (e.g., Flemings may not be fully appreciated. As part of the edu-
accidental discovery of penicillin, Pasteurs cation of the physician scientist, he or she must
chance finding of microbial antibiosis), most learn that when confronted with data that do not
arise out of purposeful, structured activity. support the study hypothesis, it is the hypothesis
Structure is engendered by a series of the rules and not the data that must be discarded, unless it
for defining variables, constructing hypothe- is abundantly clear that something untoward
ses, and developing research designs. Rules occurred during the performance of the study.
also exist for collecting, recording, and ana- 4. It should be reductive.
lyzing data, as well as for relating results to As Tuckman [3] has noted, a fundamental pur-
the problem statement or hypotheses. These pose of research is to reduce the confusion of
rules are used to generate formal plans (or individual events and objects to more under-
protocols) which guide the research effort, standable categories of concepts (p. 11). One
thereby optimizing the likelihood of achieving heuristic tool used by scientists for this pur-
valid results. pose is the creation of abstractive constructs
2. It should be logical. such as intervening variables (e.g., resistance
Research employs logic that may be induc- and solubility in the physical sciences, condi-
tive, deductive, or abductive in nature. tioning or reflex reserve in the behavioral sci-
Inductive logic is employed to develop gener- ences) to explain how phenomena cause or
alizations from repeated observations, abduc- otherwise interact with each other [6]. Another
tive logic is used to form generalizations that powerful tool available to the researcher for
serve as explanations for anomalous events, this purpose is a constellation of techniques
and deductive logic is used to generate specific for numerical and graphical data analysis
assertions from known scientific principles or (the specific methodology employed depend-
generalizations. Further elaboration of these ing on the objectives and design of the study
distinctions is covered in Chap. 3. Logic is as well as the number of observations gener-
used both in the development of the research ated by the study). As Tuckman observes,
design and selection of statistics to ensure that whenever data are subjected to analysis, some
valid inferences may be drawn from data information is lost, specifically the uniqueness
(internal validity). Logic also is used to of the individual observation. However, such
generalize from the results of the particular losses are offset by gains in the capacity to
1 Overview of the Research Process 3

conceptualize general relationships based on [CQI] or formative and summative appraisals

the data. As a result, the investigator can of educational programs) which, while employ-
explain and predict, rather than merely ing many of the same rigorous and systematic
describe. methodologies as scientific research, princi-
5. It should be replicable and transmittable. pally aim to inform decision making about
The fact that research procedures are docu- particular activities or policies rather than to
mented makes it possible for others to conduct advance more wide-ranging knowledge or the-
and attempt to replicate the investigation. The ory. As Smith and Brandon [9] have noted,
ability to replicate research results in the research generalizes whereas evaluation
confirmation (or, in some unhappy cases, refu- particularizes.
tation) of conclusions. Confirmation of con-
clusions, in turn, results in the validation of
research and confers upon research a respect- Types of Research
ability that generally is absent in other prob-
lem-solving processes. In addition, the fact There are multiple ways of classifying research,
that research is transmittable also enables and the categorizations noted below are by no
the general body of knowledge to be extended means exhaustive. Research can be classified
by subsequent investigations based on the according to its theoretical versus practical
research. For this reason, researchers are emphasis, the type of inferential processes used,
encouraged to present their findings as soon as its orientation with respect to data collection and
possible at local, national, and international analysis, its temporal characteristics, its analytic
scientific sessions and to publish them expedi- objective, the degree of control exercised by
tiously as letters (communications) or full- the investigator, or the characteristics of the
length articles in peer-reviewed journals (to measurements made during the investigation.
ensure their quality and validity). These yield the following categorizations: basic
6. It should contribute to generalizable knowledge. versus applied versus translational, hypothesis
The Tuckman criteria speak to the structure and testing versus hypothesis generating, retrospective
process of research, but not to its intended objec- versus prospective, longitudinal versus cross-
tives. The Belmont Report [7], which codified sectional, descriptive versus analytic, experimen-
the definition of human subjects research for tal versus observational, and quantitative versus
the US Department of Health and Human qualitative research.
Services, argues additionally that for an activity
to be considered research, it must contribute to
generalizable knowledge (the latter expressed in Basic Versus Applied Versus
theories, principles, and statements of relation- Translational Research
ships). For knowledge to be generalizable, the
intent of the activity must be to extrapolate Traditionally, research in medicine, as in other
findings from a sample (e.g., the study subjects) disciplines, has been classified as basic or applied,
to a larger (reference) population to define some though the lines between the two can, and do,
universal truth, and be conducted by individu- intersect. In basic research (alternatively termed
als with the requisite knowledge to draw such fundamental or pure research), the investiga-
inferences [8]. Because research seeks general- tion often is driven by scientific curiosity or inter-
izable knowledge, it differs from information est in a conceptual problem; its objective is to
gathering for diagnosis and management of expand knowledge by exploring ideas and ques-
individual patients. It also differs from formal tions and developing models and theories to
evaluation procedures (e.g., review of data explain phenomena. Basic research typically
performed for clinical quality improvement does not seek to provide immediate solutions to
4 P.G. Supino

practical problems (indeed, it can progress for in-depth discussion of purpose, challenges, and
decades before leading to breakthroughs and par- techniques of translational research in clinical
adigm shifts in practice), though it can yield medicine and associated career opportunities, the
unexpected applications (e.g., the discovery of reader is referred to the collective works of
the laser and its value for fiber-optic communica- Schuster and Powers [12], Woolf [13], Robertson
tions [10]), and it often provides the theoretical and Williams [14], and Goldblatt and Lee [15].)
underpinnings of applied research. Applied
research, in contrast, is conducted specifically to
find solutions to practical problems in as rapid a Hypothesis-Generating Versus
time frame as possible. In medicine, applied Hypothesis-Testing Research
research searches for explicit knowledge to
improve the treatment of a specific disease or its Although some studies are undertaken to describe
sequelae. Examples of applied research include a phenomenon (e.g., incidence of a new disease
clinical trials of new drugs and devices in human or prevalence of an existing disorder in a new
subjects or evaluation of new uses for existing population), most research is performed to gener-
therapeutic interventions. ate a hypothesis or to test a hypothesis. In hypoth-
In recent years, translational or translative esis-generating research, the investigator begins
research has emerged as a paradigm alternative to with an observation (e.g., a newly discovered pat-
the dichotomy between basic and applied tern, a rare event) and constructs an argument to
research. Currently practiced in the natural, explain it. Hypothesis-generating research
behavioral, and social sciences, and heavily typically is conducted when existing theory or
reliant on multidisciplinary collaboration, trans- knowledge is insufficient to explain particular
lational research is a method of conceptualizing phenomena. Popular tools for hypothesis gen-
and conducting basic research to render its eration in preclinical research include gene
findings directly and more immediately applica- expression microarray studies; hypotheses for
ble to the population under study. In medicine, clinical or epidemiological research may be
this iterative approach is used to translate results generated secondary to a projects initial purpose
of laboratory research more rapidly into clinical by mining existing datasets. In contrast, in
practice and vice versa (bench to bedside and hypothesis-testing research (sometimes called the
back or T1 translation) and from clinical prac- hypothetico-deductive approach), the investi-
tice to the population at large (to the community gator begins with a general conjecture or hunch
and beyond and back or T2 translation) to put forth to explain a prior observation or to clar-
enhance public knowledge. This is one of the ify a gap in the existing knowledge base.
major initiatives of the US National Institutes of It is vitally important that the investigator
Health (NIH) Roadmap for Medical Research. keep these differences in mind when designing
Examples of T1 translation include the develop- and drawing inferences from a study. To under-
ment of a technique for evaluating endothelium- score what can happen when these distinctions
dependent vasodilator responses as a diagnostic are blurred, it is instructive to step back from
test in patients with atherosclerosis and the eluci- scientific inquiry and mull over the following
dation of the role of the p53 tumor suppressor scenario:
gene in the regulation of apoptosis in the treat- A Texas cowboy fires his gun randomly at the
ment of patients with cancer [11]. Examples of side of a barn. Figure 1.1 (left panel) shows his
T2 translation would include the implementation, results. He pours over his efforts, paints a target
evaluation, and ultimate adoption of interventions centered around his largest number of hits (Fig. 1.1,
that have been shown to be effective in clinical right panel), and claims to be a sharpshooter.
research for primary or secondary prevention in Do you agree that the Texan is a sharpshooter?
heart disease, stroke, and other disorders. (For an Do you think that if he repeated his so-called
1 Overview of the Research Process 5

Fig. 1.1 The Texas

sharpshooter fallacy

Fig. 1.2 Variables

included in an exploratory
dataset based on 95
patients with chronic
coronary artery disease

target practice, he would again be able to get that researchers), the Texas Sharpshooter Fallacy is
many bullets in the circle? Note: the Texan related to the clustering illusion, which refers
defined his target only after he saw his results. He to the tendency of individuals to interpret patterns
also ignored the bullets that were not in the clus- in randomness when none actually exists, often
ter! This parable illustrates what epidemiologists due to an underlying cognitive bias.
call the Texas Sharpshooter Fallacy [16] to Consider a more clinical example: A resident
underscore the dangers of forming causal conclu- inherits a dataset that contains information about
sions about cases of disease that happen to cluster 95 patients with chronic coronary artery disease.
in a population due to chance alone or to reasons Figure 1.2 depicts the variables in that dataset.
other than the chosen cause. As per Atul Gawande, He believes that he could satisfy his research
in his classic article in The New Yorker, of the elective if he could draw inferences about this
myriad of cancer clusters studied by scientists study group, though he has no a priori idea about
in the United States, not one has convincingly what relationships would be most reasonable to
identified an underlying environmental cause explore. He recruits a friend who happens to have
[17]. In a more general sense (and particularly a statistical package installed on his computer,
germane to the activities of some biomedical enters all of the variables in the dataset into a
6 P.G. Supino

multiple regression model, and comes up with game [18]. The most important take-home point
some statistically significant findings, as noted is if you wish to test it, a hypothesis always should
below: be generated before data collection begins.
Ischemia severity and benefit of coronary Hypothesis-testing studies (especially ran-
artery bypass grafting (CABG): p < 0.001 domized clinical trials [RCTs]) are highly
Hair color and severity of myocardial infarc- regarded in medicine because, when based on
tion (MI): p < 0.03 correct premises, properly designed, and ade-
Zip code and height: p < 0.04 quately powered, they are likely to yield accu-
He concludes that he has confirmed the hypoth- rate conclusions [19]; in contrast, conclusions
esis that there is a strong association between drawn from hypothesis-generating studies, even
preoperative ischemia severity and benefit of when well designed, are more tentative than those
coronary artery bypass grafting because not of hypothesis-testing studies due to the myriad of
only was the obtained probability (p) value low, explanations (hypotheses) one can infer from the
his hypothesis also makes clinical sense. He also observation of a phenomenon.
decides that he would not report the other findings For these reasons, hypothesis-generating stud-
because, while also statistically significant, ies are appropriately regarded as exploratory in
he cannot explain them. What methodological nature. These differences notwithstanding, there
error has the resident made in drawing his is general consensus that hypothesis-testing and
conclusion? hypothesis-generating activities both are vital
The answer is that, analogous to the rifleman aspects of the research process. Indeed, the latter
who defined his target only after the fact, the resi- are the crucial initial steps for making discoveries
dent confirmed a hypothesis that did not exist in medicine. As Andersen [20] has correctly
before he examined patterns in his data. The fal- argued, without hypothesis-generating activities,
lacy would not have occurred if the resident had, there would be no hypotheses to test and the body
in mind, a prior expectation of a particular of theory and knowledge would stagnate. The
association. It also would not have occurred had critical role of the hypothesis in the research pro-
the resident used the data to generate a hypothesis cess and the logical issues entailed in formulating
and validated it, as he should have, with an inde- and testing them are further discussed in Chap. 3.
pendent group of observations if he wanted to
draw such a definitive conclusion. This is an
important distinction because the identification Retrospective Versus Prospective
of an association between two or more variables Research
may be the result of a chance difference in the
distribution of these variablesand hypotheses Research often is classified as retrospective or
identified this way are suggestive at best, not prospective. However, as pointed out by Catherine
proven. What one cannot do is to use the same DeAngelis, former editor-in-chief of the Journal
data to generate and test a hypothesis. of the American Medical Association (JAMA),
Moreover, the resident compounded his error these terms are among the most frequently mis-
by capitalizing on only one association that he understood in research [21] in part because they
found, ignoring all of the others. Working with are used in different ways by different workers in
hypotheses is like playing a game of cards. You the field and because some forms of research do
cannot make up rules after seeing your hand, or not neatly fall within this dichotomy. Many meth-
change the rules midstream if you do not like the odologists [22, 23] consider research to be
hand that you have been dealt. Similarly, if you retrospective when data (typically recorded for
gather your data first and draw conclusions based purposes other than research) are generated prior
only on those you believe to be true, you have, in to initiation of the study and to be prospective
the words of the famed behavioral scientist, Fred when data are collected starting with or subse-
Kerlinger, violated the rules of the scientific quent to initiation of the study. Others, including
1 Overview of the Research Process 7

DeAngelis, prefer to distinguish retrospective casecontrol study can be used to infer cause and
from prospective research according to the inves- effect associations, though various biases (dis-
tigators and subjects orientation in the data cussed in depth in Chap. 4) may limit its value for
acquisition process. According to the latter view, this purpose.
a study is retrospective if subjects are initially The two most typical examples of prospective
identified and classified on the basis of an out- research in clinical medicine are observational
come (e.g., a disease, mortality, or other event) cohort and experimental studies. In an observa-
and are followed backward in time to determine tional cohort study, subjects within a defined
the relation of the outcome to exposure to one or group who share a common attribute of interest
more risk factors (genetic, biological, environ- (e.g., newly diagnosed cardiac patients, new
mental, or behavioral); conversely, the study is dialysis patients) who are free of some outcome
prospective if it begins by identifying and classi- of interest are identified on the basis of exposure
fying subjects on the basis of the exposure (even to risk factors whose presence or absence is out-
if the exposure preceded the investigation), with side the control of the investigator. These indi-
outcome (s) observed at a later point in time [21]. viduals are followed over time until the occurrence
There are various types of retrospective stud- of an outcome (or outcomes) that usually (but not
ies. The simplest (and least credible from the always) is measured at a later date. In an experi-
standpoint of scientific evidence) is the case mental study, outcomes also are assessed at a
study (or case report), which typically pro- later date, but subjects initially are differentiated
vides instructive, albeit anecdotal, information according to their exposure to one or more inter-
about unusual symptoms not previously observed ventions which have been purposively applied.
in a medical condition or new combinations of (Further distinctions between observational and
conditions within a single individual [24]. The experimental studies are discussed below.)
case series (or clinical series) is an uncon- Prospective research is less prevalent in the
trolled study that furnishes information about literature than retrospective research principally
exposures, outcomes, and other variables of inter- due to its relatively greater cost. In most prospec-
est among multiple similar cases. Though lack of tive studies, the investigator must invest the time
control precludes evaluation of cause and effect, and resources to follow subjects and sometimes
this type of study can provide useful information even apply an intervention if the study is experi-
about unusual presentations or infrequently mental. Moreover, prospective studies usually
occurring diseases and can be used to generate require larger sample sizes. Why, then, would
hypotheses for testing, using more rigorous stud- anyone choose a prospective design over a retro-
ies [24]. The most common type of retrospective spective approach? One reason is that prospective
research used to draw inferences about the rela- studies (particularly RCTs and concurrent cohort
tion of prior exposures to diseases (and the most studies, described below) potentially have more
rigorous of the various retrospective designs) is control over temporal sequence and extraneous
the casecontrol study. In this type of investiga- factors, including selection and recall bias,
tion, a group of individuals who are positive for a although loss to follow-up can be problematic.
disease state (e.g., lung cancer) is compared with Second, prospective designs are more appropriate
a group comprised of those who are negative for than retrospective designs for rare exposures and
that disease state (e.g., free of lung cancer). By relatively more common outcomes. Finally, if it
looking back at the medical record, we attempt to is desired that the exposure be manipulated by
determine differences in risk factors (e.g., prior the investigator, as in an experimental study, the
exposure to cigarette smoke or asbestos) that may relation between exposure and outcome can be
account for the disease. Because of the temporal evaluated only with a prospective design.
sequence and interval between the factor and the In many prospective studies (all RCTs, many
outcome variable and the availability of a com- cohort studies), the exposure takes place coinci-
parison group (e.g., nondiseased subjects), the dent with or following the initiation of the study.
8 P.G. Supino

Fig. 1.3 Concurrent versus noncurrent prospective research (Reprinted with permission from [21])

This type of prospective research has been termed point (e.g., exposure to a putative risk factor or
concurrent [25, 26] because the investigator intervention) and follow them forward in time
moves along in parallel with the research process until the occurrence of a specified outcome (e.g.,
(i.e., from application or assessment of the expo- a disease state or event), whereas retrospective
sure to ascertainment of the outcomes associated studies begin with existing cases and look back in
with the exposure). In other instances, the expo- time at the history of the subject to identify rele-
sure and even the outcomes will have taken place vant exposures or other instructive trends. Both
in the past, i.e., before the investigators involve- are examples of longitudinal research because
ment in the study. If the logic of the study is to subjects are examined on multiple occasions that
follow subjects from exposure to outcome, are separated in time.
the research may be termed a nonconcurrent Not all studies have defined temporal
prospective study [25, 26], a historical cohort windows between putative risk factors and out-
study, or a retrospective cohort study (departing comes. One that does not is the cross-sectional
from the view of prospective research held by (or prevalence) study. With this approach, several
DeAngelis and others). These distinctions are variables are measured at the same point in time
shown in Fig. 1.3. to determine their frequency and/or possible
association within a group of individuals who
are selected without regard to exposure or dis-
Longitudinal Versus Cross-Sectional ease status. They are usually based on data col-
Research lected in the past for other purposes but can be
based on information acquired de novo. When
As noted above, prospective studies sample mem- used with large representative samples (to permit
bers of a defined group at a common starting valid generalizations), cross-sectional studies can
1 Overview of the Research Process 9

provide useful information about the prevalence Prospective descriptive studies include natural
of risk factors, disease states, and health-related history investigations that follow individual
knowledge, attitudes, and behaviors in a specified subjects or groups over time to determine changes
population. Cross-sectional studies are prevalent in parameters of interest.
in the literature principally because they are rela- While descriptive studies attempt to examine
tively quick, easy to conduct, and can be used to what types of problems exist in a population, ana-
evaluate multiple associations. However, unlike lytic studies attempt to determine how or why
the casecontrol study, where temporality these problems came to be. Thus, the ultimate
between risk factor and outcome variables can be goal of analytic studies is to test prestated hypoth-
established (or at least inferred) in order to eses about risk factors or interventions versus
buttress a cause and effect relationship, cross- outcomes to elucidate causality. Analytic studies
sectional studies are best suited for generating, can be performed with two or more equivalent or
rather than testing, such hypotheses [23]. matched comparison groups, in which case infer-
ences are drawn on the basis of analysis of inter-
group differences (comparative research) or by
Descriptive Versus Analytic Research comparisons within a single group in which
assessments are made over time before and after
Research can be further subdivided into descrip- imposition of an intervention or a naturally occur-
tive and analytical subtypes. In descriptive stud- ring event. Analytic research can be retrospective
ies, the presence and distribution of characteristics (e.g., casecontrol studies) or prospective (e.g.,
(e.g., health events or problems) of a single group observational cohort or experimental studies).
of subjects are examined and summarized (but Correlational analysis of cross-sectional data is
are not intervened upon or otherwise modified) to classified as analytic by some [28] but not all [22]
determine who, how, and when they were affected workers in the field.
and the magnitude of these effects. Descriptive
studies can involve a single case or a large popu-
lation. Though they are considered to be among Observational Versus Experimental
the simplest types of investigation, they can yield Research
fundamental information about an individual or
group that is of importance when little is known In this dichotomy, research is differentiated by the
about the subject in question. Modes of data col- amount of control that the investigator has over
lection for descriptive studies are primarily the factors in the study by which the outcome
observational and include survey methods, objec- variables are compared. In observational studies,
tive assessments of physiological measures, and the investigator is passive with respect to the fac-
review of historical records. Methods of analysis tors of interest as these usually are naturally
include computation of descriptive statistics such occurring risk factors or exposures outside of the
measures of central tendency and dispersion investigators control. He or she can identify them
(quantitative studies) and verbal descriptions and measure them but cannot allocate subjects to
and content analysis (qualitative studies) [27]. treatment groups or deliberately manipulate a
Because descriptive studies contain no reference treatment to systematically study its effect. The
groups, they cannot be used to test hypotheses investigators sole responsibility is to select a
about cause and effect; however, they can be use- design which can validly assess the impact of the
ful for hypothesis generation, thus providing the risk factor on the outcome variable. In contrast, in
foundation for future analytic studies. Descriptive experimental studies, the input of interest not
studies may be either retrospective or prospec- only is measured or observed but is purposively
tive. Retrospective descriptive studies include applied by the investigator, who manipulates
the single case study and case series formats. events by arranging for the intervention to occur
10 P.G. Supino

or, at the very least, arranges for random alloca- In contrast, qualitative research gathers informa-
tion of subjects to alternative treatment or control tion about how phenomena are experienced by
groups. As a consequence, most of the inherent individuals or groups of individuals (and the con-
differences that exist between comparison groups text of these experiences) based on open-ended
are minimized, if not eliminated, thereby provid- (unstructured) interviews, questionnaires, obser-
ing greater capacity to determine cause and effect vation, and focus group methodology. Fewer sub-
relationships between the intervention and the jects are studied than with quantitative research,
outcome. Unlike observational studies, which can but the investigators contact with them is longer
either be prospective or retrospective, experimen- and more interactive. As Portney and Watkins
tal studies, as noted earlier, always are prospec- [29] have noted, quantitative methods can be used
tive. Midway between observational and across the continuum of research approaches to
experimental studies is a methodology known as describe, generate, and test hypotheses, whereas
quasi-experimental research. With this approach, qualitative methods typically are used for descrip-
the investigator evaluates the impact of an tive or exploratory (hypothesis-generating)
intervention (e.g., a therapeutic agent, policy, pro- research. Quantitative and qualitative research
gram, etc.) which has been applied either to an each subsumes many different methodologies.
entire population or to one or more subgroups
on a nonrandom basis. Although he or she may
have been directly involved in arranging the inter- Steps in the Research Process
vention, control is nonetheless suboptimal due to
limitations in the quality of reference data; as As mentioned earlier, research is structured by a
such, inferences drawn from quasi-experimental series of methodological rules which govern the
studies, while stronger than those generated with nature and order of procedures used in the inves-
purely observational data, are less robust than tigation. It is, therefore, necessary that a plan be
those drawn from true experimental investiga- developed prior to the study which incorporates
tions. Characteristics of the true experimental and these procedures. This is true, irrespective of the
quasi-experimental approaches are detailed more type of research involved. The following is a brief
fully in Chap. 5. listing of the steps, identified by DeAngelis
[21], which comprise the research process in
general and the hypothetico-deductive approach
Quantitative Versus Qualitative in particular:
Research In the first stages of the project, the investigator
Finally, research also can be differentiated accord- 1. Identify the problem area or question.
ing to whether the information sought is collected 2. Optimally restate the question as a
quantitatively or qualitatively. Quantitative hypothesis.
research involves measurement of parameters 3. Review the published literature and other
(e.g., demographic, functional, geometric, or information resources, including meeting
physiological characteristics; mortality, morbid- abstracts and databases of funded resource
ity, and other outcome data; attitudes, knowledge, summaries or blogs, to determine whether the
and behaviors) that have been obtained under hypothesis has been adequately evaluated or
standardized conditions by structured or semi- is in need of further study.
structured instrumentation and that may be sub- Prior to developing the research design, he/she
jected to formal statistical analysis. Typically, will:
numerous subjects are studied and the investiga- 4. Identify all relevant study variables, knowl-
tors contact with them is relatively brief and min- edge of whose presence, absence, change, or
imally interactive to avoid introduction of bias. interrelationship is the objective of the study.
1 Overview of the Research Process 11

In order to bring precision to the research, he/she some of the data were lost, and what was located
will: had not been recorded uniformly. As a result,
5. Construct operational definitions of all years of hard work were wasted. In a second
variables. example, addressing scheduling problems, Marks
6. Develop a research design and analytic plan describes the failure of an investigator, studying
to test the hypothesis. The design will iden- the effects of a drug developed for patients
tify the nature and number of subjects from undergoing elective coronary artery bypass graft-
whom data will be obtained, the timing and ing, to complete his research project within his
sequence of measurements, and the presence specified time frame. Though the investigator had
or absence of comparison groups or other the foresight to calculate his required sample size
procedures for controlling bias. The analytic and to estimate patient accrual rates, he made the
plan will define the statistical procedures to mistake of allowing only 4 months to study 30
be performed on the data and must be points. Much to his chagrin, a poorly worded
prespecified to minimize the likelihood of consent form submitted to his institutional review
reaching spurious conclusions. board (IRB) delayed him approximately 6 weeks
7. If data collection instruments are available, and, by then, the number of nonemergency oper-
they must be specified. If not, they must be ations had dropped dramatically due to the winter
constructed. (Data collection instruments holidays. After 4 months, only a quarter of his
include all tools used to collect relevant sample had been accruedand no data analysis
observations in the study such as physiologi- had been performed.
cal measurements questionnaires, interviews, Other common problems associated with poor
and case report forms, to name a few.) planning include inability to implement or com-
8. A data collection plan, containing provisions plete a study (due to disregard of organizational,
for accrual of subjects and for recording and political, or ethical factors), loss of statistical
management of data, must be designed. power to confirm hypotheses (due to inadequate
Only after these important preparatory steps have attention to patient accrual factors, attrition of sub-
been taken should the investigator proceed to: jects, or excess variability in the study population),
9. Collect and process the data. ambiguity of findings (due to lack of operational
10. Conduct statistical analysis to describe the definitions or nonuniformity of data collection
dataset and test hypotheses. procedures), and unsound conclusions brought
11. After the data are analyzed, conclusions are about by weak research designs, among others.
drawn and these are related to the problem Marks vignettes about the adverse conse-
statement and/or hypotheses. quences of poor research planning depict errors
12. Finally, the research report is written and, if that unfortunately are not uncommon. A number
accepted after peer review, is presented and/ of years ago, in this authors first position as a
or published as a journal article. research director (at an institution that I shall
The importance of following a research plan decline to name), I was asked to implement a
was addressed by Marks [30], who described a research project, previously designed by a princi-
number of typical planning errors and their nega- pal investigator (PI) who was senior to me at the
tive consequences. To cite one example, Marks time. The purpose of the project was to evaluate
detailed the experience of an investigator who the impact of an in-hospital patient education
failed to receive renewal of his multiyear research program after a first myocardial infarction. Four
grant because he could not report the results of hospitals were involved in the study: two inter-
the data analysis to the granting agency. This vention sites and two controls (business as
occurred because he failed to develop a mecha- usual). In the first phase, patients at Hospital A
nism for the storage, handling, and analysis of received the new educational program and
data. Due to staffing changes and other factors, patients in Hospital B did not. In the replication
12 P.G. Supino

phase, patients at Hospital C received the new A final problem concerned the instrumenta-
intervention and patients at Hospital D did not. tion. Though, in fact, both the Beck Depression
The instrument chosen to evaluate depression and State-Trait Anxiety Scales had been vali-
was the Beck Depression Scale and the instru- dated, the validation had not been performed on
ment chosen to evaluate anxiety was the patients shortly after an acute myocardial infarc-
State-Trait Anxiety Scale. The study design tion. An analysis of baseline scores revealed that
compared responses before and after the educa- most patients were neither depressed nor anxious,
tional program by site. Being schooled in psycho- apparently due to the unanticipated effects of
metrics, I was concerned about the reliability and sedation or denial. Thus, low scores on these
validity of these instruments for this population primary measures (which clearly were adminis-
but was told that these had been extensively used tered too soon after the index event) could not
and previously validated in other patient popula- possibly improve due to what are called floor
tions. I also had concerns about the quality of the effects. Needless to say, the private foundation
experiences that patients were receiving at the that funded this study was less than thrilled, and
control hospitals but was told that for political none of you have ever seen it in published form.
reasons, we could not ask too many questions. Examples like these abound in research but usu-
Additionally, I had concerns about the implemen- ally are not reflected in the literature because
tation of the educational intervention but was told aborted or incomplete research investigations
that this was firmly under the control of the nurse are never published, and those failing to demon-
coordinator. I next argued for a pilot before strate statistically significant differences (or asso-
launching this very costly and lengthy research ciations) are published far less often than those
project but was told that there was no time and that doa phenomenon known as publication
that the PI did not wish to waste patients. bias [31], further discussed in Chap. 9.
And so the intervention proceeded according A number of years ago, a pediatric emergency
to protocol for well over 2 years. No interim anal- fellow at another area hospital approached me for
ysis ever was performed because the PI thought assistance with a dataset that she had compiled
that would be too expensive and waste time. over a 4-month period. The data profiled the pre-
When the primary data finally were analyzed, senting complaints, diagnoses, and disposition of
there were no detectable differences whatsoever a series of children who had presented to an
between the outcomes obtained in the experi- emergency room after having complained of
mental versus control hospitals. The PI was largely nonserious illnesses during school. I asked
horrified and did not understand how this could her for a copy of her protocol, but she told me
have happened. When the process data were ana- that she did not have one because her study was a
lyzed post hoc, we learned that, due to staffing chart review, based on de-identified anonymous
problems at the experimental sites, many nurses data and, therefore, was IRB Exempt. I next
who were entrusted to implement the educational asked her for a written copy of her research plan
intervention had attended few, if any, in-service to which she responded, I never developed one
sessions about the intervention. Moreover, even because my clinical mentor told me that it wasnt
though the new intervention had a beautifully necessary, and I didnt know that I needed one.
designed curriculum that had been packaged in a I asked her what schools the children had come
glossy binder, it became known only after the from and who had made the decision to bring
fact that quality patient education also had taken them to the emergency room, but she couldnt
place at Control Hospital B, and we never knew answer these questions because that information
what was done at Control Hospital D, again, for was not routinely included in the medical chart,
political reasons. which was the source of all of her data. I asked
1 Overview of the Research Process 13

her why she had selected a retrospective chart objective, which was to furnish information that
review as her study design, and she answered that potentially could alter decision-making patterns
the charts were readily available and that she for this patient population. Had the fellow devel-
hadnt thought about any other approach. I asked oped a proper research plan in the first place, she
her why she thought the research study was worth would have better conceptualized her study and
doing, to which she responded, Im not sure, but saved months of her time on what was essentially
maybe the data will encourage emergency physi- a fruitless undertaking.
cians to better counsel parents and school officials The moral posed by these stories is that ade-
who refer relatively healthy children to the emer- quate planning is vital for achieving research
gency room and, thus, cut down on inappropriate objectives and for minimizing the risk of wasting
visits. time and resources. As Marks correctly argues,
Feeling sorry for her, I helped her to sort out The success of a research project depends on
whatever data that she had, and to write an how well thought out a project is and how poten-
abstract and manuscript that appeared to be tial problems have been identified and resolved
respectable, at least superficially. The abstract before data collection begins [30].
was accepted at an international meeting (which In subsequent chapters, we will consider many
had somewhat less stringent standards than of the fundamental concepts, principles, and
domestic meetings in her specialty), but when issues involved in planning and implementing a
she submitted her manuscript for publication in well-designed study. It is hoped that awareness of
an academic journal, it was rejected. The review- these factors will help you to achieve your
ers correctly argued that without knowing who research objectives, minimize your risk of wast-
made the decision to bring the child to the emer- ing time and resources, and result in a more
gency room, the study had failed its primary rewarding research experience.

Take-Home Points

Research is a rigorous problem-solving process whose ultimate goal is the discovery of

new knowledge.
Research may include the description of a new phenomenon, definition of a new relation-
ship, development of a new model, or application of an existing principle or procedure to a
new context.
Research is systematic, logical, empirical, reductive, replicable and transmittable, and
Research can be classified according to a variety of dimensions: basic, applied, or transla-
tional; hypothesis generating or hypothesis testing; retrospective or prospective; longitudi-
nal or cross-sectional; observational or experimental; quantitative or qualitative.
The ultimate success of a research project is heavily dependent on adequate planning.
14 P.G. Supino

15. Goldblatt EM, Lee WH. From bench to bedside: the

References growing use of translational research in cancer medi-
cine. Am J Transl Res. 2010;2:118.
16. Milloy SJ. Science without sense: the risky business
1. Calvert J, Martin BR (2001) Changing conceptions of public health research. In: Chapter 5, Mining for
of basic research? Brighton, England: Background statistical associations. Cato Institute. 2009. http://
document for the Workshop on Policy Relevance and
Measurement of Basic Research, Oslo, 2930 Oct Retrieved 29 Oct 2009.
2001. Brighton, England: SPRU. 17. Gawande A. The cancer-cluster myth. The New
2. Leedy PD. Practical research. Planning and design. Yorker, 8 Feb 1999, p. 3437.
6th ed. Upper Saddle River: Prentice Hall; 1997. 18. Kerlinger F. [Chapter 2: problems and hypotheses].
3. Tuckman BW. Conducting educational research. 3rd In: Foundations of behavioral research 3rd edn.
ed. New York: Harcourt Brace Jovanovich; 1972. Orlando: Harcourt, Brace; 1986.
4. Tanenbaum SJ. Knowing and acting in medical prac- 19. Ioannidis JP. Why most published research findings are
tice. The epistemological policies of outcomes false. PLoS Med. 2005;2:e124. Epub 2005 Aug 30.
research. J Health Polit Policy Law. 1994;19:2744. 20. Andersen B. Methodological errors in medical
5. Richardson WS. We should overcome the barriers to research. Oxford: Blackwell Scientific Publications;
evidence-based clinical diagnosis! J Clin Epidemiol. 1990.
2007;60:21727. 21. DeAngelis C. An introduction to clinical research.
6. MacCorquodale K, Meehl PE. On a distinction New York: Oxford University Press; 1990.
between hypothetical constructs and intervening vari- 22. Hennekens CH, Buring JE. Epidemiology in medi-
ables. Psych Rev. 1948;55:95107. cine. 1st ed. Boston: Little Brown; 1987.
7. The National Commission for the Protection of 23. Jekel JF. Epidemiology, biostatistics, and preventive
Human Subjects of Biomedical and Behavioral medicine. 3rd ed. Philadelphia: Saunders Elsevier;
Research: The Belmont Report: Ethical principles and 2007.
guidelines for the protection of human subjects of 24. Hess DR. Retrospective studies and chart reviews.
research. Washington: DHEW Publication No. (OS) Respir Care. 2004;49:11714.
780012, Appendix I, DHEW Publication No. 25. Wissow L, Pascoe J. Types of research models
(OS) 780013, Appendix II, DHEW Publication (OS) and methods (chapter four). In: An introduction to
780014; 1978. clinical research. New York: Oxford University Press;
8. Coryn CLS. The fundamental characteristics of 1990.
research. J Multidisciplinary Eval. 2006;3:12433. 26. Bacchieri A, Della Cioppa G. Fundamentals of clini-
9. Smith NL, Brandon PR. Fundamental issues in evalu- cal research: bridging medicine, statistics and opera-
ation. New York: Guilford; 2008. tions. Milan: Springer; 2007.
10. Committee on Criteria for Federal Support of Research 27. Wood MJ, Ross-Kerr JC. Basic steps in planning
and Development, National Academy of Sciences, nursing research. From question to proposal. 6th ed.
National Academy of Engineering, Institute of Boston: Jones and Barlett; 2005.
Medicine, National Research Council. Allocating 28. DeVita VT, Lawrence TS, Rosenberg SA, Weinberg
federal funds for science and technology. Washington, RA, DePinho RA. Cancer. Principles and practice of
DC: The National Academies; 1995. oncology, vol. 1. Philadelphia: Wolters Klewer/
11. Busse R, Fleming I. A critical look at cardiovascular Lippincott Williams & Wilkins; 2008.
translational research. Am J Physiol Heart Circ 29. Portney LG, Watkins MP. Foundations of clinical
Physiol. 1999;277:H165560. research. Applications to practice. 2nd ed. Upper
12. Schuster DP, Powers WJ. Translational and experi- Saddle River: Prentice Hall Health; 2000.
mental clinical research. Philadelphia: Lippincott, 30. Marks RG. Designing a research project. The basics
Williams & Williams; 2005. of biomedical research methodology. Belmont:
13. Woolf SH. The meaning of translational research and Lifetime Learning Publications: A division of
why it matters. JAMA. 2008;299:21121. Wadsworth; 1982.
14. Robertson D, Williams GH. Clinical and translational 31. Easterbrook PJ, Berlin JA, Gopalan R, Matthews DR.
science: principles of human research. London: Publication bias in clinical research. Lancet. 1991;
Elsevier; 2009. 337:86772.
Developing a Research Problem
Phyllis G. Supino and Helen Ann Brown Epstein

In his discussion of how problems are gener-

Origins of Research Problems ated in science, Kerlinger described the personal
and, often, unsettling nature of the birth of the
A well-designed research project, in any disci- research problem:
pline, will begin by conceptualizing the prob- The scientist will usually experience an obstacle to
lemin its most general sense, an unresolved understanding, a vague unrest about observed and
issue of concern (e.g., a contradiction, an unobserved phenomena, a curiosity as to why
unproven relationship, an unclear mechanism, a something is as it is. His rst and most important
step is to get the idea out in the open, to express the
puzzling or enigmatic state) that warrants investi- problem in some reasonably manageable form.
gation. This intellectual activity arguably is the Rarely or never will the problem spring full-blown
most critical part of the study, and many research- at this stage. He must struggle with it, try it out,
ers consider it to be the most difcult. This is par- live with it. Sooner or later, explicitly or implic-
itly, he states the problem, even if his expression of
ticularly true in the early stages of a developing it is inchoate or tentative. In some respects, this is
science when theoretical frameworks are poorly the most difcult and most important part of the
articulated and when there is little in the literature whole process. Without some sort of statement of
about the topic. Although formal rules and proce- the problem, the scientist can rarely go further and
expect his work to be fruitful [1].
dures exist to guide the development of the
research design, data collection protocol, and Kerlingers comments point up an important
statistical approach, there are few, if any, guide- but, nonetheless, poorly recognized fact. Namely,
lines for conceptualizing or identifying research one of the most challenging aspects of the
problems, which may take years of thought and research process is to develop the idea for the
exploration to dene. research in the rst place.
So, from where do research problems come?
In general, most spring from the intellectual curi-
osity of the investigator and, of necessity, are
P.G. Supino, EdD ()
shaped by his or her critical reasoning skills,
Department of Medicine, College of Medicine,
SUNY Downstate Medical Center, experience, and environment. Probably the most
450 Clarkson Avenue, Box 1199, Brooklyn, common source of clinical research problems is
NY 11203, USA the plethora of practical issues that clinicians
confront in managing patients which mandate
H.A.B. Epstein, MLS, MS, AHIP data-driven decisions. For example, among car-
Clinical Librarian, Samuel J. Wood Library
diologists, there has been long-standing interest in
and C.V. Starr Biomedical Information Center,
Weill Cornell Medical College, optimizing management of patients with known
New York, NY, USA or suspected coronary disease. What are the best

P.G. Supino and J.S. Borer (eds.), Principles of Research Methodology: A Guide for Clinical Investigators, 15
DOI 10.1007/978-1-4614-3360-6_2, Phyllis G. Supino and Jeffrey S. Borer 2012
16 P.G. Supino and H.A.B. Epstein

algorithms and diagnostic modalities for differ- which publish requests for proposals (RFPs) or
entiating symptoms of myocardial ischemia from applications (RFAs) to address understudied
symptoms that mimic ischemia? When should areas affecting the public health. These publica-
such patients be medically managed and when tions will explicitly identify a problem that the
should they undergo invasive therapeutic proce- agency would like an investigator to address,
dures? What is the risk-benet ratio of percutane- provide a background and context for the prob-
ous coronary angioplasty vs. coronary artery lem, stipulate a study population (as well as on
bypass grafting? How often and how should occasion, specify the approach to be taken), and
patients undergoing these procedures be evalu- indicate the level of support offered to the poten-
ated after intervention? What patient-level, soci- tial investigator.
etal, and economic factors inuence these Finally, research problems can be fostered by
decisions? Issues such as these have enormous environments that stimulate an open interchange
public health implications and have spawned of ideas. These environments include scientic
hundreds of research studies. sessions conducted by professional societies and
Research problems also can be generated from organizations, grand rounds given at hospitals
observations collected in conjunction with medi- and medical schools, and other conferences and
cal procedures [2]. A radiologist might have a set seminars. In recent years, methodological
of interesting data collected in conjunction with a approaches such as brainstorming, Delphi meth-
new imaging modality (e.g., full-eld digital ods, and nominal group techniques [35] have
mammography) and might wish to know how much been developed and sometimes are utilized to
more sensitive and specic this new modality is facilitate the rapid generation (and prioritization)
vs. older technology for breast cancer screening. of research problems by individuals and groups.
Alternatively, he might be interested in a new
application of an existing modality. A thoracic
surgeon may have outcomes data available from Characteristics of Well-Conceived
two competing surgical techniques. The process Research Problems
of critically thinking about these data, sharing
them with colleagues, and obtaining their feed- Although the genesis of a research problem is a
back can lead to interesting questions for analysis complex, variable, and an inherently unpredict-
and stimulate additional research. able process, fortunately, there are generally
Another source of research problems is the agreed-upon criteria, described below, for evalu-
published scientic literature, where an observed ating the merits of the problem once it has been
exception to the ndings of past research or generated [68]. Attention to these at the outset
accepted theory, unresolved discrepancies will ensure a solid footing for the remainder of
between studies, or a general paucity of quality the investigation.
data on a clinically signicant topic can motivate
thinking and point to an opportunity for future
study. In addition, most well-crafted manuscripts The Problem Should Be Important
typically document limitations in the investiga-
tion (e.g., potential selection bias, inadequate The most signicant characteristic of a good
sample size, low number of endpoint events, loss research problem is importance. A clinical
to follow-up) and may suggest areas for future research problem is considered important if its
research. Thus, thoughtful review of published resolution has the potential to clarify a signicant
research can point to gaps in knowledge that issue affecting the public health and, ultimately,
potentially could be lled by new investigations cause the clinician (or health-care policy maker)
designed to rene or extend previous research. to make a decision or undertake an action that he
Research problems also can be suggested by or she would not have made or undertaken had
governmental and private funding agencies the problem not been addressed. The greater the
2 Developing a Research Problem 17

need for clarication and the larger the number of The Problem Should Be Interesting
individuals potentially impacted (i.e., the greater
the disease burden), the more important the prob- As Hully and Cummings have noted, a good
lem. For this reason, when research proposals are research problem, especially if suggested by
submitted to a funding agency or when research someone else, must be interesting to the investi-
manuscripts are submitted to a journal for publi- gator to provide the intensity of effort needed
cation, perceived importance of the problem is for overcoming the many hurdles and frustrations
heavily weighted during the peer-review process. of the research process [7]. It also should be
Indeed, importance of the problem typically interesting to:
overshadows other criticisms such as incomplete The investigators peers and associates to
consideration of the literature, suboptimal meth- attract collaborators
odology, and poor writing style, as these aws Senior scientists at the investigators institu-
often can be remedied. Studies that merely repli- tion who can provide necessary mentorial sup-
cate other studies, with no signicant alteration port to guide the study (if the investigator is
in methods, content, or population (or that reect relatively junior)
only a minor incremental advance over previous Potential sponsors to motivate them to fund
information) are considered unimportant and the study (if outside funding is sought)
tend to fare poorly in the peer-review process. Fellow researchers within the larger scientic
This is true even if the study is well designed. community who, ultimately, will read and
This point is illustrated below by the divergent judge its ndings
comments actually made by a reviewer in Individuals outside the scientic community
response to two different manuscripts submitted (e.g., clinicians in private practice, policy
for publication to a cardiology journal: makers, the popular media, and consumers)
Manuscript #1: This is a superb contribution who, optimally, will consider, disseminate,
which adds importantly to our knowledge and/or utilize the eventual products of the
about the pathophysiology of heart failure. research (if the problem is applied or transla-
The results of this well-focused study are of tional in nature)
great clinical importance. (Recommendation: Gauging the potential interest of a research
Accept) problem is difcult because, as Shugan has
Manuscript #2: Comment: Despite a great noted, no research ndings are innately inter-
deal of very precise and laborious effort and esting. Instead, they are interesting only rela-
the generation of an extraordinary mass of tive to a particular audience within some context
numbers little forethought was given to the that they dene [9]. While research can be inter-
focus or importance of the questions to be esting simply because it is new, in general, a
asked . The nding is not unexpected, hav- research problem will tend to be viewed as note-
ing been suggested by several earlier studies worthy if it impacts a wide audience, has the
which have evaluated the issue of regional potential to cause signicant change in what
performance in different ways (Thus,) the members of that audience will do [9] (i.e., has
authors observations add little that is impor- importance), and is clearly framed within the
tant or useful to the currently available litera- context of a current hot-button issue (or an
ture. (Recommendation: Reject) older but nonetheless viable issue). Before
Evaluating the importance of a research prob- investing substantial time pursuing a research
lem requires considerable knowledge of and problem, it is advisable that new researchers
experience in the discipline. For this reason, the check with their mentors and/or other experi-
new investigator should seek the assistance of enced investigators with broad insights into the
mentors and other experts early on to maximize general area of inquiry to conrm that the prob-
the likelihood that the proposed research will be lem satises these criteria and, thus, is likely to
fruitful. be interesting to others [10].
18 P.G. Supino and H.A.B. Epstein

The Problem Should Lead to Clear, The Problem Should Be Feasible

Researchable Questions
A research problem (or a research question)
Many workers in the eld use the terms research should be feasible in two respects. As Sim and
problem and research question interchange- Wright [16] have noted, it should be feasible on a
ably. Others view the research problem as an conceptual-empirical level, meaning that the
assertion about an issue of perceived importance concepts and propositions embodied in the
that implies a gap in knowledge from which ques- research should be susceptible to empirical eval-
tions may be developed (the position taken in this uation. As indicated in Chap. 1, it is the empirical
chapter). Whichever view is held, there is general quality of research that differentiates it from other
consensus that a research problem should be problem-solving processes. Accordingly, it is
clearly dened (see section Crafting the Problem important that the research question(s) central to
and Purpose Statements at the end of this chap- the problem be answerable and that answers to
ter) and serve as a springboard for questions the question(s) be generated by the acts of data
whose answers can be found by conducting a collection and analysis (i.e., be produced empiri-
study [7, 1012]. Ellis and Levy [13] argue that cally). These criteria are sometimes difcult to
research questions are important because they satisfy. In order for a question to be answerable,
serve to operationalize the goals of the study by it must be clear, precise, and have a manageable
narrowing them into specic areas of inquiry. set of possible answers (the latter criterion also
Leedy and Ormand [14] assert that attaining relates to the issues of feasibility and scope,
answers to research questions both satises the described below). The answer or answers also
goals of the study and generally contributes to must be inherently knowable and measurable.
problem solving within the area of interest. The question, how many angels can dance on
Kerlinger and Lee [15] further contend that an the head of a pin is philosophically interesting,
investigation has meaning only when there is a but it is neither knowable nor measurable since
clear nexus between the answers obtained to the there is no way to count angels, assuming that
research questions and the primary research prob- they existed in the rst place. The question also
lem. Like the problem itself, the questions should must be framed in such a way that it will be obvi-
be clear, concise, optimally lead to testable ous what type of data are needed to answer it, and
hypotheses, and collectively capture the overall it must be possible to collect empirical evidence
goal or purpose of the research project. In so that, when analyzed, will make a convincing
doing, they serve to guide the methodology used argument when interpreted in relation to that
in the study. The reader should note that a distinc- question. In order for empirical evidence to be
tion is drawn between a research question and gathered, suitable subjects (for a clinical study)
practical or methodological questions that arise or material (for a preclinical study) must be avail-
during the design or implementation phases of able, and valid and reliable instruments must
the research (e.g., How many subjects are needed exist or be developed for measurement of the ele-
to provide sufcient power for testing the hypoth- ments that comprise the question. If these ele-
esis or to achieve a given level of precision for ments cannot be measured, the question cannot
estimating a population parameter? Which be answered empirically. Examples of problems
approaches are best for enhancing patient recruit- that would be difcult to evaluate are:
ment, improving follow-up, and reducing the How well do patients adjust to life following
likelihood of missing data? Given the investiga- an initial myocardial infarction?
tors constraints, what study design(s) should be Following death of the cancer patient, how
used to control for threats to valid inference? well do spouses handle their grief?
Which statistical approach or approaches are Both adjustment and handling grief
most appropriate given the nature of the data?) clearly are difcult to evaluate by empirical
These and related methodological issues are dis- means, unless operational denitions and objec-
cussed, in depth, in other chapters of this book. tive measures are developed for both terms. In a
2 Developing a Research Problem 19

similar vein, questions soliciting opinions (e.g., The scope of a study can be gauged by the
what should be done to improve the health of a number of subproblems (discrete areas of inquiry
specic population?) and value-laden questions within the investigation) needed to express the
such as should terminally ill comatose patients main problem. If the number of subproblems
be disconnected from life support? certainly are exceeds six, there is high likelihood that the prob-
important and make excellent subjects for argu- lem is too broad. In contrast, if an investigator is
ment. However, they (like any question including unable to dene a minimum of two subproblems,
the word should) are not always assessable it may be too narrow [17].
empirically and may require special methods for The issue of scope of the problem has direct
data gathering (e.g., qualitative techniques). practical implications for the researcher. Even if
The problem also should be feasible on a prac- the problem is important and empirically test-
tical level [16]. An investigator must decide, early able, the investigator must balance these factors
on, if he or she has the resources to address it against the cost of doing the research. Long
within a realistic time frame and at a reasonable before data are collected, the researcher must
cost. A primary determinant of feasibility is the decide whether he or she has the time or resources
scope of the proposed problem. In planning a to collect and analyze the data.
research study, it is important to avoid selecting a Factors affecting time include:
problem that is too broad because a single inves- The interval needed for subject accrual
tigation cannot possibly provide all relevant The time involved in administering the inter-
information about a problem. The process of vention (if the research is experimental)
identifying the problem can raise ancillary ques- The time involved in collecting data on inputs
tions that may be of interest to the investigator, such as risk factors (if the research is
but it is important to prioritize these and reserve observational)
some for future research so that the time and The time involved in assessing outcome
resources of the investigator are not strained. An Factors potentially affecting resources include:
axiom in research planning is that it is better to Costs of accruing and managing subjects (pur-
provide quality answers to a small number of chasing and housing of animals for a preclini-
questions than to provide inferior information in cal study, reimbursing human subjects for
volume. For example, should an investigator wish participation in a clinical research study)
to study the effect of drug therapy on patients Cost of the intervention (if any)
with heart disease, the question What is the Costs of measurement procedures
effect of drug therapy on patients with poor heart Cost of data collection, processing, and
function?, while conceptually interesting and analysis
clinically important, is much too broad for one Costs of equipment, supplies, and travel
study and, in fact, would require hundreds of Technical expertise (the investigators own
investigations to answer adequately. The investi- research skills or access to skilled collabora-
gator would do well to narrow the problem to tors or consultants)
include a given class of drugs (e.g., adrenal ste- One way an investigator can determine feasi-
roids), a specic index of heart function (e.g., left bility is by conducting a pilot study. A pilot study
ventricular performance), and a specic popula- (sometimes called a feasibility study) typically
tion (e.g., patients with chronic severe aortic attempts to determine whether it is possible to
regurgitation). On the other hand, the problem address the research problem (or subproblems)
should not be too narrowly dened. A question under conditions approximating those of the larger,
such as what are the effects of Inderal on the proposed study but with a smaller number of sub-
change in ejection fraction from rest to exercise jects over an abbreviated period of time. The pilot
in 75-year-old Queens residents? probably can provide information about the complexities of
would result in a criticism of the study as trivial. patient recruitment and the appropriateness of data
20 P.G. Supino and H.A.B. Epstein

collection procedures (including acceptability of (, and Trust the

the research instruments to the study subjects and Evidence (
approaches to detecting endpoints and resolving
issues associated with follow-up), and obtain pre-
liminary estimates of morbidity and event rates Examination of the Problem Should
(among other variables) that can be useful in Not Violate the Ethical Standards
informing sample size calculations for future of the Scientic Community
investigations. Occasionally, the pilot will produce
preliminary answers to the proposed research The investigator may be interested in a problem
questions. If the investigator concludes that exam- that has signicant scientic or medical impor-
ining the problem is unaffordable or is unfeasible tance, but addressing it might expose patients
time-wise, he or she should consider modications to signicant risk. For example, a psychiatrist
that may include: might be interested in the effects of a particular
Delimiting the scope of the problem psychotropic drug on patients with obsessive
Broadening the inclusion criteria compulsive disorder. She believes that exami-
Relaxing the exclusion criteria nation of this problem is both clinically relevant
Adding additional study sites and scientically important because review of
Altering the study design used to address the existing literature suggests that the agent
the problem (e.g., from a prospective to a not only has the potential for reducing symp-
retrospective design or from parallel group toms but also might provide insights into the
comparison to a repeated measures design underlying processes related to this illness.
to permit assessment of outcomes with Pilot data, however, suggest that this drug is
fewer subjects [the pros and cons of these highly addictive and, in addition, may adversely
approaches are discussed in further detail in affect certain organ systems. Thus, despite
Chap. 5]) scientic merit, the conclusions generated
If successful, the results of a pilot study can might be at the expense of the overall well-
be helpful in convincing a potential funding being of the subject. According to accepted
agency that the proposed research is feasible and, standards of scientic conduct, the study should
depending on the nature of the preliminary data, not be done. These rules apply in industrial, as
that the hypotheses are likely to be conrmed well as in academic, settings. Thus, in the USA,
by a larger study conducted by the same when a pharmaceutical company launches a
investigators. new drug, it is required by the Food and Drug
Another way to try out a research question Administration to perform highly regulated tri-
is to present an idea or preliminary data in a als of feasibility (phase I) and safety (phase II),
poster or emerging ideas section of a profes- before proceeding to a large, randomized phase
sional meeting. Thoughts exchanged during a III efcacy trial. Generally, if the drug pro-
curbside chat may crystallize an idea and duced signicant toxicity in patients prior to or
may lead to valuable networking connections. during the phase III trial, the investigation
Social media, like wikis (collaborative, directly would be aborted at that time, despite otherwise
editable websites) and blogs (online personal benecial effects. Similar guidelines are fol-
journals), are rich platforms to oat ideas and lowed in most Western European countries.
exchange comments. An example of a wiki is Likewise, prior to conducting research in most
Medpedia: an open platform connecting people academic medical centers, an investigator is
and information to advance medicine (see www. required to obtain approval of his or her Useful blogs include Medical research protocol from that centers institu-
Discoveries (, tional review board (IRB), particularly when
Public Library of Science (PLoS Blog, accessi- that protocol poses more than minimal risk to
ble at, Discovery Buzz the subject. During this approval process, the
2 Developing a Research Problem 21

ethical considerations entailed in studying the classied as descriptive (What is occurring?

problem are heavily weighed. In clinical stud- What exists?), relational (What is the association
ies, these typically include: between two or more variables? Is the predictive
Proportionate risk: Is the risk to the subject value of one variable greater than or independent
outweighed by the potential benet to that of another variable?), or causal (Does a treat-
subject? If your IRB concludes that it is not, ment, program, policy, etc., affect one or more
the study would not be permitted to go for- outcomes?). Blaikie [20] contends that all
ward, despite its possible benet to the same research questions can be classied as inquiries
patient in the future or to society in general. about what, why, or how. According to this
Informed consent: Is the subject truly aware of trichotomy, what questions describe presence,
the aims of the study? If so, is the subject also magnitude, and variations of characteristics in
aware of the potential for any adverse conse- individuals, patterns in the relationships among
quences that might arise due to his or her par- these characteristics, and associated outcomes;
ticipation? Several years ago, it came to light why questions ask about causes of, or reasons
that a research investigation, undertaken at a for, the existence of phenomena, explanations
medical center in New York, had been con- about relationships between events, and mecha-
ducted on 28 adult schizophrenics who were nisms underlying various processes, whereas
not advised that they were participating in a how questions deal with methods for bringing
study in which psychosis was temporarily about desired changes in outcomes via interven-
induced [18]. The ethics of performing research tion. Research questions also can be classied
on such vulnerable subjects, without their full according to the type of inferences to be drawn.
knowledge, triggered a restorm of controversy In medicine, for example, questions characteristi-
that caused their IRB to mandate an entirely cally target issues about magnitude of disease
new approach to studies of this nature. burden, prevention, or patient management.
Role reversal: Would the investigator be will- Thus, questions may be asked about prevalence
ing to trade places with the subject? Would he and incidence of a disease (or diseases) in a
or she be willing to suffer the same pain, dis- population:
comfort, or, at the very least, inconvenience as What inuenza virus was most dominant in
the subject, as a result of participating in his or 2010?
her own research study? How many types of respiratory illness have
Integrity of the design (validity): Is the study been identied among the World Trade Center
designed well enough to warrant the expendi- Disaster rst responders?
ture of time and effort, or the potential risk to How many cases of breast cancer that were
the patient (i.e., is it likely to yield valid identied in Long Island, New York, occurred
answers to the questions being asked?) If not, in Suffolk County?
not only may the investigators be wasting their Is resistant tuberculosis on the rise in New
own time and that of their subjects, they also York City?
may be producing results that have the poten- Is AIDS in Africa still considered to be an
tial to mislead the medical community and, epidemic?
ultimately, their patients. Questions also can focus on issues of primary
These and other ethical problems will be prevention:
explored more fully in Chap. 12. Does use of margarine instead of butter
protect against hypercholesterolemia and
Types of Research Questions Does use of hormone replacement after meno-
pause protect against the development of car-
Research questions in any discipline may be cat- diovascular diseases among women?
egorized in multiple ways. Trochim [19] has Is physical tness protective against
argued that all research questions may be osteoporosis?
22 P.G. Supino and H.A.B. Epstein

Does application of dental sealants actually What is the in-hospital mortality associated
prevent the development of tooth decay? with valvular replacement? Is it greater with
Have current local and global interventions concomitant coronary artery bypass grafting?
and services reduced the transmission and (harm)
acquisition of HIV infection?
Questions of most interest to clinicians, how-
ever, typically center on issues related to the Role of the Literature Search
clinical management of patients with known
or suspected diseases. Borrowing from an Even if the research problem was sparked by
evidence-based practice framework, these can previously published research, once its basic
be subcategorized as questions about screening/ elements have been dened, it is necessary to
diagnosis, treatment, prognosis, etiology, or conduct a comprehensive search of the literature
harm (from treatment) [21]. Examples are given to acquire a thorough knowledge of relevant ear-
below: lier ndings, ongoing research, or new theories.
What is the most cost-effective way to differ- Although there is no set rule governing the opti-
entiate children who are at risk for develop- mal time frame for a literature search or the num-
mental delays from those who are not? ber of publications to be included, there is general
(screening) consensus that the search should be of sufcient
What are the sensitivity, specicity, and posi- length and breadth to include existing pertinent
tive and negative predictive values of positron seminal and landmark studies [22] as well as cur-
emission tomography [PET] among women rent studies in the eld (i.e., those conducted
with suspected coronary artery disease? What within the past 10 years). A proper literature
is the diagnostic accuracy of PET vs. other search will help the investigator to determine
available tests such as thallium scintigraphy? answers to the following questions:
(diagnosis) Has the problem been previously addressed?
What is the best (most effective, tolerable, If so, was it adequately studied?
cost-effective) currently available chemother- Are the proposed hypotheses, if any, supported
apy regimen for acute myeloid leukemia? by current theory or knowledge?
(treatment) Does the methodology cited in the literature
Is combination therapy better than single agent provide guidance on available instrumentation
therapy for benign prostatic hypertrophy? for measuring variables?
(treatment) Are the results of prior studies informative for
What is the probable clinical course of patients calculation of sample size and power?
with aortic stenosis? (prognosis) Did previous investigators describe the limita-
Which patients with chronic, severe aortic tions of their research or suggest areas for
regurgitation progress most rapidly to surgical future study?
indications? (prognosis) Seeking answers to these questions early in
Is autoimmunity causally related to the devel- the planning process will enable the investigator
opment of Crohns disease? Is it also impli- to determine whether performance of the present
cated in the development of lupus and study is feasible, whether it is likely to signicantly
rheumatic arthritis? (etiology) contribute to the existing knowledge base (thus
Do enzymes involved in the synthesis of the supporting the need for the study), and also
extracellular matrix play a role in the develop- whether it may provide guidance on the construc-
ment of brotic diseases and cancer? tion of hypotheses and choice of study design. In
(etiology) addition, creating an automatic search prole
What is the magnitude of risk for adverse early in the planning process will keep the inves-
outcome of carotid endarterectomy among the tigator informed about the latest research related
elderly? (harm) to his or her problem. The search prole will
2 Developing a Research Problem 23

generate updated lists of new literature and mid 1940s. For more information about PubMed,
provide alerts to these updates via e-mail or RSS see Many of the MEDLINE
feed on a daily, weekly, or monthly basis, as citations in PubMed link to the Gene, Nucleotide,
desired. The updates also can be used to alert the and Protein databases from the National Center
investigator to research performed by other inves- for Biotechnology Information (NCBI) for cov-
tigators and provide an opportunity for erage of molecular biology. Google Scholar
collaboration. pulls in freely available scholarly literature from
Like other aspects of a research project, the PubMed and other sources, with some linking to
performance of a proper literature search requires the full text of the articles.
a signicant investment of time and effort. This is MEDLINE may not provide adequate infor-
true in part because the results of most scientic mation about a research problem. Thus, many
investigations (particularly those reecting recent investigators consider searching EMBASE in lieu
work or primary literature) are dispersed over a of or in addition to MEDLINE (which now is
myriad of e-mail communications, meeting included within EMBASE). EMBASE is created
abstracts, web documents, and periodicals, rather by Excerpta Medica and produced by Elsevier.
than organized collectively in books or other sin- One can subscribe to it individually from Elsevier
gle sources of research. Traditionally, if an inves- or through Ovid from Wolters Kluwer Health in
tigator needed to learn more about earlier related three separate databases: EMBASE, EMBASE
work, he or she would begin by examining key Drugs and Pharmacology, and EMBASE
references cited in known relevant published Psychiatry. There are over 24 million indexed
studies. Today, continuing this principle of it records from more than 7,500 current, mostly
only takes one good article to get you going, peer-reviewed journals covering biomedical and
online systems like PubMed from the National pharmacological literature. In addition, there is
Library of Medicine, ISI Web of Knowledge extensive coverage of meeting abstracts. Like
from Thomson Reuters the EBSCOhost family MeSH from MEDLINE, EMBASE uses a hierar-
of databases from EBSCO Publishing, and the chical classication of subject headings called
databases of Ovid Technologies, Wolters Kluwer EMTREE that can be expanded. EMBASE can
Health, and Google Scholar, generate a list of be searched with signicant words, signicant
possible important citations and invite you to phrases, and EMTREE terms. Links to full text of
click on the related articles link, or times cited the journal articles are available from many
link to nd similarly indexed papers or cited ref- medical libraries.
erences from these papers to locate additional An investigator may also consider searching
relevant citations. A summary of selected core BIOSIS Previews, Biological Abstracts, and
online resources are provided in Table 2.1. Zoological Record together as a package from
Most investigators will choose to search ISI Web of Knowledge, a product of Thomson
MEDLINE, the premier bibliographic databases Scientic. This resource represents a comprehen-
from the National Library of Medicine. It is avail- sive index to the life sciences and biomedical
able by searching PubMed, ISI Web of research, including meeting abstracts, journals,
Knowledge, EBSCOhost, and Ovid plus many books and patents, and contains more than 18
other free or fee-based searching systems. The million records taken from more than 5,000 inter-
database covers the life sciences with a concen- national resources from 90 countries (1926 to
tration in biomedicine. Bibliographic citations present). BIOSIS Previews is available by search-
with author abstracts and linking to full text of ing the Ovid suite of databases and ISI Web of
many articles come from more than 5,400 bio- Knowledge.
medical journals published in the USA and Web of Sciences Science Citation Index
around the world. Most citations are written in Expanded, part of ISI Web of Knowledge from
English with English abstracts. MEDLINE con- Thomson Reuters covers scientic literature
tains over 21 million citations dating back to the from 1900 to present. An investigator can search

Table 2.1 Selected core online resources

Name of resource Description Link to full text Producer Fee
BIOSIS Bibliographic database: suite includes Yes Thomson Reuters Ovid TechnologiesWolters Yes
Biological abstracts (1926present) Kluwer
BIOSIS previews (1926present)
Zoological record (1864present)
CINAHL Bibliographic database for nursing and allied health disciplines Yes EBSCOhost Yes
Cochrane Library Family of systematic reviews, RCTs, health technology Yes Wiley Ovid TechnologiesWolters Kluwer Yes
assessments, economic assessments
EMBASE Bibliographic database with international coverage Emphasis Yes EMBASEavailable from various vendors Yes
on biomedicine and drugs
MEDLINE Bibliographic database of clinical medicine Yes National Library of Medicineavailable from No
various vendors
PsycInfo Bibliographic database of scholarly journal articles, Yes American Psychological Associationavailable Yes
book chapters, and dissertations in behavioral science from various vendors
and mental health
PubMed Premier database of biomedical literature primarily MEDLINE Yes National Library of Medicine No
Social Science Citation Index Bibliographic database with links to citations in bibliography Yes Thomson Reuters Yes
and items cited
Web of Science Bibliographic database indicating number of References, Yes Thomson Reuters Yes
number of times cited (1900present)
P.G. Supino and H.A.B. Epstein
2 Developing a Research Problem 25

this resource by subject topics and keywords. add citations to a folder, permitting them to be
The citation display features a summary abstract, printed, e-mailed, or saved. Also, like other data-
a bibliography, and publications that have cited bases, CINAHL links to cited references.
that paper. As with many systems today, full text Finally, for those seeking the latest information
of the paper as well as related article citations on evidence-based health care, the Cochrane
also may be linked. A citation map can be gener- Library is an excellent source of systematic
ated to visually display for two generations the reviews (discussed in depth in Chap. 9), RCTs,
references in the bibliography and cited papers. and health technology and economic assessments.
If the investigator is interested in behavioral It is produced by the Cochrane Collaboration, a
science research, the American Psychological worldwide effort dedicated to systematically
Association offers a suite of databases, reviewing the effectiveness of health-care interven-
PsycINFO, PsycARTICLES, PsycBOOKS, tions, and is available from Wiley and Wolters
PsycCritiques, and PsycEXTRA. Information Kluwer Health via Ovid. Though the Cochrane
can be found on psychology and related disci- Library can be searched with words, phrases, and
plines (e.g., psychiatry, nursing, neuroscience, MeSH descriptors, its central database of random-
law, education, sociology, social work). Available ized trials is extensive (mandating a more precise
in a variety of formats (e.g., journal articles, searching strategy), whereas its database of sys-
books or book chapters, dissertations, technical tematic reviews contains fewer than 5,000 elements
and annual reports, government reports, confer- (requiring a broader search strategy). If the searcher
ence presentations, consumer brochures, maga- is able to identify a systematic review that contains
zines, among others), PsycINFO can be searched a reasonable number of trials from which valid and
with words, phrases, and terms from the Psyc consistent inferences have been drawn, it may pro-
thesaurus. Like MeSH, the terms are arranged in vide most of the literature needed to support a
alphabetical and hierarchical order. research project.
Web of Sciences Social Science Citation Although web-based bibliographic programs
Index can be explored for those interested in have become increasingly user-friendly by
social sciences research. Almost 2,500 journals encouraging the searcher to place signicant
are indexed, representing 50 social science and words, phrases, and database subject terms in a
related disciplines, including anthropology, urban search box, the search process itself remains a
studies, industrial relations, law, linguistics, sub- combination of science and art which requires
stance abuse, public health, and information and practice and patience. In view of this, some
library sciences, among others. Like Science investigators may opt to complete an online tuto-
Citation Index, the citation display features a rial, sign on to a web-based training session,
summary abstract, bibliography, and publications attend an in-person course at their local library,
that have cited the paper; full text of the paper or consult with a librarian for training and search
and related article citations also may be linked. planning. Some investigators will team up with a
This database also can be searched with words searching professional to run the search together
and phrases. or, after a rigorous interview (in which the goals
The EBSCOhost family of databases covers of the study are carefully discussed), will have
the humanities and social sciences. It also includes the searching professional perform the search.
CINAHL-Cumulated Index to Nursing and Allied For those without access to such instructional
Health Literature. This database provides index- resources, we offer the following
ing for nearly 3,000 journals from the elds of recommendations:
nursing and allied health, including librarianship, Frame your search topic in the form of a
and contains more than 2.2 million records dating specic question or statement.
back to 1981. Like MEDLINE, EMBASE, and Depending on your choice of search system(s),
PsycINFO, one searches CINAHL with plan your search strategy accordingly with
signicant words and phrases as well as CINAHL signicant words, phrases, and database sub-
descriptors that can be expanded. Searchers can ject headings or descriptors.
26 P.G. Supino and H.A.B. Epstein

Decide whether empirical and/or theoretical likely to modify or extend the existing body of
literature is to be included: knowledge. Moreover, information gained from
Empirical literature comprises primary the literature review (including successes or fail-
research reports (e.g., observational stud- ures of previous published work) can, as indicated
ies, controlled trials) and systematic reviews earlier, prove invaluable for rening the problem
of research. (if necessary), buttressing or revising hypotheses,
Theoretical literature includes descriptions and validating or modifying the approach taken.
of concepts, models, and theoretical
Identify preferred literature sources, for exam- Crafting the Problem and Purpose
ple, articles, book chapters, and dissertations. Statements
Determine the amount of information needed
and the temporal period of interest. Once the problem has been conceptualized and
Evaluate the likelihood of nding specic the literature search completed, the investigator is
information about your topic. If you think the in a position to communicate to interested parties
topic is voluminous, use a more narrow (e.g., mentors, colleagues, potential sponsors) the
approach to search the literature. If you think nature, context, and signicance of the problem,
the topic will yield a small amount of litera- including, typically, the type and size of the
ture, use a broader approach. affected population, what is known and not yet
Display and review all citations with as much known, and the consequences of the lack of
text, searching terms, and related links as pos- knowledge (i.e., the implied or directly stated),
sible. Many articles will be available in full thus elucidating the active challenge to be
text directly from the searching system. addressed and justifying the logical argument
If you determine that your retrieval is inade- underlying the study. These elements are incorpo-
quate for your needs, consider modifying your rated collectively into a problem statement, a
search strategy and running your search again. declarative set of assertions, interwoven with lit-
Obtain and organize all source documents. erature support, which customarily appears in the
Once the key references have been compiled, Introduction of the research report or in the
these should be carefully reviewed to identify the Background and Significance section of a research
methodologies employed, conclusions drawn, proposal (though, as Polit et al. [12] have observed,
and limitations of the selected studies. It is of the problem statement rarely is labeled as such
paramount importance that the investigator care- and must be ferreted out). As a general rule, a
fully read the entire published study and any well-constructed problem statement should be
accompanying editorials, comments, and letters, written as concisely as possible for optimal clarity
rather than rely on information given in an yet contain sufcient information to make a via-
abstract or in published reviews of the literature ble argument in support of the study and elicit
written by others. This is because abstracts and interest [13]. Abbreviated problem statements,
review articles provide only incomplete informa- condensed into a sentence or two with minimal
tion; in addition, the perspective of the reviewing supporting argumentation, commonly are pro-
author may bias the interpretation of primary vided in the beginning of the abstract accompany-
ndings contained in the review articles. ing the main body of the research report or
The information contained within each refer- research proposal. (Ellis and Levy [13] refer to
ence should be related to the problem statement to these reductions as statements of the problem to
form a nexus between the earlier studies and the differentiate them from fully developed problem
current research project. If the investigator deter- statements with appropriate argumentation.)
mines that the literature supports the need to study If the study is broad, it is recommended that
the proposed problem, he or she can proceed with the investigator divide the main problem into
condence, knowing that pursuit of the research subproblems, each of which addresses a single
project (if properly designed and implemented) is issue. It is important that the sum of the content
2 Developing a Research Problem 27

Table 2.2 Examples of well-dened problem statements from two research reports
Fleming et al., Circulation, 2008 [23] Walker et al., CMAJ 2000 [24]
Atrial brillation (AF), the most common complication Asymptomatic bacteriuria is common in
after cardiac surgery, is associated with signicant institutionalized elderly people. The prevalence
morbidity, increased mortality, longer hospital stay, and increases with age, occurring in up to 50% of elderly
higher hospital costs . Because ventricular dysfunction women and 35% of elderly men who reside in
is common following cardiac surgery, inotropic drugs are long-term facilities . Despite lack of benet,
often necessary to improve hemodynamic status; however, institutionalized older adults with asymptomatic
the effect of inotropic drugs on postoperative AF has not bacteriuria are frequently treated with antibiotics. This
been extensively studied . Milrinone has been reported practice is of particular concern given the deleterious
to be associated with a lower risk of postoperative AF effects of antibiotics, including the potential for the
compared to dobutamine use, but milrinone increases development of antibiotic resistance and adverse
the risk of atrial arrhythmias in patients with acute reactions seen in this population. Why antibiotics
exacerbation of chronic heart failure continue to be prescribed for asymptomatic bacteriuria
is unclear

Table 2.3 Examples of well-dened statements of purpose from two published research studies
Fleming et al., Circulation, 2008 [23] Walker et al., CMA 2000 [24]
The aim of this analysis was to test the hypothesis that The aim of our study was to explore the perceptions,
the use of inotropic drugs is associated with an increased attitudes, and opinions of physicians and nurses
risk of postoperative AF in cardiac surgery patients involved in the process of prescribing antibiotics
participating in an ongoing randomized, double blinded, for asymptomatic bacteriuria in institutionalized
placebo controlled trial elderly people

reected in the subproblems equates to no more statement. Although, like the problem statement,
or no less than the content reected in the main the statement of purpose typically is not labeled as
problem. Like the main problem, the subprob- such, it is easily identiable as it includes the
lems should be stated clearly and be related to words purpose (the purpose of the study was/
each other in a meaningful way so that the is .), goal (the goal of the study was/is .),
research will maintain coherence. or, alternatively, intent, aim, or objective
Two examples of well-dened problem state- [12]. In a quantitative study, the statement of pur-
ments are given in Table 2.2. The rst (shown in pose also identies the key variables to be exam-
the left column) is drawn from a quantitative ined and/or interrelated (parameters to be estimated,
study by Fleming et al. [23] about the impact of hypotheses to be tested), the nature of the study
milrinone on risk for atrial brillation after car- population (who is included), and, occasionally,
diac surgery. The second (shown in the right col- the nature of the study design; in a qualitative
umn) is a qualitative study by Walter et al. [24] investigation, the purpose statement commonly
addressing reasons for prescription of antibiotic will include the phenomenon or phenomena under
therapy among the asymptomatic institutional- study (rather than hypotheses), as well as the study
ized elderly with bacteriuria. Note, in each case, group, community, or setting [12]. Shown in
the problem statement makes the argument that Table 2.3 are the purpose statements from the
there is an important unresolved issue that should Fleming and Walker studies. In both cases, the
be addressed, and sets the stage for what the reader will note that the statements of purpose ow
investigator intends to do to facilitate a solution. directly from the problem statements.
The problem statement typically is followed by As Polit et al. have noted (and as illustrated
a statement of purpose (usually the last sentence above), the use of verbs in a purpose statement
or two in the Introduction of the research report or is key to determining the thrust of the inquiry
given as a list in the Specific Aims of the research and also helps to differentiate quantitative from
proposal), which succinctly identies what the qualitative studies [12]. The former typically
investigator intends to do (the type of inquiry) to include terms such as compare, contrast,
resolve the unknowns explicated in the problem correlate, estimate, and test, whereas the
28 P.G. Supino and H.A.B. Epstein

Table 2.4 Examples of research questions restated from two statements of purpose
Fleming et al., Circulation, 2008 [23] Walker et al., CMA 2000 [24]
Does the use of inotropic drugs increase risk of What are the perceptions, attitudes, and opinions of physicians
postoperative AF in cardiac surgery patients? and nurses involved in the process of prescribing antibiotics for
asymptomatic bacteriuria in institutionalized elderly people?

latter include terms such as describe, explore, carbon monoxide (CO) poisoning is a substantial
understand, discover, and develop. Verbs health problem in the US, causing an estimated
11,547 deaths from 1979 through 1988. The US
such as prove or show should be avoided in Consumer Product Safety Commission estimates
purpose statements of research studies as these can that there was an average of about 28 charcoal-
be construed as indicative of investigator bias [12]. related deaths per year from 1986 through 1992.
As noted above, a statement of purpose can be Charcoal briquettes are not an uncommon source
of CO poisoning in Washington State: 16% of the
expressed in declarative form. However, some 509 unintentional poisoning cases that required
investigators instead will frame the purpose of their hyperbaric oxygen treatment between October
study interrogatively as one or more research ques- 1982 and October 1993 involved charcoal. Our
tions (each addressing a single concept) that are investigation suggests that CO poisoning following
severe winter storms should be anticipated. It also
directed at the unknowns in the problem state- suggests that preventive messages are important
ment. Alternatively, these questions can be added public health messages, but that they should be
to a global statement of purpose to improve clarity understandable to those in the community who nei-
and specicity. As Polit et al. contend, research ther read nor speak English. [25]
questions invite an answer and help focus atten- Does the Introduction contain a clear state-
tion on the kinds of data that would have to be ment of the problem so that it is evident why the
collected to provide that answer [12]. Listed in investigation was important? Is there a statement
Table 2.4 are research questions that could have of purpose (or a set of questions) that explains
been framed by Fleming et al. and Walker et al. to what the investigators did to address the prob-
address the targets of inquiry in their studies. lem? Do the authors introductory statements pre-
However written, both the problem and pur- pare the reader to follow the rest of the paper?
pose of the study (or the research questions) After all, that is the principal role of the
should be apparent to the reader in the Introduction Introduction in a research manuscript. (For fur-
of the research report (or in the Background, ther details about the role and proper construction
Significance, and Specific Aims of the research of the Introduction of the scientic paper, the
proposal) and should possess sufcient clarity for reader is referred to Chap. 13.) Note, the authors
the reader to understand them without the pres- have provided the reader with a general back-
ence of the author. Unfortunately, this is not ground statement and also have presented their
always the case. Consider the statements articu- conclusions in their Introduction, repeating infor-
lated by Houck and Hampson in the introduction mation already given in their Abstract. However,
to their study about carbon monoxide poisoning other than suggesting that their data were unique,
following a winter storm during the 1980s, when the rationale and aims of their study have not
charcoal briquettes commonly were used for been articulated, and their research questions
heating in certain areas of the USA: remain undened even after reading their com-
A major epidemic of carbon monoxide poisoning ments. The moral illustrated by this example is
occurred after a severe winter storm struck western that for the published paper to engage and edify
Washington State during the morning of 20 January
the reader, the research problem, purpose, and/or
1993. Charcoal briquettes and gasoline-powered
generators were principal sources of CO. Although research questions must be unambiguously stated
previous reports have described CO poisoning early in the research report.
following winter storms in the Eastern United When there is poor denition of problem
States, the large number and wide distribution of
and purpose, not only may the reader become
cases following this storm are unique. Unintentional
2 Developing a Research Problem 29

confused, but these deciencies may adversely In its current form, the manuscript resembles
impact the study methodology because all subse- a mystery story with a good outcome more
quent steps in the research process (e.g., con- than a scientic study. Thus, while indicating
struction of the research questions or hypotheses, the general aim of the authors, the Introduction
development of the research design, collection misstates the specic goals required by the
and analysis of data) are guided by the statements apparent design of the reported work, thus
of problem and purpose statements. Houck and misfocusing the reader. (Recommendation:
Hampson were fortunate. When their article was Consider after revision)
written, there were relatively few experienced In sum, all research (whether basic or applied,
peer reviewers in their discipline (emergency quantitative or qualitative, hypothesis generating
medicine). This may well have helped the authors or hypothesis testing, retrospective or prospec-
efforts to gain publication. tive, observational or experimental) may be con-
More commonly, deciencies in the wording sidered as a response to a problem (an ambiguity,
of these statements and their connection to the gap in knowledge, or other perplexing state) that
remainder of the paper can be a primary cause of requires resolution. In thinking through the prob-
a manuscript being rejected for publication, or lem and communicating it to others, the investi-
being sent back to the author for revision, follow- gator must provide a clear and convincing
ing the peer-review process. The following criti- argument that indicates why the problem must be
cisms, made by a reviewer in response to two addressed (the problem statement), articulate a
different submissions to a cardiology journal, are solution to the problem to clarify the ambiguity
illustrative of this point: or ll the gap in knowledge (the purpose state-
Submission #1: Comment: The focus of the ment or research questions), and tie these state-
study is not clearly apparent, even from the ments to the methods used. The challenge to the
last paragraph which specically describes investigator is to dene and interrelate these ele-
the goals. The rst page does not point directly ments well enough to justify the research study
to the study hypothesis. (Recommendation: and maximize the likelihood that the ndings will
Consider after revision) be understood, appreciated, and utilized.

Take-Home Points

A well-designed research project, in any discipline, begins with conceptualizing the

Research problems in clinical medicine may be stimulated by practical issues in the clinical
care of patients, new or unexpected observations, discrepancies and knowledge gaps in the
published literature, solicitations from government or other funding sources, and public
forums such as scientic sessions, grand rounds, and seminars.
Well-conceived research problems are important, interesting, feasible, and ethical and
serve as a springboard for clearly focused questions.
Research questions most relevant to clinicians include those pertaining to disease preva-
lence/incidence, prevention, detection (diagnosis or screening), etiology, prognosis, and
outcomes of treatment (benet or harm).
A comprehensive literature search, conducted early in the planning process, can help to
determine whether the proposed study is feasible, whether it is likely to substantively con-
tribute to the existing knowledge base, and whether it can provide guidance in the construc-
tion of hypotheses, determination of sample size, and choice of study design.
Proper framing of the problem and purpose statements is essential for communicating and
justifying the research.
30 P.G. Supino and H.A.B. Epstein

research-worthy problem. Inform Sci: Int J Emerg

References Transdiscipl. 2008;11:1733.
14. Leedy PD, Ormond JE. Practical research: planning
and design. 8th ed. Upper Saddle River: Prentice Hall;
1. Kerlinger F. Foundations of behavioral research: edu- 2005.
cational and psychological inquiry. New York: Holt, 15. Kerlinger FN, Lee HB. Foundations of behavioral
Reinhart and Winston; 1964. research. 4th ed. Holt: Harcourt College; 2000.
2. Eng J. Getting started in radiology research: asking 16. Sim J, Wright C. Research in health care. Concepts,
the right question and identifying an appropriate pop- designs and methods. Cheltenham: Nelson Thornes;
ulation: critical thinking skills symposium. Acad 2000.
Radiol. 2004;11:14954. 17. Leedy PD. Practical research planning and design.
3. Albrecht MN, Perry KM. Home health care: delinea- 2nd ed. New York: MacMillan; 1980.
tion of research priorities and formation of a national 18. Sharav VH. The ethics of conducting psychosis-
network group. Clin Nurs Res. 1992;1:30511. inducing experiments. Account Res. 1999;7:13767.
4. Davidson P, Merritt-Gray M, Buchanan J, Noel J. Voices 19. Trochim, WM.: The research methods knowledge
from practice: mental health nurses identify research base. 2nd ed. Internet WWW page at URL: www.
priorities. Arch Psychiatr Nurs. 1997;11: 3405. Version current as of
5. Gallagher M, Hares T, Spencer J, Bradshaw C, Webb 20 Oct 2006.
I. The nominal group technique: a research tool for 20. Blaikie NWH. Designing social research: the logic of
general practice? Fam Pract. 1993;10:7681. anticipation. Malden: Blackwell; 2000.
The Research Hypothesis: Role
and Construction 3
Phyllis G. Supino

Wrong hypotheses, rightly worked from, have produced more results than unguided
Augustus De Morgan, 1872[1]

predicted, the hypothesis is supported. As noted

Overview below, such support does not necessarily indicate
verication of the hypothesis. Consistent replica-
Once a problem has been dened, the investiga- tion of predictions in subsequent studies may be
tor can formulate a hypothesis (or set of hypoth- needed if the hypothesis is to be accepted as a
eses, if there are multiple subproblems) about the theory or a component of a theory. If results are
outcome of the study designed to resolve the not as predicted, the hypothesis is rejected (or, at
problem. A hypothesis (from the Greek, founda- minimum, revised or removed from active con-
tion) is a logical construct, interposed between a sideration until future developments in science
problem and its solution, which represents a pro- and/or technology provide new tools for retest-
posed answer to a research question. It gives ing). As Leedy has stated, a hypothesis is to a
direction to the investigators thinking about the researcher what a point of triangulation is to a
problem and, therefore, facilitates a solution. surveyor: it provides a position from which he
Unlike facts and assumptions (presumed true may orient his exploration into the unknown and
and, therefore, not tested in the study) or theory a checkpoint against which to test his ndings
(a relatively well-supported unifying system [2]. The paramount role of the hypothesis for
explicating a broad spectrum of observations and guiding biomedical investigations was rst high-
inferences, including previously tested hypothe- lighted by the eminent physiologist Claude
ses), the research hypothesis is a reasoned but Bernard (18131878) [3]. In the current era,
tentative proposition typically expressing a rela- hypotheses are considered fundamental to rigor-
tion between variables. For it to be useful and, ous research, and biomedical studies without
more importantly, assessable, it must generate hypotheses have been largely abandoned in favor
predictions that can be tested by subsequent of those designed to generate or test them [4].
acquisition, analysis, and interpretation of data
(i.e., through formal observation or experimenta-
tion). When the results of the study are as Hypotheses Versus Assumptions

It is important to recognize the difference between

a hypothesis and an assumption. These terms
P.G. Supino, EdD () share the same etymological root and are often
Department of Medicine, College of Medicine,
SUNY Downstate Medical Center, 450 Clarkson Avenue,
confused. An assumption is accepted as fact in
Box 1199, Brooklyn, NY 11203, USA designing or justifying a study (though it is likely
e-mail: to have been the subject of previous research).

32 P.G. Supino

Thus, the investigator does not set out to test it. induction, and abduction [5]. These differ
Examples of assumptions include: primarily according to (1) whether the origin
Radionuclide cineangiography measures ven- of the hypothesis is a body of knowledge or
tricular performance. theory (the rationalist perspective), an empiri-
Chest x-rays measure the extent of lung cal event (the inductivist perspective), or some
inltrates. combination of the two (the abductivist per-
The SF-36 measures general health-related spective); (2) the logical structure of the argu-
quality of life. ment; and (3) the probability of a correct
Medical education improves knowledge of conclusion.
clinical medicine.
An apple a day keeps the doctor away (the
most famous [albeit untested] assumption of Hypothesis by Deduction
them all).
In contrast, the hypothesis is an expectation Deduction (from the Latin de [out of] and
that an investigator will attempt to conrm dcer [to draw or lead]) is one of the oldest
through observation or experiment. Examples in forms of logical argument. It was introduced by
clinical medicine include: the ancient Greeks who believed that acquisition
Among patients with chronic nonischemic of scientic knowledge (insight into the princi-
mitral regurgitation (insufciency), survival ples and character of natural substances and
will be better among those whose valves have their causes) could be achieved largely by the
been repaired or replaced than among those same logical processes used to prove the validity
who have been maintained on medical of mathematical propositions [6]. Today, deduc-
therapy. tion remains the predominant mode of formal
Among patients hospitalized with community- inference in research in mathematics and in the
acquired pneumonia, posthospital course will fundamental sciences, but it also plays an
be better among those with a low-risk prole important role in the empirical sciences. A deduc-
than among those with a high-risk prole tively derived hypothesis arises directly from
before hospitalization. logical analysis of a theoretical framework, pre-
Life expectancy will be greater among indi- viously developed to provide an explanation of
viduals consuming low-calorie diets than events or phenomena. It is considered to be non-
among those consuming high-calorie diets. ampliative because, while it helps to provide
Health-related quality of life is better among proof of principle, it adds nothing new beyond
those whose mitral valves have been repaired the theory. The validity of a theory can never be
than among those whose mitral valves have directly examined. Therefore, scientists wishing
been replaced. to evaluate it, or to test its utility within a given
(perhaps new) context, will formulate a conjec-
ture (hypothesis) that can be subjected to empiri-
Hypothesis Generation: Modes cal appraisal. In forming a hypothesis by
of Inference deduction, the investigator typically moves from
a general proposition to a more specic case that
There is a paucity of empirical data regarding the is thought to be subsumed by the generalization
way (or ways) in which hypotheses are formu- (i.e., from theory to a conceptual hypothesis or
lated by scientists and even less information from a conceptual hypothesis to a precise pre-
about whether these methods vary across disci- diction based on the hypothesis). Deductive argu-
plines. Nonetheless, philosophers and research ments can be conditional or syllogistic (e.g.,
methodologists have suggested three fundamen- categorical [all, some, or none], disjunctive [or],
tally different modes of inference: deduction, or linear [including a quantitative or qualitative
3 The Research Hypothesis: Role and Construction 33

comparison]) and contain at least two premises Hypothesis by Induction

(statements of evidence) and a conclusion.
A well-known categorical syllogism and example Not all hypotheses are derived from theory.
are given below: Frequently, in the empirical sciences, patterns,
All As are B (e.g., All men are mortal) trends, and associations are observed serendipi-
C is an A (e.g., Socrates is a man) tously in clinical settings or in preclinical labora-
\ C is a B (e.g., Socrates is mortal) tories or, purposively, through exploratory data
If the premises of a deductive argument are analysis or other hypothesis-generating research.
true and the reasoning used to reach the conclu- Sometimes, they may result from specic ndings
sion is valid (i.e., the form of the argument is cor- gleaned from the research literature. These obser-
rect), it will necessarily follow that the conclusion vations may be generalized to produce induc-
is sound (i.e., the premises, if true, guarantee the tively derived hypotheses that may serve as the
conclusion). If the form of the deductive argu- basis for predicting future events, phenomena, or
ment is invalid (i.e., the premises are such that patterns. Induction (from the Latin in [meaning
they do not lead to the conclusion: e.g., Socrates into] and dcer [to draw or to lead]) is
is mortal, all cats are mortal, \ Socrates is a cat) dened by Jenicek and Hitchcock as any method
and/or the premises are untrue (e.g., all mortals of logical analysis that proceeds from the particu-
are men [or cats]), the conclusion will be unsound. lar to the general [8] and represents the logical
It should be noted that deductive reasoning is the opposite of deduction which, as noted above,
only form of logical argument to which the term typically proceeds from the general to the specic.
validity is appropriate. Induction can be used not only to formulate
The theory from which the hypothesis is hypotheses but to conrm or refute them, which
derived can be specic to the discipline or it can may be its most appropriate use, as noted below
be borrowed from another discipline. Polit and (see Abduction). Inductive reasoning, which is
Beck [7] provide two examples of deductively based heavily on the senses rather than on intel-
formulated hypotheses, germane to nursing, lectual reection, was popularized by the English
derived from general reinforcement theory which philosopher and scientist, Sir Francis Bacon
posits that behaviors that are rewarded tend to be (15611626) [9], who proposed it as the logic of
learned or repeated: scientic discovery, a position that, subsequently,
1. Nursing home residents who are praised (rein- has been vigorously disputed by the Austrian
forced) by nursing personnel for self-feeding logician, Sir Karl Popper (19021994) [10] and
require less assistance in feeding than those other philosophers of science. There are various
who are not praised. forms of inductive inference. One of the most
2. Pediatric patients who are given a reward (e.g., common is enumerative induction (or inductive
a balloon or permission to watch television generalization). Jenicek and Hitchcock [8]
when they cooperate during nursing proce- describe it as a mode by which one concludes
dures) tend to be more cooperative during that all cases of a specied kind have a specied
those procedures than unrewarded peers. property on the basis of observation that all exam-
Deduction also is used to translate broad ined cases of that kind have the property [8].
hypotheses such as these to more specic opera- It is called enumerative because it itemizes
tional hypotheses (i.e., working hypotheses or cases in which some pattern is found and, solely
predictions) that can be directly tested by obser- for this reason (i.e., without the benet of a theo-
vation or experiment. When empirical support is retical framework), forecasts its recurrence. Other
obtained for a hypothesis, this, in turn, strength- forms of induction include argument from analogy
ens the theory or body of knowledge from which (forming inferences based on a shared property
the hypothesis was deduced. or properties of individual cases) or prediction
34 P.G. Supino

(drawing conclusions about the future cases from cowpox (vaccinia), they became immune to its
a current sample), causal inference (concluding more severe human analogue, smallpox. The
that association implies causality), and Bayesian English surgeon, Edward Jenner (17491823),
inference (given new evidence [data], using prob- used this hypothesis as the basis of a series of
ability theory [Bayes theorem] to alter belief in a scientic experiments, using exudates from an
hypothesis). infected milkmaid, to develop and formally test a
All inductive arguments contain multiple vaccine against this disease [11]. He became
premises that provide grounds for a conclusion famous for using vaccination as a method for pre-
but do not necessitate it (in contrast to a deduc- venting infection, though there is growing recog-
tive argument where the premises, if true, entail nition that the rst successful inoculations against
the conclusion). In other words, a conclusion smallpox actually were performed by a farmer,
drawn from an inductive argument is probable (at Benjamin Jesty, some 20 years earlier, who vac-
best), even if its premises are correct. For this cinated his family using cowpox taken directly
reason, all inductive arguments, while amplia- from a local cow [12]. It also has been claimed
tive, are considered to be logically invalid and are that Charles Darwin used inductive reasoning
judged, instead, according to their strength when generalizing about the shapes of the beaks
(i.e., whether they are inductively strong or from nches from the various Galapagos Islands
inductively weak). The strength of an inductive [13] and when forming conjectures from obser-
generalization is determined by the number of vations based on the breeding of dogs, pigeons,
observations supporting it and the extent to which and farm animals at home (inferences that formed
the observations reect all observations that could underpinnings of his theory of evolution) and
be made. The more (consistent) observations that that Gregor Mendel used the same form of rea-
exist, the more likely the conclusion is correct soning to conceptualize his law of hybridiza-
(inconsistent observations, of course, reduce the tion [14]. Even if these claims are true (and there
arguments inductive strength). The typical form is far from universal agreement on this matter),
of an inductive generalization is given below: inductive generalizations typically are regarded
A1 is a B as inferior to hypothesis-generating methods
A2 is a B that involve more theoretical reasoning, that con-
(All As I have observed are Bs) sider variations in circumstances (i.e., possible
\ All As are Bs confounding factors) that may account for spuri-
Like deductive arguments, inductive general- ous patterns, and that provide possible causal
izations can be categorical, that is, represent con- explanation for observed phenomena. Moreover,
clusions about all (as above), no, or some recent research in cognition and the relatively
members of a class, or they may involve quantita- new eld of neural modeling suggest that simple
tive arguments, for example, 50% of all coins induction across a limited set of observations
I have sampled are quarters; therefore, 50% of all may have a far smaller role in scientic reasoning
coins coming from the same lot that I have sam- than previously realized [15].
pled probably are quarters (or, as a clinical
example, 30% of the patients I have examined
are obese; therefore, 30% of patients sampled Hypothesis by Abduction
from the same population as those who I have
examined probably are obese). Of the three primary methods of reasoning, the
Not all inductive hypotheses used by scientists one that has been most implicated in the creation
have been formulated by scientists; some, in fact, of novel ideas, including scientic discoveries, is
owe their origin to folklore. For example, by the the logical process of abduction (from the Latin
late eighteenth century, it was common knowl- ab [meaning away from] and dcer [to draw
edge among English farm workers that when or to lead]). It also is the most common mode of
humans were exposed to cows infected with reasoning employed by clinicians when making
3 The Research Hypothesis: Role and Construction 35

diagnostic inferences. Abduction was introduced that the abductive argument is logically less
into modern logic by American philosopher and secure than a deductive argument (or even an
mathematician, Charles Sanders Peirce (1839 inductive argument). It represents a possible con-
1914) [16], and remains an important, albeit con- clusion only (after all, the beans might come from
troversial, topic of research among philosophers some other bagor from no bag at all). Therefore,
of science and students of articial intelligence. like an inductive argument, it is ampliative though
It refers to the process of formulation and accep- logically invalid. Its strength is based on how
tance on probation of a hypothesis to explain a well the argument accounts for all available
surprising observation. Thus, hypotheses formed evidence, including that which is seemingly
by abduction (unlike those formed by induction) contradictory.
are always explanatory. (The reader should note As Peirces work evolved, he shifted his efforts
that other synonyms for, and denitions of, to developing a theory of inferential reasoning in
abduction exist, e.g., retroduction, reduction, which abduction was taken to mean the genera-
inference to the best explanation, etc., the latter tion of new rules to explain new observations. In
reecting the evaluative and selective functions so doing, he focused on, what some have termed,
that also have been associated with this term.) the creative character of abduction [17]. Peirce
Abductive reasoning entails moving from a argued that abduction had a major role in the pro-
consequent (the observation or current fact) to cess of scientic inquiry and, indeed, was the
its antecedent (presumed cause or precondition) only inferential process by which new knowledge
through a general rule. It is considered back- was createda view that was, and continues to
ward because the inference about the antecedent be, hotly debated by the philosophical commu-
is drawn from the consequent. nity. In his later work, Peirce described the logi-
Peirce devoted his earliest work (before 1900), cal structure of abduction as follows:
as did Aristotle long before him, to furthering the The surprising fact, C, is observed.
development of syllogistic theory to express logi- But if A were true, C would be a matter of
cal relations. During this early period, abduction course.
(then termed by him as hypothesis) was taken to Hence, there is reason to suspect that A is true.
mean the use of a known rule to explain an [18]
observation (result); accordingly, his initial The surprise (the stimulus to the abductive
efforts were devoted to demonstrating how the inference) arises because the observation is
hypothesis relates to the premises of the argu- viewed, at that moment in time, as an anomaly,
ment and how it differs from the logical structure given the observers preexisting corpus of knowl-
of other forms of reasoning (i.e., deduction or edge (theory base) which cannot account for it.
induction). In his essay, Deduction, Induction, The lack of compatibility between the observa-
Hypothesis, Peirce presents an abductive tion and expectation introduces a type of cogni-
syllogism: tive dissonance that seeks resolution through the
Rule: All the beans from this bag are white. adoption of a coherent explanation. In Peirces
Result: These beans are white. opinion, the explanation might be nothing more
Case: These beans are from this bag. [16] than a guess (Peirce believed that humans were
In this argument, the rule and result repre- hardwired with the ability for guessing cor-
sent the premises (background knowledge and rectly) that, unlike an inductive generalization,
observation, respectively [the order is arbitrary]) enters the mind like a ash [18] or, what is
and the case represents the conclusion (here, commonly termed, as a eureka moment or an
the hypothesis). Had this argument been expressed ah ha! experience. Because a guess (insightful
deductively, the case would have been the sec- or not), by its very nature, is speculative (and, as
ond premise, and the result, the conclusion noted above, is a relatively insecure form of rea-
(i.e., all the beans from this bag are white, these soning), Peirce recognized that an abductive
beans are from this bag; therefore, these beans hypothesis must be rigorously tested before it
are white). It should be obvious to the reader could be admitted into scientic theory. This, he
36 P.G. Supino

Fig. 3.1 The three stages

of scientic inquiry (From
Abduction and Induction.
Essays on their Relation
and Integration, Flach PA
and Kakas AC. Abductive
and Inductive Reasoning:
Background and Issues,
Chap. 1, pp. 127,
Copyright 2000, with
permission from Klewer
Academic Publishers)

reasoned, is accomplished by using deduction to Although, as Peirce points out, all three modes
explicate the consequences of the hypothesis (i.e., of inference (abduction, deduction, and induc-
the predictions) and induction to form a conclu- tion) are used in the process of scientic inquiry,
sion about the likelihood of their truthfulness, each requires different skills. As scholars have
based on experimental verication. According to noted, deduction requires the capacity to reason
Peirce, these are the primary roles of deduction logically and inductive reasoning requires under-
and induction in the scientic process. Figure 3.1 standing of the statistical implications of drawing
illustrates the Peircian view of the relation conclusions from samples to populations. In con-
between abduction, deduction, and induction as trast, as Danmark et al. have noted, abduction
interpreted by Flach and Kakas [19]. requires the discernment of new relations and
Countless abductively derived hypotheses, connections not immediately obvious [21]in
principles, theories, and laws have been put for- other words, to think outside the box. For this
ward in science. Many, if not most, owe to the reason, the best abductive hypotheses in science
serendipitous consequences of an unexpected have been made by those who not only are obser-
observation made while looking for something vant, wise, and well grounded in their disciplines
else [20]. Well-known examples of such happy but who also are imaginative and receptive to
accidents include: new ideas. This view was, perhaps, best expressed
Archimedes principles of density and by Louis Pasteur (18221895) when he argued,
buoyancy In the elds of observation, chance favors only
Hans Christian Oersteds theory of prepared minds [22]. Accordingly, developing
electromagnetism the prepared mind, in general, and enhancing
Luigi Galvanis principle of bioelectricity the capacity to reason abductively, deductively,
Claude Bernards neuroregulatory principle of and inductively, in particular, should be among
circulation the most important goals of those seeking to
Paul Gross protease-antiprotease hypothesis effectively engage in the process of scientic
of pulmonary emphysema discovery.
3 The Research Hypothesis: Role and Construction 37

and diabetes would be considered study vari-

Characteristics of the Research ables, and a hypothesis could be constructed
Hypothesis about their association. However, if all patients
in a study group were women with diabetes,
Irrespective of how it is formulated (or the prob- no hypothesis could be developed about the
lem or discipline for which it is formulated), a relation between gender and diabetes since
research hypothesis should fulll the following these attributes would be invariable. (Fuller
ve requirements: discussion of nature and role of variables, and
1. It should reflect an inference about variables. their relation to the hypothesis, is presented
The purpose of any hypothesis is to make an later in this chapter.)
inference about one or more variables. The 2. It should be stated as a grammatically com-
inference can be as simple as predicting a sin- plete, declarative sentence.
gle characteristic in a population (e.g., mean A hypothesis should contain, at minimum, a
height, prevalence of lung cancer, incidence of subject and predicate (the verb or verb phrase
acute myocardial infarction, or other popula- and other parts of the predicate modifying the
tion parameter) or, more commonly, it repre- verb). The statements relaxation (subject)
sents a supposition about an association decreases (verb) blood pressure (object, or
between two or more variables (e.g., smoking predicate noun), depression (subject)
and lung cancer, diet and hypertension, age increases (verb) the rate of suicide (predi-
and exercise tolerance, etc.). It is, therefore, cate), and consumption of diet cola (subject)
important for the investigator to understand is related to (verb phrase) body weight (object,
what is meant by a variable and how it func- or predicate noun) are illustrative of hypoth-
tions in the setting of a hypothesis. eses that meet this requirement. In these
In its broadest sense, a variable is any fea- examples, the subject and predicate modiers
ture, attribute, or characteristic that can assume reect the variables to be related, and the verb
different values (levels) within the same indi- (or verb phrase) denes the nature of the
vidual at different points in time or can differ expected association.
from one member of the study population to 3. It should be expressed simply and unam-
another. Typical variables of interest to bio- biguously.
medical researchers include subject prole For a hypothesis to be of value in a study, it
characteristics (e.g., age, weight, gender, must be clear in meaning, contain only one
etiology, stage of disease), nature, place, dura- answer to any one question, and reect only
tion of naturally occurring exposures (e.g., the essential elements of solution. The reason
risk factors, environmental inuences) or pur- is that the hypothesis guides all subsequent
posively applied interventions, and subject research activities, including selection of the
outcomes or responses (e.g., morbidity, mor- population and measurement instruments, col-
tality, symptom relief, physiological, behav- lection and analysis of data, and interpretation
ioral, or attitudinal changes) among others. of results. For example, the hypothesis right
It is important to recognize that a charac- ventricular performance is the best predictor
teristic that functions as a variable in one study of survival among patients with valvular heart
does not necessarily serve as a variable in disease, but is less important in others would
another. For example, if an investigator wished be difcult to validate. First, what is meant by
to determine the relation of gender to preva- right ventricular performance? Does this refer
lence of diabetes, it would be necessary to to ejection fraction at rest, at exercise, or the
study this problem in a group comprising change from rest to exercise, or to some other
males and females, some with and some with- parameter? Second, what is the meaning of
out this disease. Because intersubject differ- best? Does it signify ease of measurement
ences exist for both characteristics, gender or does is it relate to the strength of statistical
38 P.G. Supino

association? Third, to what is right ventricular biomedical and other empirical sciences, is
performance compared? Is the contrast between achieved through the acts of observation or
right ventricular performance and clinical experimentation, analysis, and judicious
descriptors, anatomic descriptors, other func- interpretation. If one or more of the elements
tional descriptors, or between all of these? comprising the hypothesis is not present in
Fourth, what type of valvular heart disease the population or sample, or if a phenomenon
is being studied? Is it regurgitant, stenotic, or or characteristic contained within the hypoth-
both? Does it involve the mitral, aortic, or esis is highly subjective or otherwise difcult
some other heart valve? Finally, what is meant to measure, the hypothesis cannot be prop-
by less important? Who (or what) are the erly evaluated. For example, the statement
others? As is true for the research problem, female patients cope better with stress than
the clearer and less complex the statement of male patients would be a poor hypothesis if
the hypothesis, the more straightforward the the investigator did not have access to both
study and the more useful the ndings. male and female patients or was unable to
4. It should provide an adequate answer to the generate acceptable denitions and measures
research problem. to evaluate coping and stress. An even
For a hypothesis to be adequate, it must more egregious example is the hypothesis
address, in a satisfactory manner, both the prognosis following diagnosis of ovarian
content and scope of the central question; that cancer is related to the patients survival
is, whether the problem is narrow or broad, instinct, as it would be extremely difcult to
simple or complex, evaluation of the develop empirical data in support of a sur-
hypothesis(es) should result in the full resolu- vival instinctassuming it did exist.
tion of the research problem. For this reason, For many years, philosophers of science
it is recommended that the investigator formu- have argued about what constitutes evidence
late at least one hypothesis for every subprob- in science or support for a scientic hypothe-
lem articulated in the study. Equally important, sis. By the mid-twentieth century, the tenets
a hypothesis must be plausible; for this condi- of logical positivism (or logical empiri-
tion to be satised, the hypothesis should be cism) dominated the philosophy of science in
based on prior relevant observation and expe- the United States as well as throughout the
rience, buttressed by consideration of existing English-speaking world [24], replacing the
theory, and should reect sound reasoning and Cartesian emphasis on rationalism as a pri-
knowledge of the problem at hand. In contrast, mary epistemological tool. Strongly eschew-
speculations which have either no empirical ing metaphysical and theological explanations
support or legitimate theoretical basis, even if of reality, the logical positivists argued that a
interesting, constitute poor hypotheses and proposition held meaning only if it could be
typically yield weak or uninterpretable study veried (i.e., if its truth could be determined
outcomes. Finally, if the hypothesis is explan- conclusively by observable facts). Early crit-
atory in nature (rather than an inductive gener- ics of logical positivism, most notable among
alization), all else being equal, it should them Karl Popper, believed that veriability
represent the simplest of all possible compet- was too stringent a criterion for scientic dis-
ing explanations for the phenomenon or data covery. This, he argued, was due to the logical
at hand [23], a principle known as Occams limitations inherent in inductive reasoning,
razor or entia non sunt multiplicanda praeter namely, the deductive invalidity of forming a
necessitatem (Latin for entities must not be generalization based on the observation of
multiplied beyond necessity). particulars, and the attendant uncertainty of
5. It should be testable. such an inference. Thus, while both positive
A hypothesis must be stated in such a way as existential claims (e.g., there is at least one
to allow for its examination which, in the white swan) and negative universal claims
3 The Research Hypothesis: Role and Construction 39

Fig. 3.2 The hypothetico-deductive model: Poppers view of the role of falsication in scientic reasoning

(e.g., not all swans are white) could be or law could be falsied by nding a single
conrmed by nding, respectively, at last one counterexample.
white swan or one black swan, it would be Poppers greatest contribution to science
impossible to verify a positive universal claim was his characterization of scientic inquiry,
(e.g., all swans are white). To accomplish that, based on a cyclical system of conjectures and
one would have to observe every swan in exis- refutations (a form of critical rationalism)
tence, at all times and in all places, or risk widely known as the hypothetico-deductive
being wrong. method [27]. A schematic of Poppers view
According to Popper, the hallmark of a of this method is shown in Fig. 3.2. Consistent
testable claim is its capacity to be falsified with Poppers writing on the subject, the terms
[25]. In his view, falsication (not verication) hypothesis and theory are used interchange-
is the criterion for demarcation between those ably as both are viewed as tentative, though
hypotheses, theories, and other propositions most workers in the eld currently reserve the
that are scientic versus those that are not latter term for hypotheses (or related systems
scientic. This, of course, did not mean that a of hypotheses) that have received consistent
scientic hypothesis or theory must be false; and long-standing empirical support.
rather, if it were false, it could be shown to be The reader will note that the hypothetico-
so. Returning to our earlier example, all that deductive method begins with an early postu-
would be required to disprove the claim all lation of a hypothesis. The investigator then
swans were white is to nd a swan that is not uses deductive logic to form predictions from
white. Indeed, this inductive inference, based the hypothesis that should be true if the
on the observation of millions of white swans hypothesis is, in fact, correct. The nature of
in Europe, was shown to be false when black the predictions can vary from study to study,
swans were discovered in Western Australia in but they share the common attribute of being
the eighteenth century [26]an event that was unknown before data collection. The predic-
not unnoticed by Popper. It provided clear tions are then evaluated by formal experimen-
support for his assertion that no matter how tation or observation. Assuming a properly
many observations are made that appear to designed study, those predictions that are dis-
conrm a proposition, there is always the pos- cordant with data falsify the hypothesis, which
sibility that an event not yet seen could refute is then discarded or revised, leading to addi-
it. Similarly, any scientic hypothesis, theory, tional study. Although a hypothesis can never
40 P.G. Supino

be shown to be true via collection of compat- high incidence of morbid events. Although
ible information (as Popper noted, a subse- these may be important hypotheses, these
quent demonstration of counterfactual data statements cannot be directly tested as they
can overturn any hypothesis), the extent to are fundamentally abstract. What do the inves-
which it survives repeated attempts at tigators mean by high fat, depression,
falsication provides support (corroboration) severity of coronary artery disease, rela-
for its validity. As a result, testing of a hypoth- tively high, or morbid events? How will
esis serves to advance the existing theory base these terms be evaluated?
and body of knowledge. Popper argued that To render conceptual hypotheses testable,
the hypothetico-deductive method was the they must be recast as more specic statements
only sound approach to scientic reasoning; with elements (variables) that are precisely
moreover, in his opinion, it was the only dened according to explicit observable or
method by which science made any progress. measurable criteria. Hypotheses of this type are
Although Popper did not originate the referred to as operational hypotheses or, alter-
hypothetico-deductive method, he was the natively, specic hypotheses or predictions and
rst to explicate the central role of falsication represent the specic (observable) manifesta-
versus conrmation of a hypothesis in the tion of the conceptual hypothesis that the study
developing science. While his arguments have is designed to test. Once the study is designed,
been criticized by other philosophers of sci- data will be collected and analyzed to deter-
ence who assert that scientists do not neces- mine whether they are concordant or discordant
sarily reason that way [28], his views remain with the operational hypothesis which, ulti-
prominent in modern philosophy and continue mately, will be reinterpreted in terms of its
to appeal to many modern scientists [29]. broader meaning as a conceptual hypothesis.
Today, the Popperian view of the hypothetico- Figure 3.3 below illustrates a simplied version
deductive method, with its emphasis on test- of the hypothetico-deductive method, as con-
ing to falsify a proposed hypothesis, generally ceptualized by Kleinbaum, Kupper, and
is taken to represent an ideal (if not universal) Morgenstern [31] depicting the relation of con-
approach to curbing excessive inductive spec- ceptual and operational hypotheses to the
ulation and ensuring scientic objectivity, and design and interpretation of the study.
is considered to be the primary methodology Construction of operational hypotheses
by which biological knowledge is acquired represents an important preliminary step in
and disseminated [30]. the development of the research design, data
collection strategy, and statistical analysis
plan and is described in greater detail in sub-
Types of Hypotheses sequent sections of this chapter.
2. Single Variable Versus Multiple Variable
Hypotheses can be classied in several ways, as Hypotheses
shown below. Some investigations are undertaken to deter-
1. Conceptual Versus Operational Hypotheses mine whether a mean, proportion, or other
Hypotheses can vary according to their degree parameter from a sample varies from a
of specicity or precision and theoretical relat- specied value. For example, a group of obste-
edness. Hypotheses can be written as broad or tricians may have read a report that concludes
general statements, in which case they are that, throughout the nation, the average length
termed conceptual hypotheses. For example, of stay following uncomplicated caesarian
an investigator may hypothesize that a high- section is 5 days. They may have reason to
fat diet is related to severity of coronary artery believe that the length of stay for similar
disease or another may conjecture that patients at their institution differs from the
depression is associated with a relatively national average and would like to know if
3 The Research Hypothesis: Role and Construction 41

Fig. 3.3 Interrelation of conceptual hypotheses, opera- and Quantitative Methods, Fig. 2.2: An Idealized
tional hypotheses, and the hypothetico-deductive method Conceptualization of the Scientific Method (New York:
(Reprinted with permission Kleinbaum DG, Kupper LL, Van Nostrand Reinhold 1982), p. 35)
Morgenstern H. Epidemiologic Research: Principles

their belief is correct. To study the question, length of stay. In this case, caesarian section is
they must rst recast their question as a only a descriptor of the target population
hypothesis including the stipulated variable, because all data to be examined are from
select a representative sample of patients from patients undergoing this procedure.)
their institution, and compare data from their However, the objective of most hypotheses
sample with the national average (stipulated is not to draw inferences about population
value) using an appropriate one-sample statis- parameters but to facilitate evaluation of a
tical test. (The reader should note that the only proposition that two or more variables are sys-
variable being tested within this hypothesis is tematically related in some manner [32].
42 P.G. Supino

Indeed, some methodologists recognize only However, hypotheses often are not written
the latter form of argument as a legitimate this way because support for a cause-and-
hypothesis [7, 3335]. The simplest hypothe- effect relation requires not only biological
ses about intervariable association contain two plausibility and a strong statistical result but
variables (bivariable hypotheses), for also an appropriate (and usually rigorous)
example: study design. If the investigator believes that
Caffeine consumption is more frequent the variables are related, but prefers not to
among smokers than nonsmokers. speculate on the inuence of one variable on
Women have a higher fat-to-muscle ratio another, the hypothesis may be cast to propose
than men. an association only, without explicit reference
Heart attacks are more common in winter to causality. For example:
than in other seasons. Surgical benet is related to preoperative
If the objective of the study is to compare ischemia severity.
the relative association of several characteris- Exercise tolerance is correlated with chron-
tics, it usually will be necessary to construct a ological age.
single hypothesis which relates three or more Consumption of low-calorie beverages is
variables (multivariable hypotheses), for associated with body weight.
example: Finally, hypotheses also can be written to a
Ischemia severity is a stronger predictor of assert that there will be a difference between
cardiac events than symptom status and levels of a variable among two or more groups
risk factor score. of individuals or within a single group of indi-
Response to physical training is affected viduals at different points in time, as shown by
more by age than gender. the following examples:
Improvement in health-related quality of Patients enrolled in a health maintenance
life after cardiac surgery is inuenced more organization (HMO) will have a different
by preoperative symptoms than by ventric- number of hospitalizations than those
ular performance or geometry. enrolled in preferred provider organiza-
The number and type of variables contained tions (PPOs) or traditional fee-for-
within the hypothesis (as well as the nature of service insurance plans.
the proposed association) will dictate the study Among patients undergoing mitral valve
design, measurement procedures, and statisti- repair or replacement, left ventricular
cal analysis of the results. These concepts are performance will be dissimilar at 1 versus
addressed in Chaps. 5 and 11. 3 years after operation.
3. Hypotheses of Causality Versus Association The hypothesis also can be framed so that
or Difference the nature of the association (e.g., linear, cur-
The relation posited between variables may be vilinear, positive, inverse, etc.) or difference
cast as one of cause-and-effect, in which case (larger or smaller, better or poorer,
the researcher hypothesizes that one variable etc.) will be specied (see below, Alternative
affects or inuences the other(s) in some man- hypotheses [directional]).
ner. For example: 4. Mechanistic Versus Nonmechanistic
Estrogen produces an increase in coronary Hypotheses
ow. Hypotheses can be written so as to provide a
Smoking promotes lung cancer. mechanism (i.e., an explanation) for an
Patient education improves compliance. asserted relationship or prediction, or they can
Coronary artery bypass grafting causes a be written without dening an underlying
reduction in the number of subsequent car- mechanism. Mechanistic hypotheses are com-
diac events. mon in preclinical research which typically
3 The Research Hypothesis: Role and Construction 43

attempts to dene biochemical and physiolog- (falsication) reects the fact that two
ical causes of disease or dysfunction and path- outcomes always can arise out of a study of
ways amenable to therapeutic intervention. any single research problem. Thus, prior to
Shown below are two examples of mecha- collecting and evaluating empirical evidence
nistic hypotheses that were evaluated in two to resolve a problem, the investigator will
different preclinical investigations: (Note the posit two opposing assertions. The rst asser-
use of the phrase as a result of in the rst tion will indicate the supposition for which
hypothesis evaluating the impact of endothe- support actually is sought (e.g., that there is a
lial nitric oxide synthase [eNOS] and due to difference between a population parameter
in the second hypothesis evaluating antago- and an expected value or, more commonly,
nism of endothelin [ET]-induced inotropy. that there is some form of relation between
Italics have been added for emphasis.) variables within a particular population); the
Gender-specic protection against myo- other will indicate that there is no support for
cardial infarction occurs in adult female as this supposition. This rst type of assertion is
compared to male rabbits as a result of termed the alternative hypothesis and is gen-
eNOS upregulation [36]. erally denoted HA or H1. The alternative
ET-induced direct positive inotropy is hypothesis can be differentiated further accord-
antagonized in vivo by an indirect car- ing to its quantitative attributes. As an exam-
diodepressant effect due to a mainly ETA- ple, in a study evaluating the impact of beta-
mediated and ET-induced coronary adrenergic antagonist treatment (b-blockade)
constriction with consequent myocardial on the incidence of recurrent myocardial
ischemia [37]. infarctions (MIs), an investigator could frame
In clinical research, hypotheses more com- three contrasting alternative hypotheses:
monly are nonmechanistic (i.e., framed with- 1. The proportion of recurrent MIs among
out including an explicit explanation). Shown comparable patients treated with versus
below are two published literature examples: without b-blockade is different.
Patients with medically unexplained 2. The proportion of recurrent MIs among
symptoms attending the clinic of a general patients treated with b-blockade is less
adult neurologist will have delayed earliest than that among comparable patients
and continuous memories compared with treated without b-blockade.
patients whose symptoms were explained 3. The proportion of recurrent MIs among
by neurological disease [38]. patients treated with b-blockade is greater
Patients with acute mental changes will be than that among comparable patients
scanned more frequently than other elder treated without b-blockade.
patients [39]. The rst of these statements is termed a
The reader will note that these hypotheses nondirectional hypothesis because the nature
do not include the mechanism for memory of the expected relation (i.e., the direction of
variations in these patient populations (rst the intergroup difference in the proportion of
example) or the reasons why elderly patients recurrent infarctions) is not specied. The
with acute mental changes should be scanned second and third statements are termed direc-
more frequently than comparable patients tional hypotheses since, in addition to posit-
without such changes (second example). In ing a difference between groups, the nature of
situations like this, it is critical that the the expected difference (positive or negative)
justication be clear from the introductory is predened. Generally, the decision to state
section of the research paper or protocol. an alternative hypothesis in a directional ver-
5. Alternative Versus Null Hypotheses sus nondirectional manner is based on theo-
The requirement that a hypothesis should be retical considerations and/or the availability
capable of corroboration or unsupportability of prior empirical information. (In statistics, a
44 P.G. Supino

nondirectional hypothesis is usually referred 3. The proportion of recurrent MIs among

to as a two-tailed or two-sided hypothesis; a patients treated with b-blockade is not
directional hypothesis is referred to as a one- greater than that among comparable
tailed or one-sided hypothesis.) patients treated without b-blockade.
As noted, the hypothesis reects a tentative Only after both the null and alternative
conjecture which, to gain validity, ultimately hypotheses have been specied, and the data
must be substantiated by experience (empirical collected, can an appropriate test of statistical
evidence). However, even objectively measured signicance be performed. If the results of sta-
experience varies from time to time, place to tistical analysis reveal that chance is an
place, observer to observer, and subject to sub- unlikely explanation of the ndings, the null
ject. Thus, it is difcult to know whether an hypothesis is rejected and the alternative
observed difference or association was pro- hypothesis is accepted. Under these circum-
duced by random variation or actually reects a stances, the investigator can conclude that
true underlying difference or association in the there is a statistically signicant relation
population of interest. To deal with the problem between the variables under study (or a statis-
of uncertainty, the investigator must implicitly tically signicant difference between a param-
formulate and test what, in essence, is the logi- eter and an expected value). On the other hand,
cal opposite of his or her alternative hypothesis if chance cannot be excluded as a probable
(i.e., that the population parameter is the same explanation for the ndings, the null, rather
as the expected value or that the variables of than the alternative, hypothesis must be
interest are not related as posited). Thus, the accepted. It is important to note that accep-
investigator must attempt to set up a straw man tance of the null hypothesis does not mean
to be knocked down. This construct (which that the investigator has demonstrated a true
need not be not stated in the research report), is lack of association between variables (or
termed a null (or no difference) hypothesis and equation between a population parameter and
is designated H0. A null hypothesis asserts that an expected value) any more than a verdict of
any observable differences or associations not guilty constitutes proof of a defendants
found within a population are due to chance innocence in a legal proceeding. Indeed, in
and is assumed true until contradicted by criminal law, such a verdict means only that
empirical evidence. In the single variable (one- the prosecution, upon whom the burden of
sample) hypothesis, the assertion is that the proof rests, has failed to provide sufcient
parameter of interest is not different from some evidence that a crime was committed.
expected population value, whereas in a bivari- Similarly, in research, failure to overturn a
able or multivariable hypothesis, the assertion null hypothesis (particularly when the alterna-
is that the variables of interest are unrelated to tive hypothesis has been argued) generally is
some factor or to each other. taken to mean that the investigator, upon
A null hypothesis is framed by inserting a whom the burden of proof (or, more appro-
negative modier into the statement of the priately, corroboration) also rests, has failed to
alternative hypothesis. In the examples given demonstrate the expected difference or asso-
above, the following null statements could be ciation. Null results may reect reality, but
developed: they may also be due to measurement error
1. The proportion of recurrent MIs among and inadequate sample size. For this reason,
comparable patients treated with versus negative studies, a term for research that yields
without b-blockade is not different. null ndings, are far less likely to gain publi-
2. The proportion of recurrent MIs among cation than studies that demonstrate a statisti-
patients treated with b-blockade is not less cally signicant association [40, 41]. (See
than that among comparable patients Chap. 9 for a more detailed discussion of
treated without b-blockade. publication bias.)
3 The Research Hypothesis: Role and Construction 45

Constructing the Hypothesis: 2. The Ordinal Variable

Differentiating Among Variables Ordinal variables are considered to be semi-
quantitative. They are similar to nominal vari-
As indicated earlier, hypotheses most commonly ables in that they are comprised of categories,
entail statements about variables. Variables, in but their categories are arranged in a meaning-
turn, can be differentiated according to their level ful sequence (rank order), such that successive
of measurement (or scaling characteristics) or the values indicate more or less of some quantity
role that they play in the hypothesis. (i.e., relative magnitude). Typical examples of
ordinal variables include socioeconomic sta-
Level of Measurement tus, tumor classication scores, New York
Variables can be classied according to how well Heart Association (NYHA) functional class
they can be measured (i.e., the amount of infor- for angina or heart failure, disease severity,
mation that can be obtained in a given measure- birth order, perceived level of pain, and all
ment of an attribute). One factor that determines opinion survey scores. However, distances
the informational characteristics of a variable is between scale points are arbitrary. For exam-
the nature of its associated measurement scale, ple, a patient categorized as NYHA functional
that is, whether it is nominal, ordinal, interval, or class IV may have more symptomatic debility
ratioa classication system framed in 1946 by than one categorized as functional class II, but
Stevens [42]. Understanding these distinctions is he or she does not necessarily have twice as
important because scaling characteristics much debility; indeed, he or she may have
inuence the nature of the statistical methods that considerably more than twice as much debil-
can be used for analyzing data associated with a ity. Appropriate measures of central tendency
variable. for ordinal variables are the mode and median
1. The Nominal Variable (rather than the mean or arithmetic average) or
Nominal variables represent names or catego- percentile. Similarly, hypothesis tests of sub-
ries. Examples include blood type, gender, group differences based on ordinal outcome
marital status, hair color, etiology, and presence variables are limited to nonparametric
versus absence of a risk factor or disease, and approaches employing analysis of ranks or
vital status. Nominal variables represent the sums of ranks.
weakest level of measurement as they have no 3. The Interval Variable
intrinsic order or other mathematical proper- Interval variables, like ratio variables (below),
ties and allow only for qualitative classication are considered quantitative or metric variables
or grouping. Their lack of mathematical prop- because they answer the question how
erties precludes calculation of measures of much? or how many? Both may take on
central tendency (such as means, medians, or positive or negative values. A common exam-
modes) or dispersion. When all variables in a ple of an interval variable is temperature on a
hypothesis are nominal, this limits the types of Celsius or Fahrenheit scale. Both interval and
statistical operations that can be performed to ratio variables provide more precise informa-
tests involving cross-classication (e.g., tests tion than ordinal variables because the dis-
of differences between proportions). tances between successive data values
Sometimes, variables that are on an ordinal, represent true, equal, and meaningful inter-
interval, or ratio scale are transformed into vals. For example, the difference between
nominal categories using cutoff points (e.g., 70F and 80F is equivalent to the difference
age in years can be recoded into old versus between 80F and 90F. However, the zero
young; height in meters to tall versus short; point on an interval scale is arbitrary (note,
left ventricular ejection fraction in percent to freezing on a Celsius scale is 0 but is 32 on
normal versus subnormal). a Fahrenheit scale) and does not necessarily
46 P.G. Supino

connote absence of a property (in this case, (e.g., number of dental caries, number of white
absence of kinetic energy). When analyzing cells per cubic centimeter of blood, number of
interval data, one can add or subtract but not readers of medical journals, or other count-
multiply or divide. Most statistical and opera- based data) can take on only whole numbers.
tions are permissible, including calculation of Nominal and ordinal variables are intrinsically
measures of central tendency (e.g., mean, discrete, though in some disciplines (e.g.,
median, or mode), measures of dispersion behavioral sciences), ordinally scaled data
(e.g., standard deviation, standard error of the often are treated as continuous variables. This
mean, range), and performance of many statis- practice is considered reasonable when ordi-
tical tests of hypotheses including correlation, nal data intuitively represent equivalent inter-
regression, t-tests, and analysis of variance. vals (e.g., visual analogue scales), when they
However, due to the absence of a true zero contain numerous (e.g., 10 or more) possible
point, ratios between values on an interval scale values or orderings [43], or when
scale are not meaningful (though ratios of dif- shorter individual measurement scales are
ferences can be computed). combined to yield summary scores. The reader
4. The Ratio Variable should note, however, that in other disciplines
Like interval variables, the distances between and settings, treating all data as continuous
successive values on a ratio scale are equal. data is controversial and generally is not
However, ratio variables reect the highest recommended [44].
level of measurement because they contain a
true, nonarbitrary zero point that reects com- Role in the Research Hypothesis
plete absence of a property. Examples of ratio Another method of classifying variables is based
variables include temperature on a Kelvin on the specic role (function) that the variable
scale (where zero reects absence of kinetic plays in the hypothesis. Accordingly, a variable
energy), mass, length, volume, weight, and can represent (1) the putative cause (or be associ-
income. When ratio data are analyzed, all ated with a causal factor) that initiates a subse-
arithmetic operations are available (i.e., addi- quent response or event, (2) the response or event
tion, subtraction, multiplication, and division). itself, (3) a mediator between the causal factor
The same statistical operations that can be and its effect, (4) a potential confounder whose
performed with interval variables can be per- inuence must be neutralized, or (5) an explana-
formed with ratio variables. However, ratio tion for the underlying association between the
variables also permit meaningful calculation hypothesized cause and effect. Viewed this way,
of absolute and relative (or ratio) changes in a variables may be independent, dependent, or may
variable and computation of geometric and serve as moderator, control, or intervening vari-
harmonic means, coefcients of variation, and ables. Understanding these distinctions is crucial
logarithms. for constructing a research design, executing a
Quantitative variables (interval or ratio) statistical program, or communicating effectively
can be either continuous or discrete. Continuous with a statistician.
variables (e.g., weight, height, temperature) 1. The Independent Variable
differ from discrete variables in that the for- The independent variable is that attribute
mer may take on any conceivable value within within an individual, object, or event which
a given range, including fractional values or affects some outcome. The independent vari-
decimal values. For example, within the range able is conceptualized as an input in the study
150151 lbs, an individual theoretically can that may be manipulated by the investigator
weigh 150 lbs, 150.5 lbs or 150.95 lbs, though (such as a treatment in an experimental study)
the capacity to distinguish between these values or reect a naturally occurring risk factor. In
clearly is limited by the precision of the mea- either case, the independent variable is viewed
surement device. In contrast, discrete variables as antecedent to some outcome and is presumed
3 The Research Hypothesis: Role and Construction 47

to be the cause, or a predictor of that outcome, corticosteroid therapy on systolic performance

or a marker of a causal agent or risk factor. We among patients with heart failure. In this
call this type of variable independent study, systolic performance would be the
because the researcher is interested only on its dependent variable; the investigator would
impact on other variables in the study rather measure its degree of improvement or deterio-
than the impact of other variables on it. ration in response to introducing versus not
Independent variables are sometimes termed introducing steroid treatment. Because it is a
factors and their variations are called levels. measure of effect, the dependent variable can
If, for example, if an investigator were to be observed and measured but, unlike the
conduct an observational study of the effects independent variable, it can never be
of diabetes mellitus on subsequent cardiac manipulated.
events, the independent variable (or factor) Independent and dependent variables are
would be history of diabetes, and its variants relatively simple to identify within the context
(positive or negative history) would be levels of a specic investigation, for example, a pro-
of the factor. As a second example, in an inter- spective cohort or an experimental study or a
vention study examining the relative impact of well-designed retrospective study in which
inpatient versus outpatient counseling on one variable clearly is an input, the second is a
patient morbidity after a rst MI, the indepen- response or effect, and an adequately dened
dent variable (factor) would be the counsel- temporal interval exists between their appear-
ing, and its variants (inpatient counseling vs. ance. However, when research is cross-
outpatient counseling) would correspond to sectional, and variables merely are being
the alternative levels of the factor. The reader correlated, it is sometimes difcult or impos-
should note that in both of these hypothetical sible to infer which is independent and which
examples, there was only one independent is dependent. Under these circumstances, vari-
variable (or factor) and that each factor had ables are often termed covariates.
two levels. It is possible and, in fact, common 3. The Moderator Variable
for studies to have several independent vari- Often, an independent variable does not affect
ables and for each to have multiple factor lev- all individuals in the same way, and an inves-
els (indeed, the number of factor levels in tigator may have reason to believe that some
doseresponse studies is potentially innite). other variable may be involved. If he or she
Care needs to be exercised as researchers often wishes to systematically study the effect of
confuse a factor with two levels for two fac- this other variable, rather than merely neutral-
tors. Levels are always components of the fac- ize it, it may be introduced into the study
tor. Understanding this distinction is essential design as a moderator variable (also known as
for conducting statistical tests such as analysis an effect modier). The term moderator
of variance (ANOVA). variable refers to a secondary variable that is
2. The Dependent Variable measured or manipulated by the investigator
In contrast to the independent variable, the to determine whether it alters the relationship
dependent variable is that attribute within an between the independent variable of central
individual or its environment that represents an interest and the dependent (response) variable.
outcome of the study. The dependent variable The moderator variable may be incorporated
is sometimes called a response variable because into a multivariate statistical model to exam-
one can observe its presence, absence, or ine its interactive effects with the independent
degree of change as a function of variation in variable or it may be used to provide a basis
the independent variable. Therefore, the depen- for stratifying the sample into two or more
dent variable is always a measure of effect. subgroups within which the effects of the
As an example, suppose that an investiga- independent variable may be examined
tor wished to study the effects of adrenal separately.
48 P.G. Supino

Fig. 3.4 A hypothetical

example of the effects
of a moderator variable:
inuence of chronic
anxiety on the impact
of a new drug for patients
with attention decit
hyperactivity disorder

For example, suppose a psychiatrist wishes effective, promoting greater task persistence
to study the effects of a new amphetamine- among patients without associated anxiety but
type drug on task persistence in patients with decreasing task persistence among those with
attention decit hyperactivity disorder anxiety, as hypothesized.
(ADHD) who have not responded well to cur- A cautionary note is in order. Although mod-
rent medical therapy. She believes that the erator variables can increase the yield or accu-
drug may have efcacy but suspects that its racy of information from a study, an investigator
effect may be diminished by the comorbidity needs to be very selective in using them as each
of chronic anxiety. Rather than give the new additional factor introduced into the study design
drug to patients with ADHD who do not also increases the sample size needed to enable the
have anxiety and placebo to patients with impact of these secondary factors to be satisfac-
ADHD plus anxiety, to avoid confounding, torily evaluated. During the study planning pro-
she enrolls both types of patients, randomly cess, the investigator must determine the
administers drug or placebo to members of likelihood of a potential interaction, the theoreti-
each subgroup, and measures task persistence cal or practical knowledge to be gained by dis-
among all subjects at a xed interval after covery of an interaction, and decide whether
onset of therapy. In this hypothetical study, the sufcient resources exist for such evaluation.
independent variable would be type of therapy 4. The Control Variable
(factor levels: new drug, placebo), the depen- In this last example, the investigator chose to
dent variable would be task persistence, and evaluate the interactive effects of a secondary
chronic anxiety (presence, absence) would be variable on the relation of the independent
the moderator. Figure 3.4 illustrates the impor- and dependent variables. Others in similar
tance of a moderator variable. If none had situations might choose not to study a second-
been used in the study, the data would have led ary independent variable, particularly if it is
the investigator to conclude that the new drug viewed as extraneous to the primary hypoth-
was ineffective as no overall treatment effect esis or focus of the study. Additionally, it is
would have been observed for the ADHD impractical to examine the effects of every
group (left panel, diagonal patterned bar), with ancillary variable. However, extraneous vari-
change in task persistence for the entire treated ables cannot be ignored because they can
group similar to subjects on placebo (right confound study results and render the data
panel). However, as noted, the new drug was uninterpretable. Variables such as these usu-
not ineffective but instead was differentially ally are treated as control variables.
3 The Research Hypothesis: Role and Construction 49

A control variable is dened as any poten- represent a disease process or physiological

tially confounding aspect of the study that is parameter that links an exposure or purposively
manipulated by the investigator to neutralize applied intervention to an outcome (e.g., sec-
its effects on the dependent variable. Common ondhand smoking causes lung cancer by
control variables are age, gender, clinical his- inducing lung damage; valvular surgery
tory, comorbidity, test order, etc. In the hypo- increases LV ejection fraction by improving
thetical example given above, if the contractility.). Others such as Baron and
psychiatrist had wanted to control for associ- Kenny [46] view an intervening variable as a
ated anxiety and not evaluate its interactive factor that can be measured (directly or by
effects, she could have chosen patients with operational denitions, described later in this
similar anxiety levels or, had his or her study chapter), fully derived (abstractable) from
employed a parallel design (which it did not), empirical ndings (data), and statistically
she could have made certain that different analyzed to demonstrate its capacity to medi-
treatment groups were counterbalanced for ate the relation between the independent
that variable. and dependent variables. As an example,
5. The Intervening Variable Williamson and Schulz [47] measured and
Just as the moderator variable denes when evaluated the relation between pain, functional
(under what conditions) the independent vari- disability, and depression among patients with
able exerts its action on the dependent vari- cancer. They determined that the observed
able, the intervening variable may help relation of pain to depression was due to dimi-
explain how and why the independent and nution of function, operationally dened as
dependent variables are related. This can be activities of daily living (the intervening or
especially important when the association mediating variable), which, in turn, caused
between independent and dependent variables depression. Similarly, Song and Lee [48] found
appears ambiguous. There is general consen- that depression mediated the relation of sensory
sus that the intervening variable underlies, and decits (the independent variable in their study)
accounts for, the relation between the inde- to functional capacity (their dependent vari-
pendent and dependent variable. However, able) in the elderly. (For a comprehensive dis-
historically, workers in the eld have dened cussion of mediation and statistical approaches
them in different (and often contradictory) to test for mediation, the reader is referred to
ways [45]. For example, Tuckman describes MacKinnon 2008 [49].) Whether viewed as a
the intervening variable as a hypothetical hypothetical construct or as a measurable medi-
internal state (construct) within an individual ator, an intervening variable is always interme-
(motivation, drive, goal orientation, intention, diate in the causal pathway by which the
awareness, etc.) that theoretically affects the independent variable affects the dependent
observed phenomenon but cannot be seen, variable and is useful in explaining the mecha-
measured, or manipulated; its effect must be nism linking these variables and, potentially,
inferred from the effects of the independent for suggesting additional interventions.
and moderator variables on the observed phe- Below are two hypotheses from cardiovas-
nomenon [35]. In the previous hypothetical cular medicine in which constituent variables
example which examined the interactive have been analyzed and labeled according to
effects of drug treatment and anxiety on task their role in each hypothesis.
persistence, the intervening variable was
attention. In educational research, the inter- Hypothesis 1: Among patients with heart
vening variable between an innovative peda- failure who have similar clinical histories,
gogical approach and the acquisition of new those receiving adrenal corticosteroid treat-
concepts or skills is the learning process ment will demonstrate a greater improvement
impacted by the former. In clinical or epidemi- in systolic performance than those not receiv-
ological research, the intervening variable can ing steroid treatment.
50 P.G. Supino

Fig. 3.5 Interrelation among variables in a study design

Independent variable: adrenal corticoster-

oid treatment Role of Operational Denitions
Factor levels: 2 (treatment, no treatment)
Dependent variable: systolic performance As indicated earlier, one of the characteristics of
Control variable: clinical history a hypothesis that sets it apart from other types of
Moderator variable: none statements is that it is testable. The hypotheses
Intervening variable: change in magnitude discussed thus far are conceptual. A conceptual
of the inammatory process hypothesis cannot be directly tested unless it is
transformed into an operational hypothesis. To
Hypothesis 2: Patients with angina who are accomplish this, operational denitions must be
treated with b-blockade will have a greater developed for each element specied in the
improvement in their capacity for physical hypothesis.
activity than those of the same sex and age An operational denition identies the observ-
who are not treated with b-blockade; this able characteristics of that which is being studied.
improvement will vary as a function of sever- Its use imparts specicity and precision to the
ity of initial symptoms. research, enabling others to understand exactly
Independent variable: b-blockade treatment how the hypothesis was tested. As a corollary, it
Factor levels: 2 (treatment, no treatment) enables the scientic community to evaluate the
Dependent variable: capacity for physical appropriateness of the methodology selected for
activity studying the problem. Operational denitions are
Moderator variable: severity of initial required because a concept, object, or situation
symptoms can have multiple interpretations. While double
Control variables: sex and age entendre is one basis of Western humor, inconsis-
Intervening variable: alteration in myocar- tent (or vague) denitions within a study are not
dial work comical as they typically lead to confused ndings
In sum, many research designs, particularly (and readers). Imagine, for example, what might
those intended to test hypotheses about cause occur if one member of an investigative team,
or prediction and effect, contain independent, studying the relative impact of two procedures for
dependent, control, and intervening variables. treating hemodynamically important coronary
Some also contain moderator variables. artery disease, dened important as >50%
Figure 3.5 illustrates their interrelationship. luminal diameter narrowing of one or more
3 The Research Hypothesis: Role and Construction 51

coronary vessels and another, working in the same To render this hypothesis testable, its constituent
study, dened it as 70% luminal diameter nar- elements could be dened as follows:
rowing; or if one investigator studying new onset b-blockers = propranolol (assuming that the
angina used 1 week as the criterion for new and investigator was specically interested in this
another used 1 month. Operational denitions can drug)
describe the manipulations that the investigator Capacity for physical activity = New York
performs (e.g., the intervention), or they can Heart Association functional class
describe behaviors or responses. Still others Severity of symptoms = angina class 12
describe the observable characteristics of objects versus angina class 34
or individuals. Once the investigator has selected This hypothesis, in its operational form, would
appropriate operational denitions (this choice is be stated: Patients with angina who are treated
entirely study dependent), all hypotheses in the with propranolol will have greater improvement
study can be operationalized. in New York Heart Association functional class
A hypothesis is rendered operational when its than those not treated with propranolol, and
broadly (conceptually) stated variables are this improvement will vary as a function of ini-
replaced by operational denitions of those vari- tial angina class (12 vs. 34). In this form,
ables. Hypotheses stated in this manner are called the hypothesis could be directly tested, although
operational hypotheses, specific hypotheses, or the investigator would still need to specify mea-
predictions. surement criteria and develop an appropriate
Let us consider two hypotheses previously design.
given in this chapter: Any element of a hypothesis can have more
Patients with heart failure who are treated than one operational denition and, as noted, it is
with adrenal corticosteroids will have better sys- the investigators responsibility to select the one
tolic performance than those who are not is that is most suitable for his or her study. This is
sufciently general to be considered a conceptual an important judgment because the remaining
hypothesis and, as such, is not directly testable. research procedures (i.e., specication of subject
To render this hypothesis testable, the investiga- inclusion/exclusion criteria, the nature of the
tor could operationally dene its constituent ele- intervention and outcome measures, and data
ments as follows: analysis methodology) are derived from opera-
Heart failure = secondary hypodynamic tional hypotheses. Investigators must be careful
cardiomyopathy to use a sufcient number of operational
Adrenal corticosteroids = cortisol denitions so that reviewers will have a basis
Better systolic performance = higher left ven- upon which to judge the appropriateness of the
tricular ejection fractions at rest methodology outlined in submitted grant propos-
The hypothesis, in its operational form, would als and manuscripts, so that other investigators
state: Patients with secondary hypodynamic car- will be able to replicate their work, and so that
diomyopathy who have received cortisol will the general readership can understand precisely
have higher ventricular ejection fractions at rest what was done and have sufcient information to
than those who have not received cortisol properly interpret ndings.
treatment. Once operational denitions have been devel-
Similarly, the hypothesis that patients with oped and the hypothesis has been restated in
angina who are treated with b-blockers will have operational form, the investigator can conduct the
a greater improvement in their capacity for physi- study. The next step will be to select a research
cal activity than those not treated with b-blockers, design that can yield data to support optimal sta-
and that this improvement will vary as a function tistical hypothesis testing. The strengths, weak-
of initial symptoms, while complex, is still nesses, and requirements of various study designs
general enough to be considered conceptual. will be discussed in Chaps. 4 and 5.
52 P.G. Supino

Take-Home Points

A hypothesis is a logical construct, interposed between a problem and its solution, which
represents a proposed answer to a research question. It gives direction to the investigators
thinking about the problem and, therefore, facilitates a solution.
There are three primary modes of inference by which hypotheses are developed: deduction
(reasoning from a general propositions to specic instances), induction (reasoning from
specic instances to a general proposition), and abduction (formulation/acceptance on pro-
bation of a hypothesis to explain a surprising observation).
A research hypothesis should reect an inference about variables; be stated as a grammati-
cally complete, declarative sentence; be expressed simply and unambiguously; provide an
adequate answer to the research problem; and be testable.
Hypotheses can be classied as conceptual versus operational, single versus bi- or multi-
variable, causal or not causal, mechanistic versus nonmechanistic, and null or alternative.
Hypotheses most commonly entail statements about variables which, in turn, can be
classied according to their level of measurement (scaling characteristics) or according to
their role in the hypothesis (independent, dependent, moderator, control, or intervening).
A hypothesis is rendered operational when its broadly (conceptually) stated variables are
replaced by operational denitions of those variables. Hypotheses stated in this manner are
called operational hypotheses, specic hypotheses, or predictions and facilitate testing.

Design and Interpretation
of Observational Studies: Cohort, 4
CaseControl, and Cross-Sectional

Martin L. Lesser

to be further elucidated below), we might start

Introduction out by gathering together hundreds of college
students who are smokers and follow them over
Perhaps, one of the most common undertakings their lifetimes to see what fraction develop lung
in biomedical research is to determine whether cancer (i.e., estimate the incidence rate). Likewise,
there is an association between a particular factor we might follow a similar cohort of college
(usually referred to as a risk factor) and an nonsmokers to determine their lung cancer inci-
event. That event might be a disease (e.g., lung dence rate. In the end, we would compare the
cancer) or an outcome in subjects who already incidence rates of lung cancer and, using appro-
have a disease (e.g., sudden death among subjects priate statistical methodology, determine whether
with valvular heart disease). For example, an the incidence rates were signicantly different
investigator might want to know whether smok- from one another, thereby supporting or not sup-
ing is a risk factor for lung cancer or whether oral porting the hypothesis that smoking is associated
contraceptive use is a risk factor for a myocardial with lung cancer.
infarction in women. These kinds of research On the other hand, in the casecontrol
questions are often answered using specic types design, we would begin by selecting individuals
of research designs, the two most common being who have a diagnosis of lung cancer (cases)
the casecontrol and cohort study designs. and a group of appropriate individuals without
(In this chapter, we will use the term disease lung cancer (controls) and look back in time
interchangeably with disease outcome as both to see how many smokers there were in each of
represent endpoints of interest.) the two groups. We would then, once again, using
While both types of study designs aim to appropriate statistical methods, compare the
answer the same kind of research question, the prevalence rates of a smoking history to deter-
method of conducting these designs is quite mine whether such an association between smok-
different. For example, in a cohort study ing and lung cancer is supported by the data.
(more specically a prospective cohort study, Thus, the essential difference between these
two study designs is that, in the cohort design, we
rst identify subjects with and without a given
M.L. Lesser, PhD () risk factor and then follow them forward in time
Biostatistics Unit, Departments of Molecular Medicine to determine the respective disease incidence
and Population Health, Feinstein Institute for Medical rates, whereas in the casecontrol design, we rst
Research, Hofstra North Shore-LIJ School of Medicine,
identify subjects with and without the disease
350 Community Drive, 1st oor, Manhasset,
NY 11030, USA and then determine the fraction with the risk
e-mail: factor in each group.

P.G. Supino and J.S. Borer (eds.), Principles of Research Methodology: A Guide for Clinical Investigators, 55
DOI 10.1007/978-1-4614-3360-6_4, Phyllis G. Supino and Jeffrey S. Borer 2012
56 M.L. Lesser

In both of these study designs, the timing of vantages that need to be weighed when such a
the suspected risk factor exposure in relation to choice is being considered.
the development or diagnosis of the disease is
important. Both study designs consider the situa-
tion where exposure to the risk factor precedes Cohort Studies
the disease. While such designs cannot prove
causality (as will be discussed below), this order- Basic Notation
ing of exposure and disease is a necessary condi-
tion for causality. In the most general setting, we will hypothesize
A third type of commonly used observational that exposure (E) to a particular agent, environ-
design is the cross-sectional study. As will be dis- mental factor, gene, life event, or some other
cussed below, this design does not specically specic factor increases the risk of developing a
examine the timing of exposure and disease. particular disease (D) or condition. Perhaps, a
It should be pointed out that casecontrol and better way to state the hypothesis would be that
cohort study designs are not necessarily restricted exposure is associated with the disease.
to the study of risk factors for a disease, per se. More formally, we might use the following
For example, if we wanted to conduct a study to hypothesis testing notation:
determine risk factors for a patient dropping out H0: Exposure to the factor is not associated with
of a clinical trial, we could select cases to be an increased risk of developing the disease.
those who dropped out of a clinical trial and con- HA: Exposure to the factor increases the risk of
trols would be those who did not drop out of the developing the disease.
clinical trial. Of course, dropping out of a clinical In statistical terms, H0 and HA are the null and
trial is not a disease (we might refer to it as an alternative hypotheses, respectively. (A discus-
outcome), yet it can be studied in the context of sion of hypothesis specication and testing can
a casecontrol study design. be found in Chaps. 3 and 11 in this text.) As in
The casecontrol, cohort, and cross-sectional most hypothesis testing problems, the objective
studies are considered observational study is to refute the null hypothesis and demonstrate
designs, which means that no particular therapeu- support for the alternative hypothesis.
tic or other interventions are being purposively It is important to note the hypotheses relating
applied to the subjects of the study. The subjects E and D do not use the word cause because in
of the study simply are being observed in their observational studies, we cannot prove causality;
natural settings to determine, in this example, we can only hope to show that an association
how many developed lung cancer or how many exists between E and D which may not necessar-
were smokers. A study design where an interven- ily be causal. We will have more to say about
tion is purposively applied to subjects to deter- establishing causality from observational studies
mine, for example, whether one treatment later in this chapter.
modality is better than another would be called
an experimental design or more specic to bio-
medical research, a clinical trial in which the Selection of Exposed Subjects
intervention (e.g., drug, device, etc.) is assigned
to the subject as per protocol. (For detailed In order to conduct a cohort study, one must rst
discussions of studies of interventions and how select subjects who have been exposed to the
to prepare for them, the reader is referred to hypothesized risk factor. It is not the purpose of
Chaps. 5 and 6.) this chapter to provide detailed guidance on alter-
The important issue of whether to choose a native sampling methodologies, which is dis-
casecontrol or cohort study design for a particu- cussed in greater detail in Chap. 10. Here, our
lar research study will be discussed later in this goal is to provide general guidance as to how to
chapter. Each has relative advantages and disad- sample subjects and from where they might be
4 Design and Interpretation of Observational Studies 57

sampled, with the specic details left to the reader Table 4.1 Sources of exposure information
in consultation, perhaps, with a statistician or Preexisting records
epidemiologist. Interviews, questionnaires
Direct physical examination or tests
Denition of the Exposure Direct measurement of the environment
Daily logs
To select exposed subjects, there must be a clear
denition of what it means to be, or have been,
exposed to the risk factor under study. Suppose, near environmental hazards, persons with certain
for example, a study was conducted to determine lifestyles, such as those who regularly attend an
the effect of exposure to heavy metals (e.g., gold, exercise gym. In an epidemiologic study of long-
silver, etc.) on semen and sperm quality in men term effects of prescription drugs, one might uti-
during their peak reproductive years. We might lize a roster or list of individuals who have been
enlist the support of a company that works with prescribed a certain type of drug. When selecting
heavy metals in a factory setting and then obtain cohorts of exposed subjects, an attempt should be
seminal uid samples from men working in that made to select these cohorts for their ability to
factory. However, we would still need to know facilitate the collection of relevant data, possibly
what it means to be exposed. Exposure can be over a long period of time. For example, there are
dened in many ways. For example, just working several large-scale prospective cohort studies that
in that factory environment for at least 6 months involve physicians [1, 2].
might be one denition of exposure; another
denition might involve the direct measurement Sources of Exposure Information
of heavy metal particles in the factory or on a To determine whether or not a subject has
detector worn by each factory worker from which been exposed to a particular risk factor, the
a determination of exposure might be made based investigator has several sources of information
on some minimum threshold exposure level indi- that might be used for making this determination
cated on the detector. If one were to study the (Table 4.1). First, preexisting records (medical
effect of cigarette smoking in pregnant women charts, school records, etc.) might be used
on the birth weight of newborns, once again, one for determining whether a particular exposure
would need to have a denition of what it means occurred. While preexisting records may be easy
to be a smoker during pregnancy: is having and inexpensive to retrieve, they may be inaccu-
smoked one cigarette during pregnancy enough rate with respect to the information that an inves-
to dene the smoking status or does it need to be tigator needs in his or her research investigation
a more consistent and higher frequency of ciga- because data in the chart was not collected with
rettes during the pregnancy? As for measurabil- this research study in mindrather, the data were
ity, it is desirable but not always possible to dene collected for clinical reasons only.
exposure based on some directly measurable A second source of exposure information, that
quantity. represents an improvement upon preexisting
records, is self-reported information (e.g., inter-
Sources of Exposed Subjects views or questionnaires that may be administered
Where might exposed subjects be found? to prospective participants in the cohort study).
Certainly, in the prior example of occupational This approach allows the investigator some
exposure, one might look to identify potentially exibility about which questions should be
exposed subjects from the roster of companies in asked and how they should be asked, which
certain lines of manufacturing or other work, might not be available in preexisting records. Of
labor unions, or other organizations or groups of course, conducting interviews or administering
individuals that would be associated with a par- questionnaires has associated costs that may be
ticular occupation and, potentially, with such an substantially greater than retrieving preexisting
exposure. One also might enroll persons living records or charts.
58 M.L. Lesser

Beyond direct interviews and questionnaires, Table 4.2 Sources of outcome information
the investigator also can perform physical Death certicates
examinations or tests on individual subjects to Physician and hospital records
determine certain exposures. Direct measurement Disease registries
of environmental variables (e.g., in an occupa- Self-report
Direct physical examination or tests
tional exposure type of cohort study) also would
be reasonable. Of course, these approaches to
determining exposure status generally have need to be considered. For example, in our
higher associated costs and logistical difculties hypothetical study on heavy metal exposure and
than do interviews, questionnaires, or use of pre- male fertility, it might be convenient to select
existing records. Finally, the investigator might controls from the business ofces of the same
ask subjects to maintain daily logs of certain company which might be located at some dis-
activities, environmental exposures, foods, etc., tance from the factory. However, if one were to
in order to determine levels of exposure over select ofce workers as potential unexposed con-
time. Daily logs have the advantage of providing trols, the investigator would have to be careful
information on a detailed and regular basis but that those potential controls are not regularly
have the shortcoming of being inaccurate due to exposed to the heavy metal factory. This could
the self-report nature of a daily log. happen if, for example, the vice president for
In summary, there are many sources of expo- quality control, who worked in the business
sure information available to the investigator. The ofce, made daily tours of the factory and, there-
use of a particular source depends on its relative fore, was exposed (albeit a small amount of expo-
advantages and disadvantages with respect to sure) to the heavy metals.
accuracy, feasibility, and cost.

Sources of Outcome Information

Selection of Unexposed Subjects
(Controls) Once a cohort study is underway, it is essential
for the investigator to determine whether the par-
The control group for the exposed subject should ticular outcome has or has not occurred. Once
comprise individuals who have been unexposed again, there are various sources of information
to the factor being studied. As will be seen (Table 4.2), each of which has its advantages and
for casecontrol studies, the proper selection of a disadvantages from a logistical cost and accuracy
control group can be a difcult task. perspective. Death certicates often are used to
First, one must have a good denition of expo- determine cause of death and comorbidities at
sure in order to operationalize the denition of the time of death for a participating subject.
unexposed. Obviously, we want the unexposed Unfortunately, death certicates can be inaccu-
subjects to be free of the exposure in question but rate with regard to the specic details of cause of
similar to the exposed cohort in all other respects. death and, of course, may not capture informa-
How an investigator would determine the expo- tion about other outcomes that the investigator is
sure status of a potential control certainly depends seeking.
on the type of exposure one is studying. In the Physician and hospital records represent good
example of heavy metal exposure given above, sources of outcome information provided that the
one would probably administer some sort of inter- subject has maintained contact in that particular
view to determine whether the potential control health-care or physician system. If the outcome
has ever been in an occupation or an environmen- in question was whether a patient suffered a myo-
tal situation where there might have been heavy cardial infarction (MI), there is no guarantee that
metal exposure. Additionally, there are different the patient will be seen for that MI at the investi-
degrees of exposure to a risk factor that would gators hospital, and therefore, the investigator
4 Design and Interpretation of Observational Studies 59

may not have access to that information based on Table 4.3 Criteria for confounding
his or her immediate hospital records. 1. The presumed confounder (F) is associated with the
Disease registries can be useful sources of exposure (E)
information, but, once again, they are very simi- 2. Independent of exposure, F, must be associated with
the risk of disease (D)
lar to physician and hospital records in that dis-
ease registries are often specic to a particular
hospital or large regional health area. Also, also occur when a third variable makes it appear
condentiality issues may preclude the ability to that there is no association between an exposure
access records in disease registries for subjects. and a disease when, in fact, there is.
Self-report (described in detail in Chap. 8) is a Before providing concrete examples of con-
relatively inexpensive and logistically simple founding, it is important to formally dene the
method for determining outcome but can be inac- concept. Let E denote the exposure and D
curate because patients may not be cognizant of denote the disease being studied. A third factor,
the subtleties of various diseases or outcomes F, is called a confounding variable if it meets
that have been diagnosed. However, written two criteria: (1) F is associated with exposure, E;
permission from the patient sometimes can be and (2) independent of exposure, F, is associated
obtained for the investigator to contact the with the risk of developing the disease, D. It
patients physicians and hospital records in order should be emphasized that a confounding factor,
to make denite ascertainment of whether or not F, must meet both of these conditions in order to
an outcome occurred. be a confounder. Often, in error, research investi-
Finally, direct physical examination or tests gators treat variables as confounders when they
conducted on the subject might reveal whether an only meet one of those criteria (Table 4.3).
outcome has occurred, of course, depending on As an example of confounding, suppose that
the nature of the outcome being studied. Once an investigator wished to determine whether
again, this type of information might be very smoking during pregnancy was a risk factor for
accurate but could be costly or logistically an adverse outcome (dened as spontaneous
difcult to obtain in all subjects. abortion or low birth weight). The investigator
In sum, different sources of outcome informa- would recruit two cohorts of pregnant women,
tion have their advantages and disadvantages one whose members smoke while pregnant and
relative to accuracy, logistics, and cost and should the other whose members do not. (The ner
be weighed carefully by the investigator in details of how to identify and recruit these cohorts
designing a cohort study. are not within the scope of this chapter.) The two
cohorts are then followed through their pregnan-
cies, and the rates of adverse outcomes are
Confounding in Cohort Studies compared (using a measure known as relative
risk, which will be described later). Further, sup-
Nature of the Problem pose that the investigator does nd an increased
While the identication of a potential unexposed risk of adverse outcomes in the smoking group.
group might seem rather straightforward in many He submits his results to a peer-reviewed journal
study designs, there is always an underlying but is unsuccessful in gaining publication because
problem in the choice of these unexposed con- one of the reviewers notes that the explanation for
trols, i.e., confounding. Essentially, confound- the increased risk may not be due to smoking, but,
ing can be described in two ways. It is the rather, to the effect of a confounding variable,
phenomenon that occurs when an exposure and a namely, educational status. Why might educational
disease are not associated but a third variable status be a confounder? First, individuals with
(known as the confounding variable) makes it low educational levels are more likely to
appear that there is an association between expo- be smokers. (This satises criterion #1 of the
sure and disease. Conversely, confounding can denition of confounding.) Second, irrespective
60 M.L. Lesser

of smoking, women with low educational levels Table 4.4 Bias and related problems in cohort studies
are at greater risk for adverse maternal-fetal 1. Exposure misclassication bias
outcomes. (This satises criterion #2.) Thus, it 2. Change in exposure level over time
is unclear whether the increased risk is attribut- 3. Loss to follow-up
able to smoking, educational level, or both. How 4. Nonparticipation bias
5. Reporting bias
does one eliminate the effect of a confounding
Sources of Bias in Cohort Studies
Minimizing Confounding by Matching
One solution to the confounding problem in As in any type of study design, there are potential
cohort studies is to match the exposed and aws (or biases) that may creep into the study
unexposed cohorts on the confounding vari- design and affect interpretation of the results. As
ables. (This approach will be discussed in also noted in Chaps. 5 and 8, bias refers to an
greater detail later on in the section on case error in the design or execution of a study that
control designs.) For example, a smoker who produces results that are distorted in one direc-
did not achieve a high school education would tion or another due to systematic factors. In other
be paired (or matched) with a nonsmoker who words, bias causes us to draw (incorrect) infer-
was also a non-high school graduate. By match- ences based on faulty assumptions about the
ing in this way, the representation of education nature of the data.
level will wind up being identical in both There are many types of bias that can occur in
cohorts; thus, the effect of the confounding vari- research designs. Given in Table 4.4 are some of
able is eliminated. Of course, matching could be the more common types that would be encoun-
carried out for multiple confounders, but usu- tered in cohort studies. (See Hennekens and Buring
ally, only two or three are considered for practi- 1987 [3] for a more complete description.)
cal reasons. 1. Exposure Misclassification Bias. This type of
Although matching exposed and unexposed bias occurs when there is a tendency for
subjects on confounding variables is theoretically exposed subjects to be misclassied as unex-
desirable, such matching often is not carried out posed or vice versa. The example cited above
in cohort studies due to sample size, expense, and in selection of controls is an example of
logistics. Many cohort studies are rather large, misclassication bias. In that example, the
and to perform matching can be practically quality control personnel who work in
difcult. Matching in small cohort studies also the white-collar business ofce might be
may be limited by the sample size in that it may classied as unexposed when, in fact, they are
be difcult to nd appropriate matches for the routinely exposed to the heavy metals because
exposed subjects. they tour the factory twice a day (even though
Typically, in cohort studies, confounding vari- they do not work in the factory). Typically,
ables are dealt with in the statistical analysis exposure misclassication bias occurs in the
phase where adjustments can be made for these direction of erroneously classifying an indi-
variables as covariates in a statistical regression vidual as unexposed when, in fact, he or she is
model. Also, it should be pointed out that in exposed. This would have the effect of reduc-
cohort studies which often are conducted over a ing the degree of association between the
long period of time, a subjects confounding vari- exposure and the disease. In other words, if, in
able may change over time, and a more compli- fact, exposure did increase the risk of disease,
cated accounting for that change would need to it is possible that we would declare little or no
be dealt with in the analysis phase. Matching is association. If the bias went in the other direc-
more common in casecontrol studies and will be tion (i.e., unexposed subjects are misclassied
discussed in greater detail below. as exposed), then we run the risk of nding an
4 Design and Interpretation of Observational Studies 61

association when, in fact, none exists. A solu- unexposed cohort), and, of the 50 IVDUs, 20
tion to the misclassication problem is to have have died before the end of the 1-year follow-
strict, measurable criteria for exposure. Of up period, leaving only 30 with measured viral
course, the ability to accurately measure or load levels at follow-up (as there is no follow-
determine exposure may be limited by avail- up viral load recorded on the 20 IVDUs who
able resources. died). The effect of this might be that the 30
2. Change in Exposure Level over Time. Bias IVDUs who completed the 1-year follow-up
may occur when a subjects exposure status might have been, in general, healthier than
changes with time. For example, a subject in the IVDUs who died, leading to a biased
the smoking cohort may quit smoking 10 years comparison.
after high school. Is that subject in the smok- 4. Nonparticipation Bias. Nonparticipation bias
ing or nonsmoking cohort? In cases like this, it is somewhat similar to loss to follow-up bias
is common to classify the subjects time peri- except that the bias occurs at the time of
ods with respect to smoking or nonsmoking enrollment into the study. Suppose we were
and to use the person-years method (see conducting a cohort study to determine
Kleinbaum et al. 1982 [4]) to analyze the data. whether child abuse is a risk factor for psychi-
Using this method, the subject is not classied atric disorders in teenage years. Although this
as exposed or unexposedonly his follow-up might be a problematic study to conduct, due
time periods. Nevertheless, if crossover to the sensitive nature of the risk factor (i.e.,
from one cohort to the other occurs, particu- child abuse), one might consider contacting
larly in one direction only (e.g., smokers families who were seen at a psychiatric facil-
become nonsmokers, but nonsmokers do not ity once child abuse was discovered and ask-
start to smoke after high school), this may ing them to participate in the study to follow
impart a bias that confounds interpretation of their children through their teenage years to
the study. For example, if many quitters determine their psychiatric status. Controls
develop lung cancer (presumably because they would be families or subjects without histo-
were exposed for several years), this occur- ries of abuse who would be followed in the
rence might reduce the observed association same way. In a situation such as this, it is
between smoking and lung cancer. likely that many families with histories of
3. Loss to Follow-up Bias. Bias can occur when child abuse would decline to participate and
members of one of the groups are differen- that those who would participate might be
tially lost to follow-up compared to the other, psychologically healthier, rendering them
and the reason for their loss is related, in part, unrepresentative of the general group of fami-
to their level of exposure. Consider the fol- lies with child abuse. Furthermore, if this
lowing hypothetical observational study that group were, indeed, psychologically healthier,
evaluates newly diagnosed heterosexual AIDS then the incidence of teen psychological dis-
patients. The two cohorts in this example orders might be lower, thus attenuating the
are those patients who were IV drug users true association between child abuse and psy-
(IVDUs) and those who were not. Both cohorts chological disorders.
are started on the same antiretroviral therapy 5. Reporting Accuracy Bias. Reporting accuracy
at diagnosis. The research question is whether bias in cohort studies is similar to that in case
there is a difference between the two groups in control studies. It refers to a situation where
viral load at the end of one year. either the exposed or unexposed subjects delib-
As the study progresses, some patients die. erately misreport either their exposure or their
To illustrate this bias using an exaggerated outcome status, usually due to the sensitive
scenario, suppose that there are 50 IVDUs nature of the variables being studied. (See the
(the exposed cohort) and 50 non-IVDUs (the section on casecontrol studies for examples.)
62 M.L. Lesser

Fig. 4.1 Computing the

relative risk

Computing and Interpreting exposed) is a/(a + b); the corresponding incidence

Relative Risk rate in the unexposed is c/(c + d).
The relative risk is then dened as
The foregoing discussion dealt primarily with
issues surrounding the design and interpretation RR = (incidence rate in exposed ) /
of cohort studies. Between design and interpreta-
tion is a phase during which various calculations
(incidence rate in un exp osed )
are carried out to quantify the relationship = a / (a + b ) c / (c + d ) .
between the presumed risk factor and the disease
under investigation. The most common measure Typically, one might compare the rates to
used in cohort studies for quantifying such risk is determine whether they are different, since, if the
the relative risk (RR). The calculation and rates are the same (i.e., RR = 1), that effectively
interpretation of RR can be illustrated by refer- tells us that there is no association between the
ring to Fig. 4.1. Here, a and b, respectively, repre- risk factor and the disease. On the other hand, if
sent the number of exposed subjects who did and the rate is greater in the exposed (i.e., RR > 1),
did not develop the disease in question. Likewise, that would suggest that the risk factor is posi-
c and d represent the unexposed subjects who, tively associated with the disease. (RR < 1 would
respectively, did and did not develop the disease. suggest that the subjects with the risk factor
In a cohort study, one usually selects exposed actually have a lower likelihood of disease.)
subjects so that the row total of exposed (a + b) is It should be noted that RR is always a positive
xed at some predetermined sample size. number unless one or more of the cells in the
Likewise, the sample size for the row of unex- above 2 2 table contains a zero, in which case it
posed (c + d) is also xed. The two row totals do is common to compute the RR by adding to a,
not necessarily have to be equal. This table is b, c, and d and using the formula given above
often referred to as a 2 2 table pronounced (see Agresti 2002 [5]).
two-by-two since it contains two rows and The following example (see Fig. 4.2) com-
two columns corresponding to Exposure and putes the RR for a cohort study investigating
Disease status. oral contraceptive use as a risk factor for MI in
In the exposed group, the fraction of subjects women. In this example, 1,000 women who
who developed disease (i.e., incidence rate in the used an OC were followed over a period of
4 Design and Interpretation of Observational Studies 63

Fig. 4.2 Relative risk: an


time to see who developed an MI. Likewise, study of smoking during pregnancy as a risk
1,000 OC nonusers were followed in a similar factor for adverse maternal-fetal outcomes is of
way. The incidence rates of MI were 0.03 and the prospective type because, as described, the
0.003, respectively, yielding a RR = 10, which investigator must wait from the time of exposure
means that women who used OC had 10 times to observe the outcome of the pregnancy.
a greater risk of MI than nonusers. For deter- However, suppose that the study were to be con-
mining whether a RR is signicantly different ducted by reviewing patient charts from 2 years
from 1, the reader is referred to Kleinbaum prior to the initiation of the study and identifying
et al. 1982 [4]. women who smoked and did not smoke during
pregnancy at that time. Then, the investigator
would determine the pregnancy outcome from
Prospective Versus Retrospective the chart data (i.e., the outcomes are already
Cohort Designs known and documented in the charts). This is an
example of what many term a retrospective
One usually thinks of a cohort study as prospec- cohort study. (As noted in Chap. 1, DeAngelis [6]
tive because it looks forward from an exposure and others would refer to this as a historical or
to the subsequent development of disease. nonconcurrent cohort study.)
However, a cohort study can be classied as ret- To the reader, the distinction between retro-
rospective or prospective, depending on when spective and prospective cohort studies may not
it is being conducted with respect to the outcome. seem important since the logic of the two
If, at the time the investigator initiates the study, approaches is essentially the same. However, in a
the outcome (e.g., disease) has not yet occurred in prospective cohort study, the investigator typically
the study subjects, then the study is prospective has more quality control of the conduct of the
because the investigator must follow the subjects study and how data are to be collected than in a
in real time in order to ascertain outcome status. retrospective study because the former is being
On the other hand, if the study is conducted after conducted in real time. In a retrospective cohort
the exposures and outcomes have already study, the investigator is limited by the nature and
occurred, this type of design often is classied as quality of data already available, which most likely
a retrospective cohort study. were collected for routine clinical purposes using
For example, referring back to the section on criteria and standards that are different from those
confounding, there is general consensus that the of the current research investigation.
64 M.L. Lesser

with a very specic subtype and/or severity (e.g.,

CaseControl Studies a particular histology of lung cancer), then the
study design may benet from decreased noise
The purpose of a casecontrol study, like a cohort or variation, but the results may be less generaliz-
study, is to determine whether an association able. Furthermore, restriction of the case
exists between exposure (E) to a proposed risk denition will result in a smaller potential pool of
factor and occurrence of a disease (D). The essen- subjects (i.e., smaller sample size). Conversely, if
tial difference between the two designs is that in the case denition is expanded to include, say,
a cohort study, exposed and unexposed subjects multiple subtypes of the disease, then the results
are identied and then followed over time to may be more generalizable, and the subject pool
determine the incidence rates of disease in those size may increase. However, there will be greater
two groups, whereas in a casecontrol study, sub- variability, which may reduce the ability to detect
jects with and without the disease are classied an association between E and D (i.e., reduced sta-
as having or not having been exposed to the pro- tistical power). Determining the heterogeneity of
posed risk factor. More simply put, the cohort case denition is a ne balancing act between
study follows subjects forward in time, whereas addressing the specic research question and
the casecontrol study looks backward for an sample size considerations.
associated factor by rst identifying subjects with
and without the disease. Sources of Cases
In most research studies, a case of disease will be
identied and selected from medical practices or
Selection of Cases facilities such as hospitals or physician practices.
Occasionally, cases of disease can be obtained by
If we are to conduct a casecontrol study, then we using disease registries.
rst need to determine who our cases will be
and how we will select them for inclusion in the Prevalent Versus Incident Cases
study. An important consideration in the selection of
cases is whether a case is considered a prevalent
Case Denition or incident case. A subject is said to be a preva-
The rst step in selecting cases is to dene what lent case if the patient has the disease in question
is meant by a case. For example, if we were regardless of when it was diagnosed. It may have
studying lung cancer, we might specify that a been diagnosed 2 days ago, 2 years ago, or 10 or
case would be any subject with biopsy-proven 20 years ago. But, as long as the subject is avail-
adenocarcinoma of the lung. If the research ques- able, that subject is considered a prevalent case. An
tion itself necessarily distinguished between incident case refers to a more restrictive crite-
small cell and non-small cell lung cancer and rion. In order to be an incident case, an individual
only the latter type was to be studied, then we needs to have been diagnosed recently.
would have to add that to the denition. Other Recently may have different connotations in dif-
examples of strict denitions might be as follows: ferent disease entities, but, for example, in a chronic
if one were studying nutritional factors and their disease like cancer, an incident case might be a
association with MI, we would dene a patient to case that was diagnosed within the past 23 months.
have an MI if the patient exhibits a certain degree On the other hand, for a disease that is rapidly fatal,
of enzyme elevation and has clearly dened such as anthrax poisoning, an incident case might
prespecied changes on an electrocardiogram. be dened as a case that was diagnosed an hour or
2 ago. The essential point to remember in design-
Homogeneity of Cases ing casecontrol studies is, that when selecting
Most diseases vary according to severity or sub- cases, we should select incident cases, not preva-
type. If we were to include in our study only cases lent cases. The reasons are as follows.
4 Design and Interpretation of Observational Studies 65

First, casecontrol studies often involve the instead, be associated with its lethality. Thus, it is
recall of information about past exposures. This possible that the smokers are those who died
type of information often is obtained by inter- early in the group that was diagnosed in the more
viewing the subject him or herself or by inter- distant past whereas nonsmokers are the ones
viewing family members or friends who might who have survived despite their disease. In this
have such information. Of course, some exposure case, when comparing this biased group of cases
information may also be gleaned from patient to non-cancer controls, we would observe an
charts or other documents that exist independent attenuated association between smoking and lung
of an interview with a subject. It stands to reason cancer. This bias would provide potentially mis-
that if the interval of time between diagnosis of leading results.
the disease and the interview for exposure infor- On the other hand, if one were to simply sam-
mation is lengthy, then the ability to properly ple recently diagnosed cases and assuming that
recall exposures will be reduced. Certain expo- the disease is not rapidly fatal (even small cell
sures such as smoking are not likely to be forgot- lung cancer patients would survive to be inter-
ten, but, for example, if we were studying more viewed), almost all of the available lung cancer
complex and/or rare exposures, the ability to cases would be included in the study since, at that
accurately recall such exposures and associated point, no one would be lost to follow-up or death.
details would decrease over time. Thus, the Therefore, the sample would not be biased as it
shorter the interval between diagnosis and gath- might have been had the sampling methodology
ering of exposure information the more likely the been based on prevalent case selection.
recall of information will be accurate.
A second reason for selecting incident cases is
illustrated by the following example. Suppose we Selection of Controls
were studying the association between smoking
and lung cancer. We might go to the tumor regis- Perhaps, the most difcult aspect of conducting a
try of our hospital and nd 1,000 lung cancer casecontrol study is the selection of controls. In
cases that were diagnosed over the past 10 years. principle, controls should be a group of individu-
The next step in our research design would be to als who are free of the disease or outcome in
contact these subjects and ask them whether or question (i.e., unexposed) and are as similar in all
not they were smokers prior to their development other respects to the case group.
of lung cancer. One of the problems associated
with this approach is that out of those 1,000 lung Denition of Controls
cancer cases diagnosed over the past 10 years, Controls should be free of the disease in ques-
many will have expired before we would be able tion. One of the difculties in selecting controls
to contact them. Cases that are still alive probably is determining how far we should go to ensure
would fall into two broad groups: (a) those who that someone is free of the disease or outcome.
have been recently diagnosed and have not had For example, if we were to select as a control for
enough exposure to lung cancer yet to die from our lung cancer cases an individual who has
the disease and (b) those who were diagnosed in never had a diagnosis of lung cancer, do we need
the more distant past but who have survived. The to perform a bronchoscopy on that patient for
latter group (b) is likely made up of those with certainty of that fact, or do we simply take his
lower grade disease or those who have been more self-report as the truth that he has never had lung
successful in combating their disease with therapy. cancer? Of course, there are subtleties that arise
That group may be very different from those who when subclinical disease exists at the time an
were diagnosed in the more distant past who individual is being selected as a control. These
already have died of their disease. In fact, it is are ne points that would need to be dealt with in
conceivable that smoking may not just be a very careful manner, in consultation with a stat-
associated with developing lung cancer but may, istician or an epidemiologist.
66 M.L. Lesser

At this point, it is instructive to provide an from visitors to a shopping mall (even though
example of where verication of non-disease sta- colonoscopy, itself, is not infallible). Of course,
tus might be problematic and require some subjects who have a diagnosis of colon cancer
additional thought about the design of the study. based on the colonoscopy would be excluded
Suppose we were conducting a casecontrol from the control group.
study to determine whether there is an associa- The selection of controls from among those
tion between a high fat diet and colon cancer. undergoing colonoscopy, nonetheless, could
Specically, our hypothesis is that colon cancer potentiate a different problem, namely, selection
cases will report a higher frequency of high fat bias. Generally speaking, there are two broad
diets than non-cancer controls. To test our hypoth- groups of individuals who undergo colonoscopy:
esis, we would select our colon cancer cases in (a) those who are symptomatic and who are
some way consistent with the guidelines already referred by their physician to a gastroenterologist
stated above and then select controls. One possi- to determine the cause of their rectal bleeding,
ble source of controls would be adults visiting a abdominal pains, cramping, diarrhea, etc., and
large shopping mall. (We might choose to select (b) those who are asymptomatic who undergo
individuals over 50 years old if our casecontrol colonoscopy for screening purposes only.
study was designed to answer the question in this However, these two groups differ in ways that
population.) Next, we could set up a colon cancer can inuence the results of the investigation. For
information booth in the mall and invite the pass- example, a high fat diet may not be specic to the
ersby to answer a question or two about history risk of colon cancer but may be associated with
of colon cancer and, if they wished, to pick up a other intestinal problems (e.g., some of the benign
fecal occult blood test kit so that they can screen conditions cited above). If this association was
themselves for colon cancer. Those who self- not appreciated during the study design stage,
reported that they had never had a diagnosis of and individuals from the symptomatic group
colon cancer could be invited to participate as were selected as controls, their rate of high fat
controls for our casecontrol study. We might use diets would be spuriously inated, thus reducing
as an exclusion criterion a positive test result on the observed degree of association between fatty
the fecal occult blood test (even though that diets and colon cancer. On the other hand, selec-
nding obviously does not equate to a diagnosis tion of the asymptomatic individuals who undergo
of colon cancer). cancer screening are more likely to be health-
A member of our investigative team might conscious individuals since they are voluntarily
object to this approach since self-report and fecal attending a screening program. Because these
occult blood testing, in and of themselves, would individuals are more health conscious, they may
not completely verify the disease-free status of have an articially lower level of fat intake
someone passing through the shopping mall. than a standard population of individuals without
Thus, we might be more rigorous in our selection colon cancer. Accordingly, when we compare the
of controls. This might be done by enlisting the fat intake for this control group against the colon
collaboration of a gastroenterologist who per- cancer group, we may observe an exaggerated
forms colonoscopies and selecting from his or association because of the articially reduced
her colonoscopy practice those subjects who have levels of fat intake in our control group.
colonoscopies with a benign or negative out- There are several ways to address this
come. Such outcomes might include diverticulo- problem, none of which constitutes a perfect res-
sis, inammatory bowel disease, a benign polyp, olution of the issue. In this example, some inves-
other benign tumors of the colon, etc. If we were tigators might employ only one of the control
to view colonoscopy as a close to foolproof way groups with the understanding that the bias would
of determining an individuals colon cancer sta- need to be considered when interpreting the
tus, then this would be a better way of selecting results. Thus, for example, if the benign disease
controls for such a study than selecting them group were used as the control and only a small
4 Design and Interpretation of Observational Studies 67

association was observed (i.e., odds ratio [OR] is Confounding in CaseControl Studies
close to 1), the association would be inconclusive
because of the directionality of the bias. However, The Nature of the Problem
if a large and statistically signicant association The impact of confounding on interpretation of
(i.e., OR > 1) were found, then, because the bias ndings from cohort studies has previously been
is working against the hypothesis of positive addressed. The reader should note that its adverse
association, this larger OR would provide evi- effects are not limited to cohort studies but repre-
dence in favor of the association. Another sent a potentially serious problem in casecontrol
approach might be to include both groups as sep- designs as well. Schlesselman [7] provides inter-
arate controls and, knowing the opposite direc- esting examples of such confounding, which we
tions of the bias, compare cases to each control now describe.
group and draw inferences accordingly. Consider a hypothetical casecontrol study
designed to test the hypothesis of association
Sources of Controls between alcohol use (E) and lung cancer (D).
Recall that in a casecontrol study, cases of dis- Cases of lung cancer are selected for study, and a
ease are most conveniently selected from a med- group of controls without lung cancer is identied.
ical practice or facility, but controls need not be Suppose that the rate of alcohol use in the lung
selected from such sources even though it might cancer cases is found to be signicantly greater
also be convenient to do so. Controls also can be than that of the controls. The conclusion would
selected from the community at-large using be that alcohol use increases the risk of lung can-
sophisticated sampling techniques or by simply cer. However, one might criticize the study
placing advertisements in community media to because smoking should have been considered a
recruit individuals who meet the control criteria. confounding variable.
Very often, investigators will collaborate with Why is smoking a confounding variable? One
various work places that will permit access to needs to refer back to the denition. Certainly,
their employees as potential controls for a par- smoking is associated with lung cancer (criterion
ticular study. Over the years, departments of #2), independent of any other factors. However,
motor vehicles often have served as a source of smokings association with lung cancer does not,
controls for many research studies. Occasionally, in itself, make it a confounding variable. Smoking
close friends, relatives, or neighbors of an indi- must also be associated with alcohol use (crite-
vidual case will serve as controls. Choosing such rion #1). How is smoking associated with alcohol
individuals can solve a myriad of problems use? The answer lies in the fact that individuals
because this type of control sometimes will share who drink alcohol tend to have a higher rate of
the same environmental conditions as the case or smoking than individuals who do not drink alco-
have a similar genetic disposition. The approach hol. Therefore, smoking is related both to alcohol
also facilitates cooperation because, very often, use (E) and lung cancer (D) and is, therefore, a
friends, relatives, or neighbors will cooperate confounding variable.
with an investigator who is also working with As another example of a confounding variable
that individuals relative. However, selecting that may obscure an association between a puta-
friends and relatives as controls may have tive risk factor and disease, consider a case
adverse consequences because it often forces control study to determine whether there is an
the cases and controls to be similar on the very association between oral contraceptive (OC) use
risk factors being investigated, thus reducing the and MI in women. Once again, one would pick
association between the risk factor and disease. cases of women who had suffered a recent MI
In summary, the selection of controls requires and determine whether or not they had used OC
careful thought and knowledge of the underlying in, say, the past 5 years. A possible result of this
subject matter. study would be that the level of OC use was not
68 M.L. Lesser

substantially greater in the MI cases than in the similar with respect to one or more confounding
non-MI controls, thereby resulting in the conclu- variables. When cases and controls are properly
sion that there is little or no association between matched, the representation of the confounding
OC use and MI. However, once again, smoking variables is similar in both groups and, therefore,
could be considered a confounding factor because should have no appreciable effect on the results
it meets the two criteria of a confounder: rst, and interpretation of the casecontrol study.
smoking is associated with MI. Second, smoking Most students in the medical sciences are
is associated with OC use. Why is this so? The familiar with the idea of matching since they
reason is that women who are smokers are less probably have read many studies where matching
likely to be prescribed an OC than women was employed. However, it is our objective in this
who are nonsmokers because of the risk of chapter to describe the logistics of matching in
thrombophlebitis and other cardiovascular disor- somewhat more detail. The rst step in matching
ders. In this example, the OC users were under- cases to controls is to identify the confounding
represented in the MI case group because there variables. The next step is to determine the
were many smokers in the MI group, many of desired method of matching. Typically, one
whom were never prescribed OC. Thus, the should not match on more than a few variables
confounding effect of smoking potentially masks (i.e., two or three), but this also depends on the
a relationship (i.e., reduces the association) sample size in the casecontrol study and on
between OC use and MI. the distribution of the confounding variables in
Although it is important to identify confound- the samples being studied. Let us consider a sim-
ers, it is just as important to recognize factors ple example where we have determined that age
that may appear to be confounders but, in fact, and sex are important confounders. (It is impor-
are not. Once again, two examples from tant to emphasize that, while age, sex, race, and
Schlesselman [7] are instructive. Consider a socioeconomic status are four of the most com-
casecontrol study designed to investigate monly encountered confounders, it is not always
whether a sedentary lifestyle is a risk factor for necessary to match on any of these variables. The
MI. Cases are those with a recent history of MI reader should be reminded again that in order for
and controls are individuals without MI (appro- a variable to be a confounder, it must meet the
priately chosen). The exposure variable is (for two criteria given in the denition above.)
simplicity) sedentary lifestyle (coded as no 1. Group Versus. Calipers Matching. When age
or yes), as derived from some validated mea- and sex are potential confounders, one way to
sure of physical activity. One might consider lev- match cases and controls is to classify male
els of uid intake (F) as a possible confounding and female subjects into age groupings (a com-
variable because physically active, non-sedentary mon method of classication for age is by
subjects might have higher levels of uid intake decades, i.e., age 2029, 3039, 4049, 5059,
than sedentary subjects; in other words F is asso- or 60 and above). This approach would yield
ciated with E. Accordingly, we would consider up to 10 different age/sex combinations cor-
matching cases to controls on uid intake. responding to each of the 5 age categories
However, uid intake is not a true confounder cross-classied with sex (male, female).
because there is no known or presumed associa- Therefore, if a case were to be chosen and that
tion between uid intake and MI (D). Thus, particular subject was a 30-year-old male, we
matching on uid intake is not necessary. would choose a control who was a male in the
30- to 39-year age group; these two individu-
Reducing Confounding by Matching als (the case and the control) would be natu-
If confounding is an important problem in epide- rally matched and paired.
miologic studies, how do we deal with it? A com- The reader should note, however, that there
mon solution is matching. Matching is a technique is a disadvantage to creating groups on a mea-
whereby cases and controls are made to appear sured variable such as age. Suppose, in the
4 Design and Interpretation of Observational Studies 69

above example, we required a match for a the calipers extremely narrow). For example,
30-year-old male, and, based on the pool of one would not match children to within three
potential controls, a 29-year-old male and a years (e.g., matching a 10-year-old girl to a
39-year-old male were both available. Using seven- or 13-year-old girl) since individuals at
the grouping criteria dened above, the these ages could have very different outcomes
30-year-old male would have to be matched due to variations in socialization, sexual matu-
with the 39-year-old male because they were rity, body size, and other developmental vari-
in the same age category. However, it would ables. Effective matching, under these
make more sense to match a 30-year-old male circumstances, requires that there be a large
with a 29-year-old male because the two are pool of available controls to pair with cases.
closer in age. 2. Individual Versus Frequency Matching.
A solution to this problem is to use what is Another consideration in matching is whether
known as calipers matching whereby, on a the investigator wishes to use individual ver-
measured variable, a control would be matched sus frequency matching. Typically, with indi-
to a case based on being within a certain num- vidual matching, one case and one control are
ber of units away from that cases measure- matched to one another (1:1 matching).
ment (hence the use of the term calipers). For Occasionally, the statistician or epidemiolo-
example, we might dene a rule to match age gist will recommend many-to-one matching
to within () three years. In this case, the which might involve matching two or three
29-year-old male is within three years of the controls to each case. It is uncommon to match
30-year-old male and would be matched to the more than three controls to a case because it
30-year-old male, whereas the 39-year-old can be shown that the statistical power benets
male would be outside the dened three-year do not substantially increase after two or three
limit. A compromise between broad grouping matches to a control. The reader should keep
and calipers would be to arrange the poten- in mind that if he or she conducts a case
tially confounding variable (in this case, age) control study with 1:1 matching, it is neces-
into narrow categories (e.g., 3033, 3437, sary that there be an equal number of cases
3841, etc.). This would reduce the effect of and controls. A common misstatement that is
the disparity that occurred in the example seen in many research proposals employing
given above involving grouping by decades. casecontrol studies is, for example, there
When using this method for age matching, the will be 50 cases with disease and they will be
investigator must take care to consider the matched to 20 controls without disease. If the
nature of the study population. For example, if investigator was thinking of performing indi-
one were matching on age using three-year vidual matching, then this statement makes no
calipers in a casecontrol study evaluating uti- sense as it would require a constant ratio of
lization of health-care services, a 64-year-old controls to cases. Usually, what the investiga-
case could be matched to any control ranging tor intends is that they will select cases and
from 61 to 67 years old. However, in this controls so that, for example, the average age
example, matching a 64-year-old to, say, a (or sex distribution) of both groups is approxi-
64-year-old in a health services utilization mately the same. However, this approach is
study might result in matching a non-Medicare not matching; it is simply determining how
subject with a Medicare subject. As these two comparable the two groups are after they have
types of patients might have very different uti- been selected. Unless one prospectively selects
lization patterns, a bias could be introduced controls in a deliberate way so as to match
into the study design. Similarly, when conduct- them directly to a given case, the term match-
ing research with pediatric patients, it is impor- ing is not appropriate.
tant to match as closely and precisely to actual When an investigator does not perform
age as possible (which is equivalent to making individual matching but instead wants to
70 M.L. Lesser

ensure that the confounding variables have the a study such as this where ascertainment of
same joint distributions among both cases and smoking status (the risk factor) could be made
controls, the method of choice is frequency by chart review so that one could rst consti-
matching. Frequency matching refers to the tute the case group and then return to select
deliberate and prospective selection of con- the control group. Frequency matching may
trols so that the joint distribution of the con- be logistically more difcult to conduct in
founding variables is approximately the same other types of casecontrol studies, but the
in both the case and control groups. As an concept is still the same.
example, suppose we were performing a case 3. Propensity Matching. A recently developed
control study to determine whether maternal method for matching cases and controls
smoking during pregnancy was a risk factor (which also may also be used for matching
for premature birth. Our cases might be 100 exposed and unexposed subjects in a cohort
premature infants delivered during the past study) is known as propensity scoring
year, and our controls would be drawn from (Rosenbaum and Rubin [8, 9]). Briey, this
the hundreds of normal term births delivered method involves predicting whether a subject
during the same time period. Further, we have is a case or a control based on observed pre-
determined that parity (i.e., nulliparous vs. dictor covariates. Thus, one subject may be a
parous) and age (grouped in 3-year intervals) case and the other a control, but their covari-
are confounding variables for which matching ate proles are similar as reected by their
will be performed. Suppose we have decided predicted probability of being in, say, the
that, based on statistical power and resources case group. Specically, the probability of
available to conduct the study, that the number being a case (i.e., the propensity score) is
of controls will be 250. Further, suppose that computed for each subject in the study (both
in the case group, 10% of the cases were born cases and controls) using a statistical method
to nulliparous 30- to 33-year-old women. We known as multiple logistic regression (see
would then identify from our vast pool of Chap. 11). Then, cases are matched to con-
term-delivery controls all women who are nul- trols on the propensity score. So, for example,
liparous 30- to 33-year-olds. From this pool of suppose that in a particular study, the score is
candidates, we would randomly select 25 nul- being computed as a function of age, sex,
liparous 30- to 33-year-old women. By select- smoking status, family history, and socioeco-
ing 25 at random, this would assure that 10% nomic status. If a particular case has a score
of the control group (10% of 250=25) would of, for example, 0.75, we would try to match
be nulliparous 30- to 33-year-olds. Likewise, this case to a control that also has a score of
suppose that 16% of the cases are parous 25- 0.75. In this way, cases and controls are
to 28-year-old women, then in a similar way matched based on a measure of their similar-
we would identify all parous 25- to 28-year- ity. An advantage of the propensity score
old women who had full-term deliveries and, method is that it allows the investigator to
from that group, randomly select 40 matching match cases and controls on a single
controls as 40 would constitute 16% of the criterion (the score) that is a function of mul-
control group. If we continued in this fashion, tiple confounding variables, rather than hav-
we would obtain a control group that had either ing to match on each of the individual
precisely or approximately the same joint dis- confounders.
tribution of parity and age in both cases and
controls. It is important to note that to use fre-
quency matching, one would need to know the Sources of Bias in CaseControl Studies
distribution of the confounding variables in
the case group prior to selecting the matched As in cohort studies, casecontrol studies are
controls. This certainly would be workable in subject to a variety of biases. Given below
4 Design and Interpretation of Observational Studies 71

are some of the more common types that may be select as cases women with newly diagnosed VD.
encountered. Controls could be women from the same clinic
who do not have a diagnosis of VD. The impor-
Recall Bias tant question in the epidemiologic interview
Recall bias occurs when one of the groups recalls would be how many sexual partners have you
exposure to the risk factor more accurately than had in the past year? The responses in the case
the other group. It is not uncommon for recall group (those with VD) might look as follows: 1,
bias to manifest itself as cases remembering 1, 2, 2, 2, 3, 4, 5, 5, 6, 6, 6, 8, 9, and 10. (The
exposures better than controls. As an example, responses have been ordered from smallest to
suppose one were conducting a casecontrol largest in order to better visualize the data.) When
study to examine risk factors for early childhood the control group is asked to respond to the same
leukemia. The cases in such a study might be par- question, the results might be 1, 1, 1, 1, 1, 1, 1, 2,
ents of children with leukemia who were diag- 2, and 2. Based on these responses, the average
nosed before their fourth birthday, and the number of sexual partners in the case group
controls might be parents of children who did not would be 4.7 versus 1.3 in the control group, thus
have a diagnosis of leukemia. The investigator suggesting (subject to a formal statistical test)
interviews both groups of parents with respect to that increased number of sexual partners is a risk
exposure to a variety of potential risk factors. It factor for venereal disease.
would not be unlikely that the mother of a young Although, at face value, the interpretation of
child with leukemia would remember many the results might be as just stated, there is a poten-
household exposures better than a mother whose tial reporting accuracy bias. The bias might occur
child was healthy since it is human nature to because women who have VD may be more likely
recall antecedent events potentially leading up to to be truthful about the number of sexual partners
a serious disease or traumatic event better than they have had, whereas women who are controls
someone who has no reason to remember those may not be, thus causing the average number of
events or exposures. Another example of recall sexual partners to be artifactually greater in the
bias might be found in a study examining ante- case group than in the control group. Why might
cedents of lower back pain. Subjects who experi- such a bias exist? One hypothesis is that individu-
ence lower back pain probably would have better als with a particular disease (in this case, VD)
recall of events related to lifting of heavy objects tend to be more candid with their physicians
that may have preceded the diagnosis of the back about past medical history and behaviors [10]. In
pain versus those without back pain who may not fact, many patients (rightly or wrongly) believe
have any particular reason to remember such that if they are truthful, then their physicians may
events. be able to better treat their disease than if they are
not truthful. Assuming that this womens health
Reporting Accuracy Bias center serves women who are married, those with
This term refers to lying or deception in the boyfriends, male partners, etc., among the con-
response to questions concerning exposure, as trol group might be less likely to be truthful about
frequently occurs in the setting of casecontrol the number of sexual partners because they would
studies where sensitive questions are being asked perceive that they have something to lose and
of the subject. A classic example of reporting nothing to gain by admitting multiple sexual part-
accuracy bias might be as follows: Suppose one ners. Of course, the ethical conduct of such a
were to conduct a casecontrol study among study would require an assurance of condentiality
women to determine if her number of sex part- with respect to responses to the epidemiologic
ners during the past year is a risk factor for questions, but such an assurance does not guaran-
contracting venereal disease (VD). One might tee that subjects will cooperative when confronted
conduct this study at a womens health center and with a highly personal and sensitive question.
72 M.L. Lesser

Selection Bias endocrinology, or renal clinic might create a

Selection bias in casecontrol studies occurs bias because many of those patients already
when identication and/or inclusion of cases (or have heart disease (or are at risk for heart dis-
controls) depends, in part, on the subjects level ease), so she decides to select controls from
of exposure to the risk factor under study. the podiatry clinic around the block. She fur-
There are several common forms of selection ther reasons that most of the patients visiting
bias (i.e., detection and referral bias) as discussed the podiatry clinic are presenting for a variety
below. of foot problems unrelated to heart disease or
1. Detection Bias. Detection bias occurs when diet. However, she does not realize that some
subjects exposed to the risk factor are more (or of these patients also have been referred for
less) likely to be screened for the disease. foot problems related to diabetes, and diabe-
A good example can be found in a hypothetical tes, of course, is related both to heart disease
casecontrol study to determine whether exog- and caloric intake. Therefore, in using these
enous estrogen use is a risk factor for endome- subjects as controls (without excluding con-
trial cancer in women. One might choose as trols seen for diabetic-related problems) might
cases women with newly diagnosed endome- weaken any true association between diet
trial cancer and as controls those without a (caloric intake) and CAD.
diagnosis of endometrial cancer (suitably Another type of referral bias relates to the
matched on various confounding variables). situation where included cases are not truly
The study would then determine what fraction representative of all cases of the disease. For
of cases had been exposed to estrogen (accord- example, suppose we were investigating a pos-
ing to some predened criteria) and similarly sible increased risk of pediatric inammatory
for the controls. A problem (potential bias) in bowel disease (IBD) among children who
this type of study is that when a woman under- were formula-fed during infancy, as opposed
goes estrogen therapy, it is likely that she will to having been breast-fed. If we were to select
visit her gynecologist more often than if she the IBD cases from a medical practice at a
does not since she would need to be monitored prominent teaching hospital that specializes in
more frequently for potential side effects, such pediatric IBD, we might be seeing a dispropor-
as vaginal bleeding. Consequently, if the tionately high number of severe cases since
woman were to develop endometrial cancer it is likely that severe, difcult-to-manage
(irrespective of whether estrogen increased the cases would be referred to this center.
risk), then it is more likely that the gynecolo- Furthermore, if, in fact, formula feeding is not
gist will discover it due to the increased sur- a risk factor for development of IBD but is a
veillance. Thus, when one selects cases for this risk factor for having a more severe case of
study, unbeknownst to the investigator, the IBD among those with such a diagnosis, then
cases may have a higher likelihood of expo- it is likely that these cases will have a higher
sure simply because of the way that they were percentage of formula-fed individuals than the
selected to enter the case pool. non-IBD control group. Accordingly, we
2. Referral Bias. Referral bias occurs when con- would be more likely to conclude that formula
trols are referred into the control pool for feeding is a risk factor for IBD, when it is not.
reasons that are related to the disease under
study. As an example, suppose that a case
control study was being conducted to deter- Computing and Interpreting
mine whether caloric intake was a risk factor Measures of Risk
for coronary artery disease (CAD). Since the
investigator works in a hospital, she would The foregoing discussion dealt primarily with
like to select her controls, for convenience, issues surrounding the design and interpretation
from her hospital environment. She reasons of casecontrol studies. Between the design and
that selecting controls from the pulmonary, interpretation of a casecontrol study is a phase
4 Design and Interpretation of Observational Studies 73

Fig. 4.3 Computing

the odds ratio

during which various calculations are carried out For various mathematical reasons, it is more
to quantify the relationship between the presumed convenient to express the risk, not as a difference
risk factor and the disease under investigation. between proportions but as a ratio of odds. To the
The most common measure used for drawing unfamiliar reader, the odds of an event occurring
inferences in a casecontrol study is the odds is dened as the probability that the event will
ratio (OR). The calculation and interpretation of occur divided by the probability that it will not
the OR can be illustrated by reference to Fig. 4.3. occur. For example, if the probability of an event
Here, a and c, respectively, represent the number is 25%, the odds of the event occurring is 25/75
of cases who were exposed and not exposed to the (or, as some would prefer to express it, 1:3 odds).
risk factor. Likewise, b and d, respectively, repre- Thus, the odds of exposure among cases is [a/
sent the number of controls who were exposed (a + c)]/[c/(a + c)] whereas the odds of exposure
and not exposed. In a casecontrol study, one usu- among controls is [b/(b + d)]/[d/(b + d)]. If we
ally selects cases so that the column total of cases denote these quantities by O1 and O2, respec-
(a + c) is xed at some predetermined sample size; tively, then OR = O1/O2 = (ad)/(bc). Computation
likewise for the control column (b + d). Frequently, of the OR in this fashion always will result in a
the cases and controls are sampled in equal num- positive number unless one or more of the cells in
bers (so that a + c = b + d), but there are circum- the above 2 2 table contains a zero; in the latter
stances where equality may not hold, as pointed instance, it is common to compute the OR by
out in the section on matching. adding to a, b, c, and d and using the same
In the case group, the fraction of subjects who formula [5] employed for computation of the
were exposed to the candidate risk factor is a/ relative risk (RR) in a cohort study. Just as in the
(a + c); the corresponding proportion exposed in interpretation of the RR, if OR > 1, this is taken to
the control group is b/(b + d). Typically, one might mean that the exposure to the risk factor increases
compare the two proportions to determine the risk of disease by that many times or by that
whether they are different since if the proportions fold increase. Thus, for example, if OR = 1.5,
are the same, that effectively tells us that the risk this means that individuals with the risk factor
factor is not associated with the disease; on the are 1.5 times more likely to get the disease than
other hand, if the proportion of exposed cases is those without the risk factor. Conversely, if
much larger than that of the controls, that would OR < 1, exposure to the risk factor is protective.
suggest that the risk factor is associated with the Thus, if OR = 0.5, that means that those with the
disease. risk factor are half as likely to get the disease as
74 M.L. Lesser

Fig. 4.4 The odds ratio:

an example

those without the risk factor. An OR that is Permit calculation of incidence rates (absolute
close to 1.0 means the factor is not associated risk) as well as relative risk.
with risk of disease. Figure 4.4 illustrates compu- Enable the study of relatively rare exposures.
tation of the OR for a hypothetical casecontrol Methodology and results are easily understood
study investigating family history of coronary by non-epidemiologists.
artery disease (CAD) as a risk factor for myocar-
dial infarction (MI) in men. In this example, Disadvantages
OR = 1.56, which means that men with a family Not suited for the study of rare diseases because
history of CAD have a 1.56 times greater risk of a large number of subjects is required.
MI than those without such a family history. Not suitable when the time between exposure
and disease manifestation is very long, although
this can be overcome in historical cohort
CaseControl and Cohort Designs: studies.
Advantages Versus Disadvantages Exposure patterns, for example, the composi-
tion of oral contraceptives, may change during
As with any scientic study design, there are dis- the course of the study and make the results
tinct advantages and disadvantages to their uses. irrelevant.
Below, we provide a concise listing of some of the Maintaining high rates of follow-up can be
important pros and cons of casecontrol and difcult.
cohort designs, as identied by Schlesselman [7]. Expensive to carry out because a large number
of subjects usually is required.
Baseline data may be sparse as the large num-
Cohort Studies ber of subjects often required for these studies
does not allow for long interviews.
Allow complete information on the subjects
exposure, including quality control of data, CaseControl Studies
and experience thereafter
Provide a clear temporal sequence of exposure Advantages
and disease. Permit the study of rare diseases.
Afford an opportunity to study multiple out- Permit the study of diseases with long latency
comes related to a specic exposure. between exposure and manifestation.
4 Design and Interpretation of Observational Studies 75

Can be launched and conducted over relatively via this study design would not shed any light on
short time periods. this question because (given the way the study
Relatively inexpensive as compared to cohort was conducted) it would not be known whether
studies. the sweetener exposure came before or after the
Can study multiple potential causes of disease. diagnosis of diabetes. Obviously, to be implicated
in a causal process, the exposure would have had
Disadvantages to occur prior to the disease. (This would be a
Information on exposure and past history pri- necessary but not sufcient condition for causal-
marily is based on interview and may be sub- ity [see below].)
ject to recall bias. Thus, one of the disadvantages of a cross-
Validation of information on exposure is sectional study is that a causal (or suggested
difcult, or incomplete, or even impossible. causal) association cannot be determined.
By denition, concerned with one disease Another disadvantage is that rare diseases are
only. difcult to study since a very large number of
Cannot usually provide information on inci- subjects would be needed to yield a sufcient
dence rates of disease. number of diseased individuals (likewise, if the
Generally incomplete control of extraneous prevalence of the risk factor was rare). Despite
variables. these important drawbacks, cross-sectional
Choice of appropriate control group may be designs usually are quicker and less expensive to
difcult. conduct than casecontrol or cohort studies since
Methodology may be hard to comprehend for no follow-up is needed. Another advantage of the
non-epidemiologists, and correct interpreta- cross-sectional study is that it can provide some
tion of results may be difcult. evidence suggesting an association between
exposure and disease and, thus, help in designing
a more formalized cohort or casecontrol study.
Cross-Sectional Studies

The question addressed by a cross-sectional study The Question of Causality

is similar to that addressed by casecontrol or
cohort studies: Is there an association between a In most studies of risk factors and the occurrence
particular factor and a disease or other event? of disease, the ultimate goal is to determine if
The essential difference is that in a cross-sectional exposure (E) to the risk factor causes the disease
study, both the disease and exposure are assessed (D) in question. In experimental studies (e.g.,
at the same time, with no attention to the timing laboratory experiments with animals or random-
of the exposure relative to the disease in ques- ized clinical trials in humans), establishing
tion. For example, suppose we wanted to know causality is more straightforward than in obser-
whether articial sweeteners were a risk factor vational studies, such as casecontrol or cohort
for diabetes (type II). We could distribute a ques- studies. This is because in the experimental situ-
tionnaire to some large group of subjects, perhaps ation, many confounding variables can be con-
by direct mail. The questionnaire would ask trolled by the experimenter or by randomization,
whether the subject has had a diagnosis of type II and, therefore, it becomes a more direct method
diabetes and whether the subject consumes for establishing causality.
articial sweeteners. Such a study would provide In the observational study, association between
an estimate of prevalence of both diabetes and of E and D can be readily established, but there is no
articial sweetener consumption in the targeted direct method to prove causality. However, epide-
population. However, if the ultimate objective is miologists [7, 11] have provided a set of guidelines
to know whether articial sweeteners might have for determining whether there is a causal associa-
some causal role in diabetes, the data collected tion between E and D. These guidelines state that,
76 M.L. Lesser

in order to establish causality, all of the ve of the the association is spurious, lending evidence
following criteria must be satised: toward the causality hypothesis.
1. Temporal association. If causation is to hold, 4. Doseresponse relationship. If it can be shown
then exposure must precede the disease. that the risk of disease increases as the dose
Sometimes, the time sequence of E and D may of the risk factor increases, this makes causal-
be difcult to determine, but this criterion of ity more plausible.
temporal association is certainly a necessary 5. Biological plausibility. While satisfaction of
condition. the above criteria is important, causality ulti-
2. Consistency of association. Loosely trans- mately will be more believable if there is some
lated, this means that different studies of the acceptable biological explanation as to why
same risk factordisease question result in such causal association might exist.
similar, or consistent, results. If results among In summary, it is not possible to directly prove
several similar studies were discordant, this a causal hypothesis using casecontrol or cohort
would weaken the causality hypothesis. study designs. However, the causal hypothesis
3. Strength of association. The greater the value becomes much more tenable if the above ve cri-
of the relative risk or odds ratio, the less likely teria can be established for the problem at hand.

Take-Home Points

The use of a proper study design is essential to the investigation of risk factors for disease
or other outcomes.
Observational studies are useful in studying risk factors for disease or clinical outcomes.
Cohort and casecontrol study designs are the most common strategies used in observa-
tional research, with cross-sectional studies playing a less important role.
The choice between utilizing a cohort or casecontrol design depends upon several factors
including disease prevalence and/or incidence, data availability and quality, and time
required for follow-up.
Confounding is a potentially serious problem that can affect the interpretation of either a cohort
or a casecontrol study.
Matching is a method used to reduce the effects of confounding.
The degree of risk is quantied by the relative risk for cohort studies and the odds ratio for
casecontrol studies.
There are numerous sources of bias that can affect the interpretation of observational
In general, causality cannot be directly proven in observational studies, but certain criteria can
suggest a causal hypothesis.
4 Design and Interpretation of Observational Studies 77

5. Agresti A. Categorical data analysis. 2nd ed. Hoboken:

Wiley; 2002.
References 6. DeAngelis C. An introduction to clinical research.
New York: Oxford University Press; 1990.
1. Manson JE, Nathan DM, Krolewsky AS, Stampfer 7. Schlesselman JJ. Case-control studies. New York:
MJ, Willett WC, Hennekens CH. A prospective study Oxford University Press; 1982.
of exercise and incidence of diabetes among US male 8. Rosenbaum PR, Rubin DB. Constructing a control group
physicians. JAMA. 1992;268:637. using multivariate matched sampling methods that incor-
2. Colditz GA, Manson JE, Hankinson SE. The nurses porate the propensity score. Am Stat. 1985;39:338.
health study: contribution to the understanding of 9. Rosenbaum PR, Rubin DB. Reducing bias in observa-
health among women. J Womens Health. 1997; tional studies using subclassication on the propen-
6:4962. sity score. J Am Stat Assoc. 1991;79:51624.
3. Hennekens CH, Buring JE. Epidemiology in medi- 10. Swan SH, Shaw GM, Schulman J. Reporting and
cine. Boston: Little, Brown; 1987. selection bias in case-control studies of congenital
4. Kleinbaum D, Kupper L, Morgenstern H. malformations. Epidemiology. 1992;3:35663.
Epidemiologic research: principles and quantitative 11. MacMahon B, Pugh TF. Epidemiology: principles and
methods. Belmont: Lifetime Learning; 1982. methods. Boston: Little, Brown and Company; 1970.
Fundamental Issues in Evaluating
the Impact of Interventions: Sources 5
and Control of Bias

Phyllis G. Supino

falsity of a proposition [2]. In scientic inquiry,

Introduction validity refers to whether assertions made in a
research study, including those about cause and
The ability to draw valid inferences from data is effect, are likely to be true. Campbell and Stanley
the cornerstone of research and provides the basis argued that two different types of validity, inter-
for understanding the new knowledge that nal and external (described below), must be
research results represent. In clinical research considered when evaluating the legitimacy of
and, most importantly, in trials of therapy, such conclusions drawn from an interventional study.
inferences determine whether the ndings have Both forms of validity are protected or jeopar-
any value in the real world. This chapter will dized (threatened) by the choice of study design
review potential threats to validity of data-based and related methodological issues.
inferences that may result from specic study
design elements in assessment of purposively Threats to Internal Validity
applied interventions and will present critical Internal validity refers to the extent to which eval-
analyses of several published examples. It draws uators of the research are condent that a manipu-
heavily on the seminal work of Donald T. lated independent variable accounts for changes
Campbell and Julian C. Stanley [1] whose analy- in a dependent variable. It is the indispensable ele-
sis, originally developed for the social sciences, ment for interpreting the experiment. The inde-
provides a cogent theoretical framework for pendent variable is the treatment (e.g., drug,
understanding the logical structure, strengths, surgery) that is applied to study subjects; the
and weaknesses of alternative study designs. dependent variable is the observed outcome (or
response). To draw internally valid conclusions
from an interventional study, dependence of out-
Potential Threats to Validity come on treatment must be clearly apparent; other,
potentially confounding factors must not be plau-
In its broadest sense, validity is dened as the sibly responsible for outcomes, or their impact
best available approximation to the truth or must be denitively determinable so that the effect
of the intervention can be unambiguously assessed.
In other words, demonstration of an association
P.G. Supino, EdD () between intervention and outcome, as in an obser-
Department of Medicine, College of Medicine, vational study, would be inadequate; cause and
SUNY Downstate Medical Center,
effect must be inferable. Thus, the study design
450 Clarkson Avenue, 1199, Brooklyn,
NY 11203, USA must effectively control for competing explana-
e-mail: tions (i.e., rival hypotheses) for the ndings. For

P.G. Supino and J.S. Borer (eds.), Principles of Research Methodology: A Guide for Clinical Investigators, 79
DOI 10.1007/978-1-4614-3360-6_5, Phyllis G. Supino and Jeffrey S. Borer 2012
80 P.G. Supino

the clinician, this would be equivalent to the logic reason, observed differences on outcome
underlying the protocols for ruling out myocar- measures among the groups may be due to
dial infarction in the setting of chest pain. (or at least strongly inuenced by) these
Campbell and Stanley identied eight factors that baseline differences rather than to the inter-
may threaten the internal validity of an interven- vention. Selection bias sometimes can be
tional study. They referred to these as internal neutralized after data collection through sta-
validity threats because they can provide com- tistical processes. However, the best strategy
peting explanations for observed outcomes and, is to preclude the problem by using an appro-
thus, obscure true causal linkages. It is incum- priate study design to maximize the compa-
bent on a good investigator to use study designs rability of the compared groups prior to
devoid of these potential internal validity threats intervention.
insofar as is possible. 2. History Effects. History effects are caused
1. Selection Bias. Selection bias is the improper by events not related to, or anticipated by, the
assignment (allocation) of subjects for com- research protocol that occur during the study
parison. It is one of the most commonly rec- and inuence outcomes. History effects
ognized threats to the internal validity of an potentially threaten internal validity when a
interventional study. An investigator may study is performed in a less than isolated set-
inadvertently contribute to this bias by non- ting, particularly when effects on the depen-
rigorous matching (or failed randomization) dent variable are assessed before and after
techniques, or by choosing subjects for the the intervention and the temporal interval
experimental treatment who are believed to separating these assessments is relatively
be most likely to benet from it (a form of long. When history effects occur, measured
referral bias). For example, in a trial com- outcomes may partially or completely reect
paring surgery with medical treatment, those the outside event and not the intervention.
with the most favorable clinical prole might History effects can be caused by factors such
be assigned (referred) to the surgical group as unintended procedural or environmental
(based on presumed benet), while the less changes in the experimental setting, changes
robust patients might be assigned to the med- in the social climate that can inuence atti-
ically treated group. This approach is almost tudes, media campaigns that can increase
always optimistically biased in favor of the general knowledge, to newsworthy events
surgical group, which is why it is so difcult relevant to the altered health concerns of
to form condent conclusions from trials subjects in the study, etc. As an example of
conducted in this manner. It is equally incor- the latter, if an investigator was evaluating
rect to allow subjects to self-select their treat- the impact of a breast cancer awareness pro-
ment assignments because volunteers for gram to promote increased use of mammog-
experimental treatments have been shown in raphy and a well-known pubic gure was
various studies [35] to be different from the diagnosed with breast cancer, it would be
total ambient population in terms of person- difcult to determine whether the ensuring
ality (e.g., risk tolerance, decisiveness, action increased use of mammography was due to
orientation), severity of disease or symp- the program or to the media attention sur-
toms, and race, among other variables. These rounding the public gures diagnosis. In the
characteristics could skew associated out- clinical setting, history effects can be induced
comes in any direction (though it is generally by changes in routine care (e.g., introduction
thought that the direction of the bias induced of a new medication or other treatment,
by self-selection bias, like referral bias, is in alterations in patient management, variations
favor of the experimental treatment). in patient reimbursement rules) that could
When groups to be compared are not impact study outcomes. The effects of history
equivalent initially for these or for any other are best minimized by closely monitoring
5 Fundamental Issues in Evaluating the Impact of Interventions: Sources and Control of Bias 81

to ensure that ancillary factors not directly quent results through practice or learning.
integral to the intervention remain equivalent The threat to internal validity can be mini-
for all compared groups for the duration of mized by using alternate forms of measure-
the study. History effects also can be mini- ment for testing before and after intervention,
mized by using contemporaneous (parallel) or by eliminating pre- and post-intervention
control group designs, where comparators comparisons from the data analysis plan. Of
would have equal likelihood of exposure to course, as is true in virtually all interven-
signicant external events extraneous to the tional research, the latter approach requires
experimental setting. demonstration of equivalence of the com-
3. Maturation Effects. Maturation effects are pared groups before the intervention is
due to dynamic processes within subjects applied (i.e., at baseline, the pre-interven-
that may change with time and are indepen- tion period, or control condition).
dent of the intervention (e.g., growing older, 5. Instrumentation Effects. Instrumentation
progression or regression of illness). Like effects (also known as instrument decay
history, maturation may threaten internal or instrument drift) are caused by chang-
validity when analysis of outcome depends ing measurement instruments or observers
on comparison of pre- and post-intervention during the course of a study, or by intra-study
measures. It is a particular concern when changes in the original instruments or
studies extend over long periods of time observers, that may cause systematic error
(longitudinal studies) during which biologi- (bias) in measuring the outcome variable. If
cal alterations naturally can be expected and, the error entails consistent overprediction
thus, may affect outcomes. The effects of versus baseline, the bias is said to be posi-
maturation, like selection bias and history tive; consistent underprediction is a negative
effects, are minimized in parallel designs by bias [6]. For example, if alternate versions of
selecting comparison groups likely to have a test instrument are used before and after an
similar developmental patterns. intervention to reduce testing effects, any
4. Testing Effects. Testing effects are the observed changes may be due to differences
inuences of taking a test, being measured, in difculty level (e.g., easier posttests in
or otherwise being observed, on the results studies assessing educational impact) or
of subsequent testing, measurement, or other systematic variations in the alternative
observation. Testing effects may occur instruments, rather than to the intervention.
whenever the testing process is itself a stim- To avoid instrument effects when alternate
ulus to change, even in the absence of a forms of measurement are employed, they
treatment. Examples are the act of being should be previously evaluated to assure
weighed during a weight-reduction pro- equivalence. Parallel problems can occur
gram, or requiring patients receiving nico- when observers are changed during the course
tine substitutes to document and periodically of study since new observers may use differ-
report the number of cigarettes they have ent criteria for scoring and interpreting data
smoked. In these cases, assuming the sub- than the original observers. Instrumentation
jects are aware of the results of testing, the effects also can occur when the same instru-
process of being measured may cause ment (or observer) is used throughout the
subjects to undertake lifestyle changes study since instrument calibration may change
that will affect outcome independently of with time (or observer attitudes/assessment
the intervention. Testing effects are poten- criteria may change with experience).
tial concerns when measurement assesses Like history and maturation, instrumenta-
knowledge, attitudes, behaviors, and (espe- tion effects are a potential threat to internal
cially) skills, because the testing itself can validity in any longitudinal study involving
provide an opportunity for altering subse- serial measurements. They are of particular
82 P.G. Supino

concern when subjective measures (e.g., especially if these attributes are related to the
interviews or questionnaires) are used; in this outcome. Experimental mortality can bias
situation, care must be taken to assure that outcome even for post-interventional com-
instruments have demonstrated high reliabil- parisons if dropout is due to some character-
ity (internal consistency) to ensure stability. istic of an intervention that is not related to
However, whether objective or subjective the mechanism underlying its presumed
measures are used, observers may alter their efcacy. When comparison groups are used
interpretation of data as they grow more in an experimental design, a mortality bias
procient or fatigued. Thus, instrumentation also is introduced if the subjects lost to
effects also can be minimized through devel- follow-up differ diagnostically among these
opment of standardized data collection pro- groups. For example, a psychiatrist might
tocols so that any uctuations in measurement wish to follow two groups of psychotic
will occur randomly rather than systemati- patients, one of which had been given an
cally (or when comparing treatments by innovative treatment (the experimental
using the same observers across treatment group) while the other had been managed
conditions [counterbalancing] to avoid traditionally (the control group) to determine
confounding). whether the intervention decreased return
6. Statistical Regression. Statistical regres- visits to his/her practice. If more severely ill
sion is the tendency of individuals who patients were lost to follow-up in the inter-
scored extremely high or low on initial test- vention group than in the control group, the
ing to score closer to the previously estab- investigator might falsely conclude that
lished population mean on subsequent reductions in return visits among the inter-
retesting, independent of the intervention. vention group were attributable to the inno-
This is one of the most often overlooked vative treatment when, in fact, they may have
threats to internal validity, even among inves- occurred merely as a result of differences in
tigators who are well trained in statistics. attrition rates due to differences in illness
Statistical regression results from measure- severity. Experimental mortality is best mini-
ment error, as extreme or highly deviant mized by using large groups of subjects who
scores may arise due to chance. Such deviant are geographically stable, accessible to
scores are less likely to reappear on reevalua- investigators (i.e., have working telephone
tion. Regression effects can be minimized by numbers and valid postal or e-mail addresses),
avoiding the selection of a subject pool based and who are interested in participating in the
on extreme scores, for example, very high study, and by developing strategies to facili-
blood pressure or low IQ scores. Another use- tate follow-up. When subjects are lost, it is
ful strategy to avoid regression effects is to prudent to compare their baseline character-
obtain multiple measurements on each patient istics with those who remain in study to iden-
at several different appropriate times prior to tify potential bias, and to utilize external vital
intervention, or several measurements at the statistics databases (e.g., the National Death
protocol-mandated baseline and time after Index) to identify and conrm deaths that
intervention, which may then be averaged to may not be known to investigators.
optimize reliability of the estimate. 8. Interaction of Factors. Sometimes two or
7. Experimental Mortality. Experimental mor- more threats to validity can exist concur-
tality (or attrition bias) is caused by the rently. These may combine to further restrict
loss of subjects from a study who were origi- validity. Two factors that might be expected
nally included at baseline. Because subjects to combine are selection and maturation.
who withdraw may have different attributes For example, if two groups of patients were
than those who remain, their withdrawal may not initially equivalent in severity of illness
bias pre- to post-intervention comparisons, (a selection bias), their illnesses might
5 Fundamental Issues in Evaluating the Impact of Interventions: Sources and Control of Bias 83

progress at different rates (a maturation bias). control arm (a form of instrumentation bias).
Thus, one of the two groups might end up Experimenter bias is best controlled by tech-
sicker, or healthier, than the other, irrespec- niques that blind both the investigator and
tive of any intervention. This threat is best the subject to the latters treatment assign-
controlled by procedures to minimize indi- ment, by the use of observers from whom the
vidual biases (e.g., randomized allocation to purpose of the study is withheld, and by stan-
treatment groups). dardization of the methodology of outcome
9. Experimenter Bias. In a perfect world, an assessment to ensure that subjects in the
investigator involved in a quantitative study control group are evaluated as thoroughly
would be detached and objective, maintain- and as frequently as those receiving the
ing a highly circumscribed relationship with intervention.
the subject. In an interventional study, his or 10. Subject Expectancy Effects. The subject
her responsibility is to administer or allocate expectancy effect (also termed nonspecic
subjects to a treatment and to impartially effects), also not identied by Campbell and
measure outcomes and other variables of Stanley, is a cognitive bias that arises when a
interest. Experimenter bias, not identied subject anticipates an outcome (positive or
by Campbell and Stanley, occurs when the negative) from an intervention, and reports a
expectations of the investigator (usually response to the intervention that is premised
unknowingly and unintentionally) inuence on this belief. This is the basis of the pla-
the outcome of the study, thereby confound- cebo effect, long recognized in clinical
ing the results. The profound impact of medicine. It occurs when a patient responds
experimenter bias on internal validity was positively to an inactive intervention (e.g., a
demonstrated by Rosenthal (1964) in his pharmacologically inert pill) and appears to
seminal studies of expectancy on experi- improve subjectively and even, occasionally,
menter judgment and learning outcomes objectively. This effect on outcome is due to
conducted during the mid-1960s [7]. The the patients belief that the intervention is
experimenters expectations typically arise curative. It may be stimulated or reinforced
from deeply seated views about his or her by suggestion of therapeutic benet by an
study hypothesis and can impact the study in authority gure (e.g., physician or other
a number of ways. For example, the investi- investigator, as noted above under
gator could subtly communicate expectations Experimenter Bias) and/or by the subjects
(cues) to participants about anticipated out- inherent desire to please him or her. Indeed,
comes and inuence them through the power the term placebo is derived from the Latin, I
of suggestion. The investigator could provide will please. An opposite phenomenon is the
extra attention or care to subjects that is out- nocebo (Latin for, I will harm) effect
side of the intervention (the latter is also which occurs when a subject reports nega-
termed performance bias when systemati- tive responses to administration of an inert
cally done for members of only one of the intervention due to his/her pessimistic expec-
comparison groups or compensatory treat- tation that it would produce harmful or
ment bias when specically applied to con- unpleasant consequences. Although the mag-
trols). The investigator also can bias the nitude of these subject expectancy effects is
study through improper ascertainment or variable and somewhat controversial, there is
verication of outcomes, for example, by general consensus that they can impact the
searching more diligently for adverse events validity of any study in which the subject is
in patients with versus without hypothesized aware of receiving a treatment for which the
risk factors (detection bias) or by assign- outcome is subjective (e.g., studies involving
ing a more favorable rating on a subjective pain control or symptom relief). As with
scale to subjects in the experimental versus experimenter bias, subject expectancy is best
84 P.G. Supino

controlled by utilizing study designs that external validity is not assured even when internal
blind the subject to his/her treatment validity has been established. In fact, the rigorous
assignment. For some type of interventions controls required to establish internal validity
such as those involving lifestyle changes may inadvertently compromise a studys general-
(e.g., dietary alterations, smoking cessation) izability. The investigator must use a variety of
or surgical studies, subject blinding may be strategies to strike a delicate balance between
difcult, if not impossible. (This is also true both concerns, if the study is to be both accurate
for those conducting these interventions and (internally valid) and have practical utility (be
other members of the investigational team.) externally valid). The four most common threats
In these instances, blinded assessment of to external validity, identied in the seminal works
outcomes by external adjudicators could of Campbell and Stanley, are given below.
reduce, if not eliminate, expectancy biases. 1. Reactive Effects of Testing. The reactive
However, in many biomedical studies (e.g., effects of testing involve sensitizationor
those evaluating the effects of pharmacologi- desensitizationof study subjects to interven-
cal agents), subjects (and investigators) can tions caused by the pre-intervention testing
be blinded to treatment assignments through that might not be undertaken in the general,
the use of placebos. The incorporation of pla- nonstudy population. This threat to external
cebos enables determination of treatment validity is most often encountered when pre-
effects above and beyond those arising from tests are obtrusive and/or outside of the nor-
subject (or investigator) expectancy. mal experience of the subject. For example, to
Obviously, placebos work best when they study the effects of a new nutrition program,
closely approximate the physical characteris- an investigator might assess baseline knowl-
tics of the active intervention. (This problem edge of food groups and portion control,
is avoided in early phase I clinical trials of for the purpose of comparing pre- to post-
therapeutics where both placebo and active intervention changes. If the pretest had focused
drug may be administered intravenously, or attention on the intervention, any treatment
when the investigational intervention does effects that were observed might not be repli-
not cause characteristic physiological effects cable if the pretest was not given. To diminish
that might unmask the treatment assign- this bias, the investigator should minimize or,
ment.) When the treatment assignment is ideally, dispense with the use of pretests.
known to both subject and investigator, it is However, as with its internal validity analog
said to be unblinded (or open); when (testing effects), this approach is valid only
only the subject or the investigator (but not when there is reasonable certainty that the
both) is unaware of the treatment assignment, comparison groups are equivalent at baseline.
the study is said to be single blinded; when Alternatively, the investigator could opt to use
treatment assignment is unknown both to the least obtrusive pre-intervention assess-
subject and investigator, the study is said to ments to minimize reactivity. Special research
be double blinded; and when it is unknown designs (e.g., the Solomon four-square design),
to the subject, investigator, and others ana- in which pretests are given to some but not all
lyzing or monitoring the data, the study is study subjects, can be used to determine the
said to be triple blinded. reactive effects of testing on study outcomes.
2. Interactive Effects of Selection and Treatment.
Threats to External Validity Sometimes two investigators will run similar
External validity refers to generalizability, that studies and obtain different ndings. One pos-
is, can the study ndings be extrapolated to sub- sible cause of this outcome is the interactive
jects, contexts, and times other than those for effects of selection and treatment (or selec-
which the ndings were obtained? Internal valid- tion-treatment interaction). The interactive
ity is a prerequisite for external validity. However, effects of selection and treatment are the
5 Fundamental Issues in Evaluating the Impact of Interventions: Sources and Control of Bias 85

presumed basis of the failure of results found as aberrant behavior exhibited by subjects that
in an intervention study to be generalizable to results solely as a consequence of their partici-
other subjects to whom that intervention is pation in an experiment, and that may not
applied. This failure occurs because the study occur outside the experimental setting. The
was conducted on a sample that was not repre- reactive effects of experimental arrangements
sentative of the larger population to which are often confused with the placebo effect.
results should be extrapolated. The selection- Although there are cognitive components
treatment interaction frequently is seen in inherent in both validity threats, the primary
clinical research when research subjects are difference is that with the reactive effects of
scarce (a common situation) and the investi- experimental arrangements, the subjects bias
gator is limited to those who present them- is based on the idiosyncrasies of the research
selves and are willing to participate. In these environment, whereas with the placebo effect,
situations, study subjects typically are selected the subjects bias is based on expectations
by convenience, rather than by population- about the treatment (that may or may not be
based sampling. A convenience sample part of a research study). The reactive effects
includes all, or a portion, of patients who are of experimental arrangements were serendipi-
being seen in a practice, hospital, or clinic, tously discovered in a series of trials evaluat-
provided they meet the inclusion criteria of ing the impact of the work environment on
the study, and consent to participate. If the employee productivity, conducted by Harvard
subjects selected for the study are, for exam- University researchers between 1924 and
ple, healthier, wealthier, or wiser than the gen- 1932 at the Hawthorne Works, a factory plant
eral population, or if they come from a unique of the Western Electric Company in Cicero,
geographic area, they may benet more or less Illinois. The initial studies (illumination
from a treatment, and it may not be possible to experiments) varied the level of light intensity
replicate the study, or to extrapolate its results to which employees were exposed. When the
to the larger population of interest. In theory, light intensity increased, worker output (and
the interactive effects of selection and treat- positive affect) improved but, much to the
ment are best controlled by random selection investigators surprise, worker performance
of subjects from the target population. Because also improved when lighting intensity was
this seldom is possible in clinical research diminished. The same pattern emerged when
(especially in randomized clinical trials other environmental factors were manipu-
[RCTs] in which strict inclusion/exclusion cri- lated. These unintended outcomes (also known
teria and possibility of a subjects receiving a as the Hawthorne effect) [8] led the research-
placebo sharply narrow the pool of study-eli- ers to conclude that the mere act of being stud-
gible patients), the investigator should ied changed the participants behavior (i.e.,
endeavor to select subjects who have charac- brought about a pseudo-treatment effect), con-
teristics similar to those to which he or she founding inferences about effects of the vari-
wishes to extrapolate results. Multicenter ous interventions imposed upon them.
studies, drawing from diverse demographic Underlying mechanisms proposed to explain
populations, tend to suffer less than single- these ndings include unintended special
center studies from this external validity attention and benets that may have been
threat. Nonetheless, even small, single-center given to subjects by observers, uncontrolled
studies have value provided the investigator novelty due to the articiality of the experi-
identies and reports potential biases in his or mental arrangements, and inadvertent
her selection plan and is also careful to limit responses to subjects from observers leading
generalizations to appropriate populations. to learning effects that positively impacted
3. Reactive Effects of Experimental performance. While there is no consensus as to
Arrangements. This validity threat is dened the cause, the reactive effects of experiments
86 P.G. Supino

currently are recognized as a potential threat eliminate the effects of the prior exposure.
both to external and internal validity in Under these conditions, it will be difcult to
research from various disciplines (e.g., medi- determine how much of the ultimate treatment
cine, education, psychology, and management outcome was attributable to the rst treatment
science). Their impact is potentially problem- and how much was due to the second, thus
atic in any situation in which there is human limiting the applicability of the study ndings
awareness of participation in a study and in to the real world in which patterns of treat-
which study outcomes can be motivated by ment availability may not mirror those of
that knowledge. A related threat to validity study. Multiple treatment interference is very
that is caused by experimental arrangements is difcult to eradicate. It is best controlled by
known as the John Henry effect [9]. This avoiding the use of within-subject designs.
may occur when subjects in the control group, Where this is not possible, the investigator
being aware of their treatment assignment, must carefully counterbalance or randomly
view themselves as competing with subjects order treatments across subjects and provide
in the intervention group and change their appropriate washout periods.
behavior (i.e., try harder) in an attempt to out-
perform them.
Whenever possible, the investigator should Elements of the Research Design
take steps to reduce the reactive effects of
experimental arrangements to increase the In analyzing the anatomy of a study to evaluate
likelihood that the ndings from a study will the impact of an intervention, it can be very help-
be replicated beyond the experimental con- ful to employ shorthand that displays the major
text. Methodological options for achieving elements of the design, the sequence of events,
this objective include (1) minimizing the and certain of the constraints within the design.
obtrusiveness of experimental manipulations This shorthand, based largely on the notation
and measurements, (2) blinding subjects to developed by Campbell and Stanley, will be used
their treatment assignment (to control for in the remainder of this chapter to examine the
John Henry effects), and (3) providing strengths and weaknesses of ten alternative study
equivalent attention to intervention and con- designs.
trol groups, especially in studies involving The symbol X denotes the intervention (pri-
psychological, behavioral, and educational mary treatment or independent variable) that
outcomes. To accomplish this, investigators is applied to the subjects in the study. When
may include a Hawthorne control group that more than one level of a treatment is included
receives an irrelevant intervention to equalize in a design, they are labeled X0 (control), X1,
subject contact with project staff. X2, and so on; XP indicates that a placebo has
4. Multiple Treatment Interference. A fourth been given to control subjects (in designs
threat to the external validity of an interven- incorporating parallel treatment arms) or dur-
tion study is multiple treatment interference, ing the control condition (in time-series or
dened as the inuence of one treatment on crossover design) to control for expectancy.
another, which may produce results that would Y indicates that a secondary treatment has been
not be found if either were applied alone. coadministered, concomitant with the primary
Multiple treatment interference is a potential treatment. Variations in levels of the secondary
problem in any study in which more than one treatment, if any, may be distinguished by sub-
treatment (or treatment level) is given to, and scripts in a similar manner as for X. Absence
formally evaluated in, the same subject. The of Y indicates absence of co-treatment.
threat applies even when the treatments are O is the observation (or measurement of the
given in sequence because treatment effects dependent variable) in the study. O may repre-
may carry over and it may not be possible to sent a test result, a record, or other data; when
5 Fundamental Issues in Evaluating the Impact of Interventions: Sources and Control of Bias 87

more than one observation is involved over erly termed pre-experimental designs because
time, they are variously labeled as O1, O2, etc., they contain only few of the essential structural
to distinguish them. elements needed to draw unambiguous inferences
An arrow represents the experimental order about the impact of an intervention. They are pre-
(sequence of events during the study period). sented below to heighten the readers awareness
A dashed line indicates that intact groups (e.g., of their glaring deciencies. The three most com-
hospitals, clinics, or wards) have been com- mon are the following:
pared (in other words, that subjects have not 1. The one-shot case study
been allocated to treatment on a random basis). 2. The pretest-posttest only design
R indicates that study subjects have been allo- 3. The static-group comparison
cated to treatment groups on a random basis.
(Thus, a dashed line and R generally will not Pre-Experimental Research Design # 1:
appear in the same design as these represent The One-Shot Case Study
alternative methods of subject allocation to XO
Some studies in medicine utilize a design in
which a single patient (or series of patients) is
Alternative Research Designs studied only once, following the administration
an intervention. No pre- to post-intervention
Several alternative research designs have been comparisons are made, and no concurrent control
used to evaluate the effects of an intervention on groups are used. Instead, inferences about causal-
some specied outcome. Each of these differs ity are predicated on expectations of what would
according to its adequacy in ensuring that valid have been observed in the absence of the inter-
inferences are made about the effects and gener- vention, usually based on implicit comparison
alizability of an intervention. with past information. This most rudimentary
pre-experimental design is termed the one-shot
case study and is diagrammed as follows: X for
Pre-experimental Research Designs the intervention, followed by an arrow, and O for
the observation. Consider an example from the
The literature regrettably includes many studies literature by R.F. Visser, published in the journal
that use designs which fail to control for most Clinical Cardiology [10] (summary and design
threats to internal validity. These are most prop- structure are given in Fig. 5.1).

Fig. 5.1 Example of a one-shot case study

88 P.G. Supino

In this study, the X represents the anistreplase,

design which also is commonly found in the
and the O represents the patency of the infarct- medical literature. This design differs from
related vessels, as measured by TIMI criteria for the one-shot case study in that it collects baseline
perfusion. The authors contend that the X proba- observations on study subjects that can be com-
bly caused O, but have they presented convincing pared to observations made after the intervention.
evidence of that association and protected the (The terms pretest and posttest are used
internal validity of their study? generically in this chapter to refer to assessments
The answer is that studies such have almost no of the dependent variable made, respectively,
value for determining cause and effect because, before and after the intervention.) Because study
as Campbell and Stanley have noted, securing subjects are observed under more than one treat-
evidence of this nature involves, at minimum, ment condition, the one-group pretest-posttest
making at least one direct comparison. Although study is considered one of the simplest versions
the authors allude to the results of previous studies
of repeated measurement designs, described later
of patency following an AMI, no explicit data are in this chapter. Like the one-shot case study, this
presented against which patency in this investiga- design contains no parallel comparison groups
tion is compared; the absence of such control is and is diagrammed as an O1, for the pre-interven-
even more striking for reocclusion rates. Even if tion observation; followed by an X, for the inter-
data from historical controls were given, there is vention; and followed by O2, denoting the
no assurance that previous patient characteristics post-intervention observation, with arrows
and ancillary medical management were equiva- between. An example of a study employing this
lent; in fact, they usually are not, due to differ-design was published by Wender and Reimmer
ences in the health of a given population and in the American Journal of Psychiatry [11] (sum-
alterations in medical technology over time. In mary and design structure are given in Fig. 5.2).
addition, while standardized methodology (TIMI In this study, O1 represents the baseline atten-
criteria) was used to determine initial patency andtion decit hyperactive disorder (ADHD) score,
reocclusion, those reading the angiograms were X is the bupropion treatment, and O2 is the post-
aware of (and may have been inuenced by) the treatment ADHD score (Fig. 5.2). In the opinion
intervention. Thus, history, maturation, selection,of the authors, the improvement in O2 relative to
experimental mortality, and expectancy (experi- O1 is the result of X. Can the authors primary
menter bias) potentially threaten the internal conclusion withstand scrutiny?
validity of this study because each could explain Again, we rst consider internal validity. As in
the outcome. Furthermore, the external validity any repeated measures design, study subjects
of this study also is threatened by the interactionserved as their own controls, effectively eliminat-
of selection and treatment (due to small numbers ing the threat of selection (allocation) bias.
of highly selected patients who may not be repre- However, this design fails completely to control
sentative of the general population of patients for the following other factors that also could
with AMI), as well as by multiple treatment inter- account for the results. First, history effects are a
ference (note: heparin also was given to all sub- potential threat to the internal validity of this
jects). As noted earlier, importance usually is notstudy because it is possible that patients may
attached to the generalizability of a study that have experienced an event external to the inter-
cannot be shown to be internally valid. vention, and that this event, not the intervention,
may have improved their ability to focus. A sec-
Pre-Experimental Research Design # 2 ond potential threat is maturation because, as in
The One-group Pretest-Posttest Only Design any longitudinal study, the conditions of the study
O1 X O2 subjects may have changed on their own. Yet
another threat to internal validity is testing, since
The one-group pretest-posttest only design exposure to the pretest may have improved per-
represents a very slight improvement over the one- formance on the posttest. There is also the threat
shot case study; it is a second pre-experimental of instrumentation effects as the tests may not
5 Fundamental Issues in Evaluating the Impact of Interventions: Sources and Control of Bias 89

Fig. 5.2 Example of the one-group pretest-posttest only design

have been well standardized. (Indeed, the authors A third pre-experimental design also found in
are silent about the test-retest reliability of their the literature is the static-group comparison. This
instruments.) Statistical regression poses another design incorporates two groups: one that receives
possible threat, if the study subjects had been an intervention (again denoted as X) and a sec-
chosen on the basis of extremely poor scores on ond that does not receive an intervention and
the initial test. In the nal analysis, because so which serves a control (denoted by the absence of
many potential individual biases are uncontrolled X). Groups one and two typically are observed
in this study, there is also the strong likelihood concurrently after the intervention has been
that interaction of these factors could undermine applied in one of the groups, and the observations
its internal validity and the conclusions drawn made in these groups are denoted by the Os. This
from it. Indeed, Campbell and Stanley argued design includes no pretesting or baseline mea-
that this type of design should be used only when surements. Note that both intervention and con-
nothing else can be done. trol groups are separated, schematically, by a
The study also suffers from several threats to dashed line to indicate that study subjects were
external validity, namely, the potential for selection- assigned to treatment as intact groups, that is,
treatment interaction. First of all, very few sub- they were not randomly allocated to treatment.
jects were studied, and it is highly unlikely that A study, published by Bolland et al. in the Journal
they were representative of all patients being of the American Dietetic Association [12],
treated for ADHD (selection-treatment interac- employed a variant of this design which tested
tion). Second, the subjects (as well as their doc- for effects extended over time (summary and
tors) were unblinded, and subjects may have design structure are given in Fig. 5.3).
improved due to the effects of their participa- Are these conclusions credible? A review of
tion in the study (reactive effects of experimental the structure of this design will be revealing. In
arrangements). These issues are noted only for this study, X represents the food quantity estima-
completeness. As noted above, this study fails to tion intervention, and the O represents the post-
meet criteria for internal validity; thus, its gener- intervention assessments of knowledge of food
alizability is unimportant. quantities in the experimental (trained) and con-
trol (untrained) groups, assessed at three different
Pre-Experimental Research Design # 3 times among trained subjects. (The reader should
The Static-Group Comparison note that the use of deferred assessments is not
typical of the static-group comparison design
but was used in this study in an attempt to dene
persistence of treatment effects.) The broken line
90 P.G. Supino

Fig. 5.3 Example of the static-group comparison design

between the experimental and control groups indi- absolutely no protection. The rst threat is selec-
cates the intact nature of the comparison groups, tion (or allocation bias). The authors do not tell
signifying that subject assignment to the interven- us how the study subjects were divided into treat-
tion or control comparison group was not random. ment groups. Was it by instructor preference or
The static-group comparison design repre- self-selection by the study subjects? Either of
sents an improvement over the one-shot case these scenarios would be equally awed because
study because the inclusion of a contemporane- without baseline (pre-intervention) assessments,
ous control group permits comparison of the there is no way to determine whether the observed
results of the trained study subjects with the other, outcomes were due to the training or to pre-inter-
untrained study subjects, evaluated approxi- vention differences in the subjects knowledge
mately in parallel, thereby avoiding the obvious about estimating food quantities. Even if the
biases inherent in the use of external or historical investigators had attempted to match the groups
controls (or, in the worst-case scenario, no con- on other variables, such matching would be inef-
trols). Moreover, the fact that study subjects in fective in achieving true baseline parity among
both groups are being evaluated in the same way trained versus untrained subjects, especially if
during a relatively short interval decreases the subjects had, indeed, self-selected participation
potential for maturation and instrumentation in the intervention. In addition, even though the
effects (assuming uniform data collection). study was relatively short in duration, the validity
Finally, this design also represents an improve- of the conclusions, nonetheless, is threatened by
ment over the one-group pretest-posttest only the potential for experimental mortality (attrition
design because the absence of pretesting and sub- bias) as no information is given about whether all
ject selection based on extreme pretest scores subjects who began this study actually completed
obviates the threat of testing effects and statisti- it or whether attrition (if it did occur) differed
cal regression. systematically between the two groups. Thus,
Nonetheless, there are two potential threats to even if subjects were comparable on average
internal validity for which this design affords before training, the apparent superiority of the
5 Fundamental Issues in Evaluating the Impact of Interventions: Sources and Control of Bias 91

trained group (relative to the untrained group) on assignment to the alternative study arms, and that
the outcome measure possibly could have been probability remains constant throughout the
due to several of the less knowledgeable students study. The randomization process can be per-
dropping from the former group (or, conversely, formed according to a coin toss or a table of
due to some of the more knowledgeable students random numbers or special computer software
dropping from the latter group) prior to testing. can be used. This type of randomization is known
The primary threat to external validity is the as simple randomization and works best when
interaction of selection and treatment. (After all, sample size is relatively large. However, when
how representative is one class of introductory sample size is small, simple randomization may
nutrition students of the larger relevant popula- result in statistically unequal groupings. Under
tion?) However, since the internal validity of the these circumstances, restrictive randomization
study is severely compromised, this threat to methods (e.g., blocked randomized designs or
external validity has little if any importance. stratified random allocation) can be employed.
With blocked randomization, subjects are
assigned to treatment in groups (blocks) that are
True-Experimental Research Designs similar to one another with regard to a source (or
several important sources) of variability that is
The most prominent characteristic of true- (are) not of primary interest to the experimenter
experimental designs is random allocation of (e.g., a potential confounding variable such as
study subjects, drawn from a common population, gender, geographic area). Stratified randomiza-
to alternative treatment conditions. When this tion is performed by conducting separate ran-
approach is employed, participants baseline char- domization procedures within each of two or
acteristics can be expected to be equally distrib- more subgroups of subjects that are dened
uted across the various comparisons according to according to prespecied patient characteristics
the laws of probability, especially when sample (usually important prognostic risk factors) and
size is large. Even when randomization does not increases the likelihood that allocation to treat-
result in perfect equivalence, most workers in the ment is well balanced within each stratum. With
eld believe that this form of treatment allocation adaptive methods (a Bayesian approach increas-
is the best way to reduce the threat of selection ingly used in contemporary clinical trials) [15],
bias. The theoretical underpinnings of random- the probability of allocation changes in response
ized designs can be traced to Fisher and to accumulating information during the study
Mackenzies agricultural experiments in the about the composition of, or outcomes associated
1920s [13]; however, it was not until the late with, the alternative treatment arms. (For a com-
1940s that they made their appearance in the med- prehensive discussion of the theory and tech-
ical literature, when the RCT was rst used to niques of adaptive randomization, the reader is
demonstrate the efcacy of streptomycin in the referred to Hu and Rosenberger, 2006 [16].)
treatment of tuberculosis [14]. Since that time, the As noted, the purpose of randomization is to
RCT has been considered the standard to be met render the comparison groups as similar as pos-
for clinical research, even though investigations sible at study entry to permit valid inferences to
of this type comprise only a minority of the be drawn about the effects of an intervention.
clinical research ever conducted or published. However, during the course of the trial, some
Randomization also is important in many preclin- patients may not initially receive the intended
ical/basic science research protocols, though other intervention or, during the course of the study,
considerations may minimize application of this may drop out or cross over to the alternate treat-
approach in the nonclinical setting. ment for a variety of reasons. One widely used
Most commonly randomization is fixed, less solution to circumvent these problems is intention-
commonly it is adaptive. With xed random allo- to-treat analysis (ITT), which denes the compar-
cation, each subject has an equal probability of ison groups according to initial assigned treatment
92 P.G. Supino

rather than to the treatment actually received or study. All provide much better protection than do
completed (i.e., once randomized, always ana- pre-experimental designs against most threats to
lyzed). Many workers in the eld consider ITT internal validity.
analysis to be the gold standard method of analy-
sis for clinical trials [17], describing it as the least True Experimental Design # 1
biased for drawing inferences about trial results The Pretest-Posttest Control Group Design
[17, 18], and it is considered the pivotal analysis
by major regulatory bodies in Europe and in the
USA for approval of new therapeutics. However,
the reader should note that ITT analysis provides In the most common form of the pretest-
only a pragmatic estimate of the benet of a new posttest control group design, study subjects are
treatment policy rather than an estimate of poten- randomly allocated to two comparison groups or
tial benet in patients who receive treatment treatment arms. One group receives the experi-
exactly as planned; moreover, full application of mental intervention and the second, no interven-
this method is possible only when complete out- tion, a placebo, or an alternate intervention. Both
come data are available for all randomized sub- groups are observed, in parallel, before and after
jects [19]. Thus, The ITT approach is not without the intervention on the same outcome measure(s)
its critics [20]. Some clinical trialists argue that to determine whether change varied as a function
efcacy is best demonstrated when analysis of the treatment. The structure of this design is
focuses on subjects who actually received the represented symbolically above: R denotes that
treatment of interest (sometimes termed efcacy subjects have been randomly allocated to the
subset analysis), arguing that ITT approaches comparison groups; X denotes that a treatment
provide an overly conservative estimate of the has been given to the rst group; absence of X in
magnitude of treatment effects principally due to the second group indicates that this is a control
dilution of effects by nonadherence. In addition, group (the control group also could have been
ITT analysis creates difculty in interpretation of denoted by X0 [or Xp if a placebo had been
ndings if numerous participants cross over to given]). O and its positioning indicate the obser-
opposite treatment arms. Finally, it is suboptimal vations made in both groups before and after the
for studies of equivalence, generally increasing intervention. An example of a study incorporat-
the likelihood of erroneously concluding that no ing this design was published by Gorbach et al. in
difference exists between two test articles [21]. the Journal of the American Dietetic Association
A common solution is to employ both methods of [22] (summary and design structure are given in
analysis in the same study, using ITT and on- Fig. 5.4).
treatment approaches as primary and secondary The structural representation of this study is a
analysis, respectively. clue to the strength of its internal validity. Here,
Four of the most common true-experimental X represents fat reduction dietary intervention;
designs found in the biomedical literature are the the absence of X represents no dietary interven-
following: tion, the control group; O1 and O3 represent base-
1. The pretest-posttest control group design line fat intake in the experimental and control
2. The posttest only control group design groups; O2 and O4 represent post-intervention fat
3. The true-experimental 2 2 factorial design intake in both groups; R signies that the study is
4. The crossover study (two-period design) randomized.
The rst two designs can be used to evaluate Because study subjects have been randomly
the impact of a single intervention (vs. control or allocated to comparison groups from a common
an alternate intervention), and the third and fourth subject pool, selection bias has been removed as
permit the investigator to examine the separate a serious threat to internal validity, assuming that
effects of two interventions (again, vs. control or the randomization was effective. Having baseline
an alternate intervention) applied within the same measures of the dependent variable (and other
5 Fundamental Issues in Evaluating the Impact of Interventions: Sources and Control of Bias 93

Fig. 5.4 Example of the pretest-posttest control group design

key variables that potentially could inuence it) to the latter criterion, average regression effects
and comparing them between groups permits us would not confound interpretation of the results
to conrm or reject this assumption; these com- because if they had occurred, they should have
parisons typically are expressed in tabular form been equivalent in the comparison groups, given
in most published RCTs. History effects are con- that the subjects were randomly allocated from a
trolled because if a potentially confounding gen- common subject pool. Thus, this design also pro-
eral external event had occurred, it should have tects against statistical regression. Finally, while
affected the comparison groups equally since treatment assignment could not be fully blinded
they are studied in parallel; nonetheless, as noted (as noted earlier, a common characteristic of
earlier in this chapter, the investigator must be studies evaluating impact of lifestyle interven-
vigilant and attempt to control for differences tions) to entirely eliminate the threat of expec-
between comparison groups that might occur on tancy effects, the investigators endeavored to
a more micro level (i.e., within group varia- reduce them by standardizing their methodology
tions in temperature, time of day, season, etc.). for outcome ascertainment and by blind-coding
For similar reasons, the use of a parallel design data to ensure that subjects in the control group
also protects against the threats of maturation, and those receiving the intervention were evalu-
testing, and instrumentation effects because natu- ated uniformly and impartially. The one error
ral variations in these factors should impact com- made in this study was the use of an incorrect test
parison groups equally; instrumentation effects of statistical signicance (i.e., computing two
also are minimized here because all data were sets of t-tests, one for the experimental group and
collected using standardized techniques. In this one for the control group, rather than conducting
study, subjects were selected on the basis of high direct statistical comparisons of the changes
risk for breast cancer, not on the basis of extremes between the groups). With this single exception
in pre-intervention fat and energy intake. (which Campbell identied as a wrong statistic
However, even if they had been chosen according in common use among investigators employing
94 P.G. Supino

these designs [1]), the use of random allocation impact of the selection-treatment interaction,
to parallel treatment groups afforded by the appli- which must be considered, even though hundreds
cation of the pretest-posttest parallel group of subjects were enrolled in the trial.
design, coupled with standardized data collection A third potential threat to the external validity
methodology, protected this study very well from is the reactive effects of the experimental arrange-
most factors that could have undermined its inter- ments. Because the intervention was not part of
nal validity, thus maximizing the likelihood that the routine care of this population and informed
the intervention, rather than other factors, was consent was required, subjects certainly were
responsible for the observed outcomes. aware of their participation in an experiment.
However, the external validity of this study is All subjects would have been exposed to the nov-
open to question. The reason is that randomized elty associated with random allocation techniques
designs, including this model, may lead to con- and new ways of keeping food records. Subjects
clusions that, while internally valid for the study, in the intervention group would have been
may not generalize to the reference population exposed to new health-care providers (in this
for the following three reasons. study, the nutritionists) and, as a part of such
First of all, in this study, pretests were used to intervention, may well have received more atten-
assess relative change in fat and energy intake in tion from project personnel than those told to fol-
the comparison groups. Their use may have sen- low customary diets (i.e., the control group),
sitized study subjects to the intervention, with the unless a Hawthorne control had been built into
possibility that results might not generalize when the study (which it had not). Any of these factors
the intervention is applied without pretesting. might have led to changes that were due to reac-
This threat to external validity, known as the tivity to the experiment (a possibility that is sup-
interactive effect of testing and treatment and ported by changes in fat and energy consumption,
described earlier, is a potential problem for any albeit of a lesser magnitude, among control group
pretest-posttest comparison design, randomized participants), raising the concern that the effects
or not, unless the testing itself is considered a of the intervention might not be replicated when
component of the intervention being studied. applied nonexperimentally.
Another potential threat to external validity is
True-Experimental Design # 2
the interaction of selection and treatment. Since
The Posttest Only Control Group Design
the purpose of hypothesis testing is to make infer-
ences about the reference population from which R X O1
study subjects are drawn, the representativeness R O2
of the study group must be ascertained for the gen-
eral population of women at high risk for breast The next approach, called a posttest only con-
cancer. As noted, the majority of subjects in this trol group design, again utilizes two groups: each
study were well educated, and a quarter had annual has been randomly allocated to treatment; as
household incomes that were relatively high for before, one group receives the intervention, repre-
the time (1990). It is also relevant that patients sented by X, and the second group either receives
were excluded from the study for a number of rea- no intervention, an alternate intervention, orif it
sons including, but not limited to, their unwilling- is a drug studysometimes a placebo (designated
ness to sign an informed consent form, or because as XP). Both are observed after the intervention
they were judged by the study nutritionist to be only, as shown by the positioning of O. The major
potentially unreliable in complying with the study distinction between this design and the preceding
protocol. Unfortunately, as is the case for many one is that, here, study subjects are not assessed on
published RCTs, the authors fail to state how the dependent (outcome) variable at baseline.
many patients were excluded for these reasons, Instead, they are compared only after the interven-
making it difcult to evaluate the potential adverse tion. Unless knowledge of relative change on an
5 Fundamental Issues in Evaluating the Impact of Interventions: Sources and Control of Bias 95

Fig. 5.5 Example of the posttest only control group design

outcome is required, baseline assessments of the How well does this study design protect against
dependent variable are not necessary for establish- threats to internal validity? The answer is very
ing comparability of the comparison groups in well. Again, as for pretest-posttest parallel control
true-experimental designs, since random alloca- group design, the use of random allocation of
tion to treatment should eliminate the threat of almost 4,000 patients to treatment assignment
selection bias. As noted earlier, this is especially controls for selection bias (the comparability of
true if the number of study subjects is large and the distributions of baseline clinical variables,
the randomization strategy is properly executed. electrocardiographic abnormalities, age, gender,
Nevertheless, baseline data on relevant demo- and other descriptors between the propranolol and
graphic and clinical variables other than study placebo groups noted in their manuscript illus-
outcomes typically are collected to permit exami- trates this point). In addition, the use of parallel
nation of this assumption. comparison group post-intervention comparisons,
The posttest only control group design is espe- rather than sole reliance on within-group changes
cially appropriate in situations where within-sub- without controls, effectively rules out history,
ject outcomes logically cannot be dened before maturation, testing, mortality, regression, and
application of the intervention (e.g., in studies instrumentation effects and their interactions as
relating impact of the intervention on survival). competing explanations for the outcomes. In addi-
A classic example was published by the b-Blocker tion, because the study was double blinded, both
Heart Attack Research Group in the Journal of subject expectancy and experimenter bias also are
the American Medical Association [23] (sum- eliminated as potential threats to validity.
mary and design structure are given in Fig. 5.5). The study also is superior to that of Gorbach
In this study design, X represents the experi- et al. with regard to external validity. The reason is
mental drug, in this case propranolol, and XP is that the posttest only comparison group design
the placebo. O1 and O2, respectively, represent does not require pre-intervention assessments as a
the percent mortality for the propranolol and pla- benchmark against which to establish intervention
cebo groups. As before, the symbol R denotes the effects. Thus, by denition, it controls for the reac-
use of randomized allocation to treatment group. tive effects of testing. Indeed, this is the primary
96 P.G. Supino

advantage of this design versus the pretest-posttest comparative effectiveness), the second group
parallel group design. In this study, the outcomes might receive an alternative primary treatment
of the intervention were all hard events rather (in this case, these treatments would be desig-
than behavioral or educational outcomes, and the nated X1 and X2 to differentiate them). One group
intervention, itself, involved medication rather receiving the primary treatment and one receiving
than promotion of lifestyle change. Therefore, the an alternate treatment, or no primary treatment,
reactive effects of experimental arrangements, if also receive a secondary treatment, denoted here
any, should be minimal, provided that the investi- as Y. The remaining two groups do not or may
gators took care to minimize the obtrusiveness of receive a placebo. The groups are observed in
the experimental manipulations and measure- parallel after application of the intervention, as
ments. Nonetheless, while the study was large and denoted by O. A 2 2 true-experimental design,
multicentered, the authors reported that 77% of published by the International Study Group in
those patients invited to participate did not do so. The Lancet [24], was employed to evaluate the
Therefore, despite the many thousands of patients relative effectiveness and safety of two throm-
enrolled, there is still a question of how represen- bolytic drugs administered with or without hepa-
tative the sample was of the general population rin (summary and design structure denoted are
after a recent MI. Consequently, the external valid- given in Fig. 5.6).
ity of this study potentially is threatened by the In this study, X1 represents streptokinase, and
selection-treatment interaction which, as noted X2 represents alteplase. Y indicates concomitant
earlier, is a common problem in many RCTs. administration of heparin; the absence of Y indi-
cates that no heparin was given. O1O4 denote the
True-Experimental Design # 3 percentages of in-hospital deaths in each of the
The 2 X 2 Factorial Study comparison groups (Fig. 5.6).
Because this study (like those using true-
experimental designs #1 and #2) employed a
design that randomly allocated subjects to four
large parallel treatment arms, selection bias is
controlled as are history effects, maturation,
instrumentation, testing, experimental mortality,
The rst two true-experimental designs per- and regression. Unfortunately, neither patients
mitted the investigator to evaluate the impact of a nor investigators were blinded to the formers
primary treatment versus an alternative primary treatment assignment. Thus, the study did not con-
treatment or control. True-experimental factorial trol for the potential effects of expectancy. This
designs are modications that include a second- omission is important because even though the
ary treatment administered concurrently with the dependent variable clearly was an objective out-
primary treatment to permit examination of the come (i.e., death) and randomization led to groups
modication of the main and interactive effects that appeared to be well balanced at study entry,
of each. They can be designed with and without knowledge of the treatment assignment still could
pretests (as above) and with or without blinding, have resulted in unintended differences between
if the latter is not practical or possible. the treatment arms in the use of nonprotocol-
An example of these designs is diagramed mandated co-interventions (e.g., percutaneous
above. This exemplar is termed a 2 2 factorial coronary angioplasty or coronary bypass grafting)
true-experimental design and includes four con- that, themselves, could have inuenced study
current parallel groups: the rst two groups receive outcomes. This design aw, of course, is not a
a primary treatment, denoted by X, and the second limitation of the true-experimental factorial
two receive no primary treatment, denoted by design (which, otherwise, controls very well for
the absence of X (or, alternatively, X0) or Xp if major threats to internal validity) but, as noted
placebos are given to the nontreatment controls. earlier, is a problem associated with any open
In a variation of this design (for evaluation of (unblinded) study. Had the study been blinded,
5 Fundamental Issues in Evaluating the Impact of Interventions: Sources and Control of Bias 97

Fig. 5.6 Example of a 2 2 factorial true-experimental design

this true-experimental factorial design, like the therapies, which prevents us from generalizing
two preceding true-experimental designs, would, the mortality ndings to similar patients in whom
in theory, have afforded full protection against these therapies are not given.
most, if not all, serious threats to internal validity.
The chief advantage of this study design for True Experimental Design # 4
The Two-Period Crossover (Change-Over) Design
external validity (vs. the crossover study, dis-
cussed below) lies in fact that its structure per- [Period A] [Period B]
mits a purposive and systematic evaluation of the
separate and combined (i.e., interactive) effects
of concomitant investigational therapies, thereby
avoiding unplanned carryover effects and pre-
cluding the threat of multiple treatment interfer- In the previous example, the main and interac-
ence. Though this design can increase the tive effects of two treatments were evaluated. To
efciency of interventional trials by permitting accomplish this, a factorial parallel (between-
simultaneous tests of several hypotheses, the subjects) design was used that required allocation
reader should be aware that if interactions are of large numbers of subjects into four different
severe, loss of statistical power is possible [25]. treatment arms, resulting in one protocol-
A limitation to the external validity of this par- mandated exposure per subject during the course
ticular study (but not to factorial designs in of the study. In contrast, if the study objectives
general) is the coadministration of noninvestiga- were to determine only the main (isolated) effects
tional drugs (i.e., b-blockade and aspirin) among of two treatments, rather than their interactions,
all patients without contraindications to these this objective could be accomplished more
98 P.G. Supino

efciently (i.e., with fewer subjects producing carryover effects could compromise the validity
equivalent statistical power or precision) using of data obtained after the initial period (e.g.,
the true-experimental crossover (or changeover) cause under- or overestimation of the efcacy of
design. A crossover design is a type of repeated the second treatment) and undermine the
measures design in which each subject is exposed efciency of the study.
to different treatments during the study (but they Although crossover studies can involve multi-
cross or change over from one treatment to ple periods and sequences, the most common is
another). The order of treatment administration true-experimental design #4, the two-period cross-
(determined priori via randomization) is termed over design, illustrated symbolically above. When
a sequence, and the time of the treatment this approach is used to test the efcacy and safety
administration is called a period. The statistical of different investigational drugs, subjects nor-
efciency of the design results from the fact that mally will undergo a run-in period during which
each subject acts as his or her own control, noninvestigational medications are discontinued
thereby minimizing error due to (and sample size and a suitably long washout interval between the
needed to overcome) the effects of between- two active treatment periods, A and B, (the latter
subject variability. Crossover designs have enjoyed guided by the bioavailability of the drugs) so as to
popularity in many disciplines including medi- minimize the likelihood of carryover effects.
cine, psychology, and agriculture. They are com- Typically, half of the sample initially receives the
monly used in the early stages of clinical trials to rst drug, denoted by X1, and the other half ini-
assess the efcacy and safety of pharmacological tially receives the second drug, denoted by X2.
agents and constitute the preferred methodologi- Following the washout, study subjects who
cal approach for establishing bioequivalence. received the rst drug are given the second drug,
A variant that can be used for these purposes is and vice versa, resulting in a fully counterbal-
the n-of-1 study, a mini-RCT in which a single anced design. Observations are recorded pre- and
patient is observed during exposure to randomly postdrug administration in the two treatment peri-
ordered sequences of treatment (frequently given ods, denoted by O. The symbol R to the left of the
in varying doses) and placebo. Both the patient diagram indicates that the order of initial treat-
and clinician are blinded as to treatment alloca- ment assignment is allocated at random to counter
tion, and the codes are broken after the trial. possible order effects. An example of a study
Responses, such as reported side effects, are employing a crossover design was conducted by
graphed or analyzed through a variety of para- Seabra-Gomes et al. [26] who evaluated the rela-
metric and nonparametric statistical techniques. tive effects of two antianginal drugs on exercise
When performed in series, the n-of-1 study can performance in men with stable angina (summary
provide valuable information for subsequent par- and design structure are given in Fig. 5.7).
allel group trials. In this study, X1 denotes isosorbide-5-mono-
A crossover study has utility for clinical nitrate and X2 stands for isosorbide dinitrate.
research only when three conditions are satised: O1O3 are the outcome variables measured among
(1) subjects must have a chronic stable disease patients receiving X1 during period A; O4O6 are
that is not likely to progress during the study; (2) the same variables measured during period B.
study endpoints must be transitory, that is, must O7O12 are the outcome variables measured
reect temporary physiological changes (e.g., among patients initially receiving X2. R indicates
blood pressure) or relief of pain, rather than cure that the order of the initial drug assignments was
(or death); and (3) the investigational treatments randomly allocated.
must be able to deliver relatively rapid effects As with all other true-experimental models,
that are quickly reversible after their withdrawal. internal validity is very well controlled with this
The latter point is especially critical. If the effects design. Selection bias is eliminated because study
of the investigational interventions are permanent subjects are their own controls and comparisons
or more long lasting than anticipated, their of outcomes are made within rather than between
5 Fundamental Issues in Evaluating the Impact of Interventions: Sources and Control of Bias 99

Fig. 5.7 Example of a true-experimental two-period crossover study

individuals. As for true-experimental designs desensitization) effects of multiple pre-interven-

#13, the use of parallel comparison groups stud- tion assessments. Of course, here again, the less
ied within a relatively short time interval generally obtrusive the measures, the less worrisome the
affords good control of history, maturation, and threat. Second, a study of this nature is vulnerable
similar effects. In addition, the use of double to the threat of a selection-treatment interaction.
blinding (specic to this study, though not neces- The reason is that the number of study subjects in
sarily to this design) eliminates the threat of expec- this study is relatively small, as is commonly the
tancy on the part of the investigator and subject. case in crossover studies (indeed, as noted previ-
There are, however, a number of potential ously, this is an advantage of these studies com-
threats to the external validity of any crossover pared with parallel designs without crossover,
study. Most prominent are the interactive effects which require larger numbers of subjects for
of testing and treatment which could limit gener- equivalent power). This reduces the number of
alizability due to the potential sensitization (or comparisons that can be made and amplies the
100 P.G. Supino

impact on outcome of the choice to participate or Quasi-experimental Designs

not to participate based on factors extraneous to
the aims of the study. The number of available If the value of an intervention study were to be
comparisons is further reduced if subjects dis- judged solely on considerations of internal valid-
continue their participation before the study has ity, most investigators would opt to employ fully
ended because failure to complete the study pre- blinded true-experimental designs. Yet, despite
cludes determination of within-subject treatment their undisputed methodological superiority for
differencesthe underpinning of the crossover providing evidence of cause and effect relation-
study. If the number of dropouts were high, the ships, these designs only are employed in a
study could be underpowered despite initial minority of published studies that have evalu-
planning to avoid this. (The reader should note ated the impact of interventions on outcomes of
that in the Seabra-Gomes study, 15% of subjects interest. As noted above, even well-constructed,
initially participating failed to complete it; their true-experimental designs are subject to limita-
data could not be used.) In addition, unless the tions in external validity. They also can be
experimenter took care to reduce the obtrusive- difcult, if not impossible, to apply within the
ness of the study, the inherently novel aspects of constraints of many research environments.
the crossover design (alternating treatments, Such constraints may include the lack of concur-
coupled with multiple observations) could cause rently available comparison groups (commonly
reactive effects that might not appear in a more due to ethical problems caused by withholding a
natural setting (reactive effects of experimental preferred treatment from control subjects) and,
arrangements). Perhaps the greatest potential especially, to the inability to randomly allocate
threat to the external validity of a crossover study study subjects into different treatment groups in
lies in the potential for multiple treatment inter- order to minimize the threat of selection bias,
ference because, as noted above, there may be (in clinical research, commonly due to physician
carryover effects between treatments that may or patient refusal based on assumptions about
not be generalized to the single treatments under outcome or to more complex psychological
investigation. This may occur when the alterna- factors). To compensate for these deciencies,
tive treatments being compared are not ade- and to render research feasible in constrained
quately separated in time (washed out) or, situations, Campbell and Stanley popularized a
unbeknownst to the investigator when designing concept known as quasi-experimental design.
the trial, lead to permanent change (e.g., cause This approach can be applied to individual sub-
liver or kidney damage). Under these circum- jects or to populations and to evaluations con-
stances, the response to treatment in period B ducted in practice-based and eld settings.
may be importantly inuenced by a residual It can help the investigator to control some
effect from the treatment given during period A, threats to internal validity that would be uncon-
producing an under- or overestimation of the trolled with pre-experimental designs or exter-
efcacy of the second treatment. Because of this nally controlled studies and can be very useful in
potential limitation, crossover studies generally the evaluation of therapies, educational pro-
are less favored than parallel designs for denition grams, and policy changes in many disciplines.
of treatment efcacy. Indeed, as a practical mat- Like true-experimental designs, all quasi-
ter, when such studies are undertaken to obtain experimental designs involve the application of
regulatory approval or labeling elements for a an intervention and observations of at least one
treatment, investigators should consult with the outcome that is related to the intervention.
appropriate regulatory body (in the United States, However, quasi-experimental designs typically
the Food and Drug Administration [FDA]) as to lack the hallmark of the true experiment, that is,
the acceptability of the design for the particular random allocation to treatment group. Of these,
purpose. the most widely used for evaluating the impact of
5 Fundamental Issues in Evaluating the Impact of Interventions: Sources and Control of Bias 101

clinical and other health-related interventions on The basic structure of this design is symbol-
group outcomes are the following: ized above. It is almost identical to the pretest-
1. The nonequivalent control group design posttest true-experimental control group design
2. The time-series design except that study subjects are not randomly
3. The multiple time-series design assigned to treatment groups; therefore, the
The rst design can be used to evaluate the groups cannot be assumed to be equivalent before
impact of an intervention using a single before the intervention. As before, X symbolizes the
and after assessment of the dependent variables in intervention, O denotes the pre- and post-inter-
two or more comparison groups. The second uses vention assessments in each of the comparison
multiple assessments, conducted over time, of the groups, and the dashed line (and absence of R)
dependent variable in a single group of subjects. indicates that intervention was applied to an
The third (a combination of quasi-experimental intact group (i.e., allocation was not random).
designs #1 and #2) includes multiple assessments, Steyn et al. [30] used a nonequivalent control
again over time, but in two or more parallel group design to examine the intervention effects
groups. Because the observations in designs #2 of a community-based hypertension control pro-
and #3 are broken up by the imposition of the gram (the Coronary Risk Factor Study [CORIS])
intervention, both also are termed interrupted that was introduced for 4 years among white
time-series designs. (The reader is referred to hypertensive residents in two rural South African
Kazdin [27] or to Janosky et al. [28], for a detailed towns (summary and design structure are given
discussion of other quasi-experimental designs in Fig. 5.8).
used for research with single or small groups of In this study, O1, O3, and O5 represent baseline
subjects, and to Stanley and Campbell [1], Cook systolic blood pressure and diastolic blood pres-
and Campbell [2], and Shadish, Cook, and sure in the intervention and control towns; O2, O4,
Campbell [29], for additional quasi-experimental and O6 represent post-intervention blood pres-
designs used with larger groups or populations). sures in these towns. X1 represents the low-
intensity hypertension reduction intervention,
Quasi-Experimental Design # 1 X2 represents the high-intensity intervention, and
The Nonequivalent Control Group Design the absence of X denotes the lack of intervention
(the control). The dashed line indicates intact
O2 X O2
------------------ (nonrandom) treatment assignment.
Because allocation to the intervention was not
O3-------> O4
performed randomly, confounding variables may
The nonequivalent control group design (also have inuenced the observed outcomes.
termed the nonequivalent comparison design) Therefore, internal validity is not as well pro-
compares outcomes among two or more intact tected as it is with true-experimental design #4
groups, at least one of which receives the inter- (the pretest-posttest control group design),
vention; another serves as the control. This design which has a similar structure but includes random
is most useful when concurrent comparison allocation. The greatest potential threat to inter-
groups are available, when random allocation to nal validity with this design is differential selec-
treatment condition is not possible, and when tion, which could cause the comparison groups to
pretesting of the dependent variable can be per- vary on key factors related to the dependent vari-
formed so that baseline similarity of the compari- able; if present, selection bias could interact with
son groups can be evaluated. It is commonly used other potential biases such maturation (e.g., a
when comparison groups are spontaneously or sicker group could have disease that might prog-
previously assembled entities (e.g., different clin- ress more rapidly) or regression (if one of the two
ics, wards, schools, or geographic areas) or when groups were chosen on the basis of extreme val-
logistic difculties preclude random allocation to ues). Selection bias can occur if the investigator
treatment within the same entity. evaluates the intervention in two intrinsically
102 P.G. Supino

Fig. 5.8 Example of a nonequivalent control group study

dissimilar populations or uses a nonuniform sub- pressures prior to the intervention. Thus, it is not
ject recruitment approach (e.g., permits subjects likely (though, certainly, it is not impossible) that
to self-select their treatment assignment). the differences found after the intervention were
However, if care is taken to avoid these practices, attributable to selection bias. The inclusion of
the availability of baseline measures of the depen- baseline measures also permits the investigator to
dent variable, a critical component of the non- evaluate the potential threat of experimental mor-
equivalent control group design, permits the tality (attrition bias). If there were losses to fol-
investigator to evaluate the extent and direction low-up among the comparison groups, their
of a potential selection bias and to minimize it, as potential impact could be evaluated by comparing
appropriate, through covariance analysis. baseline characteristics of those who withdrew
Therefore, this design affords much greater con- with those who completed the study. The authors
trol for this selection bias than pre-experimental of CORIS, who performed this analysis, found
design #3 (the static-group comparison) which that those who withdrew were similar to those
also contrasts outcomes across intact groups, but who remained with regard to age, gender, initial
which lacks critical baseline data needed to estab- cholesterol levels, blood pressure, body mass
lish initial comparability. Where pre-intervention index, and smoking behavior. Thus, the potential
data show relative comparability between groups threat of experimental mortality was effectively
on relevant variables, the nonequivalent control ruled out.
group design generally is appropriate; when pre- In the absence of differential selection and a
intervention comparability is not present, an alter- hypothesized interaction between selection and
native design should be used. In the CORIS study, the day-to-day experiences of the subjects, history
the authors state that the groups had similar blood effects are not plausible as an alternative (rival)
5 Fundamental Issues in Evaluating the Impact of Interventions: Sources and Control of Bias 103

explanation for the observed outcomes and, thus, be less reactive and, thus, have better external
also can be ruled out as a major potential threat validity than most true experiments.
to internal validity when using the nonequivalent
control group design. The reason is that, barring Quasi-Experimental Design # 2
evidence to the contrary, external events occur- The Time-Series Design
ring in one comparison group should be just as O1 O2 O3 O4 X O5 O6 O7 O8
likely to occur in the other when subjects are
evaluated in parallel. However, as with true- The previous example compared the impact of
experimental designs, the burden remains with an intervention on outcomes using several intact
the investigator to ascertain the degree to which groups. Occasionally, an investigator planning
other relevant events may be occurring in the to evaluate an intervention may be unable to
intact group settings that might also affect out- identify a suitable (or any) comparison group.
comes; this is especially important when com- This might occur when patients are candidates
parators are geographically separated, as in this for a treatment, the effectiveness of which is to be
study. Also, because groups are studied in paral- tested, but an alternate treatment is not available,
lel, internal validity threats such as maturation, or if available, is viewed as unacceptable by the
testing, instrumentation, and regression effects patients or their physicians; a similar problem
are fairly well controlled (again, assuming the frequently occurs when a specic treatment cannot
groups share common baseline characteristics). be withheld for what are considered ethical
Finally, any potential biases associated with reasons. Thus, sometimes, interventions must
expectancy are not inherently greater than those be presented to entire groups, for example, all
found with true-experimental designs and may be patients potentially at risk. In these cases, an
reduced, at least in part, by uniform standards for investigator might opt for a pre-experimental
data collection and analysis (as was done in design without a control group (e.g., the pretest-
CORIS). posttest only design), in which a single group of
As with true-experimental design #2, the use study subjects is observed on just one occasion
of pre-intervention testing (essential with this before and after the intervention, or might com-
design for establishing baseline comparability of pare results obtained in study subjects with exter-
the comparison groups) may pose a threat to nal or historical controls. The literature reects
external validity unless the testing itself were many such examples. Unfortunately, as noted
deemed to be part of the intervention, as it would earlier, pre-experimental designs provide very
appear to be in the CORIS study. Additionally, as poor control against important threats to internal
with any design, a selection-treatment interac- validity, and comparing results from a current
tion can occur if the study subjects are not repre- treatment group with those obtained among his-
sentative of all subjects who potentially could be torical controls is almost always biased in favor
studied. Indeed, the authors of CORIS recognized of the former, principally due to improvement in
that their ndings did not necessarily apply to the general health of the population over time.
individuals of ethnic backgrounds and socioeco- The time-series design (sometimes called an
nomic statuses not included in CORIS. In gen- interrupted time-series) represents an improve-
eral, however, the nonequivalent control group ment over both of these pre-experimental
design places far fewer restrictions on sampling approaches. In its simplest form, multiple obser-
and, therefore, tends to be more generalizable vations (the number depending on the stability of
than the typical randomized parallel group trial. the data) are generated for a single group of sub-
Lastly, the reactive effects of experimental jects both before and after application of an inter-
arrangements potentially could limit the external vention. The objective of any study using such a
validity of studies using this design, but because design is to provide evidence that observations
they entail comparisons of interventions applied made before (and sometimes after) imposition of
to naturally occurring groupings, they tend to the intervention differ in a consistent manner from
104 P.G. Supino

sharp increases in slope concomitant with the

intervention, following a stable baseline), those
reected by lines CE are equivocal, whereas
those shown by lines FH provide no justication
for such an inference (Fig. 5.9).
Time-series designs can be used to evaluate
continuous or temporary interventions and can
incorporate retrospectively or prospectively
acquired data. They are especially useful and
appropriate for modeling temporal changes in
response to programmatic interventions or health
policy changes in otherwise stable populations. A
time-series design was used by Delate et al. [31]
to evaluate economic outcomes of a cost-contain-
ment policy for Medicaid recipients that was
applied continuously throughout their study (sum-
mary and design structure are given in Fig. 5.10).
In this study, O represents the number of
antisecretory drug claims and expenditures per
member per month (PPIs and H2As) before and
during the post-policy period (24 such outcomes
were measured in total, though only eight obser-
vations are shown here for ease of interpretation).
X is the prior authorization policy; the symbol
indicates that the intervention is applied continu-
ously. The pattern of the observed H2A data
Fig. 5.9 Some possible outcome patterns from the intro- (which emulates line A of Fig. 5.9) and the
duction of an experimental variable at point X into a time- obverse pattern of the PPI data are used to but-
series of measurements, O1O8. Except for D, the O4O5 tress the investigators conclusions that the
gain is the same for all time series, while the legitimacy of
inferring an effect varies widely, being strongest in A and observed changes in the number of claims led
B, and totally unjustied in F, G, and H. From Campbell for, and expenses associated with, antisecretory
and Stanley, Experimental and Quasi-Experimental drugs are due to the imposition of the policy.
Designs for Research, 1E 1966 Wadsworth, a part of An example of a time-series design evaluating
Cengage Learning, Inc. (Reproduced by permission, a temporary intervention can be found in the
work of Reding and Raphelson [32] who evalu-
ated the impact of an addition of a psychiatrist to
observations made during the intervention. While a mobile psychiatric crisis team on psychiatric
special autoregressive statistical procedures often hospitalization admission rates in Kalamazoo
are used for analysis, the hallmark of this and County, Michigan (summary and design structure
other types of time-series designs is visual analy- are given in Fig. 5.11).
sis of temporal outcome changes in relation to the In this study, X denotes the mobile psychia-
intervention. Shown below are examples of hypo- trist intervention and O, the number of state hos-
thetical data, reecting various levels of evidence pitalizations during each of the monthly
for inferring cause and effect that, theoretically, assessments before, during, and after the inter-
can be produced with a time-series design. The vention (again, 30 outcome assessments actually
reader should note that patterns reected by lines were performed, reduced to eight for ease of pre-
A and B provide the strongest evidence for infer- sentation here). The authors conclusion that the
ring intervention effects (note that both show intervention caused the changes in the pattern of
5 Fundamental Issues in Evaluating the Impact of Interventions: Sources and Control of Bias 105

Fig. 5.10 Example of a time-series design (continuous intervention)

Fig. 5.11 Example of a time-series design (temporary intervention)

hospitalizations is based on data patterns that confound their results. Dynamic changes within
conform to the inverse of those shown in Fig. 5.9, subjects or populations (i.e., maturation effects),
line B (i.e., changes on the dependent variable if any, usually are well controlled with time-series
contemporaneous with the intervention that designs because they (like regression effects) are
return to baseline after termination). unlikely to cause variations that occur only when
In both of these studies, the threats of selection the intervention is applied. For similar reasons,
bias and experimental mortality are con- the time-series design controls for testing effects
trolled, provided that the same subjects partici- even in cases in which the measurement process
pate in each of the pre- and post-intervention is more obtrusive than that used in the Delate and
assessments. Since this is rarely the case in Reding studies.
community-based studies, the investigators must The chief potential threat to internal validity
take steps to evaluate natural migratory patterns of studies using time-series designs is history.
within the community to ensure that these do not Because human subjects rarely are studied in a
106 P.G. Supino

vacuum, the investigator must be on the alert could compromise external validity by sensitizing
for outside inuences (e.g., programs, policy subjects to their treatments. The potential for a
changes, or even seasonal uctuations) occurring testing-treatment interaction (or testing reactiv-
coincident with the intervention that also might ity) is heightened with a time-series design
affect study outcomes. For example, to accept because multiple pre-intervention assessments
Delates conclusions, one would have to believe are required to establish the stable pre-interven-
that there were no other factors (e.g., changes in tion pattern against which changes in slope and/
physician prescribing patterns, advertising cam- or intercept of the post-intervention assessments
paigns) to which the subjects were exposed that are compared. For this reason, studies using these
would have caused them to use fewer PPIs during designs generalize best when performed in set-
the post-program period. Similarly, the Reding tings in which data are collected as part of routine
conclusions are tenable only if one accepts that practice. Additionally, when based on natural
nothing else (such as another psychiatric inter- experiments, like those reported by Delate and
vention or availability of new treatments, etc.) Reding, they cause few, if any, reactive effects
occurred in Kalamazoo County specically dur- because the interventions are experienced as part
ing the tenure of the mobile psychiatrist that also of the subjects normal environment. As with any
might have reduced admissions to state hospitals. design, however, the ability to generalize out-
If careful documentation by the investigator rules comes depends on the similarity of the study
this out, then history effects become a less plau- group to the reference population.
sible alternative hypothesis for the observed Readers with clinical experience may recog-
changes. A second internal validity threat is nize a variant of the time-series design in which
instrumentation. If the calibration of an objective an intervention is reintroduced after one or more
measure (or the instrument itself) changes during intervals of withdrawal. In behavioral research
the study, and if this change occurs when the with single subjects or with series of subjects
intervention is applied, then it is difcult to know (e.g., studies designed to extinguish inappropri-
whether the observations made after the interven- ate actions among children with autism or adult
tion are due to it or to changes in the instrument. schizophrenics or to improve task performance in
The same problem may occur when measurement the setting of attention decit hyperactivity disor-
criteria or outcome adjudicators change in paral- der), this approach is termed an ABAB Design,
lel with the intervention, especially when the lat- where A and B respectively denote alternating
ter are aware of the study hypothesis. With control and intervention periods. (It is called a
administrative data, there is always a chance that BABA Design when the sequence begins with the
the methodology used for record keeping might intervention, followed by its withdrawal and rein-
spuriously inuence outcomes. For example, a troduction, etc.) In other specialties, it is more
change in the coding of diagnostic rating groups commonly termed an equivalent time samples
(DRGs) during an intervention might lead the design or a repeated treatment design. This gen-
investigator to conclude incorrectly that there eral approach has greater control of history and
were more (or less) hospitalizations for a given instrumentation effects than the classic time-
disease during this interval. To minimize these series design because the probability of some
potential effects, the investigator should endeavor external event or unintentional instrument or
to standardize measures and educate research observer change tracking with (and accounting
personnel about such issues. Finally, whenever for) the effects of intermittent applications of
possible, steps should be taken to blind those the intervention is arguably lower than it would be
interpreting outcomes to knowledge of the treat- when only one application of the intervention is
ment period to reduce the inuence of expectancy involved. It can be particularly useful as the basis
on these assessments. for relatively rigorous determination of the effects
As with all designs that evaluate change over of pharmacological therapies (particularly adverse
time, the use of multiple observations, if obtrusive, outcomes of chronically employed drugs), when
5 Fundamental Issues in Evaluating the Impact of Interventions: Sources and Control of Bias 107

such effects are predictably transient or reversible The multiple time-series design combines the
in nature. For example, with age, individuals tend unique features of nonequivalent control group
to perceive arthralgias and myalgias with relative and time-series designs to maximize internal
frequency. Hypercholesterolemia is fairly wide- validity. It evaluates relative change over time
spread according to current epidemiological on one or more dependent variables in two or
denitions, and the prescription of HMG-CoA more intact comparison groups (again, usually
reductase inhibitors (statins) to control choles- preexisting groups assembled for other pur-
terol is quite common. The drugs have been well poses) at least one of which receives an inter-
demonstrated in RCTs to reduce coronary disease vention and one of which does not (the control).
events and, specically, mortality, among patients Thus, this design creates two experiments, one
so treated. In some patients (the minority), statins in which the intervention is compared against a
also can cause myalgias and, in fewer still, poly- no-intervention control and the second in which
serositis with arthralgias. Most patients are aware pre-intervention time-series data are compared
of these potential problems from constant refer- with those obtained after the intervention,
ence to them in the news media and often ascribe thereby increasing the amount of available evi-
their symptoms to the statins because of expec- dence to buttress a claim of an intervention
tancy. Thus, when patients complain of myalgias effect. In its most general design structure,
and/or arthralgias while taking statins, it is incum- shown above, X symbolizes the intervention
bent upon the physician to determine whether the (applied within one of the groups), O is the pre-
association truly is cause and effect. The best and post-intervention assessment of the depen-
approach is to employ an equivalent time samples dent variable(s), and the dashed line denotes the
design, beginning with a careful history of cur- intact nature of the comparators. The design is
rent symptoms on drug (O) followed by with- most appropriate when it is not possible to ran-
drawal of sufcient duration to allow drug effects domly allocate subjects to an intervention, when
to dissipate, another careful history, and then a concurrent no-intervention group is avail-
reinstitution (rechallenge) with the drug, with able for comparison, and when serial data can
another O after some period of use. If the result is be (or have been) generated for both groups
unclear, the series can be repeated. Unfortunately, during the pre- and post-intervention periods.
in the real world, patients tend to confound out- As for the nonequivalent control group design,
come by interposing anti-inammatory drug use the availability of baseline data is necessary to
concomitantly with cessation of the statin and evaluate initial comparability of the interven-
often refuse the rechallenge. Nonetheless, this tion and control groups. The multiple time sam-
example illustrates the importance of understand- ples design was used by Holder et al. [34] to
ing and applying the principles of study design in evaluate the effects of a community-based
the course of clinical practice. (For further details intervention on high-risk drinking and alcohol-
about the pros and cons of this design as a tool for related injuries (summary and design structure
research and methods for implementing it in clin- are given in Fig. 5.12).
ical populations, the reader again is referred to In this study, X represents the community-
the works of Campbell and Stanley [1], Cook and based alcohol deterrence intervention; O (made
Campbell [2], Kazdin [27], Janosky et al. [28], approximately monthly over a 5-year interval)
and to Haukoos et al. [33].) denotes average (1) frequency of drinking, (2)
number of alcoholic drinks consumed per drink-
Quasi-Experimental Design # 3
ing occasion, (3) instances of driving while intox-
The Multiple Time-Series Design icated, (4) motor vehicle crashes (daytime,
DUI-related, nighttime injury-associated), and
proportion of (5) emergency room and (6) hospi-
tal admissions for violent assault among the
Fig. 5.12 Example of a multiple time-series design

intervention versus comparison communities. The potentially caused by instrumentation, maturation,

investigators conclusion that the intervention and testing because if pre- to post-intervention
caused reductions in high-risk drinking behavior differences were inuenced by these factors, they
and associated motor vehicle accidents and assaults should be just as likely to impact both the experi-
is based on sustained differential trends in post- mental and control groups (again, assuming
versus pre-intervention outcomes among the inter- reasonable baseline equivalence between com-
vention versus matched control communities. parators). Indeed, when properly executed, the
All of the potential threats to internal validity multiple time-series design essentially is free
protected by the time-series design also are from the most important threats to internal valid-
protected by the multiple time-series design. ity of an intervention study and, for this reason,
However, with the addition of a parallel generally is considered to be among the most rig-
comparison group, there is better control for the orous of the various quasi-experimental designs.
potential threat of history unless, as with the non- The threats to the external validity of a multiple
equivalent control group design, the comparison time-series design are similar to those of the non-
groups are so poorly selected as to have different equivalent control group and time-series designs
external experiences. Similarly, as previously and, as for these designs, are minimized by the
noted, the use of a parallel control group generally use of unobtrusive measures, naturalistic inter-
affords good protection against threats to validity ventions, and careful selection of comparators.
potential threats to internal validity, providing

Summary weakest evidence to support a claim of cause and
effect. The true-experimental designs offer the
This chapter has reviewed a variety of alternative best control over most internal validity threats,
research designs commonly used to evaluate the providing strongest evidence to support interven-
impact of interventions. The examples of their tion effects, but their generalizability may be com-
application have been drawn from clinical research. promised by highly restrictive inclusion criteria,
However, the reader should be aware that, to patient reluctance to participate in a randomized
achieve optimal rigor and strength of conclusions, study, or reactivity caused by pre-intervention test-
the same design principles can and should be ing or the experimental arrangements. The quasi-
applied in preclinical, cellular, and molecular stud- experimental designs fall somewhere in the middle,
ies though, because of the relative homogeneity providing more protection against internal validity
(and nonhuman characteristics) of test material in threats than pre-experimental designs but less than
basic science studies, issues of randomization, that afforded by true-experimental designs. Because
blinding, sample sizes, etc., may be handled most quasi-experimental designs lend themselves
somewhat differently than in clinical research. to real-world studies of typical (rather than the
Nonetheless, it should be clear from a compari- ideal) subjects or populations, they also offer
son of the relative strengths and weaknesses of certain advantages in external validity. Therefore,
the various study designs reviewed in this chapter in many situations, they represent a good compro-
that there is no perfect study. The pre-experimental mise for the researcher, particularly when their
designs offer the least protection against major strengths and limitations are recognized.

Take-Home Points

The ability to draw valid inferences from data is the cornerstone of research and provides
the basis for understanding the new knowledge that research results represent.
Internal validity reects the extent to which a manipulated variable can be shown to account
for changes in a dependent variable. It is indispensable for interpreting the experiment.
Ten common threats to internal validity include selection bias, history effects, maturation
effects, testing effects, instrumentation effects, statistical regression, experimental mortality,
interaction of these factors, experimenter bias, and subject expectancy effects.
Four threats to external validity (generalizability) are reactive effects of testing, interactive
effects of selection and treatment, reactive effects of experimental arrangements, and mul-
tiple treatment interference.
A variety of research designs can be used to evaluate interventions. Each differs in its ade-
quacy for ensuring that valid inferences are made about effects and generalizability.
The poorest for controlling threats to internal validity are termed pre-experimental
designs. These lack adequate control groups.
The strongest are termed true-experimental designs. They incorporate control groups to
which subjects have been randomly allocated but may suffer from lack of generalizability.
Quasi-experimental designs represent a good compromise when randomization is not
Protocol Development
and Preparation for a Clinical Trial 6
Joseph A. Franciosa

including the purpose of the study or statement

Introduction of the hypothesis being tested and the signi-
cance of its possible results; a detailed descrip-
A clinical trial protocol is a written document tion of the study population, including patient
that provides a detailed description of the ratio- eligibility criteria; implementation of the inter-
nale for the trial, the hypothesis to be tested, the vention, study specic visits, and observations
overall design, and the methods to be used in car- made; a plan for safety monitoring, including
rying out the trial and in analyzing its results. The reporting of adverse events; ethical consider-
protocol represents the means by which a hypoth- ations; a description of data management plans,
esis will be tested. As such, it must be written in including methods of data generation, recording
its entirety before the study is performed to help and processing; and statistical considerations,
assure the credibility of the results. In addition, it including a detailed description of the study
must be prepared in as detailed a manner as pos- design.
sible in order that the elements of the trial can be The purposes of this chapter are to briey
subjected to constructive critique and that others describe the clinical trial and to discuss, in depth,
can replicate it in the future with the expectation the various stepwise components of the protocol
of obtaining essentially the same results. structure and organization that guide it. This
A protocol has a structure and organization chapter will focus primarily on protocols for con-
made up of elements that follow the conception, ducting trials in human subjects or patients, with
development, and conduct of a clinical trial in a special emphasis on randomized controlled clini-
chronological fashion. Although these elements cal trials that test specic hypotheses. Protocols
vary from protocol to protocol, they typically for other types of clinical research (e.g., epide-
include the following, in this suggested order: miological studies) or for preclinical research
a statement of the background and rationale for (e.g., animal or laboratory bench studies) will not
the trial; a brief overview of the study design, be specically addressed here, though many of
the principles of clinical trials generally are appli-
cable. Though there is ample published in forma-
tion available about protocol development for
clinical trials, much of this is dispersed through-
J.A. Franciosa, MD ()
Department of Medicine, State University of New York out websites, institutional guidelines, proceed-
Downstate Medical Center, Brooklyn, NY, USA ings, literature, books, and software and may be
e-mail: difcult to locate [1, 2].

112 J.A. Franciosa

Table 6.1 Components of the background and rationale

Background, Rationale, and Overview section of the clinical trial protocol
of Study Design General description of the disease being treated/
managed and why improved treatment/intervention/
management is needed
Background and Rationale Description of current treatment/management of the
disease/condition and any problems with available
The background and rationale section of a proto- therapy/management
col is a brief but comprehensive introductory sec- Description of known properties of the proposed
tion that should provide a compelling argument treatment/management intervention that justify its use
Brief summary of relevant preclinical and clinical
to justify the proposed research. Some key com-
experience with the proposed treatment/management
ponents of this section are shown in Table 6.1. It intervention
should succinctly summarize what has been done Rationale for the current study and its role in the
by the investigators and others in the specic and overall research program
related areas of research, it should highlight what Statement of the hypothesis and objectives of the
deciencies exist and what additional informa- proposed research
Brief description of the signicance of the study
tion is needed, and it should state how the pro-
posed research will address those needs. It is
important to stress the unique characteristics of
the proposed research, which may involve new The introductory section should logically lead
methods, unique patients, a new intervention, an to a statement of the hypothesis of the proposed
innovative study design, or other new approaches research. This section is the key to the entire pro-
that distinguish the proposed research and war- tocol as it describes the purpose of the trial and
rant its conduct. This should all logically ow to guides the rest of the protocol which, subse-
a concise statement of the hypothesis addressed quently, is developed to provide the details about
by the proposed research and be concluded by a the methodology to be used in assessing the stated
statement of the signicance of the anticipated hypothesis. In other words, the hypothesis
results, whether they conrm or fail to conrm addresses the primary question by providing a
the hypothesis. tentative answer. The rest of the protocol describes
The importance of this section cannot be over- how the hypothesis will be tested to provide a
stated as it provides the rst impression of the more denitive answer.
investigators to reviewers, funding agencies, and The section stating the hypothesis or hypoth-
others who may have to approve or support the eses (there may be more than one in a given
proposed research. It offers these others a glimpse study) typically begins with a broad description
of the investigators thought processes, their ana- of the overall goal of the research within the
lytic and synthetic abilities, the thoroughness of context of the investigators overall research
their methods, and their objectivity. Finally, it program. For example, the investigators may
should be written in a style that is suitable to both have an interest in seeking new treatments for a
scientic and nonscientic lay persons who may given disease, and the broad purpose of the pro-
be members of reviewing and approving bodies. posed study is to test a new drug for treating that
disease. The broad purpose in this case is an
attempt to answer a new question. In some situ-
Statement of Hypothesis ations, the broad purpose might be to conrm
previous preliminary work in the eld in a larger
The hypothesis (described in detail in Chap. 3) or different patient population. The overall pur-
must be asserted early in the protocol. Therefore, pose might also be preliminary in nature as a
we offer a few key points here on how it should proof of concept study to assess whether a
be stated in the protocol. hypothesized pathogenetic mechanism plays an
6 Protocol Development and Preparation for a Clinical Trial 113

important enough role in a disease such that it Table 6.2 Components of the study design summary
might be a therapeutic target. Statement of study type (e.g., controlled clinical trial)
In addition to stating the broad programmatic Overview of study design
goal of the proposed research, the statement of Parallel-group, crossover
the hypothesis also presents a more specic broad Level of blinding (e.g., open-label, single-blind,
objective of the research followed by some more double-blind)
Method of treatment assignment
detailed specic aims of the research. For exam- (e.g., randomization, stratication)
ple, a broad objective might be to test the hypoth- Statement of treatment/intervention to be used
esis that a new drug improves symptoms in Investigative drug or device
patients with the disease of interest to the overall Dosage of drugs or usage of devices
Type of control (e.g., placebo, active drug, no
research program. The specic aims might be to treatment)
determine whether certain of those symptoms Description of study population
improve by a specied amount over a specied Planned sample size
period of time without producing major side Source of patients
effects. The specic aims typically include major Number of centers
Note any unique patient characteristics (age, race,
outcomes (primary endpoint [s]) that essentially sex) required
drive the study design and other outcomes of Description of the disease or condition being studied
lesser importance (secondary endpoints) that pro- and any characteristics of that disease/condition that
vide supportive information, as will be discussed might affect patient eligibility or study outcomes
in greater detail below. Duration
The statement of hypothesis should be suc- Severity
cinctly phrased and should provide a basis for the Treatment
overall study design being employed to test it, Sequence and duration of study visits
i.e., to determine whether the hypothesis is sup- Description of study endpoints
ported by the study results. As noted in Chap. 3,
the operational restatement of the hypothesis
should, at minimum, clearly identify the patient Overview of Study Design Summary
population, intervention (if any), primary end-
point, key methods, duration, and anticipated It is common practice and helpful to include an
outcomes. overall summary or synopsis of the study design
before embarking on the detailed discussion of
the various protocol components that will ensue.
Signicance of the Research This summary is especially useful to certain
reviewers, e.g., research administrators, funding
The Introduction should conclude with some dis- agency ofcials, or institutional review board
cussion, even if largely speculative, about the (IRB) members, who may not be scientists or
signicance of the proposed research and its pos- may not require the level of detail of the full pro-
sible outcomes. If the hypothesis is conrmed, tocol in order to perform their specic review or
what does that mean in terms of the initial objec- critique functions. Thus, this section is typically
tives? Is it conclusive or does it indicate a direc- very brief and to the point, as details of every-
tion for future research? Results which are not thing addressed here will be provided in the sec-
conrmatory may lead to outright rejection of the tions that follow. Table 6.2 shows the key
hypothesis or may imply a need for modication components of this summary.
of the research approach. Finally, some ndings The summary should include a statement of
of the study may generate new hypotheses to be the nature of the study design (e.g., whether it is
addressed by future research. controlled or uncontrolled, parallel or crossover,
114 J.A. Franciosa

blinded or unblinded, and the number and nature to address multiple primary endpoints almost
of treatment arms). A brief description of any invariably lead to methodological inconsistencies
randomization methods should be provided (the and difculties, resulting in a trial that fails to
details of which should be given in the Statistical achieve any meaningful result in terms of pri-
Considerations section). It also should indicate mary endpoints. The primary endpoint(s) should
the number of centers involved (single or multi- be specically dened, along with an explanation
center), total number of patients to be enrolled, of how and when it will be measured. The sec-
and the geographical area included, e.g., United ondary endpoints may be more numerous than
States, North America, Europe, China, or a the primary ones. They may represent additional
region of a country. The study population should measures of efcacy or safety but also may be
be characterized, especially any unique demo- included for other reasons such as exploration of
graphic characteristics, e.g., women only, mechanisms, particular safety concerns, and
African-Americans only, or a certain age group. development of data for future research. The sec-
In addition to patient demographics, a brief ondary endpoints also should be specically
description of their underlying disease condition dened, and the timing and methodology of their
being studied should be mentioned along with measurements should be briey stated.
any important information about the current sta- Factors considered in the selection of end-
tus, duration, severity, and treatment of the con- points (especially the primary endpoints), such as
dition that might affect patient eligibility as well relevance, practicality, acceptability, validation,
as outcomes. The active intervention being tested, and experience should be discussed. Clearly, it is
along with any control interventions, should be necessary to establish that the endpoint chosen is
briey described. In addition, the frequency and relevant to the patients and conditions being stud-
duration of the intervention should be stated ied; that is, it addresses real and signicant needs
along with the total study duration, which may be such as improving symptoms, survival, diagno-
longer than the intervention period. Finally, the sis, or other outcomes. In addition, the endpoints
primary study endpoint should be described should be practical, not only by addressing real
along with a statement about how it will be needs but by utilizing readily applied methods of
assessed, when it will be assessed, and how often objective measurement. Furthermore, the meth-
it will be assessed. Key secondary endpoints may ods used must be acceptable to both investigators
be simply listed. and patients in terms of ease of application,
safety, comfort, and cost. Optimally, they should
be standard methods that are appropriate for the
Endpoints group under study to avoid the necessity of vali-
dating them, which usually must be done in sepa-
It is desirable to present the study endpoints early rate preliminary studies [3]. Validation involves
in the protocol, as these tend to drive the rest of establishing (via the literature or the investiga-
the study design which is developed to measure tors own work) that the proposed methods per-
an effect on those same endpoints. Thus, the sam- form as intended in both the patients and
ple size, methodology, duration of study, and conditions being studied. The investigators must
analytical methods are all inuenced by the indicate that they have sufcient experience with
choice of endpoints. the successful use of the proposed methods.
The endpoints are dened as primary and sec- Finally, it is critical that there be a consensus
ondary. The primary endpoint is usually a single regarding study endpoints among all investiga-
one, though it may include two endpoints, or may tors, study administrators, and committees before
consist of a single composite endpoint made of the study starts in order to avoid disputes when
two or more components. It is important to strictly the nal results become available [4]. Table 6.3
limit the number of primary endpoints, as attempts lists guidelines for describing the key components
6 Protocol Development and Preparation for a Clinical Trial 115

Table 6.3 Primary study endpoints Although the terms patients and/or subjects
State the primary study endpoint(s) often are used interchangeably or may be estab-
Briey mention the appropriateness and relevance of lished according to convention of the sponsoring
the endpoint group, we prefer to use the term patients for
Describe the methods, timing, and frequency for those individuals with a medical diagnosis or
assessing the endpoint
condition that is the target of the proposed
As needed, describe and special personnel perform-
ing the assessment (e.g., an unblinded assessor in a research. We reserve the term subjects for nor-
double-blind study) mal healthy individuals that typically are included
Additional details about collecting endpoint data may in some studies as the control population but who
need to include: also may represent the primary population, e.g.,
Details about the use of subjects diaries in studies of the clinical pharmacological proper-
Any instructions on timing/conditions of
ties of a new drug before it is given to patients.
Details about unusual collection, storage, or
analysis of laboratory samples
Provide information about the standardization and General Description of the Study
validation of the methods to be used for endpoint Population
Describe the investigators experience with the
methods to be used The study population should be described in
As needed, describe any training that might be terms of its general demographics, as well as the
required in using the methods for endpoint characteristics of the disease or condition being
measurements studied that the patients should have, along with
the number of such patients that will be recruited
and enrolled. The demographic characteristics
of primary study endpoints; secondary endpoints typically describe the sex and age group of
should follow this same sequence, though with patients and, if appropriate, their race. If any of
less detail. these characteristics are particularly restrictive,
It should be noted that endpoints, as discussed the reason for that restriction should also be
above, refer primarily to clinical trials. Other given. For example, if one is studying only Asian
kinds of studies, such as nonprospective obser- females in their 20s, the reason for focusing on
vational studies that evaluate associations or dis- that population should be presented. In many
tributional characteristics (e.g., prevalences) instances, this may have been addressed in the
rather than intervention effects may not employ introductory sections and need not be gone into
endpoints as described above for their study in great detail in this section. The selection of
objectives. Observational studies are discussed these demographic characteristics (especially
in greater detail in Chap. 4. age) should not be taken lightly, as they may
have important effects on adverse events and
study outcomes [5]. In fact, it has been sug-
gested that these kinds of patient characteristics
Study Population may impact study results more than other fea-
tures of the study design itself [6]. These charac-
This section is a detailed description of the teristics will be expanded upon in greater detail
patients/subjects to be included in the study and as needed in the list of inclusion/exclusion crite-
should provide a broad description of the study ria, as discussed below. The medical condition
population, the source of patients, and a compre- these patients must have in order to participate
hensive listing of the inclusion (eligibility) and in the study also should be described in terms
exclusion criteria for study participation. of its diagnostic criteria, duration, etiology
116 J.A. Franciosa

(if appropriate), treatment, present status, and location of investigative sites that will provide
severity. If normal subjects are included, then patients and/or participate in the trial. Not all
operational criteria for dening the normal sub- sites may actually have study investigators; some
ject also must be presented. Subjects may be may serve only as sources that will identify and
required to be completely normal, with no refer patients to an investigators site. Methods to
signicant past or current medical conditions, be used for nding patients should be described.
especially if these subjects constitute the pri- These may include various ways of publicizing
mary study population. If normal subjects are the study, ranging from notices within the local
included as a control group, they may only be institution to advertising in various media. These
required to be relatively normal, i.e., they techniques and the individuals responsible for
should not have the same disease as the other implementing them should be described. It also
patients in the study. These disease characteris- is necessary to describe how patients, once
tics will be expanded upon in greater detail in identied, will be further screened and by whom.
the list of inclusion/exclusion criteria. This sec- A detailed description of the screening process to
tion also should include a description of the determine eligibility should be included, listing
number of patients to be studied. Whereas a the specic initial parameters that will be used
sample size estimate typically is included in the preliminarily to identify potential eligible
statistical analysis section (see below), that patients. It is common practice to identify patients
estimate usually refers to the number of patients who meet initial screening criteria by history,
needed to complete the study. Since, typically, then follow them for a brief interval to determine
some patients fail to complete a trial for several whether they subsequently meet all study eligi-
different reasons, it is necessary to try to esti- bility criteria. For example, in a study of treat-
mate the total number of patients that will be ment of hypertension, patients initially may be
recruited in order to achieve the number needed screened on the basis of having a history of
to complete the trial. Depending on the disease, hypertension or of having a single reading of
study population, and treatment, patients may elevated blood pressure. Typically, such patients
drop out of the trial for many reasons, including would be followed for a limited period to see if
death and side effects of the treatment. In addi- they, in fact, do currently have hypertension.
tion to these reasons, which will vary, some The location of screening procedures should
patients withdraw consent, move, or just never be specied. This could involve screening of
return for follow-up. The investigator must make clinic records, emergency room logs, diagnostic
every attempt to estimate the number of expected laboratory reports, etc., depending on the popula-
dropouts and decide what to do about them, tion being sought. For example, in a study of
i.e., to replace them or not in the study. It is criti- patients with documented coronary artery dis-
cal to estimate the number of patients that need ease, one might screen the cardiac catheterization
to be recruited not only in order to achieve the and intensive care unit logs. The protocol should
desired number of study completers but also describe who will do this, when it will be done,
to properly estimate resource needs, e.g., study and how it will be done. Unlike some sections of
medications, case report forms, and laboratory the protocol (e.g., endpoint denitions, patient
supplies. inclusion/exclusion criteria, and analytic meth-
ods to be used), the screening procedures are not
carved in stone and may be modied as
Patient Sources needed.
For a more detailed description of recruiting
The techniques to be used for recruiting patients techniques and the many issues that may become
for the trial should be discussed in detail in this involved, the reader should consult standard ref-
section. One should describe the number and erences and the medical literature [1, 711].
6 Protocol Development and Preparation for a Clinical Trial 117

Inclusion/Exclusion Criteria results), it is mandatory that these criteria be

carefully thought through and decided upon pro-
A list of all inclusion and exclusion criteria to be spectively. It is highly undesirable to change
used in determining eligibility of patients for the these in any way after the study has started as
trial must include a detailed description of all the such post hoc changes may introduce bias,
requirements a patient or subject must meet to be thereby impacting the results and their interpreta-
eligible for enrollment in the trial, along with a tion, and raise doubts about the validity of the
detailed description of all variables that would study in adequately testing the original hypothe-
render the patient ineligible for enrollment. Each sis. Occasionally, circumstances can arise that
patient enrolled must satisfy all of the inclusion may mandate a change in patient eligibility crite-
criteria and none of the exclusion criteria, with- ria, but these are rare and usually involve ethical
out exception, in order to be enrolled in the trial. issues. For an example, an effective new treat-
It is important that the list be very detailed, ment may become available for some or all of the
leaving no ambiguities for the study personnel patients in the trial, making it potentially unac-
who must use the list to screen for potential ceptable to leave them on a placebo or on an
patients. Thus, specic criteria, along with any unproven treatment. It generally is unacceptable
relevant methods needed to apply them, should to change eligibility criteria simply because the
be provided for each item on the inclusion/exclu- investigators have found it extremely difcult to
sion list. For example, eligibility for inclusion in recruit patients meeting the current criteria. In
a trial of antihypertensive treatment might require such instances, it may be wiser to terminate the
that the patient have a systolic blood pressure study and start a new one with different eligibil-
above 140 mmHg or a diastolic pressure above ity criteria. Obviously, such decisions have
90 mmHg, as determined by the average of 3 important consequences and should never be
readings taken 5 min apart with the patient seated taken lightly.
and using a standard sphygmomanometer. In Table 6.4 shows some items to be considered
addition, if the study is to include untreated when constructing an inclusion/exclusion criteria
patients, the exclusion criteria might state that A list. The rst requirement is written informed
patient may not be included if he/she has received patient consent. Without this, it would not be per-
any antihypertensive drugs within the past missible to proceed to the subsequent criteria
6 months, specically any diuretics, beta-block- which require obtaining condential medical his-
ers, calcium blockers, ACE-inhibitors, angio- torical information from the patient. The criteria
tensin-receptor blockers, or alpha-blockers. For list follows a structured progression from the
other agents with possible antihypertensive activ- general demographic characteristics to those that
ity, the investigator must obtain approval of the are more related to a specic disease, and con-
study chairperson before enrolling the patient. It cludes with criteria that relate to characteristics
is extremely important that this list be as compre- that might confound the conduct or outcomes of
hensive and detailed as possible since it will serve the study or that might impair the patients ability
as a checklist for many of the personnel involved to complete the study. The exclusion criteria
in the study, including those doing the screening, often mirror the inclusion criteria by stating the
designing the case report forms, developing the converse of the corresponding inclusion criteria,
database, analyzing the results, and auditing the thereby providing a means of double checking
studys conduct. the patients eligibility. For a more detailed dis-
Since the inclusion/exclusion criteria are criti- cussion of how to construct an inclusion/exclu-
cal to dening the study population (whose char- sion list, the reader should consult standard
acteristics, in turn, may greatly impact the study references [1].
118 J.A. Franciosa

Table 6.4 Patient eligibility considerations

Category Inclusion criteria Exclusion criteria
Patient Provision of written informed consent Failure to provide written informed consent
characteristics Demographics (age, sex, race) Hypersensitivity/intolerance to study
Body weight interventions
Pregnancy or childbearing potential Medical history (current or preexisting
Behaviors (alcohol, smoking, activity, diet) conditions and treatments)
Mental status Allergies/food intolerance
Occupational risk/hazard
Breast feeding
Characteristics Diagnostic criteria Nonpermitted treatment
of disease being Duration Status/severity that might bias results
studied Etiology Confounding concomitant conditions/
Status/severity complications
Treatment (required, permitted)
Screening Within limits specied Outside of limits specied
examinations Meets all run-in requirements (compliance, Fails to meet run-in criteria
Other factors Cooperative attitude Inability to perform study requirements/
Occupation procedures
Availability for all study requirements Lack of availability
for full duration Increased risk of lack of cooperation
Current/recent participation in another
clinical trial

Implementation of the Intervention, started immediately in the active phase of the

Study Visits, and Observations study or may be observed during a preliminary
phase (run-in period), before entering the active
Once the study endpoints and population have phase of the study.
been dened, one must provide the detailed
methods by which the actual study data will be Run-In Periods
generated. This section provides all interested It is common to have patients enter a run-in
parties with a precise description of how the period between the time that they qualify for a
patients will be entered into the study, how they clinical trial and the time that they begin active
will be started and followed while on the study involvement, i.e., are started on the actual study
intervention, how and when required observa- intervention. There are several reasons for using
tions will be made, and when and how the a run-in period. Common reasons include estab-
patients participation in the study will be lishing nal patient eligibility, demonstrating
terminated. stability, and assessing compliance. Not all inclu-
sion/exclusion criteria may be completely avail-
able for assessment at the time of screening,
Study Initiation especially if there is a requirement for recent
laboratory or diagnostic information. A run-in
After eligible patients have been dened, period just before starting a new drug or a special
screened, identied, and consented, they are procedure may be used to allow for obtaining
ready to be enrolled in the study and begin their any assessments that must be current (e.g., an
active participation. Depending on the specic echocardiogram to document presence of abnor-
study design and intervention, patients may be mal cardiac function) to conrm that the patient
6 Protocol Development and Preparation for a Clinical Trial 119

actually has the medical condition required for him/her as a new patient in the screening
study participation. A run-in period also may be phase. Another potential risk and criticism of
used to demonstrate that a patient has the required run-in periods is that they may introduce bias by
status of the condition being studied. For exam- selecting the better responders to the active study
ple, it may be required that a patient have stable intervention [12].
symptoms while taking all standard treatment for
the condition in order to minimize difculty in
interpreting changes in the patients condition Start of Study Treatment/Intervention
after starting active treatment. If the patient was
not stable or if other treatments were started after Once all inclusion criteria are satised and no
the study intervention, it would be extremely exclusion criteria are met, whether at the end of
difcult to assess the cause of a change in the screening or after a run-in period, the patient is
patients condition. Another common reason for ready to initiate study-mandated activities. At
using a run-in is to assess the tolerability of the this time, the patient will be assigned his/her
study intervention. A patient may have difculty study treatment or intervention. If the study is not
complying with an intervention if it produces controlled, the patient will be started on the study
signicant side effects or is difcult to adminis- intervention. If the study is controlled, the patient
ter. Furthermore, patient compliance may be is randomized to his/her study treatment. The
inuenced by other patient conditions or behav- method of randomization, e.g., consulting a list,
iors, e.g., substance abuse or alcoholism. A run- opening an envelope, or contacting a central ran-
in period may be useful to assess the patients domization center should be briey described
likelihood of complying with and completing all here. If the intervention being evaluated in the
study requirements. trial includes pharmacological therapy, the study
Treatment during run-in periods may vary. If drug may also be dispensed at this time or
the purpose is only to acquire nal inclusion/ arrangements may be made for procuring it. The
exclusion information, no treatment may be patient should be given any applicable instruc-
needed. Obviously, if the purpose is to assess sta- tions at this time and scheduled for the next clinic
bility and/or compliance with an intervention visit. Typically, the details of the randomization
such as a study drug, it would be necessary that it technique, and the administration and manage-
be given according to the same regimen that ment of the intervention, respectively, are pro-
would be used in the active phase of the study. vided in the statistical and administrative sections
This phase usually involves either active study of the protocol.
intervention in all patients if its purpose is pri-
marily to assess tolerability or placebo in all
patients to assess patient compliance for reasons Schedule of Visits and Observations
other than tolerability of the intervention. Clearly,
the patient is kept blinded to treatment if the The protocol must provide a schedule of patient
active phase is to be double-blinded. visits with details about when these will be con-
Finally, the duration of the run-in periods ducted and what information will be collected at
should be as short as possible, typically not more each visit. This section is used and closely
than 23 weeks. In general, less time is needed to adhered to by study personnel, much as a recipe
obtain laboratory tests, and more time would be is followed by a cook. Study visits typically
needed to assess tolerability or compliance. The consist of a baseline or study initiation visit,
problem with excessively long run-in periods is follow-up interim visits, a nal on-treatment
that patients may change during this time. In study visit, and a post-study follow-up visit. It is
cases where a run-in period has had to be important to specify the timing of these visits,
extended, it is common practice to terminate that with a window of plus or minus a small number
patient from the study at that point and restart of days, if possible, to allow the patient some
120 J.A. Franciosa

exibility in scheduling appointments. Typically, visits primarily are intended to monitor the
the time is set relative to randomization or base- patients progress and his/her tolerability of the
line, i.e., at some time a time window follow- study intervention. A brief medical history and
ing the date of randomization or the baseline physical examination are carried out, with the
visit. The observations recorded at each visit emphasis on looking for any adverse events or
often are variable, with fewer items observed at ndings. Information on one or more study end-
interim visits. points may be collected, but not necessarily the
primary endpoint, especially if that involves a
Baseline Visit special procedure, e.g., cardiac catheterization,
The baseline visit is performed at or very close to which might be done only at the end of the study
the time when the patients are randomized to or once during an interim visit. In trials evaluat-
study treatment/intervention, whether or not that ing medications, patient compliance usually is
treatment/intervention has actually been insti- assessed, typically by having the patient bring
tuted. This is a critical visit as all observations any unused study medications with him/her and
recorded at this time will be the basis for com- calculating the percentage of pills taken relative
parison with all observations made while on to those prescribed. The interim visit also is con-
study treatment. Thus, a complete medical his- cluded by dispensing any study drugs or other
tory and physical examination usually are per- required materials to the patient, scheduling the
formed, along with laboratory tests. All next visit, and arranging for any procedures or
concomitant medications are recorded with tests needed for the next visit.
details about dose and duration of administra- Of course, patients may develop complica-
tion. In addition to this general medical exami- tions and may need to be seen between scheduled
nation, there is usually information collected visits. All clinical trials must include provisions
that is specic to the status of the medical condi- for patients to be seen by physicians who may be
tion being studied, e.g., its duration, severity, associated with the study in order to deal with
history of complications, current symptoms and clinical necessities whether or not a visit is spe-
status, and current treatment. Any special tests, cically related to a protocol-based assessment.
assessments, or procedures relating to study end- The reasons for, and ndings obtained during,
points are carried out at this visit or are sched- any unscheduled visits must be recorded as study
uled to be obtained very soon after this visit, if data on appropriate forms.
not yet already done. One cannot overemphasize
the importance of all baseline determinations.
They must be thorough and comprehensive, as Final Visit
any medical and/or laboratory ndings that The nal visit is the last one during which the
appear later must be ascribed in some way to patient is still receiving the study intervention.
study participation if they were not present at Its observations include essentially the same as
baseline. In trials evaluating experimental medi- those obtained at the baseline visit and are just
cations, the baseline visit is concluded by dis- as critical since they represent the study results
pensing any study drugs or other required and outcomes that will be compared to those
materials to the patient, scheduling the next visit, from the baseline visit. In addition, the same
and arranging for any procedures or tests needed kind of information collected at the interim
for the next visit. visits is obtained to cover the interval since that
preceding interim visit.
Interim Visits Whereas a nal visit is obtained routinely in
Following the baseline visit, the patient is seen at all patients at the end of the study, it may be nec-
intervals specied in the protocol to occur at essary to perform a nal visit if a patient termi-
some set time, e.g., every 3 months 1 week nates his/her study participation prematurely, as
from the date of the baseline visit. These interim might happen for intolerable side effects or other
6 Protocol Development and Preparation for a Clinical Trial 121

Table 6.5 Template schedule of study events in protocol no. XXXX

Screening Baseline Treatment period Follow-up
Evaluation (Day xx) Day 0 Day # Day # Day # Day # Day #
Informed consent X
Inclusion/exclusion criteria X X
Demographics X
Medical history X
Full physical examination X X X
Partial physical examination X X X X
Laboratory tests X X X
Special tests/procedures X X X
Randomization X
Vital signs and weight X X X X X X X
Study intervention administration X X X X
Adverse event assessment X X X X X X X
Concomitant medication assessment X X X X X X
Terminate study drug X

reasons. In such cases, every attempt must be mandatory to attribute any side effects or compli-
made to have the patient return and perform all cations occurring during this period to the study
the procedures and observations required at a intervention. These post-study visits also are of
regularly scheduled nal visit. Without this, that value in helping to document patient status and to
patients entire dataset may be useless and exclude protect all study personnel and institutions in the
the patient from the study analysis. In most event of any future allegations stemming from
instances, nal visit data obtained even prema- the patients study involvement. It is strongly
turely may still be analyzable and allow the suggested that a ow chart of all scheduled visits
patient to be included in the results. and related procedures be included, a template of
At the end of the nal visit, study drug/inter- which is shown in Table 6.5.
vention is terminated, and the patient is sched-
uled for a study follow-up visit.
Data Management
Post-study Follow-Up Visit
Often by regulatory requirement, but more in the A clinical trial, along with its data generation and
interests of good clinical practice, patients should acquisition, is driven by the thoroughness and
be seen at least once after completing their study objectivity of the research protocol. The research
participation to ensure that they are not experi- data to be generated, collected, processed, and
encing any sequelae that might be attributed to stored in the clinical database must support the
their study involvement. Such visits usually are objectives of the study, as specied in the proto-
scheduled at 1 week to 1 month after the nal on col. This, in turn, relies on designing data man-
treatment study visit, depending on the possible agement processes that correctly capture the
duration of effects of the study intervention. required research data. All data generated by the
(As used here, the term on treatment means the trial must be captured and managed to ultimately
patient is still receiving a study-mandated inter- yield the results of the trial. Data management
vention, regardless of whether he/she is receiving has been enhanced dramatically in recent years
active therapy or an inactive control substance [or as a result of technological advancements includ-
other control condition].) In some instances, ing computerization of databases, bioinformat-
especially by regulatory requirements, it may be ics, and Internet applications to facilitate
122 J.A. Franciosa

acquisition and processing of data [1315]. As a the quality of conduct of the study; as such, they
consequence, modern data management pro- are commonly audited after study completion to
cesses involve specialized personnel and meth- help ascertain the validity and reliability of the
ods which are discussed in detail in Chap. 7. For study conclusions.
all these processes to be properly carried out, it is
necessary that a detailed, comprehensive, and
unambiguous protocol be developed, as the pro- Safety Monitoring Procedures
tocol drives the data management processes
which tend to follow the protocol in a chrono- A complete protocol should describe all proce-
logical fashion. Obviously, the tools used for dures that will be in place to ensure and assess the
data collection will be developed in accordance safety of study participants. Whereas much of
with protocol specications. Ideally, data man- this information already is included in different
agement processes should be developed in parts of the protocol, e.g., on the schedule of vis-
advance of data collection because post hoc its and procedures, it is recommended that a
changes potentially introduce a risk of bias, specic section be devoted to summarizing all
threatening the validity and credibility of the safety monitoring procedures. It should summa-
results, as noted above. rize how often patients will be seen, that an
The data management plan closely follows interim history and physical examination will be
the structure and sequence of the protocol. performed, and that laboratory tests will be
A well-written data management section will obtained. It is important to point out any special
provide detailed descriptions of each data item to visits, examinations, tests, or procedures that will
be collected, how it will be collected, and when it be conducted specically to look for known side
will be collected. The data management group effects of the treatment. For example, liver func-
must work very closely with the team that is pre- tion tests would be obtained in a trial of a new
paring the actual protocol to help ensure that all drug suspected of possibly producing liver toxic-
the data described are readily obtainable, com- ity, or the eyes would be examined often in a trial
plete, unambiguous, objective, and easily pro- of an intervention that could potentially be asso-
grammable and quantiable. Furthermore, it ciated with cataract formation.
must be ascertained that all of the data generation In addition to describing what will be done and
methods are generally accepted and that the how often, it is important to specify who is respon-
research team is adequately experienced in using sible for carrying out these procedures and what
these methods so as to help ensure reliability and will be done with the information in case some-
validity of the data obtained. thing is found, i.e., instructing the investigators
Whereas trials generally try to limit the amount whom to contact, how to establish contact, and
of information collected to that which is necessary the timeframe for making contact. It is important
to obtain valid results, it is common to collect that all study personnel know what constitutes an
additional information, especially at baseline, adverse event or serious adverse event. These are
because this is the last time one can make obser- not simply clinical impressions but are specically
vations before the effects of the trial interventions dened by regulations. These regulations also
come into play. Just being in a clinical trial may establish what information about the adverse
affect patient outcomes because of the level and event must be collected (start date, duration,
frequency of care provided (see also Chap. 5). It severity, drug dose, concomitant drugs, action
is critical that every attempt be made to capture taken, outcomes, etc.) and who must be notied
all the required data at the times specied by the within the specied time frame (other investiga-
protocol, as incomplete, inaccurate, and/or miss- tors, IRBs, study administrators, regulatory agen-
ing data can undermine the reliability and credi- cies, etc.). Instruction also should be provided to
bility of results. The completeness, accuracy, and the investigators regarding possible discontinua-
timeliness of data collection are key indicators of tion of the study drug, premature termination of
6 Protocol Development and Preparation for a Clinical Trial 123

the patients study participation, unblinding of sponsor, auditors, or other regulatory authorities
any study medication, etc. and that his/her study information may be used in
It is critical that all study personnel understand publications. In any of these instances, the patient
that an adverse event is any undesirable sign, must be assured that his/her identity will be kept
symptom, or medical condition that occurs after strictly condential. The process of obtaining
starting study participation regardless of its rela- informed consent offers an excellent opportunity
tionship to the study intervention, i.e., even if a to establish good communications and rapport
cause other than the study intervention is present. between the patient and the investigators and, as
Any condition that was present before starting such, may impact the study outcome [2123]. It
study participation must be considered an adverse is important to recognize that consent for study
event if it worsened. Furthermore, the serious- participation contains important elements that
ness of an adverse event is not synonymous with distinguish it from consent to a procedure, be it a
its severity or potential outcomes. An adverse routine clinical procedure or one required as part
event is considered serious if it is (1) serious or of the study; thus, consent to participate in a
life-threatening, (2) requires or prolongs hospi- research study should be obtained separately
talization, (3) is signicantly or permanently dis- from other permissions obtained in caring for a
abling or incapacitating, (4) constitutes a patient [24]. The informed consent form itself is
congenital anomaly or birth defect, or (5) requires considered a part of the protocol. The protocol
medical/surgical intervention to prevent any one also should contain a statement that IRB approval
of the preceding. There is no mention of severity will be obtained and that the investigators and all
or potential seriousness. Thus, a severe symptom study personnel will obtain all periodic re-
or abnormal laboratory nding that does not meet approvals and comply with all other requirements
one of these criteria is not considered a serious of that review board.
adverse event. In addition, the protocol often includes a
Above all, it is critical that adverse events be description of the investigators responsibilities
looked for, recognized, recorded, and reported as regarding patient safety. This description typi-
quickly as possible to the appropriate study gov- cally points out the research policies, regulations,
erning personnel to allow any necessary actions to and requirements of governmental, international,
be taken to safeguard all other study participants. institutional, and sponsoring bodies. The investi-
gators are required to comply with all of these. In
addition, the investigators agree to accept full
Ethical Considerations responsibility for protecting the rights, safety,
(See Also Chap. 12) and welfare of patients under their care during
the study. The principles of good clinical practice
The protocol must state that all patients will pro- mandate that the investigators provide the best
vide informed consent prior to being enrolled in available care, themselves or by appropriate
the study. The consent form must be written in referral, for any medically related problems that
language the patient can fully understand and arise during the study, regardless of their rela-
must contain certain elements. These include a tionship to the study itself.
description of the study; what is expected of the
patient; what risks are involved with any tests,
procedures, and treatments; what alternative Statistical Considerations
treatments are available; and assurance that the
patient will be given the best available treatment All protocols should contain a section that
for his/her condition whether or not he/she describes trial-specic statistical evaluation
chooses to participate initially or to terminate plans. For randomized controlled clinical trials,
prematurely. The patient should also be informed such considerations typically include (but are not
that his/her study records may be reviewed by the limited to): the specic nature of the study design
124 J.A. Franciosa

and related issues, the specics of the randomi- Studies typically encounter unforeseen prob-
zation procedure and rationale employed, justi- lems and questions during their conduct. In addi-
cation of sample size and associated power (see tion, some potential issues can be foreseen prior
also Chap. 11), the statistical analysis planned to study initiation; these need to be prospectively
for assessing primary and secondary outcome addressed so that solutions can be decided quickly
measures, and a statement of the null hypothesis according to plan should they, indeed, arise dur-
for primary efcacy comparison. When appro- ing the course of the study. Examples of such
priate (e.g., a randomized controlled trial evalu- issues include endpoint criteria, rules for early
ating high-risk patients), this section also may termination of the study, need for protocol
articulate statistically-based stopping rules for changes, etc. It is important for any study, and is
premature termination of the study (e.g., early mandatory for multicenter studies, that the proto-
evidence of efcacy in the absence of safety col identify those individuals responsible for
problems). making decisions about the studys conduct.
Thus, the protocol should specify the individuals
and committees who are responsible for study
Protocol Implementation leadership and charged with making the kinds of
and Study Conduct decisions mentioned above.
Multicenter studies should have a chairperson
Recent observations suggest that the conduct of who is empowered to make and/or delegate day-
certain types of clinical trials have decreased, to-day decisions regarding such things as decid-
raising concerns about adequacy of planning and ing if a patient satises all inclusion/exclusion
implementation. For example, late phase clini- criteria or if a patient or center has violated pro-
cal trials represented about 20% of all clinical tri- tocol requirements, etc. In addition, there may be
als in 1994 whereas in 2008, they accounted for a steering or executive committee to address
only 4.4% of all clinical trials [16]. Possible rea- broader issues, e.g., protocol changes, and to
sons that may contribute to this apparent decline address recommendations of any subcommittees.
include inadequate organization and infrastruc- The subcommittees may typically include an
ture, lack of coordinated research team effort, and independent data safety and monitoring board
insufcient training [1618]. No matter how well (DSMB) that periodically reviews study data to
a protocol is written, it is of little value if it cannot assess the need for possible premature termina-
be implemented and carried out to completion. tion of the study if a clear benet or risk appears
that makes it unethical to continue the study.
Another subcommittee might analyze study end-
Study Organization, Structure, point outcomes, e.g., cause of death or reason for
and Administration hospital admission. It is mandatory that subcom-
mittees and committees prospectively dene the
In addition to describing how the study will be rules and criteria to be used in arriving at any
done, protocols typically address issues which decisions they make and that information required
help safeguard the well-being of patients during to satisfy these rules be included as a part of the
their study participation, while ensuring the integ- protocol. Subcommittees and other committees
rity and proper conduct of the study. Many of the generally make recommendations to the steering
topics discussed in this section are addressed at or executive committee who has responsibility
great length in other publications and reference for making nal decisions based on those
materials which the reader should consult [1, 9]. recommendations.
We will focus here on some of these topics, espe- In summary, the leadership of the study is
cially those that are typically required for inclu- responsible for the general satisfactory conduct
sion in a protocol by sponsoring institutions, of the study in all of its aspects. This includes
funding agencies, and regulatory authorities. resource recruitment and allocation, providing
6 Protocol Development and Preparation for a Clinical Trial 125

any training required, ensuring timeliness of timely availability of supplies. In addition, study
patient recruitment, overseeing data manage- leaders must be readily available to these same
ment, and reporting of the results. individuals to try to resolve any supply problems
that might arise.
The protocol should contain information about
Resource Allocation and Management study materials the patient will need, including
study drugs, laboratory kits, questionnaires, dia-
Key resources include funds, manpower, and ries, etc. Information should be provided on who
supplies. Funding may be available prior to study is responsible for procuring and dispensing these
initiation in some settings with predetermined materials, how and where they will be procured,
budgets, e.g., industry. In other settings, funding how they will be supplied (kits, bottles, etc.), how
must be applied for, and its procurement often they will be labeled to correctly identify content
depends heavily on the quality of the research and the study patient, and instructions for their
proposal and/or protocol. Once funds are secured, use. There also should be a description of how
the study leadership must oversee their alloca- the supplies will be stored. Finally, there must be
tion, accountability, and continuing availability, an accurate inventory of all materials, with dates
as well as identify the individuals who will be of receipt, dispensing, names of recipients, etc.
responsible for these matters. There also must be a procedure for returning
The success of the study also will depend on study material and recording their receipt. All of
the availability of sufcient and qualied per- these records are mandatory for accountability of
sonnel to carry out all the required functions. For supplies and are subject to strict regulations,
certain functions, especially those that might especially when any controlled substances are
only be required from time to time to address involved. This section is critical to the study
specic issues that might arise, it may be prefer- sponsor who generally provides the materials and
able to use consultants. For example, if patient must be able to show that adequate instructions
recruitment lags, the advice of persons special- for their correct handling were provided to
ized in recruitment techniques might be sought. investigators.
It is critical that all personnel be qualied to
carry out whatever responsibilities they are
assigned and that the study leadership provides Recruitment of Study Participants
the proper training needed to ensure their
qualications. The recruitment of eligible patients/subjects into
Availability of all supplies needed to carry out the study in a timely fashion is one of the key
the study is critical and may be a rate-limiting rate-limiting processes that has a major impact
factor in starting and completing the study in a on study results. Failure to recruit patients in a
timely fashion. Obviously, the study cannot start timely manner may have serious consequences
without materials for gathering and reporting by precipitating retrospective protocol changes,
data, e. g., case report forms (see also Chap. 7). such as relaxing eligibility/exclusion require-
Similarly, study drugs and/or devices must be ments or modifying procedures and observations.
available and ready for use, i.e., properly coded Any such changes can signicantly affect the
and allocated for a randomized trial. Any supplies study and potentially undermine its original intent
for laboratory tests and study procedures also and capacity to properly test the study hypothe-
must be available. Not only is it important that all sis, thus yielding results that may not be valid and
supplies be available to start the study, but it also conclusive relative to the original intent. Failure
is necessary to assure that they will continue to be to recruit patients quickly enough in sufcient
available throughout the study until its conclu- numbers can lead to early termination of the
sion. A key responsibility of study leadership is to study itself as well as discontinuation of its fund-
oversee the individuals responsible for ensuring ing, thereby jeopardizing the power of the trial to
126 J.A. Franciosa

achieve its projected sample size needed to is important to describe the procedures that these
achieve statistically conclusive results. individuals will follow to ensure (1) adherence to
Techniques for recruiting study subjects vary the protocol, (2) provision of complete and accu-
considerably and represent a specialized topic in rate data, (3) response to queries, and (4) compli-
and of itself [1, 19, 20] that is beyond the scope ance with auditing. Instructions on record keeping
of this chapter. The study leadership must iden- and record retention should also be provided.
tify the individuals responsible for recruitment Monitoring techniques vary and may include
and provide them with adequate resources and simple periodic telephone or e-mail contact with
training for whatever recruitment techniques are mailing or electronic submission of study docu-
employed. The specic techniques to be used ments between investigator sites and the moni-
should be spelled out in detail in the protocol. tors. Monitors may visit sites on a periodic basis
Numerous recruitment techniques are available to retrieve and deliver study materials as well as
and include screening subjects from (1) the local directly observe the sites performance. For a
research site (ofce, clinic, hospital, etc.), (2) more detailed description of monitoring methods
collaborating local sites, and (3) collaborating and procedures, the reader should consult stan-
regional, national, and/or international sites. dards references on the subject [1].
Within each of these sites, local areas of interest
must be identied, e.g., ofce, laboratory, and
emergency room. Screening-type trials seeking Data Acquisition and Processing
large or broad populations of subjects may estab-
lish recruitment centers in churches, schools, The principles of data acquisition and manage-
supermarkets, shopping centers, commercial estab- ment are described in detail in Chap. 7. From the
lishments, etc., to identify appropriate patients. study conduct perspective, it is important that ade-
In addition, advertising through various media quate numbers of qualied personnel are available
should be utilized to reach potentially eligible for data processing and management. Furthermore,
participants. Other sources are colleagues, bulle- these individuals must have expertise or be trained
tin board notices, direct mailings, and telephone in the required methods to be used for acquiring
screening [1]. The nal decision regarding and processing data. Similarly, study leadership
recruiting methods will depend on the overall must ensure that all appropriate materials, espe-
number and kinds of patients/subjects needed. cially equipment, hardware and software, are
Importantly, the duration of active recruiting available to properly process the data.
efforts commonly is specied in a protocol. These
timelines should be closely monitored and
adjusted as needed by the study leadership. End of Study Procedures

Once all study visits have been completed in all

Study Monitoring subjects, the study itself can be terminated.
Procedures for terminating the study may include
Implementation of the protocol should be care- a nal monitoring visit to retrieve all outstanding
fully monitored. The persons assigned this task study materials such as case report forms and
should be identied and adequately trained in the study supplies. Data processing procedures, e.g.,
monitoring procedures to be used. These indi- quality control and source document verication,
viduals should identify the personnel responsible should be initiated and completed. Record reten-
for overseeing study conduct at the various cen- tion procedures should be implemented.
ters and should ensure that all personnel at the The nal results should be tabulated, ana-
center are well aware of and able to properly lyzed, and presented in a nal study report to be
carry out all the investigators responsibilities. It submitted as required to funding agencies, IRBs,
6 Protocol Development and Preparation for a Clinical Trial 127

regulatory agencies, etc. Most importantly, it is trial takes the form of a prospective study
strongly recommended that all nal results be comparing the effect of an intervention, usually a
published. Only in this manner can the study be new drug or device, with a comparator or control
critically analyzed by all those with a stake in its (i.e., a placebo or a treatment already available)
outcome as well as be replicated if deemed [26]. The fundamental design of the clinical trial
desirable. can be widely applied to many different disci-
plines or areas of clinical research. (For a com-
prehensive discussion of contemporary clinical
Overview of the Interventional trial methodology, the reader is referred to the
Clinical Trial seminal writings of Spilker [1]). Clinical trials
can be employed to evaluate many forms of ther-
Most of what is discussed above has derived apy, including surgical interventions and radia-
from, and has been best dened by, interven- tion therapy. In addition, clinical trials can be
tional clinical trials which represent the culmina- used to test other nontherapeutic approaches to
tion of clinical research and merit special patient care, such as diagnostic tests or proce-
consideration because of their impact on clinical dures [27]. Thus, the NIH classies clinical trials
research methodology. Interventional clinical tri- into ve categories according to their purpose,
als are designed and conducted for the primary i.e., treatment trials, prevention trials, diagnostic
purpose of testing a treatment or management trials, screening trials, and health-related quality
strategy in patients with a specic disease. Such of life trials. These categories reect the way in
trials typically are sponsored by large research which clinical trials t within the entirety of the
organizations, such as the United States National clinical research spectrum, as they can be instru-
Institutes of Health (NIH), or by private organi- mental in assisting clinical efforts to improve not
zations such as pharmaceutical companies or only the treatment of a particular disease (as is
medical device manufacturers. most often the case) but also its prevention and
An interventional clinical trial is a formal detection [27].
experiment designed to elucidate and evalu- The clinical trial is the most widespread appli-
ate the relative efcacy and safety of different cation of experimental study design in humans
treatments or management strategies for patients [26]. Indeed, it is the adherence of the trial to the
with a specic medical condition [25]. Healthy principles of scientic experimentation, perhaps
volunteers often are used in the early phases of more so than a reliance on therapeutic compari-
assessment of a new therapy primarily to assure son, that most aptly validates the results of the
sufcient safety of an intervention before apply- trial. Along this vein, a number of general charac-
ing it to patients with the disease targeted by the teristics of the scientic method play a substan-
intervention. Such studies typically involve tial role in the modern conduct of clinical trials
establishing the proper dosing and/or administra- including, most notably, the control of extrane-
tion of the intervention along with demonstrating ous factors that might inuence outcome vari-
that the intervention is tolerated well enough to ability, selection bias, or interpretation of results
permit further studies in patients. However, [28]. For example, an important feature of the
healthy human volunteers provide only indirect randomized controlled trial, which is widely
evidence of effects on patients. Therefore, ulti- accepted as the primary standard of evidence
mately, clinical trials of putative interventions when interventions are evaluated, is the require-
must be conducted among individuals with dis- ment to randomly allocate patients to alternative
ease. The results obtained from this limited sam- interventions, strengthening the internal validity
ple then are used to make inferences about how of the study (see also Chap. 5).
treatment can be applied in the diseased popula- In any clinical trial, regardless of which inter-
tion in the future [25]. Most commonly, a clinical ventions or tests are administered, investigators
128 J.A. Franciosa

must carefully follow the progress of recruited

subjects, collecting data for a prespecied time Conclusions
interval according to the requirements of the
study protocol; subsequently, statistical analyses The study protocol is the most important and
are performed that might yield valuable conclu- critical document available to the investigator and
sions relevant to predened research objectives. is central to the conduct of any study. It provides
Some studies might involve more tests or medical the necessary guidance and serves as the main
visits than are clinically necessary, while others reference for all study personnel, while also pro-
interfere only minimally with normal patient care viding for the welfare and safety of all study par-
practices. In general, the details of the procedure, ticipants. It must be detailed and comprehensive
including the specic conceptual plans for obser- and must be prospectively dened. Whereas it is
vation, data capture, follow-up, and analysis not possible to foresee all things that might occur
depend on what type of clinical trial is being during the course of the study, it behooves the
conducted. Due to their broadening scope of investigators to plan for all foreseeable develop-
applicability since the mid-1900s, clinical trials ments in the protocol. Virtually, anything that
currently play a paramount role in examining the must be done post hoc has the potential to intro-
impact of interventions among human subjects. duce bias and undermine the credibility and
What has further cemented the clinical trial as a validity of the study. The degree to which the
valuable tool for the clinical investigator has been investigators can achieve these requirements will
the recognition by health-care professionals that, serve as testimony to their thoughtfulness, atten-
if insights into disease prevention and improve- tion to detail, and overall quality of work. A high-
ment to patient care are to be gained, experimen- quality protocol should allow others who follow
tal methodology should be followed as rigorously it rigorously to obtain the same results. Most
in a clinical setting as it is in basic science[28]. importantly, a high-quality protocol will likely
Proper preparation of the research protocol, lead to a valid and credible conclusion, whether
therefore, is essential to the successful and ethi- it conrms or refutes the hypothesis, thereby
cal application of the clinical trial to modern reducing the likelihood of needing a costly repeat
clinical research. study because of a faulty protocol.

Take-Home Points

A protocol is the most critical document in a research study.

It plays a central role in the conduct of a study by describing how a hypothesis will be
It provides the necessary guidance and serves as the main reference for all study personnel,
while also providing for the welfare and safety of all study participants; it must be prospec-
tive, detailed, and comprehensive.
A protocol is organized in chronological divisions; the background and rationale provide
the rst impression of the investigators; study endpoints, especially the primary ones,
drive the rest of the study design.
The study population schedule of visits/procedures, and methods for ensuring patient
safety, along with other human subjects issues, must be described in detail.
A high-quality protocol will enhance the likelihood of drawing valid conclusions, whether
they conrm or refute the hypothesis, thereby reducing the likelihood of needing a costly
repeat study.
6 Protocol Development and Preparation for a Clinical Trial 129

planning clinical trials recruitment. Contemp Clin

References Trials. 2007;28:22031.
12. Franciosa JA. Commentary on the use of run-in peri-
ods in clinical trials. Am J Cardiol. 1999;83:9424.
1. Spilker B. Guide to clinical trials. New York: Raven; 13. Romano P. Automation of in-silico data analysis pro-
1991. cesses through workow management systems. Brief
2. Treweek S, McCormack K, Abalos E, Campbell M, Bioinform. 2008;9:5768.
Ramsay C, Zwarenstein M, PRACTIHC Collaboration. 14. Lacroix Z. Biological data integration: wrapping data and
The trial protocol tool: the PRACTIHC software tool tools. IEEE Trans Inf Technol Biomed. 2002;6:1238.
that supported the writing of protocols for pragmatic 15. Shah AR, Singhal M, Klicker KR, Stephan EG, Wiley
randomized controlled trials. J Clin Epidemiol. HS, Waters KM. Enabling high-throughput data man-
2006;59:112733. agement for systems biology: the Bioinformatics
3. Sellier P, Chatellier G, DAgrosa-Boiteux MC, Resource Manager. Bioinformatics. 2007;23:9069.
Douard H, Dubois C, Goepfert PC, Monpre C, Saint 16. Nussenblatt RB, Meinert CL. The status of clinical
Pierre A, Investigators of the PERISCOP study. Use trials: cause for concern. J Transl Med. 2010;8:658.
of non-invasive cardiac investigations to predict clini- 17. Smith A, Palmer S, Johnson DW, Navaneethan S,
cal endpoints after coronary bypass graft surgery in Valentini M, Strippoli GF. How to conduct a random-
coronary artery disease patients: results from the ized trial. Nephrology. 2010;15:7406.
prognosis and evaluation of risk in the coronary oper- 18. Paschoale HS, Barbosa FR, Nita ME, Carrilho FJ,
ated patient (PERISCOP) study. Eur Heart J. 2003; Ono-Nita SK. Clinical trials prole: professionals and
24:91626. sites. Contemp Clin Trials. 2010;31:43842.
4. Mahaffey KW, Harrington RA, Akkerhuis M, Kleiman 19. Bader JD, Robinson DS, Gilbert GH, Ritter AV,
NS, Berdan LG, Crenshaw BS, Tardiff BE, Granger Makhija SK, Funkhouser KA, Amaechi BT, Shugars
CB, DeJong I, Bhapkar M, Widimsky P, Corbalon R, DA, Laws R. X-ACT collaborative research group.
Lee KL, Deckers JW, Simoons ML, Topol EJ, Califf Four lessons learned while implementing a multi-
RM, For the PURSUIT Investigators. Disagreements site caries prevention trial. J Public Health Dent.
between central clinical events committee and site 2010;70:1715.
investigator assessments of myocardial infarction 20. Treweek S, Pitkethly M, Cook J, Kjeldstrm M,
endpoints in an international clinical trial: review of Taskila T, Johansen M, Sullivan F, Wilson S, Jackson
the PURSUIT study. Curr Control Trials Cardiovasc C, Jones R, Mitchell E. Strategies to improve recruit-
Med. 2001;2:18794. ment to randomised controlled trials. Cochrane
5. Marang van de Mheen PJ, Hollander EJ, Kievit J. Database Syst Rev. 2010;4:MR000013.
Effects of study methodology on adverse outcome 21. Helgesson G, Ludvigsson J, Gustafsson Stolt U. How
occurrence and mortality. Int J Qual Health Care. to handle informed consent in longitudinal studies
2007;19:399406. when participants have a limited understanding of the
6. Borgsteede SD, Deliens L, Francke AL, Stalman WA, study. J Med Ethics. 2005;31:6703.
Willems DL, van Eijk JT, van der Wal G. Dening the 22. Jones JW, McCullough LB, Richman BW. Informed
patient population: one of the problems for palliative consent: its not just signing a form. Thorac Surg Clin.
care research. Palliat Med. 2006;20:638. 2005;15:45160.
7. Chin Feman SP, Nguyen LT, Quilty MT, Kerr CE, 23. Albrecht TL, Franks MM, Ruckdeschel JC.
Nam BH, Conboy LA, Singer JP, Park M, Lembo AJ, Communication and informed consent. Curr Opin
Kaptchuk TJ, Davis RB. Effectiveness of recruitment Oncol. 2005;17:3369.
in clinical trials: an analysis of methods used in a trial 24. del Carmen MG, Joffe S. Informed consent for medi-
for irritable bowel syndrome patients. Contemp Clin cal treatment and research: a review. Oncologist.
Trials. 2008;29:24151. 2005;10:63641.
8. Sisk JE, Horowitz CR, Wang JJ, McLaughlin MA, 25. Pocock SJ. Clinical trials: a practical approach. New
Hebert PL, Tuzzio L. The success of recruiting minor- York: Wiley; 1983.
ities, women, and elderly into a randomized controlled 26. Portney LG, Watkins MP. Foundations of clinical
effectiveness trial. Mt Sinai J Med. 2008;75:3743. research: applications to practice. 2nd ed. New Jersey:
9. Armitage J, Souhami R, Friedman L, Hilbrich L, Prentice-Hall; 2000.
Holland J, Muhlbaier LH, Shannon J, Van Nie A. The 27. Basic questions and answers about clinical trials.
impact of privacy and condentiality laws on the con- Rockville (MD): Food and Drug Administration (US).
duct of clinical trials. Clin Trials. 2008;5:704. Last Updated: 07/16/2009.
10. Anisimov VV, Fedorov VV. Modelling, prediction forconsumers/byaudience/forpatientadvocates/
and adaptive adjustment of recruitment in multicentre hivandaidsactivities/ucm121345.htm Accessed 11
trials. Stat Med. 2007;26:495875. Aug 2011.
11. Abbas I, Rovira J, Casanovas J. Clinical trial optimi- 28. Piantadosi S. Clinical trials: a methodologic perspec-
zation: Monte Carlo simulation Markov model for tive. New York: Wiley; 1997.
Data Collection and Management
in Clinical Research 7
Mario Guralnik

in procedural manuals, which outline the plans

Introduction and processes for data ow, entry, and quality
control and represent the essential documents
As noted elsewhere in this volume, all successful for managing the conduct of the research. Not
clinical trials begin with a good study question surprisingly, most data are collected to address
or questions, optimally framed as one or more the research study objectives. However, trial
hypotheses, and an appropriate research design administration and compliance data also are often
that clearly denes appropriate study endpoints collected to provide evidence of the quality of the
as well as other key variables. As in most serious conduct of the study.
endeavors, the old adage Failing to plan is plan- Having developed the proper study design
ning to fail applies when conducting clinical and data denitions, the researcher next is faced
research, where poorly conceived study objec- with the challenge of selecting the systems to be
tives and incompletely dened endpoints can used to collect and manage the trial data. Well-
almost guarantee that a studys conclusions will designed data management processes, collection
be faulty. In such cases, the best the researcher tools, and systems will help ensure the validity
may hope for are anecdotal observations of ques- and integrity of the data to be analyzed. Only
tionable validity; at worst, they could mislead the data whose sources can be trusted as accurate,
community of patients, clinicians, and/or health complete, and protected from tampering can be
policy decision makers for whom the research used to substantiate conclusions about a trials
was conducted. outcomes. Also, clinical research inherently
Once these elements have been rigorously raises issues of patient privacy and data security;
dened, the next most important step is the desig- thus, data management processes and systems
nation of the data to be collected among the sub- used in clinical trials must address both of these
jects to be included in the trial and the manner of areas as well. Overall, defects and inefciencies
data collection. Optimally, these will be detailed in methods and procedures of data identication,
collection, and management translate into defects
in documented evidence and waste in the con-
duct of the trial itself [1]. These problems may be
M. Guralnik, PhD ()
compounded when studies are large, are long-
Synergy Research Inc, 3943 Irvine Blvd #627,
Irvine, CA 92602, USA term, or utilize multiple centers [2]. Therefore,
e-mail: well-designed trials and data management

P.G. Supino and J.S. Borer (eds.), Principles of Research Methodology: A Guide for Clinical Investigators, 131
DOI 10.1007/978-1-4614-3360-6_7, Phyllis G. Supino and Jeffrey S. Borer 2012
132 M. Guralnik

methods are essential to the integrity of the

ndings from clinical trials and containing Data Types
the costs of conducting them.
The methods by which data are collected must The term data in clinical research refers to
be addressed during the research design step. observations that are structured in such a way as
Attention must be paid to identifying existing or to be amenable to inspection and/or analysis [3].
creating new research documents or devices In other words, they represent the evidence for
into which trial observations can be recorded. conclusions drawn in a trial. All data collected in
Selecting the documents/devices that provide the biomedical research studies are either numerical
most reliable and valid data is a critical compo- or nonnumerical. Nonnumerical data typically are
nent of the research design process. Historically, based on written text but also could include data
the cornerstone of data collection has been the from sources ranging from digital photography to
structured paper case report form (CRF) into voice dictation. Any individual study may collect
which the required data are transcribed from the either or both of these data types. The approaches
research documents. However, inherent required to analyze, summarize, and interpret
inefciencies are present in paper-based data col- each type vary, so the differences between the
lection due to its time and resource-intensive various approaches must be considered when
nature and the error-prone aspects of data tran- designing a study and collecting the data [4].
scription and database entry. Not surprisingly, in
the last decade, studies once steadfastly done on
paper now routinely use electronic data capture Quantitative Data
(EDC) in an attempt to overcome these
inefciencies. Specically, these EDC systems The data collected in randomized clinical trials
reduce redundancy, trap errors in real time (allow- (RCTs), where the effectiveness and safety of
ing their prompt resolution), and promote the uni- new clinical treatments are evaluated, primarily
form collection of data which can be analyzed and are quantitative (numerical) in nature. Such data
shared in a more consistent and timely manner. may be discrete or count-based (e.g., number of
Procedural manuals typically outline processes white blood cells or hospitalizations) or continu-
for data generation, ow, entry, and quality con- ous measurements (e.g., dimensions, tempera-
trol. They are essential for managing the conduct ture, ow) and are collected using such methods
of the research. Verifying, validating, and correct- as objective (laboratory) testing or patient
ing data entered into a clinical research database response questionnaires and surveys that ask
are critical steps for quality control. Several data the respondent how much or how many.
cleaning processes are available for this purpose. Quantitative data may be displayed graphically
This chapter will consider the tools and pro- or summarized and otherwise analyzed through
cesses that support the development of accurate the use of descriptive and/or inferential statistics.
clinical research data and efcient trial manage- Descriptive statistics, including distributional
ment. These tools and processes are designed to characteristics of a sample (e.g., frequencies or
satisfy the requirements of funding agencies, percentages), measures of central tendency (e.g.,
Institutional Research Boards (IRBs), and other means, medians, or modes), and measures of
regulatory bodies with regard to protecting variability (e.g., ranges or standard deviations),
human subjects, provide timely access to safety provide a way by which the voluminous numeri-
and efcacy data, and maintain patient cal data collected can be reduced to a manageable
condentiality. Topics to be covered include the and more easily interpretable set of numbers.
various types of data used in clinical research, Inferential statistics provide levels of probability
basic source and research documents, data cap- by which the research hypotheses can be tested
ture methods, and procedures for monitoring and and conclusions drawn (see Chap. 11 for an in-
securely storing data. depth discussion of these methods).
7 Data Collection and Management in Clinical Research 133

Qualitative Data groups of people or situations. Additional infor-

mation about validity and reliability can be found
Exploratory trials, in which one of the purposes in Chaps. 5 and 8.
is to generate information for use in the planning
and design of RCTs, rely heavily on nominal and
other forms of nonnumeric data produced using Principles of Data Identication
such methods as patient free-text opinion sur- and Collection
veys, diaries, and translations of verbal commu-
nications (e.g., interviews). The summarization As previously described, the research data to be
and analysis of nonnumeric data typically involve collected in any clinical trial and stored in the
the use of descriptive statistics (as is the case for clinical database must support the objectives of
quantitative study data), but additional work is the study and be specied in the protocol. This, in
required before the descriptive statistics can be turn, relies on designing data collection instru-
calculated. Specically, the nonnumeric data rst ments and computer databases that correctly cap-
must be translated into numeric codes based on a ture the dened research data. To support trial
coding scheme preferably specied in the proto- administration and to document compliance with
col or, at least, prior to the collection of the data. regulations and Good Clinical Practice (GCP),
The coding scheme, however, is by its very nature source documents also are expected to capture
a subjective process which has the potential for subject participation data, though such data
investigator bias resulting from selective collec- are not necessarily included in the research data-
tion and recording of the data (or from interpreta- base [4].
tion based on personal perspectives). The The research data represent the information
potential bias can be minimized by having at least that is analyzed to answer the questions being
two researchers independently collect and record stated in the study objectives. In most protocols,
the data based on the same information and cod- addressing primary and potentially secondary
ing scheme. objectives requires collection of both efcacy
and safety endpoints. To appropriately design
the data collection documents and collection
Reliability and Validity methods, it is important to consider the value or
weight that each study objective contributes to
Reliability and validity are concepts that reect the overall outcome of the study. Emphasis must
the rigor of the research and the trustworthiness be placed on accurate and complete collection
of the research ndings [5]. Reliability describes of the specic data points necessary to investi-
the extent to which a particular test, procedure, or gate the studys primary objectives, while the
data collection method (e.g., a questionnaire) will collection of extensive data in support of sec-
produce similar results under different circum- ondary objectives should never be allowed to
stances. Highly reliable data are in evidence when detract from satisfying the studys primary
the research tool or method used in the collection objectives.
of the data provides similar information when When considering the collection of adminis-
used by different individuals (interrater reliabil- trative source data to help with the management
ity) or at different times (reproducibility). Validity of a trial, the amount of such data required
is a subtler concept which describes the extent to depends to a large degree on the complexity of
which what we believe we are measuring the trial structure. For example, in a small,
accurately represents what we intended to mea- single-institution trial, much less information
sure. Internal validity indicates the accuracy of typically is needed than in a large multicenter
causal inferences drawn from a studys ndings. trial [4]. The specic types of data to be collected
External validity indicates the extent to which a will depend on the details of the trial and could
studys ndings can be applied to other similar include information about transport of study
134 M. Guralnik

materials, monitors assigned to each site, dates of

monitor visits, or drug supply levels at each Data Sources
site [4]. Regardless of the amount and the type of
administrative data collected, it is not uncommon During the research design step, attention must
for trial management information to be manually be paid to identifying existing or creating new
and/or electronically stored separately from the research source documents or electronic devices
clinical trial research results. into which trial observations can be recorded.
Other administrative data include personal Selecting the source documents/devices that
patient information. The Study Coordinating provide the most reliable and valid data to
Center must be able to link a patient to a specic investigate the research study objectives is a
institution and maintain a roster of contact details critical component of the research design pro-
for that institution (e.g., patient name, address, cess. Data may be extracted from research-
telephone and fax numbers, and names, titles, independent sources, e.g., health insurance
and e-mail addresses for key trial personnel at databases or electronic health records (EHRs),
that institution) [4]. However, due to privacy con- or research-dependent sources, e.g., lab reports
siderations, the patient identication information generated from the performance of procedures
must be stored separately from the trial database conducted according to a trial protocols sched-
which contains uniquely assigned patient and ule. Both types of sources may provide data for
possibly randomization numbers, which can also a research study and are described, in greater
be used to link data from multiple sources to the detail, below.
same patient (e.g., laboratory data, demographic
information, and medical history).
Most studies also will need some documenta- Source Documentation
tion of compliance with regulatory requirements. and the Concept of Original Ink
The level of detail for such compliance data
depends on the type and purpose of the trial. Source documents for research can be dened as
Studies to be submitted to regulatory agencies all information contained in original records and
in support of New Drug or Product License certied copies of results, observations, and
Applications typically require the most complete other aspects required for the reconstruction and
set of compliance source data. Types of docu- evaluation of a study and its conduct [7]. Source
mentation can include ethics committee approv- documentation in a clinical trial includes medical
als for the protocol, original copies of patient or physiological, social, and psychological indi-
consent forms, and personnel qualications and cators of health that can be used to determine the
training at participating sites [4]. effectiveness of a clinical intervention. These can
Bottom line, in most clinical trials, a large vol- involve copies of any or all of the following
ume of data is collected. According to a recent original condential medical records: pharmacy
review of data monitoring in clinical trials [6], dispensing records, physicians notes, clinic and
the more data that are collected, the more cum- ofce charts, nurses notes, clinical laboratory
bersome and complicated data management reports, diagnostic imaging reports, patient diaries
becomes. Therefore, one goal in trial design and questionnaires, hospital admission records,
should be to minimize the volume of noncritical hospital discharge records, emergency room
data required so as to increase the integrity and reports, autopsy reports, electronic diagnostic or
quality of the studys results. This requires a research test results, vital sign records, electroni-
realistic appraisal of the ability of investigators cally captured original study data, photographs,
and other study personnel to manage the amount diagrams, and sketches. Source documents also
of data collected with a minimum of confusion can be created or provided by a trial sponsor by a
and error. third party (e.g., a contract research organization
7 Data Collection and Management in Clinical Research 135

[CRO] or a site management organization [SMO]) then rely on the transcribed sponsors source
or by the investigator or site staff, and may include documents to be the accurate and overriding data
study case report forms (CRFs) or electronic case points for resolution. Simply stated, erroneous
report forms (eCRFs) if used as the rst point of data could be considered the factual representa-
data capture. A source document could even be a tion of an event or observation. A simple but
cafeteria napkin containing laboratory results or effective tool for avoiding such situations is to
other observations, although a more formal dene in advance on a site-by-site, as well as a
data collection source document would be much form-by-form, basis what is and what is not
preferred. source documentation. When clarifying the
Use of the original ink concept can help to denition of source documentation, an important
differentiate a source document from subsequent point to keep in mind is that the study staff may
documentation. Original ink is a term that may be habitually record original ink data in certain
used to dene the rst-ever written documenta- places. For example, a patients temperature and
tion of an event or observation pertaining to the pulse may be routinely taken at the bedside by
study subject. Thus, documents containing origi- the study coordinator and recorded on a copy of
nal ink are considered source documents for the CRF. If the patients blood pressure is then
research. The US Food and Drug Administration taken from the physicians notes and recorded on
(FDA) as well as other regulatory agencies also the copy, then that copy becomes the source doc-
recognize a CRF as source documentation when umentation for the rst two measurements, but
it has captured the original ink of an event or not for the third. Interviewing the staff prior to
observation in a clinical trial. In contrast, tran- source document verication is an effective time-
scriptions or reproductions are considered sub- saving tool. When done early in the study initia-
sequent documentation based on the source tion process, this method can very effectively
original ink document. With todays use of clarify potential discrepancies.
advanced computer technology, ranging from
digital photography to voice dictation, we must
consider other forms of original ink or, more Research-Independent Data Sources
appropriately termed, original electronic chroni-
cles. These include voice, electronic, magnetic, A wealth of medical information is generated
photo-optical, and other source documentation every day for nonresearch purposes. A signicant
and records. For further information on the FDAs source of such data, accessible for research pur-
position on source documentation, the reader is poses, are the patient medical records maintained
referred to Guidance for Industry: Electronic by hospitals, clinics, and doctors ofces. Even
Source Documentation in Clinical Investigations the simplest medical records could contain impor-
(2010) [8]. tant information for research purposes, such as
Confusing these issues can lead to misrepre- sociodemographic data, clinical data, administra-
sentation of clinical trial data. For example, after tive data, economic data, and behavioral data.
site staff has collected a subjects history directly Additional potential research-independent
on sponsor-designated CRFs, the study monitor primary data sources are (a) claims data (such as
might remind the investigators staff that pre- those from managed care databases), (b) encoun-
printed sponsor source documents exist and that ter data (such as those from a staff/group model
they are designed to assist the site in capturing of health maintenance organizations), (c) expert
all necessary data elements. The site staff might opinions, (d) results of published literature,
then proceed to transcribe data from the CRF (e) patient registries, and (f) national survey data-
onto the sponsors source documents. To further bases. Since these data sources contain historical
confuse the matter, subsequent monitoring or as well as current data that are updated on an
query resolution activities by the sponsor would ongoing basis, these sources provide data that
136 M. Guralnik

are potentially useful in both retrospective stud- be obtained directly from the patient, most often
ies (designed to investigate past events) and through the use of a questionnaire or survey.
prospective studies (designed to investigate Questionnaires and surveys consist of a prede-
events occurring after patients have been enrolled termined set of questions administered verbally, as
in a study). a part of a structured interview, or nonverbally on
paper or an electronic device. The responses to the
questions may be discrete bits of data or may be
Research-Dependent Data Sources grouped as measures of study outcomes (e.g., psy-
chological scales). If the questionnaire is intended
Controlled evaluation of investigational products to measure study outcomes, establishing its reli-
or interventions requires prospective data collec- ability and validity and minimizing bias are essen-
tion which typically involves identifying one or tial. Administering a published questionnaire for
more patient groups, collecting baseline data, which reliability and validity have been previously
delivering one or more products or interventions, determined is recommended when possible.
collecting follow-up data, and comparing the However, the use of some published question-
changes from baseline among the different patient naires requires permission of their authors and
groups. Although there may be some research- may have a cost associated with their use. When
independent sources collected in these controlled the use of published questionnaires is not feasible,
evaluations (e.g., demographic, characteristics, new questionnaires will need to be developed.
medical history), most of the baseline data and, Such questionnaires should be pretested systemati-
obviously, the follow-up data must be collected cally (i.e., piloted) with a small subgroup of the
from research-dependent sources. Well-designed patient population in order to identify and correct
investigations of this nature specify, prior to the ambiguities or biases in the way the questions are
initiation of the study, the data to be collected and stated. Training interviewers who verbally admin-
the collection methods to be used. ister a questionnaire will also increase the quality
of the data generated both from published or newly
developed data collection instruments. (See Chap. 8
Data Collection Methods for a detailed description of various item formats
used in questionnaires and general rules to con-
The study design and the study data to be col- sider when constructing questionnaire items.)
lected dictate the methods by which the data are
to be collected. Laboratory data (e.g., hematol-
ogy, urinalysis, serology) and vital signs (e.g., Data Capture
height, weight, blood pressure) may be required
in a clinical trial to evaluate efcacy and, often, Paper-Based Methods
to evaluate patient safety. These data typically
would be collected using standard methods for Efcient analysis, summarization, and reporting of
these data types and recorded in the patients biomedical research data require that data be avail-
medical records, often designed specically for able in an electronic database, such as a spread-
the research study. Other data collected to sheet or one of several available databases, some of
address the research question(s) may require which have been designed specically for clinical
clinical information (e.g., events experienced by research data. The manner in which the data are
the patient, nonstudy medications used by the entered into these databases has been evolving.
patient), tracking information (e.g., timing and Historically, most data in biomedical research, par-
amount of study medications received, alcohol ticularly in RCTs, were entered from a set of paper
consumption, sexual activity), or subjective CRFs specically designed for the study. Figure 7.1
information (e.g., personal opinions of medical shows an example of a typical paper CRF used to
condition or ease of treatment). These data must collect data obtained from physical examination.
7 Data Collection and Management in Clinical Research 137

Fig. 7.1 Example of a paper CRF used to collect research Health, Division of Cancer Prevention. http://dcp.cancer.
data from a physical examination. Downloaded from the gov/Files/clinical-trials/FINAL_DCP_CRF_Templates_
National Cancer Institute at the National Institutes of Version_3.doc (Accessed 10 Nov 2011)

Electronic Systems recent technological advances, paper-based

CRFs are being replaced by eCRFs into which
Despite their long-term use, paper-based sys- the data are entered directly into trial databases
tems for data collection and management have from source documents. Features of eCRFs are
been found to be inefcient and error prone presented in Table 7.1, but they may vary
because of multiple iterations of data transcrip- depending on the computer software upon which
tion, entry, and validation [9]. Thus, due to they are based.
138 M. Guralnik

Table 7.1 Features that may be available for electronic CRFs depending on the clinical trial data management software
used (Reproduced with permission from Brandt et al. [2])
Feature Function
Primary electronic data entry Data entered into CRF by interviewer or subject (rather than into a paper form rst)
Context-sensitive help Help is given in the context of the problem (immediately)
Default values set Based upon predened criteria, or previously entered date, values of elds may be set
Skip patterns Disabling of questions that become inapplicable based on response to a previous
Computed (derived) values Certain questions may be based on values of other questions (such as body mass
index (BMI) that is derived from height and weight). Computed values may also
control skip patterns on a CRF. If BMI exceeds a present threshold, questions related
to high BMI may be enabled
Interactive validation Immediate checking of the values entered into the CRF based upon predened
criteria such as ranges, other values in the CRF or study, etc.

Table 7.2 Content, format, and data-entry principles

Principles of Case Report of good case report form design
Form Design CRF content principles
Collect data that support questions (as dened in
Regardless of whether a CRF or eCRF is used, the protocol) that are to be answered by the
meaningful collection of high-quality data begins statistical analysis.
Dene terminology and scales.
with a CRF that is based on the trial protocol [8].
Avoid questions that address ancillary issues.
Hence, consistency with source documents Ask each question only once.
is an essential feature of a well-designed CRF. CRF format principles
However, an analysis of source document Ask questions directly and unambiguously, using
verication performed by the sponsors of clinical conventional and professional terminology.
For long-term studies, provide a separate CRF for
trials has identied areas of inconsistency in 70% each visit and group of visits.
of cases [10]. Several items were either covered Arrange the questions in a logical sequence
in the CRF but not in the source documents (i.e., the order in which a physician would
(including those pertaining to patient history and ordinarily collect the data).
Specify how precise answers should be (i.e.,
informed consent) or were described in the source whether values should be rounded off or carried to
documents but not in the CRF (including one or more decimal places).
those regarding patient history, complications, When possible, collect direct numerical measure-
adverse events, and concomitant drugs or other ments rather than broad categorical judgments.
Use design techniques that simplify reading and
therapies). Sources of such discrepancies need to completing the form:
be resolved before a trial begins. Although CRFs Balance white space with text.
play a pivotal role in the successful conduct of a When possible, use check-off blocks instead of
trial, the design of these forms often is neglected asking for a code, value, or term.
Block sections of the form to make them easy to
in the haste to launch a trial. [11]. The content, locate and complete.
format, coding, and data-entry requirement prin- Use variations in size and boldness to show the
ciples of good CRF design, described more than hierarchy of headings.
20 years ago by Bernd [11], remain applicable Highlight areas of the form where entries are
today (Table 7.2).
CRF coding and data-entry requirements
To avoid the excessive costs and delays often Use consistent reference codes (e.g., if code [1]
associated with printing CRFs, sponsors that use represents no for one question, it should not
paper-based data capturing have found alternatives represent yes for another question).
7 Data Collection and Management in Clinical Research 139

to the traditional outsourcing of this task. Although EDC systems are most often used
Desktop-publishing systems and precollated by formally organized research centers with data
no-carbon-required paper (NCR) allow printing, management staff, many clinical investigators in
collating, and binding of CRFs, with multicol- private practice or in academia conduct studies
ored two- or three-part sets [11]. Over the course without the support of qualied biomedical
of a longitudinal study, CRFs often are improved informatics consultants and sophisticated EDC
or rened, including the addition of new entries systems [15]. Nevertheless, EDC systems are
and modication or deletion of entries on previ- available that can be implemented without spe-
ous versions [2]. Some newly requested data cialized software for investigators with small
(such as information about the patients history) budgets or limited access to data management
may be obtainable later, whereas time-dependent staff.
observations (such as measurements taken at a Data collection has naturally evolved along-
certain clinic visit) will not. Data for new or side with computer and information technology.
modied questions that cannot be obtained must Major milestones in this evolution include
be treated as missing. Conversely, when a personal computers, relational databases, user-
question is deleted, data for patients evaluated friendly interfaces for software once reserved for
under the older CRF version must be archived or engineering and systems design staff, and broad-
purged or both [2]. Regardless of the types of ened connectivity options such as computer
changes made, the FDA requires that the sponsor to computer, internet networking, wireless to
preserve all electronic versions for agency review Ethernet, and cellular data connectivity. These
and copying [12]. advances along with the availability now of
Electronic systems are designed to support mobile computing and electronics devices, like
data entry where data are entered directly from the iPad, have a potentially huge impact on how
source documents with most data validations we gather data, as well as where data capture is
executed real time as the data are entered and heading.
errors promptly resolved typically by study site The iPad is a major step forward for clinical
staff. As will be noted below, EDC systems data management. These truly remarkable
also support the monitoring, cleaning, storage, devices, resting in the hands of all members of
retrieval, and analysis of research data [2], as the research team, would allow quick access to
well as promote the uniform collection of data, tools for capturing data, real time or otherwise.
which can then be more easily analyzed and They also offer two-way connectivity along
shared across a variety of platforms and data- with the portability and functionality of the
bases [13]. hardware, thereby lending them the exact adapt-
EDC systems, however, are not without their ability needed for clinical medicine and research
own constraints. To be useful in multicenter roles.
trials, EDC systems must allow electronic sub- Newer generation iPads allow data to migrate
mission of data from different sites to a central from text-based eld entry, or PDF form data
data center, be easy to implement and use, and entry, through to server-based relational data-
minimize disruption at the clinical sites [9]. bases. Using methods from e-mail as a carrier to
Timing is essential to the successful implementa- internet-connected applications, the data stream
tion of an EDC system. Considerable information can be instantaneous, allowing for immediate
technology (IT) support is needed to build the two-way data efforts, relaying back from sponsor
eCRFs, and considerable time must be dedicated to investigator. Third-party communications fur-
to educating the trial site staff on the proper use ther enhance the iPad platform. All of this has
of the new systems. To be successful and reap begun to evolve because the iPad platform has
the benets of EDC systems, this effort should be simplied the process of data capture and trans-
undertaken prior to the initiation of any research fer via its accessible hardware and novel data
study [14]. management applications.
140 M. Guralnik

entry errors by their deviation from allowable

Electronic Training Manuals and expected responses and interactively prompts
for corrected data). Compared with paper-based
Procedural and training manualscore docu- systems, EDC systems can more efciently
ments of any clinical studyoutline plans and clean data by reducing the number of data dis-
processes for study coordination, creation of crepancies and requests for clarications, as
CRFs, data entry, quality control, data audits, well as lower the cost of each data query resolu-
data-entry verication, and site/data restrictions tion by lessening the amount of manual input
[2, 16]. The recent technological advances have required.
not only made possible efciencies in data collec-
tion and data processing but also made possible
electronic manuals created with HTML-based Data-Entry Cleaning
content which offer several advantages over
paper-based core documents. Special software Cleaning, the process of verifying, validating,
can be used to edit multiple discrete documents, and correcting data entered onto the CRF or into
organize them hierarchically, and provide hyper- the database, is essential to verifying quality
linking between related topics. When a manual is control in a clinical trial. Double data entry, the
produced in this way, many authors can work most common way to verify data entered onto
simultaneously on different topics that are subse- CRFs [17], begins with reentry of data from the
quently integrated with a version-control system CRF into the study database at a later point than
as the content evolves. The version-control soft- the original entry; often, this step is performed by
ware can also manage changes and updates to a person other than the operator who made the
protocol documents. Other advantages of this sys- rst entry. Next, the two versions are automati-
tem over simple text documents include the capa- cally compared, and any discrepancies are cor-
bility for single-source authoring with generation rected [17]. Despite the widespread use of this
of multiple output formats (e.g., JavaHelp), distri- method, the quality of data so corrected has been
bution of the complete manual through a dedi- debated for many years [18]. Commenting that
cated web site of the complete manual (with the concept of typing a nal report twice to
hyperlinks to support online browsing) and sup- check for typographical errors is almost laugh-
port for highly efcient, full-text searching, with able, one group questioned why double data
results ranked by relevance [2]. entry but not double everything else? Because
double data entry rests on the assumption that
original records are correct and all errors are
Data Error Identication introduced during data entry, this system can
and Resolution never trap errors made by the person completing
the form without exploratory data analysis (EDA)
Verifying, validating, and correcting data entered [19]. EDA, which challenges the plausibility of
into a clinical research database are critical steps the written data on the CRF, should therefore be
for quality control. Several data cleaning pro- performed either as data entry is ongoing or as
cesses are available, including the following: the rst stage in an analysis when double data
double computer data entry which captures entry entry is used.
inconsistencies (though it cannot detect errors Random data-entry audits are another way to
made by the person supplying the data without check the quality of data on a CRF. This method
additional exploratory data analysis), random is based on a predetermined level of criticality
data-entry audits (which are based on a predeter- (assigned by the data management/ investigator
mined level of criticality for each data category), team) for each data category, with respect to the
and electronic data validation (which identies adverse consequences of entering erroneous data.
7 Data Collection and Management in Clinical Research 141

For each category, a proportion of the CRFs is process. The use of eCRFs in combination with
sampled by a random-sample-generating pro- manual ad hoc queries by study monitors has been
gram, and entered data are compared with the able to reduce data discrepancies and the conse-
source documents for discrepancies. For very quent need for clarications by more than 50%.
important categories (i.e., data that are central to The enhanced ability to clean and analyze data
the study objective and must be correct), as many has resulted in the generation of more accurate
as 100% of CRFs may be sampled [2, 6]. data [21]. Moreover, compared with a paper-based
Noncritical data, which should be correct but system, EDC systems with built-in error checking
would not affect the study outcome if incorrect, for data quality have been shown to reduce the
would require a lower proportion of CRFs to be total number of queries and decrease the cost of
checked [6]. After sampling, the number of dis- each query resolution from $60 to $10 [14].
crepancies is reported and corrective action taken.
The proportion of audited CRFs for any category
may be modied for a given site in light of site- Document Retention, Security,
specic discrepancy rates [2]. and Storage
Electronic data validation identies entry
errors by their deviation from allowable and Retention
expected values or answers. These include labo-
ratory measurements, answers that contradict All clinical investigators should ensure that rele-
answers to other questions entered elsewhere on vant forms such as CRFs are always accessible in
the CRF, spelling errors, and missing values [2]. an organized fashion. Informed-consent forms,
Because of their concrete nature, these errors can CRFs, laboratory forms, medical records, and
easily be identied. correspondence should be retained by the investi-
gator until the end of the study and, thereafter, by
the sponsor for at least 2 years after clinical
Data Queries development of the investigational product has
been formally discontinued or 6 years after the
To support the full process of study monitoring trial has ended. Even after the completion of
and auditing, the data management system should the study, side effects or benets of the interven-
have querying tools in place [2]. After the data tion may be present and the relevant forms may
entry/verication process discovers an entry that need to be retrieved. Factors to be considered are
requires clarication and determines that the data the availability of storage space and the possibil-
were accurately entered into the database, the ity of off-site storage if there is insufcient stor-
data coordinator sends the participating institu- age space [22].
tion a paper or electronic query. Examples of
entries that warrant queries include missing data
values, values out of range, values that fail Security and Privileging
logic checks, or data that appear to be inconsis-
tent [20]. The query should include protocol and Both during and after completion of a study, inves-
patient identiers, specic descriptions of the tigators and their staff must prevent unauthorized
form/data item in question and the clarication access, preserve patient condentiality, and prevent
needed, and instructions on how and when to retrospective tampering/falsication of data. Under
send a response. In turn, the coordinating center the FDAs Title 21 Code of Federal Regulations
should have a mechanism for recording the issue [23], access must be restricted to authorized per-
and response to each query [20]. sonnel, the system must prevent malicious changes
EDC systems have a proven superiority to to research data through selective data locking, and
paper-based systems with respect to the querying an audit trail must exist [2].
142 M. Guralnik

Consideration should be given for software patient identifying information, but other per-
that provides: sonnel, such as biostatisticians performing
Privileging: Study-specic role-based privi- analyses, may view only de-identied data.
leges should be assigned, with roles requiring Data Locking: The software should allow a
adequate training and documentation of such study coordinator to lock all the data in the
training prior to system use. In the case of system by study, subject, or CRF level when
multisite studies, it is especially important to required. All investigators, particularly those
be able to assure investigators from each site involved in any type of human subjects
that other sites can be restricted from altering research, must be sure to take adequate steps
their data or, in some cases, even seeing their to preserve the condentiality of the data they
data while the study is in progress. Also, dif- collect. Investigators must specify who will
ferent users should have different data access have access to the data, how and at what point
and editing privileges. Software should allow in the research personal information will be
site restriction of data and the assignment of separated from other data, and how the data
both role-based and functional privileges. The will be retained at the conclusion of the study.
software should allow the level of restriction The following guidelines for preserving patient
to be changed as appropriate. condentiality should be followed [24, 25]:
Storing of De-identified Data: For studies In general, all information collected as part of
where breach of patient condentiality could a study is condential: data must be stored in
have serious repercussions, the software a secure manner and must not be shared
should support storing of de-identied data. It inappropriately.
is important to note that the Health Insurance Information should not to be disclosed with-
Portability and Accountability Act (HIPAA) out the subjects consent.
does not prohibit the storing of patient- The protocol must clearly state who is entitled
identiable information: it requires only that it to see records with identiers, both within and
be secure, be made accessible strictly on a outside the project.
need-to-know basis, and that accesses to such Wherever possible, potentially eligible sub-
information be audited. The drawback of not jects should be contacted either by the person
storing patient-identiable information in to whom they originally gave the information
every study is that many of a systems useful or by another person with whom they have a
workow-automation features, such as gener- trust relationship.
ation of reminders to be mailed to patients Information provided to prospective subjects
periodically, cannot function seamlessly and should include descriptions of the kind of data
personalization of reminders requires manual that will be collected, the identity of the per-
processes. Also, in prospective clinical studies sons who will have access to the data, the
for life-threatening conditions such as cancer, safeguards that will be used to protect the data
where decisions such as dose escalation are from inappropriate disclosure, and the risks
based on values of patient parameters, the that could result from disclosure of the data.
storage and selective echoing of protected Academic and research organizations should
health information (PHI) provides an added establish patient privacy guidelines for non-
safeguard to ensure that data are being entered, employee researchers.
or the appropriate intervention is being per-
formed, for the correct patient.
Generation of De-identified Data: The soft- Other Responsibilities and Issues
ware should be able to de-identify the data
when required in order to share data and GCP guidelines mandated through the Code of
should utilize information about user role- Federal Regulations require that institutions (or
based privileges as well. For example, an when appropriate, an IRB) maintain records of all
investigator may have privileges to view research proposals reviewed (including any
7 Data Collection and Management in Clinical Research 143

scientic evaluations that accompany the propos- instruments that contain data, properly disposing
als), approved sample consent documents, prog- of computer sheets and other documents, limiting
ress reports submitted by investigators, and reports access to data, and storing research records in
of injuries to subjects [25]. Institutions also must locked cabinets. Although most researchers are
maintain adequate records on the shipment of the familiar with the routine precautions that should
drug product to the trial site and its receipt there, be taken to maintain the condentiality of data,
the inventory at the site, use of the product by more elaborate precautions may be needed in
study participants, and the return to the sponsor of studies involving sensitive matters such as sexual
unused product and its disposition [2628]. behavior or criminal activities to give subjects the
Because drug-accountability records must be condence they need to participate and answer
accurate and clear, especially for an audit of the questions. When information linked to individu-
study site [29], electronically based inventory als will be recorded as part of the research design,
management systems have been devised. In addi- IRBs require that data managers ensure that ade-
tion to describing current inventory [20], some of quate precautions are in place to safeguard the
these systems have look ahead capabilities to condentiality of the information; thus, numerous
assess and fulll future inventory needs [30]. specialized security methods have been devel-
oped for this purpose and IRBs typically have at
least one member (or consultant) who is familiar
Oversight of Data Management: Role with the strengths and weaknesses of the different
of Institutional Review Boards systems available. Researchers should also be
aware that federal ofcials have the right to
As will be noted in Chap. 12, IRBs have a wide inspect research records, including consent forms
range of responsibilities in the design, conduct, and individual medical records, to ensure compli-
and oversight of clinical trials, and it is important ance with the rules and standards of their pro-
that clinical researchers be familiar with them. grams. In the USA, FDA rules require that
IRB functions that are particularly germane to information regarding this authority be included
those managing data include oversight of protec- in the consent forms for all research regulated by
tion of the privacy and condentiality of human that agency.
subjects (identiers and other data), monitoring
of collected data to optimize subjects safety, and
continuing review of ndings during the duration Monitoring and Observation
of the research project [31].
One of the areas typically reviewed by the IRB is
the researchers plan for collection, storage, and
Condentiality and Privacy analysis of data. Regular monitoring of research
of Research Data ndings is important because preliminary data
may signal the need to change the research design
Information obtained by researchers about their or the information that is presented to subjects or
subjects must not be improperly divulged. It is even to terminate the study early if deemed nec-
essential that researchers be able to offer subjects essary. Thus, for an IRB to approve proposed
assurance of condentiality and privacy and research, the protocol must, as appropriate,
make explicit provisions for preventing breaches. include plans for monitoring the data collected to
For most clinical research studies, assuring ensure the safety of subjects. Investigators some-
condentiality typically requires adherence to the times misinterpret this requirement as a call for
following routine practices: substituting codes for annual reports to the IRB. Instead, US Federal
patient identiers, removing face sheets (contain- regulations require that, when appropriate,
ing such items as names and addresses) from survey researchers provide the IRB with a description of
their plans for analyzing the data during the the consent document(s) and any variations in the
collection process. Concurrent collection and manner of data collection must be reviewed and
analysis enables the researcher to identify aws approved by the IRB. The IRB has the authority
in the study design early in the project. The level to observe, or have a third party observe, the con-
of monitoring in the research plan should be sent process and the research itself. The researcher
related to the degree of risk posed by the is required to keep the IRB informed of unex-
research. Furthermore, when the research will be pected ndings involving risks and to report any
performed at foreign sites, the IRB at a US insti- occurrence of serious harm to subjects. Reports
tution may require different monitoring and/or of preliminary data analysis may be helpful both
more frequent reporting than that required by to the researcher and the IRB in monitoring
the foreign institution. Under normal circum- the need to continue the study. An open and coop-
stances, however, the IRB itself does not under- erative effort between the researcher and the IRB
take data monitoring. Rather, other independent protects all concerned parties.
persons (e.g., members of a data safety monitor-
ing board [DSMB]) typically are responsible for
monitoring trials and for decisions about Summary and Conclusions
modication or discontinuation of trials. It is the
IRBs responsibility, though, to ensure that these Clearly dened study endpoints combined with
functions are carried out by an appropriate well-designed source documents, CRFs, and
group. The review group should be required to systems for capturing, monitoring, cleaning, and
report its ndings to the IRB on an appropriate securely storing data are essential to the integ-
schedule. rity of ndings from clinical biomedical research
trials. Because IRBs have a wide range of
responsibilities in the design, conduct, and over-
Continuing Review sight of clinical trials, it is also essential that
clinical investigators be familiar with their
At the time of its initial review, the IRB deter- requirements.
mines how often it should reevaluate the research The inexorable shift from paper-based to EDC
project and will set a date for its next review. systems in large trials promotes the efcient and
Some IRBs set up a complaint procedure that uniform collection of data that can be analyzed
allows subjects to indicate whether they believe and shared across a variety of platforms and data-
that they were treated unfairly or that they were bases. EDC systems can build quality control
placed at greater risk than was agreed upon at into the data collection process from its incep-
the beginning of the study. A report form avail- tiona more productive approach than building
able to all researchers and staff may be helpful checks onto the end [19]. Although modern soft-
for informing the IRB of unforeseen problems or ware tools unquestionably improve the potential
accidents. US Federal policy requires that inves- for data collection and management, systems
tigators inform subjects of any important new alone are worthless without pro-active study
information that might affect their willingness to coordinators and investigators who create and
continue participating in the trial. Typically, the enforce policies and procedures to ensure
IRB will make a determination as to whether any quality [2]. Therefore, a trials data collection
new ndings, new knowledge, or adverse effects system and its ndings are only as sound as the
should be communicated to subjects, and it commitment by individuals who formulate and
should receive copies of any such information carry out document design, study procedures,
conveyed to subjects. Any necessary changes to training, and data management plans.
Take-Home Points

Well-designed trials and data management methods are essential to the integrity of the
ndings from clinical trials, and the completeness, accuracy, and timeliness of data collec-
tion are key indicators of the quality of conduct of the study.
The research data provide the information to be analyzed in addressing the study objec-
tives, and addressing the primary objectives is the critical driver of the study.
Since the data management plan closely follows the structure and sequence of the protocol,
the data management group and protocol development team must work closely together.
Accurate, thorough, detailed, and complete collection of data is critical, especially at base-
line as this is the last time observations can be recorded before the effects of the trial inter-
ventions come into play.
The shift from paper-based to electronic systems promotes efcient and uniform collection
of data and can build quality control into the data collection process.

Constructing and Evaluating
Self-Report Measures 8
Peter L. Flom, Phyllis G. Supino, and N. Philip Ross

A self-report measure, as the name implies, is a subject often can provide valuable information
measure where the respondent supplies informa- about social, demographic, economic, psycho-
tion about him or herself. Such information may logical, and other factors related to the risk of dis-
include self-reports of behaviors, physical states ease or to adverse outcomes of disease. The
or emotional states, attitudes, beliefs, personality choice between self-report, observational, and
constructs, and self-judged ability among others. biophysiological measures will depend on the
A self-report may be obtained via questionnaire, data that are available and the nature of the research
interview, or related methods. Questionnaires questions and hypotheses. It is important to note
typically are written documents that are adminis- that while the range of biophysiological measures
tered without the involvement of an interviewer, is constantly increasing, and while these mea-
whereas interviews usually (but not always) sures may permit objective evaluation of clini-
are administered orally [1]; both are sometimes cally relevant attributes, they are not perfectly
termed surveys. reliable (i.e., free from measurement error). Even
Self-reports are important in medical research more importantly, they may fail to capture the
because while some variables can be evaluated specic quality that the investigator wishes to
through physiological measures, chart review, evaluate. For example, if an investigator is inter-
physical exam, direct observation of the respon- ested in blood pressure, this may be evaluated
dent, or by reports by others, other variables only biophysiologically. However, if the aim of the
can be assessed from information directly fur- investigation is to examine the effects of mood on
nished by the patient or other subject. Indeed, the blood pressure, mood can be evaluated only by
self-report as there are no biophysiological
measures of mood (though there may be biophys-
pain or stress, activities of daily living, health- Questionnaires, like tests, can produce a total
related quality of life, availability of social sup- score or subscores, but also can yield different
port, use and perceived effectiveness of strategies types of information that can be separately ana-
used to cope with ill-health, satisfaction with the lyzed. Questionnaires are almost always a neces-
doctor-patient interaction, and adherence to med- sity when direct contact with the subject is not
ication schedules (though the latter might, at least possible. Under these circumstances, question-
in theory, also be evaluated through objective naires typically are administered by mail to the
testing). respondent who, in turn, completes and returns
Although self-report instruments are relatively them to the sender. In other circumstances, ques-
easy to use, their construction and validation can tionnaires may be read to the respondent over the
be difcult. This chapter will cover fundamental telephone or in-person as part of a structured
aspects of, and distinctions among, question- interview, or they may be administered via the
naires, interviews, and other methods of self- Internet in a variety of ways. A questionnaire can
report and will indicate the circumstances under cover virtually any topic, although here we will
which a new self-report measure may be needed. emphasize those that capture information related
It also will describe methods of generating and to medical issues or health-related topics includ-
structuring responses; discuss approaches to ask- ing, but not limited to, diseases, symptoms, and a
ing about sensitive information; describe the patients experiences with doctors and other
rationale for, and processes involved in, pilot test- health professionals. Some well-known question-
ing, evaluating, and revising a measure; review naires used in medical research are the Brief
related ethical and legal aspects; and provide a Symptom Inventory (a 53-item questionnaire
general guide to the entire process. covering nine dimensions of psychological
health [5]); the SF-36 (a 36-item patient-centered
questionnaire about general physical and mental
What Is a Questionnaire? health-related quality of life [6]); the 26-item
World Health Organization Quality of Life
A questionnaire is a type of self-report instru- Questionnaire (WHOQOL) [7] assessing general,
ment that is designed to elicit specic informa- physical, emotional, social, and environmental
tion from a population of interest. Questionnaires health quality; the Minnesota Living with Heart
may be standardized but often are designed (or Failure Questionnaire (MLHFQ) (comprising 21
adapted) specically for a particular study. questions that measure the patients perceived
Depending on the objective of the study and limitations due to heart failure [8]); and the
resources, the questionnaire, like other self- Morisky Scale (a series of six questions about
report measures, may be administered to all sub- medication adherence [9]).
jects in the available sample or to a dened
subsample. As noted below, the most common
method of administration is direct mailing to Interviews and Related Methods
subjects, though other methods exist. Deciding
upon the sampling strategy is a complex pro- There are a large variety of interview and related
cess. It can range from a simple random sample methods that also can be used to collect self-
to a very complex hierarchical design involving report data. These can be categorized along sev-
multiple strata and sampling procedures, as eral dimensions: level of structure of the interview,
reviewed in Chap. 10. For additional informa- number of respondents involved (one vs. two or
tion on this subject, the reader is referred to Kish more), and use of subject narrative (historical or
(1995) [2], Groves et al. (2004) [3], and Cochran anecdotal methods). In addition, these types of
(1977) [4]. measures are usually qualitative (i.e., focus
The questionnaire usually is in the form of a groups, in-depth/unstructured interviews, ethno-
written document, though sometimes it may be graphic interviews) as opposed to quantitative
administered by audio or with pictorial methods. (e.g., structured interviews and questionnaires) in
8 Constructing and Evaluating Self-Report Measures 149

nature. This chapter provides an overview of anthropological literature, is termed ethno-

some of these qualitative methods, but the graphic. With ethnographic methods, there is
construction of these methods and the analysis of even less structure than with traditional unstruc-
the data generated from qualitative methods are tured interviews, as the process begins with the
quite complicated and outside the scope of this interviewer simply listening. Perhaps the best
chapter. For further information on qualitative known example of a medical ethnographic study
methods and data analysis, see Strauss and Corbin can be found in the book The Spirit Catches You
(1998) [10]. and You Fall Down [12], which describes the
horric experiences of a young Hmong immi-
grant child and her American doctors, caused by
Level of Structure the collision of their vastly differing cultural
views about illness and medical care.
Unstructured interviews (also known as in- Sometimes investigators may prepare a topic
depth interviews [11]), contain very little orga- guide or a list of questions of interest, but the
nization; the developers of unstructured respondents are free to respond in any way they
interviews may have only a general idea of what choose. Interviews of this nature are termed
sort of information they need or they may wish to semistructured and can be useful when there is
allow the respondents to develop their responses concern about imposition of bias or constraint of
with minimal interference. Unstructured inter- potential responses. Typically, in a semistruc-
views often resemble conversations, proceeding tured interview, follow-up questions are simple
from a very general question to more specic probes, such as tell me more, but occasionally
ones (the latter dependent upon responses to the they may be more complex. Because the ques-
general question). They are advantageous because tions contained in the interview are not fully
they produce data that reect an exact accounting articulated before the interview, interviewers
of what the respondent has said and can elicit using these methods must be thoroughly
important information that the interviewer had trained [13]. Semistructured interviews have
not considered before the interview. However, been used in a number of biomedical and health
they suffer from a number of limitations. An education studies. For example, this methodol-
important one is reproducibility, that is, the same ogy has been used to ascertain cancer patients
interview, conducted twice with the same sub- views about disclosing information to their
ject, can yield quite different results due to varia- families [14] and to evaluate the consumption
tions in the circumstances of the interview and perceived usefulness of nutritional supple-
(including, but not limited to, the inuence of ments among adolescents [15].
unintended responses by the interviewer) [1]. As the name implies, a structured interview
Other disadvantages include the potential for delineates the questions in advance, usually with
digressions by the respondent that can cause this the aid of a written questionnaire or other
type of interview to be excessively time-consum- instrument [11]. This approach provides more
ing, complexities of coding and categorization of uniformity than is possible with a semistructured
responses, and difculty generalizing responses or unstructured questionnaire, but it lacks some
to the reference population (as unstructured sur- of their advantages. Probably the best-known
veys typically are performed on relatively small examples of highly structured interviews are
numbers of subjects). An example of an unstruc- polls, where the respondents choices are strictly
tured interview can be found in the work of limited. Although polls are most familiar in polit-
Cohen et al. who studied patients perceptions of ical contexts, they also can be used in medical
the psychological impact of isolation in the set- research aimed at, for example, eliciting informa-
ting of bone marrow transfusions, which began tion about patient preferences regarding types
with the question What was it like to have of care or provider characteristics. A greater
bone marrow transplantation? [11]. Another degree of structure generally is appropriate
type of unstructured interview, often found in the when specic hypotheses are involved and when
150 P.L. Flom et al.

the eld of study is well developed. A lesser joint interview) they must have sufcient skill to
degree of structure is more appropriate earlier in ensure that one member of the group does not
the development of a eld of knowledge or when dominate the discussion. Focus groups have been
the particular research is highly exploratory. used in medical research to uncover attitudes
about a particular illness or difculty. For exam-
ple, Quatromoni and colleagues used focus
Number of Respondents groups to explore the attitudes toward, and knowl-
edge about, diabetes among Caribbean-American
While the traditional interview typically entails a women [21], whereas Hicks et al. used focus
one-on-one interaction between interviewer and groups to explore ethical problems faced by med-
an interviewee (respondent), the joint interview ical students [22].
involves two (or sometimes several) individuals
who know each other, commonly a couple or a
family [16]. Joint interviews differ from focus Narrative Methods
group methods (described below) where those
being interviewed may be strangers. They have Life Histories, Oral Histories, and Critical
value in survey research because different indi- Incidents: Life histories are narrative self-
viduals may have very different perspectives that disclosures about personal life experiences,
may be illuminated by the interaction between or typically recounted orally or in writing in
among them. These different perspectives, in chronological sequence [1]. They commonly
turn, may provide the researcher with greater are used as an ethnographic tool for identify-
insight into the problem at hand; however, to ing and elucidating cultural patterns, but the
accomplish this objective, the interviewer must technique also can be of value for eliciting the
be able to prevent one respondent from dominat- experience of patterns and meanings of health
ing the discussion. Joint interviews have been care in populations of interest. Oral histories
used to study family reactions to youth suicide are similar to life histories, but they focus on
[17] and to study reliability of reports of pediatric personal recollections of thematic events
adherence to HIV medication by interviewing rather than on individual life stories. The crit-
both patients and their caregivers [18]. Note that ical incident technique, pioneered by
the term joint interview sometimes is used Flanagan [23] in the mid-1950s, is widely
when there are two interviewers, rather than two used in many areas of health sciences and
subjects. This approach can be used as a vehicle health sciences education. More focused than
for interviewer training and for determination of life or oral history methods, the critical inci-
inter-rater reliability, but it also can be used to dent technique requires respondents to iden-
provide better answers to health-care questions, tify and judge past behaviors and related
as when a psychiatrist and an internist jointly factors that have contributed to their success
interview a patient to obtain information from or failure in accomplishing some outcome of
varying perspectives [19]. interest. The critical incident method has been
In a focus group, typically four or more used to explore such wide-ranging topics as
individuals (usually a fairly homogenous group) adverse reactions to sedation among children
collectively discuss an issue, guided by a moder- [24], attitudes of third-year medical students
ator. Focus groups are useful for exploring a par- toward becoming physicians [25], and reasons
ticular issue in depth. However, to provide useful why physicians changed their areas of clinical
information, members of the focus group must be practice [26].
properly selected. In addition, moderators must Diaries: A diary is not technically an interview,
be matched well to the subjects, they must know as no one is asking questions. Nonetheless,
the subject matter very well, they must be able to because diaries have some similarities with
elicit information from those who do not offer it interview methods, sometimes they are
spontaneously [20], and (as in the case of the classied with them. A diary is a written
8 Constructing and Evaluating Self-Report Measures 151

record kept by the respondent, usually over a There are even questionnaires that may be com-
fairly lengthy period of time. Diaries may have pleted by couples or groups. Nevertheless, these
any degree of structure or content; for exam- methods differ in certain important respects. As
ple, in a study of diet, a diary might include noted, questionnaires tend to be more structured;
only what the respondent ate each day. On the some forms of interview, such as those conducted
other hand, in a study of reactions to medica- with focus groups, cannot be conducted as a ques-
tion, the diary might include any reactions that tionnaire and require a trained moderator. In addi-
a patient may have experienced after taking tion, some individuals (e.g., young children,
the medication. If subjects are not literate, dia- stroke patients, nonnative speakers) may be more
ries may need to be orally recorded. Diaries comfortable with spoken than with written English
have been used in clinical research to describe and may have a diminished ability to read, which
somnolence syndrome in patients after under- would limit their ability to complete a paper and
going cranial radiotherapy [27], to measure pencil questionnaire. These factors notwithstand-
morbidity of children experienced at home ing, some types of questions, particularly those
[28], and for improving heart failure recogni- that are relatively complex, are better suited to
tion after intervention [29]; the methodology questionnaires, particularly when skip patterns
has been particularly useful for monitoring are clear. (The skip pattern refers to the idea that
symptoms in individual patients in the setting some questions will be passed over appropriately
of N of 1 randomized clinical trials [30] (see depending on answers to earlier questions or
Chap. 5). when the questions do not apply to the respon-
Think-Aloud Methods: With think-aloud dent.) For example, in a questionnaire about gen-
methods, respondents are asked to dictate their eral health, women might answer questions on
thoughts into a recorder while they are trying topics such as menstruation and pregnancy,
to solve a problem or make a decision. These whereas men would not answer these questions.
methods produce inventories of decisions as In addition, because it takes less time to read a
they occur in context [1]. One fundamental question than to speak it, questionnaires can con-
aspect of think-aloud methods that differenti- tain more items, yet be completed within the same
ates them from other approaches is that they amount of time as an interview covering fewer
are concurrent with the process involved items. Finally, self-completed questionnaires may
that is, information is gathered while active be viewed as less intrusive than face-to-face inter-
reasoning is taking place. Think-aloud meth- views. Thus, the choice is a complex process, and
ods have been used to examine nurses reason- a variety of factors must be weighed.
ing and decision-making processes [31] and
have been shown to produce useful informa-
tion in hospital settings [32]. For further infor- When Is a New Self-Report
mation about this approach, the reader is Measure Needed?
referred to the seminal writings of Ericsson
and Simon (1993) [33]. Creating a new self-report measure entails con-
siderable time and effort for item construction
and for pilot testing, renement, and validation.
Making the Choice: Questionnaires Before undertaking such a project, it makes sense
Versus Interviews to be sure it is necessary to do so. As noted above,
answers to some questions can be obtained
This choice is, in some ways, a false one. Similar through biophysiological methods or through
questions may be asked in interviews and ques- direct observation and some cannot. Should the
tionnaires, and as noted above, interviews may be investigator decide that answers to a research
guided by written questionnaires. Either approach question can be obtained only through use of a
may be relatively structured or unstructured. self-report measure, he or she should rst
152 P.L. Flom et al.

determine whether a suitable measure already respondents reading level and related charac-
exists. (The Internet site teristics must be kept in mind. How educated
edu/library/reference/publications/tests.html pro- will they be? In which languages will they be
vides directories of tests and measures in medi- uent? If subjects are excluded who are not
cine, psychology, and other elds; other good uent in the language used in the question-
sources are Tests in Print [34], the Mental naire, how will lack of uency bias the sample?
Measurements Yearbook [35], and the Directory Answers to all of these questions will vary by
of Unpublished Experimental Mental Measures sample and by location. If, for example, an
[36].) Should an existing measure be selected investigator is surveying a group of profes-
(even if widely used and psychometrically sound sionals (e.g., doctors or nurses) in the United
in other populations), the investigator should States [USA], England, or in another country
ensure that it has been successfully employed in which the native language is English, it
and, optimally, validated in the population under probably is safe to assume that the respon-
study. If an appropriate preexisting measure can- dents will have a reasonable command of
not be identied, it may be possible to identify English as well as a high level of education.
two (or more) measures that together may serve On the other hand, if patients are being
the needs of the study, though the investigator surveyed from among a heterogeneous popu-
should be aware that combining multiple mea- lation where geographic variations in language
sures (or rewording items) can impact the psy- exist, it must be assumed that the patients lan-
chometric properties of their constituent parts. guage prociency in the countrys primary
language (and their use of alternative lan-
guages) will vary by location and that at least
Sources of Items some may have little formal education. These
assumptions can be examined by administer-
The rst source of items for a self-report measure ing various tests of reading level. If reading
is the existing literature, which, as noted, includes level is low, alternative formats can be used
existing tests and measures. In some cases, there including auditory or pictorial methods. For
may be a strong conceptual basis for a set of example, pain scales exist that use faces repre-
questions in which case the theoretical or discur- senting different levels of pain [37]. These can
sive literature may be helpful for item generation. be particularly useful with young children or
An additional source of items is observation with illiterate respondents. (Issues regarding
and interview. One protable long-term research need for and methods of translating question-
strategy is to begin with relatively qualitative naires are discussed below.)
methods (such as unstructured interviews or Clarity: Not only must questions be readable
observation), administered among relatively by the target population, they also must be
small samples, and use the ndings obtained with clearly framed to render the survey process as
these methods to develop more structured forms simple as possible for the respondent. It is
that can be administered to signicantly larger very common to assume that a question that is
samples. On the other hand, unexpected responses clear to the investigator will be clear to others.
to a highly structured method may provide the However, this often is not the case. The best
impetus to developing less structured surveys route to assess clarity is thorough pilot test-
that can further explore those areas. ing. Questions that are unclear may be skipped
by the respondent or, worse, may be answered
in unexpected ways. Unlike readability, lack
Structuring Questions: Key Points of clarity affects respondents at all levels of
education and language prociency, although
The Respondents Reading Level: When it may be more problematic at lower
developing a questionnaire, the potential levels. Ironically, sometimes it can be more
8 Constructing and Evaluating Self-Report Measures 153

problematic at higher levels of prociency, as Otherwise, they will be at risk. Is it the

readers may overinterpret the questions. Lack doctors, the nurses, or the patients who will be
of clarity can arise from the use of vague or at risk? To ensure clarity, it may be helpful to
uncommon words whose meaning is impre- operationally dene terms within the survey
cise and not evident in context. However, even process [39]. However, if denitions of terms
common words such as assist, require, are provided, they should be provided to all
and sufcient may be misunderstood [38]. respondents, not only to those who ask for
The respondents perception of clarity will them. Fowler [39] provides a particularly good
depend greatly on the population being sur- example of an unclear question of this nature,
veyed. For example, if the population com- in which respondents were asked how often
prises medical professionals, it may be clearer they visited doctors. Those who asked for
to use a less common word because, often, the clarication were told that doctors included
less common word is more precise. For psychiatrists, ophthalmologists, and anyone
example, the choice between abdomen and else with a medical degree, whereas those who
stomach might depend on whether the sur- did not seek clarication may have excluded
vey is of medical professionals (for whom the psychiatrists and ophthalmologists, or may
former term is more precise) or the general have included individuals without medical
population (for whom it may be obscure). degrees (e.g., psychologists, nurses, individu-
Vague words often are found in the response als trained in alternative medicine who did not
options associated with the questions. For have MD or similar degrees), rendering inter-
example, when asking about the frequency pretation of these data very difcult.
with which a subject does something, Avoiding Leading Questions: A leading ques-
words like regularly and occasionally are tion is one that guides a respondents answers
vagueit would be better to specify a fre- and represents a signicant source of bias in
quency (e.g., three times a week). Other any questionnaire or interview. This can be
common vague words are sometimes, deliberate or accidental and can occur in a
often, most of the time, and rarely. single question or in a series of questions. For
Clarity also can be negatively impacted by example, Do you smoke, even though you
ambiguity. Could a word, a sentence, or a know it causes cancer and many other health
question mean more than one thing within a problems? is a leading question framed
given context? For example, if respondents within a single question. Similarly, if the
are asked about how much money they made respondent is rst asked questions about the
in the last year, is the question soliciting infor- many dangers of smoking, and these questions
mation about before-tax or after-tax are followed with one that asks the respon-
income? Does the question imply individual dents if they smoke, different answers may be
income, household income, or family income? obtained than if the question about the respon-
If the latter, does the term include individuals dents smoking history had been posed with-
not living with the family who contribute out the initial background questions. More
nancially or individuals living with the subtle leading questions include those that
household who are not family members? start with negative wording (e.g., Dont you
Should unearned income, illegal income, odd agree that .? rather than Do you agree or
jobs, capital gains, etc., be included? Complex disagree that .?) [38].
questions such as these may be better asked as Avoiding Double-Barreled Questions: A
several questions [39]. Ambiguity also can double-barreled (or multibarreled) question is
arise when pronouns are used in unclear one that combines multiple questions. For
ways [40]. Consider, for example, being asked example, a subject may be asked to respond to
to agree or disagree with the statement: the statement I exercise regularly and get
Doctors and nurses must educate patients. plenty of sleep. If the respondent answers
154 P.L. Flom et al.

afrmatively, it will not be clear whether he or Translation issues: If large numbers of

she exercises, sleeps adequately, or does both. potential respondents are not uent or in
A negative response is similarly uninterpretable the primary language spoken by the popu-
[38]. A more subtle double barrel is a question lation to which results are to be extrapo-
that incorporates a particular reason, for exam- lated (e.g., English in the USA), excluding
ple, I support civil rights because discrimina- those individuals may introduce sampling
tion is a crime against God. Such a question bias. However, including them, but asking
may lead to confusion among individuals who questions only in English, may bias their
support civil rights for other reasons [40]. responses. Under these circumstances,
Question Order : There are several universal the survey may need to be translated.
criteria that must be met for proper ordering of Preparatory to this process, it will be
questions. Below is a guide: necessary to ascertain the primary lan-
Group similar questions so that respon- guages spoken by members of the sample.
dents can remain focused on one area. Then, for each language spoken by large
When testing ability, arrange items from numbers of the sample, questions and
easy to difcult to build condence. answer choices will need to be carefully
Arrange items from interesting to dull so translated. (If self-reported data are to be
that respondents do not stop answering collected via an interview rather than by
questions. questionnaire, it will be necessary to recruit
As noted below (see section Asking About interviewers who are uent in these various
Sensitive Information), if the survey languages.) After translation, the material
includes questions that are potentially sen- will need to be back-translated to iden-
sitive, these are best asked after relatively tify potential linguistic problems. However,
neutral questions to increase respondent even these steps may not sufce. Not all
comfort level. words and phrases have exact equivalents
Arrange items from general to specic to in other languages, and some concepts vary
avoid biasing the answers. For example, if strongly from culture to culture. Chang and
querying patients about their general and colleagues [41] investigated premenstrual
specic experiences in a hospital, the gen- syndrome in Chinese-American women.
eral question should be asked rst; other- Using a questionnaire that had been trans-
wise, respondents may answer the general lated and back-translated, they asked
question as the sum of the specic ques- bilingual women to respond to both the
tions, ignoring factors that were not included Chinese and English versions. While intra-
in the specic questions (even if those fac- class correlations indicated moderate to
tors were important to the respondents). high levels of equivalence for total scores
Ideally, all questions should apply to each and scales, some questions showed very
respondent. When this is not possible, skip little consistency between languages.
questions or conditional logic should be Asking the same question in more than one
used to guide respondents through the sur- way: Rephrasing a question also can help
vey so that they are not required to answer to minimize ambiguity and avoid honest or
irrelevant items or sections. Alternatively, a dishonest errors. As an example, studies
not applicable category can be included have found that respondents tend to pro-
as a response option to avoid confusion. vide more precise and accurate information
Not applicable is not equivalent to no when they are asked for birth dates com-
opinion; rather, it indicates that the ques- pared to when they are asked to state their
tion does not apply to the respondent (e.g., ages [42]. This phenomenon may be due to
questions about complications during intentional mistruth or to poor recall. Thus,
pregnancy apply neither to men nor to commonly, those collecting self-report data
women who have never been pregnant). often will ask for both the respondents
8 Constructing and Evaluating Self-Report Measures 155

birthday and his or her age. However, it is a question such as When did you move to
important to be selective, as asking all New York? then, given an open-ended format,
questions in multiple ways not only will respondents may name a year, a date, or may
make for a very long survey, it will invari- refer to a time in their lives (e.g., right after I
ably irritate the respondents. Therefore, it got married) or to the history of the area (e.g.,
is best to include intentionally redundant just before the big blackout). For a question
items only for key areas and under condi- such this, it is better to ask for a specic type of
tions where ambiguity is difcult to avoid. response (e.g., either How old were you when
you moved to New York? or In what year did
you move to New York?) because, under these
Structuring Potential Responses circumstances, it is unlikely that any response
given would be unduly constrained.
There are two broad types of questions that can Closed-Ended Questions: Closed-ended ques-
be included in a self-report measure: open-ended tions are those in which the respondent is
(also known as open) questions and closed- asked to choose from a preexisting set of
ended (also known as closed) questions. These response options that have been generated by
differ according to who (the developer of the sur- the individuals developing the survey. Closed-
vey or the respondent) is responsible for dening ended questions, therefore, limit the answers
possible answers to the questions. that the respondent can provide. Their primary
Open-Ended Questions: Open-ended ques- advantages are that they are easier to code and
tions are those for which the respondent sup- analyze, provide more specic and uniform
plies the answer. These are subcategorized into information for a given question, and gener-
(1) numeric open-ended questions that may ally take less time to answer than open-ended
ask for responses expressed as quantities (e.g., questions. Closed-ended questions can be
How much out-of-pocket money did you subclassied into those calling for dichoto-
spend on medications during the past week? mous responses versus polychotomous (multi-
How much weight did you gain during the ple choice) responses. Dichotomous responses
last year? How old were you when you had are those that have only two possible values
your rst heart attack?) versus (2) free text most commonly, yes or no. Examples of
questions (sometimes called verbatims). The questions that may generate such responses
latter, often seen at the end of surveys, ask are legion (Did the patient die? Do you
about experiences or satisfaction with services have a physician? Have you ever had
(e.g., Do you have any other comments youd surgery?). When items are framed as state-
like to share?). Open-ended questions are the ments rather than as questions, typical dichot-
question-level equivalent of unstructured sur- omous responses include true/false or
veys and share some of the same problems (in agree/disagree response options. Items
particular, they may be difcult to code). The calling for dichotomous responses sometimes
chief advantage of open-ended questions is are combined into scales that can yield an
that they do not constrain the range of possible aggregate score. One well-known example is
responses. Indeed, they permit respondents to Thurstone scaling. Thurstone scaling refers
freely respond to the question, allowing them not to a method of soliciting responses to
to describe their feelings about, attitudes single unrelated items, but to a method of
toward, and understanding of the topic at hand. constructing and scaling several related items.
As such, they potentially can generate more The essential idea is to construct several
information about the topic than other formats. dichotomous statements about a respondents
Open-ended responses also tend to reduce the attitudes, each of which may be answered
response error associated with answers sup- Agree or Disagree. This method of
plied by others (i.e., the survey developer). But scaling can be used to classify respondents
this approach has its perils. If a survey includes with different levels of an attribute [40].
156 P.L. Flom et al.

For example, if the area of inquiry entailed a follow-up question asking about reasons for the
nurses attitudes about doctors orders, the hospitalization, with responses entered into
following series of items might be presented: separate columns of a spreadsheet.
Ordinal responses are those that have a mean-
(a) A nurse must always follow every order
ingful sequence, but no xed distances between
that a doctor gives, even if he/she thinks it
the levels of the sequence. Questions about sub-
is wrong.
jective responses are often ordinal. For example,
Agree Disagree
responses to a question such as How much pain
(b) A nurse should almost always follow a
are you in? could range from none, to a lit-
doctors orders, but may raise questions
tle, to some, to a lot, to excruciating. They
on rare occasions.
are considered to be ordinal rather than interval
Agree Disagree
because while they arguably proceed from least
(c) A nurse should generally follow a doc-
to most pain, it is not at all clear whether the dif-
tors orders, but should also voice his/her
ference between, for example, none and a lit-
opinions about those orders.
tle is larger, smaller, or the same as the difference
Agree Disagree
between, for example, a lot and excruciating.
(d) Nurses should be equal partners in all
As noted, ordinal response scales typically
decisions about patient care and should
include a number of possible answers. Usually,
regard doctors orders as advice.
an odd number of responses (typically ve or
Agree Disagree
seven) is chosen to allow the respondent a neu-
In contrast to questions soliciting dichotomous tral or midrange option, though there is no con-
responses, multiple choice questions include sensus about how many choices to include. There
three or more response options. These, in turn, are a variety of different ordinal response scales.
can be differentiated into questions calling for The most common are given below:
nominal-level responses and those that call Traditional Ordinal Rating Scales: These
for ordinal responses. rating scales ask the respondent to evaluate an
As noted in Chap. 3, nominal variables are attribute such as performance by checking or
simply namesthey have no order. There are circling one of several ordered choices. Rating
two primary types of questions that call for nomi- scales often are used to measure the direction
nal responses. The rst includes items for which and intensity of attitude toward the target attri-
the respondent can provide only one answer, as bute. An example of a traditional rating scale is
the available response options are mutually exclu- given below:
sive. Examples include questions about demo-
Excellent Good Fair
graphic characteristics (e.g., religion, gender),
Poor Very Poor
other characteristics such as hair color and blood
type, and so on. The second type includes ques- Likert Scales represent another traditional
tions where the respondent can select more than type of rating scale that asks the respondent to
one response (i.e., choose all that apply ques- indicate his or her level of agreement with a
tions). The latter may provide very useful infor- given statement, with the center of the scale
mation but pose data entry and analytic challenges typically representing a neutral point [40].
that need to be considered when designing the Likert scales are most frequently used for
survey instrument. To counter these, special items that measure opinion and take the gen-
techniques are needed. For example, if one is eral form shown below:
interested in learning about why patients have
Strongly Disagree Neither Agree
gone to the hospital, it is advisable to divide the
Disagree Nor Disagree
main question into two subquestions: the rst
asking the respondent whether he or she has been Agree Strongly Agree
to the hospital and (if answered in the afrmative)
8 Constructing and Evaluating Self-Report Measures 157

Semantic Differential Scales: Semantic of the scale typically represents a neutral

differential scales measure the respondents response (as is the case in many rating
reactions to stimulus words and concepts scales).
using rating scales with contrasting adjectives
at each end [43]. For example, one might ask During the past year, Dr. Heartly has:
a question where the polar extremes are Outstanding 7 Independently published (as
good and bad, with gradations between performance sole or rst author) two or more
these extremes provided as response options. research manuscripts in top-tier
journals, with others in draft
Good __ __ __ __ __ __ __ Bad Very good 6 First authored one research
3 2 1 0 1 2 3 performance manuscript in a well-regarded
peer-reviewed journal, with
The Behaviorally Anchored Rating Scale minimal input from senior
(BARS) is a complex approach to perfor- faculty
Somewhat 5 Coauthored one or more
mance appraisal that combines the elements
good published research manuscripts
of traditional rating scales with critical inci- performance in a peer-reviewed journal, with
dent methods. It was developed to counter support from senior faculty
concerns about subjectivity associated with members
traditional rating scales and, thus, to facilitate Neither good 4 Presented a rst-authored
nor poor abstract at a scientic meeting
relatively more accurate ratings of target performance but has not completed the
behaviors or performance versus other manuscript
approaches. A BARS is constructed by com- Somewhat 3 Actively coauthored a research
piling examples of ineffective and effective poor abstract, but provided very
behaviors (usually based on the consensus of performance limited assistance in manuscript
experts), converting these behaviors into per-
Very poor 2 Provided minimal contribution
formance dimensions, and identifying multi- performance as coauthor on a research
ple incidents per dimension to form a abstract but no participation in
numerical scale in which each item is associ- manuscript development
ated with a particular type of behavior [44]. Extremely 1 Made no progress in developing
poor scientic manuscripts or
Respondents may rate their degree of agree- performance abstracts, due to competing
ment with each item by checking or circling priorities or interests
the appropriate level of the accompanying
rating scale. Shown below is a 7-point BARS Visual Analog Scales: Visual analog scales
that could be used to evaluate an academic (VAS) are similar to Likert scales or semantic
faculty members research productivity in differential scales, except, rather than check-
terms of number and types of publications ing a box or circling a predened response
produced during a given period (a dimension option, the respondents indicate their responses
of interest to faculty leaders). Note each scale by making a mark (denoted here by the x)
value (1 = extremely poor performance, along a line anchored by terms describing
2 = very poor performance, 3 = somewhat opposite values of an attribute, as shown in the
poor performance, 4 = neither good nor hypothetical example below:
poor performance, 5 = somewhat good per-
Good x Bad
formance, 6 = very good performance,
7 = outstanding performance) is anchored VAS have the dual advantages of being very
in specic behaviors related to the dimension sensitive, and, in cases where the measure is
of interest. Unlike traditional rating scales, repeated over time, the respondent will not
which are presented horizontally, BARS typi- be able to intentionally duplicate his or her pre-
cally is arrayed vertically, comprising between vious response. However, different individuals
5 and 9 scale points (values); when the num- may encode physical space differently.
ber of scale values are uneven, the midpoint Thus, a mark halfway between good and bad
158 P.L. Flom et al.

may not mean the same thing to all respondents. 6574 and 7584. Nonetheless, there can be
VAS have been used commonly for the clinical advantages to categorical scaling. The primary
measurement of chronic and acute postoperative advantage is that some respondents may be more
pain. In one study designed to formally assess its willing or able to answer some questions in cate-
psychometric performance in the latter setting, gorical form than in numerical form. This is
DeLoach and coworkers [45] administered the particularly true of income questions, where
VAS to 60 patients in the immediate postopera- respondents may not know their precise income,
tive period, using the scale anchors no pain and but they will know it approximately. (Ironically,
worst imaginable pain. The authors found good self-reported age follows an opposite pattern as
correlations between the VAS and a traditional individuals appear to be better able and more
numeric measure though individual VAS esti- willing to give their birthdates than their ages.)
mates tended to be relatively imprecise.
Rank Order Scales: With this form of mea-
sure, respondents are asked to rank alterna- Asking About Sensitive Information
tives in order, rather than rate them on a scale.
For example, if members of a medical school What is sensitive information? The answer to this
class all had the same professors in one semes- question depends on the respondent, because
ter, they could be asked to grade them in rela- what is sensitive to one person is not to another.
tion to one another, as shown below: In general, questions about stigmatized or illegal
behaviors, or unusual beliefs and opinions will be
Please rank each of your professors from best
judged to be more sensitive by those who engage
to worst, where 1 = best and 5 = worst:
in those behaviors or hold those beliefs than by
Adams _____ Bassett _____ Cochran _____ those who do not [39]. Highly personal questions
Davis _____ Edwards _____ (e.g., income, weight, some health conditions) or
questions about traumatic events (e.g., rape
or child abuse, or other forms of abuse) also may
Advantages and Disadvantages be viewed as sensitive. When asking about sensi-
of Categorizing Responses tive information, warm up questions often are
used to set the respondent at ease, thereby increas-
Many times, responses that are fundamentally ing the likelihood that the sensitive questions will
continuous in nature are transformed into cate- be answered. It also may be useful to include a
gorical responses by the design of the question- cool-down or cool-off phase that can reduce
naire. Instead of asking How old are you? a the possible stress induced by the sensitive ques-
respondent can be asked Are you: (a) under 18, tions. Typical warm-up questions include those
(b) 1924, (c) 2534, (d) 3544, (e) 4554, (f) about nonsensitive demographics (e.g., county of
5564 or (g) over 65? This approach, however, residence, birth order); cool-down questions
has several important drawbacks. First, categori- often are quite trivial (e.g., pet ownership, taste in
cal responses cannot be reconverted into continu- music, food preferences, and similar items).
ous responses. Second, it can limit comparisons Sensitive questions can be uncomfortable to
with other questionnaires that utilize different the respondent and may raise ethical concerns.
breaks between categories. Third, breaks must When included within a research protocol, the
be meaningful, with variations occurring only investigator may need to demonstrate to his or
between those that have been included. Sometimes her institutional review board (IRB) the need for
the survey developer may choose breaks that are such questions and provide assurances that the
inappropriate. For example, if, after data collec- respondent will not be compelled to answer them.
tion, it is determined that most respondents are When asking highly sensitive questions, inter-
over age 65, it is not possible to reverse course viewer training is essential, and interviewers may
and redo the survey adding additional breaks for need to be aware of referral services that can be
8 Constructing and Evaluating Self-Report Measures 159

offered if the respondent reveals high-risk any particular population, precluding generaliz-
behavior, for example, being involved in an ability of conclusions. These limitations apply
abusive relationship, being suicidal, or using even to mail surveys that have been published in
illicit drugs. In addition, becoming aware of cer- medical journals, where average response rates
tain types of behavior via self-report may impose have been shown to be approximately 60% [47].
ethical responsibilities on certain classes of pro-
fessionals. For example, clinical psychologists
have a duty to report certain behaviors. Clinical E-mail and Web-Based Surveys
researchers typically are obligated to report non-
adherence to (or adverse outcomes associated E-mail and web-based surveys are less costly to
with) treatment. More generally, anyone who is a administer than traditional postal mail surveys,
member of a group that has licensure will need to but have several limitations. Anonymity can be
investigate his or her own specic requirements difcult to ensure, response rates may be low,
for such disclosure. and responses may not be random (often, there is
no way of knowing exactly who is answering the
questions). Response rates with Internet surveys
Modes of Administration have been found to differ from those obtained by
postal methods, depending on the group sur-
Self-reported information can be obtained via a veyed. Younger individuals tend to respond more
variety of methods. These include face-to-face frequently than older individuals to e-mail,
interviews, mailed questionnaires, e-mail and whereas older individuals more to traditional
web-based surveys, telephone surveys, computer- mail [48]; in one study, medical doctors have
assisted response systems, and randomized been found to respond more frequently to tradi-
response methods. tional mail than to Internet-based methods [49].

Face-to-Face Interviews Telephone Surveys

The chief advantages of face-to-face administra- Telephone surveys are less costly than face-to-
tion are that response rates are optimized and that face interviews, but the telephone-based approach
it provides an opportunity for the interviewer to may lead to signicant nonresponse. Assuming
clarify confusing items. Disadvantages include that the subject can be reached, the lack of per-
expense (both time and money), the possibility sonal contact between the interviewer and respon-
that interviewer behaviors may inuence (bias) dent may increase the likelihood that the latter
responses, and the fact that some individuals may will decline the interview. In addition, in the cur-
be reluctant to answer some questions in the rent era, many potential respondents lack landline
presence of an interviewer due to embarrassment telephones, and some have multiple telephones
(especially sensitive items) or concerns about creating difculties in achieving a random sample.
revealing illegal behavior. A recent study using telephone survey methodol-
ogy found response rates of only 39% [50].

Mail (Postal) Surveys

Computer-Assisted Interviews (CAI)
Administering a questionnaire by mail is rela-
tively inexpensive and helps to avoid interviewer The availability of computers over the last several
bias. However, unless care is taken, response decades has created new methods of administer-
rates are likely to be suboptimal (i.e., <85%) [46], ing and responding to surveys. Among the most
and respondents may not be a random sample of common are the Computer-Assisted Personal
160 P.L. Flom et al.

Interview (CAPI), the Computer-Assisted Randomized Response

Telephone Interview (CATI), and the Audio
Computer-Assisted Self-Interview (ACASI). Randomized response is a useful method of
With CAPI, the interviewer typically uses a com- assessing the rate of stigmatized behaviors. In
puter screen to read questions to respondents in brief, respondents ip a coin (in private) or use
the setting of a face-to-face interview. With some other randomizing device to determine
CATI, the interviewer follows a script provided whether they are about to answer an innocuous
by a software application to ask questions by question (e.g., Is today Monday?) or a sensitive
telephone. Depending on the system used, the one (e.g. Have you ever used heroin?). They
respondent may have the options of interacting report their answers (yes/no) without the
with a live interviewer or listening to a recorded investigator being aware which question the
interview and may answer questions by voice or respondent was asked, thus protecting the latters
touch phone mechanisms. CATI also provides privacy. At the conclusion of the assessment, a
the advantages of automating initial calls and statistical algorithm is used to calculate out over-
call-backs and keeping notes on the status of the all prevalence of the target behavior. Variations
interviews. With ACASI, the respondent uses a on randomized response methods also exist for
headphone connected to a computer to listen to ordinal and interval level variables. Randomized
preprogrammed questions and enters his or her response methodology has been widely used for
responses directly into the computer via a key- highly stigmatized behaviors such as illegal drug
board or keypad. If respondents have limited use [56] and homosexual sex [57] and has been
computer literacy, these systems can be engi- found to yield more accurate data than direct
neered to employ a touch screen mechanism surveys [58].
whereby the respondent simply pushes a patch of
a certain color. Because absence of an interviewer
protects privacy (broadly dened as control of Methods for Boosting
access of oneself to others), some respondents Response Rates
may feel more comfortable answering sensitive
questions in this format. Indeed, studies have There is a large literature comprising methods for
shown that respondents are more likely to admit increasing response rates to surveys, some of
use of illicit drugs and to report sensitive or stig- which involve paying or providing other incen-
matized sexual behaviors with ACASI than when tives to respondents for their participation. Their
interacting with an interviewer in person or by appropriateness is largely dependent on the pop-
telephone [51, 52]. CAI methods have distinct ulation with which the investigator is working as
advantages over traditional paper-and-pencil well as the nature and magnitude of the induce-
surveys. They improve turnaround time, avoid ment. For example, if participants are members
problems associated with skip patterns and of a low-income, nonprofessional group, offering
branching in complex surveys, and facilitate modest compensation for time and effort would
entry validation and internal consistency checks. be ethically appropriate and could encourage par-
They also minimize (or entirely eliminate) the ticipation in a survey, whereas offering large
requirement for secondary data entry and clean- sums of money or valuable goods for such par-
ing, further improving data quality by avoiding ticipation would be viewed as coercive. Among
additional keystroke errors [53]. The primary more advantaged respondents, offering an induce-
limitation of CAI is their relatively high initial ment could backre (if the respondent viewed the
setup costs. In medicine, computer-assisted inducement as insulting). For such subjects, a
methods have been shown to be of value for reasonable alternative is to offer money to a
obtaining information from stroke victims [54] charity of the respondents choice. Other effec-
or others with limited ability to use a pen. They tive methods, frequently adopted in other domains
also have been used to improve patient care in such as marketing but applicable to medical
the setting of HIV infection [55]. research, include making the survey interesting,
8 Constructing and Evaluating Self-Report Measures 161

including questions that are relevant to the subjective methods, the measurement instrument
respondent and keeping the survey short and sim- provides only an estimate of the quantity of
ple (KISS). Strategies specic to mail surveys interest. By an estimate, we mean that the
include the use of personalized questionnaires recorded value is not a direct measure of the
and/or cover letters that orient the respondent to underlying quantity of interest or the true
the purpose and importance of the study and value. For example, if we are measuring the
invite their participation. Additional strategies blood pressure of an individual, the observed
include the use of colored ink, rst class mail and value for the systolic pressure may be 124 mmHg.
recorded delivery, stamped return envelopes (or However, the true value cannot be observed and
permitting use of facsimile), contacting is equal to the 124 plus or minus some value
participants before sending surveys, maintaining reecting measurement error as well as other
follow-up contact with participants, and provid- sources of error.
ing nonrespondents with replacement question- Two fundamental components of accuracy,
naires when the initial questionnaires were not both inversely related to the error of an observa-
readily accessible [59]. In one study, the com- tion, are validity and reliability. Physicians and
bined use of replacement questionnaires and others using self-report measures for research
chocolate (the inducement) was found to should have a fundamental understanding of
signicantly increase response rates versus either these concepts if they are to form judgments
method alone [60]. Strategies specic to tele- about the quality of outcomes based on these
phone surveys include allowing the respondent to measures or develop their own measures. In the
return the call using a toll-free number and setting of tests and measures, validity relates to
sending alerts prior to initiation of the survey. how well the instrument measures what it pur-
(For more possibilities, the reader is referred to ports to measure and reliability relates to how
the website consistently the instrument measures whatever it
survey-response-rates.htm.) is that it measures. These qualities exist on a con-
tinuum rather than as absolutes, that is, inferences
drawn from an instrument are neither valid nor
Evaluating Psychometric Properties invalid nor are they reliable or unreliable;
of a Self-Report Measure rather, they are valid to a certain degree and reli-
able to a certain degree for a given population
Before a self-report measure can be used with and setting (i.e., are sample dependent).
condence, it must be rigorously evaluated to Together, validity and reliability reect the abil-
determine whether it is psychometrically sound; ity of the instrument to provide an accurate
that is, that it measures the construct of interest quantitative estimate of the characteristic of inter-
(e.g., quality of life, satisfaction, emotional state est to the researcher.
of health) accurately in the population of inter-
est. Such an assessment not only is essential for
all newly developed instruments, it also is impor- Validity
tant for instruments that have been validated for
other populations. By accuracy, we mean that Validity has been dened as the degree to which
the quantitative or qualitative assessment pro- conclusions drawn from the results of any assess-
vided by the instrument should provide as true a ment are well-grounded or justiable, being at
measure of the underlying construct as possible. once relevant and meaningful [61]. When the
Unfortunately, all measurement is accompanied term validity is applied to measurement, it refers
by the possibility of error which is either system- to the extent to which the instrument measures
atic or random as no data collection technique is the actual parameter of interest [62]. Thus, a
perfect. Whenever we measure a patient charac- well-built scale should, on average, produce read-
teristic, be it by objective testing or by more ings that permit a meaningful conclusion about a
162 P.L. Flom et al.

persons actual weight; a well-constructed process. Does the assessment seem like a
measure of clinical depression should yield data reasonable way to gain the information the
that are useful for drawing meaningful conclu- investigator is attempting to obtain? Does it
sions about the presence and severity of depres- look as though it will measure what it is sup-
sive symptoms; and a properly designed measure posed to measure? Does it seem well
of health-related quality of life should provide designed? [64] For example, the Beck
responses that are value for drawing meaningful Depression Inventory, which is widely used
conclusions about health status or health utility in clinical medicine, asks questions about
from the perspective of the patient. In each of depression; more specically, it asks about
these cases, the quality of the instrument is judged such attributes as sadness, suicide, and loss of
according to the soundness of the conclusions pleasure [65]. It has face validity because
that can be drawn from the responses that it these (and other) items are what most people
provides. Therefore, though the term valid is think of as depression.
commonly used as a descriptor for various tests Content Validity: Content validity reects
and measures, validity, as Cook and Brown have how well the items comprising a measure
noted, represents a property of the inference cover (sample) the subject of interest or
rather than the instrument itself [63]. Because domain. When a domain is well dened,
these inferences are inuenced by the circum- content validity is relatively easy to ascertain.
stances under which the instrument is adminis- If the domain is less well dened, ascertain-
tered, there is no such entity as a generically valid ment of content validity may require having
instrument. Indeed, all instruments should be experts in the eld review the measure [40].
validated for each interpretation, including the The content validity of a test of knowledge of
specic populations and contexts in which it will womens health was called into question by
be used. For example, a test that measured knowl- comparing the domains it covered with those
edge of basic addition and subtraction might be covered by a set of curriculum guides [66],
used to draw valid inferences about mathematics and the content validity of the SF-36 health
prociency among rst-grade students but would questionnaire was afrmed by comparing it
not be useful for drawing similar inferences about with the longer instrument from which its
college mathematics majors. Similarly, a scale items were drawn [67].
that has been validated for one disorder (e.g., Construct Validity: Construct validity is the
depression) would need to be re-evaluated to degree to which a measure is related to other
establish its validity in the setting of another (e.g., measures or attributes, as dictated by theory. It
anxiety). Moreover, an instrument that has been reects the extent to which the construct under
shown to permit valid inferences under research study (e.g., depression), even if it cannot
conditions or in highly selected patients may directly be assessed, has been properly labeled
need further evaluation before use in a general (operationalized) by the items comprising the
clinical population [63]. measure. In other words, does the instrument
Validation of a measurement instrument is a measure what it was designed to measure?
complex process, in part, because validity encom- Thus, construct validity is a key part of valid-
passes various dimensions. The most common of ityno instrument has any value unless it
these are summarized below: satises this criterion. Inferences about con-
Face Validity: Face validity (validity at face struct validity can be evaluated by a variety of
value), also known as representation valid- methods. A common approach to construct
ity, is concerned with how a measurement validation entails assessment of the conver-
instrument or procedure appears to be relevant gent and divergent (or discriminant) validities
to a construct, as judged by a potential respon- of a measure. Convergent validity indicates
dent. It is the simplest type of validity to gauge that the measure correlates highly with
and, typically, is assessed early in the validation other measures of similar constructs, whereas
8 Constructing and Evaluating Self-Report Measures 163

divergent validity indicates that it correlates substantiate the single-factor structure of a

poorly with measures of other constructs. For mental well-being scale [72].
example, we would expect a measure of Criterion Validity: Criterion validity (also
depression to correlate more highly with mea- known as criterion-related or instrumental
sures of anxiety than with measures of most validity) refers to how well the results obtained
physical characteristics. Similarly, we would by an instrument correlate with or predict
expect measures of post-traumatic stress dis- some real world behavior or other attribute. It
order to correlate more highly with measures estimates the accuracy of the measure by com-
of similar stressors than with measures of age. paring it with some preexisting indicator that
A related approach is known groups analysis, has been demonstrated to measure the same
which evaluates the extent to which scores on construct (i.e., a gold standard). There are
a measure discriminate between individuals two primary forms of criterion validity: con-
known to possess an attribute versus those current and predictive. Concurrent validity is
who do not. Known groups validity analysis evaluated by comparing two measures in par-
has been used to provide support for the con- allel and determining whether they are con-
struct validity of the Pediatric Evaluation of cordant. For example, the concurrent validity
Disability Inventory by demonstrating differ- of a test of tness could be dened by
ent scores among individuals with different determining the extent to which it correlates
levels of disability [68]; the method also was with maximum oxygen uptake measured at
used to support the construct validity of the (or approximately at) the same time [73].
Multidimensional Fatigue Inventory by dem- Predictive validity implies that the measure
onstrating scores consistent with greater forecasts an expected result. As examples, a
fatigue among patients presenting with chronic self-report measure of functioning in the
fatigue-like symptoms or chronically unwell elderly was found to predict mortality [74]; a
patients versus healthy controls [69]. An alter- measure of readiness to change was used to
native approach involves the use of factor predict change in drinking behavior in exces-
(exploratory or conrmatory) analysis or prin- sive drinkers [75]; and a measure of adherence
cipal components analysis to identify clusters to medication instructions was afrmed by
of related items on a scale. Collectively, these predicting blood pressure 5 years later [9].
methods are useful for (a) determining how Responsiveness to Change: A primary goal of
many latent variables or dimensions under- clinical management and target of clinical
lie a set of items (thereby helping to elucidate investigation is assessment of change over
or conrm the structure of the instrument), time in a patients status in response to treat-
(b) condensing a relatively larger number of ment. As Portney and Watkins have noted, the
items into a smaller number of variables to use of change scores as a basis for assessing
facilitate statistical analysis, and (c) clarifying treatment outcomes is pervasive throughout
the meaning of these variables [39]. As exam- clinical research [76]. While some methodolo-
ples, principal components analysis was used gists contend that the sensitivity of an instru-
to dene two distinct higher-order clusters ment to change (i.e., its responsiveness) is
reecting mental and physical health from distinct from validity [77], others argue that
among the eight scales comprising the Medical responsiveness is, indeed, an important com-
Outcomes Study Short Form (SF) 36 [70]; ponent of validity [76, 78]. An instrument is
exploratory factor analysis was used to identify considered to be responsive if it can accurately
three subdimensions of climate (clarity, detect change when (and only when) it has
challenge, support) in a work-group climate occurred [79]. In other words, it should pro-
assessment tool for improving the perfor- duce the scores that change in proportion to
mance of public health organizations [71], and the change in the patients status, but remain
conrmatory factor analysis was used to stable when the patient is unchanged [76].
164 P.L. Flom et al.

Two forms of responsiveness are recognized: a systematic error consistently affects the mea-
internal and external [80]. Internal respon- surement of the variable in the same way each
siveness represents the instruments capacity time that the measurement is done. It provides an
to detect change from before to after exposure incorrect measure of the variable, and the error
to an intervention of acknowledged efcacy will be the same for every subject.
[81]. Typically, it is evaluated in the setting of There are several types of bias that specically
repeated measures designs that incorporate affect responses obtained in self-report measures;
assessments before and after the intervention some of the most common are listed below. (For
in the same individual. These designs can a fuller list, the reader is referred to Aiken and
involve a single group of subjects followed Mardegan [44] and Choi and Pak [38].) Although
over time (i.e., a treated cohort, where intra- adequate quantitative data are not available for
subject change is expected) or include two purposes of comparison, there is general agree-
groups (including an untreated control where ment that the extent and impact of these biases
change is unexpected). External responsive- vary greatly from discipline to discipline and
ness refers to the degree to which changes in a from one population to another.
measurement correlate with changes in other Social Desirability Bias: Social desirability
putatively related changes in health status bias (sometimes termed faking good bias)
[81]. Both forms of responsiveness are refers to the tendency of respondents to answer
inuenced by reliability and scale characteris- questions in ways that make them look good,
tics. Scales that are unreliable will produce rather than honestly [40]. This positive
too much noise to allow for determination of response bias may be of two typessome
meaningful change over time. Scales with too respondents may deliberately tell falsehoods
few response categories may fail to detect all in order to appear acceptable to those conduct-
but very large changes. Scales producing ing the survey, whereas others may have inter-
ceiling effects (due to restriction at the upper nalized the dishonest response. (The latter
level of the range of possible values) may occurs more commonly than generally recog-
leave little room for improvement on subse- nized [84].) The social desirability bias can
quent testing just as those producing oor compromise most forms of self-report, but its
effects (where data cannot take on lower val- potential impact should be anticipated when
ues) will be insensitive to clinical decline even asking about stigmatized behaviors or atti-
when there is a worsening of status or func- tudes (e.g., when questions involve issues of
tioning. When instruments with varying scal- criminality, violence, or sexual orientation), or
ing characteristics (type, length, directionality, when the respondent has reason to believe that
etc.) are compared to determine their relative a socially nondesirable response could cause
responsiveness, unit-free statistical approaches him or her to lose something of critical value
including standardized scores and compari- (e.g., a belief by a patient that nonadherence to
sons (e.g., effect sizes or standardized response a health-care providers instructions could
means) must be used. (For an excellent negatively impact future interactions with that
discussion of these techniques and their provider). Although it may not be possible to
interpretation, the reader is referred to Liang eradicate this form of bias, the extent of its
et al. [82] and Angst et al. [83]). potential inuence can be examined by embed-
As noted throughout this volume, the validity ding, in the self-report measure, an item or
of any study can be threatened by bias, which two that ask the respondent to answer a ques-
broadly is dened as known or unknown system- tion such as I have never intentionally told a
atic error in the design, sampling, measurement, lie or I always know the difference between
or other critical aspects in the conduct of an right and wrong or through formal testing.
investigation that can produce distortions of A common test of social desirability is the
ndings. Unlike a random error, described below, Marlowe-Crowne scale [85]; a shorter version
8 Constructing and Evaluating Self-Report Measures 165

of this scale has been created by Strahan and impressions guide their ratings. It is suspected
Gerbasi [86]. whenever respondents assign similar ratings to
Agreement Bias: Agreement bias (also known each dimension measured in a survey (e.g.,
as acquiescence bias) is the tendency to say rate all aspects of performance as excellent
yes or I agree to every item regardless of or all components of a course or program as
content. It is subtly different from social very good). The phenomenon, empirically
desirability bias as agreement bias includes conrmed by Thorndike in 1920 [93], is
admission to possessing socially undesirable thought to result from a cognitive bias, whereby
traits. For example, respondents manifesting one particular trait, especially a positive char-
agreement bias might respond afrmatively to acteristic, inuences or extends to perception
the question, Have you ever used illicit of other traits. A commonly cited example is
drugs? whereas those exhibiting social desir- judging an attractive person as more intelli-
ability bias would likely provide the opposite gent. Its logical opposite is sometimes termed
response. The phenomenon is thought to have the devil, horns, or reverse-halo effect
multiple causes. First, it has been argued that whereby individuals judged to have a single
most respondents desire to be polite and undesirable trait (e.g., unattractiveness) are
respectful and, thus, not wish to disagree with subsequently judged to have other undesirable
the questioner [87, 88]. Second, respondents traits (e.g., lack of intelligence) based on the
may feel that they have lower standing than evaluators tendency to allow a single weak-
the questioner and agree with questions based ness to inuence the totality of impressions
on this perceived status differential [89]. [94]. In the setting of a survey, a respondents
Third, respondents may select an agreeable prejudices, recollections of previous observa-
(but not necessarily truthful) answer to com- tions, and even answers to previous questions
plete the survey as rapidly as possible [90]. also may inuence responses. Thus, the halo
Whatever the cause, agreement bias can be (and reverse-halo) effects collectively repre-
detected (and sometimes resolved) by includ- sent an important bias that must be recognized
ing a balance of positively and negatively and, if possible, minimized to improve the
worded items [91], though care must be taken accuracy of individual ratings. Several
to minimize confusion to the respondents. approaches have been recommended includ-
Faking Bad Bias: In contrast to social desir- ing proper introduction of the purpose of the
ability (or faking good) bias, the faking survey (to emphasize the importance of the
bad bias occurs where failure (in the usual respondents ratings), increasing the number
sense) is a goal. In the context of self-reported of attributes to be rated (bearing in mind that
information, faking bad is a negative response an excessive number of questions may cause
bias that is caused by the respondents desire the respondent to abandon the survey), and/or
to appear worse (e.g., manifest symptom physically arranging scales so that their favor-
amplication) than he or she really is either to able and unfavorable ends alternate.
avoid duty or responsibility (i.e., malinger) or
to qualify for goods or services [38]. If faking
bad bias is suspected, methods exist to detect Reliability
it. (For a comprehensive discussion of one
such method [the Fake Bad Scale], the reader Reliability is related to the question how
is referred to Nelson et al. [92].) consistent or reproducible are the scores that an
Halo Effect: The halo effect is a systematic instrument produces? Like validity, reliability
bias that occurs when respondents fail to rate technically is considered to be a property of the
individual attributes of a person, object, event, measurement rather than of the instrument itself
or service in isolation but instead let overall because the same instrument administered in
166 P.L. Flom et al.

different settings and to different subjects under research setting (e.g., unintended variations in
varying conditions can yield widely varying reli- temperature, lighting, noise, or interruptions).
ability estimates [63]. Reliability is considered to Finally, many factors causing random error have
be a necessary, but insufcient, element of valid- their source in the instrument. For example,
ity [95, 96]. This is because valid conclusions unclear questions or directions, inadequate item
cannot be drawn from an instrument that yields sampling, suboptimal format, or even the order in
inconsistent observations [63]. At the same time, which the questions are posed are potential
reliability does not imply validity because an sources of random error. Random error (like sys-
instrument can produce consistent errors. tematic error) must be considered in interpreting
The concept of reliability can be illustrated the results of studies; the greater the error, the
using the metaphor of a bathroom scale. For less we can rely on the results of the measure-
example, if you are like many people, you prob- ment process for decision-making. In designing
ably will step on your bathroom scale in the or selecting among instruments, we are constantly
morning, check your weight, step off, and step striving to create or identify those that not only
back on the scale to recheck the reading. You measure the attribute of interest but which mea-
have learned through experience that the mea- sure that attribute reliably.
surement displayed by a bathroom scale the rst Like validity, reliability can be classied
time you weigh yourself is not always the same according to several dimensions. These include
as the second time you try, but usually it is very the stability of the measurement over time, the
close. A good scale might vary by half a pound or congruence of a measurement when dened by
so, but if measured weight differs signicantly different assessors (or determined by different
(e.g., more than 5 lb) at 7:00 a.m., 7:01 a.m., and methods), the consistency (homogeneity) of
7:02 a.m., the readings that the scale produces items within a measure or scale, and the
would have very limited reliability. Similarly, if correspondence of parallel measures. These
an instrument is designed to measure a patients dimensions, typically expressed as reliability
self-condence, then it should yield approxi- coefcients, are evaluated using various method-
mately the same result each time it is adminis- ological approaches, as described below:
tered to the same subject. Test-Retest Reliability (Temporal Stability):
Whereas validity is diminished by systematic Test-retest reliability is the most commonly
error, reliability is reduced by random (chance) recognized form of reliability. It is evaluated
error. There are many sources of random error in by administering the same item, scale, or
research measurement. The most common are instrument to a sample of individuals twice
those caused by factors related to the subject, over a relatively short period (the period
researcher, environment, and instrumentation. depending on the intrinsic stability of the vari-
For example, a subject who is tired, sick, hungry, able under study) and comparing the results
angry, irritable, or confused may produce mea- using Pearsons product moment correlation
surements that are different than they would be if for interval data or Spearmans rank order
the subject were not so aficted. Indeed, any correlation for ordinal data. Typically, test-
changing physical, emotional, or psychological retest correlation coefcients ranging 0.70
state of the subject, including the subjects aware- 0.80 generally are considered to be satisfactory
ness of the researchers presence, can introduce to good (though criteria for acceptability vary
error into the measurement process. The according to discipline). This measure of
researcher can introduce random error in mea- reliability is most appropriate for assessing
surement simply by his or her physical appear- relatively enduring characteristics such as per-
ance, demeanor, or other personal attributes or by sonality traits, aptitude, and chronic health
becoming fatigued, impatient, bored, ill, or dis- status in stable populations where subjects are
tracted. Many factors that cause random error in willing to undergo multiple administration of
measurement can arise from perturbations of the the same measure. It is less appropriate for
8 Constructing and Evaluating Self-Report Measures 167

estimating temporal consistency of attitudes, when evaluated as repeated measures, can

mood, and knowledge that can be inuenced falsely create the impression of relatively low
by experience(s) or for health states that have reliability [99]. Internal consistency reliability
been altered by intercurrent events between customarily is evaluated by a variety of
measurements. approaches, each of which assesses equiva-
Interobserver (Inter-rater) Reliability: Inter- lence of responses within a related group of
rater reliability reects the agreement between items during a single administration of the
or among two or more assessors who indepen- instrument to the same subjects. The most
dently rate the same item, scale, or instrument common are given below:
administered within a sample of individuals at Split-Half Reliability is one of the oldest
a single point in time. Cohens Kappa (k) is a methods for evaluating internal consis-
commonly used statistic for estimating agree- tency. It is calculated by dividing a scale
ment between two raters for binary data (e.g., into two parts, computing a correlation
heart failure present vs. absent); a related sta- coefcient between those parts, and adjust-
tistic (Weighted Kappa) may be used for ing the correlation using the Spearman-
ordinally ranked data such as those obtained Brown prophecy formula to correct for
via Likert-type scales. If the raters are in com- foreshortened test length (as shorter scales
plete agreement, then k = 1. If there is no tend to yield lower reliability estimates).
agreement beyond that which would be As a rule of thumb, coefcients between
expected by chance, then k = 0 (values <0 sig- .70 and .80 indicate adequate reliability,
nify that agreement is even less than that and .90 or greater indicates high reliability.
which would be attributable to chance). If the two half measures are highly cor-
Although there is no universal consensus, in related, this provides evidence that they are
the range of values indicating better than measuring the same attribute. Two com-
chance agreement, statistics 0.010.20 have mon methods for performing this analysis
been interpreted as slight agreement, 0.21 are to choose the rst N items and the last
0.40 as fair agreement, 0.4060 as moder- N items, or to choose odd numbered items
ate agreement, 0.610.80 as substantial and even numbered items. It is important
agreement, and .81 as almost perfect that split-half reliability be determined for
agreement [97]. When data are at the interval particular scales, not for entire question-
level, inter-rater reliability can be established naires comprising different scales. For
via computation of the Pearsons correlation example, if a questionnaire assesses both
coefcient (r) when sample size is relatively anxiety and depression, the split-half reli-
large and by the interclass correlation ability of the two measures will need to be
coefcient (ICC) when sample size is smaller evaluated separately.
(i.e., <15) [98], and is interpreted in the same The Kuder-Richardson Formula 20
manner as Kappa. (KR-20) [100]. The KR-20 can be used
Internal Consistency: Internal consistency is to provide an estimate of internal consis-
an approach to reliability assessment that esti- tency for scales calling for binary (dichoto-
mates the homogeneity of items in a scale that mous) responses (e.g., yes/no, true/
are intended to measure the same construct. false, agree/disagree, symptomatic/
The essential idea is that the various items on asymptomatic). Unlike the split-half
a scale all should correlate highly and posi- method (described above), which is based
tively; that is, when one item is answered in a only on a single splitting of items, the
particular way, other related items ought to be KR-20 computes split-half reliability based
answered similarly. This approach is prefera- on all combinations of splittings and pro-
ble to test-retest methods for instruments that duces an estimate of the mean correlation
are highly sensitive to change and which, of the items comprising the measure. Values
168 P.L. Flom et al.

can range from 0.00 to 1.00 (sometimes the rst assessment can inuence the results
expressed as whole numbers, 1100). of subsequent assessment by providing an
A high KR-20 coefcient (i.e., >0.90) opportunity for practice or learning inde-
indicates a homogeneous measure or scale. pendent of the intervention. This threat to
A variant, the KR-21, is computationally internal validity (testing effects) can be
simpler (it is based only on the assessment minimized (though not entirely eliminated)
mean, variance, and number of items on the by using alternate forms of measurement of
scale), but tends to produce lower reliabil- the same construct or content domain
ity estimates. before and after the intervention. One com-
Cronbachs Alpha [101] is the best known, monly used approach to creating these
and most commonly used, measure of alternate forms is to generate a large pool
internal consistency. Like the KR-20, of items, each of which addresses the con-
Cronbachs alpha conceptually represents struct being studied, and randomly divid-
the mean of all split-half reliability esti- ing the items to create two functionally
mates for a scale [76] and is computed by equivalent instruments of similar difculty
calculating pair-wise correlations between and length. Other methods include chang-
items in a scale; however, Cronbachs alpha ing the wording or order of the questions in
can be used with scales that include several the two instruments. (The same approach is
ordinal response options (e.g., 1 = strongly used to discourage cheating on high stakes
agree through 5 = strongly disagree or achievement or aptitude tests.) After the
0 = not limited by heart failure symptoms alternate forms are created, they are admin-
through 3 = severely limited by heart fail- istered to the same sample, and the results
ure symptoms) as well as those that are correlated. If they produce similar
include binary response options, making it results for the same subjects (i.e., they yield
more exible than the KR-20. Values of correlation coefcients >0.80), they are
0.70 or above are widely viewed as accept- considered to be equivalent forms and can
able, and values of approximately 0.90 are be used interchangeably [62]. (The reader
considered to be excellent [102]; however, will note that the methodology for estab-
extremely high reliability estimates (i.e., lishing alternate form reliability, when
0.95) suggest that some of the items may based on division of a related item pool, is
be redundant, contributing no additional analogous to that used for estimating split-
information than that furnished by other half reliability. The primary difference is
items on the scale. Alpha if item is deleted that with split-half reliability, items within
is a widely used index that can be useful for a single scale or measure are divided solely
deleting nonhomogenous or redundant for the purpose of determining internal
items during the process of scale develop- consistency, whereas with the alternate
ment. Nonetheless, when using standard- form approach, the objective is to construct
ized scales, all items (including those that two equivalent instruments that can be used
reduce alpha) should be retained to permit independently of one another.)
meaningful comparison with previous as
well as future assessments employing the
same instrument. Ethical and Legal Aspects of Survey
Alternate (Equivalent, Parallel) Form Methods
Reliability. An investigator may be con-
cerned that repeated measurement using Given below is a brief prcis of some ethical and
the same instrument might threaten the legal issues involved in survey research. Any
internal validity of an intervention study investigator should carefully review the policies
because (as noted in Chap. 5) exposure to of his or her institution to ensure compliance.
8 Constructing and Evaluating Self-Report Measures 169

If the investigator has a professional license, that during the chain referral process, as disclo-
licensing body may also have relevant rules and sures from the investigator could compromise
regulations governing survey research. privacy of the subject and condentiality
1. General participation. In all cases, respon- of their data, destroy the relationships
dents must know that they are free to not par- within the chain, and militate against further
ticipate, to skip questions, and to stop the recruitment [103].
survey at any time. 5. Focus groups. Focus groups pose ethical spe-
2. Sensitive questions. If sensitive questions are cial problems, because members of the focus
asked, provision should be made for debrieng, group share information that can, potentially,
and respondents should be provided with be used by one participant against another. As
information about relevant services, as appro- a hypothetical example, suppose a focus group
priate. For example, if an investigator asks a of medical students were convened to evaluate
subject about illicit drug use, information may specic academic programs and one member
need to be provided about available treatment of the focus group identied a certain faculty
facilities. member as incompetent. If another focus
3. Privacy. Especially when sensitive informa- group member knew the identity of the partici-
tion is discussed, substantial efforts should be pant expressing this view, he or she could be
made to keep identifying information private. threatened or even blackmailed. As another
One solution is to use code numbers rather example, if a focus group member acknowl-
than names and, if necessary, to store a link of edged having HIV or some other stigmatized
code numbers to names in a separate and condition or admitted to engaging in illicit
secure location. behavior (such as abuse of prescription or
4. Snowball (chain-referral) sampling. nonprescription drugs), similar problems
Sometimes, when a sampling characteristic is could ensue.
relatively rare within a population, or when a 6. Children and other special populations.
population is concealed from society at Additional rules apply when conducting self-
large, an investigator may have difculty report surveys involving children and other
locating an adequate number of subjects for a special populations (e.g., prisoners, individu-
survey. This can occur when the population of als with mental disabilities). These populations
interest comprises individuals who exhibit may have limited ability to supply informed
illegal or otherwise stigmatized behaviors consent, either due to lack of comprehension
(e.g., illicit drug use or prostitution). One (e.g., young children and individuals with
approach that sometimes is used to increase mental disabilities) or because of feelings of
sample size under these conditions is to recruit duress (e.g., prisoners). (A listing of these rules
a relatively small number of subjects who pos- can be found in the Code of Federal Regulations,
sess the desired sampling attribute and ask Title 45 Public Welfare, Department of Health
each subject to bring in additional subjects and Human Services [104].)
from among their acquaintances (social net-
work) who possess the same attribute. These,
in turn, may be called upon to recruit similar Summary: A General Guide
additional subjects for the study. Thus, the to Constructing a Measure
sample grows metaphorically like a snow-
ball. Though snowball sampling can reduce This chapter has highlighted the complexities of
subject search costs and provide access to constructing a self-report measure. If the investi-
subjects who would otherwise be inaccessible, gator believes that the need for a new measure
the investigator must take great care to ade- outweighs the effort required to develop it, the
quately protect the potentially sensitive and following provides an outline of the essential
damaging information given by respondents steps involved, adapted from those suggested by
170 P.L. Flom et al.

DeVellis [40] and Fowler [39]. (Further details of narrowed later in the process. It is not uncom-
these steps can be found in their writings.) mon for the initial pool to contain four times
1. Determine precisely what must be mea- as many items as the number of items com-
sured. It is not sufcient to have a vague idea prising the nal scale.
of what it is to be measuredone needs to be 5. Determine the measurement format. As
fairly precise. If the study is analytic, how previously indicated, questions and responses
well does the new measure facilitate testing can be framed in numerous ways. The pre-
of the research hypothesis? If the study is ferred format should be considered at the
performed to generate a hypothesis, how well same time that the item pool is generated to
will the anticipated responses achieve this maintain consistency. For example, will the
objective? Will the measure assess knowl- survey be unstructured, semistructured, or
edge, attitudes, behaviors, or a combination structured? If the questions call for closed-
of these areas? What areas must be covered? ended responses, how many response catego-
How will the new measure differ from exist- ries will there be? What type of scaling will
ing measures? What theory will guide the be used? Will the time frame to which the
development of the new measure? How questions refer be specied or implied, etc.?
specic versus general should the measure 6. Develop validation items. Validation
be? As is the case for all forms of research, items are of two types: (a) those that do not
time spent clarifying objectives at the outset directly measure the construct under study,
will save a great deal of time later on. but which may be useful for detecting aws
2. Define the population of interest. State, as (biases) in the measurement process, and (b)
precisely as possible, whom you wish to those which assist in assessing the construct
study. Often, the choice will be a compro- validity of the new measure. Including a
mise between optimal versus available sub- social desirability scale can help to determine
jects. An investigator may be interested in all which items tend to be inuenced by this
humans with a disease, but it is never possi- positive bias and serve as a basis for elimi-
ble to study all such individuals. It also is nating them. The inclusion of items from a
very difcult, if not impossible, to obtain a putatively related measure can be used to
random sample of such individuals from buttress a claim of construct validity or iden-
around the world. Early in the design of the tify poorly performing items [40].
study, the investigator should identify the 7. Pretest. Once a large pool of items has been
age group and gender(s) of interest, the geo- dened, it can be reduced to a manageable
graphic location of potential respondents, number and screened for omissions, errors,
their racial or ethnic characteristics, etc. and related problems. Independent review by
3. Select the type of self-report to be used. content-matter experts, colleagues, and key
Decide whether the information being sought decision makers can be helpful for establish-
is best obtained via a mailed self-completed ing both the face and content validity of the
questionnaire, an in-person or telephone preliminary instrument and for obtaining
interview, or a computer-based method. Each feedback regarding specic items. Reviewers
approach has advantages and disadvantages, can be asked:
as noted above. How relevant each item is to the construct
4. Generate the item pool. Initially, a large being measured
pool of items should be generated, covering How clear the items are
as many different parts of the construct of If there are ways to make the items more
interest as possible from different perspec- concise
tives. Brainstorm. At this stage, the creator of If key items are missing (there should be
the survey instrument should not fear redun- at least one question for every variable of
dancy or a long list of itemsthese can be interest)
8 Constructing and Evaluating Self-Report Measures 171

If items are superuous or redundant overly intrusive? Were any redundant? Did
If items are difcult to read or answer they ow well?). Statistical methods (e.g.,
(e.g., are ambiguous or otherwise evaluation of distributional characteristics,
unclear) examination of missing answers, item-to-
It also is helpful to solicit review of the item and item-to-scale correlations) can be
drafted items from individuals who are simi- applied to responses obtained in the pilot to
lar to the intended respondents. This can be detect poorly performing or redundant items
done within a focus group or as a series of and to evaluate their impact on internal con-
one-on-one cognitive interviews con- sistency when retained or deleted. It is
ducted among a small number of individual difcult to nd guidance regarding the mini-
respondents. Both approaches allow explora- mal number of participants to be included in
tion of how well the items are understood a pilot. Some workers in the eld have sug-
and are particularly useful when the intended gested 300 [105]; others [40] have recom-
respondents differ greatly from the individu- mended that for single scales comprising
als writing the survey instrument. Specic relatively few (e.g., 20) items, a smaller num-
questions should be asked about how respon- ber may sufce. A cautionary note is in order.
dents interpreted the questions, how they If too few respondents are chosen, it may not
thought the various questions differed from be possible to evaluate the items properly; if
each other, how readable they were, and what the sample is not representative, items may
their responses meant. At this stage, ques- have different meanings to the pilot sample
tions can be open-ended, as one of the goals versus the target population, and the relation-
of pretesting is to identify response options ships among the items may be different as
that may have been overlooked (a prespecied well [40].
list of responses options will, by force, con- 9. Edit. Invariably, once a measure is pilot
strain the respondent to think like the survey tested, revision will be required. Directions
developer). Feedback from the pretest can be may need to be claried. Confusing, overly
use to add, delete, and otherwise rene ques- intrusive, or unanswered questions will need
tions to be included in the preliminary instru- to be deleted or reworded (though reworded
ment and to frame appropriate response items may need to be retested). If revisions
options. are extensive, a second round of pilot testing
8. Pilot test. Pilot testing is crucial to develop- may be required. Once poorly performing
ment of a valid and useful scale. No matter items are eliminated, the length of the instru-
what care is taken in developing and screen- ment should be evaluated. Too short a mea-
ing items, some will be misinterpreted by sure will not fully explore the construct of
respondents. Pilot testing involves adminis- interest. However, one that is too long may
tering the preliminary questionnaire (includ- bore or frustrate the respondents.
ing the cover letter and directions) to 10. Assess reliability and validity. Before an
respondents who, again, are as similar as instrument can be used for formal research
possible to members of the target population. purposes, its reliability and validity must be
The pilot should be performed, to the extent assessed in the population of interest. As
possible, under conditions that mirror the noted above, the most common test for reli-
conditions under which the nal survey will ability is Cronbachs alpha; for validity, the
be conducted. It should ask respondents to appropriate method depends on the degree of
nd aws in the survey (e.g., Were directions development of substantive knowledge and
and skip patterns (if any) clear? Was the sur- the existence of (a) other measures of the
vey too long? Was the format appropriate? same construct, (b) measures of similar but
Were any of the questions confusing or oth- different constructs, and (c) the availability
erwise unclear? Did any not apply? Were any of a gold standard.
172 P.L. Flom et al.

Take-Home Points

A self-report (a.k.a. survey) is a measure where the respondent supplies information about
him or herself.
Self-reports are important in medical research because some variables (e.g., attitudes,
beliefs, self-judged ability) only can be assessed from information directly furnished by the
patient or other subject.
A self-report is obtained by questionnaire, interview, or related methods.
Questionnaires are written documents that can be self-completed without interviewer
involvement or read aloud as part of an interview; interviews usually (but not always) are
administered orally; both can be structured (comprise closed-ended questions), unstruc-
tured (comprise open-ended questions), or semistructured (comprise a mix of both question
If answers to a research question can be obtained only via self-report, the investigator
should rst determine whether an instrument already exists that is reliable, valid, and oth-
erwise suitable for the population of interest.
In situations where a new instrument must be developed, the investigator must clearly
dene the question(s) of interest; identify the population to be surveyed; select the pre-
ferred type of self-report/format of measurement; consider inclusion of validation
questions; pretest, pilot test and edit the measure; and test the nal battery of questions
for reliability and validity.
When developing or implementing a survey, the investigator must be certain to observe all
ethical and legal aspects of survey methodology.

pimobendan. Pimobendan Multicenter Research

References Group. Am Heart J. 1992;124:101725.
9. Morisky DE, Green LW, Levine DM. Concurrent
1. Polit DF, Beck CT. Nursing research: principles and and predictive validity of a self-reported measure of
methods. 7th ed. Philadelphia: Lippincott, Williams medicine adherence. Med Care. 1986;24:6772.
and Wilkins; 2004. 10. Strauss AL, Corbin JM. Basics of qualitative
2. Kish L. Survey sampling. New York: Wiley; 1995. research: techniques and procedures for developing
3. Groves RM, Fowler FJ, Couper MP, Lepkowski JM, grounded theory. 2nd ed. Thousand Oaks: Sage;
Singer E. Survey methodology. New York: Wiley; 1998.
2004. 11. Cohen MZ, Ley C, Tarzian AJ. Isolation in blood
4. Cochran WG. Sampling techniques. 3rd ed. New and marrow transplantation. West J Nurs Res.
York: Wiley; 1977. 2001;25:3748.
5. Derogatis LR. BSI: Brief Symptom Inventory: 12. Fadiman A. The spirit catches you and you fall
administration, scoring and procedures manual. down: a Hmong child, her American doctors, and the
Minneapolis: National Computer Systems; 1993. collision of two cultures. New York: Farrar, Straus
6. Ware JE, Snow KK, Kosinski M, Gandek B. SF-36 and Giroux; 1998.
health survey: manual and interpretation guide. 13. Drever E. Using semi-structured interviews in small-
Lincoln: RI, QualityMetric, Inc.; 2000. scale research, a teachers guide. ERIC. Edinburgh:
7. Skevington SM, Bradshaw J, Saxena S. Selecting SCRE; 1995.
national items for the WHOQOL: conceptual and 14. Benson J, Britten N. Respecting the autonomy of
psychometric considerations. Soc Sci Med. 1999;48: cancer patients when talking with their families:
473487. qualitative analysis of semistructured interviews
8. Rector TS, Cohn JN. Assessment of patient outcome with patients. BMJ. 1996;313:729731.
with the Minnesota Living with Heart Failure 15. ODea JA. Consumption of nutritional supplements
Questionnaire: reliability and validity during a ran- among adolescents: usage and perceived benets.
domized, double-blind, placebo-controlled trial of Health Educ Res. 2003;18:98107.
8 Constructing and Evaluating Self-Report Measures 173

16. Allan G. A note on interviewing spouses together. 34. Murphy LL, Spies RA, Plake BS, editors. Tests in
J Marriage Fam. 1980;42:205210. print VII. Lincoln: Buros Institute of Mental
17. Kalischuk RG, Davies B. A theory of healing in the Measurements; 2006.
aftermath of youth suicide. J Holist Nurs. 2001;19: 35. Geisinger KF, Spies RA, Carlson JF, Plake BS,
163186. editors. The seventeenth mental measurements
18. Dolezal C, Mellins C, Brackis-Cott E, Abrams EJ. yearbook. Lincoln: Buros Institute of Mental
The reliability of reports of medical adherence from Measurements; 2007.
children with HIV and their adult care givers. J 36. Goldman BA, Mitchell DF, Egelson PE, editors.
Pediatr Psychol. 2003;28:355361. Directory of unpublished experimental mental mea-
19. Dym B, Berman S. The primary health care team: sures. Washington, DC: American Psychological
family physician and family therapist in joint prac- Association; 2007.
tice. Fam Syst Med. 1986;4:921. 37. Bieri D, Reeve R, Champion GD, Addicoat L,
20. Morrison-Beedy D, Ct-Arsenault D, Feinstein NF. Ziegler JB. The Faces Pain Scale for the self-
Maximizing results with focus groups: moderator assessment of the severity of pain experienced by
and analysis issues. Appl Nurs Res. 2001;14:4853. children: development, initial validation and pre-
21. Quatromoni PA, Milbauer M, Posner BM, Carballeira liminary investigation for ratio scale properties.
NP, Brunt M, Chipkin SR. Use of focus groups to Pain. 1990;41:139150.
explore nutrition practices and health beliefs of 38. Choi BCK, Pak AWP. A catalog of biases in ques-
urban Caribbean Latinos with diabetes. Diabetes tionnaires. Prev Chronic Dis. 2005;2:113.
Care. 1994;17:869873. 39. Fowler FJ. Improving survey questions. Thousand
22. Hicks LK, Lin Y, Robertson DW, Robinson DL, Oaks: Sage; 1995.
Woodrow SI. Understanding the clinical dilemmas 40. DeVellis RF. Scale development: theory and applica-
that shape medical students ethical development: tions. Newbury Park: Sage; 1991.
questionnaire survey and focus group study. BMJ. 41. Chang AM, Chau JPC, Holroyd E. Translation of
2001;322:709710. questionnaires and issues of equivalence. J Adv
23. Flanagan JC. The critical incident technique. Psychol Nurs. 2010;29:316322.
Bull. 1954;51:327358. 42. Healey B, Gendall P. Asking the age question in mail
24. Cot CJ, Notterman DA, Karl HW, Weinberg JA, and online surveys. Austral and New Zeal Marketing
McClosky C. Adverse sedation events in pediatrics: Acad (ANZMAC) Conference 2007. Dunedin;
a critical incident analysis of contributing factors. 2007.
Pediatrics. 2000;105:80514. 43. Heise DR. The semantic differential and attitude
25. Branch W, Pels RJ, Arky R. Becoming a doctor. research. In: Summers GF, editor. Attitude measure-
Critical-incident reports from third-year medical stu- ment. Chicago: Rand McNally; 1970.
dents. N Engl J Med. 1993;329:11301132. 44. Aiken LR. Rating scales and checklists. New York:
26. Allery LA, Owen PA, Robling MR. Why general Wiley; 1996.
practitioners and consultants change their clinical 45. DeLoach LJ, Higgins MS, Caplan AB, Stiff JL. The
practice: a critical incident study. BMJ. 1997;314: visual analog scale in the immediate postoperative
870874. period: intrasubject variability and correlation with a
27. Faithfull S. The diary method for nursing research. numeric scale. Anesth Analg. 1998;86:102106.
Eur J Cancer Care. 2007;1:1318. 46. Brealey SD, Atwell C, Bryan S, Coulton S, Cox H,
28. Bruijnzeels NA, Foets M, van der Wooden JC, Prins Cross B, Fylan F, Garratt A, Gilbert FG, Gillan
A, van den Houvel WJ. Measuring morbidity of MGC, Hendry M, Hood K, Houston H, King D,
children in the community: a comparison of inter- Morton V, Orchard J, Robling M, Russell IT,
view and diary data. Int J Epidemiol. 1998;27: Torgerson D, Wadsworth V, Wilkinson C. Improving
96100. response rates using a monetary incentive for patient
29. White MM, Howie-Esquivel J, Caldwell MA. completion of questionnaires: an observational
Improving heart failure symptom recognition: a study. BMC Med Res Methodol. 2007;7:1216.
diary analysis. Cardiovasc Nurs. 2010;25:712. 47. Asch D, Jedrziewski MK, Christakis N. Response
30. Woodeld R, Goodyear-Smith F, Arroll B. N-of-1 rates to mail surveys published in medical journals.
trials of quinine efcacy in skeletal muscle cramps J Clin Epidemiol. 1997;50:11291136.
of the leg. Br J Gen Pract. 2005;55(512):181185. 48. Diment K, Garrett-Jones S. How demographic char-
31. Aitken L, Mardegan KJ. Thinking aloud: data col- acteristics affect mode preference in a postal/web
lection in the natural setting. Western J Nurs Res. mixed survey of Australian researchers. Soc Sci
2000;22:841853. Comput Rev. 2007;25:410417.
32. Fonetyn M, Fisher A. Use of think aloud method to 49. Shih TH. Comparing response rates from web and
study nurses reasoning and decision making in clin- mail surveys: a meta-analysis. Field Methods.
ical practice settings. J Neurosci Nurs. 1995;27: 2008;20:249271.
124128. 50. OToole J, Sinclair M, Leder K. Maximising
33. Ericsson K, Simon H. Protocol analysis: verbal response rates in household telephone surveys. BMC
reports as data. London: MIT Press; 1993. Med Res Methodol. 2008;8:71.
174 P.L. Flom et al.

51. Tourangeau R, Smith TW. Asking sensitive ques- 68. Feldman AB, Haley SM, Coryell J. Concurrent and
tions: the impact of data collection mode, question construct validity of the pediatric evaluation of dis-
format and question context. Public Opin Q. 1996;60: ability inventory. Phys Ther. 1990;70:602610.
275304. 69. Lin JM, Brimmer DJ, Maloney EM, Nyarko E,
52. Turner CF, Al-Tayyib AA, Rogers SM, Eggleston BeLue R, Reeves WC. Further validation of the
MA, Villarroel MA, Roman AM, Chromy JR, Cooley Multidimensional Fatigue Inventory in a US adult
PC. Improving epidemiological surveys of sexual population sample. Popul Health Metr. 2009; 7:18
behavior conducted by telephone. Int J Epidemiol. doi:10.1186/1478-7954-7-18.
2009;38:11181127. 70. McHorney CA, Ware Jr JE, Raczek AE. The MOS
53. Couper MP, Nicholls II WL. The history and 36-item Short-Form Health Survey (SF-36): II.
development of computer assisted survey informa- Psychometric and clinical tests of validity in mea-
tion collection methods. In: Couper MP et al., edi- suring physical and mental health constructs. Med
tors. Computer assisted survey information Care. 1993;31:247263.
collection. New York: Wiley; 1998. 71. Management Sciences for Health. Creating a climate
54. Vataja R, Pohjasvaara T, Leppvuori A, Mntyl R, that motivates staff and improves performance. The
Aronen HJ, Salonen O, Kaste M, Erkinjuntti T. Manager. 2003;11:122.
Magnetic resonance imaging correlates of depres- 72. Tennant R, Hiller L, Fishwick R, Platt S, Joseph S,
sion after ischemic stroke. Arch Gen Psychiatry. Parkinson J, Secker J, Stewart-Brown S. The
2001;58:92531. Warwick-Edinburgh Mental Well-Being Scale
55. Schackman BR, Dastur Z, Rubin DS, Berger J, (WEMWBS): development and UK validation.
Camhi E, Netherland J, Ni Q, Finkelstein R. Health and Quality of Life Outcomes 2007;
Feasibility of using audio computer-assisted self- 5:63doi:10.1186/1477-7525-5-63.
interview (ACASI) screening in routine HIV care. 73. Cooper SM, Baker JS, Tong RJ, Roberts E, Hanford
AIDS Care. 2009;21:992999. M. The repeatability and criterion related validity of
56. Oetting ER, Beauvais F. Adolescent drug use. the 20 m Multistage Fitness Test as a predictor of
J Consult Clin Psychol. 1990;58:385394. maximal oxygen uptake in active young men. Br J
57. Fidler DS, Kleinknec RE. Randomized response Sports Med. 2005;39:e19.
versus direct questioning: two data-collection meth- 74. Reuben DB, Siu AL, Kimpau S. The predictive
ods for sensitive information. Psychol Bull. 1977;84: validity of self-report and performance-based mea-
10451049. sures of function and health. J Gerontol. 1991;47:
58. Lensvelt-Mulders GJLM, Hox JJ, van der Heijden M106M110.
PGM, Maas CJM. Meta-analysis of randomized 75. Heather N, Rollnick S, Bell A. Predictive validity of
response research. Sociol Method Res. 2005;33: the readiness to change questionnaire. Addiction.
319348. 1993;88:16671677.
59. Edwards P, Roberts I, Clarke M, DiGuisseppi C, 76. Portney LG, Watkins MP. Foundations of clinical
Pratap S, Wentz R, Kwan I. Increasing response rates research. Applications to practice. Upper Saddle
to postal questionnaires. BMJ. 2002;324:118391. River: Prentice Hall Health; 2000.
60. Brennan M, Charbonnau J. Improving mail survey 77. Guyatt G, Walter S, Norman G. Measuring change
response rate using chocolate and replacement ques- over time: assessing the usefulness of evaluative
tionnaires. Public Opin Q. 2009;73:368378. instruments. J Chronic Dis. 1987;40:171178.
61. Merriam-Webster Online. Available at http:// 78. Hays RD, Hadorn D. Responsiveness to change: an Accessed 27 July 2010. aspect of validity, not a separate dimension. Qual
62. Waltz CF, Strickland OL, Lenz ER. Measurement in Life Res. 1992;1:7375.
nursing and research. New York: Springer Publishing 79. Beaton DE, Bombadier C, Katz JN, Wright JG. A
Inc; 2005. taxonomy for responsiveness. J Clin Epidemiol.
63. Cook DA, Beckman TJ. Current concepts in validity 2001;54:12041217.
and reliability for psychometric instruments: theory 80. Husted JA, Cook RJ, Farewell VT, Gladman DD.
and application. Am J Med. 2006;119(2):166. Methods for assessing responsiveness: a critical
e7166.e16. review and recommendations. J Clin Epidemiol.
64. Litwin MS. How to measure survey reliability and 2000;53:459468.
validity. Thousand Oaks: Sage; 1995. 81. Roach KE. Measurement of health outcomes: reli-
65. Beck AT, Steer R, Brown GK. Manual for the Beck ability, validity and responsiveness. JPO. 2006;
Depression Inventory-II. San Antonio: Psychological 18:812.
Corporation; 1996. 82. Liang MH, Fossel AH, Larson MG. Comparison of
66. Williams RA. Womens health content validity of ve health status instruments for orthopedic evalua-
the family medicine in-training exam. Fam Med. tion. Med Care. 1990;28:632642.
2007;39:572577. 83. Angst F, Verra ML, Lehmann S, Aeschlimann A.
67. Ware JE, Sherbourne CD. The MOS 36 item short Responsiveness of ve condition-specic and
form health survey. Med Care. 1992;30:473483. generic outcome assessment instruments for chronic
8 Constructing and Evaluating Self-Report Measures 175

pain. BMC Med Res Methodol 2008;8:26 (published 94. Roeckelein J. Elseviers dictionary of psychological
online 2008 April 25 doi:10.1186/1471-2288-8-26). theories. Amsterdam: Elsevier BV; 2006.
84. Tavris C, Aronson E. Mistakes were made, but not 95. Feldt LS, Brennan RL. Reliability. In: Linn RL,
by me. Orlando: Harcourt Books; 2008. editor. Educational measurement. 3rd ed. New York:
85. Crowne DP, Marlowe D. A new scale of social desir- American Council on Education and Macmillan;
ability independent of psychopathology. J Consult 1989.
Psychol. 1960;24:349354. 96. Downing SM. Validity: on the meaningful interpre-
86. Strahan R, Kerbasi K. Short homogenous version of tation of assessment data. Med Educ. 2003;37:
the Marlowe-Crowne Social Desirability Scale. J 830837.
Cin Psychol. 1972;28:191193. 97. Landis JR, Koch GG. The measurement of observer
87. Furnham A, Henderson M. The good, the bad and agreement for categorical data. Biometrics. 1977;33:
the mad: Response bias in self-report measures. Pers 159174.
Indiv Differ. 1982;3:311320. 98. Shrout PE, Fleiss JL. Intraclass correlations: uses in
88. Leary MR, Kowalski RM. Impression management: assessing rater reliability. Psychol Bull. 1979;86:
a literature review and two-component model. 420428.
Psychol Bull. 1990;107:3447. 99. McDowell I, Newell C. Measuring health. A guide to
89. Lenski GE, Leggett JC. Caste, class, and deference rating scales and questionnaires. 2nd ed. New York:
in the research interview. Am J Sociol. 1960;65: Oxford University Press; 1996.
463467. 100. Kuder GF, Richardson MW. The theory of the esti-
90. Krosnick JA, Alwin DF. An evaluation of cognitive mation of test reliability. Psychometrika. 1937;2:
theory of response order effects in survey measure- 15160.
ment. Public Opin Q. 1987;51:201219. 101. Cronbach LJ. Coefcient alpha and the internal
91. Toner B. Impact of agreement bias on the rating of structure of tests. Psychometrika. 1951;16:297334.
questionnaire response. J Soc Psychol. 1987;127: 102. George D, Mallery P. SPSS for Windows step by
221222. step. Boston: Allyn & Bacon; 2003.
92. Nelson NW, Parsons TD, Grote CL, Smith CA, 103. Faugier J, Sargeant M. Sampling hard to reach popu-
Sisung II JR. The MMPI-2 Fake Bad Scale: concor- lations. J Adv Nurs. 1997;26:790797.
dance and specicity of true and estimate scores. 104. Code of Federal Regulations, Title 45 Public wel-
J Clin Exp Neuropsychol. 2006;28:112. fare, department of Health and Human Services,
93. Thorndike EL. A constant error in psychological rat- Revised 15 Jan 2009, (Effective 14 July 2009).
ing. J Appl Psychol. 1920;4:2529. 105. Nunnally JC. Psychometric theory. New York:
McGraw-Hill; 1978.
Selecting and Evaluating Secondary
Data: The Role of Systematic 9
Reviews and Meta-analysis

Lorenzo Paladino and Richard H. Sinert

Sorting through the body of available literature is means for physicians to translate clinical research
a daunting task. MEDLINE, only one of many into standard practice and help reconcile
databases, indexed 902,346 articles in 2010. This conicting studies in the literature.
number reects a continuing increase over 2009
(854,506) and 2008 (821,834). How can clini-
cians have any chance of keeping up with the Difference Between a Narrative
literature or use it for guiding research or for for- Review, Systematic Review,
mulating clinical practice decisions if their pri- and Meta-analysis
mary sources are restricted to individual studies?
The answer is that it is difcult, if not increas- A narrative review (sometimes termed a tradi-
ingly impractical. Reliance on individual studies tional literature review) is a summary of primary
is further complicated when current beliefs and published studies in which conclusions are drawn
standards of practice are challenged by new stud- by the reviewer, guided by his or her own inter-
ies. For clinicians to make informed decisions, pretations of the studies, rather than by external
they must analyze multiple studies for both their criteria. Narrative reviews are well suited for
quality and relevance to the patient population of general topics or broad coverage of a eld as they
interest. This is a principal reason for the long lag usually cover a wide range of issues within a
time before clinical research is incorporated into given topic [2], e.g., Update on Multiple
standard practice. A representative example is the Sclerosis. Typically, they are written by experts
20-year delay between initial reports suggesting in the specic eld of study rather than by experts
the utility of thrombolytic therapy for myocardial on research methodology. As such, narrative
infarctions in the late 1970s and its adoption in reviews do not necessarily explicitly state or
the 1990s [1]. For these reasons, secondary follow the rules of evidence-based search strate-
sources such as narrative reviews, systematic gies (including selection criteria for articles and
reviews, and meta-analyses are an important abstracts found) or assess the quality or validity
of the included studies. This decit leads to lack
of transparency and reproducibility and is likely
L. Paladino, MD R.H. Sinert, DO () to reect a biased selection of the total evidence
Department of Emergency Medicine, available (selection bias). A common bias in nar-
SUNY Downstate Medical Center, rative reviews is failure to include research that
450 Clarkson Avenue, 1228, Brooklyn,
NY 11203, USA
conicts with the beliefs or opinions of the expert.
e-mail:; Nonetheless, the majority of published reviews are narrative rather than systematic.

P.G. Supino and J.S. Borer (eds.), Principles of Research Methodology: A Guide for Clinical Investigators, 177
DOI 10.1007/978-1-4614-3360-6_9, Phyllis G. Supino and Jeffrey S. Borer 2012
178 L. Paladino and R.H. Sinert

In contrast, systematic reviews (in medicine, effective research described by Tuckman [6] and
written most commonly about treatment or reviewed in Chap. 1. They are systematic because
diagnostic research) focus on a specic question information gathering is done in a structured and
within a topic (e.g., Are steroids effective in rigorous way and the data contained within them
controlling ares of multiple sclerosis? Does are interpreted. They are logical in that their
positron emission tomography have strong posi- methodologies employ tools for assessing the
tive predictive value for breast cancer?), render- studies bias (internal validity) and procedures to
ing them amenable to an explicit search strategy. discern the effects of varying populations on
This characteristic makes them excellent tools to study results (external validity). They are repli-
explore clinically relevant topics. Systematic cable both because they demonstrate whether the
reviews identify the databases searched and, thus, results of individual studies are congruent and
present clear and reproducible search strategies. also because the methodology employed in the
A comprehensive literature search is conducted, review, if properly performed and reported, is
and all identied studies identied are assessed sufciently explicit to be permit reproduction.
for relevance and methodology. Selection is based They are transmittable because, by digesting
on predened inclusion and exclusion criteria, available information and coming to a conclu-
quality is assessed, and data are abstracted in a sion, they effectively summarize what is cur-
standardized format. By explicitly stating how rently known on a specic topic and, when
the evidence was found, how it was appraised or published, enable clinicians to learn about the
validated, and which studies were excluded (and conclusions of research. In addition, meta-analy-
why), systematic reviews eliminate many of the ses, specically, gather, compare, and pool the
biases inherent in narrative reviews. empirical products (data) of the studies collected
A meta-analysis (sometimes termed a quanti- and are reductive to a clinical conclusion. As
tative review) often, but not always, is included noted above, meta-analyses increase sample size
as a component of a systematic review. First used by pooling the subjects of smaller studies when
for medicine in 1904 by renowned statistician appropriate. This larger N increases the general-
Karl Pearson to examine the preventive effect of izability of the results. When the results cannot
serum innoculations against enteric fever [3] and be pooled, they often shed light on reasons why
later formalized by contemporary statistician and results may not be generalizable.
educational researcher, Gene V. Glass (who
coined the term in 1976) [4], meta-analysis cur-
rently is employed in many disciplines as a statis- Searching for a Systematic Review
tical methodology to combine the results of or Meta-analysis
several studies about a topic as if they were from
one large study. In studies of treatment (the most Almost all of the of the databases described in
common focus of meta-analysis in clinical medi- Chap. 2 can be used to search for meta-analyses
cine), its principal purposes are to enable detec- and systematic reviews. The Clinical Queries
tion of overall and subgroup effects (as statistical link on the PubMed interface for MEDLINE can
power may be suboptimal due to limitations in be used to apply search lters to focus on system-
sample size of individual trials), to improve esti- atic reviews [7]. A variety of databases also are
mates of the magnitude of these effects, and to available that specialize in systematic reviews
aid in the resolution of uncertainty due to incon- and meta-analyses. The Cochrane Library (www.
sistent ndings (i.e., interstudy differences) [5]., developed under the
The studies included in a meta-analysis should be auspices of the Cochrane Collaboration (an inter-
found using the same rigorous search methodol- national network dedicated to promoting well-
ogy as that used for systematic reviews. informed health-care decision-making), maintains
Well-constructed systematic reviews and an online collection of systematic reviews on
meta-analyses have many of the characteristics of intervention and treatment. The Database of
9 Selecting and Evaluating Secondary Data 179

Promoting Health Effectiveness Reviews Should the search be focused on males or

(DoPHER) is a registry of systematic and nonsys- females only? For which specic diseases is
tematic reviews of public health. BestBETs (www. information sought (eg., diabetes or acute, ACP Journal Club (, myocardial infarction or acute myocardial
and the TRIP Database ( infarctions in diabetics)? An overly broad
index.html) are other sources of systematic search typically will yield an excessive quan-
reviews for clinical questions. The Database tity of information, whereas an overly narrow
of Abstracts of Reviews of Effectiveness search (e.g., females 3035 years of age) will
(DARE) ( con- result in too few or no results.
tains abstracts of systematic reviews that have I denotes the intervention. In the setting of
been assessed for their quality. clinical medicine, the term intervention com-
monly is considered to be therapy (e.g., medi-
cal or surgical treatment or a risk-reduction
Steps in Writing a Systematic Review initiative such as a smoking-cessation or
weight-reduction program). However, this
There are several steps in writing a systematic component of the PICO is somewhat of a mis-
review. Below is a brief outline that may serve as nomer as it also can pertain to diagnostic test-
an overview (discussion of these steps is provided ing. When the PICO method is applied to
below): analyze questions about progression of dis-
1. Formulate the question. ease, the intervention (more appropriately
2. Dene the literature searching strategy. termed factor of interest as it is not purpo-
3. Select the studies to be included. sively applied) would be presence of a prog-
4. Summarize results across studies. nostic factor such as age, gender, morbidity,
5. Assess heterogeneity. lifestyle, or family.
6. Consider appropriateness of pooling results C denotes the comparator, that is, to what
for meta-analysis. the intervention in question will be compared.
A clinician might argue, I dont want to com-
pare two drugs, I just want to know if giving
Formulating the Question aspirin is benecial to my patients? This
question, however, by its very nature, must
As is true for all well-designed primary studies, include a comparator, that is, giving aspirin
the rst step in conducting a systematic review is versus giving nothing, in which case the target
denition of a clear searchable question. The of the search likely will include studies that
importance of this initial step often is underesti- involve administering a placebo as a compari-
mated, leading to frustrating and unsuccessful son. In diagnostic questions such as Is a ultra-
searches. The process is best guided by the often- sound a good study for detection of common
used four-part PICO method, originally dened bile duct stones? the comparator optimally is
by the McMaster University Centre for Evidence- the best available or gold standard test (i.e.,
Based Medicine (Hamilton Ontario) 1992 recom- endoscopic retrograde cholangiopancreatog-
mendations for asking focused clinical questions raphy [ERCP]). (In questions about prevention
[8]. The PICO method can help translate a ques- and prognosis, the optimal comparators are,
tion into terms that will allow whichever search respectively, absence of a preventive initiative
engine is selected to retrieve the most appropriate or a given prognostic factor or factors.)
literature. Its components are described below: The O (outcome) denotes the component
P denotes the patient population or problem. that often spurs the research question. For
The reviewer needs to carefully dene the example, will this therapy decrease morbidity
population from among many available or mortality? This element of the PICO
options. What is the age group of interest? typically requires renement (consider the
180 L. Paladino and R.H. Sinert

Fig. 9.1 MeSH for mortality on PubMed. Available at

concept of mortality reduction: what period of that is used for indexing articles; it is hierarchically
time is clinically meaningful? 30 days? arrayed to facilitate searching at varying levels of
6 weeks? 6 months? 1 year?). specicity [9]. Use of all of these tools invariably
will yield a more inclusive search.
Consider the example: Does drawing blood
Dening the Literature Search Strategy: cultures (intervention) change mortality (out-
Keywords, MeSH, and Boolean come) in adult patients with pneumonia (popula-
Operators tion)? (The comparison implied by the question
is not drawing blood cultures.) In some literature,
An organized literature search will increase the blood culture may be classied as microbiologi-
likelihood of nding answers to the question of cal culture, microbial culture, or microbial
interest. The PICO question described above can testing; pneumonia as lung infection or respi-
be subdivided into its four components for entry ratory infection; and mortality as death or
into the databases search engine. We recommend survival. MeSH terms can help expand the
that the reviewer search broadly at rst and then search by including many or all of these syn-
search more narrowly (cone down). The more onyms under one umbrella (Fig. 9.1). However,
limited the initial search, the more likely it will they should not be solely relied upon because
miss relevant articles. Each component of the ques- inclusion or exclusion of an item under a specic
tion should be searched by keywords, probable MeSH is determined subjectively by those per-
synonyms, and, if using PubMed, its MeSH (medi- forming the NML indexing.
cal subject headings) terms (also called descrip- During the search, the selected terms are
tors). MeSH is the US National Library of connected by the Boolean operators AND,
Medicines (NLM) controlled vocabulary thesaurus OR, and NOT (see Venn Diagram,
9 Selecting and Evaluating Secondary Data 181

Though, as noted, the Boolean NOT operator

is available, to optimize inclusiveness, it is better
to search positively (i.e., to join desired concepts)
rather than to search by exclusion.
An inclusive search should not miss any rel-
evant information. Unfortunately, the literature
is not centralized, and many databases (e.g.,
MEDLINE, EMBASE, and others listed in
Chap. 2) must be queried to assure a complete
search. The bibliographies of relevant papers should
be checked for articles missed by the initial
search, a methodology often refered to as snow-
balling. Repeating this process on the additional
papers can lead to greater retrieval. Citation
searches using the Web of Science or SciVerse-
Scopus also may yield additional papers. New
keywords found on these papers can be added to
Fig. 9.2 Boolean terms OR, AND, and NOT augment the original search terms. Consulting a
research librarian to perform expert searches also
should be done for completeness. Unpublished
studies can be found by searching clinical trials
registries and by contacting experts and individ-
ual authors in the eld. The Cochrane Library
maintains a registry of controlled clinical trials,
Cochrane Library Cochrane Central Register of
Controlled Trials (CENTRAL) as does These important steps help to
prevent the reviewer from missing relevant yet
unpublished research, common with negative
studies (see below: Detecting Publication Bias).

Fig. 9.3 (Mortality OR survival) AND pneumonia

Selecting Articles

Fig. 9.2 ). The meaning of these operators are Having formed the search question, the next step
self-explanatory; however, the implications of in constructing the systematic review is consider-
their additions to a search deserves outlining. The ation of the types of literature available to answer
OR operator expands the search to include the question. Selection should be based on sev-
any of the selected terms, whereas AND limits eral key factors, the most important of which are
it to those that contain all selected terms. listed below.
To start a search broadly, the keywords in the
query should be connected by the OR operator Levels of Evidence
(e.g., mortality OR survival). This strategy pro- Medline and other databases contain literature
vides the sum of all words as if they were searched that is very heterogenous with regard to the
individually. By adding AND pneumonia, the strength of evidence provided. The varying types
search will yield articles only about both mortal- of studies contained within the literature are
ity (OR survival) and pneumonia. This concept is represented here as a pyramid (Fig. 9.4), with
illustrated by the Venn diagram given in Fig. 9.3. the weakest evidence for answering clinical
182 L. Paladino and R.H. Sinert

Fig. 9.4 Pyramid of evidence

questions shown at the bottom and the strongest casecontrol studies provide stronger evidence
evidence shown at the top. Bias decreases as we for association than case reports or case series,
move up the pyramid, in direct contrast to the caution must be exercised in interpretation of
amount of literature available on a given topic. results because demonstration of a statistical rela-
In vitro and animal studies, although impor- tionship does not provide proof of causality.
tant for hypothesis generation, cannot be applied Cohort studies follow individuals with specic
directly for clinical care or provide a direct answer risk factors or exposures over time and compare
to a clinical research question, as can case reports, them with comparable individuals who do not
series, casecontrol, cohorts, and randomized have the risk factor or exposure being studied to
controlled clinical trials (RCTs). As noted in pre- evaluate differences in outcomes. Though cohort
vious chapters, a case report describes the pre- studies (particularly those that are prospective in
sentation and/or treatment of an individual patient, nature) provide better evidence of association
whereas a case series consists of a collection of than casecontrol studies, they (like casecontrol
reports on several individual patients. Because studies) are observational and, as such, are sub-
they do not have control groups with which to ject to more bias than studies in which an inter-
compare outcomes, neither has validity for draw- vention has been purposely applied; their greatest
ing conclusions about associations or cause and utility in clinical epidemiology is for dening
effect. Casecontrol studies are always retro- prognosis of a disease. Quasi-experimental
spective studies in which subjects who already studies contain some of elements of true experi-
have a specic condition are compared with those ments (parallel control groups and/or repeated
who do not. These studies are well suited to test assessments) but (as noted in Chap. 5), due to
associations between risks or toxic exposures and lack of random allocation to treatment group, are
diseases, especially when the latter are relatively not fully protected from all threats to internal
rare. Data collection typically is based on the validity. In contrast, randomized controlled
medical record and/or patient recall. Though clinical trials (RCTs) study the effects of a
9 Selecting and Evaluating Secondary Data 183

Table 9.1 Criteria for calculating the Jadad score (Reprinted with permission from Jadad et al. [12])
Criteria Yes (1 point) No (0 points)
1. Was the study described as randomized?
2. Was the randomization process described, and was it appropriate?
3. Was the study described as double blind?
4. Was the method for double blinding appropriate?
5. Were the withdrawals and drops out of the study enumerated?
Score 02 Low-quality study
Score 35 High-quality study

purposively applied therapy by comparing an predened inclusion and exclusion criteria should
intervention group and control group to which be reported in the methods section and the search
subjects have been randomly allocated. They also strategy in the appendix, to facilitate replication
incorporate additional methodologies such as of results.
blinding (masking) and analysis by intention-to-
treat that reduce the potential for a variety of Assessing the Quality of Primary Studies
threats to internal validity, though they may suffer Assessment of bias in the methodology of the indi-
from limitations in generalizability (external vidual studies is a core component of a systematic
validity). In theory, as syntheses of prior research, review; therefore, tools for appraising the quality
systematic reviews and meta-analyses, though of the individual studies should be integrated.
relatively few in number, are at the top of the Unfortunately, no gold standard exists to evaluate
pyramid, providing the strongest evidence for the methodology of therapeutic trials or assess-
associations or cause-and-effect relationships. ments of diagnostic test performance even though
However, for this to be true, both must meet their quality and methods for synthesis are thought
stringent methodological quality criteria by some to be superior to that of other forms of
(described below) and the elements of the meta- clinical research (e.g., prognostic studies) [11].
analysis (i.e., the included studies), specically, Consensus and working groups continually reeval-
must have sufciently similar study design char- uate and improve upon assessment tools; thus, the
acteristics to permit pooling of results, a criterion preferred methods or systems change over time.
that is not always met in practice. When it does Below is a listing and brief discussion of a cross
not, a meta-analysis, if performed, will be more section of tools for detecting bias in these types of
useful for hypothesis generation than for hypoth- studies. We present these to introduce the topic
esis testing [10]. rather than to advocate a specic scoring system.
(For the author of a primary study, they can be
Standardizing Selection of Articles used as a check list to ensure a sufciently com-
The list of abstracts generated from the PICO prehensive methods section.)
search query is next screened for selection of
relevant articles. Although inclusion criteria (e.g., Therapeutic Testing Articles Appraisal
nature of the patient population, specic outcomes A variety of assessment tools for therapeutic arti-
and summary measures) optimally are predened, cles exist such as the Jadad scale [12], shown in
the process is not immune to subjectivity and Table 9.1. Common to all is evaluation of key
bias. The list should be screened independently areas prone to bias. Inclusion and exclusion
by two reviewers to minimize subjectivity. Any criteria should be reviewed to decide whether the
discrepancies should be compared and discussed patients included in the identied study meet the
to reach a consensus. The reviewers interrater requirements of the P of the PICO. As indi-
reliability should be measured and reported. The cated earlier, the highest quality studies optimally
184 L. Paladino and R.H. Sinert

will use randomized treatment assignment with outcome. The NNT must be weighed with the
concealment of allocation, double blinding, and baseline risk, NNH, benet magnitude and/or
intention-to-treat analyses. Follow-up should be cost to have comprehensive meaning to the clini-
complete and transparent. In addition, readers cian. It may be more acceptable in clinical prac-
should look for an explanation as to why partici- tice to apply a treatment that is inexpensive, easy
pants may have dropped out of an investigation, to use, and of almost no adverse risk but has
as differential attrition from a study may impact higher NNT than one that has a lower NNT but is
conclusions regarding the effectiveness of the expensive, dangerous, and has only a marginal
investigational treatment (e.g., if the sickest clinical benet. For example, while the NNT was
patients dropped out of the treatment arm receiv- relatively higher with aspirin than with SK in
ing an investigational new drug, the drug might ISIS-2, there was no reported bleeding requiring
appear to be more effective than it is.) transfusion or conrmed cerebral hemorrhage
Studies about treatment, optimally, will associated with aspirin (a very low cost, easy-to-
express the impact of therapy quantitatively as manage intervention), whereas there was a very
the number needed to treat (NNT) or the number small (though statistically signicant) excess
needed to harm (NNH). The NNT is the number occurence of these events with SK (0.5% vs.
of patients that need to be given the intervention 0.2% with placebo [major bleeds], equivalent to
for one patient to benet, thus expressing the a NNH = 333; 0.1% (SK) vs. 0.0% with placebo
effectiveness of an intervention in a clinically [cerebrovascular hemorrhage], equivalent to a
meaningful manner. It is calculated as the recip- NNH = 1,000).
rocal of the difference in outcomes of the inter-
vention and control groups (absolute risk Diagnostic Testing Articles Appraisal
reduction) derived from a therapeutic trial. The Diagnostic accuracy studies investigate how well
closer the NNT is to 1, the greater the efcacy of the results from an index test (test being evalu-
the intervention; the further from 1, the lesser its ated) agree with the results of the reference stan-
efcacy. As an example, in the landmark study dard. (As noted above, the reference standard or
ISIS-2 [13], the efcacy of (1) 1 h of IV infusion gold standard is considered the best available
of 1.5 MU streptokinase (SK), (2) 1 month of method to determine the presence or absence of a
160 mg of enteric-coated aspirin (ASA) taken condition.) Diagnostic studies have unique design
daily for 30 days, and (3) both active agents ver- features which differ from therapeutic testing;
sus placebo was evaluated through 35 days after therefore, different methods exist for detecting
a suspected acute myocardial infarction (AMI) bias and variability.
among 17,187 patients. Analysis revealed that The Quality Assessment of Diagnostic
the absolute reductions in risk of vascular mortal- Accuracy Studies (QUADAS) tool [14] is one
ity associated with SK and ASA and their combi- such method. The tool comprises 14 items,
nation versus placebo, respectively, were 2.8%, dened by expert consensus, that examine a vari-
2.4%, and 5.2%, yielding NNTs of 36 (SK), 42 ety of important biases and other methodological
(ASA), and 19 (SK + ASA). These NNTs (not concerns specic to the evaluation of diagnostic
calculated in the original study) indicated that 36 tests (Table 9.2), though it it does not address the
patients would need to be treated with SK and 42 issue of intra- or interobserver reliability.
patients with ASA aspirin to prevent one vascular Responses are framed as binary yes/no ques-
death, whereas the same result could be achieved tions, or if not enough information is supplied,
with combination therapy in 19 patients. unclear. The Cochrane Collaboration offers a
A closely related parameter is the number needed similar tools for assesing diagnostic studies [15].
to harm (NNH), calculated as the inverse of the In the past, calculations of the sensitivity,
absolute risk increase (again expressed as a pro- specicity, and predictive values of a diagnostic
portion) and interpreted as the number of patients were considered sufcient for evaluation of its
one would need to treat to expect an adverse utility. In this era, a high-quality diagnostic
9 Selecting and Evaluating Secondary Data 185

Table 9.2 The development of QUADAS: a tool for the quality assessment of studies of diagnostic accuracy included
in systematic reviews (Reproduced with permission from Whiting et al. [14])
Item Yes No Unclear
1. Was the spectrum of patients representative of the patients who will receive the () () ()
test in practice?
2. Were selection criteria clearly described? () () ()
3. Is the reference standard likely to correctly classify the target condition? () () ()
4. Is the time period between reference standard and index test short enough to be () () ()
reasonably sure that the target condition did not change between the two tests?
5. Did the whole sample or a random selection of the sample, receive verication () () ()
using a reference standard of diagnosis?
6. Did patients receive the same reference standard regardless of the index () () ()
test result?
7. Was the reference standard independent of the index test (i.e. the index test did () () ()
not form part of the reference standard)?
8. Was the execution of the index test described in sufcient detail to permit () () ()
replication of the test?
9. Was the execution of the reference standard described in sufcient detail to () () ()
permit its replication?
10. Were the index test results interpreted without knowledge of the results of the () () ()
reference standard?
11. Were the reference standard results interpreted without knowledge of the results () () ()
of the index test?
12. Were the same clinical data available when test results were interpreted as would () () ()
be available when the test is used in practice?
13. Were uninterpretable/intermediate test results reported? () () ()
14. Were withdrawals from the study explained? () () ()

study also will dene thresholds values for their The LR is the probability that a given test result
diagnostic test using receiver operator character- would be expected in a patient with the target
istic (ROC) curves which are plots of the true disorder divided by the probability (P) that that
positive rate (sensitivity) versus the false positive same result would be expected in a patient with-
rate (1-specicity) (Fig. 9.5). The area under the out the target disorder. LRs can be calculated
curve reects the relationship between sensitivity both for positive (LR+) and negative (LR) test
and specicity for a given test. As a curve asymp- results, as shown below.
totically approaches the upper left-hand corner,
the area under the curve approaches 1 (100% sen- sensitivity P (Test + | Disease + )
sitivity and specicity). A random guess would LR + =
1 specificity P(Test + | Disease )
generate a point along the diagonal bisecting the
graph, also called the line of no discrimination. 1 sensitivity P (Test | Disease + )
Points above the diagonal represent better results LR =
specificity P(Test | Disease )
(greater diagnostic accuracy), while points below
the line are poor (lower diagnostic accuracy). High LR + values (LR+ > 10) signicantly
(For further discussion of the use of ROC curves increase the probability of disease and low
for determination of diagnostic accuracy, the LR values (LR < 0.1) signicantly decrease
reader is referred to Chap. 11.) the probability of disease. The extent to which
Once thresholds for a positive and negative the results of a diagnostic test changes the prob-
diagnostic test are dened by ROC curves, then ability that the patient has a disease (posttest
an evidence-based operating characteristic of the probability) can be estimated using a graphical
test can be dened by its likelihood ratios (LR). tool known as the Fagan nomogram [16] by
186 L. Paladino and R.H. Sinert

Fig. 9.5 Receiver operator

characteristic curve

using a straight edge to draw a line from the summary statistic (e.g., a risk ratio, a difference
pretest probability through the calculated LR between outcome means) for the observed effect
(Fig. 9.6). is abstracted or recalculated from each included
study. (A less common approach, not reviewed in
this chapter, combines original or patient-level
Summarizing the Results: The Role data from prior studies; for an excellent discus-
of Meta-analysis sion of the pros and cons of this method, known
as Individual Patient Data [IPD] meta-analysis,
As noted earlier, sometimes the size of an indi- the reader is referred to Stewart and Tierney
vidual clinical trial may be too small to detect a 2002) [17].) Next, a pooled effect estimate is cal-
treatment effect or to estimate its magnitude reli- culated as a weighted average by sample size of
ably. Meta-analysis is a method to increase the the intervention effects reported in the individual
power of statistical analyses and precision of esti- studies. By pooling results, the standard error of
mates by pooling the results of related trials (i.e., the weighted average effect size of the included
those that address a similar hypothesis) to obtain studies and its associated condence interval are
a quantied synthesis. Not all systematic reviews reduced, typically affording greater statistical
lead to a meta-analysis. The trials may be so var- power to detect an effect than would be possible
ied in their methodology, end points, or results from any one consitutent study. Reduction of the
that combining them may not be appropriate. condence intervals also increases precision of
In a conventional meta-analysis (sometimes the estimated population effect size [18]. In
known as aggregate-level meta-analysis, a assigning weights for generating the pooled
9 Selecting and Evaluating Secondary Data 187

timing, and measurable differences other than

sampling variability (see also assessment of het-
erogeneity below). Athough more data are
required for random effects models to achieve the
same statistical power as with xed effects mod-
els, the former represents a more conservative
assumption. Unless the author of a meta-analysis
has guidance from a statistician indicating that a
xed model is appropriate, a random effects
model typically is preferrable.
Most meta-analyses summarize their ndings
graphically using a forest plot [19]. A forest plot
illustrates the relative effects of multiple studies
addressing the same question or hypothesis. The
studies are listed in the left-hand column, typi-
cally in chronological order. The measured effect
for each of these studies is represented by a
square, whose area is related to the relative
sample size of the individual study. The effect
may be an odds ratio, risk difference, or a correla-
tion coefcient. The condence intervals (CI) are
represented by horizontal lines bisecting the
square. The width of the CI is related to the power
and variability of the study. The combined results
of the meta-analysis usually are represented by a
diamond, the width of which is the CI for the
pooled data. A vertical line is placed at 1 for
Fig. 9.6 The Fagan nomogram (Reproduced with per- ratios (odds or risks) and correlation coefcients,
mission from Fagan [16]) or at 0 for differences, representing no effect. If
the CI of an individual study or the pooled data
crosses this line, the null hypothesis is accepted.
estimate, the evaluator needs to consider whether Figure 9.7 illustrates a forest plot used in a meta-
it is more appropriate to use a xed versus a analysis of the effects of administration of beta
random effects model as these make different blockade on in-hospital mortality rates among
assumptions about the nature of the included patients with acute coronary syndrome [20].
studies. A xed effects model assumes that all of
the studies contained within the meta-analysis
have attempted to measure a single true effects Assessment of Heterogeneity: Methods
value and that variation in observed effect sizes of Investigation
is due only to chance. An assumption underlying
such a model is that all of the studies have been Heterogeneity in meta-analysis refers to the vari-
conducted under similar conditions with similar ation in outcomes among included studies. As
subjects, differing only in their power to detect noted above, a certain degree of variability should
the outcome of interest. (This rarely, if ever, is the be expected when comparing multiple studies
case.) In contrast, a random effects model (hence, the rationale for suggesting the more con-
assumes that the true effect size can vary from servative random effects model for pooling data).
study to study along a distribution due to differ- Clinical variability occurs when there are dif-
ences in the nature of the populations, dosing, ferences in the study population, interventions,
188 L. Paladino and R.H. Sinert

Fig. 9.7 The forest plot (Reproduced with permission from: Brandler et al. [20])

or outcomes measured. Methodological vari- intervention or may be too far along the disease
ability occurs when there are differences in process to show any efcacy. Sometimes, the
study design. Not suprisingly, clinical or method- interventions themselves may be dissimilar.
ological differences will cause variations in the For example, a review of antibiotics in sepsis
effect measured. Heterogeneity refers to this may include studies that used different classes of
difference in effect size (or direction) between antibiotics. Dosing size may have an impact
studies. Of course, like all statistical tests, the on heterogeneity as well. The effects, benecial
heterogeneity of the effect size in pooled studies or harmful, may increase with increased dose
may occur by chance. and with the duration or frequency of the
Assessment of clinical and methodological het- intervention.
erogeneity includes both qualitative and quantita- Clearly, outcome measures also must be simi-
tive elements. One begins by comparing the study lar to permit appropriate comparison. Thus,
populations. Are the studies similar in age, sex, or 6-month mortality after cardiac intervention in
even type of disease? If not, is it appropriate to one study should not be compared to left ventric-
pool them together? Are the interventions the ular ejection fraction at 6 months in another.
same? Some studies may include co-interventions Length of follow-up of a trial may have an
which may be a source of confounding. Studies inuence on the estimate of treatment effect. Like
also may exhibit variability in terms of the timing applying the intervention at disparate times, fol-
of the intervention; thus, imposition of an inter- low-up at different stages of the disease likely
vention at different stages during the disease pro- will impact outcomes. This issue should have be
cess may cause differences in degree of efcacy. resolved during the study selection stage of a
For example, a study on the impact of oncologic review so that studies lacking the desired out-
surgery would likely exhibit differences in come measure were excluded. One should also
efcacy if conducted early after cancer detection be critical of surrogate marker use as an outcome
as opposed to after metastases had developed. measure, especially when being compared to a
The question of timing overlaps the issue of pop- direct outcome. Different study methods will
ulation differences as patients may be sicker at have different degrees of bias. Those conducting
one stage of the disease than another. This can meta-analyses should consider whether it is
magnify the effects or negate them. An ill popu- appropriate to compare RCTs with blinding and
lation may exaggerate the benecial effects of an concealment to unblinded cohort studies.
9 Selecting and Evaluating Secondary Data 189

Table 9.3 Assessing heterogeneity with I2 statistic

I2 Degree of heterogeneity
<0.25 Low
0.25 to 0.50 Moderate
>0.50 High

distribution with N 1 degree of freedom (df),

that indicates whether the individual effects
are farther away from the common effect,
beyond what would be expected by chance. A
p value < 0.10 indicates signicant heteroge-
neity. (The level of signicance for Cochran Q
often is set at 0.1 due to the low power of the
test to detect heterogeneity.) If the Cochran Q
Fig. 9.8 The LAbb plot is not statistically signicant, but the ratio of
Cochran Q and the degrees of freedom (Q/df)
is >1, the result is interpreted to indicate
Heterogeneity of the effect size can be ana- possible heterogeneity. If the Cochran Q is not
lyzed graphically or statistically. The following statistically signicant and Q/df is <1, then
are some of the commonly accepted methods: heterogeneity is much less likely. A limitation
The forest plot (described above) can be visu- of the Cochran Q test is that it is underpow-
ally analyzed to determine whether the effects ered to detect heterogeneity if there are few
of the individual studies are scattered about on studies in the meta-analysis. Conversely, it is
both sides of the no difference/association overpowered (i.e., may detect negligible vari-
line or whether they are grouped together ability) when the number of studies is large.
(i.e., are on one side or another of the this An additional limitation is that the Cochran Q
line). If there is very little or no overlap of the test evaluates only the presence or absence of
condence intervals, then signicant hetero- heterogeneity rather than its magnitude.
geneity exists and pooling of the results may The I 2 statistic represents the percentage of
not be appropriate. If a meta-analysis is car- variation across studies due to heterogeneity. I2
ried out, the authors should address the cause is an index that quanties the degree of hetero-
of the heterogeneity, whether clinical, meth- geneity in a meta-analysis and can be used as a
odological, or both, and provide a justication complement to the Cochran Q test. I2 is calcu-
for continuation. lated from the Cochran Q according to the for-
The LAbb plot (Fig. 9.8) also can be used to mula: I2 = 100 (Q df)/Q, where df = degrees
explore the heterogeneity of effect estimates of freedom. Values may range from 0% to
[21]. The proportion of events in the interven- 100%, with a value of 0% indicating no
tion group (y-axis) is plotted against that in observed heterogeneity (Table 9.3). Although
the control group (x-axis). The no effects line negative values are possible from the equation,
runs between them at 45. The symbol size is they are equivalent in meaning to 0.
proportional to sample size. Sensitivity analysis. A sensitivity analysis
The Cochran chi-square (Cochran Q) is a tests whether the results of the meta-analysis
common test for quantifying heterogeneity in are affected by restrictions and alterations in
meta-analyses. It assumes the null hypothesis the included studies. Examples include
that all the variability among the individual removing an outlier (i.e., the study with the
study results is due to chance. The Cochran Q largest effect size in either direction) or
test generates a p value, based on a chi-square removing the largest study to test if this
190 L. Paladino and R.H. Sinert

changes the magnitude or direction of the or effect size and publication, with potentially
pooled effect size or its statistical signicance. adverse consequences (i.e., type I error or inap-
This analysis helps to determine whether the propriate rejection of the null hypothesis in favor
pooled result is inuenced heavily by a par- of the alternative hypothesis, further discussed in
ticular trial. Other permutations include using Chaps. 10, 11). Fortunately, a variety of graphical
only blinded, higher quality trials (or exclud- and statistical methods are available to help detect
ing lower quality trials) or performing the it. The most widely used of these are described
analysis under xed and random effects below:
assumptions. If the results are consistent, the Funnel plots. The funnel plot [23] is a graphic
sensitivity analysis provides stronger evi- display of the sample size or precision (1/stan-
dence of an effect and of generalizability. dard error) on the y-axis versus the effect esti-
mate (x-axis) used to detect publication bias.
Ideally, the results from small studies will
Pooling Results for Meta-analysis: scatter widely at the bottom of the graph form-
Considerations ing the base of the triangle or funnel because
they have less precision, with the spread nar-
Heterogeneity (whether dened graphically or rowing around the summary effects line at the
statistically) should be considered alongside a apex for larger studies. This pattern occurs
qualitative assessment of the combinability of when publication bias is absent or unlikely.
studies. When signicant methodological differ- Asymmetry indicates systematic differences,
ences and heterogeneity are detected, a meta- errors of measurement, or publication bias; as
analysis probably should not be performed as it noted, small studies with positive results are
may be misleading. Under these circumstances, more likely to be published, whereas negative
the systematic review should report the results studies of similar size are not and, therefore,
descriptively using text and tables and not pool not found during execution of the search
the data. However, if effect sizes are similar strategy. The absence of these balancing
despite variability of clinical and methodological studies are made visually obvious in the asym-
differences, the results probably are robust and metry of the plot (Fig. 9.9). Although funnel
generalizable. A cost-free program for producing plots usually are employed to test for publica-
the tables and graphs and performing the statis- tion bias, there are other causes of asymmetry
tics for a meta-analysis is available from the such as systematic differences and errors of
Cochrane group, RevMan 5 (Review Manager, measurement. When found, the causes of the
Version 5.0, The Cochrane Collaboration, asymmetry should be investigated and
Copenhagen, Denmark). explained to justify the continued grouping of
these studies for meta-analysis.
Fail-safe N. The inability to locate every
Detecting Publication Bias unpublished study about a subject might be
unnerving to authors of a meta-analysis. As a
The literature tends to be biased toward positive method of compensation for what may be
ndingsa phenomenon known as publication unknown, Rosenthal [24] developed formulae
bias [22]. Studies with large sample sizes have a based on the desired level of signicance
greater probability of achieving statistical (p value), later named the fail-safe N by Cooper
signicance and, therefore, achieving publica- [25]. Orwin [26] adapted the fail-safe N to
tion. This holds true for studies demonstrating adjust for small (d = 0.2), medium (d = 0.5), or
large treatment effects as well, even if the sample large (d = 0.8) effect sizes [27]. The formula
size is small. Indeed, many smaller or negative calculates the number of studies that would be
trials are never published. Publication Bias needed to conrm the null hypothesis and,
produces a positive relationship between sample thereby, reverse a conclusion that a signicant
9 Selecting and Evaluating Secondary Data 191

Fig. 9.9 The funnel plot

relationship exists. The formula for Orwins quality of reporting of meta-analyses of clinical
fail-safe N [26] is given below: randomized controlled trials. Since that time,
many additions, updates, and expansions of this
N ( d dc ) statement for broader applicability have led to
N fs =
dc the development of the PRISMA. (Preferred
Reporting Items for Systematic Reviews and Meta-
where N = the number of studies in the meta- analyses) statement, which provides guidelines
analysis, d = the average effect size for the designed to reduce the risk of awed reporting of
studies synthesized, and dc = the criterion value systematic reviews and improve the clarity and
selected that d would equal when some know- transparency in how reviews are conducted [31].
able number of hypothetical studies (Nfs) were Included are a 27-item checklist (Table 9.4) and
added to the meta-analysis. If the fail-safe N is 4-phase owchart (Fig. 9.10) [32].
sufciently high, it may provide reassurance Though not part of current current checklists,
that a few missing studies would not alter the conicts of interest such as nancial funding of
conclusion. individual trails should be reported in the system-
atic review or meta-analysis.

Assessing Quality of Systematic

Reviews and Meta-analyses Limitations of Systematic Reviews
and Meta-analyses
Systematic reviews and meta-analysis are power-
ful informational tools. However, unless properly The major limitations of narrative reviews have
conducted and reported, they can produce errone- been described above. The reader should be
ous conclusions that potentially could impact the aware that caution also must be exercised when
public health [28, 29]. Thus, as there are tools for conducting or interpreting a systematic review or
assessing the quality of individual trials, there meta-analysis. Of note, an evaluation of 300
also are guidelines for assessing the quality of systematic reviews conducted by Moher et al. in
systematic reviews and meta-analysis. In 1996 2007 found that the quality of these reviews was
(published in 1999) [30], the QUOROM (quality inconsistent [33], a nding that led to the above-
of reporting of meta-analyses) statement was mentioned 2009 PRISMA statement. Other criti-
issued to address standards for improving the cisms are based on poor methodology including
Table 9.4 PRISMA checklist for reporting systematic reviews with (or) without meta-analyses (Reproduced with
permission from Moher et al. [32]
Section/topic Item No Checklist item on page No
Title 1 Identify the report as a systematic review, meta-analysis, or both
Structured summary 2 Provide a structured summary including, as applicable,
background, objectives, data sources, study eligibility criteria,
participants, interventions, study appraisal and synthesis
methods, results, limitations, conclusions and implications of
key ndings, systematic review registration number
Rationale 3 Describe the rationale for the review in the context of what is
already known
Objectives 4 Provide an explicit statement of questions being addressed with
reference to participants, interventions, comparisons, outcomes,
and study design (PICOS)
Protocol and registration 5 Indicate if a review protocol exists, if and where it can be
accessed (such as web address), and, if available, provide
registration information including registration number
Eligibility criteria 6 Specify study characteristics (such as PICOS, length of
follow-up) and report characteristics (such as years considered,
language, publication status) used as criteria for eligibility,
giving rationale
Information sources 7 Describe all information sources (such as databases with dates
of coverage, contact with study authors to identify additional
studies) in the search and date last searched
Search 8 Present full electronic search strategy for at least one database,
including any limits used, such that it could be repeated
Study selection 9 State the process for selecting studies (that is, screening,
eligibility, included in systematic review, and, if applicable,
included in the meta-analysis)
Data collection process 10 Describe method of data extraction from reports (such as piloted
forms, independently, in duplicate) and any processes for
obtaining and conrming data from investigators
Data items 11 List and dene all variables for which data were sought
(such as PICOS, funding sources) and any assumptions and
simplications made
Risk of bias in individual 12 Describe methods used for assessing risk of bias of individual
studies studies (including specication of whether this was done at the
study or outcome level), and how this information is to be used
in any data synthesis
Summary measures 13 State the principal summary measures (such as risk ratio,
difference in means).
Synthesis of results 14 Describe the methods of handling data and combining results of
studies, if done, including measures of consistency (such as I2
statistic) for each meta-analysis
Risk of bias across studies 15 Specify any assessment of risk of bias that may affect the
cumulative evidence (such as publication bias, selective
reporting within studies)
Additional analyses 16 Describe methods of additional analyses (such as sensitivity or
subgroup analyses, meta-regression), if done, indicating which
were pre-specied
Study selection 17 Give numbers of studies screened, assessed for eligibility,
and included in the review, with reasons for exclusions at
each stage, ideally with a ow diagram
Table 9.4 (continued)
Section/topic Item No Checklist item on page No
Study characteristics 18 For each study, present characteristics for which data were
extracted (such as study size, PICOS, follow-up period) and
provide the citations
Risk of bias within studies 19 Present data on risk of bias of each study and, if available,
any outcome-level assessment (see item 12).
Results of individual 20 For all outcomes considered (benets or harms), present
studies for each study (a) simple summary data for each intervention
group and (b) effect estimates and condence intervals, ideally
with a forest plot
Synthesis of results 21 Present results of each meta-analysis done, including
condence intervals and measures of consistency
Risk of bias across studies 22 Present results of any assessment of risk of bias across studies
(see item 15)
Additional analysis 23 Give results of additional analyses, if done (such as sensitivity
or subgroup analyses, meta-regression) (see item 16)
Summary of evidence 24 Summarize the main ndings including the strength of evidence
for each main outcome; consider their relevance to key groups
(such as health care providers, users, and policy makers)
Limitations 25 Discuss limitations at study and outcome level (such as risk
of bias), and at review level (such as incomplete retrieval of
identied research, reporting bias)
Conclusions 26 Provide a general interpretation of the results in the context of
other evidence, and implications for future research
Funding 27 Describe sources of funding for the systematic review and other
support (such as supply of data) and role of funders for the
systematic review

Fig. 9.10 PRISMA four-phase ow diagram (Reproduced with permission from Moher et al. [32])
194 L. Paladino and R.H. Sinert

nonadherence to proper searching strategies, lack as noted earlier, it should be considered for
of statistical rigor, and introduction of bias (inten- hypothesis generation only). In addition, the
tional or unintentional) in which studies were increased power gained by pooling the results of
cherry picked to suit the personal agenda of individual studies that is advantageous for
the reviewer/analyst. Unfortunately, not all of the decreasing type II errors also may allow small
limitations can be minimized by strict method- biases to be interpreted erroneously as an effect,
ology. A fundamental limitation of meta-analy- increasing type I errors. (Again, see Chaps. 10
sis, specically, is that it is comprised of studies and 11 for further elaboration of these fundamen-
performed under different protocols and at differ- tal concepts.) On occasion, the same dataset may
ent times; for purposes of the analysis, it is be published multiple times, making the results
assumed that the differences in protocol and not independent. If this is not recognized, the
study design of the elements are obviated by the dataset will be weighed more than once in the
large number of observations ultimately avail- analysis, articially inating the results. Finally,
able. This assumption is highly questionable. the results and conclusions of a systematic review
As noted above, if clinical and methodological or meta-analysis are only as reliable as the meth-
diversity across studies is such that substantial ods used in each of the primary studies. The
heterogeneity is determined, it may be better not methodology used for their qualitative or quanti-
to combine them in a meta-analysis (if a meta- tative synthesis does not compensate for aws or
analysis is performed under these circumstances, errors in the individual primary studies.

Take-Home Points

For clinicians to make informed decisions for patient management and research, they must
analyze multiple studies for quality and relevance to the population of interest.
Secondary sources of information (especially systematic reviews and meta-analyses) help
to summarize and reconcile conicting studies in the literature.
By explicitly stating how evidence was found, selected, and evaluated, systematic reviews
eliminate many of the biases inherent in narrative reviews.
Meta-analysis uses statistical methodology to combine results of several related studies,
which affords greater statistical power versus that of individual studies.
Though retrievable via traditional online literature search engines, a variety of databases
are available that specialize in systematic reviews and meta-analyses.
To construct a quality systematic review, one should formulate a clear question, dene a
comprehensive yet efcient literature searching strategy, include all appropriate studies,
summarize results, assess heterogeneity, and consider appropriateness of pooling results if
individual studies for meta-analysis.
Caution must be exercised when conducting/interpreting a systematic review or meta-analy-
sis to: ensure inclusiveness of literature searching, optimization of statistical rigor, minimi-
zation of bias, and avoidance of inclusion of multiple publications of the same dataset.
The results and conclusions of a systematic review or meta-analysis are only as reliable as
the methods used in each of the primary studies; their synthesis does not compensate for
errors of methodology in the individual primary studies.
Meta-analyses, constructed as they are of multiple nonidentical studies, must be viewed as a
hypothesis-generating rather than a hypothesis testing tool especially if major methodological
differences or heterogeneity among their components is detected.
9 Selecting and Evaluating Secondary Data 195

17. Stewart LA, Tierney JF. To IPD or not to IPD?

References Advantages and disadvantages of systematic reviews
using individual patient data. Eval Health Prof.
1. Lau J, Antman EM, Jimenez-Silva J, Kupelnick B, 2002;25:7697.
Mosteller F, Chalmers TC. Cumulative meta-analysis 18. Matt GE, Cook TD. Threats to the validity of research
of therapeutic trials for myocardial infarction. N Engl synthesis. In: Cooper H, Hedges LV, editors. The
J Med. 1992;327:24854. handbook of research synthesis. NewYork: Russell
2. Collins JA, Fauser BCJM. Balancing the strengths of Sage; 1994.
systematic and narrative reviews. Hum Reprod 19. Lewis S, Clarke C. Forest plots: trying to see the wood
Update. 2005;11:1034. and the trees. BMJ. 2001;322:147980.
3. Pearson K. Report on certain enteric fever inoculation 20. Brandler E, Paladino L, Sinert R. Does the early
statistics. Br Med J. 1904;3:12436. administration of beta-blockers improve the
4. Glass GV. Primary, secondary, and meta-analysis of in-hospital mortality rate of patients admitted with
research. Educ Res. 1976;5:38. acute coronary syndrome? Acad Emerg Med. 2010;17:
5. Sacks HS, Berrier J, Reitman D, Ancona-Berk VA, 110.
Chalmers TC. Meta-analyses of randomized con- 21. LAbbe KL, Detsky AS, ORourke K. Meta-analysis in
trolled trials. N Engl J Med. 1987;316:4505. clinical research. Ann Intern Med. 1987;107:22433.
6. Tuckman BW. Conducting educational research. 3rd 22. Easterbrook PJ, Berlin JA, Gopalan R, Matthews DR.
ed. New York: Harcourt Brace Jovanovich; 1972. Publication bias in clinical research. Lancet.
7. Wilczynski NL, Haynes RB, Lavis JN, 1991;337:86772.
Ramkissoonsingh R, Arnold-Oatley A, HSR Hedges 23. Egger M, Smith DG, Schneider M, Minder C. Bias in
Team. Optimal search strategies for detecting health meta-analysis detected by a simple, graphical test.
services research studies in MEDLINE. CMAJ. BMJ. 1997;315:62934.
2004;171:117985. 24. Rosenthal R. The le drawer problem and tolerance
8. Oxman AD, Sackett DL, Guyatt GH. Users guides to for null results. Psychol Bull. 1979;86:63841.
the medical literature. I. How to get started. The 25. Cooper HM. Statistically combining independent stud-
Evidence-Based Medicine Working Group. JAMA. ies: a meta-