Phyllis G. Supino EdD (Auth.), Phyllis G. Supino, Jeffrey S. Borer (Eds.) - Principles of Research Methodology - A Guide For Clinical Investigators-Springer-Verlag New York (2012)

Principles of Research Methodology
Phyllis G. Supino Jeffrey S. Borer

Editors
Principles of Research
Methodology
A Guide for Clinical Investigators
Foreword by Stephen E. Epstein

Editors
Phyllis G. Supino
Professor of Medicine, College of Medicine
Professor of Public Health, School of Public Heath
Director of Clinical Epidemiology and Clinical Research
Division of Cardiovascular Medicine
State University of New York (SUNY) Downstate Medical Center
Brooklyn, NY, USA
Jeffrey S. Borer
Professor and Chair, Department of Medicine
Chief, Division of Cardiovascular Medicine
Director of The Howard Gilman Institute for Heart Valve Disease
Director of the Cardiovascular Translational Research Institute
SUNY Downstate Medical Center
Brooklyn, NY, USA
ISBN 978-1-4614-3359-0 e-ISBN 978-1-4614-3360-6 (eBook)

DOI 10.1007/978-1-4614-3360-6
Springer New York Heidelberg Dordrecht London
Library of Congress Control Number: 2012937226
Phyllis G. Supino and Jeffrey S. Borer 2012

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or
part of the material is concerned, specically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microlms or in any other physical way,
and transmission or information storage and retrieval, electronic adaptation, computer software,
or by similar or dissimilar methodology now known or hereafter developed. Exempted from this
legal reservation are brief excerpts in connection with reviews or scholarly analysis or material
supplied specically for the purpose of being entered and executed on a computer system, for
exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is
permitted only under the provisions of the Copyright Law of the Publishers location, in its
current version, and permission for use must always be obtained from Springer. Permissions for
use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable
to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specic statement, that such names are
exempt from the relevant protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of
publication, neither the authors nor the editors nor the publisher can accept any legal responsibility
for any errors or omissions that may be made. The publisher makes no warranty, express or
implied, with respect to the material contained herein.
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)

Foreword
This superb book on research philosophy and methodology that Drs. Phyllis
Supino and Jeffrey Borer have written and edited came out of an experience
common to most of us involved in training investigators beginning their
research careers. How do you teach these investigators the mostly unwritten
ways of an area as complex as medical research? How do you help the
research neophyte develop into a creative and reliable researcher? For me and
my associates in the Cardiology Branch of the NIH (of which Dr. Borer was
one) in the 1970s and 1980s, the teaching process was mostly based on an
apprenticeship model, with learning coming in the actual doing of the
research. This time-honored approach led to the development, in many
research centers, of a cadre of superb researchersbut it was hard to master
and the results were necessarily inconsistent, with many young investigators
going down wrong paths.
Drs. Supino and Borers book represents a unique collaboration between
an accomplished educator specializing in research methodology and a promi-
nent physician-scientist. Drs. Supino and Borer began their collaboration
more than 20 years ago at Cornell University Medical College, continuing
their work together in what became the Howard Gilman Institute for Valvular
Heart Diseases. The Institute, of which Dr. Borer is the Director, now is
located at the State University of New York Downstate Medical Center.
Working within the context of a research institute housed within a medical
school, Dr. Borer soon discovered that most of the fellows coming into his
program had no formal research training and scant knowledge of research
methodology. Prior to joining the Institute, Dr. Supino had been conducting
continuing education in research methodology for scientists and health pro-
fessionals since late 1970s. When Dr. Supino joined the Institute in 1990, she
applied her accumulated expertise in this eld to develop a curriculum and
lead a comprehensive course providing formal training in research methodol-
ogy for Dr. Borers fellows and others at the institution. This curriculum and
course, developed in partnership with Dr. Borer, turned out to be our good
fortune. During the ensuing 20+ years Drs. Supino and Borer gradually devel-
oped the pedagogical framework for writing what is one of the best books in
the eld.
This book provides in depth chapters containing information critical to
creating good researchfrom the kind of mind-set that generates valuable
research questions to study design, to exploring a variety of online data
v
vi Foreword
bases, to the elements making for compelling research grants and papers,
and to the wonderfully informing chapter on the history of the application
of ethics to medical research. There also is a valuable chapter on statistical
considerations and a fascinating discussion on the origins and elements of
hypothesis generation.
Its also important to emphasize that this superb text is not only for the
new investigator, but for experienced investigators as well. This results from
the fact that Drs. Supino, Borer, and their coauthors write their chapters in
ways that are not only easily accessible to the new investigator, but at the
same time are sufciently sophisticated so that the seasoned investigator will
prot.
As an example, I particularly enjoyed the rst chapter, written by
Dr. Supino, which provides some down to earth examples of, in essence, why
there should be a clearly dened primary endpoint in clinical investigations.
As I was reading her chapter, I realized I had forgotten the why of this
requirement, and that I was just taking the requirement for granteda situation
that could make investigators vulnerable to dismissing its importance. In this
regard, over the years Ive found it not uncommon for investigators, who nd
that the efcacy of the intervention theyre studying signicantly improves
one or another secondary endpoints but not the primary endpoint, to freely
attack this requirement and argue theyve proven the efcacy of their inter-
vention. But Dr. Supino reminds us what good science is by providing an
elegantly simple example of the marksman who boasts his skills after inter-
preting the results of his shooting a gun at a piece of paper hung on the side
of a barn. The marksman, it turns out, does not prospectively dene the bulls
eye. Rather, after multiple bullets are red at the piece of paper, he inspects
the bullet hole-riddled paper, sees the random bullet hole patterns, and then
draws a circle (bulls eye) around a group of holes that by chance have fallen
into a tight cluster. The post hoc denition of the bulls eye (i.e., now the
primary endpoint) speaks (unjustiably) to the marksmans skill. By this
simple anecdote, Dr. Supino makes the critical importance of prospectively
dening the primary endpoint exquisitely clear.
A foreword is no place to provide extensive details of what a book con-
tains. Ill therefore limit myself and just enthusiastically say this rst chapter
I read is representative of the high quality of the chapters to come. Drs.
Supino and Borer have used the many years they have developed their course
extraordinarily wellthey and their outstanding coauthors have produced a
book that is well written, beautifully edited, and contains wisdom and insight.
It is a book, whether reading it in its entirety or perusing individual chapters,
that presents the reader with a superb learning experience. The authors have
certainly hit the bulls eye.
Washington, DC, USA Stephen E. Epstein, MD

Preface
This book has been written to aid medical students, physicians, and other
health professionals as they probe the increasingly complex and varied medi-
cal/scientic literature for knowledge to improve patient care and search for
guidance in the conduct of their own research. It also is intended for basic
scientists involved in translational research who wish to better understand the
unique challenges and demands of clinical research and, thus, become more
successful members of interdisciplinary medical research teams.
The book is based largely on a lecture series on research methodology,
with particular emphasis on issues affecting clinical research, that the editors
designed and have offered for 21 years to more than 1,000 members of the
academic medical communities of Weill Cornell Medical College and the
State University of New York (SUNY) Downstate Medical Center, both
located in New York City. The book spans the entire research process, begin-
ning with the conception of the research problem to publication of ndings.
The need for such a book has become increasingly clear to us during many
years of conducting a program of training and research in cardiovascular dis-
eases and in our general teaching of research methodology to students, train-
ees, and postgraduate clinical physicians and researchers. Though agreement
on the fundamental principles of scientic research has existed for more than
a century, the application of these principles has changed over time. The pre-
cision required in dening study populations and in detailing methodologies
(and their deciencies) is continually increasing. In addition, a bewildering
arsenal of statistical tools has developed (and continues to grow) to identify
and dene the magnitude and consistency of relationships. Simultaneously,
acceptable formats for communicating scientic data have changed in
response to parallel changes in the world at large, and under the pressure of
an information explosion which mandates succinctness and clarity.
Despite these demands, there are few books, if any, that comprehensively and
concisely present these concepts in a manner that is relevant and comprehensible
to a broad professional medical community. This text is designed to resolve this
deciency by combining theory and practical application to familiarize the
reader with the logic of research design and hypothesis construction, the impor-
tance of research planning, the ethical basis of human subjects research, the
basics of writing a clinical protocol, the logic and techniques of data generation
and management, and the fundamentals and implications of various sampling
vii
viii Preface
techniques and alternative statistical methodologies. This book also aims to offer
guidance for assembling and interpreting results, writing scientic papers, and
publishing studies.
The books 13 chapters emphasize the role and structure of the scientic
hypothesis (reinforced throughout the various chapters) in informing meth-
ods and in guiding data interpretation. Chapter 1 describes the general
characteristics of research and differentiates among various types of research;
it also summarizes the steps typically utilized in the hypothesis-testing
(hypothetico-deductive) method and underscores the importance of proper
planning. Chapter 2 reviews the origins of clinical research problems and the
types of questions that are commonly asked in clinical investigations; it also
identies the characteristics of well-conceived research problems and explains
the role of the literature search in research problem development. Chapter 3
introduces the reader to various modes of logical inference utilized for
hypothesis generation, describes the characteristics of well-designed research
hypotheses, distinguishes among various types of hypotheses, and provides
guidelines for constructing them. Chapter 4 takes the reader through classic
epidemiological (observational) methods, including cohort, casecontrol,
and cross-sectional designs, and describes their respective advantages and
limitations. Chapter 5 discusses the meaning of internal and external validity
in the context of studies that aim to examine the effects of purposively applied
interventions, identies the most important sources of bias in these types of
studies, and presents a variety of alternative study designs that can be used to
evaluate interventions, together with their respective strengths and weak-
nesses for controlling each of the identied biases. Chapter 6 denes and
describes the purpose of the clinical trial and provides in-depth guidelines for
writing the clinical protocol that governs its conduct. Chapter 7 describes
methodologies used for data capture and management in clinical trials and
reviews associated regulatory requirements. Chapter 8 explains the steps
involved in designing, implementing, and evaluating questionnaires and
interviews that seek to obtain self-reported information. Chapter 9 reviews
the pros and cons of systematic reviews and meta-analyses for generating
secondary data by synthesizing evidence from previously conducted studies,
and discusses methods for locating, evaluating, and writing them. Chapter 10
describes the various methods by which subjects can be sampled and the
implications of these methods for drawing conclusions from clinical research
ndings. Chapter 11 introduces the reader to fundamental statistical princi-
ples used in biomedical research and describes the basis of determination of
sample size and denition of statistical power. Chapter 12 describes the ethi-
cal basis of human subjects research, identies areas of greatest concern to
institutional review boards, and outlines the basic responsibilities of investi-
gators towards their subjects. Finally, Chapter 13 provides practical guidance
on how to write a publishable scientic paper.
The authors of this book include prominent medical scientists and meth-
odologists who have extensive personal experience in biomedical investiga-
Preface ix
tion and in teaching various key aspects of research methodology to medical

students, physicians, and other health professionals. They have endeavored to
integrate theory and examples to promote concept acquisition and to employ
language that will be clear and useful for a general medical audience. We hope
that this text will serve as a helpful resource for those individuals for whom
performing or understanding the process of research is important.
Brooklyn, NY, USA Phyllis G. Supino

Jeffrey S. Borer
Special Acknowledgments
We wish to give special thanks to the following individuals, who provided

particular assistance to the editors and authors in the preparation of this book:
From our publishers, we especially thank Richard Lansing for his belief in
the importance of our project as well as Kevin Wright, senior developmental
editor, for his excellent pre-production work.
From SUNY Downstate Medical Center, we thank Ofek Hai DO for his
efforts in the preparation of gures and tables; Rachel Reece BS for her assis-
tance in helping us to secure permission for the reproduction of images; and
Dany Bouraad BA, Jaclyn Wilkens BA, Daniel Santarsieri BS, and Romina
Arias BA for their assistance in literature searching, proof reading, and other
essential background work.
Finally, we thank our colleagues at Weill Cornell Medical College and
SUNY Downstate Medical Center who participated in our teaching programs
on which this book is largely based, and to our families for their unfailing
support of this project.
xi
Contents
1 Overview of the Research Process .............................................. 1

Phyllis G. Supino
2 Developing a Research Problem ................................................. 15
Phyllis G. Supino and Helen-Ann Brown Epstein
3 The Research Hypothesis: Role and Construction ................... 31
Phyllis G. Supino
4 Design and Interpretation of Observational Studies:
Cohort, CaseControl, and Cross-Sectional Designs ................ 55
Martin L. Lesser
5 Fundamental Issues in Evaluating the Impact
of Interventions: Sources and Control of Bias .......................... 79
Phyllis G. Supino
6 Protocol Development and Preparation
for a Clinical Trial ........................................................................ 111
Joseph A. Franciosa
7 Data Collection and Management in Clinical Research ........... 131
Mario Guralnik
8 Constructing and Evaluating Self-Report Measures ................ 147
Peter L. Flom, Phyllis G. Supino, and N. Philip Ross
9 Selecting and Evaluating Secondary Data: The Role
of Systematic Reviews and Meta-analysis.................................. 177
Lorenzo Paladino and Richard H. Sinert
10 Sampling Methodology: Implications for Drawing
Conclusions from Clinical Research Findings........................... 197
Richard C. Zink
xiii
xiv Contents
11 Introductory Statistics in Medical Research ............................. 207

Todd A. Durham, Gary G. Koch, and Lisa M. LaVange
12 Ethical Issues in Clinical Research ............................................. 233
Eli A. Friedman
13 How to Prepare a Scientific Paper .............................................. 255
Jeffrey S. Borer
About the Editors ................................................................................. 269
Index ...................................................................................................... 271

Contributors
Jeffrey S. Borer, MD Department of Medicine, Division of Cardiovascular

Diseases, Howard Gilman Institute for Valvular Heart Diseases, and Car-
diovascular Translational Research Institute, State University of New York
(SUNY) Downstate Medical Center, Brooklyn, NY, USA
Todd A. Durham, MS Axio Research, LLC, Seattle, WA, USA
Helen-Ann Brown Epstein, MLS, MS, AHIP Clinical Librarian, Samuel J.
Wood Library and C.V. Starr Biomedical Information Center, Weill Cornell
Medical College, New York, NY, USA
Peter L. Flom, PhD Peter Flom Consulting, LLC, New York, NY, USA
Joseph A. Franciosa, MD Department of Medicine, SUNY, Downstate
Medical Center, Brooklyn, NY, USA
Eli A. Friedman, MD Department of Medicine, SUNY, Downstate Medical
Center, Brooklyn, NY, USA
Mario Guralnik, PhD Synergy Research Inc, Irvine, CA, USA
Gary G. Koch, PhD Department of Biostatistics, University of North
Carolina at Chapel Hill Gillings School of Global Public Health, Chapel
Hill, NC, USA
Lisa M. LaVange, PhD Department of Biostatistics, University of North
Carolina at Chapel Hill Gillings School of Global Public Health, Chapel
Hill, NC, USA
Martin L. Lesser, PhD Biostatistics Unit, Departments of Molecular Medi-
cine and Population Health, Feinstein Institute for Medical Research, Hofstra
North Shore-LIJ School of Medicine, Manhasset, NY, USA
Lorenzo Paladino, MD Department of Emergency Medicine, SUNY Down-
state Medical Center, Brooklyn, NY, USA
N. Philip Ross, BS, MS, PhD SUNY Downstate Medical Center, Bethesda,
MD, USA
xv
xvi Contributors
Richard H. Sinert, DO Department of Emergency Medicine, SUNY Down-

state Medical Center, Brooklyn, NY, USA
Phyllis G. Supino, EdD Department of Medicine, College of Medicine,
SUNY Downstate Medical Center, Brooklyn, NY, USA
Richard C. Zink, PhD JMP Life Sciences, SAS Institute, Inc, Cary,
NC, USA
Overview of the Research Process
1
Phyllis G. Supino
The term research can be defined broadly as a

process of solving problems and resolving pre- Characteristics of Research
viously unanswered questions. This is done by
careful consideration or examination of a sub- No discussion of research methodology should
ject or occurrence. Although approach and begin without examining the characteristics of
specific objectives may vary, the ultimate goal research and its relation to the scientific method.
of research always is to discover new knowl- The reason for this starting point is that the term
edge. In biomedical research, this may include research has been used so loosely in common
the description of a new phenomenon, the parlance and defined in so many different ways
definition of a new relationship, the develop- by scholars in various fields of inquiry [1] that its
ment of a new model, or the application of an meaning is not always appreciated by those with-
existing principle or procedure to a new context. out a formal background. To understand more
Increasingly, the methodology of research is readily what research is, it is useful to begin by
acknowledged as an academic discipline of its considering some examples of what it is not.
own, whose specific rules and requirements for Leedy, in his book Practical Research [2],
securing evidence, though applicable across dis- describes two young students: one whose teacher
ciplines, mandate special study. This chapter has sent him to the library to do research by
describes the characteristics of the research pro- gleaning a few facts about Christopher Columbus
cess and its relation to the scientific method, and another who completes a research paper on
distinguishes among the various forms of the Dark Lady in Shakespeares sonnets by gath-
research used in the biomedical sciences, out- ering facts, assembling a bibliography, and refer-
lines the principal steps involved in initiating a encing statements without drawing conclusions or
research project, and highlights the importance otherwise interpreting the collected data. Both
of planning. students think that research has taken place when,
in fact, all that has occurred has been information
gathering and transport from one location to
another. Leedy argues that these misconceptions
are reinforced at every grade level and that
most students facing the rigors of a graduate
P.G. Supino, EdD () program lack clear understanding about the
Department of Medicine, College of Medicine, specific requirements of the research process and
SUNY Downstate Medical Center,
underestimate what is involved. In academic med-
450 Clarkson Avenue, Box 1199, Brooklyn,
NY 11203, USA ical programs, it is not uncommon for a resident
e-mail: phyllissupino@aol.com to comment, I have a 2-week block available to
P.G. Supino and J.S. Borer (eds.), Principles of Research Methodology: A Guide for Clinical Investigators, 1
DOI 10.1007/978-1-4614-3360-6_1, Phyllis G. Supino and Jeffrey S. Borer 2012
2 P.G. Supino
conduct a research project and to expect to study to a broader context (external validity or
design, execute, and complete it in that time extrapolability).
frame. 3. It should be empirical.
There is general consensus that information Despite the deductive processes that may pre-
gathering, including reviewing and synthesizing cede data collection, the findings of research
the literature, is a critically important activity to must always be based on observation or experi-
be undertaken by an investigator. However, in ence and, thus, must relate to reality. It is the
and of itself, it is not research. The same can be empirical quality of research that sets it apart
said for data gathering activities aimed at per- from other logical disciplines, such as philoso-
sonal edification or those undertaken to resolve phy, which also attempts to explain reality.
organization-specific issues. So what, then, char- Recognition of this fact may pose a problem for
acterizes research? physicians who, according to some researchers
Tuckman [3] has argued that in order for an [4, 5], have a cognitive style that tends to be
activity to qualify as research, it should possess a more deterministic than probabilistic, causing
minimum of five characteristics: personal experience to be valued more than
1. It should be systematic. data. Under these circumstances, the impor-
While some important research findings have tance of subordinating the hypothesis to data
occurred serendipitously (e.g., Flemings may not be fully appreciated. As part of the edu-
accidental discovery of penicillin, Pasteurs cation of the physician scientist, he or she must
chance finding of microbial antibiosis), most learn that when confronted with data that do not
arise out of purposeful, structured activity. support the study hypothesis, it is the hypothesis
Structure is engendered by a series of the rules and not the data that must be discarded, unless it
for defining variables, constructing hypothe- is abundantly clear that something untoward
ses, and developing research designs. Rules occurred during the performance of the study.
also exist for collecting, recording, and ana- 4. It should be reductive.
lyzing data, as well as for relating results to As Tuckman [3] has noted, a fundamental pur-
the problem statement or hypotheses. These pose of research is to reduce the confusion of
rules are used to generate formal plans (or individual events and objects to more under-
protocols) which guide the research effort, standable categories of concepts (p. 11). One
thereby optimizing the likelihood of achieving heuristic tool used by scientists for this pur-
valid results. pose is the creation of abstractive constructs
2. It should be logical. such as intervening variables (e.g., resistance
Research employs logic that may be induc- and solubility in the physical sciences, condi-
tive, deductive, or abductive in nature. tioning or reflex reserve in the behavioral sci-
Inductive logic is employed to develop gener- ences) to explain how phenomena cause or
alizations from repeated observations, abduc- otherwise interact with each other [6]. Another
tive logic is used to form generalizations that powerful tool available to the researcher for
serve as explanations for anomalous events, this purpose is a constellation of techniques
and deductive logic is used to generate specific for numerical and graphical data analysis
assertions from known scientific principles or (the specific methodology employed depend-
generalizations. Further elaboration of these ing on the objectives and design of the study
distinctions is covered in Chap. 3. Logic is as well as the number of observations gener-
used both in the development of the research ated by the study). As Tuckman observes,
design and selection of statistics to ensure that whenever data are subjected to analysis, some
valid inferences may be drawn from data information is lost, specifically the uniqueness
(internal validity). Logic also is used to of the individual observation. However, such
generalize from the results of the particular losses are offset by gains in the capacity to
1 Overview of the Research Process 3
conceptualize general relationships based on [CQI] or formative and summative appraisals

the data. As a result, the investigator can of educational programs) which, while employ-
explain and predict, rather than merely ing many of the same rigorous and systematic
describe. methodologies as scientific research, princi-
5. It should be replicable and transmittable. pally aim to inform decision making about
The fact that research procedures are docu- particular activities or policies rather than to
mented makes it possible for others to conduct advance more wide-ranging knowledge or the-
and attempt to replicate the investigation. The ory. As Smith and Brandon [9] have noted,
ability to replicate research results in the research generalizes whereas evaluation
confirmation (or, in some unhappy cases, refu- particularizes.
tation) of conclusions. Confirmation of con-
clusions, in turn, results in the validation of
research and confers upon research a respect- Types of Research
ability that generally is absent in other prob-
lem-solving processes. In addition, the fact There are multiple ways of classifying research,
that research is transmittable also enables and the categorizations noted below are by no
the general body of knowledge to be extended means exhaustive. Research can be classified
by subsequent investigations based on the according to its theoretical versus practical
research. For this reason, researchers are emphasis, the type of inferential processes used,
encouraged to present their findings as soon as its orientation with respect to data collection and
possible at local, national, and international analysis, its temporal characteristics, its analytic
scientific sessions and to publish them expedi- objective, the degree of control exercised by
tiously as letters (communications) or full- the investigator, or the characteristics of the
length articles in peer-reviewed journals (to measurements made during the investigation.
ensure their quality and validity). These yield the following categorizations: basic
6. It should contribute to generalizable knowledge. versus applied versus translational, hypothesis
The Tuckman criteria speak to the structure and testing versus hypothesis generating, retrospective
process of research, but not to its intended objec- versus prospective, longitudinal versus cross-
tives. The Belmont Report [7], which codified sectional, descriptive versus analytic, experimen-
the definition of human subjects research for tal versus observational, and quantitative versus
the US Department of Health and Human qualitative research.
Services, argues additionally that for an activity
to be considered research, it must contribute to
generalizable knowledge (the latter expressed in Basic Versus Applied Versus
theories, principles, and statements of relation- Translational Research
ships). For knowledge to be generalizable, the
intent of the activity must be to extrapolate Traditionally, research in medicine, as in other
findings from a sample (e.g., the study subjects) disciplines, has been classified as basic or applied,
to a larger (reference) population to define some though the lines between the two can, and do,
universal truth, and be conducted by individu- intersect. In basic research (alternatively termed
als with the requisite knowledge to draw such fundamental or pure research), the investiga-
inferences [8]. Because research seeks general- tion often is driven by scientific curiosity or inter-
izable knowledge, it differs from information est in a conceptual problem; its objective is to
gathering for diagnosis and management of expand knowledge by exploring ideas and ques-
individual patients. It also differs from formal tions and developing models and theories to
evaluation procedures (e.g., review of data explain phenomena. Basic research typically
performed for clinical quality improvement does not seek to provide immediate solutions to
4 P.G. Supino
practical problems (indeed, it can progress for in-depth discussion of purpose, challenges, and
decades before leading to breakthroughs and par- techniques of translational research in clinical
adigm shifts in practice), though it can yield medicine and associated career opportunities, the
unexpected applications (e.g., the discovery of reader is referred to the collective works of
the laser and its value for fiber-optic communica- Schuster and Powers [12], Woolf [13], Robertson
tions [10]), and it often provides the theoretical and Williams [14], and Goldblatt and Lee [15].)
underpinnings of applied research. Applied
research, in contrast, is conducted specifically to
find solutions to practical problems in as rapid a Hypothesis-Generating Versus
time frame as possible. In medicine, applied Hypothesis-Testing Research
research searches for explicit knowledge to
improve the treatment of a specific disease or its Although some studies are undertaken to describe
sequelae. Examples of applied research include a phenomenon (e.g., incidence of a new disease
clinical trials of new drugs and devices in human or prevalence of an existing disorder in a new
subjects or evaluation of new uses for existing population), most research is performed to gener-
therapeutic interventions. ate a hypothesis or to test a hypothesis. In hypoth-
In recent years, translational or translative esis-generating research, the investigator begins
research has emerged as a paradigm alternative to with an observation (e.g., a newly discovered pat-
the dichotomy between basic and applied tern, a rare event) and constructs an argument to
research. Currently practiced in the natural, explain it. Hypothesis-generating research
behavioral, and social sciences, and heavily typically is conducted when existing theory or
reliant on multidisciplinary collaboration, trans- knowledge is insufficient to explain particular
lational research is a method of conceptualizing phenomena. Popular tools for hypothesis gen-
and conducting basic research to render its eration in preclinical research include gene
findings directly and more immediately applica- expression microarray studies; hypotheses for
ble to the population under study. In medicine, clinical or epidemiological research may be
this iterative approach is used to translate results generated secondary to a projects initial purpose
of laboratory research more rapidly into clinical by mining existing datasets. In contrast, in
practice and vice versa (bench to bedside and hypothesis-testing research (sometimes called the
back or T1 translation) and from clinical prac- hypothetico-deductive approach), the investi-
tice to the population at large (to the community gator begins with a general conjecture or hunch
and beyond and back or T2 translation) to put forth to explain a prior observation or to clar-
enhance public knowledge. This is one of the ify a gap in the existing knowledge base.
major initiatives of the US National Institutes of It is vitally important that the investigator
Health (NIH) Roadmap for Medical Research. keep these differences in mind when designing
Examples of T1 translation include the develop- and drawing inferences from a study. To under-
ment of a technique for evaluating endothelium- score what can happen when these distinctions
dependent vasodilator responses as a diagnostic are blurred, it is instructive to step back from
test in patients with atherosclerosis and the eluci- scientific inquiry and mull over the following
dation of the role of the p53 tumor suppressor scenario:
gene in the regulation of apoptosis in the treat- A Texas cowboy fires his gun randomly at the
ment of patients with cancer [11]. Examples of side of a barn. Figure 1.1 (left panel) shows his
T2 translation would include the implementation, results. He pours over his efforts, paints a target
evaluation, and ultimate adoption of interventions centered around his largest number of hits (Fig. 1.1,
that have been shown to be effective in clinical right panel), and claims to be a sharpshooter.
research for primary or secondary prevention in Do you agree that the Texan is a sharpshooter?
heart disease, stroke, and other disorders. (For an Do you think that if he repeated his so-called
Fig. 1.1 The Texas

sharpshooter fallacy
Fig. 1.2 Variables

included in an exploratory
dataset based on 95
patients with chronic
coronary artery disease
target practice, he would again be able to get that researchers), the Texas Sharpshooter Fallacy is
many bullets in the circle? Note: the Texan related to the clustering illusion, which refers
defined his target only after he saw his results. He to the tendency of individuals to interpret patterns
also ignored the bullets that were not in the clus- in randomness when none actually exists, often
ter! This parable illustrates what epidemiologists due to an underlying cognitive bias.
call the Texas Sharpshooter Fallacy [16] to Consider a more clinical example: A resident
underscore the dangers of forming causal conclu- inherits a dataset that contains information about
sions about cases of disease that happen to cluster 95 patients with chronic coronary artery disease.
in a population due to chance alone or to reasons Figure 1.2 depicts the variables in that dataset.
other than the chosen cause. As per Atul Gawande, He believes that he could satisfy his research
in his classic article in The New Yorker, of the elective if he could draw inferences about this
myriad of cancer clusters studied by scientists study group, though he has no a priori idea about
in the United States, not one has convincingly what relationships would be most reasonable to
identified an underlying environmental cause explore. He recruits a friend who happens to have
[17]. In a more general sense (and particularly a statistical package installed on his computer,
germane to the activities of some biomedical enters all of the variables in the dataset into a
6 P.G. Supino
multiple regression model, and comes up with game [18]. The most important take-home point
some statistically significant findings, as noted is if you wish to test it, a hypothesis always should
below: be generated before data collection begins.
Ischemia severity and benefit of coronary Hypothesis-testing studies (especially ran-
artery bypass grafting (CABG): p < 0.001 domized clinical trials [RCTs]) are highly
Hair color and severity of myocardial infarc- regarded in medicine because, when based on
tion (MI): p < 0.03 correct premises, properly designed, and ade-
Zip code and height: p < 0.04 quately powered, they are likely to yield accu-
He concludes that he has confirmed the hypoth- rate conclusions [19]; in contrast, conclusions
esis that there is a strong association between drawn from hypothesis-generating studies, even
preoperative ischemia severity and benefit of when well designed, are more tentative than those
coronary artery bypass grafting because not of hypothesis-testing studies due to the myriad of
only was the obtained probability (p) value low, explanations (hypotheses) one can infer from the
his hypothesis also makes clinical sense. He also observation of a phenomenon.
decides that he would not report the other findings For these reasons, hypothesis-generating stud-
because, while also statistically significant, ies are appropriately regarded as exploratory in
he cannot explain them. What methodological nature. These differences notwithstanding, there
error has the resident made in drawing his is general consensus that hypothesis-testing and
conclusion? hypothesis-generating activities both are vital
The answer is that, analogous to the rifleman aspects of the research process. Indeed, the latter
who defined his target only after the fact, the resi- are the crucial initial steps for making discoveries
dent confirmed a hypothesis that did not exist in medicine. As Andersen [20] has correctly
before he examined patterns in his data. The fal- argued, without hypothesis-generating activities,
lacy would not have occurred if the resident had, there would be no hypotheses to test and the body
in mind, a prior expectation of a particular of theory and knowledge would stagnate. The
association. It also would not have occurred had critical role of the hypothesis in the research pro-
the resident used the data to generate a hypothesis cess and the logical issues entailed in formulating
and validated it, as he should have, with an inde- and testing them are further discussed in Chap. 3.
pendent group of observations if he wanted to
draw such a definitive conclusion. This is an
important distinction because the identification Retrospective Versus Prospective
of an association between two or more variables Research
may be the result of a chance difference in the
distribution of these variablesand hypotheses Research often is classified as retrospective or
identified this way are suggestive at best, not prospective. However, as pointed out by Catherine
proven. What one cannot do is to use the same DeAngelis, former editor-in-chief of the Journal
data to generate and test a hypothesis. of the American Medical Association (JAMA),
Moreover, the resident compounded his error these terms are among the most frequently mis-
by capitalizing on only one association that he understood in research [21] in part because they
found, ignoring all of the others. Working with are used in different ways by different workers in
hypotheses is like playing a game of cards. You the field and because some forms of research do
cannot make up rules after seeing your hand, or not neatly fall within this dichotomy. Many meth-
change the rules midstream if you do not like the odologists [22, 23] consider research to be
hand that you have been dealt. Similarly, if you retrospective when data (typically recorded for
gather your data first and draw conclusions based purposes other than research) are generated prior
only on those you believe to be true, you have, in to initiation of the study and to be prospective
the words of the famed behavioral scientist, Fred when data are collected starting with or subse-
Kerlinger, violated the rules of the scientific quent to initiation of the study. Others, including
DeAngelis, prefer to distinguish retrospective casecontrol study can be used to infer cause and
from prospective research according to the inves- effect associations, though various biases (dis-
tigators and subjects orientation in the data cussed in depth in Chap. 4) may limit its value for
acquisition process. According to the latter view, this purpose.
a study is retrospective if subjects are initially The two most typical examples of prospective
identified and classified on the basis of an out- research in clinical medicine are observational
come (e.g., a disease, mortality, or other event) cohort and experimental studies. In an observa-
and are followed backward in time to determine tional cohort study, subjects within a defined
the relation of the outcome to exposure to one or group who share a common attribute of interest
more risk factors (genetic, biological, environ- (e.g., newly diagnosed cardiac patients, new
mental, or behavioral); conversely, the study is dialysis patients) who are free of some outcome
prospective if it begins by identifying and classi- of interest are identified on the basis of exposure
fying subjects on the basis of the exposure (even to risk factors whose presence or absence is out-
if the exposure preceded the investigation), with side the control of the investigator. These indi-
outcome (s) observed at a later point in time [21]. viduals are followed over time until the occurrence
There are various types of retrospective stud- of an outcome (or outcomes) that usually (but not
ies. The simplest (and least credible from the always) is measured at a later date. In an experi-
standpoint of scientific evidence) is the case mental study, outcomes also are assessed at a
study (or case report), which typically pro- later date, but subjects initially are differentiated
vides instructive, albeit anecdotal, information according to their exposure to one or more inter-
about unusual symptoms not previously observed ventions which have been purposively applied.
in a medical condition or new combinations of (Further distinctions between observational and
conditions within a single individual [24]. The experimental studies are discussed below.)
case series (or clinical series) is an uncon- Prospective research is less prevalent in the
trolled study that furnishes information about literature than retrospective research principally
exposures, outcomes, and other variables of inter- due to its relatively greater cost. In most prospec-
est among multiple similar cases. Though lack of tive studies, the investigator must invest the time
control precludes evaluation of cause and effect, and resources to follow subjects and sometimes
this type of study can provide useful information even apply an intervention if the study is experi-
about unusual presentations or infrequently mental. Moreover, prospective studies usually
occurring diseases and can be used to generate require larger sample sizes. Why, then, would
hypotheses for testing, using more rigorous stud- anyone choose a prospective design over a retro-
ies [24]. The most common type of retrospective spective approach? One reason is that prospective
research used to draw inferences about the rela- studies (particularly RCTs and concurrent cohort
tion of prior exposures to diseases (and the most studies, described below) potentially have more
rigorous of the various retrospective designs) is control over temporal sequence and extraneous
the casecontrol study. In this type of investiga- factors, including selection and recall bias,
tion, a group of individuals who are positive for a although loss to follow-up can be problematic.
disease state (e.g., lung cancer) is compared with Second, prospective designs are more appropriate
a group comprised of those who are negative for than retrospective designs for rare exposures and
that disease state (e.g., free of lung cancer). By relatively more common outcomes. Finally, if it
looking back at the medical record, we attempt to is desired that the exposure be manipulated by
determine differences in risk factors (e.g., prior the investigator, as in an experimental study, the
exposure to cigarette smoke or asbestos) that may relation between exposure and outcome can be
account for the disease. Because of the temporal evaluated only with a prospective design.
sequence and interval between the factor and the In many prospective studies (all RCTs, many
outcome variable and the availability of a com- cohort studies), the exposure takes place coinci-
parison group (e.g., nondiseased subjects), the dent with or following the initiation of the study.
8 P.G. Supino
Fig. 1.3 Concurrent versus noncurrent prospective research (Reprinted with permission from [21])
This type of prospective research has been termed point (e.g., exposure to a putative risk factor or
concurrent [25, 26] because the investigator intervention) and follow them forward in time
moves along in parallel with the research process until the occurrence of a specified outcome (e.g.,
(i.e., from application or assessment of the expo- a disease state or event), whereas retrospective
sure to ascertainment of the outcomes associated studies begin with existing cases and look back in
with the exposure). In other instances, the expo- time at the history of the subject to identify rele-
sure and even the outcomes will have taken place vant exposures or other instructive trends. Both
in the past, i.e., before the investigators involve- are examples of longitudinal research because
ment in the study. If the logic of the study is to subjects are examined on multiple occasions that
follow subjects from exposure to outcome, are separated in time.
the research may be termed a nonconcurrent Not all studies have defined temporal
prospective study [25, 26], a historical cohort windows between putative risk factors and out-
study, or a retrospective cohort study (departing comes. One that does not is the cross-sectional
from the view of prospective research held by (or prevalence) study. With this approach, several
DeAngelis and others). These distinctions are variables are measured at the same point in time
shown in Fig. 1.3. to determine their frequency and/or possible
association within a group of individuals who
are selected without regard to exposure or dis-
Longitudinal Versus Cross-Sectional ease status. They are usually based on data col-
Research lected in the past for other purposes but can be
based on information acquired de novo. When
As noted above, prospective studies sample mem- used with large representative samples (to permit
bers of a defined group at a common starting valid generalizations), cross-sectional studies can
provide useful information about the prevalence Prospective descriptive studies include natural
of risk factors, disease states, and health-related history investigations that follow individual
knowledge, attitudes, and behaviors in a specified subjects or groups over time to determine changes
population. Cross-sectional studies are prevalent in parameters of interest.
in the literature principally because they are rela- While descriptive studies attempt to examine
tively quick, easy to conduct, and can be used to what types of problems exist in a population, ana-
evaluate multiple associations. However, unlike lytic studies attempt to determine how or why
the casecontrol study, where temporality these problems came to be. Thus, the ultimate
between risk factor and outcome variables can be goal of analytic studies is to test prestated hypoth-
established (or at least inferred) in order to eses about risk factors or interventions versus
buttress a cause and effect relationship, cross- outcomes to elucidate causality. Analytic studies
sectional studies are best suited for generating, can be performed with two or more equivalent or
rather than testing, such hypotheses [23]. matched comparison groups, in which case infer-
ences are drawn on the basis of analysis of inter-
group differences (comparative research) or by
Descriptive Versus Analytic Research comparisons within a single group in which
assessments are made over time before and after
Research can be further subdivided into descrip- imposition of an intervention or a naturally occur-
tive and analytical subtypes. In descriptive stud- ring event. Analytic research can be retrospective
ies, the presence and distribution of characteristics (e.g., casecontrol studies) or prospective (e.g.,
(e.g., health events or problems) of a single group observational cohort or experimental studies).
of subjects are examined and summarized (but Correlational analysis of cross-sectional data is
are not intervened upon or otherwise modified) to classified as analytic by some [28] but not all [22]
determine who, how, and when they were affected workers in the field.
and the magnitude of these effects. Descriptive
studies can involve a single case or a large popu-
lation. Though they are considered to be among Observational Versus Experimental
the simplest types of investigation, they can yield Research
fundamental information about an individual or
group that is of importance when little is known In this dichotomy, research is differentiated by the
about the subject in question. Modes of data col- amount of control that the investigator has over
lection for descriptive studies are primarily the factors in the study by which the outcome
observational and include survey methods, objec- variables are compared. In observational studies,
tive assessments of physiological measures, and the investigator is passive with respect to the fac-
review of historical records. Methods of analysis tors of interest as these usually are naturally
include computation of descriptive statistics such occurring risk factors or exposures outside of the
measures of central tendency and dispersion investigators control. He or she can identify them
(quantitative studies) and verbal descriptions and measure them but cannot allocate subjects to
and content analysis (qualitative studies) [27]. treatment groups or deliberately manipulate a
Because descriptive studies contain no reference treatment to systematically study its effect. The
groups, they cannot be used to test hypotheses investigators sole responsibility is to select a
about cause and effect; however, they can be use- design which can validly assess the impact of the
ful for hypothesis generation, thus providing the risk factor on the outcome variable. In contrast, in
foundation for future analytic studies. Descriptive experimental studies, the input of interest not
studies may be either retrospective or prospec- only is measured or observed but is purposively
tive. Retrospective descriptive studies include applied by the investigator, who manipulates
the single case study and case series formats. events by arranging for the intervention to occur
10 P.G. Supino
or, at the very least, arranges for random alloca- In contrast, qualitative research gathers informa-
tion of subjects to alternative treatment or control tion about how phenomena are experienced by
groups. As a consequence, most of the inherent individuals or groups of individuals (and the con-
differences that exist between comparison groups text of these experiences) based on open-ended
are minimized, if not eliminated, thereby provid- (unstructured) interviews, questionnaires, obser-
ing greater capacity to determine cause and effect vation, and focus group methodology. Fewer sub-
relationships between the intervention and the jects are studied than with quantitative research,
outcome. Unlike observational studies, which can but the investigators contact with them is longer
either be prospective or retrospective, experimen- and more interactive. As Portney and Watkins
tal studies, as noted earlier, always are prospec- [29] have noted, quantitative methods can be used
tive. Midway between observational and across the continuum of research approaches to
experimental studies is a methodology known as describe, generate, and test hypotheses, whereas
quasi-experimental research. With this approach, qualitative methods typically are used for descrip-
the investigator evaluates the impact of an tive or exploratory (hypothesis-generating)
intervention (e.g., a therapeutic agent, policy, pro- research. Quantitative and qualitative research
gram, etc.) which has been applied either to an each subsumes many different methodologies.
entire population or to one or more subgroups
on a nonrandom basis. Although he or she may
have been directly involved in arranging the inter- Steps in the Research Process
vention, control is nonetheless suboptimal due to
limitations in the quality of reference data; as As mentioned earlier, research is structured by a
such, inferences drawn from quasi-experimental series of methodological rules which govern the
studies, while stronger than those generated with nature and order of procedures used in the inves-
purely observational data, are less robust than tigation. It is, therefore, necessary that a plan be
those drawn from true experimental investiga- developed prior to the study which incorporates
tions. Characteristics of the true experimental and these procedures. This is true, irrespective of the
quasi-experimental approaches are detailed more type of research involved. The following is a brief
fully in Chap. 5. listing of the steps, identified by DeAngelis
[21], which comprise the research process in
general and the hypothetico-deductive approach
Quantitative Versus Qualitative in particular:
Research In the first stages of the project, the investigator
will:
Finally, research also can be differentiated accord- 1. Identify the problem area or question.
ing to whether the information sought is collected 2. Optimally restate the question as a
quantitatively or qualitatively. Quantitative hypothesis.
research involves measurement of parameters 3. Review the published literature and other
(e.g., demographic, functional, geometric, or information resources, including meeting
physiological characteristics; mortality, morbid- abstracts and databases of funded resource
ity, and other outcome data; attitudes, knowledge, summaries or blogs, to determine whether the
and behaviors) that have been obtained under hypothesis has been adequately evaluated or
standardized conditions by structured or semi- is in need of further study.
structured instrumentation and that may be sub- Prior to developing the research design, he/she
jected to formal statistical analysis. Typically, will:
numerous subjects are studied and the investiga- 4. Identify all relevant study variables, knowl-
tors contact with them is relatively brief and min- edge of whose presence, absence, change, or
imally interactive to avoid introduction of bias. interrelationship is the objective of the study.
In order to bring precision to the research, he/she some of the data were lost, and what was located
will: had not been recorded uniformly. As a result,
5. Construct operational definitions of all years of hard work were wasted. In a second
variables. example, addressing scheduling problems, Marks
6. Develop a research design and analytic plan describes the failure of an investigator, studying
to test the hypothesis. The design will iden- the effects of a drug developed for patients
tify the nature and number of subjects from undergoing elective coronary artery bypass graft-
whom data will be obtained, the timing and ing, to complete his research project within his
sequence of measurements, and the presence specified time frame. Though the investigator had
or absence of comparison groups or other the foresight to calculate his required sample size
procedures for controlling bias. The analytic and to estimate patient accrual rates, he made the
plan will define the statistical procedures to mistake of allowing only 4 months to study 30
be performed on the data and must be points. Much to his chagrin, a poorly worded
prespecified to minimize the likelihood of consent form submitted to his institutional review
reaching spurious conclusions. board (IRB) delayed him approximately 6 weeks
7. If data collection instruments are available, and, by then, the number of nonemergency oper-
they must be specified. If not, they must be ations had dropped dramatically due to the winter
constructed. (Data collection instruments holidays. After 4 months, only a quarter of his
include all tools used to collect relevant sample had been accruedand no data analysis
observations in the study such as physiologi- had been performed.
cal measurements questionnaires, interviews, Other common problems associated with poor
and case report forms, to name a few.) planning include inability to implement or com-
8. A data collection plan, containing provisions plete a study (due to disregard of organizational,
for accrual of subjects and for recording and political, or ethical factors), loss of statistical
management of data, must be designed. power to confirm hypotheses (due to inadequate
Only after these important preparatory steps have attention to patient accrual factors, attrition of sub-
been taken should the investigator proceed to: jects, or excess variability in the study population),
9. Collect and process the data. ambiguity of findings (due to lack of operational
10. Conduct statistical analysis to describe the definitions or nonuniformity of data collection
dataset and test hypotheses. procedures), and unsound conclusions brought
11. After the data are analyzed, conclusions are about by weak research designs, among others.
drawn and these are related to the problem Marks vignettes about the adverse conse-
statement and/or hypotheses. quences of poor research planning depict errors
12. Finally, the research report is written and, if that unfortunately are not uncommon. A number
accepted after peer review, is presented and/ of years ago, in this authors first position as a
or published as a journal article. research director (at an institution that I shall
The importance of following a research plan decline to name), I was asked to implement a
was addressed by Marks [30], who described a research project, previously designed by a princi-
number of typical planning errors and their nega- pal investigator (PI) who was senior to me at the
tive consequences. To cite one example, Marks time. The purpose of the project was to evaluate
detailed the experience of an investigator who the impact of an in-hospital patient education
failed to receive renewal of his multiyear research program after a first myocardial infarction. Four
grant because he could not report the results of hospitals were involved in the study: two inter-
the data analysis to the granting agency. This vention sites and two controls (business as
occurred because he failed to develop a mecha- usual). In the first phase, patients at Hospital A
nism for the storage, handling, and analysis of received the new educational program and
data. Due to staffing changes and other factors, patients in Hospital B did not. In the replication
12 P.G. Supino
phase, patients at Hospital C received the new A final problem concerned the instrumenta-
intervention and patients at Hospital D did not. tion. Though, in fact, both the Beck Depression
The instrument chosen to evaluate depression and State-Trait Anxiety Scales had been vali-
was the Beck Depression Scale and the instru- dated, the validation had not been performed on
ment chosen to evaluate anxiety was the patients shortly after an acute myocardial infarc-
State-Trait Anxiety Scale. The study design tion. An analysis of baseline scores revealed that
compared responses before and after the educa- most patients were neither depressed nor anxious,
tional program by site. Being schooled in psycho- apparently due to the unanticipated effects of
metrics, I was concerned about the reliability and sedation or denial. Thus, low scores on these
validity of these instruments for this population primary measures (which clearly were adminis-
but was told that these had been extensively used tered too soon after the index event) could not
and previously validated in other patient popula- possibly improve due to what are called floor
tions. I also had concerns about the quality of the effects. Needless to say, the private foundation
experiences that patients were receiving at the that funded this study was less than thrilled, and
control hospitals but was told that for political none of you have ever seen it in published form.
reasons, we could not ask too many questions. Examples like these abound in research but usu-
Additionally, I had concerns about the implemen- ally are not reflected in the literature because
tation of the educational intervention but was told aborted or incomplete research investigations
that this was firmly under the control of the nurse are never published, and those failing to demon-
coordinator. I next argued for a pilot before strate statistically significant differences (or asso-
launching this very costly and lengthy research ciations) are published far less often than those
project but was told that there was no time and that doa phenomenon known as publication
that the PI did not wish to waste patients. bias [31], further discussed in Chap. 9.
And so the intervention proceeded according A number of years ago, a pediatric emergency
to protocol for well over 2 years. No interim anal- fellow at another area hospital approached me for
ysis ever was performed because the PI thought assistance with a dataset that she had compiled
that would be too expensive and waste time. over a 4-month period. The data profiled the pre-
When the primary data finally were analyzed, senting complaints, diagnoses, and disposition of
there were no detectable differences whatsoever a series of children who had presented to an
between the outcomes obtained in the experi- emergency room after having complained of
mental versus control hospitals. The PI was largely nonserious illnesses during school. I asked
horrified and did not understand how this could her for a copy of her protocol, but she told me
have happened. When the process data were ana- that she did not have one because her study was a
lyzed post hoc, we learned that, due to staffing chart review, based on de-identified anonymous
problems at the experimental sites, many nurses data and, therefore, was IRB Exempt. I next
who were entrusted to implement the educational asked her for a written copy of her research plan
intervention had attended few, if any, in-service to which she responded, I never developed one
sessions about the intervention. Moreover, even because my clinical mentor told me that it wasnt
though the new intervention had a beautifully necessary, and I didnt know that I needed one.
designed curriculum that had been packaged in a I asked her what schools the children had come
glossy binder, it became known only after the from and who had made the decision to bring
fact that quality patient education also had taken them to the emergency room, but she couldnt
place at Control Hospital B, and we never knew answer these questions because that information
what was done at Control Hospital D, again, for was not routinely included in the medical chart,
political reasons. which was the source of all of her data. I asked
her why she had selected a retrospective chart objective, which was to furnish information that
review as her study design, and she answered that potentially could alter decision-making patterns
the charts were readily available and that she for this patient population. Had the fellow devel-
hadnt thought about any other approach. I asked oped a proper research plan in the first place, she
her why she thought the research study was worth would have better conceptualized her study and
doing, to which she responded, Im not sure, but saved months of her time on what was essentially
maybe the data will encourage emergency physi- a fruitless undertaking.
cians to better counsel parents and school officials The moral posed by these stories is that ade-
who refer relatively healthy children to the emer- quate planning is vital for achieving research
gency room and, thus, cut down on inappropriate objectives and for minimizing the risk of wasting
visits. time and resources. As Marks correctly argues,
Feeling sorry for her, I helped her to sort out The success of a research project depends on
whatever data that she had, and to write an how well thought out a project is and how poten-
abstract and manuscript that appeared to be tial problems have been identified and resolved
respectable, at least superficially. The abstract before data collection begins [30].
was accepted at an international meeting (which In subsequent chapters, we will consider many
had somewhat less stringent standards than of the fundamental concepts, principles, and
domestic meetings in her specialty), but when issues involved in planning and implementing a
she submitted her manuscript for publication in well-designed study. It is hoped that awareness of
an academic journal, it was rejected. The review- these factors will help you to achieve your
ers correctly argued that without knowing who research objectives, minimize your risk of wast-
made the decision to bring the child to the emer- ing time and resources, and result in a more
gency room, the study had failed its primary rewarding research experience.
Take-Home Points
Research is a rigorous problem-solving process whose ultimate goal is the discovery of

new knowledge.
Research may include the description of a new phenomenon, definition of a new relation-
ship, development of a new model, or application of an existing principle or procedure to a
new context.
Research is systematic, logical, empirical, reductive, replicable and transmittable, and
generalizable.
Research can be classified according to a variety of dimensions: basic, applied, or transla-
tional; hypothesis generating or hypothesis testing; retrospective or prospective; longitudi-
nal or cross-sectional; observational or experimental; quantitative or qualitative.
The ultimate success of a research project is heavily dependent on adequate planning.
14 P.G. Supino
15. Goldblatt EM, Lee WH. From bench to bedside: the

References growing use of translational research in cancer medi-
cine. Am J Transl Res. 2010;2:118.
16. Milloy SJ. Science without sense: the risky business
1. Calvert J, Martin BR (2001) Changing conceptions of public health research. In: Chapter 5, Mining for
of basic research? Brighton, England: Background statistical associations. Cato Institute. 2009. http://
document for the Workshop on Policy Relevance and www.junkscience.com/news/sws/sws-chapter5.html.
Measurement of Basic Research, Oslo, 2930 Oct Retrieved 29 Oct 2009.
2001. Brighton, England: SPRU. 17. Gawande A. The cancer-cluster myth. The New
2. Leedy PD. Practical research. Planning and design. Yorker, 8 Feb 1999, p. 3437.
6th ed. Upper Saddle River: Prentice Hall; 1997. 18. Kerlinger F. [Chapter 2: problems and hypotheses].
3. Tuckman BW. Conducting educational research. 3rd In: Foundations of behavioral research 3rd edn.
ed. New York: Harcourt Brace Jovanovich; 1972. Orlando: Harcourt, Brace; 1986.
4. Tanenbaum SJ. Knowing and acting in medical prac- 19. Ioannidis JP. Why most published research findings are
tice. The epistemological policies of outcomes false. PLoS Med. 2005;2:e124. Epub 2005 Aug 30.
research. J Health Polit Policy Law. 1994;19:2744. 20. Andersen B. Methodological errors in medical
5. Richardson WS. We should overcome the barriers to research. Oxford: Blackwell Scientific Publications;
evidence-based clinical diagnosis! J Clin Epidemiol. 1990.
2007;60:21727. 21. DeAngelis C. An introduction to clinical research.
6. MacCorquodale K, Meehl PE. On a distinction New York: Oxford University Press; 1990.
between hypothetical constructs and intervening vari- 22. Hennekens CH, Buring JE. Epidemiology in medi-
ables. Psych Rev. 1948;55:95107. cine. 1st ed. Boston: Little Brown; 1987.
7. The National Commission for the Protection of 23. Jekel JF. Epidemiology, biostatistics, and preventive
Human Subjects of Biomedical and Behavioral medicine. 3rd ed. Philadelphia: Saunders Elsevier;
Research: The Belmont Report: Ethical principles and 2007.
guidelines for the protection of human subjects of 24. Hess DR. Retrospective studies and chart reviews.
research. Washington: DHEW Publication No. (OS) Respir Care. 2004;49:11714.
780012, Appendix I, DHEW Publication No. 25. Wissow L, Pascoe J. Types of research models
(OS) 780013, Appendix II, DHEW Publication (OS) and methods (chapter four). In: An introduction to
780014; 1978. clinical research. New York: Oxford University Press;
8. Coryn CLS. The fundamental characteristics of 1990.
research. J Multidisciplinary Eval. 2006;3:12433. 26. Bacchieri A, Della Cioppa G. Fundamentals of clini-
9. Smith NL, Brandon PR. Fundamental issues in evalu- cal research: bridging medicine, statistics and opera-
ation. New York: Guilford; 2008. tions. Milan: Springer; 2007.
10. Committee on Criteria for Federal Support of Research 27. Wood MJ, Ross-Kerr JC. Basic steps in planning
and Development, National Academy of Sciences, nursing research. From question to proposal. 6th ed.
National Academy of Engineering, Institute of Boston: Jones and Barlett; 2005.
Medicine, National Research Council. Allocating 28. DeVita VT, Lawrence TS, Rosenberg SA, Weinberg
federal funds for science and technology. Washington, RA, DePinho RA. Cancer. Principles and practice of
DC: The National Academies; 1995. oncology, vol. 1. Philadelphia: Wolters Klewer/
11. Busse R, Fleming I. A critical look at cardiovascular Lippincott Williams & Wilkins; 2008.
translational research. Am J Physiol Heart Circ 29. Portney LG, Watkins MP. Foundations of clinical
Physiol. 1999;277:H165560. research. Applications to practice. 2nd ed. Upper
12. Schuster DP, Powers WJ. Translational and experi- Saddle River: Prentice Hall Health; 2000.
mental clinical research. Philadelphia: Lippincott, 30. Marks RG. Designing a research project. The basics
Williams & Williams; 2005. of biomedical research methodology. Belmont:
13. Woolf SH. The meaning of translational research and Lifetime Learning Publications: A division of
why it matters. JAMA. 2008;299:21121. Wadsworth; 1982.
14. Robertson D, Williams GH. Clinical and translational 31. Easterbrook PJ, Berlin JA, Gopalan R, Matthews DR.
science: principles of human research. London: Publication bias in clinical research. Lancet. 1991;
Elsevier; 2009. 337:86772.
Developing a Research Problem
2
Phyllis G. Supino and Helen Ann Brown Epstein
In his discussion of how problems are gener-

Origins of Research Problems ated in science, Kerlinger described the personal
and, often, unsettling nature of the birth of the
A well-designed research project, in any disci- research problem:
pline, will begin by conceptualizing the prob- The scientist will usually experience an obstacle to
lemin its most general sense, an unresolved understanding, a vague unrest about observed and
issue of concern (e.g., a contradiction, an unobserved phenomena, a curiosity as to why
unproven relationship, an unclear mechanism, a something is as it is. His rst and most important
step is to get the idea out in the open, to express the
puzzling or enigmatic state) that warrants investi- problem in some reasonably manageable form.
gation. This intellectual activity arguably is the Rarely or never will the problem spring full-blown
most critical part of the study, and many research- at this stage. He must struggle with it, try it out,
ers consider it to be the most difcult. This is par- live with it. Sooner or later, explicitly or implic-
itly, he states the problem, even if his expression of
ticularly true in the early stages of a developing it is inchoate or tentative. In some respects, this is
science when theoretical frameworks are poorly the most difcult and most important part of the
articulated and when there is little in the literature whole process. Without some sort of statement of
about the topic. Although formal rules and proce- the problem, the scientist can rarely go further and
expect his work to be fruitful [1].
dures exist to guide the development of the
research design, data collection protocol, and Kerlingers comments point up an important
statistical approach, there are few, if any, guide- but, nonetheless, poorly recognized fact. Namely,
lines for conceptualizing or identifying research one of the most challenging aspects of the
problems, which may take years of thought and research process is to develop the idea for the
exploration to dene. research in the rst place.
So, from where do research problems come?
In general, most spring from the intellectual curi-
osity of the investigator and, of necessity, are
P.G. Supino, EdD ()
shaped by his or her critical reasoning skills,
Department of Medicine, College of Medicine,
SUNY Downstate Medical Center, experience, and environment. Probably the most
450 Clarkson Avenue, Box 1199, Brooklyn, common source of clinical research problems is
NY 11203, USA the plethora of practical issues that clinicians
e-mail: phyllissupino@aol.com
confront in managing patients which mandate
H.A.B. Epstein, MLS, MS, AHIP data-driven decisions. For example, among car-
Clinical Librarian, Samuel J. Wood Library
diologists, there has been long-standing interest in
and C.V. Starr Biomedical Information Center,
Weill Cornell Medical College, optimizing management of patients with known
New York, NY, USA or suspected coronary disease. What are the best
16 P.G. Supino and H.A.B. Epstein
algorithms and diagnostic modalities for differ- which publish requests for proposals (RFPs) or
entiating symptoms of myocardial ischemia from applications (RFAs) to address understudied
symptoms that mimic ischemia? When should areas affecting the public health. These publica-
such patients be medically managed and when tions will explicitly identify a problem that the
should they undergo invasive therapeutic proce- agency would like an investigator to address,
dures? What is the risk-benet ratio of percutane- provide a background and context for the prob-
ous coronary angioplasty vs. coronary artery lem, stipulate a study population (as well as on
bypass grafting? How often and how should occasion, specify the approach to be taken), and
patients undergoing these procedures be evalu- indicate the level of support offered to the poten-
ated after intervention? What patient-level, soci- tial investigator.
etal, and economic factors inuence these Finally, research problems can be fostered by
decisions? Issues such as these have enormous environments that stimulate an open interchange
public health implications and have spawned of ideas. These environments include scientic
hundreds of research studies. sessions conducted by professional societies and
Research problems also can be generated from organizations, grand rounds given at hospitals
observations collected in conjunction with medi- and medical schools, and other conferences and
cal procedures [2]. A radiologist might have a set seminars. In recent years, methodological
of interesting data collected in conjunction with a approaches such as brainstorming, Delphi meth-
new imaging modality (e.g., full-eld digital ods, and nominal group techniques [35] have
mammography) and might wish to know how much been developed and sometimes are utilized to
more sensitive and specic this new modality is facilitate the rapid generation (and prioritization)
vs. older technology for breast cancer screening. of research problems by individuals and groups.
Alternatively, he might be interested in a new
application of an existing modality. A thoracic
surgeon may have outcomes data available from Characteristics of Well-Conceived
two competing surgical techniques. The process Research Problems
of critically thinking about these data, sharing
them with colleagues, and obtaining their feed- Although the genesis of a research problem is a
back can lead to interesting questions for analysis complex, variable, and an inherently unpredict-
and stimulate additional research. able process, fortunately, there are generally
Another source of research problems is the agreed-upon criteria, described below, for evalu-
published scientic literature, where an observed ating the merits of the problem once it has been
exception to the ndings of past research or generated [68]. Attention to these at the outset
accepted theory, unresolved discrepancies will ensure a solid footing for the remainder of
between studies, or a general paucity of quality the investigation.
data on a clinically signicant topic can motivate
thinking and point to an opportunity for future
study. In addition, most well-crafted manuscripts The Problem Should Be Important
typically document limitations in the investiga-
tion (e.g., potential selection bias, inadequate The most signicant characteristic of a good
sample size, low number of endpoint events, loss research problem is importance. A clinical
to follow-up) and may suggest areas for future research problem is considered important if its
research. Thus, thoughtful review of published resolution has the potential to clarify a signicant
research can point to gaps in knowledge that issue affecting the public health and, ultimately,
potentially could be lled by new investigations cause the clinician (or health-care policy maker)
designed to rene or extend previous research. to make a decision or undertake an action that he
Research problems also can be suggested by or she would not have made or undertaken had
governmental and private funding agencies the problem not been addressed. The greater the
2 Developing a Research Problem 17
need for clarication and the larger the number of The Problem Should Be Interesting
individuals potentially impacted (i.e., the greater
the disease burden), the more important the prob- As Hully and Cummings have noted, a good
lem. For this reason, when research proposals are research problem, especially if suggested by
submitted to a funding agency or when research someone else, must be interesting to the investi-
manuscripts are submitted to a journal for publi- gator to provide the intensity of effort needed
cation, perceived importance of the problem is for overcoming the many hurdles and frustrations
heavily weighted during the peer-review process. of the research process [7]. It also should be
Indeed, importance of the problem typically interesting to:
overshadows other criticisms such as incomplete The investigators peers and associates to
consideration of the literature, suboptimal meth- attract collaborators
odology, and poor writing style, as these aws Senior scientists at the investigators institu-
often can be remedied. Studies that merely repli- tion who can provide necessary mentorial sup-
cate other studies, with no signicant alteration port to guide the study (if the investigator is
in methods, content, or population (or that reect relatively junior)
only a minor incremental advance over previous Potential sponsors to motivate them to fund
information) are considered unimportant and the study (if outside funding is sought)
tend to fare poorly in the peer-review process. Fellow researchers within the larger scientic
This is true even if the study is well designed. community who, ultimately, will read and
This point is illustrated below by the divergent judge its ndings
comments actually made by a reviewer in Individuals outside the scientic community
response to two different manuscripts submitted (e.g., clinicians in private practice, policy
for publication to a cardiology journal: makers, the popular media, and consumers)
Manuscript #1: This is a superb contribution who, optimally, will consider, disseminate,
which adds importantly to our knowledge and/or utilize the eventual products of the
about the pathophysiology of heart failure. research (if the problem is applied or transla-
The results of this well-focused study are of tional in nature)
great clinical importance. (Recommendation: Gauging the potential interest of a research
Accept) problem is difcult because, as Shugan has
Manuscript #2: Comment: Despite a great noted, no research ndings are innately inter-
deal of very precise and laborious effort and esting. Instead, they are interesting only rela-
the generation of an extraordinary mass of tive to a particular audience within some context
numbers little forethought was given to the that they dene [9]. While research can be inter-
focus or importance of the questions to be esting simply because it is new, in general, a
asked . The nding is not unexpected, hav- research problem will tend to be viewed as note-
ing been suggested by several earlier studies worthy if it impacts a wide audience, has the
which have evaluated the issue of regional potential to cause signicant change in what
performance in different ways (Thus,) the members of that audience will do [9] (i.e., has
authors observations add little that is impor- importance), and is clearly framed within the
tant or useful to the currently available litera- context of a current hot-button issue (or an
ture. (Recommendation: Reject) older but nonetheless viable issue). Before
Evaluating the importance of a research prob- investing substantial time pursuing a research
lem requires considerable knowledge of and problem, it is advisable that new researchers
experience in the discipline. For this reason, the check with their mentors and/or other experi-
new investigator should seek the assistance of enced investigators with broad insights into the
mentors and other experts early on to maximize general area of inquiry to conrm that the prob-
the likelihood that the proposed research will be lem satises these criteria and, thus, is likely to
fruitful. be interesting to others [10].
The Problem Should Lead to Clear, The Problem Should Be Feasible

Researchable Questions
A research problem (or a research question)
Many workers in the eld use the terms research should be feasible in two respects. As Sim and
problem and research question interchange- Wright [16] have noted, it should be feasible on a
ably. Others view the research problem as an conceptual-empirical level, meaning that the
assertion about an issue of perceived importance concepts and propositions embodied in the
that implies a gap in knowledge from which ques- research should be susceptible to empirical eval-
tions may be developed (the position taken in this uation. As indicated in Chap. 1, it is the empirical
chapter). Whichever view is held, there is general quality of research that differentiates it from other
consensus that a research problem should be problem-solving processes. Accordingly, it is
clearly dened (see section Crafting the Problem important that the research question(s) central to
and Purpose Statements at the end of this chap- the problem be answerable and that answers to
ter) and serve as a springboard for questions the question(s) be generated by the acts of data
whose answers can be found by conducting a collection and analysis (i.e., be produced empiri-
study [7, 1012]. Ellis and Levy [13] argue that cally). These criteria are sometimes difcult to
research questions are important because they satisfy. In order for a question to be answerable,
serve to operationalize the goals of the study by it must be clear, precise, and have a manageable
narrowing them into specic areas of inquiry. set of possible answers (the latter criterion also
Leedy and Ormand [14] assert that attaining relates to the issues of feasibility and scope,
answers to research questions both satises the described below). The answer or answers also
goals of the study and generally contributes to must be inherently knowable and measurable.
problem solving within the area of interest. The question, how many angels can dance on
Kerlinger and Lee [15] further contend that an the head of a pin is philosophically interesting,
investigation has meaning only when there is a but it is neither knowable nor measurable since
clear nexus between the answers obtained to the there is no way to count angels, assuming that
research questions and the primary research prob- they existed in the rst place. The question also
lem. Like the problem itself, the questions should must be framed in such a way that it will be obvi-
be clear, concise, optimally lead to testable ous what type of data are needed to answer it, and
hypotheses, and collectively capture the overall it must be possible to collect empirical evidence
goal or purpose of the research project. In so that, when analyzed, will make a convincing
doing, they serve to guide the methodology used argument when interpreted in relation to that
in the study. The reader should note that a distinc- question. In order for empirical evidence to be
tion is drawn between a research question and gathered, suitable subjects (for a clinical study)
practical or methodological questions that arise or material (for a preclinical study) must be avail-
during the design or implementation phases of able, and valid and reliable instruments must
the research (e.g., How many subjects are needed exist or be developed for measurement of the ele-
to provide sufcient power for testing the hypoth- ments that comprise the question. If these ele-
esis or to achieve a given level of precision for ments cannot be measured, the question cannot
estimating a population parameter? Which be answered empirically. Examples of problems
approaches are best for enhancing patient recruit- that would be difcult to evaluate are:
ment, improving follow-up, and reducing the How well do patients adjust to life following
likelihood of missing data? Given the investiga- an initial myocardial infarction?
tors constraints, what study design(s) should be Following death of the cancer patient, how
used to control for threats to valid inference? well do spouses handle their grief?
Which statistical approach or approaches are Both adjustment and handling grief
most appropriate given the nature of the data?) clearly are difcult to evaluate by empirical
These and related methodological issues are dis- means, unless operational denitions and objec-
cussed, in depth, in other chapters of this book. tive measures are developed for both terms. In a
similar vein, questions soliciting opinions (e.g., The scope of a study can be gauged by the
what should be done to improve the health of a number of subproblems (discrete areas of inquiry
specic population?) and value-laden questions within the investigation) needed to express the
such as should terminally ill comatose patients main problem. If the number of subproblems
be disconnected from life support? certainly are exceeds six, there is high likelihood that the prob-
important and make excellent subjects for argu- lem is too broad. In contrast, if an investigator is
ment. However, they (like any question including unable to dene a minimum of two subproblems,
the word should) are not always assessable it may be too narrow [17].
empirically and may require special methods for The issue of scope of the problem has direct
data gathering (e.g., qualitative techniques). practical implications for the researcher. Even if
The problem also should be feasible on a prac- the problem is important and empirically test-
tical level [16]. An investigator must decide, early able, the investigator must balance these factors
on, if he or she has the resources to address it against the cost of doing the research. Long
within a realistic time frame and at a reasonable before data are collected, the researcher must
cost. A primary determinant of feasibility is the decide whether he or she has the time or resources
scope of the proposed problem. In planning a to collect and analyze the data.
research study, it is important to avoid selecting a Factors affecting time include:
problem that is too broad because a single inves- The interval needed for subject accrual
tigation cannot possibly provide all relevant The time involved in administering the inter-
information about a problem. The process of vention (if the research is experimental)
identifying the problem can raise ancillary ques- The time involved in collecting data on inputs
tions that may be of interest to the investigator, such as risk factors (if the research is
but it is important to prioritize these and reserve observational)
some for future research so that the time and The time involved in assessing outcome
resources of the investigator are not strained. An Factors potentially affecting resources include:
axiom in research planning is that it is better to Costs of accruing and managing subjects (pur-
provide quality answers to a small number of chasing and housing of animals for a preclini-
questions than to provide inferior information in cal study, reimbursing human subjects for
volume. For example, should an investigator wish participation in a clinical research study)
to study the effect of drug therapy on patients Cost of the intervention (if any)
with heart disease, the question What is the Costs of measurement procedures
effect of drug therapy on patients with poor heart Cost of data collection, processing, and
function?, while conceptually interesting and analysis
clinically important, is much too broad for one Costs of equipment, supplies, and travel
study and, in fact, would require hundreds of Technical expertise (the investigators own
investigations to answer adequately. The investi- research skills or access to skilled collabora-
gator would do well to narrow the problem to tors or consultants)
include a given class of drugs (e.g., adrenal ste- One way an investigator can determine feasi-
roids), a specic index of heart function (e.g., left bility is by conducting a pilot study. A pilot study
ventricular performance), and a specic popula- (sometimes called a feasibility study) typically
tion (e.g., patients with chronic severe aortic attempts to determine whether it is possible to
regurgitation). On the other hand, the problem address the research problem (or subproblems)
should not be too narrowly dened. A question under conditions approximating those of the larger,
such as what are the effects of Inderal on the proposed study but with a smaller number of sub-
change in ejection fraction from rest to exercise jects over an abbreviated period of time. The pilot
in 75-year-old Queens residents? probably can provide information about the complexities of
would result in a criticism of the study as trivial. patient recruitment and the appropriateness of data
collection procedures (including acceptability of (http://discoverybuzz.com/blog), and Trust the

the research instruments to the study subjects and Evidence (http://blogs.trusttheevidence.net).
approaches to detecting endpoints and resolving
issues associated with follow-up), and obtain pre-
liminary estimates of morbidity and event rates Examination of the Problem Should
(among other variables) that can be useful in Not Violate the Ethical Standards
informing sample size calculations for future of the Scientic Community
investigations. Occasionally, the pilot will produce
preliminary answers to the proposed research The investigator may be interested in a problem
questions. If the investigator concludes that exam- that has signicant scientic or medical impor-
ining the problem is unaffordable or is unfeasible tance, but addressing it might expose patients
time-wise, he or she should consider modications to signicant risk. For example, a psychiatrist
that may include: might be interested in the effects of a particular
Delimiting the scope of the problem psychotropic drug on patients with obsessive
Broadening the inclusion criteria compulsive disorder. She believes that exami-
Relaxing the exclusion criteria nation of this problem is both clinically relevant
Adding additional study sites and scientically important because review of
Altering the study design used to address the existing literature suggests that the agent
the problem (e.g., from a prospective to a not only has the potential for reducing symp-
retrospective design or from parallel group toms but also might provide insights into the
comparison to a repeated measures design underlying processes related to this illness.
to permit assessment of outcomes with Pilot data, however, suggest that this drug is
fewer subjects [the pros and cons of these highly addictive and, in addition, may adversely
approaches are discussed in further detail in affect certain organ systems. Thus, despite
Chap. 5]) scientic merit, the conclusions generated
If successful, the results of a pilot study can might be at the expense of the overall well-
be helpful in convincing a potential funding being of the subject. According to accepted
agency that the proposed research is feasible and, standards of scientic conduct, the study should
depending on the nature of the preliminary data, not be done. These rules apply in industrial, as
that the hypotheses are likely to be conrmed well as in academic, settings. Thus, in the USA,
by a larger study conducted by the same when a pharmaceutical company launches a
investigators. new drug, it is required by the Food and Drug
Another way to try out a research question Administration to perform highly regulated tri-
is to present an idea or preliminary data in a als of feasibility (phase I) and safety (phase II),
poster or emerging ideas section of a profes- before proceeding to a large, randomized phase
sional meeting. Thoughts exchanged during a III efcacy trial. Generally, if the drug pro-
curbside chat may crystallize an idea and duced signicant toxicity in patients prior to or
may lead to valuable networking connections. during the phase III trial, the investigation
Social media, like wikis (collaborative, directly would be aborted at that time, despite otherwise
editable websites) and blogs (online personal benecial effects. Similar guidelines are fol-
journals), are rich platforms to oat ideas and lowed in most Western European countries.
exchange comments. An example of a wiki is Likewise, prior to conducting research in most
Medpedia: an open platform connecting people academic medical centers, an investigator is
and information to advance medicine (see www. required to obtain approval of his or her
medpedia.com). Useful blogs include Medical research protocol from that centers institu-
Discoveries (www.medicalhealthdiscoveries.com), tional review board (IRB), particularly when
Public Library of Science (PLoS Blog, accessi- that protocol poses more than minimal risk to
ble at http://blogs.plos.org), Discovery Buzz the subject. During this approval process, the
ethical considerations entailed in studying the classied as descriptive (What is occurring?

problem are heavily weighed. In clinical stud- What exists?), relational (What is the association
ies, these typically include: between two or more variables? Is the predictive
Proportionate risk: Is the risk to the subject value of one variable greater than or independent
outweighed by the potential benet to that of another variable?), or causal (Does a treat-
subject? If your IRB concludes that it is not, ment, program, policy, etc., affect one or more
the study would not be permitted to go for- outcomes?). Blaikie [20] contends that all
ward, despite its possible benet to the same research questions can be classied as inquiries
patient in the future or to society in general. about what, why, or how. According to this
Informed consent: Is the subject truly aware of trichotomy, what questions describe presence,
the aims of the study? If so, is the subject also magnitude, and variations of characteristics in
aware of the potential for any adverse conse- individuals, patterns in the relationships among
quences that might arise due to his or her par- these characteristics, and associated outcomes;
ticipation? Several years ago, it came to light why questions ask about causes of, or reasons
that a research investigation, undertaken at a for, the existence of phenomena, explanations
medical center in New York, had been con- about relationships between events, and mecha-
ducted on 28 adult schizophrenics who were nisms underlying various processes, whereas
not advised that they were participating in a how questions deal with methods for bringing
study in which psychosis was temporarily about desired changes in outcomes via interven-
induced [18]. The ethics of performing research tion. Research questions also can be classied
on such vulnerable subjects, without their full according to the type of inferences to be drawn.
knowledge, triggered a restorm of controversy In medicine, for example, questions characteristi-
that caused their IRB to mandate an entirely cally target issues about magnitude of disease
new approach to studies of this nature. burden, prevention, or patient management.
Role reversal: Would the investigator be will- Thus, questions may be asked about prevalence
ing to trade places with the subject? Would he and incidence of a disease (or diseases) in a
or she be willing to suffer the same pain, dis- population:
comfort, or, at the very least, inconvenience as What inuenza virus was most dominant in
the subject, as a result of participating in his or 2010?
her own research study? How many types of respiratory illness have
Integrity of the design (validity): Is the study been identied among the World Trade Center
designed well enough to warrant the expendi- Disaster rst responders?
ture of time and effort, or the potential risk to How many cases of breast cancer that were
the patient (i.e., is it likely to yield valid identied in Long Island, New York, occurred
answers to the questions being asked?) If not, in Suffolk County?
not only may the investigators be wasting their Is resistant tuberculosis on the rise in New
own time and that of their subjects, they also York City?
may be producing results that have the poten- Is AIDS in Africa still considered to be an
tial to mislead the medical community and, epidemic?
ultimately, their patients. Questions also can focus on issues of primary
These and other ethical problems will be prevention:
explored more fully in Chap. 12. Does use of margarine instead of butter
protect against hypercholesterolemia and
hypertriglyceridemia?
Types of Research Questions Does use of hormone replacement after meno-
pause protect against the development of car-
Research questions in any discipline may be cat- diovascular diseases among women?
egorized in multiple ways. Trochim [19] has Is physical tness protective against
argued that all research questions may be osteoporosis?
Does application of dental sealants actually What is the in-hospital mortality associated
prevent the development of tooth decay? with valvular replacement? Is it greater with
Have current local and global interventions concomitant coronary artery bypass grafting?
and services reduced the transmission and (harm)
acquisition of HIV infection?
Questions of most interest to clinicians, how-
ever, typically center on issues related to the Role of the Literature Search
clinical management of patients with known
or suspected diseases. Borrowing from an Even if the research problem was sparked by
evidence-based practice framework, these can previously published research, once its basic
be subcategorized as questions about screening/ elements have been dened, it is necessary to
diagnosis, treatment, prognosis, etiology, or conduct a comprehensive search of the literature
harm (from treatment) [21]. Examples are given to acquire a thorough knowledge of relevant ear-
below: lier ndings, ongoing research, or new theories.
What is the most cost-effective way to differ- Although there is no set rule governing the opti-
entiate children who are at risk for develop- mal time frame for a literature search or the num-
mental delays from those who are not? ber of publications to be included, there is general
(screening) consensus that the search should be of sufcient
What are the sensitivity, specicity, and posi- length and breadth to include existing pertinent
tive and negative predictive values of positron seminal and landmark studies [22] as well as cur-
emission tomography [PET] among women rent studies in the eld (i.e., those conducted
with suspected coronary artery disease? What within the past 10 years). A proper literature
is the diagnostic accuracy of PET vs. other search will help the investigator to determine
available tests such as thallium scintigraphy? answers to the following questions:
(diagnosis) Has the problem been previously addressed?
What is the best (most effective, tolerable, If so, was it adequately studied?
cost-effective) currently available chemother- Are the proposed hypotheses, if any, supported
apy regimen for acute myeloid leukemia? by current theory or knowledge?
(treatment) Does the methodology cited in the literature
Is combination therapy better than single agent provide guidance on available instrumentation
therapy for benign prostatic hypertrophy? for measuring variables?
(treatment) Are the results of prior studies informative for
What is the probable clinical course of patients calculation of sample size and power?
with aortic stenosis? (prognosis) Did previous investigators describe the limita-
Which patients with chronic, severe aortic tions of their research or suggest areas for
regurgitation progress most rapidly to surgical future study?
indications? (prognosis) Seeking answers to these questions early in
Is autoimmunity causally related to the devel- the planning process will enable the investigator
opment of Crohns disease? Is it also impli- to determine whether performance of the present
cated in the development of lupus and study is feasible, whether it is likely to signicantly
rheumatic arthritis? (etiology) contribute to the existing knowledge base (thus
Do enzymes involved in the synthesis of the supporting the need for the study), and also
extracellular matrix play a role in the develop- whether it may provide guidance on the construc-
ment of brotic diseases and cancer? tion of hypotheses and choice of study design. In
(etiology) addition, creating an automatic search prole
What is the magnitude of risk for adverse early in the planning process will keep the inves-
outcome of carotid endarterectomy among the tigator informed about the latest research related
elderly? (harm) to his or her problem. The search prole will
generate updated lists of new literature and mid 1940s. For more information about PubMed,
provide alerts to these updates via e-mail or RSS see www.pubmed.gov. Many of the MEDLINE
feed on a daily, weekly, or monthly basis, as citations in PubMed link to the Gene, Nucleotide,
desired. The updates also can be used to alert the and Protein databases from the National Center
investigator to research performed by other inves- for Biotechnology Information (NCBI) for cov-
tigators and provide an opportunity for erage of molecular biology. Google Scholar
collaboration. pulls in freely available scholarly literature from
Like other aspects of a research project, the PubMed and other sources, with some linking to
performance of a proper literature search requires the full text of the articles.
a signicant investment of time and effort. This is MEDLINE may not provide adequate infor-
true in part because the results of most scientic mation about a research problem. Thus, many
investigations (particularly those reecting recent investigators consider searching EMBASE in lieu
work or primary literature) are dispersed over a of or in addition to MEDLINE (which now is
myriad of e-mail communications, meeting included within EMBASE). EMBASE is created
abstracts, web documents, and periodicals, rather by Excerpta Medica and produced by Elsevier.
than organized collectively in books or other sin- One can subscribe to it individually from Elsevier
gle sources of research. Traditionally, if an inves- or through Ovid from Wolters Kluwer Health in
tigator needed to learn more about earlier related three separate databases: EMBASE, EMBASE
work, he or she would begin by examining key Drugs and Pharmacology, and EMBASE
references cited in known relevant published Psychiatry. There are over 24 million indexed
studies. Today, continuing this principle of it records from more than 7,500 current, mostly
only takes one good article to get you going, peer-reviewed journals covering biomedical and
online systems like PubMed from the National pharmacological literature. In addition, there is
Library of Medicine, ISI Web of Knowledge extensive coverage of meeting abstracts. Like
from Thomson Reuters the EBSCOhost family MeSH from MEDLINE, EMBASE uses a hierar-
of databases from EBSCO Publishing, and the chical classication of subject headings called
databases of Ovid Technologies, Wolters Kluwer EMTREE that can be expanded. EMBASE can
Health, and Google Scholar, generate a list of be searched with signicant words, signicant
possible important citations and invite you to phrases, and EMTREE terms. Links to full text of
click on the related articles link, or times cited the journal articles are available from many
link to nd similarly indexed papers or cited ref- medical libraries.
erences from these papers to locate additional An investigator may also consider searching
relevant citations. A summary of selected core BIOSIS Previews, Biological Abstracts, and
online resources are provided in Table 2.1. Zoological Record together as a package from
Most investigators will choose to search ISI Web of Knowledge, a product of Thomson
MEDLINE, the premier bibliographic databases Scientic. This resource represents a comprehen-
from the National Library of Medicine. It is avail- sive index to the life sciences and biomedical
able by searching PubMed, ISI Web of research, including meeting abstracts, journals,
Knowledge, EBSCOhost, and Ovid plus many books and patents, and contains more than 18
other free or fee-based searching systems. The million records taken from more than 5,000 inter-
database covers the life sciences with a concen- national resources from 90 countries (1926 to
tration in biomedicine. Bibliographic citations present). BIOSIS Previews is available by search-
with author abstracts and linking to full text of ing the Ovid suite of databases and ISI Web of
many articles come from more than 5,400 bio- Knowledge.
medical journals published in the USA and Web of Sciences Science Citation Index
around the world. Most citations are written in Expanded, part of ISI Web of Knowledge from
English with English abstracts. MEDLINE con- Thomson Reuters covers scientic literature
tains over 21 million citations dating back to the from 1900 to present. An investigator can search
24
Table 2.1 Selected core online resources

Name of resource Description Link to full text Producer Fee
BIOSIS Bibliographic database: suite includes Yes Thomson Reuters Ovid TechnologiesWolters Yes
Biological abstracts (1926present) Kluwer
BIOSIS previews (1926present)
Zoological record (1864present)
CINAHL Bibliographic database for nursing and allied health disciplines Yes EBSCOhost Yes
Cochrane Library Family of systematic reviews, RCTs, health technology Yes Wiley Ovid TechnologiesWolters Kluwer Yes
assessments, economic assessments
EMBASE Bibliographic database with international coverage Emphasis Yes EMBASEavailable from various vendors Yes
on biomedicine and drugs
MEDLINE Bibliographic database of clinical medicine Yes National Library of Medicineavailable from No
various vendors
PsycInfo Bibliographic database of scholarly journal articles, Yes American Psychological Associationavailable Yes
book chapters, and dissertations in behavioral science from various vendors
and mental health
PubMed Premier database of biomedical literature primarily MEDLINE Yes National Library of Medicine No
(1947present)
Social Science Citation Index Bibliographic database with links to citations in bibliography Yes Thomson Reuters Yes
and items cited
Web of Science Bibliographic database indicating number of References, Yes Thomson Reuters Yes
number of times cited (1900present)
P.G. Supino and H.A.B. Epstein
this resource by subject topics and keywords. add citations to a folder, permitting them to be
The citation display features a summary abstract, printed, e-mailed, or saved. Also, like other data-
a bibliography, and publications that have cited bases, CINAHL links to cited references.
that paper. As with many systems today, full text Finally, for those seeking the latest information
of the paper as well as related article citations on evidence-based health care, the Cochrane
also may be linked. A citation map can be gener- Library is an excellent source of systematic
ated to visually display for two generations the reviews (discussed in depth in Chap. 9), RCTs,
references in the bibliography and cited papers. and health technology and economic assessments.
If the investigator is interested in behavioral It is produced by the Cochrane Collaboration, a
science research, the American Psychological worldwide effort dedicated to systematically
Association offers a suite of databases, reviewing the effectiveness of health-care interven-
PsycINFO, PsycARTICLES, PsycBOOKS, tions, and is available from Wiley and Wolters
PsycCritiques, and PsycEXTRA. Information Kluwer Health via Ovid. Though the Cochrane
can be found on psychology and related disci- Library can be searched with words, phrases, and
plines (e.g., psychiatry, nursing, neuroscience, MeSH descriptors, its central database of random-
law, education, sociology, social work). Available ized trials is extensive (mandating a more precise
in a variety of formats (e.g., journal articles, searching strategy), whereas its database of sys-
books or book chapters, dissertations, technical tematic reviews contains fewer than 5,000 elements
and annual reports, government reports, confer- (requiring a broader search strategy). If the searcher
ence presentations, consumer brochures, maga- is able to identify a systematic review that contains
zines, among others), PsycINFO can be searched a reasonable number of trials from which valid and
with words, phrases, and terms from the Psyc consistent inferences have been drawn, it may pro-
thesaurus. Like MeSH, the terms are arranged in vide most of the literature needed to support a
alphabetical and hierarchical order. research project.
Web of Sciences Social Science Citation Although web-based bibliographic programs
Index can be explored for those interested in have become increasingly user-friendly by
social sciences research. Almost 2,500 journals encouraging the searcher to place signicant
are indexed, representing 50 social science and words, phrases, and database subject terms in a
related disciplines, including anthropology, urban search box, the search process itself remains a
studies, industrial relations, law, linguistics, sub- combination of science and art which requires
stance abuse, public health, and information and practice and patience. In view of this, some
library sciences, among others. Like Science investigators may opt to complete an online tuto-
Citation Index, the citation display features a rial, sign on to a web-based training session,
summary abstract, bibliography, and publications attend an in-person course at their local library,
that have cited the paper; full text of the paper or consult with a librarian for training and search
and related article citations also may be linked. planning. Some investigators will team up with a
This database also can be searched with words searching professional to run the search together
and phrases. or, after a rigorous interview (in which the goals
The EBSCOhost family of databases covers of the study are carefully discussed), will have
the humanities and social sciences. It also includes the searching professional perform the search.
CINAHL-Cumulated Index to Nursing and Allied For those without access to such instructional
Health Literature. This database provides index- resources, we offer the following
ing for nearly 3,000 journals from the elds of recommendations:
nursing and allied health, including librarianship, Frame your search topic in the form of a
and contains more than 2.2 million records dating specic question or statement.
back to 1981. Like MEDLINE, EMBASE, and Depending on your choice of search system(s),
PsycINFO, one searches CINAHL with plan your search strategy accordingly with
signicant words and phrases as well as CINAHL signicant words, phrases, and database sub-
descriptors that can be expanded. Searchers can ject headings or descriptors.
Decide whether empirical and/or theoretical likely to modify or extend the existing body of
literature is to be included: knowledge. Moreover, information gained from
Empirical literature comprises primary the literature review (including successes or fail-
research reports (e.g., observational stud- ures of previous published work) can, as indicated
ies, controlled trials) and systematic reviews earlier, prove invaluable for rening the problem
of research. (if necessary), buttressing or revising hypotheses,
Theoretical literature includes descriptions and validating or modifying the approach taken.
of concepts, models, and theoretical
frameworks.
Identify preferred literature sources, for exam- Crafting the Problem and Purpose
ple, articles, book chapters, and dissertations. Statements
Determine the amount of information needed
and the temporal period of interest. Once the problem has been conceptualized and
Evaluate the likelihood of nding specic the literature search completed, the investigator is
information about your topic. If you think the in a position to communicate to interested parties
topic is voluminous, use a more narrow (e.g., mentors, colleagues, potential sponsors) the
approach to search the literature. If you think nature, context, and signicance of the problem,
the topic will yield a small amount of litera- including, typically, the type and size of the
ture, use a broader approach. affected population, what is known and not yet
Display and review all citations with as much known, and the consequences of the lack of
text, searching terms, and related links as pos- knowledge (i.e., the implied or directly stated),
sible. Many articles will be available in full thus elucidating the active challenge to be
text directly from the searching system. addressed and justifying the logical argument
If you determine that your retrieval is inade- underlying the study. These elements are incorpo-
quate for your needs, consider modifying your rated collectively into a problem statement, a
search strategy and running your search again. declarative set of assertions, interwoven with lit-
Obtain and organize all source documents. erature support, which customarily appears in the
Once the key references have been compiled, Introduction of the research report or in the
these should be carefully reviewed to identify the Background and Significance section of a research
methodologies employed, conclusions drawn, proposal (though, as Polit et al. [12] have observed,
and limitations of the selected studies. It is of the problem statement rarely is labeled as such
paramount importance that the investigator care- and must be ferreted out). As a general rule, a
fully read the entire published study and any well-constructed problem statement should be
accompanying editorials, comments, and letters, written as concisely as possible for optimal clarity
rather than rely on information given in an yet contain sufcient information to make a via-
abstract or in published reviews of the literature ble argument in support of the study and elicit
written by others. This is because abstracts and interest [13]. Abbreviated problem statements,
review articles provide only incomplete informa- condensed into a sentence or two with minimal
tion; in addition, the perspective of the reviewing supporting argumentation, commonly are pro-
author may bias the interpretation of primary vided in the beginning of the abstract accompany-
ndings contained in the review articles. ing the main body of the research report or
The information contained within each refer- research proposal. (Ellis and Levy [13] refer to
ence should be related to the problem statement to these reductions as statements of the problem to
form a nexus between the earlier studies and the differentiate them from fully developed problem
current research project. If the investigator deter- statements with appropriate argumentation.)
mines that the literature supports the need to study If the study is broad, it is recommended that
the proposed problem, he or she can proceed with the investigator divide the main problem into
condence, knowing that pursuit of the research subproblems, each of which addresses a single
project (if properly designed and implemented) is issue. It is important that the sum of the content
Table 2.2 Examples of well-dened problem statements from two research reports
PROBLEM STATEMENT #1: PROBLEM STATEMENT #2:
Fleming et al., Circulation, 2008 [23] Walker et al., CMAJ 2000 [24]
Atrial brillation (AF), the most common complication Asymptomatic bacteriuria is common in
after cardiac surgery, is associated with signicant institutionalized elderly people. The prevalence
morbidity, increased mortality, longer hospital stay, and increases with age, occurring in up to 50% of elderly
higher hospital costs . Because ventricular dysfunction women and 35% of elderly men who reside in
is common following cardiac surgery, inotropic drugs are long-term facilities . Despite lack of benet,
often necessary to improve hemodynamic status; however, institutionalized older adults with asymptomatic
the effect of inotropic drugs on postoperative AF has not bacteriuria are frequently treated with antibiotics. This
been extensively studied . Milrinone has been reported practice is of particular concern given the deleterious
to be associated with a lower risk of postoperative AF effects of antibiotics, including the potential for the
compared to dobutamine use, but milrinone increases development of antibiotic resistance and adverse
the risk of atrial arrhythmias in patients with acute reactions seen in this population. Why antibiotics
exacerbation of chronic heart failure continue to be prescribed for asymptomatic bacteriuria
is unclear
Table 2.3 Examples of well-dened statements of purpose from two published research studies
PURPOSE STATEMENT #1: PURPOSE STATEMENT #2:
Fleming et al., Circulation, 2008 [23] Walker et al., CMA 2000 [24]
The aim of this analysis was to test the hypothesis that The aim of our study was to explore the perceptions,
the use of inotropic drugs is associated with an increased attitudes, and opinions of physicians and nurses
risk of postoperative AF in cardiac surgery patients involved in the process of prescribing antibiotics
participating in an ongoing randomized, double blinded, for asymptomatic bacteriuria in institutionalized
placebo controlled trial elderly people
reected in the subproblems equates to no more statement. Although, like the problem statement,
or no less than the content reected in the main the statement of purpose typically is not labeled as
problem. Like the main problem, the subprob- such, it is easily identiable as it includes the
lems should be stated clearly and be related to words purpose (the purpose of the study was/
each other in a meaningful way so that the is .), goal (the goal of the study was/is .),
research will maintain coherence. or, alternatively, intent, aim, or objective
Two examples of well-dened problem state- [12]. In a quantitative study, the statement of pur-
ments are given in Table 2.2. The rst (shown in pose also identies the key variables to be exam-
the left column) is drawn from a quantitative ined and/or interrelated (parameters to be estimated,
study by Fleming et al. [23] about the impact of hypotheses to be tested), the nature of the study
milrinone on risk for atrial brillation after car- population (who is included), and, occasionally,
diac surgery. The second (shown in the right col- the nature of the study design; in a qualitative
umn) is a qualitative study by Walter et al. [24] investigation, the purpose statement commonly
addressing reasons for prescription of antibiotic will include the phenomenon or phenomena under
therapy among the asymptomatic institutional- study (rather than hypotheses), as well as the study
ized elderly with bacteriuria. Note, in each case, group, community, or setting [12]. Shown in
the problem statement makes the argument that Table 2.3 are the purpose statements from the
there is an important unresolved issue that should Fleming and Walker studies. In both cases, the
be addressed, and sets the stage for what the reader will note that the statements of purpose ow
investigator intends to do to facilitate a solution. directly from the problem statements.
The problem statement typically is followed by As Polit et al. have noted (and as illustrated
a statement of purpose (usually the last sentence above), the use of verbs in a purpose statement
or two in the Introduction of the research report or is key to determining the thrust of the inquiry
given as a list in the Specific Aims of the research and also helps to differentiate quantitative from
proposal), which succinctly identies what the qualitative studies [12]. The former typically
investigator intends to do (the type of inquiry) to include terms such as compare, contrast,
resolve the unknowns explicated in the problem correlate, estimate, and test, whereas the
Table 2.4 Examples of research questions restated from two statements of purpose
PURPOSE STATEMENT #1: RESTATED PURPOSE STATEMENT #2: RESTATED AS A
AS A RESEARCH QUESTION RESEARCH QUESTION
Fleming et al., Circulation, 2008 [23] Walker et al., CMA 2000 [24]
Does the use of inotropic drugs increase risk of What are the perceptions, attitudes, and opinions of physicians
postoperative AF in cardiac surgery patients? and nurses involved in the process of prescribing antibiotics for
asymptomatic bacteriuria in institutionalized elderly people?
latter include terms such as describe, explore, carbon monoxide (CO) poisoning is a substantial
understand, discover, and develop. Verbs health problem in the US, causing an estimated
11,547 deaths from 1979 through 1988. The US
such as prove or show should be avoided in Consumer Product Safety Commission estimates
purpose statements of research studies as these can that there was an average of about 28 charcoal-
be construed as indicative of investigator bias [12]. related deaths per year from 1986 through 1992.
As noted above, a statement of purpose can be Charcoal briquettes are not an uncommon source
of CO poisoning in Washington State: 16% of the
expressed in declarative form. However, some 509 unintentional poisoning cases that required
investigators instead will frame the purpose of their hyperbaric oxygen treatment between October
study interrogatively as one or more research ques- 1982 and October 1993 involved charcoal. Our
tions (each addressing a single concept) that are investigation suggests that CO poisoning following
severe winter storms should be anticipated. It also
directed at the unknowns in the problem state- suggests that preventive messages are important
ment. Alternatively, these questions can be added public health messages, but that they should be
to a global statement of purpose to improve clarity understandable to those in the community who nei-
and specicity. As Polit et al. contend, research ther read nor speak English. [25]
questions invite an answer and help focus atten- Does the Introduction contain a clear state-
tion on the kinds of data that would have to be ment of the problem so that it is evident why the
collected to provide that answer [12]. Listed in investigation was important? Is there a statement
Table 2.4 are research questions that could have of purpose (or a set of questions) that explains
been framed by Fleming et al. and Walker et al. to what the investigators did to address the prob-
address the targets of inquiry in their studies. lem? Do the authors introductory statements pre-
However written, both the problem and pur- pare the reader to follow the rest of the paper?
pose of the study (or the research questions) After all, that is the principal role of the
should be apparent to the reader in the Introduction Introduction in a research manuscript. (For fur-
of the research report (or in the Background, ther details about the role and proper construction
Significance, and Specific Aims of the research of the Introduction of the scientic paper, the
proposal) and should possess sufcient clarity for reader is referred to Chap. 13.) Note, the authors
the reader to understand them without the pres- have provided the reader with a general back-
ence of the author. Unfortunately, this is not ground statement and also have presented their
always the case. Consider the statements articu- conclusions in their Introduction, repeating infor-
lated by Houck and Hampson in the introduction mation already given in their Abstract. However,
to their study about carbon monoxide poisoning other than suggesting that their data were unique,
following a winter storm during the 1980s, when the rationale and aims of their study have not
charcoal briquettes commonly were used for been articulated, and their research questions
heating in certain areas of the USA: remain undened even after reading their com-
A major epidemic of carbon monoxide poisoning ments. The moral illustrated by this example is
occurred after a severe winter storm struck western that for the published paper to engage and edify
Washington State during the morning of 20 January
the reader, the research problem, purpose, and/or
1993. Charcoal briquettes and gasoline-powered
generators were principal sources of CO. Although research questions must be unambiguously stated
previous reports have described CO poisoning early in the research report.
following winter storms in the Eastern United When there is poor denition of problem
States, the large number and wide distribution of
and purpose, not only may the reader become
cases following this storm are unique. Unintentional
confused, but these deciencies may adversely In its current form, the manuscript resembles
impact the study methodology because all subse- a mystery story with a good outcome more
quent steps in the research process (e.g., con- than a scientic study. Thus, while indicating
struction of the research questions or hypotheses, the general aim of the authors, the Introduction
development of the research design, collection misstates the specic goals required by the
and analysis of data) are guided by the statements apparent design of the reported work, thus
of problem and purpose statements. Houck and misfocusing the reader. (Recommendation:
Hampson were fortunate. When their article was Consider after revision)
written, there were relatively few experienced In sum, all research (whether basic or applied,
peer reviewers in their discipline (emergency quantitative or qualitative, hypothesis generating
medicine). This may well have helped the authors or hypothesis testing, retrospective or prospec-
efforts to gain publication. tive, observational or experimental) may be con-
More commonly, deciencies in the wording sidered as a response to a problem (an ambiguity,
of these statements and their connection to the gap in knowledge, or other perplexing state) that
remainder of the paper can be a primary cause of requires resolution. In thinking through the prob-
a manuscript being rejected for publication, or lem and communicating it to others, the investi-
being sent back to the author for revision, follow- gator must provide a clear and convincing
ing the peer-review process. The following criti- argument that indicates why the problem must be
cisms, made by a reviewer in response to two addressed (the problem statement), articulate a
different submissions to a cardiology journal, are solution to the problem to clarify the ambiguity
illustrative of this point: or ll the gap in knowledge (the purpose state-
Submission #1: Comment: The focus of the ment or research questions), and tie these state-
study is not clearly apparent, even from the ments to the methods used. The challenge to the
last paragraph which specically describes investigator is to dene and interrelate these ele-
the goals. The rst page does not point directly ments well enough to justify the research study
to the study hypothesis. (Recommendation: and maximize the likelihood that the ndings will
Consider after revision) be understood, appreciated, and utilized.
Take-Home Points
A well-designed research project, in any discipline, begins with conceptualizing the

problem.
Research problems in clinical medicine may be stimulated by practical issues in the clinical
care of patients, new or unexpected observations, discrepancies and knowledge gaps in the
published literature, solicitations from government or other funding sources, and public
forums such as scientic sessions, grand rounds, and seminars.
Well-conceived research problems are important, interesting, feasible, and ethical and
serve as a springboard for clearly focused questions.
Research questions most relevant to clinicians include those pertaining to disease preva-
lence/incidence, prevention, detection (diagnosis or screening), etiology, prognosis, and
outcomes of treatment (benet or harm).
A comprehensive literature search, conducted early in the planning process, can help to
determine whether the proposed study is feasible, whether it is likely to substantively con-
tribute to the existing knowledge base, and whether it can provide guidance in the construc-
tion of hypotheses, determination of sample size, and choice of study design.
Proper framing of the problem and purpose statements is essential for communicating and
justifying the research.
research-worthy problem. Inform Sci: Int J Emerg

References Transdiscipl. 2008;11:1733.
14. Leedy PD, Ormond JE. Practical research: planning
and design. 8th ed. Upper Saddle River: Prentice Hall;
1. Kerlinger F. Foundations of behavioral research: edu- 2005.
cational and psychological inquiry. New York: Holt, 15. Kerlinger FN, Lee HB. Foundations of behavioral
Reinhart and Winston; 1964. research. 4th ed. Holt: Harcourt College; 2000.
2. Eng J. Getting started in radiology research: asking 16. Sim J, Wright C. Research in health care. Concepts,
the right question and identifying an appropriate pop- designs and methods. Cheltenham: Nelson Thornes;
ulation: critical thinking skills symposium. Acad 2000.
Radiol. 2004;11:14954. 17. Leedy PD. Practical research planning and design.
3. Albrecht MN, Perry KM. Home health care: delinea- 2nd ed. New York: MacMillan; 1980.
tion of research priorities and formation of a national 18. Sharav VH. The ethics of conducting psychosis-
network group. Clin Nurs Res. 1992;1:30511. inducing experiments. Account Res. 1999;7:13767.
4. Davidson P, Merritt-Gray M, Buchanan J, Noel J. Voices 19. Trochim, WM.: The research methods knowledge
from practice: mental health nurses identify research base. 2nd ed. Internet WWW page at URL: www.
priorities. Arch Psychiatr Nurs. 1997;11: 3405. socialresearchmethods.net/kb/. Version current as of
5. Gallagher M, Hares T, Spencer J, Bradshaw C, Webb 20 Oct 2006.
I. The nominal group technique: a research tool for 20. Blaikie NWH. Designing social research: the logic of
general practice? Fam Pract. 1993;10:7681. anticipation. Malden: Blackwell; 2000.
6. Beitz JM. Writing the researchable question. J Wound 21. Sackett DL, Straus SE, Richardson WS, Haynes RB.
Ostomy Continence Nurs. 2006;33:1224. Evidence-based medicine: how to practice and Teach
7. Hulley SB, Cummings SR. Designing clinical research. EBM. 2nd ed. Edinburgh/New York: Churchill-
1st ed. Baltimore: Williams and Wilkins; 1988. Livingstone; 2000.
8. Gliner JA, Morgan GA. Research methods in applied set- 22. Burns N, Grove SK. The practice of nursing research.
ting. An integrated approach to design and analysis. Conduct, critique and utilization. 5th ed. St. Louis:
Mahwah: Lawrence Erlbaum Associates; 2000. Elsevier Saunders; 2005.
9. Shugan SM. Dening interesting research problems. 23. Fleming GA, Murray KT, Yu C, Byrne JG, Greelish
Market Sci. 2003;22:115. JP, Petracek MR, Hoff SJ, Ball SK, Brown NJ,
10. Hulley SB, Cummings SR, Browner WS, Grady DG, Pretorius M. Milrinone use is associated with postop-
Newman TB. Designing clinical research. 3rd ed. erative atrial brillation following cardiac surgery.
Philadelphia: Lippincott Williams and Wilkins; 2007. Circulation. 2008;118:161925.
11. Tuckman BW. Conducting educational research. New 24. Walker S, McGeer A, Simor AE, Armstrong-Evans
York: Harcourt Brace Jovanovich; 1972. M, Loeb M. Why are antibiotics prescribed for asymp-
12. Polit DF, Beck CT, Hungler BP. Essentials of nursing tomatic bacteriuria in institutionalized elderly people?
research. Methods, appraisal, and utilization. CMAJ. 2000;163:2737.
Philadelphia: Lippincott, Williams & Wilkins; 2001. 25. Houck PM, Hampson NB. Epidemic carbon monox-
13. Ellis TJ, Levy Y. Framework of problem-based research: ide poisoning following a winter storm. J Emerg Med.
a guide for novice researchers on the development of a 1997;15:46973.
The Research Hypothesis: Role
and Construction 3
Phyllis G. Supino
Wrong hypotheses, rightly worked from, have produced more results than unguided
observation
Augustus De Morgan, 1872[1]
predicted, the hypothesis is supported. As noted

Overview below, such support does not necessarily indicate
verication of the hypothesis. Consistent replica-
Once a problem has been dened, the investigation of predictions in subsequent studies may be
tor can formulate a hypothesis (or set of hypoth- needed if the hypothesis is to be accepted as a
eses, if there are multiple subproblems) about the theory or a component of a theory. If results are
outcome of the study designed to resolve the not as predicted, the hypothesis is rejected (or, at
problem. A hypothesis (from the Greek, founda- minimum, revised or removed from active con-
tion) is a logical construct, interposed between a sideration until future developments in science
problem and its solution, which represents a pro- and/or technology provide new tools for retest-
posed answer to a research question. It gives ing). As Leedy has stated, a hypothesis is to a
direction to the investigators thinking about the researcher what a point of triangulation is to a
problem and, therefore, facilitates a solution. surveyor: it provides a position from which he
Unlike facts and assumptions (presumed true may orient his exploration into the unknown and
and, therefore, not tested in the study) or theory a checkpoint against which to test his ndings
(a relatively well-supported unifying system [2]. The paramount role of the hypothesis for
explicating a broad spectrum of observations and guiding biomedical investigations was rst high-
inferences, including previously tested hypothe- lighted by the eminent physiologist Claude
ses), the research hypothesis is a reasoned but Bernard (18131878) [3]. In the current era,
tentative proposition typically expressing a rela- hypotheses are considered fundamental to rigor-
tion between variables. For it to be useful and, ous research, and biomedical studies without
more importantly, assessable, it must generate hypotheses have been largely abandoned in favor
predictions that can be tested by subsequent of those designed to generate or test them [4].
acquisition, analysis, and interpretation of data
(i.e., through formal observation or experimenta-
tion). When the results of the study are as Hypotheses Versus Assumptions
It is important to recognize the difference between

a hypothesis and an assumption. These terms
P.G. Supino, EdD () share the same etymological root and are often
Department of Medicine, College of Medicine,
SUNY Downstate Medical Center, 450 Clarkson Avenue,
confused. An assumption is accepted as fact in
Box 1199, Brooklyn, NY 11203, USA designing or justifying a study (though it is likely
e-mail: phyllissupino@aol.com to have been the subject of previous research).
32 P.G. Supino
Thus, the investigator does not set out to test it. induction, and abduction [5]. These differ
Examples of assumptions include: primarily according to (1) whether the origin
Radionuclide cineangiography measures ven- of the hypothesis is a body of knowledge or
tricular performance. theory (the rationalist perspective), an empiri-
Chest x-rays measure the extent of lung cal event (the inductivist perspective), or some
inltrates. combination of the two (the abductivist per-
The SF-36 measures general health-related spective); (2) the logical structure of the argu-
quality of life. ment; and (3) the probability of a correct
Medical education improves knowledge of conclusion.
clinical medicine.
An apple a day keeps the doctor away (the
most famous [albeit untested] assumption of Hypothesis by Deduction
them all).
In contrast, the hypothesis is an expectation Deduction (from the Latin de [out of] and
that an investigator will attempt to conrm dcer [to draw or lead]) is one of the oldest
through observation or experiment. Examples in forms of logical argument. It was introduced by
clinical medicine include: the ancient Greeks who believed that acquisition
Among patients with chronic nonischemic of scientic knowledge (insight into the princi-
mitral regurgitation (insufciency), survival ples and character of natural substances and
will be better among those whose valves have their causes) could be achieved largely by the
been repaired or replaced than among those same logical processes used to prove the validity
who have been maintained on medical of mathematical propositions [6]. Today, deduc-
therapy. tion remains the predominant mode of formal
Among patients hospitalized with community- inference in research in mathematics and in the
acquired pneumonia, posthospital course will fundamental sciences, but it also plays an
be better among those with a low-risk prole important role in the empirical sciences. A deduc-
than among those with a high-risk prole tively derived hypothesis arises directly from
before hospitalization. logical analysis of a theoretical framework, pre-
Life expectancy will be greater among indi- viously developed to provide an explanation of
viduals consuming low-calorie diets than events or phenomena. It is considered to be non-
among those consuming high-calorie diets. ampliative because, while it helps to provide
Health-related quality of life is better among proof of principle, it adds nothing new beyond
those whose mitral valves have been repaired the theory. The validity of a theory can never be
than among those whose mitral valves have directly examined. Therefore, scientists wishing
been replaced. to evaluate it, or to test its utility within a given
(perhaps new) context, will formulate a conjec-
ture (hypothesis) that can be subjected to empiri-
Hypothesis Generation: Modes cal appraisal. In forming a hypothesis by
of Inference deduction, the investigator typically moves from
a general proposition to a more specic case that
There is a paucity of empirical data regarding the is thought to be subsumed by the generalization
way (or ways) in which hypotheses are formu- (i.e., from theory to a conceptual hypothesis or
lated by scientists and even less information from a conceptual hypothesis to a precise pre-
about whether these methods vary across disci- diction based on the hypothesis). Deductive argu-
plines. Nonetheless, philosophers and research ments can be conditional or syllogistic (e.g.,
methodologists have suggested three fundamen- categorical [all, some, or none], disjunctive [or],
tally different modes of inference: deduction, or linear [including a quantitative or qualitative
3 The Research Hypothesis: Role and Construction 33
comparison]) and contain at least two premises Hypothesis by Induction

(statements of evidence) and a conclusion.
A well-known categorical syllogism and example Not all hypotheses are derived from theory.
are given below: Frequently, in the empirical sciences, patterns,
All As are B (e.g., All men are mortal) trends, and associations are observed serendipi-
C is an A (e.g., Socrates is a man) tously in clinical settings or in preclinical labora-
\ C is a B (e.g., Socrates is mortal) tories or, purposively, through exploratory data
If the premises of a deductive argument are analysis or other hypothesis-generating research.
true and the reasoning used to reach the conclu- Sometimes, they may result from specic ndings
sion is valid (i.e., the form of the argument is cor- gleaned from the research literature. These obser-
rect), it will necessarily follow that the conclusion vations may be generalized to produce induc-
is sound (i.e., the premises, if true, guarantee the tively derived hypotheses that may serve as the
conclusion). If the form of the deductive argu- basis for predicting future events, phenomena, or
ment is invalid (i.e., the premises are such that patterns. Induction (from the Latin in [meaning
they do not lead to the conclusion: e.g., Socrates into] and dcer [to draw or to lead]) is
is mortal, all cats are mortal, \ Socrates is a cat) dened by Jenicek and Hitchcock as any method
and/or the premises are untrue (e.g., all mortals of logical analysis that proceeds from the particu-
are men [or cats]), the conclusion will be unsound. lar to the general [8] and represents the logical
It should be noted that deductive reasoning is the opposite of deduction which, as noted above,
only form of logical argument to which the term typically proceeds from the general to the specic.
validity is appropriate. Induction can be used not only to formulate
The theory from which the hypothesis is hypotheses but to conrm or refute them, which
derived can be specic to the discipline or it can may be its most appropriate use, as noted below
be borrowed from another discipline. Polit and (see Abduction). Inductive reasoning, which is
Beck [7] provide two examples of deductively based heavily on the senses rather than on intel-
formulated hypotheses, germane to nursing, lectual reection, was popularized by the English
derived from general reinforcement theory which philosopher and scientist, Sir Francis Bacon
posits that behaviors that are rewarded tend to be (15611626) [9], who proposed it as the logic of
learned or repeated: scientic discovery, a position that, subsequently,
1. Nursing home residents who are praised (rein- has been vigorously disputed by the Austrian
forced) by nursing personnel for self-feeding logician, Sir Karl Popper (19021994) [10] and
require less assistance in feeding than those other philosophers of science. There are various
who are not praised. forms of inductive inference. One of the most
2. Pediatric patients who are given a reward (e.g., common is enumerative induction (or inductive
a balloon or permission to watch television generalization). Jenicek and Hitchcock [8]
when they cooperate during nursing proce- describe it as a mode by which one concludes
dures) tend to be more cooperative during that all cases of a specied kind have a specied
those procedures than unrewarded peers. property on the basis of observation that all exam-
Deduction also is used to translate broad ined cases of that kind have the property [8].
hypotheses such as these to more specic opera- It is called enumerative because it itemizes
tional hypotheses (i.e., working hypotheses or cases in which some pattern is found and, solely
predictions) that can be directly tested by obser- for this reason (i.e., without the benet of a theo-
vation or experiment. When empirical support is retical framework), forecasts its recurrence. Other
obtained for a hypothesis, this, in turn, strength- forms of induction include argument from analogy
ens the theory or body of knowledge from which (forming inferences based on a shared property
the hypothesis was deduced. or properties of individual cases) or prediction
34 P.G. Supino
(drawing conclusions about the future cases from cowpox (vaccinia), they became immune to its
a current sample), causal inference (concluding more severe human analogue, smallpox. The
that association implies causality), and Bayesian English surgeon, Edward Jenner (17491823),
inference (given new evidence [data], using prob- used this hypothesis as the basis of a series of
ability theory [Bayes theorem] to alter belief in a scientic experiments, using exudates from an
hypothesis). infected milkmaid, to develop and formally test a
All inductive arguments contain multiple vaccine against this disease [11]. He became
premises that provide grounds for a conclusion famous for using vaccination as a method for pre-
but do not necessitate it (in contrast to a deduc- venting infection, though there is growing recog-
tive argument where the premises, if true, entail nition that the rst successful inoculations against
the conclusion). In other words, a conclusion smallpox actually were performed by a farmer,
drawn from an inductive argument is probable (at Benjamin Jesty, some 20 years earlier, who vac-
best), even if its premises are correct. For this cinated his family using cowpox taken directly
reason, all inductive arguments, while amplia- from a local cow [12]. It also has been claimed
tive, are considered to be logically invalid and are that Charles Darwin used inductive reasoning
judged, instead, according to their strength when generalizing about the shapes of the beaks
(i.e., whether they are inductively strong or from nches from the various Galapagos Islands
inductively weak). The strength of an inductive [13] and when forming conjectures from obser-
generalization is determined by the number of vations based on the breeding of dogs, pigeons,
observations supporting it and the extent to which and farm animals at home (inferences that formed
the observations reect all observations that could underpinnings of his theory of evolution) and
be made. The more (consistent) observations that that Gregor Mendel used the same form of rea-
exist, the more likely the conclusion is correct soning to conceptualize his law of hybridiza-
(inconsistent observations, of course, reduce the tion [14]. Even if these claims are true (and there
arguments inductive strength). The typical form is far from universal agreement on this matter),
of an inductive generalization is given below: inductive generalizations typically are regarded
A1 is a B as inferior to hypothesis-generating methods
A2 is a B that involve more theoretical reasoning, that con-
(All As I have observed are Bs) sider variations in circumstances (i.e., possible
\ All As are Bs confounding factors) that may account for spuri-
Like deductive arguments, inductive general- ous patterns, and that provide possible causal
izations can be categorical, that is, represent con- explanation for observed phenomena. Moreover,
clusions about all (as above), no, or some recent research in cognition and the relatively
members of a class, or they may involve quantita- new eld of neural modeling suggest that simple
tive arguments, for example, 50% of all coins induction across a limited set of observations
I have sampled are quarters; therefore, 50% of all may have a far smaller role in scientic reasoning
coins coming from the same lot that I have sam- than previously realized [15].
pled probably are quarters (or, as a clinical
example, 30% of the patients I have examined
are obese; therefore, 30% of patients sampled Hypothesis by Abduction
from the same population as those who I have
examined probably are obese). Of the three primary methods of reasoning, the
Not all inductive hypotheses used by scientists one that has been most implicated in the creation
have been formulated by scientists; some, in fact, of novel ideas, including scientic discoveries, is
owe their origin to folklore. For example, by the the logical process of abduction (from the Latin
late eighteenth century, it was common knowl- ab [meaning away from] and dcer [to draw
edge among English farm workers that when or to lead]). It also is the most common mode of
humans were exposed to cows infected with reasoning employed by clinicians when making
diagnostic inferences. Abduction was introduced that the abductive argument is logically less
into modern logic by American philosopher and secure than a deductive argument (or even an
mathematician, Charles Sanders Peirce (1839 inductive argument). It represents a possible con-
1914) [16], and remains an important, albeit conclusion only (after all, the beans might come from
troversial, topic of research among philosophers some other bagor from no bag at all). Therefore,
of science and students of articial intelligence. like an inductive argument, it is ampliative though
It refers to the process of formulation and accep- logically invalid. Its strength is based on how
tance on probation of a hypothesis to explain a well the argument accounts for all available
surprising observation. Thus, hypotheses formed evidence, including that which is seemingly
by abduction (unlike those formed by induction) contradictory.
are always explanatory. (The reader should note As Peirces work evolved, he shifted his efforts
that other synonyms for, and denitions of, to developing a theory of inferential reasoning in
abduction exist, e.g., retroduction, reduction, which abduction was taken to mean the genera-
inference to the best explanation, etc., the latter tion of new rules to explain new observations. In
reecting the evaluative and selective functions so doing, he focused on, what some have termed,
that also have been associated with this term.) the creative character of abduction [17]. Peirce
Abductive reasoning entails moving from a argued that abduction had a major role in the pro-
consequent (the observation or current fact) to cess of scientic inquiry and, indeed, was the
its antecedent (presumed cause or precondition) only inferential process by which new knowledge
through a general rule. It is considered back- was createda view that was, and continues to
ward because the inference about the antecedent be, hotly debated by the philosophical commu-
is drawn from the consequent. nity. In his later work, Peirce described the logi-
Peirce devoted his earliest work (before 1900), cal structure of abduction as follows:
as did Aristotle long before him, to furthering the The surprising fact, C, is observed.
development of syllogistic theory to express logi- But if A were true, C would be a matter of
cal relations. During this early period, abduction course.
(then termed by him as hypothesis) was taken to Hence, there is reason to suspect that A is true.
mean the use of a known rule to explain an [18]
observation (result); accordingly, his initial The surprise (the stimulus to the abductive
efforts were devoted to demonstrating how the inference) arises because the observation is
hypothesis relates to the premises of the argu- viewed, at that moment in time, as an anomaly,
ment and how it differs from the logical structure given the observers preexisting corpus of knowl-
of other forms of reasoning (i.e., deduction or edge (theory base) which cannot account for it.
induction). In his essay, Deduction, Induction, The lack of compatibility between the observa-
Hypothesis, Peirce presents an abductive tion and expectation introduces a type of cogni-
syllogism: tive dissonance that seeks resolution through the
Rule: All the beans from this bag are white. adoption of a coherent explanation. In Peirces
Result: These beans are white. opinion, the explanation might be nothing more
Case: These beans are from this bag. [16] than a guess (Peirce believed that humans were
In this argument, the rule and result repre- hardwired with the ability for guessing cor-
sent the premises (background knowledge and rectly) that, unlike an inductive generalization,
observation, respectively [the order is arbitrary]) enters the mind like a ash [18] or, what is
and the case represents the conclusion (here, commonly termed, as a eureka moment or an
the hypothesis). Had this argument been expressed ah ha! experience. Because a guess (insightful
deductively, the case would have been the sec- or not), by its very nature, is speculative (and, as
ond premise, and the result, the conclusion noted above, is a relatively insecure form of rea-
(i.e., all the beans from this bag are white, these soning), Peirce recognized that an abductive
beans are from this bag; therefore, these beans hypothesis must be rigorously tested before it
are white). It should be obvious to the reader could be admitted into scientic theory. This, he
36 P.G. Supino
Fig. 3.1 The three stages

of scientic inquiry (From
Abduction and Induction.
Essays on their Relation
and Integration, Flach PA
and Kakas AC. Abductive
and Inductive Reasoning:
Background and Issues,
Chap. 1, pp. 127,
Copyright 2000, with
permission from Klewer
Academic Publishers)
reasoned, is accomplished by using deduction to Although, as Peirce points out, all three modes
explicate the consequences of the hypothesis (i.e., of inference (abduction, deduction, and induc-
the predictions) and induction to form a conclu- tion) are used in the process of scientic inquiry,
sion about the likelihood of their truthfulness, each requires different skills. As scholars have
based on experimental verication. According to noted, deduction requires the capacity to reason
Peirce, these are the primary roles of deduction logically and inductive reasoning requires under-
and induction in the scientic process. Figure 3.1 standing of the statistical implications of drawing
illustrates the Peircian view of the relation conclusions from samples to populations. In con-
between abduction, deduction, and induction as trast, as Danmark et al. have noted, abduction
interpreted by Flach and Kakas [19]. requires the discernment of new relations and
Countless abductively derived hypotheses, connections not immediately obvious [21]in
principles, theories, and laws have been put for- other words, to think outside the box. For this
ward in science. Many, if not most, owe to the reason, the best abductive hypotheses in science
serendipitous consequences of an unexpected have been made by those who not only are obser-
observation made while looking for something vant, wise, and well grounded in their disciplines
else [20]. Well-known examples of such happy but who also are imaginative and receptive to
accidents include: new ideas. This view was, perhaps, best expressed
Archimedes principles of density and by Louis Pasteur (18221895) when he argued,
buoyancy In the elds of observation, chance favors only
Hans Christian Oersteds theory of prepared minds [22]. Accordingly, developing
electromagnetism the prepared mind, in general, and enhancing
Luigi Galvanis principle of bioelectricity the capacity to reason abductively, deductively,
Claude Bernards neuroregulatory principle of and inductively, in particular, should be among
circulation the most important goals of those seeking to
Paul Gross protease-antiprotease hypothesis effectively engage in the process of scientic
of pulmonary emphysema discovery.
and diabetes would be considered study vari-

Characteristics of the Research ables, and a hypothesis could be constructed
Hypothesis about their association. However, if all patients
in a study group were women with diabetes,
Irrespective of how it is formulated (or the prob- no hypothesis could be developed about the
lem or discipline for which it is formulated), a relation between gender and diabetes since
research hypothesis should fulll the following these attributes would be invariable. (Fuller
ve requirements: discussion of nature and role of variables, and
1. It should reflect an inference about variables. their relation to the hypothesis, is presented
The purpose of any hypothesis is to make an later in this chapter.)
inference about one or more variables. The 2. It should be stated as a grammatically com-
inference can be as simple as predicting a sin- plete, declarative sentence.
gle characteristic in a population (e.g., mean A hypothesis should contain, at minimum, a
height, prevalence of lung cancer, incidence of subject and predicate (the verb or verb phrase
acute myocardial infarction, or other popula- and other parts of the predicate modifying the
tion parameter) or, more commonly, it repre- verb). The statements relaxation (subject)
sents a supposition about an association decreases (verb) blood pressure (object, or
between two or more variables (e.g., smoking predicate noun), depression (subject)
and lung cancer, diet and hypertension, age increases (verb) the rate of suicide (predi-
and exercise tolerance, etc.). It is, therefore, cate), and consumption of diet cola (subject)
important for the investigator to understand is related to (verb phrase) body weight (object,
what is meant by a variable and how it func- or predicate noun) are illustrative of hypoth-
tions in the setting of a hypothesis. eses that meet this requirement. In these
In its broadest sense, a variable is any fea- examples, the subject and predicate modiers
ture, attribute, or characteristic that can assume reect the variables to be related, and the verb
different values (levels) within the same indi- (or verb phrase) denes the nature of the
vidual at different points in time or can differ expected association.
from one member of the study population to 3. It should be expressed simply and unam-
another. Typical variables of interest to bio- biguously.
medical researchers include subject prole For a hypothesis to be of value in a study, it
characteristics (e.g., age, weight, gender, must be clear in meaning, contain only one
etiology, stage of disease), nature, place, dura- answer to any one question, and reect only
tion of naturally occurring exposures (e.g., the essential elements of solution. The reason
risk factors, environmental inuences) or pur- is that the hypothesis guides all subsequent
posively applied interventions, and subject research activities, including selection of the
outcomes or responses (e.g., morbidity, mor- population and measurement instruments, col-
tality, symptom relief, physiological, behav- lection and analysis of data, and interpretation
ioral, or attitudinal changes) among others. of results. For example, the hypothesis right
It is important to recognize that a charac- ventricular performance is the best predictor
teristic that functions as a variable in one study of survival among patients with valvular heart
does not necessarily serve as a variable in disease, but is less important in others would
another. For example, if an investigator wished be difcult to validate. First, what is meant by
to determine the relation of gender to preva- right ventricular performance? Does this refer
lence of diabetes, it would be necessary to to ejection fraction at rest, at exercise, or the
study this problem in a group comprising change from rest to exercise, or to some other
males and females, some with and some with- parameter? Second, what is the meaning of
out this disease. Because intersubject differ- best? Does it signify ease of measurement
ences exist for both characteristics, gender or does is it relate to the strength of statistical
38 P.G. Supino
association? Third, to what is right ventricular biomedical and other empirical sciences, is
performance compared? Is the contrast between achieved through the acts of observation or
right ventricular performance and clinical experimentation, analysis, and judicious
descriptors, anatomic descriptors, other func- interpretation. If one or more of the elements
tional descriptors, or between all of these? comprising the hypothesis is not present in
Fourth, what type of valvular heart disease the population or sample, or if a phenomenon
is being studied? Is it regurgitant, stenotic, or or characteristic contained within the hypoth-
both? Does it involve the mitral, aortic, or esis is highly subjective or otherwise difcult
some other heart valve? Finally, what is meant to measure, the hypothesis cannot be prop-
by less important? Who (or what) are the erly evaluated. For example, the statement
others? As is true for the research problem, female patients cope better with stress than
the clearer and less complex the statement of male patients would be a poor hypothesis if
the hypothesis, the more straightforward the the investigator did not have access to both
study and the more useful the ndings. male and female patients or was unable to
4. It should provide an adequate answer to the generate acceptable denitions and measures
research problem. to evaluate coping and stress. An even
For a hypothesis to be adequate, it must more egregious example is the hypothesis
address, in a satisfactory manner, both the prognosis following diagnosis of ovarian
content and scope of the central question; that cancer is related to the patients survival
is, whether the problem is narrow or broad, instinct, as it would be extremely difcult to
simple or complex, evaluation of the develop empirical data in support of a sur-
hypothesis(es) should result in the full resolu- vival instinctassuming it did exist.
tion of the research problem. For this reason, For many years, philosophers of science
it is recommended that the investigator formu- have argued about what constitutes evidence
late at least one hypothesis for every subprob- in science or support for a scientic hypothe-
lem articulated in the study. Equally important, sis. By the mid-twentieth century, the tenets
a hypothesis must be plausible; for this condi- of logical positivism (or logical empiri-
tion to be satised, the hypothesis should be cism) dominated the philosophy of science in
based on prior relevant observation and expe- the United States as well as throughout the
rience, buttressed by consideration of existing English-speaking world [24], replacing the
theory, and should reect sound reasoning and Cartesian emphasis on rationalism as a pri-
knowledge of the problem at hand. In contrast, mary epistemological tool. Strongly eschew-
speculations which have either no empirical ing metaphysical and theological explanations
support or legitimate theoretical basis, even if of reality, the logical positivists argued that a
interesting, constitute poor hypotheses and proposition held meaning only if it could be
typically yield weak or uninterpretable study veried (i.e., if its truth could be determined
outcomes. Finally, if the hypothesis is explan- conclusively by observable facts). Early crit-
atory in nature (rather than an inductive gener- ics of logical positivism, most notable among
alization), all else being equal, it should them Karl Popper, believed that veriability
represent the simplest of all possible compet- was too stringent a criterion for scientic dis-
ing explanations for the phenomenon or data covery. This, he argued, was due to the logical
at hand [23], a principle known as Occams limitations inherent in inductive reasoning,
razor or entia non sunt multiplicanda praeter namely, the deductive invalidity of forming a
necessitatem (Latin for entities must not be generalization based on the observation of
multiplied beyond necessity). particulars, and the attendant uncertainty of
5. It should be testable. such an inference. Thus, while both positive
A hypothesis must be stated in such a way as existential claims (e.g., there is at least one
to allow for its examination which, in the white swan) and negative universal claims
Fig. 3.2 The hypothetico-deductive model: Poppers view of the role of falsication in scientic reasoning
(e.g., not all swans are white) could be or law could be falsied by nding a single
conrmed by nding, respectively, at last one counterexample.
white swan or one black swan, it would be Poppers greatest contribution to science
impossible to verify a positive universal claim was his characterization of scientic inquiry,
(e.g., all swans are white). To accomplish that, based on a cyclical system of conjectures and
one would have to observe every swan in exis- refutations (a form of critical rationalism)
tence, at all times and in all places, or risk widely known as the hypothetico-deductive
being wrong. method [27]. A schematic of Poppers view
According to Popper, the hallmark of a of this method is shown in Fig. 3.2. Consistent
testable claim is its capacity to be falsified with Poppers writing on the subject, the terms
[25]. In his view, falsication (not verication) hypothesis and theory are used interchange-
is the criterion for demarcation between those ably as both are viewed as tentative, though
hypotheses, theories, and other propositions most workers in the eld currently reserve the
that are scientic versus those that are not latter term for hypotheses (or related systems
scientic. This, of course, did not mean that a of hypotheses) that have received consistent
scientic hypothesis or theory must be false; and long-standing empirical support.
rather, if it were false, it could be shown to be The reader will note that the hypothetico-
so. Returning to our earlier example, all that deductive method begins with an early postu-
would be required to disprove the claim all lation of a hypothesis. The investigator then
swans were white is to nd a swan that is not uses deductive logic to form predictions from
white. Indeed, this inductive inference, based the hypothesis that should be true if the
on the observation of millions of white swans hypothesis is, in fact, correct. The nature of
in Europe, was shown to be false when black the predictions can vary from study to study,
swans were discovered in Western Australia in but they share the common attribute of being
the eighteenth century [26]an event that was unknown before data collection. The predic-
not unnoticed by Popper. It provided clear tions are then evaluated by formal experimen-
support for his assertion that no matter how tation or observation. Assuming a properly
many observations are made that appear to designed study, those predictions that are dis-
conrm a proposition, there is always the pos- cordant with data falsify the hypothesis, which
sibility that an event not yet seen could refute is then discarded or revised, leading to addi-
it. Similarly, any scientic hypothesis, theory, tional study. Although a hypothesis can never
40 P.G. Supino
be shown to be true via collection of compat- high incidence of morbid events. Although
ible information (as Popper noted, a subse- these may be important hypotheses, these
quent demonstration of counterfactual data statements cannot be directly tested as they
can overturn any hypothesis), the extent to are fundamentally abstract. What do the inves-
which it survives repeated attempts at tigators mean by high fat, depression,
falsication provides support (corroboration) severity of coronary artery disease, rela-
for its validity. As a result, testing of a hypoth- tively high, or morbid events? How will
esis serves to advance the existing theory base these terms be evaluated?
and body of knowledge. Popper argued that To render conceptual hypotheses testable,
the hypothetico-deductive method was the they must be recast as more specic statements
only sound approach to scientic reasoning; with elements (variables) that are precisely
moreover, in his opinion, it was the only dened according to explicit observable or
method by which science made any progress. measurable criteria. Hypotheses of this type are
Although Popper did not originate the referred to as operational hypotheses or, alter-
hypothetico-deductive method, he was the natively, specic hypotheses or predictions and
rst to explicate the central role of falsication represent the specic (observable) manifesta-
versus conrmation of a hypothesis in the tion of the conceptual hypothesis that the study
developing science. While his arguments have is designed to test. Once the study is designed,
been criticized by other philosophers of sci- data will be collected and analyzed to deter-
ence who assert that scientists do not neces- mine whether they are concordant or discordant
sarily reason that way [28], his views remain with the operational hypothesis which, ulti-
prominent in modern philosophy and continue mately, will be reinterpreted in terms of its
to appeal to many modern scientists [29]. broader meaning as a conceptual hypothesis.
Today, the Popperian view of the hypothetico- Figure 3.3 below illustrates a simplied version
deductive method, with its emphasis on test- of the hypothetico-deductive method, as con-
ing to falsify a proposed hypothesis, generally ceptualized by Kleinbaum, Kupper, and
is taken to represent an ideal (if not universal) Morgenstern [31] depicting the relation of con-
approach to curbing excessive inductive spec- ceptual and operational hypotheses to the
ulation and ensuring scientic objectivity, and design and interpretation of the study.
is considered to be the primary methodology Construction of operational hypotheses
by which biological knowledge is acquired represents an important preliminary step in
and disseminated [30]. the development of the research design, data
collection strategy, and statistical analysis
plan and is described in greater detail in sub-
Types of Hypotheses sequent sections of this chapter.
2. Single Variable Versus Multiple Variable
Hypotheses can be classied in several ways, as Hypotheses
shown below. Some investigations are undertaken to deter-
1. Conceptual Versus Operational Hypotheses mine whether a mean, proportion, or other
Hypotheses can vary according to their degree parameter from a sample varies from a
of specicity or precision and theoretical relat- specied value. For example, a group of obste-
edness. Hypotheses can be written as broad or tricians may have read a report that concludes
general statements, in which case they are that, throughout the nation, the average length
termed conceptual hypotheses. For example, of stay following uncomplicated caesarian
an investigator may hypothesize that a high- section is 5 days. They may have reason to
fat diet is related to severity of coronary artery believe that the length of stay for similar
disease or another may conjecture that patients at their institution differs from the
depression is associated with a relatively national average and would like to know if
Fig. 3.3 Interrelation of conceptual hypotheses, opera- and Quantitative Methods, Fig. 2.2: An Idealized
tional hypotheses, and the hypothetico-deductive method Conceptualization of the Scientific Method (New York:
(Reprinted with permission Kleinbaum DG, Kupper LL, Van Nostrand Reinhold 1982), p. 35)
Morgenstern H. Epidemiologic Research: Principles
their belief is correct. To study the question, length of stay. In this case, caesarian section is
they must rst recast their question as a only a descriptor of the target population
hypothesis including the stipulated variable, because all data to be examined are from
select a representative sample of patients from patients undergoing this procedure.)
their institution, and compare data from their However, the objective of most hypotheses
sample with the national average (stipulated is not to draw inferences about population
value) using an appropriate one-sample statis- parameters but to facilitate evaluation of a
tical test. (The reader should note that the only proposition that two or more variables are sys-
variable being tested within this hypothesis is tematically related in some manner [32].
42 P.G. Supino
Indeed, some methodologists recognize only However, hypotheses often are not written
the latter form of argument as a legitimate this way because support for a cause-and-
hypothesis [7, 3335]. The simplest hypothe- effect relation requires not only biological
ses about intervariable association contain two plausibility and a strong statistical result but
variables (bivariable hypotheses), for also an appropriate (and usually rigorous)
example: study design. If the investigator believes that
Caffeine consumption is more frequent the variables are related, but prefers not to
among smokers than nonsmokers. speculate on the inuence of one variable on
Women have a higher fat-to-muscle ratio another, the hypothesis may be cast to propose
than men. an association only, without explicit reference
Heart attacks are more common in winter to causality. For example:
than in other seasons. Surgical benet is related to preoperative
If the objective of the study is to compare ischemia severity.
the relative association of several characteris- Exercise tolerance is correlated with chron-
tics, it usually will be necessary to construct a ological age.
single hypothesis which relates three or more Consumption of low-calorie beverages is
variables (multivariable hypotheses), for associated with body weight.
example: Finally, hypotheses also can be written to a
Ischemia severity is a stronger predictor of assert that there will be a difference between
cardiac events than symptom status and levels of a variable among two or more groups
risk factor score. of individuals or within a single group of indi-
Response to physical training is affected viduals at different points in time, as shown by
more by age than gender. the following examples:
Improvement in health-related quality of Patients enrolled in a health maintenance
life after cardiac surgery is inuenced more organization (HMO) will have a different
by preoperative symptoms than by ventric- number of hospitalizations than those
ular performance or geometry. enrolled in preferred provider organiza-
The number and type of variables contained tions (PPOs) or traditional fee-for-
within the hypothesis (as well as the nature of service insurance plans.
the proposed association) will dictate the study Among patients undergoing mitral valve
design, measurement procedures, and statisti- repair or replacement, left ventricular
cal analysis of the results. These concepts are performance will be dissimilar at 1 versus
addressed in Chaps. 5 and 11. 3 years after operation.
3. Hypotheses of Causality Versus Association The hypothesis also can be framed so that
or Difference the nature of the association (e.g., linear, cur-
The relation posited between variables may be vilinear, positive, inverse, etc.) or difference
cast as one of cause-and-effect, in which case (larger or smaller, better or poorer,
the researcher hypothesizes that one variable etc.) will be specied (see below, Alternative
affects or inuences the other(s) in some man- hypotheses [directional]).
ner. For example: 4. Mechanistic Versus Nonmechanistic
Estrogen produces an increase in coronary Hypotheses
ow. Hypotheses can be written so as to provide a
Smoking promotes lung cancer. mechanism (i.e., an explanation) for an
Patient education improves compliance. asserted relationship or prediction, or they can
Coronary artery bypass grafting causes a be written without dening an underlying
reduction in the number of subsequent car- mechanism. Mechanistic hypotheses are com-
diac events. mon in preclinical research which typically
attempts to dene biochemical and physiolog- (falsication) reects the fact that two
ical causes of disease or dysfunction and path- outcomes always can arise out of a study of
ways amenable to therapeutic intervention. any single research problem. Thus, prior to
Shown below are two examples of mecha- collecting and evaluating empirical evidence
nistic hypotheses that were evaluated in two to resolve a problem, the investigator will
different preclinical investigations: (Note the posit two opposing assertions. The rst asser-
use of the phrase as a result of in the rst tion will indicate the supposition for which
hypothesis evaluating the impact of endothe- support actually is sought (e.g., that there is a
lial nitric oxide synthase [eNOS] and due to difference between a population parameter
in the second hypothesis evaluating antago- and an expected value or, more commonly,
nism of endothelin [ET]-induced inotropy. that there is some form of relation between
Italics have been added for emphasis.) variables within a particular population); the
Gender-specic protection against myo- other will indicate that there is no support for
cardial infarction occurs in adult female as this supposition. This rst type of assertion is
compared to male rabbits as a result of termed the alternative hypothesis and is gen-
eNOS upregulation [36]. erally denoted HA or H1. The alternative
ET-induced direct positive inotropy is hypothesis can be differentiated further accord-
antagonized in vivo by an indirect caring to its quantitative attributes. As an exam-
diodepressant effect due to a mainly ETA- ple, in a study evaluating the impact of beta-
mediated and ET-induced coronary adrenergic antagonist treatment (b-blockade)
constriction with consequent myocardial on the incidence of recurrent myocardial
ischemia [37]. infarctions (MIs), an investigator could frame
In clinical research, hypotheses more com- three contrasting alternative hypotheses:
monly are nonmechanistic (i.e., framed with- 1. The proportion of recurrent MIs among
out including an explicit explanation). Shown comparable patients treated with versus
below are two published literature examples: without b-blockade is different.
Patients with medically unexplained 2. The proportion of recurrent MIs among
symptoms attending the clinic of a general patients treated with b-blockade is less
adult neurologist will have delayed earliest than that among comparable patients
and continuous memories compared with treated without b-blockade.
patients whose symptoms were explained 3. The proportion of recurrent MIs among
by neurological disease [38]. patients treated with b-blockade is greater
Patients with acute mental changes will be than that among comparable patients
scanned more frequently than other elder treated without b-blockade.
patients [39]. The rst of these statements is termed a
The reader will note that these hypotheses nondirectional hypothesis because the nature
do not include the mechanism for memory of the expected relation (i.e., the direction of
variations in these patient populations (rst the intergroup difference in the proportion of
example) or the reasons why elderly patients recurrent infarctions) is not specied. The
with acute mental changes should be scanned second and third statements are termed direc-
more frequently than comparable patients tional hypotheses since, in addition to posit-
without such changes (second example). In ing a difference between groups, the nature of
situations like this, it is critical that the the expected difference (positive or negative)
justication be clear from the introductory is predened. Generally, the decision to state
section of the research paper or protocol. an alternative hypothesis in a directional ver-
5. Alternative Versus Null Hypotheses sus nondirectional manner is based on theo-
The requirement that a hypothesis should be retical considerations and/or the availability
capable of corroboration or unsupportability of prior empirical information. (In statistics, a
44 P.G. Supino
nondirectional hypothesis is usually referred 3. The proportion of recurrent MIs among

to as a two-tailed or two-sided hypothesis; a patients treated with b-blockade is not
directional hypothesis is referred to as a one- greater than that among comparable
tailed or one-sided hypothesis.) patients treated without b-blockade.
As noted, the hypothesis reects a tentative Only after both the null and alternative
conjecture which, to gain validity, ultimately hypotheses have been specied, and the data
must be substantiated by experience (empirical collected, can an appropriate test of statistical
evidence). However, even objectively measured signicance be performed. If the results of sta-
experience varies from time to time, place to tistical analysis reveal that chance is an
place, observer to observer, and subject to sub- unlikely explanation of the ndings, the null
ject. Thus, it is difcult to know whether an hypothesis is rejected and the alternative
observed difference or association was pro- hypothesis is accepted. Under these circum-
duced by random variation or actually reects a stances, the investigator can conclude that
true underlying difference or association in the there is a statistically signicant relation
population of interest. To deal with the problem between the variables under study (or a statis-
of uncertainty, the investigator must implicitly tically signicant difference between a param-
formulate and test what, in essence, is the logi- eter and an expected value). On the other hand,
cal opposite of his or her alternative hypothesis if chance cannot be excluded as a probable
(i.e., that the population parameter is the same explanation for the ndings, the null, rather
as the expected value or that the variables of than the alternative, hypothesis must be
interest are not related as posited). Thus, the accepted. It is important to note that accep-
investigator must attempt to set up a straw man tance of the null hypothesis does not mean
to be knocked down. This construct (which that the investigator has demonstrated a true
need not be not stated in the research report), is lack of association between variables (or
termed a null (or no difference) hypothesis and equation between a population parameter and
is designated H0. A null hypothesis asserts that an expected value) any more than a verdict of
any observable differences or associations not guilty constitutes proof of a defendants
found within a population are due to chance innocence in a legal proceeding. Indeed, in
and is assumed true until contradicted by criminal law, such a verdict means only that
empirical evidence. In the single variable (one- the prosecution, upon whom the burden of
sample) hypothesis, the assertion is that the proof rests, has failed to provide sufcient
parameter of interest is not different from some evidence that a crime was committed.
expected population value, whereas in a bivari- Similarly, in research, failure to overturn a
able or multivariable hypothesis, the assertion null hypothesis (particularly when the alterna-
is that the variables of interest are unrelated to tive hypothesis has been argued) generally is
some factor or to each other. taken to mean that the investigator, upon
A null hypothesis is framed by inserting a whom the burden of proof (or, more appro-
negative modier into the statement of the priately, corroboration) also rests, has failed to
alternative hypothesis. In the examples given demonstrate the expected difference or asso-
above, the following null statements could be ciation. Null results may reect reality, but
developed: they may also be due to measurement error
1. The proportion of recurrent MIs among and inadequate sample size. For this reason,
comparable patients treated with versus negative studies, a term for research that yields
without b-blockade is not different. null ndings, are far less likely to gain publi-
2. The proportion of recurrent MIs among cation than studies that demonstrate a statisti-
patients treated with b-blockade is not less cally signicant association [40, 41]. (See
than that among comparable patients Chap. 9 for a more detailed discussion of
treated without b-blockade. publication bias.)
Constructing the Hypothesis: 2. The Ordinal Variable

Differentiating Among Variables Ordinal variables are considered to be semi-
quantitative. They are similar to nominal vari-
As indicated earlier, hypotheses most commonly ables in that they are comprised of categories,
entail statements about variables. Variables, in but their categories are arranged in a meaning-
turn, can be differentiated according to their level ful sequence (rank order), such that successive
of measurement (or scaling characteristics) or the values indicate more or less of some quantity
role that they play in the hypothesis. (i.e., relative magnitude). Typical examples of
ordinal variables include socioeconomic sta-
Level of Measurement tus, tumor classication scores, New York
Variables can be classied according to how well Heart Association (NYHA) functional class
they can be measured (i.e., the amount of infor- for angina or heart failure, disease severity,
mation that can be obtained in a given measure- birth order, perceived level of pain, and all
ment of an attribute). One factor that determines opinion survey scores. However, distances
the informational characteristics of a variable is between scale points are arbitrary. For exam-
the nature of its associated measurement scale, ple, a patient categorized as NYHA functional
that is, whether it is nominal, ordinal, interval, or class IV may have more symptomatic debility
ratioa classication system framed in 1946 by than one categorized as functional class II, but
Stevens [42]. Understanding these distinctions is he or she does not necessarily have twice as
important because scaling characteristics much debility; indeed, he or she may have
inuence the nature of the statistical methods that considerably more than twice as much debil-
can be used for analyzing data associated with a ity. Appropriate measures of central tendency
variable. for ordinal variables are the mode and median
1. The Nominal Variable (rather than the mean or arithmetic average) or
Nominal variables represent names or catego- percentile. Similarly, hypothesis tests of sub-
ries. Examples include blood type, gender, group differences based on ordinal outcome
marital status, hair color, etiology, and presence variables are limited to nonparametric
versus absence of a risk factor or disease, and approaches employing analysis of ranks or
vital status. Nominal variables represent the sums of ranks.
weakest level of measurement as they have no 3. The Interval Variable
intrinsic order or other mathematical proper- Interval variables, like ratio variables (below),
ties and allow only for qualitative classication are considered quantitative or metric variables
or grouping. Their lack of mathematical prop- because they answer the question how
erties precludes calculation of measures of much? or how many? Both may take on
central tendency (such as means, medians, or positive or negative values. A common exam-
modes) or dispersion. When all variables in a ple of an interval variable is temperature on a
hypothesis are nominal, this limits the types of Celsius or Fahrenheit scale. Both interval and
statistical operations that can be performed to ratio variables provide more precise informa-
tests involving cross-classication (e.g., tests tion than ordinal variables because the dis-
of differences between proportions). tances between successive data values
Sometimes, variables that are on an ordinal, represent true, equal, and meaningful inter-
interval, or ratio scale are transformed into vals. For example, the difference between
nominal categories using cutoff points (e.g., 70F and 80F is equivalent to the difference
age in years can be recoded into old versus between 80F and 90F. However, the zero
young; height in meters to tall versus short; point on an interval scale is arbitrary (note,
left ventricular ejection fraction in percent to freezing on a Celsius scale is 0 but is 32 on
normal versus subnormal). a Fahrenheit scale) and does not necessarily
46 P.G. Supino
connote absence of a property (in this case, (e.g., number of dental caries, number of white
absence of kinetic energy). When analyzing cells per cubic centimeter of blood, number of
interval data, one can add or subtract but not readers of medical journals, or other count-
multiply or divide. Most statistical and opera- based data) can take on only whole numbers.
tions are permissible, including calculation of Nominal and ordinal variables are intrinsically
measures of central tendency (e.g., mean, discrete, though in some disciplines (e.g.,
median, or mode), measures of dispersion behavioral sciences), ordinally scaled data
(e.g., standard deviation, standard error of the often are treated as continuous variables. This
mean, range), and performance of many statis- practice is considered reasonable when ordi-
tical tests of hypotheses including correlation, nal data intuitively represent equivalent inter-
regression, t-tests, and analysis of variance. vals (e.g., visual analogue scales), when they
However, due to the absence of a true zero contain numerous (e.g., 10 or more) possible
point, ratios between values on an interval scale values or orderings [43], or when
scale are not meaningful (though ratios of dif- shorter individual measurement scales are
ferences can be computed). combined to yield summary scores. The reader
4. The Ratio Variable should note, however, that in other disciplines
Like interval variables, the distances between and settings, treating all data as continuous
successive values on a ratio scale are equal. data is controversial and generally is not
However, ratio variables reect the highest recommended [44].
level of measurement because they contain a
true, nonarbitrary zero point that reects com- Role in the Research Hypothesis
plete absence of a property. Examples of ratio Another method of classifying variables is based
variables include temperature on a Kelvin on the specic role (function) that the variable
scale (where zero reects absence of kinetic plays in the hypothesis. Accordingly, a variable
energy), mass, length, volume, weight, and can represent (1) the putative cause (or be associ-
income. When ratio data are analyzed, all ated with a causal factor) that initiates a subse-
arithmetic operations are available (i.e., addi- quent response or event, (2) the response or event
tion, subtraction, multiplication, and division). itself, (3) a mediator between the causal factor
The same statistical operations that can be and its effect, (4) a potential confounder whose
performed with interval variables can be per- inuence must be neutralized, or (5) an explana-
formed with ratio variables. However, ratio tion for the underlying association between the
variables also permit meaningful calculation hypothesized cause and effect. Viewed this way,
of absolute and relative (or ratio) changes in a variables may be independent, dependent, or may
variable and computation of geometric and serve as moderator, control, or intervening vari-
harmonic means, coefcients of variation, and ables. Understanding these distinctions is crucial
logarithms. for constructing a research design, executing a
Quantitative variables (interval or ratio) statistical program, or communicating effectively
can be either continuous or discrete. Continuous with a statistician.
variables (e.g., weight, height, temperature) 1. The Independent Variable
differ from discrete variables in that the for- The independent variable is that attribute
mer may take on any conceivable value within within an individual, object, or event which
a given range, including fractional values or affects some outcome. The independent vari-
decimal values. For example, within the range able is conceptualized as an input in the study
150151 lbs, an individual theoretically can that may be manipulated by the investigator
weigh 150 lbs, 150.5 lbs or 150.95 lbs, though (such as a treatment in an experimental study)
the capacity to distinguish between these values or reect a naturally occurring risk factor. In
clearly is limited by the precision of the mea- either case, the independent variable is viewed
surement device. In contrast, discrete variables as antecedent to some outcome and is presumed
to be the cause, or a predictor of that outcome, corticosteroid therapy on systolic performance

or a marker of a causal agent or risk factor. We among patients with heart failure. In this
call this type of variable independent study, systolic performance would be the
because the researcher is interested only on its dependent variable; the investigator would
impact on other variables in the study rather measure its degree of improvement or deterio-
than the impact of other variables on it. ration in response to introducing versus not
Independent variables are sometimes termed introducing steroid treatment. Because it is a
factors and their variations are called levels. measure of effect, the dependent variable can
If, for example, if an investigator were to be observed and measured but, unlike the
conduct an observational study of the effects independent variable, it can never be
of diabetes mellitus on subsequent cardiac manipulated.
events, the independent variable (or factor) Independent and dependent variables are
would be history of diabetes, and its variants relatively simple to identify within the context
(positive or negative history) would be levels of a specic investigation, for example, a pro-
of the factor. As a second example, in an inter- spective cohort or an experimental study or a
vention study examining the relative impact of well-designed retrospective study in which
inpatient versus outpatient counseling on one variable clearly is an input, the second is a
patient morbidity after a rst MI, the indepen- response or effect, and an adequately dened
dent variable (factor) would be the counsel- temporal interval exists between their appear-
ing, and its variants (inpatient counseling vs. ance. However, when research is cross-
outpatient counseling) would correspond to sectional, and variables merely are being
the alternative levels of the factor. The reader correlated, it is sometimes difcult or impos-
should note that in both of these hypothetical sible to infer which is independent and which
examples, there was only one independent is dependent. Under these circumstances, vari-
variable (or factor) and that each factor had ables are often termed covariates.
two levels. It is possible and, in fact, common 3. The Moderator Variable
for studies to have several independent vari- Often, an independent variable does not affect
ables and for each to have multiple factor lev- all individuals in the same way, and an inves-
els (indeed, the number of factor levels in tigator may have reason to believe that some
doseresponse studies is potentially innite). other variable may be involved. If he or she
Care needs to be exercised as researchers often wishes to systematically study the effect of
confuse a factor with two levels for two fac- this other variable, rather than merely neutral-
tors. Levels are always components of the fac- ize it, it may be introduced into the study
tor. Understanding this distinction is essential design as a moderator variable (also known as
for conducting statistical tests such as analysis an effect modier). The term moderator
of variance (ANOVA). variable refers to a secondary variable that is
2. The Dependent Variable measured or manipulated by the investigator
In contrast to the independent variable, the to determine whether it alters the relationship
dependent variable is that attribute within an between the independent variable of central
individual or its environment that represents an interest and the dependent (response) variable.
outcome of the study. The dependent variable The moderator variable may be incorporated
is sometimes called a response variable because into a multivariate statistical model to exam-
one can observe its presence, absence, or ine its interactive effects with the independent
degree of change as a function of variation in variable or it may be used to provide a basis
the independent variable. Therefore, the depen- for stratifying the sample into two or more
dent variable is always a measure of effect. subgroups within which the effects of the
As an example, suppose that an investiga- independent variable may be examined
tor wished to study the effects of adrenal separately.
48 P.G. Supino
Fig. 3.4 A hypothetical

example of the effects
of a moderator variable:
inuence of chronic
anxiety on the impact
of a new drug for patients
with attention decit
hyperactivity disorder
For example, suppose a psychiatrist wishes effective, promoting greater task persistence
to study the effects of a new amphetamine- among patients without associated anxiety but
type drug on task persistence in patients with decreasing task persistence among those with
attention decit hyperactivity disorder anxiety, as hypothesized.
(ADHD) who have not responded well to cur- A cautionary note is in order. Although mod-
rent medical therapy. She believes that the erator variables can increase the yield or accu-
drug may have efcacy but suspects that its racy of information from a study, an investigator
effect may be diminished by the comorbidity needs to be very selective in using them as each
of chronic anxiety. Rather than give the new additional factor introduced into the study design
drug to patients with ADHD who do not also increases the sample size needed to enable the
have anxiety and placebo to patients with impact of these secondary factors to be satisfac-
ADHD plus anxiety, to avoid confounding, torily evaluated. During the study planning pro-
she enrolls both types of patients, randomly cess, the investigator must determine the
administers drug or placebo to members of likelihood of a potential interaction, the theoreti-
each subgroup, and measures task persistence cal or practical knowledge to be gained by dis-
among all subjects at a xed interval after covery of an interaction, and decide whether
onset of therapy. In this hypothetical study, the sufcient resources exist for such evaluation.
independent variable would be type of therapy 4. The Control Variable
(factor levels: new drug, placebo), the depen- In this last example, the investigator chose to
dent variable would be task persistence, and evaluate the interactive effects of a secondary
chronic anxiety (presence, absence) would be variable on the relation of the independent
the moderator. Figure 3.4 illustrates the impor- and dependent variables. Others in similar
tance of a moderator variable. If none had situations might choose not to study a second-
been used in the study, the data would have led ary independent variable, particularly if it is
the investigator to conclude that the new drug viewed as extraneous to the primary hypoth-
was ineffective as no overall treatment effect esis or focus of the study. Additionally, it is
would have been observed for the ADHD impractical to examine the effects of every
group (left panel, diagonal patterned bar), with ancillary variable. However, extraneous vari-
change in task persistence for the entire treated ables cannot be ignored because they can
group similar to subjects on placebo (right confound study results and render the data
panel). However, as noted, the new drug was uninterpretable. Variables such as these usu-
not ineffective but instead was differentially ally are treated as control variables.
A control variable is dened as any poten- represent a disease process or physiological

tially confounding aspect of the study that is parameter that links an exposure or purposively
manipulated by the investigator to neutralize applied intervention to an outcome (e.g., sec-
its effects on the dependent variable. Common ondhand smoking causes lung cancer by
control variables are age, gender, clinical his- inducing lung damage; valvular surgery
tory, comorbidity, test order, etc. In the hypo- increases LV ejection fraction by improving
thetical example given above, if the contractility.). Others such as Baron and
psychiatrist had wanted to control for associ- Kenny [46] view an intervening variable as a
ated anxiety and not evaluate its interactive factor that can be measured (directly or by
effects, she could have chosen patients with operational denitions, described later in this
similar anxiety levels or, had his or her study chapter), fully derived (abstractable) from
employed a parallel design (which it did not), empirical ndings (data), and statistically
she could have made certain that different analyzed to demonstrate its capacity to medi-
treatment groups were counterbalanced for ate the relation between the independent
that variable. and dependent variables. As an example,
5. The Intervening Variable Williamson and Schulz [47] measured and
Just as the moderator variable denes when evaluated the relation between pain, functional
(under what conditions) the independent vari- disability, and depression among patients with
able exerts its action on the dependent vari- cancer. They determined that the observed
able, the intervening variable may help relation of pain to depression was due to dimi-
explain how and why the independent and nution of function, operationally dened as
dependent variables are related. This can be activities of daily living (the intervening or
especially important when the association mediating variable), which, in turn, caused
between independent and dependent variables depression. Similarly, Song and Lee [48] found
appears ambiguous. There is general consen- that depression mediated the relation of sensory
sus that the intervening variable underlies, and decits (the independent variable in their study)
accounts for, the relation between the inde- to functional capacity (their dependent vari-
pendent and dependent variable. However, able) in the elderly. (For a comprehensive dis-
historically, workers in the eld have dened cussion of mediation and statistical approaches
them in different (and often contradictory) to test for mediation, the reader is referred to
ways [45]. For example, Tuckman describes MacKinnon 2008 [49].) Whether viewed as a
the intervening variable as a hypothetical hypothetical construct or as a measurable medi-
internal state (construct) within an individual ator, an intervening variable is always interme-
(motivation, drive, goal orientation, intention, diate in the causal pathway by which the
awareness, etc.) that theoretically affects the independent variable affects the dependent
observed phenomenon but cannot be seen, variable and is useful in explaining the mecha-
measured, or manipulated; its effect must be nism linking these variables and, potentially,
inferred from the effects of the independent for suggesting additional interventions.
and moderator variables on the observed phe- Below are two hypotheses from cardiovas-
nomenon [35]. In the previous hypothetical cular medicine in which constituent variables
example which examined the interactive have been analyzed and labeled according to
effects of drug treatment and anxiety on task their role in each hypothesis.
persistence, the intervening variable was
attention. In educational research, the inter- Hypothesis 1: Among patients with heart
vening variable between an innovative peda- failure who have similar clinical histories,
gogical approach and the acquisition of new those receiving adrenal corticosteroid treat-
concepts or skills is the learning process ment will demonstrate a greater improvement
impacted by the former. In clinical or epidemi- in systolic performance than those not receiv-
ological research, the intervening variable can ing steroid treatment.
50 P.G. Supino
Fig. 3.5 Interrelation among variables in a study design
Independent variable: adrenal corticoster-

oid treatment Role of Operational Denitions
Factor levels: 2 (treatment, no treatment)
Dependent variable: systolic performance As indicated earlier, one of the characteristics of
Control variable: clinical history a hypothesis that sets it apart from other types of
Moderator variable: none statements is that it is testable. The hypotheses
Intervening variable: change in magnitude discussed thus far are conceptual. A conceptual
of the inammatory process hypothesis cannot be directly tested unless it is
transformed into an operational hypothesis. To
Hypothesis 2: Patients with angina who are accomplish this, operational denitions must be
treated with b-blockade will have a greater developed for each element specied in the
improvement in their capacity for physical hypothesis.
activity than those of the same sex and age An operational denition identies the observ-
who are not treated with b-blockade; this able characteristics of that which is being studied.
improvement will vary as a function of sever- Its use imparts specicity and precision to the
ity of initial symptoms. research, enabling others to understand exactly
Independent variable: b-blockade treatment how the hypothesis was tested. As a corollary, it
Factor levels: 2 (treatment, no treatment) enables the scientic community to evaluate the
Dependent variable: capacity for physical appropriateness of the methodology selected for
activity studying the problem. Operational denitions are
Moderator variable: severity of initial required because a concept, object, or situation
symptoms can have multiple interpretations. While double
Control variables: sex and age entendre is one basis of Western humor, inconsis-
Intervening variable: alteration in myocar- tent (or vague) denitions within a study are not
dial work comical as they typically lead to confused ndings
In sum, many research designs, particularly (and readers). Imagine, for example, what might
those intended to test hypotheses about cause occur if one member of an investigative team,
or prediction and effect, contain independent, studying the relative impact of two procedures for
dependent, control, and intervening variables. treating hemodynamically important coronary
Some also contain moderator variables. artery disease, dened important as >50%
Figure 3.5 illustrates their interrelationship. luminal diameter narrowing of one or more
coronary vessels and another, working in the same To render this hypothesis testable, its constituent
study, dened it as 70% luminal diameter nar- elements could be dened as follows:
rowing; or if one investigator studying new onset b-blockers = propranolol (assuming that the
angina used 1 week as the criterion for new and investigator was specically interested in this
another used 1 month. Operational denitions can drug)
describe the manipulations that the investigator Capacity for physical activity = New York
performs (e.g., the intervention), or they can Heart Association functional class
describe behaviors or responses. Still others Severity of symptoms = angina class 12
describe the observable characteristics of objects versus angina class 34
or individuals. Once the investigator has selected This hypothesis, in its operational form, would
appropriate operational denitions (this choice is be stated: Patients with angina who are treated
entirely study dependent), all hypotheses in the with propranolol will have greater improvement
study can be operationalized. in New York Heart Association functional class
A hypothesis is rendered operational when its than those not treated with propranolol, and
broadly (conceptually) stated variables are this improvement will vary as a function of ini-
replaced by operational denitions of those vari- tial angina class (12 vs. 34). In this form,
ables. Hypotheses stated in this manner are called the hypothesis could be directly tested, although
operational hypotheses, specific hypotheses, or the investigator would still need to specify mea-
predictions. surement criteria and develop an appropriate
Let us consider two hypotheses previously design.
given in this chapter: Any element of a hypothesis can have more
Patients with heart failure who are treated than one operational denition and, as noted, it is
with adrenal corticosteroids will have better sys- the investigators responsibility to select the one
tolic performance than those who are not is that is most suitable for his or her study. This is
sufciently general to be considered a conceptual an important judgment because the remaining
hypothesis and, as such, is not directly testable. research procedures (i.e., specication of subject
To render this hypothesis testable, the investiga- inclusion/exclusion criteria, the nature of the
tor could operationally dene its constituent ele- intervention and outcome measures, and data
ments as follows: analysis methodology) are derived from opera-
Heart failure = secondary hypodynamic tional hypotheses. Investigators must be careful
cardiomyopathy to use a sufcient number of operational
Adrenal corticosteroids = cortisol denitions so that reviewers will have a basis
Better systolic performance = higher left ven- upon which to judge the appropriateness of the
tricular ejection fractions at rest methodology outlined in submitted grant propos-
The hypothesis, in its operational form, would als and manuscripts, so that other investigators
state: Patients with secondary hypodynamic car- will be able to replicate their work, and so that
diomyopathy who have received cortisol will the general readership can understand precisely
have higher ventricular ejection fractions at rest what was done and have sufcient information to
than those who have not received cortisol properly interpret ndings.
treatment. Once operational denitions have been devel-
Similarly, the hypothesis that patients with oped and the hypothesis has been restated in
angina who are treated with b-blockers will have operational form, the investigator can conduct the
a greater improvement in their capacity for physi- study. The next step will be to select a research
cal activity than those not treated with b-blockers, design that can yield data to support optimal sta-
and that this improvement will vary as a function tistical hypothesis testing. The strengths, weak-
of initial symptoms, while complex, is still nesses, and requirements of various study designs
general enough to be considered conceptual. will be discussed in Chaps. 4 and 5.
52 P.G. Supino
Take-Home Points
A hypothesis is a logical construct, interposed between a problem and its solution, which
represents a proposed answer to a research question. It gives direction to the investigators
thinking about the problem and, therefore, facilitates a solution.
There are three primary modes of inference by which hypotheses are developed: deduction
(reasoning from a general propositions to specic instances), induction (reasoning from
specic instances to a general proposition), and abduction (formulation/acceptance on pro-
bation of a hypothesis to explain a surprising observation).
A research hypothesis should reect an inference about variables; be stated as a grammati-
cally complete, declarative sentence; be expressed simply and unambiguously; provide an
adequate answer to the research problem; and be testable.
Hypotheses can be classied as conceptual versus operational, single versus bi- or multi-
variable, causal or not causal, mechanistic versus nonmechanistic, and null or alternative.
Hypotheses most commonly entail statements about variables which, in turn, can be
classied according to their level of measurement (scaling characteristics) or according to
their role in the hypothesis (independent, dependent, moderator, control, or intervening).
A hypothesis is rendered operational when its broadly (conceptually) stated variables are
replaced by operational denitions of those variables. Hypotheses stated in this manner are
called operational hypotheses, specic hypotheses, or predictions and facilitate testing.
10. Popper KR. Objective knowledge: an evolutionary

References approach (revised edition). New York: Oxford
University Press; 1979.
1. De Morgan A, De Morgan S. A budget of paradoxes. 11. Morgan AJ, Parker S. Translational mini-review series
London: Longmans Green; 1872. on vaccines: the Edward Jenner Museum and the his-
2. Leedy Paul D. Practical research. Planning and design. tory of vaccination. Clin Exp Immunol. 2007;147:
2nd ed. New York: Macmillan; 1960. 38994.
3. Bernard C. Introduction to the study of experimental 12. Pead PJ. Benjamin Jesty: new light in the dawn of
medicine. New York: Dover; 1957. vaccination. Lancet. 2003;362:21049.
4. Erren TC. The quest for questionson the logical 13. Lee JA. The scientic endeavor: a primer on scientic
force of science. Med Hypotheses. 2004;62:63540. principles and practice. San Francisco: Addison-
5. Peirce CS. Collected papers of Charles Sanders Peirce, Wesley Longman; 2000.
vol. 7. In: Hartshorne C, Weiss P, editors. Boston: The 14. Allchin D. Lawsons shoehorn, or should the philoso-
Belknap Press of Harvard University Press; 1966. phy of science be rated, X? Science and Education.
6. Aristotle. The complete works of Aristotle: the revised 2003;12:31529.
Oxford Translation. In: Barnes J, editor. vol. 2. Princeton/ 15. Lawson AE. What is the role of induction and deduc-
New Jersey: Princeton University Press; 1984. tion in reasoning and scientic inquiry? J Res Sci
7. Polit D, Beck CT. Conceptualizing a study to generate Teach. 2005;42:71640.
evidence for nursing. In: Polit D, Beck CT, editors. 16. Peirce CS. Collected papers of Charles Sanders Peirce,
Nursing research: generating and assessing evidence vol. 2. In: Hartshorne C, Weiss P, editors. Boston: The
for nursing practice. 8th ed. Philadelphia: Wolters Belknap Press of Harvard University Press; 1965.
Kluwer/Lippincott Williams and Wilkins; 2008. 17. Bonfantini MA, Proni G. To guess or not to guess? In:
Chapter 4. Eco U, Sebeok T, editors. The sign of three: Dupin,
8. Jenicek M, Hitchcock DL. Evidence-based practice. Holmes, Peirce. Bloomington: Indiana University
Logic and critical thinking in medicine. Chicago: Press; 1983. Chapter 5.
AMA Press; 2005. 18. Peirce CS. Collected papers of Charles Sanders
9. Bacon F. The novum organon or a true guide to the Peirce, vol. 5. In: Hartshorne C, Weiss P, editors.
interpretation of nature. A new translation by the Rev Boston: The Belknap Press of Harvard University
G.W. Kitchin. Oxford: The University Press; 1855. Press; 1965.
19. Flach PA, Kakas AC. Abductive and inductive reason- 35. Tuckman BW. Conducting educational research.
ing: background issues. In: Flach PA, Kakas AC, New York: Harcourt, Brace, Jovanovich; 1972.
editors. Abduction and induction. Essays on their rela- 36. Wang C, Chiari PC, Weihrauch D, Krolikowski JG,
tion and integration. The Netherlands: Klewer; 2000. Warltier DC, Kersten JR, Pratt Jr PF, Pagel PS.
Chapter 1. Gender-specicity of delayed preconditioning by
20. Murray JF. Voltaire, Walpole and Pasteur: variations isourane in rabbits: potential role of endothelial nitric
on the theme of discovery. Am J Respir Crit Care oxide synthase. Anesth Analg. 2006;103:27480.
Med. 2005;172:4236. 37. Beyer ME, Slesak G, Nerz S, Kazmaier S, Hoffmeister
21. Danemark B, Ekstrom M, Jakobsen L, Karlsson JC. HM. Effects of endothelin-1 and IRL 1620 on myo-
Methodological implications, generalization, scientic cardial contractility and myocardial energy metabo-
inference, models (Part II) In: explaining society. lism. J Cardiovasc Pharmacol. 1995;26(Suppl 3):
Critical realism in the social sciences. New York: S1502.
Routledge; 2002. 38. Stone J, Sharpe M. Amnesia for childhood in patients
22. Pasteur L. Inaugural lecture as professor and dean of with unexplained neurological symptoms. J Neurol
the faculty of sciences. In: Peterson H, editor. A trea- Neurosurg Psychiatry. 2002;72:4167.
sury of the worlds greatest speeches. Douai, France: 39. Naughton BJ, Moran M, Ghaly Y, Michalakes C.
University of Lille 7 Dec 1954. Computer tomography scanning and delirium in elder
23. Swineburne R. Simplicity as evidence for truth. patients. Acad Emerg Med. 1997;4:110710.
Milwaukee: Marquette University Press; 1997. 40. Easterbrook PJ, Berlin JA, Gopalan R, Matthews DR.
24. Sakar S, editor. Logical empiricism at its peak: Publication bias in clinical research. Lancet.
Schlick, Carnap and Neurath. New York: Garland; 1991;337:86772.
1996. 41. Stern JM, Simes RJ. Publication bias: evidence of
25. Popper K. The logic of scientic discovery. New York: delayed publication in a cohort study of clinical
Basic Books; 1959. 1934, trans. 1959. research projects. BMJ. 1997;315:6405.
26. Caws P. The philosophy of science. Princeton: D. Van 42. Stevens SS. On the theory of scales and measurement.
Nostrand Company; 1965. Science. 1946;103:67780.
27. Popper K. Conjectures and refutations. The growth of 43. Knapp TR. Treating ordinal scales as interval scales:
scientic knowledge. 4th ed. London: Routledge and an attempt to resolve the controversy. Nurs Res.
Keegan Paul; 1972. 1990;39:1213.
28. Feyerabend PK. Against method, outline of an anar- 44. The Cochrane Collaboration. Open Learning Material.
chistic theory of knowledge. London, UK: Verso; www.cochrane-net.org/openlearning/html/mod14-3.
1978. htm. Accessed 12 Oct 2009.
29. Smith PG. Popper: conjectures and refutations 45. MacCorquodale K, Meehl PE. On a distinction
(Chapter IV). In: Theory and reality: an introduction between hypothetical constructs and intervening
to the philosophy of science. Chicago: University of variables. Psychol Rev. 1948;55:95107.
Chicago Press; 2003. 46. Baron RM, Kenny DA. The moderator-mediator vari-
30. Blystone RV, Blodgett K. WWW: the scientic able distinction in social psychological research:
method. CBE Life Sci Educ. 2006;5:711. conceptual, strategic and statistical considerations.
31. Kleinbaum DG, Kupper LL, Morgenstern H. J Pers Soc Psychol. 1986;51:117382.
Epidemiological research. Principles and quantitative 47. Williamson GM, Schultz R. Activity restriction medi-
methods. New York: Van Nostrand Reinhold; 1982. ates the association between pain and depressed
32. Fortune AE, Reid WJ. Research in social work. 3rd affect: a study of younger and older adult cancer
ed. New York: Columbia University Press; 1999. patients. Psychol Aging. 1995;10:36978.
33. Kerlinger FN. Foundations of behavioral research. 1st 48. Song M, Lee EO. Development of a functional capac-
ed. New York: Hold, Reinhart and Winston; 1970. ity model for the elderly. Res Nurs Health. 1998;
34. Hoskins CN, Mariano C. Research in nursing and 21:18998.
health. Understanding and using quantitative and 49. MacKinnon DP. Introduction to statistical mediation
qualitative methods. New York: Springer; 2004. analysis. New York: Routledge; 2008.
Design and Interpretation
of Observational Studies: Cohort, 4
CaseControl, and Cross-Sectional
Designs
Martin L. Lesser
to be further elucidated below), we might start

Introduction out by gathering together hundreds of college
students who are smokers and follow them over
Perhaps, one of the most common undertakings their lifetimes to see what fraction develop lung
in biomedical research is to determine whether cancer (i.e., estimate the incidence rate). Likewise,
there is an association between a particular factor we might follow a similar cohort of college
(usually referred to as a risk factor) and an nonsmokers to determine their lung cancer inci-
event. That event might be a disease (e.g., lung dence rate. In the end, we would compare the
cancer) or an outcome in subjects who already incidence rates of lung cancer and, using appro-
have a disease (e.g., sudden death among subjects priate statistical methodology, determine whether
with valvular heart disease). For example, an the incidence rates were signicantly different
investigator might want to know whether smok- from one another, thereby supporting or not sup-
ing is a risk factor for lung cancer or whether oral porting the hypothesis that smoking is associated
contraceptive use is a risk factor for a myocardial with lung cancer.
infarction in women. These kinds of research On the other hand, in the casecontrol
questions are often answered using specic types design, we would begin by selecting individuals
of research designs, the two most common being who have a diagnosis of lung cancer (cases)
the casecontrol and cohort study designs. and a group of appropriate individuals without
(In this chapter, we will use the term disease lung cancer (controls) and look back in time
interchangeably with disease outcome as both to see how many smokers there were in each of
represent endpoints of interest.) the two groups. We would then, once again, using
While both types of study designs aim to appropriate statistical methods, compare the
answer the same kind of research question, the prevalence rates of a smoking history to deter-
method of conducting these designs is quite mine whether such an association between smok-
different. For example, in a cohort study ing and lung cancer is supported by the data.
(more specically a prospective cohort study, Thus, the essential difference between these
two study designs is that, in the cohort design, we
rst identify subjects with and without a given
M.L. Lesser, PhD () risk factor and then follow them forward in time
Biostatistics Unit, Departments of Molecular Medicine to determine the respective disease incidence
and Population Health, Feinstein Institute for Medical rates, whereas in the casecontrol design, we rst
Research, Hofstra North Shore-LIJ School of Medicine,
identify subjects with and without the disease
350 Community Drive, 1st oor, Manhasset,
NY 11030, USA and then determine the fraction with the risk
e-mail: mlesser@nshs.edu factor in each group.
56 M.L. Lesser
In both of these study designs, the timing of vantages that need to be weighed when such a
the suspected risk factor exposure in relation to choice is being considered.
the development or diagnosis of the disease is
important. Both study designs consider the situa-
tion where exposure to the risk factor precedes Cohort Studies
the disease. While such designs cannot prove
causality (as will be discussed below), this order- Basic Notation
ing of exposure and disease is a necessary condi-
tion for causality. In the most general setting, we will hypothesize
A third type of commonly used observational that exposure (E) to a particular agent, environ-
design is the cross-sectional study. As will be dis- mental factor, gene, life event, or some other
cussed below, this design does not specically specic factor increases the risk of developing a
examine the timing of exposure and disease. particular disease (D) or condition. Perhaps, a
It should be pointed out that casecontrol and better way to state the hypothesis would be that
cohort study designs are not necessarily restricted exposure is associated with the disease.
to the study of risk factors for a disease, per se. More formally, we might use the following
For example, if we wanted to conduct a study to hypothesis testing notation:
determine risk factors for a patient dropping out H0: Exposure to the factor is not associated with
of a clinical trial, we could select cases to be an increased risk of developing the disease.
those who dropped out of a clinical trial and con- HA: Exposure to the factor increases the risk of
trols would be those who did not drop out of the developing the disease.
clinical trial. Of course, dropping out of a clinical In statistical terms, H0 and HA are the null and
trial is not a disease (we might refer to it as an alternative hypotheses, respectively. (A discus-
outcome), yet it can be studied in the context of sion of hypothesis specication and testing can
a casecontrol study design. be found in Chaps. 3 and 11 in this text.) As in
The casecontrol, cohort, and cross-sectional most hypothesis testing problems, the objective
studies are considered observational study is to refute the null hypothesis and demonstrate
designs, which means that no particular therapeu- support for the alternative hypothesis.
tic or other interventions are being purposively It is important to note the hypotheses relating
applied to the subjects of the study. The subjects E and D do not use the word cause because in
of the study simply are being observed in their observational studies, we cannot prove causality;
natural settings to determine, in this example, we can only hope to show that an association
how many developed lung cancer or how many exists between E and D which may not necessar-
were smokers. A study design where an interven- ily be causal. We will have more to say about
tion is purposively applied to subjects to deter- establishing causality from observational studies
mine, for example, whether one treatment later in this chapter.
modality is better than another would be called
an experimental design or more specic to bio-
medical research, a clinical trial in which the Selection of Exposed Subjects
intervention (e.g., drug, device, etc.) is assigned
to the subject as per protocol. (For detailed In order to conduct a cohort study, one must rst
discussions of studies of interventions and how select subjects who have been exposed to the
to prepare for them, the reader is referred to hypothesized risk factor. It is not the purpose of
Chaps. 5 and 6.) this chapter to provide detailed guidance on alter-
The important issue of whether to choose a native sampling methodologies, which is dis-
casecontrol or cohort study design for a particu- cussed in greater detail in Chap. 10. Here, our
lar research study will be discussed later in this goal is to provide general guidance as to how to
chapter. Each has relative advantages and disad- sample subjects and from where they might be
4 Design and Interpretation of Observational Studies 57
sampled, with the specic details left to the reader Table 4.1 Sources of exposure information
in consultation, perhaps, with a statistician or Preexisting records
epidemiologist. Interviews, questionnaires
Direct physical examination or tests
Denition of the Exposure Direct measurement of the environment
Daily logs
To select exposed subjects, there must be a clear
denition of what it means to be, or have been,
exposed to the risk factor under study. Suppose, near environmental hazards, persons with certain
for example, a study was conducted to determine lifestyles, such as those who regularly attend an
the effect of exposure to heavy metals (e.g., gold, exercise gym. In an epidemiologic study of long-
silver, etc.) on semen and sperm quality in men term effects of prescription drugs, one might uti-
during their peak reproductive years. We might lize a roster or list of individuals who have been
enlist the support of a company that works with prescribed a certain type of drug. When selecting
heavy metals in a factory setting and then obtain cohorts of exposed subjects, an attempt should be
seminal uid samples from men working in that made to select these cohorts for their ability to
factory. However, we would still need to know facilitate the collection of relevant data, possibly
what it means to be exposed. Exposure can be over a long period of time. For example, there are
dened in many ways. For example, just working several large-scale prospective cohort studies that
in that factory environment for at least 6 months involve physicians [1, 2].
might be one denition of exposure; another
denition might involve the direct measurement Sources of Exposure Information
of heavy metal particles in the factory or on a To determine whether or not a subject has
detector worn by each factory worker from which been exposed to a particular risk factor, the
a determination of exposure might be made based investigator has several sources of information
on some minimum threshold exposure level indi- that might be used for making this determination
cated on the detector. If one were to study the (Table 4.1). First, preexisting records (medical
effect of cigarette smoking in pregnant women charts, school records, etc.) might be used
on the birth weight of newborns, once again, one for determining whether a particular exposure
would need to have a denition of what it means occurred. While preexisting records may be easy
to be a smoker during pregnancy: is having and inexpensive to retrieve, they may be inaccu-
smoked one cigarette during pregnancy enough rate with respect to the information that an inves-
to dene the smoking status or does it need to be tigator needs in his or her research investigation
a more consistent and higher frequency of ciga- because data in the chart was not collected with
rettes during the pregnancy? As for measurabil- this research study in mindrather, the data were
ity, it is desirable but not always possible to dene collected for clinical reasons only.
exposure based on some directly measurable A second source of exposure information, that
quantity. represents an improvement upon preexisting
records, is self-reported information (e.g., inter-
Sources of Exposed Subjects views or questionnaires that may be administered
Where might exposed subjects be found? to prospective participants in the cohort study).
Certainly, in the prior example of occupational This approach allows the investigator some
exposure, one might look to identify potentially exibility about which questions should be
exposed subjects from the roster of companies in asked and how they should be asked, which
certain lines of manufacturing or other work, might not be available in preexisting records. Of
labor unions, or other organizations or groups of course, conducting interviews or administering
individuals that would be associated with a par- questionnaires has associated costs that may be
ticular occupation and, potentially, with such an substantially greater than retrieving preexisting
exposure. One also might enroll persons living records or charts.
58 M.L. Lesser
Beyond direct interviews and questionnaires, Table 4.2 Sources of outcome information
the investigator also can perform physical Death certicates
examinations or tests on individual subjects to Physician and hospital records
determine certain exposures. Direct measurement Disease registries
of environmental variables (e.g., in an occupa- Self-report
Direct physical examination or tests
tional exposure type of cohort study) also would
be reasonable. Of course, these approaches to
determining exposure status generally have need to be considered. For example, in our
higher associated costs and logistical difculties hypothetical study on heavy metal exposure and
than do interviews, questionnaires, or use of pre- male fertility, it might be convenient to select
existing records. Finally, the investigator might controls from the business ofces of the same
ask subjects to maintain daily logs of certain company which might be located at some dis-
activities, environmental exposures, foods, etc., tance from the factory. However, if one were to
in order to determine levels of exposure over select ofce workers as potential unexposed con-
time. Daily logs have the advantage of providing trols, the investigator would have to be careful
information on a detailed and regular basis but that those potential controls are not regularly
have the shortcoming of being inaccurate due to exposed to the heavy metal factory. This could
the self-report nature of a daily log. happen if, for example, the vice president for
In summary, there are many sources of expo- quality control, who worked in the business
sure information available to the investigator. The ofce, made daily tours of the factory and, there-
use of a particular source depends on its relative fore, was exposed (albeit a small amount of expo-
advantages and disadvantages with respect to sure) to the heavy metals.
accuracy, feasibility, and cost.
Sources of Outcome Information

Selection of Unexposed Subjects
(Controls) Once a cohort study is underway, it is essential
for the investigator to determine whether the par-
The control group for the exposed subject should ticular outcome has or has not occurred. Once
comprise individuals who have been unexposed again, there are various sources of information
to the factor being studied. As will be seen (Table 4.2), each of which has its advantages and
for casecontrol studies, the proper selection of a disadvantages from a logistical cost and accuracy
control group can be a difcult task. perspective. Death certicates often are used to
First, one must have a good denition of expo- determine cause of death and comorbidities at
sure in order to operationalize the denition of the time of death for a participating subject.
unexposed. Obviously, we want the unexposed Unfortunately, death certicates can be inaccu-
subjects to be free of the exposure in question but rate with regard to the specic details of cause of
similar to the exposed cohort in all other respects. death and, of course, may not capture informa-
How an investigator would determine the expo- tion about other outcomes that the investigator is
sure status of a potential control certainly depends seeking.
on the type of exposure one is studying. In the Physician and hospital records represent good
example of heavy metal exposure given above, sources of outcome information provided that the
one would probably administer some sort of intersubject has maintained contact in that particular
view to determine whether the potential control health-care or physician system. If the outcome
has ever been in an occupation or an environmen- in question was whether a patient suffered a myo-
tal situation where there might have been heavy cardial infarction (MI), there is no guarantee that
metal exposure. Additionally, there are different the patient will be seen for that MI at the investi-
degrees of exposure to a risk factor that would gators hospital, and therefore, the investigator
may not have access to that information based on Table 4.3 Criteria for confounding
his or her immediate hospital records. 1. The presumed confounder (F) is associated with the
Disease registries can be useful sources of exposure (E)
information, but, once again, they are very simi- 2. Independent of exposure, F, must be associated with
the risk of disease (D)
lar to physician and hospital records in that dis-
ease registries are often specic to a particular
hospital or large regional health area. Also, also occur when a third variable makes it appear
condentiality issues may preclude the ability to that there is no association between an exposure
access records in disease registries for subjects. and a disease when, in fact, there is.
Self-report (described in detail in Chap. 8) is a Before providing concrete examples of con-
relatively inexpensive and logistically simple founding, it is important to formally dene the
method for determining outcome but can be inac- concept. Let E denote the exposure and D
curate because patients may not be cognizant of denote the disease being studied. A third factor,
the subtleties of various diseases or outcomes F, is called a confounding variable if it meets
that have been diagnosed. However, written two criteria: (1) F is associated with exposure, E;
permission from the patient sometimes can be and (2) independent of exposure, F, is associated
obtained for the investigator to contact the with the risk of developing the disease, D. It
patients physicians and hospital records in order should be emphasized that a confounding factor,
to make denite ascertainment of whether or not F, must meet both of these conditions in order to
an outcome occurred. be a confounder. Often, in error, research investi-
Finally, direct physical examination or tests gators treat variables as confounders when they
conducted on the subject might reveal whether an only meet one of those criteria (Table 4.3).
outcome has occurred, of course, depending on As an example of confounding, suppose that
the nature of the outcome being studied. Once an investigator wished to determine whether
again, this type of information might be very smoking during pregnancy was a risk factor for
accurate but could be costly or logistically an adverse outcome (dened as spontaneous
difcult to obtain in all subjects. abortion or low birth weight). The investigator
In sum, different sources of outcome informa- would recruit two cohorts of pregnant women,
tion have their advantages and disadvantages one whose members smoke while pregnant and
relative to accuracy, logistics, and cost and should the other whose members do not. (The ner
be weighed carefully by the investigator in details of how to identify and recruit these cohorts
designing a cohort study. are not within the scope of this chapter.) The two
cohorts are then followed through their pregnan-
cies, and the rates of adverse outcomes are
Confounding in Cohort Studies compared (using a measure known as relative
risk, which will be described later). Further, sup-
Nature of the Problem pose that the investigator does nd an increased
While the identication of a potential unexposed risk of adverse outcomes in the smoking group.
group might seem rather straightforward in many He submits his results to a peer-reviewed journal
study designs, there is always an underlying but is unsuccessful in gaining publication because
problem in the choice of these unexposed con- one of the reviewers notes that the explanation for
trols, i.e., confounding. Essentially, confound- the increased risk may not be due to smoking, but,
ing can be described in two ways. It is the rather, to the effect of a confounding variable,
phenomenon that occurs when an exposure and a namely, educational status. Why might educational
disease are not associated but a third variable status be a confounder? First, individuals with
(known as the confounding variable) makes it low educational levels are more likely to
appear that there is an association between expo- be smokers. (This satises criterion #1 of the
sure and disease. Conversely, confounding can denition of confounding.) Second, irrespective
60 M.L. Lesser
of smoking, women with low educational levels Table 4.4 Bias and related problems in cohort studies
are at greater risk for adverse maternal-fetal 1. Exposure misclassication bias
outcomes. (This satises criterion #2.) Thus, it 2. Change in exposure level over time
is unclear whether the increased risk is attribut- 3. Loss to follow-up
able to smoking, educational level, or both. How 4. Nonparticipation bias
5. Reporting bias
does one eliminate the effect of a confounding
variable?
Sources of Bias in Cohort Studies
Minimizing Confounding by Matching
One solution to the confounding problem in As in any type of study design, there are potential
cohort studies is to match the exposed and aws (or biases) that may creep into the study
unexposed cohorts on the confounding vari- design and affect interpretation of the results. As
ables. (This approach will be discussed in also noted in Chaps. 5 and 8, bias refers to an
greater detail later on in the section on case error in the design or execution of a study that
control designs.) For example, a smoker who produces results that are distorted in one direc-
did not achieve a high school education would tion or another due to systematic factors. In other
be paired (or matched) with a nonsmoker who words, bias causes us to draw (incorrect) infer-
was also a non-high school graduate. By match- ences based on faulty assumptions about the
ing in this way, the representation of education nature of the data.
level will wind up being identical in both There are many types of bias that can occur in
cohorts; thus, the effect of the confounding vari- research designs. Given in Table 4.4 are some of
able is eliminated. Of course, matching could be the more common types that would be encoun-
carried out for multiple confounders, but usu- tered in cohort studies. (See Hennekens and Buring
ally, only two or three are considered for practi- 1987 [3] for a more complete description.)
cal reasons. 1. Exposure Misclassification Bias. This type of
Although matching exposed and unexposed bias occurs when there is a tendency for
subjects on confounding variables is theoretically exposed subjects to be misclassied as unex-
desirable, such matching often is not carried out posed or vice versa. The example cited above
in cohort studies due to sample size, expense, and in selection of controls is an example of
logistics. Many cohort studies are rather large, misclassication bias. In that example, the
and to perform matching can be practically quality control personnel who work in
difcult. Matching in small cohort studies also the white-collar business ofce might be
may be limited by the sample size in that it may classied as unexposed when, in fact, they are
be difcult to nd appropriate matches for the routinely exposed to the heavy metals because
exposed subjects. they tour the factory twice a day (even though
Typically, in cohort studies, confounding vari- they do not work in the factory). Typically,
ables are dealt with in the statistical analysis exposure misclassication bias occurs in the
phase where adjustments can be made for these direction of erroneously classifying an indi-
variables as covariates in a statistical regression vidual as unexposed when, in fact, he or she is
model. Also, it should be pointed out that in exposed. This would have the effect of reduc-
cohort studies which often are conducted over a ing the degree of association between the
long period of time, a subjects confounding vari- exposure and the disease. In other words, if, in
able may change over time, and a more compli- fact, exposure did increase the risk of disease,
cated accounting for that change would need to it is possible that we would declare little or no
be dealt with in the analysis phase. Matching is association. If the bias went in the other direc-
more common in casecontrol studies and will be tion (i.e., unexposed subjects are misclassied
discussed in greater detail below. as exposed), then we run the risk of nding an
association when, in fact, none exists. A solu- unexposed cohort), and, of the 50 IVDUs, 20
tion to the misclassication problem is to have have died before the end of the 1-year follow-
strict, measurable criteria for exposure. Of up period, leaving only 30 with measured viral
course, the ability to accurately measure or load levels at follow-up (as there is no follow-
determine exposure may be limited by avail- up viral load recorded on the 20 IVDUs who
able resources. died). The effect of this might be that the 30
2. Change in Exposure Level over Time. Bias IVDUs who completed the 1-year follow-up
may occur when a subjects exposure status might have been, in general, healthier than
changes with time. For example, a subject in the IVDUs who died, leading to a biased
the smoking cohort may quit smoking 10 years comparison.
after high school. Is that subject in the smok- 4. Nonparticipation Bias. Nonparticipation bias
ing or nonsmoking cohort? In cases like this, it is somewhat similar to loss to follow-up bias
is common to classify the subjects time peri- except that the bias occurs at the time of
ods with respect to smoking or nonsmoking enrollment into the study. Suppose we were
and to use the person-years method (see conducting a cohort study to determine
Kleinbaum et al. 1982 [4]) to analyze the data. whether child abuse is a risk factor for psychi-
Using this method, the subject is not classied atric disorders in teenage years. Although this
as exposed or unexposedonly his follow-up might be a problematic study to conduct, due
time periods. Nevertheless, if crossover to the sensitive nature of the risk factor (i.e.,
from one cohort to the other occurs, particu- child abuse), one might consider contacting
larly in one direction only (e.g., smokers families who were seen at a psychiatric facil-
become nonsmokers, but nonsmokers do not ity once child abuse was discovered and ask-
start to smoke after high school), this may ing them to participate in the study to follow
impart a bias that confounds interpretation of their children through their teenage years to
the study. For example, if many quitters determine their psychiatric status. Controls
develop lung cancer (presumably because they would be families or subjects without histo-
were exposed for several years), this occur- ries of abuse who would be followed in the
rence might reduce the observed association same way. In a situation such as this, it is
between smoking and lung cancer. likely that many families with histories of
3. Loss to Follow-up Bias. Bias can occur when child abuse would decline to participate and
members of one of the groups are differen- that those who would participate might be
tially lost to follow-up compared to the other, psychologically healthier, rendering them
and the reason for their loss is related, in part, unrepresentative of the general group of fami-
to their level of exposure. Consider the fol- lies with child abuse. Furthermore, if this
lowing hypothetical observational study that group were, indeed, psychologically healthier,
evaluates newly diagnosed heterosexual AIDS then the incidence of teen psychological dis-
patients. The two cohorts in this example orders might be lower, thus attenuating the
are those patients who were IV drug users true association between child abuse and psy-
(IVDUs) and those who were not. Both cohorts chological disorders.
are started on the same antiretroviral therapy 5. Reporting Accuracy Bias. Reporting accuracy
at diagnosis. The research question is whether bias in cohort studies is similar to that in case
there is a difference between the two groups in control studies. It refers to a situation where
viral load at the end of one year. either the exposed or unexposed subjects delib-
As the study progresses, some patients die. erately misreport either their exposure or their
To illustrate this bias using an exaggerated outcome status, usually due to the sensitive
scenario, suppose that there are 50 IVDUs nature of the variables being studied. (See the
(the exposed cohort) and 50 non-IVDUs (the section on casecontrol studies for examples.)
62 M.L. Lesser
Fig. 4.1 Computing the

relative risk
Computing and Interpreting exposed) is a/(a + b); the corresponding incidence

Relative Risk rate in the unexposed is c/(c + d).
The relative risk is then dened as
The foregoing discussion dealt primarily with
issues surrounding the design and interpretation RR = (incidence rate in exposed ) /
of cohort studies. Between design and interpreta-
tion is a phase during which various calculations
(incidence rate in un exp osed )
are carried out to quantify the relationship = a / (a + b ) c / (c + d ) .
between the presumed risk factor and the disease
under investigation. The most common measure Typically, one might compare the rates to
used in cohort studies for quantifying such risk is determine whether they are different, since, if the
the relative risk (RR). The calculation and rates are the same (i.e., RR = 1), that effectively
interpretation of RR can be illustrated by refer- tells us that there is no association between the
ring to Fig. 4.1. Here, a and b, respectively, repre- risk factor and the disease. On the other hand, if
sent the number of exposed subjects who did and the rate is greater in the exposed (i.e., RR > 1),
did not develop the disease in question. Likewise, that would suggest that the risk factor is posi-
c and d represent the unexposed subjects who, tively associated with the disease. (RR < 1 would
respectively, did and did not develop the disease. suggest that the subjects with the risk factor
In a cohort study, one usually selects exposed actually have a lower likelihood of disease.)
subjects so that the row total of exposed (a + b) is It should be noted that RR is always a positive
xed at some predetermined sample size. number unless one or more of the cells in the
Likewise, the sample size for the row of unex- above 2 2 table contains a zero, in which case it
posed (c + d) is also xed. The two row totals do is common to compute the RR by adding to a,
not necessarily have to be equal. This table is b, c, and d and using the formula given above
often referred to as a 2 2 table pronounced (see Agresti 2002 [5]).
two-by-two since it contains two rows and The following example (see Fig. 4.2) com-
two columns corresponding to Exposure and putes the RR for a cohort study investigating
Disease status. oral contraceptive use as a risk factor for MI in
In the exposed group, the fraction of subjects women. In this example, 1,000 women who
who developed disease (i.e., incidence rate in the used an OC were followed over a period of
Fig. 4.2 Relative risk: an

example
time to see who developed an MI. Likewise, study of smoking during pregnancy as a risk
1,000 OC nonusers were followed in a similar factor for adverse maternal-fetal outcomes is of
way. The incidence rates of MI were 0.03 and the prospective type because, as described, the
0.003, respectively, yielding a RR = 10, which investigator must wait from the time of exposure
means that women who used OC had 10 times to observe the outcome of the pregnancy.
a greater risk of MI than nonusers. For deter- However, suppose that the study were to be con-
mining whether a RR is signicantly different ducted by reviewing patient charts from 2 years
from 1, the reader is referred to Kleinbaum prior to the initiation of the study and identifying
et al. 1982 [4]. women who smoked and did not smoke during
pregnancy at that time. Then, the investigator
would determine the pregnancy outcome from
Prospective Versus Retrospective the chart data (i.e., the outcomes are already
Cohort Designs known and documented in the charts). This is an
example of what many term a retrospective
One usually thinks of a cohort study as prospec- cohort study. (As noted in Chap. 1, DeAngelis [6]
tive because it looks forward from an exposure and others would refer to this as a historical or
to the subsequent development of disease. nonconcurrent cohort study.)
However, a cohort study can be classied as ret- To the reader, the distinction between retro-
rospective or prospective, depending on when spective and prospective cohort studies may not
it is being conducted with respect to the outcome. seem important since the logic of the two
If, at the time the investigator initiates the study, approaches is essentially the same. However, in a
the outcome (e.g., disease) has not yet occurred in prospective cohort study, the investigator typically
the study subjects, then the study is prospective has more quality control of the conduct of the
because the investigator must follow the subjects study and how data are to be collected than in a
in real time in order to ascertain outcome status. retrospective study because the former is being
On the other hand, if the study is conducted after conducted in real time. In a retrospective cohort
the exposures and outcomes have already study, the investigator is limited by the nature and
occurred, this type of design often is classied as quality of data already available, which most likely
a retrospective cohort study. were collected for routine clinical purposes using
For example, referring back to the section on criteria and standards that are different from those
confounding, there is general consensus that the of the current research investigation.
64 M.L. Lesser
with a very specic subtype and/or severity (e.g.,

CaseControl Studies a particular histology of lung cancer), then the
study design may benet from decreased noise
The purpose of a casecontrol study, like a cohort or variation, but the results may be less generaliz-
study, is to determine whether an association able. Furthermore, restriction of the case
exists between exposure (E) to a proposed risk denition will result in a smaller potential pool of
factor and occurrence of a disease (D). The essen- subjects (i.e., smaller sample size). Conversely, if
tial difference between the two designs is that in the case denition is expanded to include, say,
a cohort study, exposed and unexposed subjects multiple subtypes of the disease, then the results
are identied and then followed over time to may be more generalizable, and the subject pool
determine the incidence rates of disease in those size may increase. However, there will be greater
two groups, whereas in a casecontrol study, sub- variability, which may reduce the ability to detect
jects with and without the disease are classied an association between E and D (i.e., reduced sta-
as having or not having been exposed to the pro- tistical power). Determining the heterogeneity of
posed risk factor. More simply put, the cohort case denition is a ne balancing act between
study follows subjects forward in time, whereas addressing the specic research question and
the casecontrol study looks backward for an sample size considerations.
associated factor by rst identifying subjects with
and without the disease. Sources of Cases
In most research studies, a case of disease will be
identied and selected from medical practices or
Selection of Cases facilities such as hospitals or physician practices.
Occasionally, cases of disease can be obtained by
If we are to conduct a casecontrol study, then we using disease registries.
rst need to determine who our cases will be
and how we will select them for inclusion in the Prevalent Versus Incident Cases
study. An important consideration in the selection of
cases is whether a case is considered a prevalent
Case Denition or incident case. A subject is said to be a preva-
The rst step in selecting cases is to dene what lent case if the patient has the disease in question
is meant by a case. For example, if we were regardless of when it was diagnosed. It may have
studying lung cancer, we might specify that a been diagnosed 2 days ago, 2 years ago, or 10 or
case would be any subject with biopsy-proven 20 years ago. But, as long as the subject is avail-
adenocarcinoma of the lung. If the research ques- able, that subject is considered a prevalent case. An
tion itself necessarily distinguished between incident case refers to a more restrictive crite-
small cell and non-small cell lung cancer and rion. In order to be an incident case, an individual
only the latter type was to be studied, then we needs to have been diagnosed recently.
would have to add that to the denition. Other Recently may have different connotations in dif-
examples of strict denitions might be as follows: ferent disease entities, but, for example, in a chronic
if one were studying nutritional factors and their disease like cancer, an incident case might be a
association with MI, we would dene a patient to case that was diagnosed within the past 23 months.
have an MI if the patient exhibits a certain degree On the other hand, for a disease that is rapidly fatal,
of enzyme elevation and has clearly dened such as anthrax poisoning, an incident case might
prespecied changes on an electrocardiogram. be dened as a case that was diagnosed an hour or
2 ago. The essential point to remember in design-
Homogeneity of Cases ing casecontrol studies is, that when selecting
Most diseases vary according to severity or sub- cases, we should select incident cases, not preva-
type. If we were to include in our study only cases lent cases. The reasons are as follows.
First, casecontrol studies often involve the instead, be associated with its lethality. Thus, it is
recall of information about past exposures. This possible that the smokers are those who died
type of information often is obtained by inter- early in the group that was diagnosed in the more
viewing the subject him or herself or by inter- distant past whereas nonsmokers are the ones
viewing family members or friends who might who have survived despite their disease. In this
have such information. Of course, some exposure case, when comparing this biased group of cases
information may also be gleaned from patient to non-cancer controls, we would observe an
charts or other documents that exist independent attenuated association between smoking and lung
of an interview with a subject. It stands to reason cancer. This bias would provide potentially mis-
that if the interval of time between diagnosis of leading results.
the disease and the interview for exposure infor- On the other hand, if one were to simply sam-
mation is lengthy, then the ability to properly ple recently diagnosed cases and assuming that
recall exposures will be reduced. Certain expo- the disease is not rapidly fatal (even small cell
sures such as smoking are not likely to be forgot- lung cancer patients would survive to be inter-
ten, but, for example, if we were studying more viewed), almost all of the available lung cancer
complex and/or rare exposures, the ability to cases would be included in the study since, at that
accurately recall such exposures and associated point, no one would be lost to follow-up or death.
details would decrease over time. Thus, the Therefore, the sample would not be biased as it
shorter the interval between diagnosis and gath- might have been had the sampling methodology
ering of exposure information the more likely the been based on prevalent case selection.
recall of information will be accurate.
A second reason for selecting incident cases is
illustrated by the following example. Suppose we Selection of Controls
were studying the association between smoking
and lung cancer. We might go to the tumor regis- Perhaps, the most difcult aspect of conducting a
try of our hospital and nd 1,000 lung cancer casecontrol study is the selection of controls. In
cases that were diagnosed over the past 10 years. principle, controls should be a group of individu-
The next step in our research design would be to als who are free of the disease or outcome in
contact these subjects and ask them whether or question (i.e., unexposed) and are as similar in all
not they were smokers prior to their development other respects to the case group.
of lung cancer. One of the problems associated
with this approach is that out of those 1,000 lung Denition of Controls
cancer cases diagnosed over the past 10 years, Controls should be free of the disease in ques-
many will have expired before we would be able tion. One of the difculties in selecting controls
to contact them. Cases that are still alive probably is determining how far we should go to ensure
would fall into two broad groups: (a) those who that someone is free of the disease or outcome.
have been recently diagnosed and have not had For example, if we were to select as a control for
enough exposure to lung cancer yet to die from our lung cancer cases an individual who has
the disease and (b) those who were diagnosed in never had a diagnosis of lung cancer, do we need
the more distant past but who have survived. The to perform a bronchoscopy on that patient for
latter group (b) is likely made up of those with certainty of that fact, or do we simply take his
lower grade disease or those who have been more self-report as the truth that he has never had lung
successful in combating their disease with therapy. cancer? Of course, there are subtleties that arise
That group may be very different from those who when subclinical disease exists at the time an
were diagnosed in the more distant past who individual is being selected as a control. These
already have died of their disease. In fact, it is are ne points that would need to be dealt with in
conceivable that smoking may not just be a very careful manner, in consultation with a stat-
associated with developing lung cancer but may, istician or an epidemiologist.
66 M.L. Lesser
At this point, it is instructive to provide an from visitors to a shopping mall (even though
example of where verication of non-disease sta- colonoscopy, itself, is not infallible). Of course,
tus might be problematic and require some subjects who have a diagnosis of colon cancer
additional thought about the design of the study. based on the colonoscopy would be excluded
Suppose we were conducting a casecontrol from the control group.
study to determine whether there is an associa- The selection of controls from among those
tion between a high fat diet and colon cancer. undergoing colonoscopy, nonetheless, could
Specically, our hypothesis is that colon cancer potentiate a different problem, namely, selection
cases will report a higher frequency of high fat bias. Generally speaking, there are two broad
diets than non-cancer controls. To test our hypoth- groups of individuals who undergo colonoscopy:
esis, we would select our colon cancer cases in (a) those who are symptomatic and who are
some way consistent with the guidelines already referred by their physician to a gastroenterologist
stated above and then select controls. One possi- to determine the cause of their rectal bleeding,
ble source of controls would be adults visiting a abdominal pains, cramping, diarrhea, etc., and
large shopping mall. (We might choose to select (b) those who are asymptomatic who undergo
individuals over 50 years old if our casecontrol colonoscopy for screening purposes only.
study was designed to answer the question in this However, these two groups differ in ways that
population.) Next, we could set up a colon cancer can inuence the results of the investigation. For
information booth in the mall and invite the pass- example, a high fat diet may not be specic to the
ersby to answer a question or two about history risk of colon cancer but may be associated with
of colon cancer and, if they wished, to pick up a other intestinal problems (e.g., some of the benign
fecal occult blood test kit so that they can screen conditions cited above). If this association was
themselves for colon cancer. Those who self- not appreciated during the study design stage,
reported that they had never had a diagnosis of and individuals from the symptomatic group
colon cancer could be invited to participate as were selected as controls, their rate of high fat
controls for our casecontrol study. We might use diets would be spuriously inated, thus reducing
as an exclusion criterion a positive test result on the observed degree of association between fatty
the fecal occult blood test (even though that diets and colon cancer. On the other hand, selec-
nding obviously does not equate to a diagnosis tion of the asymptomatic individuals who undergo
of colon cancer). cancer screening are more likely to be health-
A member of our investigative team might conscious individuals since they are voluntarily
object to this approach since self-report and fecal attending a screening program. Because these
occult blood testing, in and of themselves, would individuals are more health conscious, they may
not completely verify the disease-free status of have an articially lower level of fat intake
someone passing through the shopping mall. than a standard population of individuals without
Thus, we might be more rigorous in our selection colon cancer. Accordingly, when we compare the
of controls. This might be done by enlisting the fat intake for this control group against the colon
collaboration of a gastroenterologist who per- cancer group, we may observe an exaggerated
forms colonoscopies and selecting from his or association because of the articially reduced
her colonoscopy practice those subjects who have levels of fat intake in our control group.
colonoscopies with a benign or negative out- There are several ways to address this
come. Such outcomes might include diverticulo- problem, none of which constitutes a perfect res-
sis, inammatory bowel disease, a benign polyp, olution of the issue. In this example, some inves-
other benign tumors of the colon, etc. If we were tigators might employ only one of the control
to view colonoscopy as a close to foolproof way groups with the understanding that the bias would
of determining an individuals colon cancer sta- need to be considered when interpreting the
tus, then this would be a better way of selecting results. Thus, for example, if the benign disease
controls for such a study than selecting them group were used as the control and only a small
association was observed (i.e., odds ratio [OR] is Confounding in CaseControl Studies
close to 1), the association would be inconclusive
because of the directionality of the bias. However, The Nature of the Problem
if a large and statistically signicant association The impact of confounding on interpretation of
(i.e., OR > 1) were found, then, because the bias ndings from cohort studies has previously been
is working against the hypothesis of positive addressed. The reader should note that its adverse
association, this larger OR would provide evi- effects are not limited to cohort studies but repre-
dence in favor of the association. Another sent a potentially serious problem in casecontrol
approach might be to include both groups as sep- designs as well. Schlesselman [7] provides inter-
arate controls and, knowing the opposite direc- esting examples of such confounding, which we
tions of the bias, compare cases to each control now describe.
group and draw inferences accordingly. Consider a hypothetical casecontrol study
designed to test the hypothesis of association
Sources of Controls between alcohol use (E) and lung cancer (D).
Recall that in a casecontrol study, cases of dis- Cases of lung cancer are selected for study, and a
ease are most conveniently selected from a med- group of controls without lung cancer is identied.
ical practice or facility, but controls need not be Suppose that the rate of alcohol use in the lung
selected from such sources even though it might cancer cases is found to be signicantly greater
also be convenient to do so. Controls also can be than that of the controls. The conclusion would
selected from the community at-large using be that alcohol use increases the risk of lung can-
sophisticated sampling techniques or by simply cer. However, one might criticize the study
placing advertisements in community media to because smoking should have been considered a
recruit individuals who meet the control criteria. confounding variable.
Very often, investigators will collaborate with Why is smoking a confounding variable? One
various work places that will permit access to needs to refer back to the denition. Certainly,
their employees as potential controls for a par- smoking is associated with lung cancer (criterion
ticular study. Over the years, departments of #2), independent of any other factors. However,
motor vehicles often have served as a source of smokings association with lung cancer does not,
controls for many research studies. Occasionally, in itself, make it a confounding variable. Smoking
close friends, relatives, or neighbors of an indi- must also be associated with alcohol use (crite-
vidual case will serve as controls. Choosing such rion #1). How is smoking associated with alcohol
individuals can solve a myriad of problems use? The answer lies in the fact that individuals
because this type of control sometimes will share who drink alcohol tend to have a higher rate of
the same environmental conditions as the case or smoking than individuals who do not drink alco-
have a similar genetic disposition. The approach hol. Therefore, smoking is related both to alcohol
also facilitates cooperation because, very often, use (E) and lung cancer (D) and is, therefore, a
friends, relatives, or neighbors will cooperate confounding variable.
with an investigator who is also working with As another example of a confounding variable
that individuals relative. However, selecting that may obscure an association between a puta-
friends and relatives as controls may have tive risk factor and disease, consider a case
adverse consequences because it often forces control study to determine whether there is an
the cases and controls to be similar on the very association between oral contraceptive (OC) use
risk factors being investigated, thus reducing the and MI in women. Once again, one would pick
association between the risk factor and disease. cases of women who had suffered a recent MI
In summary, the selection of controls requires and determine whether or not they had used OC
careful thought and knowledge of the underlying in, say, the past 5 years. A possible result of this
subject matter. study would be that the level of OC use was not
68 M.L. Lesser
substantially greater in the MI cases than in the similar with respect to one or more confounding
non-MI controls, thereby resulting in the conclu- variables. When cases and controls are properly
sion that there is little or no association between matched, the representation of the confounding
OC use and MI. However, once again, smoking variables is similar in both groups and, therefore,
could be considered a confounding factor because should have no appreciable effect on the results
it meets the two criteria of a confounder: rst, and interpretation of the casecontrol study.
smoking is associated with MI. Second, smoking Most students in the medical sciences are
is associated with OC use. Why is this so? The familiar with the idea of matching since they
reason is that women who are smokers are less probably have read many studies where matching
likely to be prescribed an OC than women was employed. However, it is our objective in this
who are nonsmokers because of the risk of chapter to describe the logistics of matching in
thrombophlebitis and other cardiovascular disor- somewhat more detail. The rst step in matching
ders. In this example, the OC users were under- cases to controls is to identify the confounding
represented in the MI case group because there variables. The next step is to determine the
were many smokers in the MI group, many of desired method of matching. Typically, one
whom were never prescribed OC. Thus, the should not match on more than a few variables
confounding effect of smoking potentially masks (i.e., two or three), but this also depends on the
a relationship (i.e., reduces the association) sample size in the casecontrol study and on
between OC use and MI. the distribution of the confounding variables in
Although it is important to identify confound- the samples being studied. Let us consider a sim-
ers, it is just as important to recognize factors ple example where we have determined that age
that may appear to be confounders but, in fact, and sex are important confounders. (It is impor-
are not. Once again, two examples from tant to emphasize that, while age, sex, race, and
Schlesselman [7] are instructive. Consider a socioeconomic status are four of the most com-
casecontrol study designed to investigate monly encountered confounders, it is not always
whether a sedentary lifestyle is a risk factor for necessary to match on any of these variables. The
MI. Cases are those with a recent history of MI reader should be reminded again that in order for
and controls are individuals without MI (appro- a variable to be a confounder, it must meet the
priately chosen). The exposure variable is (for two criteria given in the denition above.)
simplicity) sedentary lifestyle (coded as no 1. Group Versus. Calipers Matching. When age
or yes), as derived from some validated mea- and sex are potential confounders, one way to
sure of physical activity. One might consider lev- match cases and controls is to classify male
els of uid intake (F) as a possible confounding and female subjects into age groupings (a com-
variable because physically active, non-sedentary mon method of classication for age is by
subjects might have higher levels of uid intake decades, i.e., age 2029, 3039, 4049, 5059,
than sedentary subjects; in other words F is asso- or 60 and above). This approach would yield
ciated with E. Accordingly, we would consider up to 10 different age/sex combinations cor-
matching cases to controls on uid intake. responding to each of the 5 age categories
However, uid intake is not a true confounder cross-classied with sex (male, female).
because there is no known or presumed associa- Therefore, if a case were to be chosen and that
tion between uid intake and MI (D). Thus, particular subject was a 30-year-old male, we
matching on uid intake is not necessary. would choose a control who was a male in the
30- to 39-year age group; these two individu-
Reducing Confounding by Matching als (the case and the control) would be natu-
If confounding is an important problem in epide- rally matched and paired.
miologic studies, how do we deal with it? A com- The reader should note, however, that there
mon solution is matching. Matching is a technique is a disadvantage to creating groups on a mea-
whereby cases and controls are made to appear sured variable such as age. Suppose, in the
above example, we required a match for a the calipers extremely narrow). For example,
30-year-old male, and, based on the pool of one would not match children to within three
potential controls, a 29-year-old male and a years (e.g., matching a 10-year-old girl to a
39-year-old male were both available. Using seven- or 13-year-old girl) since individuals at
the grouping criteria dened above, the these ages could have very different outcomes
30-year-old male would have to be matched due to variations in socialization, sexual matu-
with the 39-year-old male because they were rity, body size, and other developmental vari-
in the same age category. However, it would ables. Effective matching, under these
make more sense to match a 30-year-old male circumstances, requires that there be a large
with a 29-year-old male because the two are pool of available controls to pair with cases.
closer in age. 2. Individual Versus Frequency Matching.
A solution to this problem is to use what is Another consideration in matching is whether
known as calipers matching whereby, on a the investigator wishes to use individual ver-
measured variable, a control would be matched sus frequency matching. Typically, with indi-
to a case based on being within a certain num- vidual matching, one case and one control are
ber of units away from that cases measure- matched to one another (1:1 matching).
ment (hence the use of the term calipers). For Occasionally, the statistician or epidemiolo-
example, we might dene a rule to match age gist will recommend many-to-one matching
to within () three years. In this case, the which might involve matching two or three
29-year-old male is within three years of the controls to each case. It is uncommon to match
30-year-old male and would be matched to the more than three controls to a case because it
30-year-old male, whereas the 39-year-old can be shown that the statistical power benets
male would be outside the dened three-year do not substantially increase after two or three
limit. A compromise between broad grouping matches to a control. The reader should keep
and calipers would be to arrange the poten- in mind that if he or she conducts a case
tially confounding variable (in this case, age) control study with 1:1 matching, it is neces-
into narrow categories (e.g., 3033, 3437, sary that there be an equal number of cases
3841, etc.). This would reduce the effect of and controls. A common misstatement that is
the disparity that occurred in the example seen in many research proposals employing
given above involving grouping by decades. casecontrol studies is, for example, there
When using this method for age matching, the will be 50 cases with disease and they will be
investigator must take care to consider the matched to 20 controls without disease. If the
nature of the study population. For example, if investigator was thinking of performing indi-
one were matching on age using three-year vidual matching, then this statement makes no
calipers in a casecontrol study evaluating uti- sense as it would require a constant ratio of
lization of health-care services, a 64-year-old controls to cases. Usually, what the investiga-
case could be matched to any control ranging tor intends is that they will select cases and
from 61 to 67 years old. However, in this controls so that, for example, the average age
example, matching a 64-year-old to, say, a (or sex distribution) of both groups is approxi-
64-year-old in a health services utilization mately the same. However, this approach is
study might result in matching a non-Medicare not matching; it is simply determining how
subject with a Medicare subject. As these two comparable the two groups are after they have
types of patients might have very different uti- been selected. Unless one prospectively selects
lization patterns, a bias could be introduced controls in a deliberate way so as to match
into the study design. Similarly, when conduct- them directly to a given case, the term match-
ing research with pediatric patients, it is impor- ing is not appropriate.
tant to match as closely and precisely to actual When an investigator does not perform
age as possible (which is equivalent to making individual matching but instead wants to
70 M.L. Lesser
ensure that the confounding variables have the a study such as this where ascertainment of
same joint distributions among both cases and smoking status (the risk factor) could be made
controls, the method of choice is frequency by chart review so that one could rst consti-
matching. Frequency matching refers to the tute the case group and then return to select
deliberate and prospective selection of con- the control group. Frequency matching may
trols so that the joint distribution of the con- be logistically more difcult to conduct in
founding variables is approximately the same other types of casecontrol studies, but the
in both the case and control groups. As an concept is still the same.
example, suppose we were performing a case 3. Propensity Matching. A recently developed
control study to determine whether maternal method for matching cases and controls
smoking during pregnancy was a risk factor (which also may also be used for matching
for premature birth. Our cases might be 100 exposed and unexposed subjects in a cohort
premature infants delivered during the past study) is known as propensity scoring
year, and our controls would be drawn from (Rosenbaum and Rubin [8, 9]). Briey, this
the hundreds of normal term births delivered method involves predicting whether a subject
during the same time period. Further, we have is a case or a control based on observed pre-
determined that parity (i.e., nulliparous vs. dictor covariates. Thus, one subject may be a
parous) and age (grouped in 3-year intervals) case and the other a control, but their covari-
are confounding variables for which matching ate proles are similar as reected by their
will be performed. Suppose we have decided predicted probability of being in, say, the
that, based on statistical power and resources case group. Specically, the probability of
available to conduct the study, that the number being a case (i.e., the propensity score) is
of controls will be 250. Further, suppose that computed for each subject in the study (both
in the case group, 10% of the cases were born cases and controls) using a statistical method
to nulliparous 30- to 33-year-old women. We known as multiple logistic regression (see
would then identify from our vast pool of Chap. 11). Then, cases are matched to con-
term-delivery controls all women who are nul- trols on the propensity score. So, for example,
liparous 30- to 33-year-olds. From this pool of suppose that in a particular study, the score is
candidates, we would randomly select 25 nul- being computed as a function of age, sex,
liparous 30- to 33-year-old women. By select- smoking status, family history, and socioeco-
ing 25 at random, this would assure that 10% nomic status. If a particular case has a score
of the control group (10% of 250=25) would of, for example, 0.75, we would try to match
be nulliparous 30- to 33-year-olds. Likewise, this case to a control that also has a score of
suppose that 16% of the cases are parous 25- 0.75. In this way, cases and controls are
to 28-year-old women, then in a similar way matched based on a measure of their similar-
we would identify all parous 25- to 28-year- ity. An advantage of the propensity score
old women who had full-term deliveries and, method is that it allows the investigator to
from that group, randomly select 40 matching match cases and controls on a single
controls as 40 would constitute 16% of the criterion (the score) that is a function of mul-
control group. If we continued in this fashion, tiple confounding variables, rather than hav-
we would obtain a control group that had either ing to match on each of the individual
precisely or approximately the same joint dis- confounders.
tribution of parity and age in both cases and
controls. It is important to note that to use fre-
quency matching, one would need to know the Sources of Bias in CaseControl Studies
distribution of the confounding variables in
the case group prior to selecting the matched As in cohort studies, casecontrol studies are
controls. This certainly would be workable in subject to a variety of biases. Given below
are some of the more common types that may be select as cases women with newly diagnosed VD.
encountered. Controls could be women from the same clinic
who do not have a diagnosis of VD. The impor-
Recall Bias tant question in the epidemiologic interview
Recall bias occurs when one of the groups recalls would be how many sexual partners have you
exposure to the risk factor more accurately than had in the past year? The responses in the case
the other group. It is not uncommon for recall group (those with VD) might look as follows: 1,
bias to manifest itself as cases remembering 1, 2, 2, 2, 3, 4, 5, 5, 6, 6, 6, 8, 9, and 10. (The
exposures better than controls. As an example, responses have been ordered from smallest to
suppose one were conducting a casecontrol largest in order to better visualize the data.) When
study to examine risk factors for early childhood the control group is asked to respond to the same
leukemia. The cases in such a study might be par- question, the results might be 1, 1, 1, 1, 1, 1, 1, 2,
ents of children with leukemia who were diag- 2, and 2. Based on these responses, the average
nosed before their fourth birthday, and the number of sexual partners in the case group
controls might be parents of children who did not would be 4.7 versus 1.3 in the control group, thus
have a diagnosis of leukemia. The investigator suggesting (subject to a formal statistical test)
interviews both groups of parents with respect to that increased number of sexual partners is a risk
exposure to a variety of potential risk factors. It factor for venereal disease.
would not be unlikely that the mother of a young Although, at face value, the interpretation of
child with leukemia would remember many the results might be as just stated, there is a poten-
household exposures better than a mother whose tial reporting accuracy bias. The bias might occur
child was healthy since it is human nature to because women who have VD may be more likely
recall antecedent events potentially leading up to to be truthful about the number of sexual partners
a serious disease or traumatic event better than they have had, whereas women who are controls
someone who has no reason to remember those may not be, thus causing the average number of
events or exposures. Another example of recall sexual partners to be artifactually greater in the
bias might be found in a study examining ante- case group than in the control group. Why might
cedents of lower back pain. Subjects who experi- such a bias exist? One hypothesis is that individu-
ence lower back pain probably would have better als with a particular disease (in this case, VD)
recall of events related to lifting of heavy objects tend to be more candid with their physicians
that may have preceded the diagnosis of the back about past medical history and behaviors [10]. In
pain versus those without back pain who may not fact, many patients (rightly or wrongly) believe
have any particular reason to remember such that if they are truthful, then their physicians may
events. be able to better treat their disease than if they are
not truthful. Assuming that this womens health
Reporting Accuracy Bias center serves women who are married, those with
This term refers to lying or deception in the boyfriends, male partners, etc., among the con-
response to questions concerning exposure, as trol group might be less likely to be truthful about
frequently occurs in the setting of casecontrol the number of sexual partners because they would
studies where sensitive questions are being asked perceive that they have something to lose and
of the subject. A classic example of reporting nothing to gain by admitting multiple sexual part-
accuracy bias might be as follows: Suppose one ners. Of course, the ethical conduct of such a
were to conduct a casecontrol study among study would require an assurance of condentiality
women to determine if her number of sex part- with respect to responses to the epidemiologic
ners during the past year is a risk factor for questions, but such an assurance does not guaran-
contracting venereal disease (VD). One might tee that subjects will cooperative when confronted
conduct this study at a womens health center and with a highly personal and sensitive question.
72 M.L. Lesser
Selection Bias endocrinology, or renal clinic might create a

Selection bias in casecontrol studies occurs bias because many of those patients already
when identication and/or inclusion of cases (or have heart disease (or are at risk for heart dis-
controls) depends, in part, on the subjects level ease), so she decides to select controls from
of exposure to the risk factor under study. the podiatry clinic around the block. She fur-
There are several common forms of selection ther reasons that most of the patients visiting
bias (i.e., detection and referral bias) as discussed the podiatry clinic are presenting for a variety
below. of foot problems unrelated to heart disease or
1. Detection Bias. Detection bias occurs when diet. However, she does not realize that some
subjects exposed to the risk factor are more (or of these patients also have been referred for
less) likely to be screened for the disease. foot problems related to diabetes, and diabe-
A good example can be found in a hypothetical tes, of course, is related both to heart disease
casecontrol study to determine whether exog- and caloric intake. Therefore, in using these
enous estrogen use is a risk factor for endome- subjects as controls (without excluding con-
trial cancer in women. One might choose as trols seen for diabetic-related problems) might
cases women with newly diagnosed endome- weaken any true association between diet
trial cancer and as controls those without a (caloric intake) and CAD.
diagnosis of endometrial cancer (suitably Another type of referral bias relates to the
matched on various confounding variables). situation where included cases are not truly
The study would then determine what fraction representative of all cases of the disease. For
of cases had been exposed to estrogen (accord- example, suppose we were investigating a pos-
ing to some predened criteria) and similarly sible increased risk of pediatric inammatory
for the controls. A problem (potential bias) in bowel disease (IBD) among children who
this type of study is that when a woman under- were formula-fed during infancy, as opposed
goes estrogen therapy, it is likely that she will to having been breast-fed. If we were to select
visit her gynecologist more often than if she the IBD cases from a medical practice at a
does not since she would need to be monitored prominent teaching hospital that specializes in
more frequently for potential side effects, such pediatric IBD, we might be seeing a dispropor-
as vaginal bleeding. Consequently, if the tionately high number of severe cases since
woman were to develop endometrial cancer it is likely that severe, difcult-to-manage
(irrespective of whether estrogen increased the cases would be referred to this center.
risk), then it is more likely that the gynecolo- Furthermore, if, in fact, formula feeding is not
gist will discover it due to the increased sur- a risk factor for development of IBD but is a
veillance. Thus, when one selects cases for this risk factor for having a more severe case of
study, unbeknownst to the investigator, the IBD among those with such a diagnosis, then
cases may have a higher likelihood of expo- it is likely that these cases will have a higher
sure simply because of the way that they were percentage of formula-fed individuals than the
selected to enter the case pool. non-IBD control group. Accordingly, we
2. Referral Bias. Referral bias occurs when con- would be more likely to conclude that formula
trols are referred into the control pool for feeding is a risk factor for IBD, when it is not.
reasons that are related to the disease under
study. As an example, suppose that a case
control study was being conducted to deter- Computing and Interpreting
mine whether caloric intake was a risk factor Measures of Risk
for coronary artery disease (CAD). Since the
investigator works in a hospital, she would The foregoing discussion dealt primarily with
like to select her controls, for convenience, issues surrounding the design and interpretation
from her hospital environment. She reasons of casecontrol studies. Between the design and
that selecting controls from the pulmonary, interpretation of a casecontrol study is a phase
Fig. 4.3 Computing

the odds ratio
during which various calculations are carried out For various mathematical reasons, it is more
to quantify the relationship between the presumed convenient to express the risk, not as a difference
risk factor and the disease under investigation. between proportions but as a ratio of odds. To the
The most common measure used for drawing unfamiliar reader, the odds of an event occurring
inferences in a casecontrol study is the odds is dened as the probability that the event will
ratio (OR). The calculation and interpretation of occur divided by the probability that it will not
the OR can be illustrated by reference to Fig. 4.3. occur. For example, if the probability of an event
Here, a and c, respectively, represent the number is 25%, the odds of the event occurring is 25/75
of cases who were exposed and not exposed to the (or, as some would prefer to express it, 1:3 odds).
risk factor. Likewise, b and d, respectively, repre- Thus, the odds of exposure among cases is [a/
sent the number of controls who were exposed (a + c)]/[c/(a + c)] whereas the odds of exposure
and not exposed. In a casecontrol study, one usu- among controls is [b/(b + d)]/[d/(b + d)]. If we
ally selects cases so that the column total of cases denote these quantities by O1 and O2, respec-
(a + c) is xed at some predetermined sample size; tively, then OR = O1/O2 = (ad)/(bc). Computation
likewise for the control column (b + d). Frequently, of the OR in this fashion always will result in a
the cases and controls are sampled in equal num- positive number unless one or more of the cells in
bers (so that a + c = b + d), but there are circum- the above 2 2 table contains a zero; in the latter
stances where equality may not hold, as pointed instance, it is common to compute the OR by
out in the section on matching. adding to a, b, c, and d and using the same
In the case group, the fraction of subjects who formula [5] employed for computation of the
were exposed to the candidate risk factor is a/ relative risk (RR) in a cohort study. Just as in the
(a + c); the corresponding proportion exposed in interpretation of the RR, if OR > 1, this is taken to
the control group is b/(b + d). Typically, one might mean that the exposure to the risk factor increases
compare the two proportions to determine the risk of disease by that many times or by that
whether they are different since if the proportions fold increase. Thus, for example, if OR = 1.5,
are the same, that effectively tells us that the risk this means that individuals with the risk factor
factor is not associated with the disease; on the are 1.5 times more likely to get the disease than
other hand, if the proportion of exposed cases is those without the risk factor. Conversely, if
much larger than that of the controls, that would OR < 1, exposure to the risk factor is protective.
suggest that the risk factor is associated with the Thus, if OR = 0.5, that means that those with the
disease. risk factor are half as likely to get the disease as
74 M.L. Lesser
Fig. 4.4 The odds ratio:

an example
those without the risk factor. An OR that is Permit calculation of incidence rates (absolute
close to 1.0 means the factor is not associated risk) as well as relative risk.
with risk of disease. Figure 4.4 illustrates compu- Enable the study of relatively rare exposures.
tation of the OR for a hypothetical casecontrol Methodology and results are easily understood
study investigating family history of coronary by non-epidemiologists.
artery disease (CAD) as a risk factor for myocar-
dial infarction (MI) in men. In this example, Disadvantages
OR = 1.56, which means that men with a family Not suited for the study of rare diseases because
history of CAD have a 1.56 times greater risk of a large number of subjects is required.
MI than those without such a family history. Not suitable when the time between exposure
and disease manifestation is very long, although
this can be overcome in historical cohort
CaseControl and Cohort Designs: studies.
Advantages Versus Disadvantages Exposure patterns, for example, the composi-
tion of oral contraceptives, may change during
As with any scientic study design, there are dis- the course of the study and make the results
tinct advantages and disadvantages to their uses. irrelevant.
Below, we provide a concise listing of some of the Maintaining high rates of follow-up can be
important pros and cons of casecontrol and difcult.
cohort designs, as identied by Schlesselman [7]. Expensive to carry out because a large number
of subjects usually is required.
Baseline data may be sparse as the large num-
Cohort Studies ber of subjects often required for these studies
does not allow for long interviews.
Advantages
Allow complete information on the subjects
exposure, including quality control of data, CaseControl Studies
and experience thereafter
Provide a clear temporal sequence of exposure Advantages
and disease. Permit the study of rare diseases.
Afford an opportunity to study multiple out- Permit the study of diseases with long latency
comes related to a specic exposure. between exposure and manifestation.
Can be launched and conducted over relatively via this study design would not shed any light on
short time periods. this question because (given the way the study
Relatively inexpensive as compared to cohort was conducted) it would not be known whether
studies. the sweetener exposure came before or after the
Can study multiple potential causes of disease. diagnosis of diabetes. Obviously, to be implicated
in a causal process, the exposure would have had
Disadvantages to occur prior to the disease. (This would be a
Information on exposure and past history pri- necessary but not sufcient condition for causal-
marily is based on interview and may be sub- ity [see below].)
ject to recall bias. Thus, one of the disadvantages of a cross-
Validation of information on exposure is sectional study is that a causal (or suggested
difcult, or incomplete, or even impossible. causal) association cannot be determined.
By denition, concerned with one disease Another disadvantage is that rare diseases are
only. difcult to study since a very large number of
Cannot usually provide information on inci- subjects would be needed to yield a sufcient
dence rates of disease. number of diseased individuals (likewise, if the
Generally incomplete control of extraneous prevalence of the risk factor was rare). Despite
variables. these important drawbacks, cross-sectional
Choice of appropriate control group may be designs usually are quicker and less expensive to
difcult. conduct than casecontrol or cohort studies since
Methodology may be hard to comprehend for no follow-up is needed. Another advantage of the
non-epidemiologists, and correct interpreta- cross-sectional study is that it can provide some
tion of results may be difcult. evidence suggesting an association between
exposure and disease and, thus, help in designing
a more formalized cohort or casecontrol study.
Cross-Sectional Studies
The question addressed by a cross-sectional study The Question of Causality

is similar to that addressed by casecontrol or
cohort studies: Is there an association between a In most studies of risk factors and the occurrence
particular factor and a disease or other event? of disease, the ultimate goal is to determine if
The essential difference is that in a cross-sectional exposure (E) to the risk factor causes the disease
study, both the disease and exposure are assessed (D) in question. In experimental studies (e.g.,
at the same time, with no attention to the timing laboratory experiments with animals or random-
of the exposure relative to the disease in ques- ized clinical trials in humans), establishing
tion. For example, suppose we wanted to know causality is more straightforward than in obser-
whether articial sweeteners were a risk factor vational studies, such as casecontrol or cohort
for diabetes (type II). We could distribute a ques- studies. This is because in the experimental situ-
tionnaire to some large group of subjects, perhaps ation, many confounding variables can be con-
by direct mail. The questionnaire would ask trolled by the experimenter or by randomization,
whether the subject has had a diagnosis of type II and, therefore, it becomes a more direct method
diabetes and whether the subject consumes for establishing causality.
articial sweeteners. Such a study would provide In the observational study, association between
an estimate of prevalence of both diabetes and of E and D can be readily established, but there is no
articial sweetener consumption in the targeted direct method to prove causality. However, epide-
population. However, if the ultimate objective is miologists [7, 11] have provided a set of guidelines
to know whether articial sweeteners might have for determining whether there is a causal associa-
some causal role in diabetes, the data collected tion between E and D. These guidelines state that,
76 M.L. Lesser
in order to establish causality, all of the ve of the the association is spurious, lending evidence
following criteria must be satised: toward the causality hypothesis.
1. Temporal association. If causation is to hold, 4. Doseresponse relationship. If it can be shown
then exposure must precede the disease. that the risk of disease increases as the dose
Sometimes, the time sequence of E and D may of the risk factor increases, this makes causal-
be difcult to determine, but this criterion of ity more plausible.
temporal association is certainly a necessary 5. Biological plausibility. While satisfaction of
condition. the above criteria is important, causality ulti-
2. Consistency of association. Loosely trans- mately will be more believable if there is some
lated, this means that different studies of the acceptable biological explanation as to why
same risk factordisease question result in such causal association might exist.
similar, or consistent, results. If results among In summary, it is not possible to directly prove
several similar studies were discordant, this a causal hypothesis using casecontrol or cohort
would weaken the causality hypothesis. study designs. However, the causal hypothesis
3. Strength of association. The greater the value becomes much more tenable if the above ve cri-
of the relative risk or odds ratio, the less likely teria can be established for the problem at hand.
Take-Home Points
The use of a proper study design is essential to the investigation of risk factors for disease
or other outcomes.
Observational studies are useful in studying risk factors for disease or clinical outcomes.
Cohort and casecontrol study designs are the most common strategies used in observa-
tional research, with cross-sectional studies playing a less important role.
The choice between utilizing a cohort or casecontrol design depends upon several factors
including disease prevalence and/or incidence, data availability and quality, and time
required for follow-up.
Confounding is a potentially serious problem that can affect the interpretation of either a cohort
or a casecontrol study.
Matching is a method used to reduce the effects of confounding.
The degree of risk is quantied by the relative risk for cohort studies and the odds ratio for
casecontrol studies.
There are numerous sources of bias that can affect the interpretation of observational
studies.
In general, causality cannot be directly proven in observational studies, but certain criteria can
suggest a causal hypothesis.
5. Agresti A. Categorical data analysis. 2nd ed. Hoboken:

Wiley; 2002.
References 6. DeAngelis C. An introduction to clinical research.
New York: Oxford University Press; 1990.
1. Manson JE, Nathan DM, Krolewsky AS, Stampfer 7. Schlesselman JJ. Case-control studies. New York:
MJ, Willett WC, Hennekens CH. A prospective study Oxford University Press; 1982.
of exercise and incidence of diabetes among US male 8. Rosenbaum PR, Rubin DB. Constructing a control group
physicians. JAMA. 1992;268:637. using multivariate matched sampling methods that incor-
2. Colditz GA, Manson JE, Hankinson SE. The nurses porate the propensity score. Am Stat. 1985;39:338.
health study: contribution to the understanding of 9. Rosenbaum PR, Rubin DB. Reducing bias in observa-
health among women. J Womens Health. 1997; tional studies using subclassication on the propen-
6:4962. sity score. J Am Stat Assoc. 1991;79:51624.
3. Hennekens CH, Buring JE. Epidemiology in medi- 10. Swan SH, Shaw GM, Schulman J. Reporting and
cine. Boston: Little, Brown; 1987. selection bias in case-control studies of congenital
4. Kleinbaum D, Kupper L, Morgenstern H. malformations. Epidemiology. 1992;3:35663.
Epidemiologic research: principles and quantitative 11. MacMahon B, Pugh TF. Epidemiology: principles and
methods. Belmont: Lifetime Learning; 1982. methods. Boston: Little, Brown and Company; 1970.
Fundamental Issues in Evaluating
the Impact of Interventions: Sources 5
and Control of Bias
Phyllis G. Supino
falsity of a proposition [2]. In scientic inquiry,

Introduction validity refers to whether assertions made in a
research study, including those about cause and
The ability to draw valid inferences from data is effect, are likely to be true. Campbell and Stanley
the cornerstone of research and provides the basis argued that two different types of validity, inter-
for understanding the new knowledge that nal and external (described below), must be
research results represent. In clinical research considered when evaluating the legitimacy of
and, most importantly, in trials of therapy, such conclusions drawn from an interventional study.
inferences determine whether the ndings have Both forms of validity are protected or jeopar-
any value in the real world. This chapter will dized (threatened) by the choice of study design
review potential threats to validity of data-based and related methodological issues.
inferences that may result from specic study
design elements in assessment of purposively Threats to Internal Validity
applied interventions and will present critical Internal validity refers to the extent to which eval-
analyses of several published examples. It draws uators of the research are condent that a manipu-
heavily on the seminal work of Donald T. lated independent variable accounts for changes
Campbell and Julian C. Stanley [1] whose analy- in a dependent variable. It is the indispensable ele-
sis, originally developed for the social sciences, ment for interpreting the experiment. The inde-
provides a cogent theoretical framework for pendent variable is the treatment (e.g., drug,
understanding the logical structure, strengths, surgery) that is applied to study subjects; the
and weaknesses of alternative study designs. dependent variable is the observed outcome (or
response). To draw internally valid conclusions
from an interventional study, dependence of out-
Potential Threats to Validity come on treatment must be clearly apparent; other,
potentially confounding factors must not be plau-
In its broadest sense, validity is dened as the sibly responsible for outcomes, or their impact
best available approximation to the truth or must be denitively determinable so that the effect
of the intervention can be unambiguously assessed.
In other words, demonstration of an association
P.G. Supino, EdD () between intervention and outcome, as in an obser-
Department of Medicine, College of Medicine, vational study, would be inadequate; cause and
effect must be inferable. Thus, the study design
450 Clarkson Avenue, 1199, Brooklyn,
NY 11203, USA must effectively control for competing explana-
e-mail: phyllissupino@aol.com tions (i.e., rival hypotheses) for the ndings. For
80 P.G. Supino
the clinician, this would be equivalent to the logic reason, observed differences on outcome
underlying the protocols for ruling out myocar- measures among the groups may be due to
dial infarction in the setting of chest pain. (or at least strongly inuenced by) these
Campbell and Stanley identied eight factors that baseline differences rather than to the inter-
may threaten the internal validity of an interven- vention. Selection bias sometimes can be
tional study. They referred to these as internal neutralized after data collection through sta-
validity threats because they can provide com- tistical processes. However, the best strategy
peting explanations for observed outcomes and, is to preclude the problem by using an appro-
thus, obscure true causal linkages. It is incum- priate study design to maximize the compa-
bent on a good investigator to use study designs rability of the compared groups prior to
devoid of these potential internal validity threats intervention.
insofar as is possible. 2. History Effects. History effects are caused
1. Selection Bias. Selection bias is the improper by events not related to, or anticipated by, the
assignment (allocation) of subjects for com- research protocol that occur during the study
parison. It is one of the most commonly rec- and inuence outcomes. History effects
ognized threats to the internal validity of an potentially threaten internal validity when a
interventional study. An investigator may study is performed in a less than isolated set-
inadvertently contribute to this bias by non- ting, particularly when effects on the depen-
rigorous matching (or failed randomization) dent variable are assessed before and after
techniques, or by choosing subjects for the the intervention and the temporal interval
experimental treatment who are believed to separating these assessments is relatively
be most likely to benet from it (a form of long. When history effects occur, measured
referral bias). For example, in a trial com- outcomes may partially or completely reect
paring surgery with medical treatment, those the outside event and not the intervention.
with the most favorable clinical prole might History effects can be caused by factors such
be assigned (referred) to the surgical group as unintended procedural or environmental
(based on presumed benet), while the less changes in the experimental setting, changes
robust patients might be assigned to the med- in the social climate that can inuence atti-
ically treated group. This approach is almost tudes, media campaigns that can increase
always optimistically biased in favor of the general knowledge, to newsworthy events
surgical group, which is why it is so difcult relevant to the altered health concerns of
to form condent conclusions from trials subjects in the study, etc. As an example of
conducted in this manner. It is equally incor- the latter, if an investigator was evaluating
rect to allow subjects to self-select their treat- the impact of a breast cancer awareness pro-
ment assignments because volunteers for gram to promote increased use of mammog-
experimental treatments have been shown in raphy and a well-known pubic gure was
various studies [35] to be different from the diagnosed with breast cancer, it would be
total ambient population in terms of person- difcult to determine whether the ensuring
ality (e.g., risk tolerance, decisiveness, action increased use of mammography was due to
orientation), severity of disease or symp- the program or to the media attention sur-
toms, and race, among other variables. These rounding the public gures diagnosis. In the
characteristics could skew associated out- clinical setting, history effects can be induced
comes in any direction (though it is generally by changes in routine care (e.g., introduction
thought that the direction of the bias induced of a new medication or other treatment,
by self-selection bias, like referral bias, is in alterations in patient management, variations
favor of the experimental treatment). in patient reimbursement rules) that could
When groups to be compared are not impact study outcomes. The effects of history
equivalent initially for these or for any other are best minimized by closely monitoring
5 Fundamental Issues in Evaluating the Impact of Interventions: Sources and Control of Bias 81
to ensure that ancillary factors not directly quent results through practice or learning.
integral to the intervention remain equivalent The threat to internal validity can be mini-
for all compared groups for the duration of mized by using alternate forms of measure-
the study. History effects also can be mini- ment for testing before and after intervention,
mized by using contemporaneous (parallel) or by eliminating pre- and post-intervention
control group designs, where comparators comparisons from the data analysis plan. Of
would have equal likelihood of exposure to course, as is true in virtually all interven-
signicant external events extraneous to the tional research, the latter approach requires
experimental setting. demonstration of equivalence of the com-
3. Maturation Effects. Maturation effects are pared groups before the intervention is
due to dynamic processes within subjects applied (i.e., at baseline, the pre-interven-
that may change with time and are indepen- tion period, or control condition).
dent of the intervention (e.g., growing older, 5. Instrumentation Effects. Instrumentation
progression or regression of illness). Like effects (also known as instrument decay
history, maturation may threaten internal or instrument drift) are caused by chang-
validity when analysis of outcome depends ing measurement instruments or observers
on comparison of pre- and post-intervention during the course of a study, or by intra-study
measures. It is a particular concern when changes in the original instruments or
studies extend over long periods of time observers, that may cause systematic error
(longitudinal studies) during which biologi- (bias) in measuring the outcome variable. If
cal alterations naturally can be expected and, the error entails consistent overprediction
thus, may affect outcomes. The effects of versus baseline, the bias is said to be posi-
maturation, like selection bias and history tive; consistent underprediction is a negative
effects, are minimized in parallel designs by bias [6]. For example, if alternate versions of
selecting comparison groups likely to have a test instrument are used before and after an
similar developmental patterns. intervention to reduce testing effects, any
4. Testing Effects. Testing effects are the observed changes may be due to differences
inuences of taking a test, being measured, in difculty level (e.g., easier posttests in
or otherwise being observed, on the results studies assessing educational impact) or
of subsequent testing, measurement, or other systematic variations in the alternative
observation. Testing effects may occur instruments, rather than to the intervention.
whenever the testing process is itself a stim- To avoid instrument effects when alternate
ulus to change, even in the absence of a forms of measurement are employed, they
treatment. Examples are the act of being should be previously evaluated to assure
weighed during a weight-reduction pro- equivalence. Parallel problems can occur
gram, or requiring patients receiving nico- when observers are changed during the course
tine substitutes to document and periodically of study since new observers may use differ-
report the number of cigarettes they have ent criteria for scoring and interpreting data
smoked. In these cases, assuming the sub- than the original observers. Instrumentation
jects are aware of the results of testing, the effects also can occur when the same instru-
process of being measured may cause ment (or observer) is used throughout the
subjects to undertake lifestyle changes study since instrument calibration may change
that will affect outcome independently of with time (or observer attitudes/assessment
the intervention. Testing effects are poten- criteria may change with experience).
tial concerns when measurement assesses Like history and maturation, instrumenta-
knowledge, attitudes, behaviors, and (espe- tion effects are a potential threat to internal
cially) skills, because the testing itself can validity in any longitudinal study involving
provide an opportunity for altering subse- serial measurements. They are of particular
82 P.G. Supino
concern when subjective measures (e.g., especially if these attributes are related to the
interviews or questionnaires) are used; in this outcome. Experimental mortality can bias
situation, care must be taken to assure that outcome even for post-interventional com-
instruments have demonstrated high reliabil- parisons if dropout is due to some character-
ity (internal consistency) to ensure stability. istic of an intervention that is not related to
However, whether objective or subjective the mechanism underlying its presumed
measures are used, observers may alter their efcacy. When comparison groups are used
interpretation of data as they grow more in an experimental design, a mortality bias
procient or fatigued. Thus, instrumentation also is introduced if the subjects lost to
effects also can be minimized through devel- follow-up differ diagnostically among these
opment of standardized data collection pro- groups. For example, a psychiatrist might
tocols so that any uctuations in measurement wish to follow two groups of psychotic
will occur randomly rather than systemati- patients, one of which had been given an
cally (or when comparing treatments by innovative treatment (the experimental
using the same observers across treatment group) while the other had been managed
conditions [counterbalancing] to avoid traditionally (the control group) to determine
confounding). whether the intervention decreased return
6. Statistical Regression. Statistical regres- visits to his/her practice. If more severely ill
sion is the tendency of individuals who patients were lost to follow-up in the inter-
scored extremely high or low on initial test- vention group than in the control group, the
ing to score closer to the previously estab- investigator might falsely conclude that
lished population mean on subsequent reductions in return visits among the inter-
retesting, independent of the intervention. vention group were attributable to the inno-
This is one of the most often overlooked vative treatment when, in fact, they may have
threats to internal validity, even among inves- occurred merely as a result of differences in
tigators who are well trained in statistics. attrition rates due to differences in illness
Statistical regression results from measure- severity. Experimental mortality is best mini-
ment error, as extreme or highly deviant mized by using large groups of subjects who
scores may arise due to chance. Such deviant are geographically stable, accessible to
scores are less likely to reappear on reevalua- investigators (i.e., have working telephone
tion. Regression effects can be minimized by numbers and valid postal or e-mail addresses),
avoiding the selection of a subject pool based and who are interested in participating in the
on extreme scores, for example, very high study, and by developing strategies to facili-
blood pressure or low IQ scores. Another use- tate follow-up. When subjects are lost, it is
ful strategy to avoid regression effects is to prudent to compare their baseline character-
obtain multiple measurements on each patient istics with those who remain in study to iden-
at several different appropriate times prior to tify potential bias, and to utilize external vital
intervention, or several measurements at the statistics databases (e.g., the National Death
protocol-mandated baseline and time after Index) to identify and conrm deaths that
intervention, which may then be averaged to may not be known to investigators.
optimize reliability of the estimate. 8. Interaction of Factors. Sometimes two or
7. Experimental Mortality. Experimental mor- more threats to validity can exist concur-
tality (or attrition bias) is caused by the rently. These may combine to further restrict
loss of subjects from a study who were origi- validity. Two factors that might be expected
nally included at baseline. Because subjects to combine are selection and maturation.
who withdraw may have different attributes For example, if two groups of patients were
than those who remain, their withdrawal may not initially equivalent in severity of illness
bias pre- to post-intervention comparisons, (a selection bias), their illnesses might
progress at different rates (a maturation bias). control arm (a form of instrumentation bias).
Thus, one of the two groups might end up Experimenter bias is best controlled by tech-
sicker, or healthier, than the other, irrespec- niques that blind both the investigator and
tive of any intervention. This threat is best the subject to the latters treatment assign-
controlled by procedures to minimize indi- ment, by the use of observers from whom the
vidual biases (e.g., randomized allocation to purpose of the study is withheld, and by stan-
treatment groups). dardization of the methodology of outcome
9. Experimenter Bias. In a perfect world, an assessment to ensure that subjects in the
investigator involved in a quantitative study control group are evaluated as thoroughly
would be detached and objective, maintain- and as frequently as those receiving the
ing a highly circumscribed relationship with intervention.
the subject. In an interventional study, his or 10. Subject Expectancy Effects. The subject
her responsibility is to administer or allocate expectancy effect (also termed nonspecic
subjects to a treatment and to impartially effects), also not identied by Campbell and
measure outcomes and other variables of Stanley, is a cognitive bias that arises when a
interest. Experimenter bias, not identied subject anticipates an outcome (positive or
by Campbell and Stanley, occurs when the negative) from an intervention, and reports a
expectations of the investigator (usually response to the intervention that is premised
unknowingly and unintentionally) inuence on this belief. This is the basis of the pla-
the outcome of the study, thereby confound- cebo effect, long recognized in clinical
ing the results. The profound impact of medicine. It occurs when a patient responds
experimenter bias on internal validity was positively to an inactive intervention (e.g., a
demonstrated by Rosenthal (1964) in his pharmacologically inert pill) and appears to
seminal studies of expectancy on experi- improve subjectively and even, occasionally,
menter judgment and learning outcomes objectively. This effect on outcome is due to
conducted during the mid-1960s [7]. The the patients belief that the intervention is
experimenters expectations typically arise curative. It may be stimulated or reinforced
from deeply seated views about his or her by suggestion of therapeutic benet by an
study hypothesis and can impact the study in authority gure (e.g., physician or other
a number of ways. For example, the investi- investigator, as noted above under
gator could subtly communicate expectations Experimenter Bias) and/or by the subjects
(cues) to participants about anticipated out- inherent desire to please him or her. Indeed,
comes and inuence them through the power the term placebo is derived from the Latin, I
of suggestion. The investigator could provide will please. An opposite phenomenon is the
extra attention or care to subjects that is out- nocebo (Latin for, I will harm) effect
side of the intervention (the latter is also which occurs when a subject reports nega-
termed performance bias when systemati- tive responses to administration of an inert
cally done for members of only one of the intervention due to his/her pessimistic expec-
comparison groups or compensatory treat- tation that it would produce harmful or
ment bias when specically applied to con- unpleasant consequences. Although the mag-
trols). The investigator also can bias the nitude of these subject expectancy effects is
study through improper ascertainment or variable and somewhat controversial, there is
verication of outcomes, for example, by general consensus that they can impact the
searching more diligently for adverse events validity of any study in which the subject is
in patients with versus without hypothesized aware of receiving a treatment for which the
risk factors (detection bias) or by assign- outcome is subjective (e.g., studies involving
ing a more favorable rating on a subjective pain control or symptom relief). As with
scale to subjects in the experimental versus experimenter bias, subject expectancy is best
84 P.G. Supino
controlled by utilizing study designs that external validity is not assured even when internal
blind the subject to his/her treatment validity has been established. In fact, the rigorous
assignment. For some type of interventions controls required to establish internal validity
such as those involving lifestyle changes may inadvertently compromise a studys general-
(e.g., dietary alterations, smoking cessation) izability. The investigator must use a variety of
or surgical studies, subject blinding may be strategies to strike a delicate balance between
difcult, if not impossible. (This is also true both concerns, if the study is to be both accurate
for those conducting these interventions and (internally valid) and have practical utility (be
other members of the investigational team.) externally valid). The four most common threats
In these instances, blinded assessment of to external validity, identied in the seminal works
outcomes by external adjudicators could of Campbell and Stanley, are given below.
reduce, if not eliminate, expectancy biases. 1. Reactive Effects of Testing. The reactive
However, in many biomedical studies (e.g., effects of testing involve sensitizationor
those evaluating the effects of pharmacologi- desensitizationof study subjects to interven-
cal agents), subjects (and investigators) can tions caused by the pre-intervention testing
be blinded to treatment assignments through that might not be undertaken in the general,
the use of placebos. The incorporation of pla- nonstudy population. This threat to external
cebos enables determination of treatment validity is most often encountered when pre-
effects above and beyond those arising from tests are obtrusive and/or outside of the nor-
subject (or investigator) expectancy. mal experience of the subject. For example, to
Obviously, placebos work best when they study the effects of a new nutrition program,
closely approximate the physical characteris- an investigator might assess baseline knowl-
tics of the active intervention. (This problem edge of food groups and portion control,
is avoided in early phase I clinical trials of for the purpose of comparing pre- to post-
therapeutics where both placebo and active intervention changes. If the pretest had focused
drug may be administered intravenously, or attention on the intervention, any treatment
when the investigational intervention does effects that were observed might not be repli-
not cause characteristic physiological effects cable if the pretest was not given. To diminish
that might unmask the treatment assign- this bias, the investigator should minimize or,
ment.) When the treatment assignment is ideally, dispense with the use of pretests.
known to both subject and investigator, it is However, as with its internal validity analog
said to be unblinded (or open); when (testing effects), this approach is valid only
only the subject or the investigator (but not when there is reasonable certainty that the
both) is unaware of the treatment assignment, comparison groups are equivalent at baseline.
the study is said to be single blinded; when Alternatively, the investigator could opt to use
treatment assignment is unknown both to the least obtrusive pre-intervention assess-
subject and investigator, the study is said to ments to minimize reactivity. Special research
be double blinded; and when it is unknown designs (e.g., the Solomon four-square design),
to the subject, investigator, and others ana- in which pretests are given to some but not all
lyzing or monitoring the data, the study is study subjects, can be used to determine the
said to be triple blinded. reactive effects of testing on study outcomes.
2. Interactive Effects of Selection and Treatment.
Threats to External Validity Sometimes two investigators will run similar
External validity refers to generalizability, that studies and obtain different ndings. One pos-
is, can the study ndings be extrapolated to sub- sible cause of this outcome is the interactive
jects, contexts, and times other than those for effects of selection and treatment (or selec-
which the ndings were obtained? Internal valid- tion-treatment interaction). The interactive
ity is a prerequisite for external validity. However, effects of selection and treatment are the
presumed basis of the failure of results found as aberrant behavior exhibited by subjects that
in an intervention study to be generalizable to results solely as a consequence of their partici-
other subjects to whom that intervention is pation in an experiment, and that may not
applied. This failure occurs because the study occur outside the experimental setting. The
was conducted on a sample that was not repre- reactive effects of experimental arrangements
sentative of the larger population to which are often confused with the placebo effect.
results should be extrapolated. The selection- Although there are cognitive components
treatment interaction frequently is seen in inherent in both validity threats, the primary
clinical research when research subjects are difference is that with the reactive effects of
scarce (a common situation) and the investi- experimental arrangements, the subjects bias
gator is limited to those who present them- is based on the idiosyncrasies of the research
selves and are willing to participate. In these environment, whereas with the placebo effect,
situations, study subjects typically are selected the subjects bias is based on expectations
by convenience, rather than by population- about the treatment (that may or may not be
based sampling. A convenience sample part of a research study). The reactive effects
includes all, or a portion, of patients who are of experimental arrangements were serendipi-
being seen in a practice, hospital, or clinic, tously discovered in a series of trials evaluat-
provided they meet the inclusion criteria of ing the impact of the work environment on
the study, and consent to participate. If the employee productivity, conducted by Harvard
subjects selected for the study are, for exam- University researchers between 1924 and
ple, healthier, wealthier, or wiser than the gen- 1932 at the Hawthorne Works, a factory plant
eral population, or if they come from a unique of the Western Electric Company in Cicero,
geographic area, they may benet more or less Illinois. The initial studies (illumination
from a treatment, and it may not be possible to experiments) varied the level of light intensity
replicate the study, or to extrapolate its results to which employees were exposed. When the
to the larger population of interest. In theory, light intensity increased, worker output (and
the interactive effects of selection and treat- positive affect) improved but, much to the
ment are best controlled by random selection investigators surprise, worker performance
of subjects from the target population. Because also improved when lighting intensity was
this seldom is possible in clinical research diminished. The same pattern emerged when
(especially in randomized clinical trials other environmental factors were manipu-
[RCTs] in which strict inclusion/exclusion cri- lated. These unintended outcomes (also known
teria and possibility of a subjects receiving a as the Hawthorne effect) [8] led the research-
placebo sharply narrow the pool of study-eli- ers to conclude that the mere act of being stud-
gible patients), the investigator should ied changed the participants behavior (i.e.,
endeavor to select subjects who have charac- brought about a pseudo-treatment effect), con-
teristics similar to those to which he or she founding inferences about effects of the vari-
wishes to extrapolate results. Multicenter ous interventions imposed upon them.
studies, drawing from diverse demographic Underlying mechanisms proposed to explain
populations, tend to suffer less than single- these ndings include unintended special
center studies from this external validity attention and benets that may have been
threat. Nonetheless, even small, single-center given to subjects by observers, uncontrolled
studies have value provided the investigator novelty due to the articiality of the experi-
identies and reports potential biases in his or mental arrangements, and inadvertent
her selection plan and is also careful to limit responses to subjects from observers leading
generalizations to appropriate populations. to learning effects that positively impacted
3. Reactive Effects of Experimental performance. While there is no consensus as to
Arrangements. This validity threat is dened the cause, the reactive effects of experiments
86 P.G. Supino
currently are recognized as a potential threat eliminate the effects of the prior exposure.
both to external and internal validity in Under these conditions, it will be difcult to
research from various disciplines (e.g., medi- determine how much of the ultimate treatment
cine, education, psychology, and management outcome was attributable to the rst treatment
science). Their impact is potentially problem- and how much was due to the second, thus
atic in any situation in which there is human limiting the applicability of the study ndings
awareness of participation in a study and in to the real world in which patterns of treat-
which study outcomes can be motivated by ment availability may not mirror those of
that knowledge. A related threat to validity study. Multiple treatment interference is very
that is caused by experimental arrangements is difcult to eradicate. It is best controlled by
known as the John Henry effect [9]. This avoiding the use of within-subject designs.
may occur when subjects in the control group, Where this is not possible, the investigator
being aware of their treatment assignment, must carefully counterbalance or randomly
view themselves as competing with subjects order treatments across subjects and provide
in the intervention group and change their appropriate washout periods.
behavior (i.e., try harder) in an attempt to out-
perform them.
Whenever possible, the investigator should Elements of the Research Design
take steps to reduce the reactive effects of
experimental arrangements to increase the In analyzing the anatomy of a study to evaluate
likelihood that the ndings from a study will the impact of an intervention, it can be very help-
be replicated beyond the experimental con- ful to employ shorthand that displays the major
text. Methodological options for achieving elements of the design, the sequence of events,
this objective include (1) minimizing the and certain of the constraints within the design.
obtrusiveness of experimental manipulations This shorthand, based largely on the notation
and measurements, (2) blinding subjects to developed by Campbell and Stanley, will be used
their treatment assignment (to control for in the remainder of this chapter to examine the
John Henry effects), and (3) providing strengths and weaknesses of ten alternative study
equivalent attention to intervention and con- designs.
trol groups, especially in studies involving The symbol X denotes the intervention (pri-
psychological, behavioral, and educational mary treatment or independent variable) that
outcomes. To accomplish this, investigators is applied to the subjects in the study. When
may include a Hawthorne control group that more than one level of a treatment is included
receives an irrelevant intervention to equalize in a design, they are labeled X0 (control), X1,
subject contact with project staff. X2, and so on; XP indicates that a placebo has
4. Multiple Treatment Interference. A fourth been given to control subjects (in designs
threat to the external validity of an interven- incorporating parallel treatment arms) or dur-
tion study is multiple treatment interference, ing the control condition (in time-series or
dened as the inuence of one treatment on crossover design) to control for expectancy.
another, which may produce results that would Y indicates that a secondary treatment has been
not be found if either were applied alone. coadministered, concomitant with the primary
Multiple treatment interference is a potential treatment. Variations in levels of the secondary
problem in any study in which more than one treatment, if any, may be distinguished by sub-
treatment (or treatment level) is given to, and scripts in a similar manner as for X. Absence
formally evaluated in, the same subject. The of Y indicates absence of co-treatment.
threat applies even when the treatments are O is the observation (or measurement of the
given in sequence because treatment effects dependent variable) in the study. O may repre-
may carry over and it may not be possible to sent a test result, a record, or other data; when
more than one observation is involved over erly termed pre-experimental designs because
time, they are variously labeled as O1, O2, etc., they contain only few of the essential structural
to distinguish them. elements needed to draw unambiguous inferences
An arrow represents the experimental order about the impact of an intervention. They are pre-
(sequence of events during the study period). sented below to heighten the readers awareness
A dashed line indicates that intact groups (e.g., of their glaring deciencies. The three most com-
hospitals, clinics, or wards) have been common are the following:
pared (in other words, that subjects have not 1. The one-shot case study
been allocated to treatment on a random basis). 2. The pretest-posttest only design
R indicates that study subjects have been allo- 3. The static-group comparison
cated to treatment groups on a random basis.
(Thus, a dashed line and R generally will not Pre-Experimental Research Design # 1:
appear in the same design as these represent The One-Shot Case Study
alternative methods of subject allocation to XO
treatment.)
Some studies in medicine utilize a design in
which a single patient (or series of patients) is
Alternative Research Designs studied only once, following the administration
an intervention. No pre- to post-intervention
Several alternative research designs have been comparisons are made, and no concurrent control
used to evaluate the effects of an intervention on groups are used. Instead, inferences about causal-
some specied outcome. Each of these differs ity are predicated on expectations of what would
according to its adequacy in ensuring that valid have been observed in the absence of the inter-
inferences are made about the effects and gener- vention, usually based on implicit comparison
alizability of an intervention. with past information. This most rudimentary
pre-experimental design is termed the one-shot
case study and is diagrammed as follows: X for
Pre-experimental Research Designs the intervention, followed by an arrow, and O for
the observation. Consider an example from the
The literature regrettably includes many studies literature by R.F. Visser, published in the journal
that use designs which fail to control for most Clinical Cardiology [10] (summary and design
threats to internal validity. These are most prop- structure are given in Fig. 5.1).
Fig. 5.1 Example of a one-shot case study

88 P.G. Supino
In this study, the X represents the anistreplase,

design which also is commonly found in the
and the O represents the patency of the infarct- medical literature. This design differs from
related vessels, as measured by TIMI criteria for the one-shot case study in that it collects baseline
perfusion. The authors contend that the X proba- observations on study subjects that can be com-
bly caused O, but have they presented convincing pared to observations made after the intervention.
evidence of that association and protected the (The terms pretest and posttest are used
internal validity of their study? generically in this chapter to refer to assessments
The answer is that studies such have almost no of the dependent variable made, respectively,
value for determining cause and effect because, before and after the intervention.) Because study
as Campbell and Stanley have noted, securing subjects are observed under more than one treat-
evidence of this nature involves, at minimum, ment condition, the one-group pretest-posttest
making at least one direct comparison. Although study is considered one of the simplest versions
the authors allude to the results of previous studies
of repeated measurement designs, described later
of patency following an AMI, no explicit data are in this chapter. Like the one-shot case study, this
presented against which patency in this investiga- design contains no parallel comparison groups
tion is compared; the absence of such control is and is diagrammed as an O1, for the pre-interven-
even more striking for reocclusion rates. Even if tion observation; followed by an X, for the inter-
data from historical controls were given, there is vention; and followed by O2, denoting the
no assurance that previous patient characteristics post-intervention observation, with arrows
and ancillary medical management were equiva- between. An example of a study employing this
lent; in fact, they usually are not, due to differ-design was published by Wender and Reimmer
ences in the health of a given population and in the American Journal of Psychiatry [11] (sum-
alterations in medical technology over time. In mary and design structure are given in Fig. 5.2).
addition, while standardized methodology (TIMI In this study, O1 represents the baseline atten-
criteria) was used to determine initial patency andtion decit hyperactive disorder (ADHD) score,
reocclusion, those reading the angiograms were X is the bupropion treatment, and O2 is the post-
aware of (and may have been inuenced by) the treatment ADHD score (Fig. 5.2). In the opinion
intervention. Thus, history, maturation, selection,of the authors, the improvement in O2 relative to
experimental mortality, and expectancy (experi- O1 is the result of X. Can the authors primary
menter bias) potentially threaten the internal conclusion withstand scrutiny?
validity of this study because each could explain Again, we rst consider internal validity. As in
the outcome. Furthermore, the external validity any repeated measures design, study subjects
of this study also is threatened by the interactionserved as their own controls, effectively eliminat-
of selection and treatment (due to small numbers ing the threat of selection (allocation) bias.
of highly selected patients who may not be repre- However, this design fails completely to control
sentative of the general population of patients for the following other factors that also could
with AMI), as well as by multiple treatment inter- account for the results. First, history effects are a
ference (note: heparin also was given to all sub- potential threat to the internal validity of this
jects). As noted earlier, importance usually is notstudy because it is possible that patients may
attached to the generalizability of a study that have experienced an event external to the inter-
cannot be shown to be internally valid. vention, and that this event, not the intervention,
may have improved their ability to focus. A sec-
Pre-Experimental Research Design # 2 ond potential threat is maturation because, as in
The One-group Pretest-Posttest Only Design any longitudinal study, the conditions of the study
O1 X O2 subjects may have changed on their own. Yet
another threat to internal validity is testing, since
The one-group pretest-posttest only design exposure to the pretest may have improved per-
represents a very slight improvement over the one- formance on the posttest. There is also the threat
shot case study; it is a second pre-experimental of instrumentation effects as the tests may not
Fig. 5.2 Example of the one-group pretest-posttest only design
have been well standardized. (Indeed, the authors A third pre-experimental design also found in
are silent about the test-retest reliability of their the literature is the static-group comparison. This
instruments.) Statistical regression poses another design incorporates two groups: one that receives
possible threat, if the study subjects had been an intervention (again denoted as X) and a sec-
chosen on the basis of extremely poor scores on ond that does not receive an intervention and
the initial test. In the nal analysis, because so which serves a control (denoted by the absence of
many potential individual biases are uncontrolled X). Groups one and two typically are observed
in this study, there is also the strong likelihood concurrently after the intervention has been
that interaction of these factors could undermine applied in one of the groups, and the observations
its internal validity and the conclusions drawn made in these groups are denoted by the Os. This
from it. Indeed, Campbell and Stanley argued design includes no pretesting or baseline mea-
that this type of design should be used only when surements. Note that both intervention and con-
nothing else can be done. trol groups are separated, schematically, by a
The study also suffers from several threats to dashed line to indicate that study subjects were
external validity, namely, the potential for selection- assigned to treatment as intact groups, that is,
treatment interaction. First of all, very few sub- they were not randomly allocated to treatment.
jects were studied, and it is highly unlikely that A study, published by Bolland et al. in the Journal
they were representative of all patients being of the American Dietetic Association [12],
treated for ADHD (selection-treatment interac- employed a variant of this design which tested
tion). Second, the subjects (as well as their doc- for effects extended over time (summary and
tors) were unblinded, and subjects may have design structure are given in Fig. 5.3).
improved due to the effects of their participa- Are these conclusions credible? A review of
tion in the study (reactive effects of experimental the structure of this design will be revealing. In
arrangements). These issues are noted only for this study, X represents the food quantity estima-
completeness. As noted above, this study fails to tion intervention, and the O represents the post-
meet criteria for internal validity; thus, its gener- intervention assessments of knowledge of food
alizability is unimportant. quantities in the experimental (trained) and con-
trol (untrained) groups, assessed at three different
Pre-Experimental Research Design # 3 times among trained subjects. (The reader should
The Static-Group Comparison note that the use of deferred assessments is not
typical of the static-group comparison design
but was used in this study in an attempt to dene
persistence of treatment effects.) The broken line
90 P.G. Supino
Fig. 5.3 Example of the static-group comparison design
between the experimental and control groups indi- absolutely no protection. The rst threat is selec-
cates the intact nature of the comparison groups, tion (or allocation bias). The authors do not tell
signifying that subject assignment to the interven- us how the study subjects were divided into treat-
tion or control comparison group was not random. ment groups. Was it by instructor preference or
The static-group comparison design repre- self-selection by the study subjects? Either of
sents an improvement over the one-shot case these scenarios would be equally awed because
study because the inclusion of a contemporane- without baseline (pre-intervention) assessments,
ous control group permits comparison of the there is no way to determine whether the observed
results of the trained study subjects with the other, outcomes were due to the training or to pre-inter-
untrained study subjects, evaluated approxi- vention differences in the subjects knowledge
mately in parallel, thereby avoiding the obvious about estimating food quantities. Even if the
biases inherent in the use of external or historical investigators had attempted to match the groups
controls (or, in the worst-case scenario, no con- on other variables, such matching would be inef-
trols). Moreover, the fact that study subjects in fective in achieving true baseline parity among
both groups are being evaluated in the same way trained versus untrained subjects, especially if
during a relatively short interval decreases the subjects had, indeed, self-selected participation
potential for maturation and instrumentation in the intervention. In addition, even though the
effects (assuming uniform data collection). study was relatively short in duration, the validity
Finally, this design also represents an improve- of the conclusions, nonetheless, is threatened by
ment over the one-group pretest-posttest only the potential for experimental mortality (attrition
design because the absence of pretesting and sub- bias) as no information is given about whether all
ject selection based on extreme pretest scores subjects who began this study actually completed
obviates the threat of testing effects and statisti- it or whether attrition (if it did occur) differed
cal regression. systematically between the two groups. Thus,
Nonetheless, there are two potential threats to even if subjects were comparable on average
internal validity for which this design affords before training, the apparent superiority of the
trained group (relative to the untrained group) on assignment to the alternative study arms, and that
the outcome measure possibly could have been probability remains constant throughout the
due to several of the less knowledgeable students study. The randomization process can be per-
dropping from the former group (or, conversely, formed according to a coin toss or a table of
due to some of the more knowledgeable students random numbers or special computer software
dropping from the latter group) prior to testing. can be used. This type of randomization is known
The primary threat to external validity is the as simple randomization and works best when
interaction of selection and treatment. (After all, sample size is relatively large. However, when
how representative is one class of introductory sample size is small, simple randomization may
nutrition students of the larger relevant popula- result in statistically unequal groupings. Under
tion?) However, since the internal validity of the these circumstances, restrictive randomization
study is severely compromised, this threat to methods (e.g., blocked randomized designs or
external validity has little if any importance. stratified random allocation) can be employed.
With blocked randomization, subjects are
assigned to treatment in groups (blocks) that are
True-Experimental Research Designs similar to one another with regard to a source (or
several important sources) of variability that is
The most prominent characteristic of true- (are) not of primary interest to the experimenter
experimental designs is random allocation of (e.g., a potential confounding variable such as
study subjects, drawn from a common population, gender, geographic area). Stratified randomiza-
to alternative treatment conditions. When this tion is performed by conducting separate ran-
approach is employed, participants baseline char- domization procedures within each of two or
acteristics can be expected to be equally distrib- more subgroups of subjects that are dened
uted across the various comparisons according to according to prespecied patient characteristics
the laws of probability, especially when sample (usually important prognostic risk factors) and
size is large. Even when randomization does not increases the likelihood that allocation to treat-
result in perfect equivalence, most workers in the ment is well balanced within each stratum. With
eld believe that this form of treatment allocation adaptive methods (a Bayesian approach increas-
is the best way to reduce the threat of selection ingly used in contemporary clinical trials) [15],
bias. The theoretical underpinnings of random- the probability of allocation changes in response
ized designs can be traced to Fisher and to accumulating information during the study
Mackenzies agricultural experiments in the about the composition of, or outcomes associated
1920s [13]; however, it was not until the late with, the alternative treatment arms. (For a com-
1940s that they made their appearance in the med- prehensive discussion of the theory and tech-
ical literature, when the RCT was rst used to niques of adaptive randomization, the reader is
demonstrate the efcacy of streptomycin in the referred to Hu and Rosenberger, 2006 [16].)
treatment of tuberculosis [14]. Since that time, the As noted, the purpose of randomization is to
RCT has been considered the standard to be met render the comparison groups as similar as pos-
for clinical research, even though investigations sible at study entry to permit valid inferences to
of this type comprise only a minority of the be drawn about the effects of an intervention.
clinical research ever conducted or published. However, during the course of the trial, some
Randomization also is important in many preclin- patients may not initially receive the intended
ical/basic science research protocols, though other intervention or, during the course of the study,
considerations may minimize application of this may drop out or cross over to the alternate treat-
approach in the nonclinical setting. ment for a variety of reasons. One widely used
Most commonly randomization is fixed, less solution to circumvent these problems is intention-
commonly it is adaptive. With xed random allo- to-treat analysis (ITT), which denes the compar-
cation, each subject has an equal probability of ison groups according to initial assigned treatment
92 P.G. Supino
rather than to the treatment actually received or study. All provide much better protection than do
completed (i.e., once randomized, always ana- pre-experimental designs against most threats to
lyzed). Many workers in the eld consider ITT internal validity.
analysis to be the gold standard method of analy-
sis for clinical trials [17], describing it as the least True Experimental Design # 1
biased for drawing inferences about trial results The Pretest-Posttest Control Group Design
[17, 18], and it is considered the pivotal analysis
by major regulatory bodies in Europe and in the
USA for approval of new therapeutics. However,
the reader should note that ITT analysis provides In the most common form of the pretest-
only a pragmatic estimate of the benet of a new posttest control group design, study subjects are
treatment policy rather than an estimate of poten- randomly allocated to two comparison groups or
tial benet in patients who receive treatment treatment arms. One group receives the experi-
exactly as planned; moreover, full application of mental intervention and the second, no interven-
this method is possible only when complete out- tion, a placebo, or an alternate intervention. Both
come data are available for all randomized subgroups are observed, in parallel, before and after
jects [19]. Thus, The ITT approach is not without the intervention on the same outcome measure(s)
its critics [20]. Some clinical trialists argue that to determine whether change varied as a function
efcacy is best demonstrated when analysis of the treatment. The structure of this design is
focuses on subjects who actually received the represented symbolically above: R denotes that
treatment of interest (sometimes termed efcacy subjects have been randomly allocated to the
subset analysis), arguing that ITT approaches comparison groups; X denotes that a treatment
provide an overly conservative estimate of the has been given to the rst group; absence of X in
magnitude of treatment effects principally due to the second group indicates that this is a control
dilution of effects by nonadherence. In addition, group (the control group also could have been
ITT analysis creates difculty in interpretation of denoted by X0 [or Xp if a placebo had been
ndings if numerous participants cross over to given]). O and its positioning indicate the obser-
opposite treatment arms. Finally, it is suboptimal vations made in both groups before and after the
for studies of equivalence, generally increasing intervention. An example of a study incorporat-
the likelihood of erroneously concluding that no ing this design was published by Gorbach et al. in
difference exists between two test articles [21]. the Journal of the American Dietetic Association
A common solution is to employ both methods of [22] (summary and design structure are given in
analysis in the same study, using ITT and on- Fig. 5.4).
treatment approaches as primary and secondary The structural representation of this study is a
analysis, respectively. clue to the strength of its internal validity. Here,
Four of the most common true-experimental X represents fat reduction dietary intervention;
designs found in the biomedical literature are the the absence of X represents no dietary interven-
following: tion, the control group; O1 and O3 represent base-
1. The pretest-posttest control group design line fat intake in the experimental and control
2. The posttest only control group design groups; O2 and O4 represent post-intervention fat
3. The true-experimental 2 2 factorial design intake in both groups; R signies that the study is
4. The crossover study (two-period design) randomized.
The rst two designs can be used to evaluate Because study subjects have been randomly
the impact of a single intervention (vs. control or allocated to comparison groups from a common
an alternate intervention), and the third and fourth subject pool, selection bias has been removed as
permit the investigator to examine the separate a serious threat to internal validity, assuming that
effects of two interventions (again, vs. control or the randomization was effective. Having baseline
an alternate intervention) applied within the same measures of the dependent variable (and other
Fig. 5.4 Example of the pretest-posttest control group design
key variables that potentially could inuence it) to the latter criterion, average regression effects
and comparing them between groups permits us would not confound interpretation of the results
to conrm or reject this assumption; these com- because if they had occurred, they should have
parisons typically are expressed in tabular form been equivalent in the comparison groups, given
in most published RCTs. History effects are con- that the subjects were randomly allocated from a
trolled because if a potentially confounding gen- common subject pool. Thus, this design also pro-
eral external event had occurred, it should have tects against statistical regression. Finally, while
affected the comparison groups equally since treatment assignment could not be fully blinded
they are studied in parallel; nonetheless, as noted (as noted earlier, a common characteristic of
earlier in this chapter, the investigator must be studies evaluating impact of lifestyle interven-
vigilant and attempt to control for differences tions) to entirely eliminate the threat of expec-
between comparison groups that might occur on tancy effects, the investigators endeavored to
a more micro level (i.e., within group varia- reduce them by standardizing their methodology
tions in temperature, time of day, season, etc.). for outcome ascertainment and by blind-coding
For similar reasons, the use of a parallel design data to ensure that subjects in the control group
also protects against the threats of maturation, and those receiving the intervention were evalu-
testing, and instrumentation effects because natu- ated uniformly and impartially. The one error
ral variations in these factors should impact com- made in this study was the use of an incorrect test
parison groups equally; instrumentation effects of statistical signicance (i.e., computing two
also are minimized here because all data were sets of t-tests, one for the experimental group and
collected using standardized techniques. In this one for the control group, rather than conducting
study, subjects were selected on the basis of high direct statistical comparisons of the changes
risk for breast cancer, not on the basis of extremes between the groups). With this single exception
in pre-intervention fat and energy intake. (which Campbell identied as a wrong statistic
However, even if they had been chosen according in common use among investigators employing
94 P.G. Supino
these designs [1]), the use of random allocation impact of the selection-treatment interaction,
to parallel treatment groups afforded by the appli- which must be considered, even though hundreds
cation of the pretest-posttest parallel group of subjects were enrolled in the trial.
design, coupled with standardized data collection A third potential threat to the external validity
methodology, protected this study very well from is the reactive effects of the experimental arrange-
most factors that could have undermined its inter- ments. Because the intervention was not part of
nal validity, thus maximizing the likelihood that the routine care of this population and informed
the intervention, rather than other factors, was consent was required, subjects certainly were
responsible for the observed outcomes. aware of their participation in an experiment.
However, the external validity of this study is All subjects would have been exposed to the nov-
open to question. The reason is that randomized elty associated with random allocation techniques
designs, including this model, may lead to con- and new ways of keeping food records. Subjects
clusions that, while internally valid for the study, in the intervention group would have been
may not generalize to the reference population exposed to new health-care providers (in this
for the following three reasons. study, the nutritionists) and, as a part of such
First of all, in this study, pretests were used to intervention, may well have received more atten-
assess relative change in fat and energy intake in tion from project personnel than those told to fol-
the comparison groups. Their use may have sen- low customary diets (i.e., the control group),
sitized study subjects to the intervention, with the unless a Hawthorne control had been built into
possibility that results might not generalize when the study (which it had not). Any of these factors
the intervention is applied without pretesting. might have led to changes that were due to reac-
This threat to external validity, known as the tivity to the experiment (a possibility that is sup-
interactive effect of testing and treatment and ported by changes in fat and energy consumption,
described earlier, is a potential problem for any albeit of a lesser magnitude, among control group
pretest-posttest comparison design, randomized participants), raising the concern that the effects
or not, unless the testing itself is considered a of the intervention might not be replicated when
component of the intervention being studied. applied nonexperimentally.
Another potential threat to external validity is
True-Experimental Design # 2
the interaction of selection and treatment. Since
The Posttest Only Control Group Design
the purpose of hypothesis testing is to make infer-
ences about the reference population from which R X O1
study subjects are drawn, the representativeness R O2
of the study group must be ascertained for the gen-
eral population of women at high risk for breast The next approach, called a posttest only con-
cancer. As noted, the majority of subjects in this trol group design, again utilizes two groups: each
study were well educated, and a quarter had annual has been randomly allocated to treatment; as
household incomes that were relatively high for before, one group receives the intervention, repre-
the time (1990). It is also relevant that patients sented by X, and the second group either receives
were excluded from the study for a number of rea- no intervention, an alternate intervention, orif it
sons including, but not limited to, their unwilling- is a drug studysometimes a placebo (designated
ness to sign an informed consent form, or because as XP). Both are observed after the intervention
they were judged by the study nutritionist to be only, as shown by the positioning of O. The major
potentially unreliable in complying with the study distinction between this design and the preceding
protocol. Unfortunately, as is the case for many one is that, here, study subjects are not assessed on
published RCTs, the authors fail to state how the dependent (outcome) variable at baseline.
many patients were excluded for these reasons, Instead, they are compared only after the interven-
making it difcult to evaluate the potential adverse tion. Unless knowledge of relative change on an
Fig. 5.5 Example of the posttest only control group design
outcome is required, baseline assessments of the How well does this study design protect against
dependent variable are not necessary for establish- threats to internal validity? The answer is very
ing comparability of the comparison groups in well. Again, as for pretest-posttest parallel control
true-experimental designs, since random alloca- group design, the use of random allocation of
tion to treatment should eliminate the threat of almost 4,000 patients to treatment assignment
selection bias. As noted earlier, this is especially controls for selection bias (the comparability of
true if the number of study subjects is large and the distributions of baseline clinical variables,
the randomization strategy is properly executed. electrocardiographic abnormalities, age, gender,
Nevertheless, baseline data on relevant demo- and other descriptors between the propranolol and
graphic and clinical variables other than study placebo groups noted in their manuscript illus-
outcomes typically are collected to permit exami- trates this point). In addition, the use of parallel
nation of this assumption. comparison group post-intervention comparisons,
The posttest only control group design is espe- rather than sole reliance on within-group changes
cially appropriate in situations where within-sub- without controls, effectively rules out history,
ject outcomes logically cannot be dened before maturation, testing, mortality, regression, and
application of the intervention (e.g., in studies instrumentation effects and their interactions as
relating impact of the intervention on survival). competing explanations for the outcomes. In addi-
A classic example was published by the b-Blocker tion, because the study was double blinded, both
Heart Attack Research Group in the Journal of subject expectancy and experimenter bias also are
the American Medical Association [23] (sum- eliminated as potential threats to validity.
mary and design structure are given in Fig. 5.5). The study also is superior to that of Gorbach
In this study design, X represents the experi- et al. with regard to external validity. The reason is
mental drug, in this case propranolol, and XP is that the posttest only comparison group design
the placebo. O1 and O2, respectively, represent does not require pre-intervention assessments as a
the percent mortality for the propranolol and pla- benchmark against which to establish intervention
cebo groups. As before, the symbol R denotes the effects. Thus, by denition, it controls for the reac-
use of randomized allocation to treatment group. tive effects of testing. Indeed, this is the primary
96 P.G. Supino
advantage of this design versus the pretest-posttest comparative effectiveness), the second group
parallel group design. In this study, the outcomes might receive an alternative primary treatment
of the intervention were all hard events rather (in this case, these treatments would be desig-
than behavioral or educational outcomes, and the nated X1 and X2 to differentiate them). One group
intervention, itself, involved medication rather receiving the primary treatment and one receiving
than promotion of lifestyle change. Therefore, the an alternate treatment, or no primary treatment,
reactive effects of experimental arrangements, if also receive a secondary treatment, denoted here
any, should be minimal, provided that the investi- as Y. The remaining two groups do not or may
gators took care to minimize the obtrusiveness of receive a placebo. The groups are observed in
the experimental manipulations and measure- parallel after application of the intervention, as
ments. Nonetheless, while the study was large and denoted by O. A 2 2 true-experimental design,
multicentered, the authors reported that 77% of published by the International Study Group in
those patients invited to participate did not do so. The Lancet [24], was employed to evaluate the
Therefore, despite the many thousands of patients relative effectiveness and safety of two throm-
enrolled, there is still a question of how represen- bolytic drugs administered with or without hepa-
tative the sample was of the general population rin (summary and design structure denoted are
after a recent MI. Consequently, the external valid- given in Fig. 5.6).
ity of this study potentially is threatened by the In this study, X1 represents streptokinase, and
selection-treatment interaction which, as noted X2 represents alteplase. Y indicates concomitant
earlier, is a common problem in many RCTs. administration of heparin; the absence of Y indi-
cates that no heparin was given. O1O4 denote the
True-Experimental Design # 3 percentages of in-hospital deaths in each of the
The 2 X 2 Factorial Study comparison groups (Fig. 5.6).
Because this study (like those using true-
experimental designs #1 and #2) employed a
design that randomly allocated subjects to four
large parallel treatment arms, selection bias is
controlled as are history effects, maturation,
instrumentation, testing, experimental mortality,
The rst two true-experimental designs per- and regression. Unfortunately, neither patients
mitted the investigator to evaluate the impact of a nor investigators were blinded to the formers
primary treatment versus an alternative primary treatment assignment. Thus, the study did not con-
treatment or control. True-experimental factorial trol for the potential effects of expectancy. This
designs are modications that include a second- omission is important because even though the
ary treatment administered concurrently with the dependent variable clearly was an objective out-
primary treatment to permit examination of the come (i.e., death) and randomization led to groups
modication of the main and interactive effects that appeared to be well balanced at study entry,
of each. They can be designed with and without knowledge of the treatment assignment still could
pretests (as above) and with or without blinding, have resulted in unintended differences between
if the latter is not practical or possible. the treatment arms in the use of nonprotocol-
An example of these designs is diagramed mandated co-interventions (e.g., percutaneous
above. This exemplar is termed a 2 2 factorial coronary angioplasty or coronary bypass grafting)
true-experimental design and includes four con- that, themselves, could have inuenced study
current parallel groups: the rst two groups receive outcomes. This design aw, of course, is not a
a primary treatment, denoted by X, and the second limitation of the true-experimental factorial
two receive no primary treatment, denoted by design (which, otherwise, controls very well for
the absence of X (or, alternatively, X0) or Xp if major threats to internal validity) but, as noted
placebos are given to the nontreatment controls. earlier, is a problem associated with any open
In a variation of this design (for evaluation of (unblinded) study. Had the study been blinded,
Fig. 5.6 Example of a 2 2 factorial true-experimental design
this true-experimental factorial design, like the therapies, which prevents us from generalizing
two preceding true-experimental designs, would, the mortality ndings to similar patients in whom
in theory, have afforded full protection against these therapies are not given.
most, if not all, serious threats to internal validity.
The chief advantage of this study design for True Experimental Design # 4
The Two-Period Crossover (Change-Over) Design
external validity (vs. the crossover study, dis-
cussed below) lies in fact that its structure per- [Period A] [Period B]
mits a purposive and systematic evaluation of the
separate and combined (i.e., interactive) effects
of concomitant investigational therapies, thereby
avoiding unplanned carryover effects and pre-
cluding the threat of multiple treatment interfer- In the previous example, the main and interac-
ence. Though this design can increase the tive effects of two treatments were evaluated. To
efciency of interventional trials by permitting accomplish this, a factorial parallel (between-
simultaneous tests of several hypotheses, the subjects) design was used that required allocation
reader should be aware that if interactions are of large numbers of subjects into four different
severe, loss of statistical power is possible [25]. treatment arms, resulting in one protocol-
A limitation to the external validity of this par- mandated exposure per subject during the course
ticular study (but not to factorial designs in of the study. In contrast, if the study objectives
general) is the coadministration of noninvestiga- were to determine only the main (isolated) effects
tional drugs (i.e., b-blockade and aspirin) among of two treatments, rather than their interactions,
all patients without contraindications to these this objective could be accomplished more
98 P.G. Supino
efciently (i.e., with fewer subjects producing carryover effects could compromise the validity
equivalent statistical power or precision) using of data obtained after the initial period (e.g.,
the true-experimental crossover (or changeover) cause under- or overestimation of the efcacy of
design. A crossover design is a type of repeated the second treatment) and undermine the
measures design in which each subject is exposed efciency of the study.
to different treatments during the study (but they Although crossover studies can involve multi-
cross or change over from one treatment to ple periods and sequences, the most common is
another). The order of treatment administration true-experimental design #4, the two-period cross-
(determined priori via randomization) is termed over design, illustrated symbolically above. When
a sequence, and the time of the treatment this approach is used to test the efcacy and safety
administration is called a period. The statistical of different investigational drugs, subjects nor-
efciency of the design results from the fact that mally will undergo a run-in period during which
each subject acts as his or her own control, noninvestigational medications are discontinued
thereby minimizing error due to (and sample size and a suitably long washout interval between the
needed to overcome) the effects of between- two active treatment periods, A and B, (the latter
subject variability. Crossover designs have enjoyed guided by the bioavailability of the drugs) so as to
popularity in many disciplines including medi- minimize the likelihood of carryover effects.
cine, psychology, and agriculture. They are com- Typically, half of the sample initially receives the
monly used in the early stages of clinical trials to rst drug, denoted by X1, and the other half ini-
assess the efcacy and safety of pharmacological tially receives the second drug, denoted by X2.
agents and constitute the preferred methodologi- Following the washout, study subjects who
cal approach for establishing bioequivalence. received the rst drug are given the second drug,
A variant that can be used for these purposes is and vice versa, resulting in a fully counterbal-
the n-of-1 study, a mini-RCT in which a single anced design. Observations are recorded pre- and
patient is observed during exposure to randomly postdrug administration in the two treatment peri-
ordered sequences of treatment (frequently given ods, denoted by O. The symbol R to the left of the
in varying doses) and placebo. Both the patient diagram indicates that the order of initial treat-
and clinician are blinded as to treatment alloca- ment assignment is allocated at random to counter
tion, and the codes are broken after the trial. possible order effects. An example of a study
Responses, such as reported side effects, are employing a crossover design was conducted by
graphed or analyzed through a variety of para- Seabra-Gomes et al. [26] who evaluated the rela-
metric and nonparametric statistical techniques. tive effects of two antianginal drugs on exercise
When performed in series, the n-of-1 study can performance in men with stable angina (summary
provide valuable information for subsequent par- and design structure are given in Fig. 5.7).
allel group trials. In this study, X1 denotes isosorbide-5-mono-
A crossover study has utility for clinical nitrate and X2 stands for isosorbide dinitrate.
research only when three conditions are satised: O1O3 are the outcome variables measured among
(1) subjects must have a chronic stable disease patients receiving X1 during period A; O4O6 are
that is not likely to progress during the study; (2) the same variables measured during period B.
study endpoints must be transitory, that is, must O7O12 are the outcome variables measured
reect temporary physiological changes (e.g., among patients initially receiving X2. R indicates
blood pressure) or relief of pain, rather than cure that the order of the initial drug assignments was
(or death); and (3) the investigational treatments randomly allocated.
must be able to deliver relatively rapid effects As with all other true-experimental models,
that are quickly reversible after their withdrawal. internal validity is very well controlled with this
The latter point is especially critical. If the effects design. Selection bias is eliminated because study
of the investigational interventions are permanent subjects are their own controls and comparisons
or more long lasting than anticipated, their of outcomes are made within rather than between
Fig. 5.7 Example of a true-experimental two-period crossover study
individuals. As for true-experimental designs desensitization) effects of multiple pre-interven-

#13, the use of parallel comparison groups stud- tion assessments. Of course, here again, the less
ied within a relatively short time interval generally obtrusive the measures, the less worrisome the
affords good control of history, maturation, and threat. Second, a study of this nature is vulnerable
similar effects. In addition, the use of double to the threat of a selection-treatment interaction.
blinding (specic to this study, though not neces- The reason is that the number of study subjects in
sarily to this design) eliminates the threat of expec- this study is relatively small, as is commonly the
tancy on the part of the investigator and subject. case in crossover studies (indeed, as noted previ-
There are, however, a number of potential ously, this is an advantage of these studies com-
threats to the external validity of any crossover pared with parallel designs without crossover,
study. Most prominent are the interactive effects which require larger numbers of subjects for
of testing and treatment which could limit gener- equivalent power). This reduces the number of
alizability due to the potential sensitization (or comparisons that can be made and amplies the
100 P.G. Supino
impact on outcome of the choice to participate or Quasi-experimental Designs

not to participate based on factors extraneous to
the aims of the study. The number of available If the value of an intervention study were to be
comparisons is further reduced if subjects dis- judged solely on considerations of internal valid-
continue their participation before the study has ity, most investigators would opt to employ fully
ended because failure to complete the study pre- blinded true-experimental designs. Yet, despite
cludes determination of within-subject treatment their undisputed methodological superiority for
differencesthe underpinning of the crossover providing evidence of cause and effect relation-
study. If the number of dropouts were high, the ships, these designs only are employed in a
study could be underpowered despite initial minority of published studies that have evalu-
planning to avoid this. (The reader should note ated the impact of interventions on outcomes of
that in the Seabra-Gomes study, 15% of subjects interest. As noted above, even well-constructed,
initially participating failed to complete it; their true-experimental designs are subject to limita-
data could not be used.) In addition, unless the tions in external validity. They also can be
experimenter took care to reduce the obtrusive- difcult, if not impossible, to apply within the
ness of the study, the inherently novel aspects of constraints of many research environments.
the crossover design (alternating treatments, Such constraints may include the lack of concur-
coupled with multiple observations) could cause rently available comparison groups (commonly
reactive effects that might not appear in a more due to ethical problems caused by withholding a
natural setting (reactive effects of experimental preferred treatment from control subjects) and,
arrangements). Perhaps the greatest potential especially, to the inability to randomly allocate
threat to the external validity of a crossover study study subjects into different treatment groups in
lies in the potential for multiple treatment inter- order to minimize the threat of selection bias,
ference because, as noted above, there may be (in clinical research, commonly due to physician
carryover effects between treatments that may or patient refusal based on assumptions about
not be generalized to the single treatments under outcome or to more complex psychological
investigation. This may occur when the alterna- factors). To compensate for these deciencies,
tive treatments being compared are not ade- and to render research feasible in constrained
quately separated in time (washed out) or, situations, Campbell and Stanley popularized a
unbeknownst to the investigator when designing concept known as quasi-experimental design.
the trial, lead to permanent change (e.g., cause This approach can be applied to individual sub-
liver or kidney damage). Under these circum- jects or to populations and to evaluations con-
stances, the response to treatment in period B ducted in practice-based and eld settings.
may be importantly inuenced by a residual It can help the investigator to control some
effect from the treatment given during period A, threats to internal validity that would be uncon-
producing an under- or overestimation of the trolled with pre-experimental designs or exter-
efcacy of the second treatment. Because of this nally controlled studies and can be very useful in
potential limitation, crossover studies generally the evaluation of therapies, educational pro-
are less favored than parallel designs for denition grams, and policy changes in many disciplines.
of treatment efcacy. Indeed, as a practical mat- Like true-experimental designs, all quasi-
ter, when such studies are undertaken to obtain experimental designs involve the application of
regulatory approval or labeling elements for a an intervention and observations of at least one
treatment, investigators should consult with the outcome that is related to the intervention.
appropriate regulatory body (in the United States, However, quasi-experimental designs typically
the Food and Drug Administration [FDA]) as to lack the hallmark of the true experiment, that is,
the acceptability of the design for the particular random allocation to treatment group. Of these,
purpose. the most widely used for evaluating the impact of
clinical and other health-related interventions on The basic structure of this design is symbol-
group outcomes are the following: ized above. It is almost identical to the pretest-
1. The nonequivalent control group design posttest true-experimental control group design
2. The time-series design except that study subjects are not randomly
3. The multiple time-series design assigned to treatment groups; therefore, the
The rst design can be used to evaluate the groups cannot be assumed to be equivalent before
impact of an intervention using a single before the intervention. As before, X symbolizes the
and after assessment of the dependent variables in intervention, O denotes the pre- and post-inter-
two or more comparison groups. The second uses vention assessments in each of the comparison
multiple assessments, conducted over time, of the groups, and the dashed line (and absence of R)
dependent variable in a single group of subjects. indicates that intervention was applied to an
The third (a combination of quasi-experimental intact group (i.e., allocation was not random).
designs #1 and #2) includes multiple assessments, Steyn et al. [30] used a nonequivalent control
again over time, but in two or more parallel group design to examine the intervention effects
groups. Because the observations in designs #2 of a community-based hypertension control pro-
and #3 are broken up by the imposition of the gram (the Coronary Risk Factor Study [CORIS])
intervention, both also are termed interrupted that was introduced for 4 years among white
time-series designs. (The reader is referred to hypertensive residents in two rural South African
Kazdin [27] or to Janosky et al. [28], for a detailed towns (summary and design structure are given
discussion of other quasi-experimental designs in Fig. 5.8).
used for research with single or small groups of In this study, O1, O3, and O5 represent baseline
subjects, and to Stanley and Campbell [1], Cook systolic blood pressure and diastolic blood pres-
and Campbell [2], and Shadish, Cook, and sure in the intervention and control towns; O2, O4,
Campbell [29], for additional quasi-experimental and O6 represent post-intervention blood pres-
designs used with larger groups or populations). sures in these towns. X1 represents the low-
intensity hypertension reduction intervention,
Quasi-Experimental Design # 1 X2 represents the high-intensity intervention, and
The Nonequivalent Control Group Design the absence of X denotes the lack of intervention
(the control). The dashed line indicates intact
O2 X O2
------------------ (nonrandom) treatment assignment.
Because allocation to the intervention was not
O3-------> O4
performed randomly, confounding variables may
The nonequivalent control group design (also have inuenced the observed outcomes.
termed the nonequivalent comparison design) Therefore, internal validity is not as well pro-
compares outcomes among two or more intact tected as it is with true-experimental design #4
groups, at least one of which receives the inter- (the pretest-posttest control group design),
vention; another serves as the control. This design which has a similar structure but includes random
is most useful when concurrent comparison allocation. The greatest potential threat to inter-
groups are available, when random allocation to nal validity with this design is differential selec-
treatment condition is not possible, and when tion, which could cause the comparison groups to
pretesting of the dependent variable can be per- vary on key factors related to the dependent vari-
formed so that baseline similarity of the compari- able; if present, selection bias could interact with
son groups can be evaluated. It is commonly used other potential biases such maturation (e.g., a
when comparison groups are spontaneously or sicker group could have disease that might prog-
previously assembled entities (e.g., different clin- ress more rapidly) or regression (if one of the two
ics, wards, schools, or geographic areas) or when groups were chosen on the basis of extreme val-
logistic difculties preclude random allocation to ues). Selection bias can occur if the investigator
treatment within the same entity. evaluates the intervention in two intrinsically
102 P.G. Supino
Fig. 5.8 Example of a nonequivalent control group study
dissimilar populations or uses a nonuniform sub- pressures prior to the intervention. Thus, it is not
ject recruitment approach (e.g., permits subjects likely (though, certainly, it is not impossible) that
to self-select their treatment assignment). the differences found after the intervention were
However, if care is taken to avoid these practices, attributable to selection bias. The inclusion of
the availability of baseline measures of the depen- baseline measures also permits the investigator to
dent variable, a critical component of the non- evaluate the potential threat of experimental mor-
equivalent control group design, permits the tality (attrition bias). If there were losses to fol-
investigator to evaluate the extent and direction low-up among the comparison groups, their
of a potential selection bias and to minimize it, as potential impact could be evaluated by comparing
appropriate, through covariance analysis. baseline characteristics of those who withdrew
Therefore, this design affords much greater con- with those who completed the study. The authors
trol for this selection bias than pre-experimental of CORIS, who performed this analysis, found
design #3 (the static-group comparison) which that those who withdrew were similar to those
also contrasts outcomes across intact groups, but who remained with regard to age, gender, initial
which lacks critical baseline data needed to estab- cholesterol levels, blood pressure, body mass
lish initial comparability. Where pre-intervention index, and smoking behavior. Thus, the potential
data show relative comparability between groups threat of experimental mortality was effectively
on relevant variables, the nonequivalent control ruled out.
group design generally is appropriate; when pre- In the absence of differential selection and a
intervention comparability is not present, an alter- hypothesized interaction between selection and
native design should be used. In the CORIS study, the day-to-day experiences of the subjects, history
the authors state that the groups had similar blood effects are not plausible as an alternative (rival)
explanation for the observed outcomes and, thus, be less reactive and, thus, have better external
also can be ruled out as a major potential threat validity than most true experiments.
to internal validity when using the nonequivalent
control group design. The reason is that, barring Quasi-Experimental Design # 2
evidence to the contrary, external events occur- The Time-Series Design
ring in one comparison group should be just as O1 O2 O3 O4 X O5 O6 O7 O8
likely to occur in the other when subjects are
evaluated in parallel. However, as with true- The previous example compared the impact of
experimental designs, the burden remains with an intervention on outcomes using several intact
the investigator to ascertain the degree to which groups. Occasionally, an investigator planning
other relevant events may be occurring in the to evaluate an intervention may be unable to
intact group settings that might also affect out- identify a suitable (or any) comparison group.
comes; this is especially important when com- This might occur when patients are candidates
parators are geographically separated, as in this for a treatment, the effectiveness of which is to be
study. Also, because groups are studied in paral- tested, but an alternate treatment is not available,
lel, internal validity threats such as maturation, or if available, is viewed as unacceptable by the
testing, instrumentation, and regression effects patients or their physicians; a similar problem
are fairly well controlled (again, assuming the frequently occurs when a specic treatment cannot
groups share common baseline characteristics). be withheld for what are considered ethical
Finally, any potential biases associated with reasons. Thus, sometimes, interventions must
expectancy are not inherently greater than those be presented to entire groups, for example, all
found with true-experimental designs and may be patients potentially at risk. In these cases, an
reduced, at least in part, by uniform standards for investigator might opt for a pre-experimental
data collection and analysis (as was done in design without a control group (e.g., the pretest-
CORIS). posttest only design), in which a single group of
As with true-experimental design #2, the use study subjects is observed on just one occasion
of pre-intervention testing (essential with this before and after the intervention, or might com-
design for establishing baseline comparability of pare results obtained in study subjects with exter-
the comparison groups) may pose a threat to nal or historical controls. The literature reects
external validity unless the testing itself were many such examples. Unfortunately, as noted
deemed to be part of the intervention, as it would earlier, pre-experimental designs provide very
appear to be in the CORIS study. Additionally, as poor control against important threats to internal
with any design, a selection-treatment interac- validity, and comparing results from a current
tion can occur if the study subjects are not repre- treatment group with those obtained among his-
sentative of all subjects who potentially could be torical controls is almost always biased in favor
studied. Indeed, the authors of CORIS recognized of the former, principally due to improvement in
that their ndings did not necessarily apply to the general health of the population over time.
individuals of ethnic backgrounds and socioeco- The time-series design (sometimes called an
nomic statuses not included in CORIS. In gen- interrupted time-series) represents an improve-
eral, however, the nonequivalent control group ment over both of these pre-experimental
design places far fewer restrictions on sampling approaches. In its simplest form, multiple obser-
and, therefore, tends to be more generalizable vations (the number depending on the stability of
than the typical randomized parallel group trial. the data) are generated for a single group of sub-
Lastly, the reactive effects of experimental jects both before and after application of an inter-
arrangements potentially could limit the external vention. The objective of any study using such a
validity of studies using this design, but because design is to provide evidence that observations
they entail comparisons of interventions applied made before (and sometimes after) imposition of
to naturally occurring groupings, they tend to the intervention differ in a consistent manner from
104 P.G. Supino
sharp increases in slope concomitant with the

intervention, following a stable baseline), those
reected by lines CE are equivocal, whereas
those shown by lines FH provide no justication
for such an inference (Fig. 5.9).
Time-series designs can be used to evaluate
continuous or temporary interventions and can
incorporate retrospectively or prospectively
acquired data. They are especially useful and
appropriate for modeling temporal changes in
response to programmatic interventions or health
policy changes in otherwise stable populations. A
time-series design was used by Delate et al. [31]
to evaluate economic outcomes of a cost-contain-
ment policy for Medicaid recipients that was
applied continuously throughout their study (sum-
mary and design structure are given in Fig. 5.10).
In this study, O represents the number of
antisecretory drug claims and expenditures per
member per month (PPIs and H2As) before and
during the post-policy period (24 such outcomes
were measured in total, though only eight obser-
vations are shown here for ease of interpretation).
X is the prior authorization policy; the symbol
indicates that the intervention is applied continu-
ously. The pattern of the observed H2A data
Fig. 5.9 Some possible outcome patterns from the intro- (which emulates line A of Fig. 5.9) and the
duction of an experimental variable at point X into a time- obverse pattern of the PPI data are used to but-
series of measurements, O1O8. Except for D, the O4O5 tress the investigators conclusions that the
gain is the same for all time series, while the legitimacy of
inferring an effect varies widely, being strongest in A and observed changes in the number of claims led
B, and totally unjustied in F, G, and H. From Campbell for, and expenses associated with, antisecretory
and Stanley, Experimental and Quasi-Experimental drugs are due to the imposition of the policy.
Designs for Research, 1E 1966 Wadsworth, a part of An example of a time-series design evaluating
Cengage Learning, Inc. (Reproduced by permission,
www.cengage.com/permissions) a temporary intervention can be found in the
work of Reding and Raphelson [32] who evalu-
ated the impact of an addition of a psychiatrist to
observations made during the intervention. While a mobile psychiatric crisis team on psychiatric
special autoregressive statistical procedures often hospitalization admission rates in Kalamazoo
are used for analysis, the hallmark of this and County, Michigan (summary and design structure
other types of time-series designs is visual analy- are given in Fig. 5.11).
sis of temporal outcome changes in relation to the In this study, X denotes the mobile psychia-
intervention. Shown below are examples of hypo- trist intervention and O, the number of state hos-
thetical data, reecting various levels of evidence pitalizations during each of the monthly
for inferring cause and effect that, theoretically, assessments before, during, and after the inter-
can be produced with a time-series design. The vention (again, 30 outcome assessments actually
reader should note that patterns reected by lines were performed, reduced to eight for ease of pre-
A and B provide the strongest evidence for infer- sentation here). The authors conclusion that the
ring intervention effects (note that both show intervention caused the changes in the pattern of
Fig. 5.10 Example of a time-series design (continuous intervention)
Fig. 5.11 Example of a time-series design (temporary intervention)
hospitalizations is based on data patterns that confound their results. Dynamic changes within
conform to the inverse of those shown in Fig. 5.9, subjects or populations (i.e., maturation effects),
line B (i.e., changes on the dependent variable if any, usually are well controlled with time-series
contemporaneous with the intervention that designs because they (like regression effects) are
return to baseline after termination). unlikely to cause variations that occur only when
In both of these studies, the threats of selection the intervention is applied. For similar reasons,
bias and experimental mortality are con- the time-series design controls for testing effects
trolled, provided that the same subjects partici- even in cases in which the measurement process
pate in each of the pre- and post-intervention is more obtrusive than that used in the Delate and
assessments. Since this is rarely the case in Reding studies.
community-based studies, the investigators must The chief potential threat to internal validity
take steps to evaluate natural migratory patterns of studies using time-series designs is history.
within the community to ensure that these do not Because human subjects rarely are studied in a
106 P.G. Supino
vacuum, the investigator must be on the alert could compromise external validity by sensitizing
for outside inuences (e.g., programs, policy subjects to their treatments. The potential for a
changes, or even seasonal uctuations) occurring testing-treatment interaction (or testing reactiv-
coincident with the intervention that also might ity) is heightened with a time-series design
affect study outcomes. For example, to accept because multiple pre-intervention assessments
Delates conclusions, one would have to believe are required to establish the stable pre-interven-
that there were no other factors (e.g., changes in tion pattern against which changes in slope and/
physician prescribing patterns, advertising cam- or intercept of the post-intervention assessments
paigns) to which the subjects were exposed that are compared. For this reason, studies using these
would have caused them to use fewer PPIs during designs generalize best when performed in set-
the post-program period. Similarly, the Reding tings in which data are collected as part of routine
conclusions are tenable only if one accepts that practice. Additionally, when based on natural
nothing else (such as another psychiatric inter- experiments, like those reported by Delate and
vention or availability of new treatments, etc.) Reding, they cause few, if any, reactive effects
occurred in Kalamazoo County specically dur- because the interventions are experienced as part
ing the tenure of the mobile psychiatrist that also of the subjects normal environment. As with any
might have reduced admissions to state hospitals. design, however, the ability to generalize out-
If careful documentation by the investigator rules comes depends on the similarity of the study
this out, then history effects become a less plau- group to the reference population.
sible alternative hypothesis for the observed Readers with clinical experience may recog-
changes. A second internal validity threat is nize a variant of the time-series design in which
instrumentation. If the calibration of an objective an intervention is reintroduced after one or more
measure (or the instrument itself) changes during intervals of withdrawal. In behavioral research
the study, and if this change occurs when the with single subjects or with series of subjects
intervention is applied, then it is difcult to know (e.g., studies designed to extinguish inappropri-
whether the observations made after the interven- ate actions among children with autism or adult
tion are due to it or to changes in the instrument. schizophrenics or to improve task performance in
The same problem may occur when measurement the setting of attention decit hyperactivity disor-
criteria or outcome adjudicators change in paral- der), this approach is termed an ABAB Design,
lel with the intervention, especially when the lat- where A and B respectively denote alternating
ter are aware of the study hypothesis. With control and intervention periods. (It is called a
administrative data, there is always a chance that BABA Design when the sequence begins with the
the methodology used for record keeping might intervention, followed by its withdrawal and rein-
spuriously inuence outcomes. For example, a troduction, etc.) In other specialties, it is more
change in the coding of diagnostic rating groups commonly termed an equivalent time samples
(DRGs) during an intervention might lead the design or a repeated treatment design. This gen-
investigator to conclude incorrectly that there eral approach has greater control of history and
were more (or less) hospitalizations for a given instrumentation effects than the classic time-
disease during this interval. To minimize these series design because the probability of some
potential effects, the investigator should endeavor external event or unintentional instrument or
to standardize measures and educate research observer change tracking with (and accounting
personnel about such issues. Finally, whenever for) the effects of intermittent applications of
possible, steps should be taken to blind those the intervention is arguably lower than it would be
interpreting outcomes to knowledge of the treat- when only one application of the intervention is
ment period to reduce the inuence of expectancy involved. It can be particularly useful as the basis
on these assessments. for relatively rigorous determination of the effects
As with all designs that evaluate change over of pharmacological therapies (particularly adverse
time, the use of multiple observations, if obtrusive, outcomes of chronically employed drugs), when
such effects are predictably transient or reversible The multiple time-series design combines the
in nature. For example, with age, individuals tend unique features of nonequivalent control group
to perceive arthralgias and myalgias with relative and time-series designs to maximize internal
frequency. Hypercholesterolemia is fairly wide- validity. It evaluates relative change over time
spread according to current epidemiological on one or more dependent variables in two or
denitions, and the prescription of HMG-CoA more intact comparison groups (again, usually
reductase inhibitors (statins) to control choles- preexisting groups assembled for other pur-
terol is quite common. The drugs have been well poses) at least one of which receives an inter-
demonstrated in RCTs to reduce coronary disease vention and one of which does not (the control).
events and, specically, mortality, among patients Thus, this design creates two experiments, one
so treated. In some patients (the minority), statins in which the intervention is compared against a
also can cause myalgias and, in fewer still, poly- no-intervention control and the second in which
serositis with arthralgias. Most patients are aware pre-intervention time-series data are compared
of these potential problems from constant refer- with those obtained after the intervention,
ence to them in the news media and often ascribe thereby increasing the amount of available evi-
their symptoms to the statins because of expec- dence to buttress a claim of an intervention
tancy. Thus, when patients complain of myalgias effect. In its most general design structure,
and/or arthralgias while taking statins, it is incum- shown above, X symbolizes the intervention
bent upon the physician to determine whether the (applied within one of the groups), O is the pre-
association truly is cause and effect. The best and post-intervention assessment of the depen-
approach is to employ an equivalent time samples dent variable(s), and the dashed line denotes the
design, beginning with a careful history of cur- intact nature of the comparators. The design is
rent symptoms on drug (O) followed by with- most appropriate when it is not possible to ran-
drawal of sufcient duration to allow drug effects domly allocate subjects to an intervention, when
to dissipate, another careful history, and then a concurrent no-intervention group is avail-
reinstitution (rechallenge) with the drug, with able for comparison, and when serial data can
another O after some period of use. If the result is be (or have been) generated for both groups
unclear, the series can be repeated. Unfortunately, during the pre- and post-intervention periods.
in the real world, patients tend to confound out- As for the nonequivalent control group design,
come by interposing anti-inammatory drug use the availability of baseline data is necessary to
concomitantly with cessation of the statin and evaluate initial comparability of the interven-
often refuse the rechallenge. Nonetheless, this tion and control groups. The multiple time sam-
example illustrates the importance of understand- ples design was used by Holder et al. [34] to
ing and applying the principles of study design in evaluate the effects of a community-based
the course of clinical practice. (For further details intervention on high-risk drinking and alcohol-
about the pros and cons of this design as a tool for related injuries (summary and design structure
research and methods for implementing it in clin- are given in Fig. 5.12).
ical populations, the reader again is referred to In this study, X represents the community-
the works of Campbell and Stanley [1], Cook and based alcohol deterrence intervention; O (made
Campbell [2], Kazdin [27], Janosky et al. [28], approximately monthly over a 5-year interval)
and to Haukoos et al. [33].) denotes average (1) frequency of drinking, (2)
number of alcoholic drinks consumed per drink-
Quasi-Experimental Design # 3
ing occasion, (3) instances of driving while intox-
The Multiple Time-Series Design icated, (4) motor vehicle crashes (daytime,
DUI-related, nighttime injury-associated), and
proportion of (5) emergency room and (6) hospi-
tal admissions for violent assault among the
108 P.G. Supino
Fig. 5.12 Example of a multiple time-series design
intervention versus comparison communities. The potentially caused by instrumentation, maturation,

investigators conclusion that the intervention and testing because if pre- to post-intervention
caused reductions in high-risk drinking behavior differences were inuenced by these factors, they
and associated motor vehicle accidents and assaults should be just as likely to impact both the experi-
is based on sustained differential trends in post- mental and control groups (again, assuming
versus pre-intervention outcomes among the inter- reasonable baseline equivalence between com-
vention versus matched control communities. parators). Indeed, when properly executed, the
All of the potential threats to internal validity multiple time-series design essentially is free
protected by the time-series design also are from the most important threats to internal valid-
protected by the multiple time-series design. ity of an intervention study and, for this reason,
However, with the addition of a parallel generally is considered to be among the most rig-
comparison group, there is better control for the orous of the various quasi-experimental designs.
potential threat of history unless, as with the non- The threats to the external validity of a multiple
equivalent control group design, the comparison time-series design are similar to those of the non-
groups are so poorly selected as to have different equivalent control group and time-series designs
external experiences. Similarly, as previously and, as for these designs, are minimized by the
noted, the use of a parallel control group generally use of unobtrusive measures, naturalistic inter-
affords good protection against threats to validity ventions, and careful selection of comparators.
potential threats to internal validity, providing

Summary weakest evidence to support a claim of cause and
effect. The true-experimental designs offer the
This chapter has reviewed a variety of alternative best control over most internal validity threats,
research designs commonly used to evaluate the providing strongest evidence to support interven-
impact of interventions. The examples of their tion effects, but their generalizability may be com-
application have been drawn from clinical research. promised by highly restrictive inclusion criteria,
However, the reader should be aware that, to patient reluctance to participate in a randomized
achieve optimal rigor and strength of conclusions, study, or reactivity caused by pre-intervention test-
the same design principles can and should be ing or the experimental arrangements. The quasi-
applied in preclinical, cellular, and molecular stud- experimental designs fall somewhere in the middle,
ies though, because of the relative homogeneity providing more protection against internal validity
(and nonhuman characteristics) of test material in threats than pre-experimental designs but less than
basic science studies, issues of randomization, that afforded by true-experimental designs. Because
blinding, sample sizes, etc., may be handled most quasi-experimental designs lend themselves
somewhat differently than in clinical research. to real-world studies of typical (rather than the
Nonetheless, it should be clear from a compari- ideal) subjects or populations, they also offer
son of the relative strengths and weaknesses of certain advantages in external validity. Therefore,
the various study designs reviewed in this chapter in many situations, they represent a good compro-
that there is no perfect study. The pre-experimental mise for the researcher, particularly when their
designs offer the least protection against major strengths and limitations are recognized.
Take-Home Points
The ability to draw valid inferences from data is the cornerstone of research and provides
the basis for understanding the new knowledge that research results represent.
Internal validity reects the extent to which a manipulated variable can be shown to account
for changes in a dependent variable. It is indispensable for interpreting the experiment.
Ten common threats to internal validity include selection bias, history effects, maturation
effects, testing effects, instrumentation effects, statistical regression, experimental mortality,
interaction of these factors, experimenter bias, and subject expectancy effects.
Four threats to external validity (generalizability) are reactive effects of testing, interactive
effects of selection and treatment, reactive effects of experimental arrangements, and mul-
tiple treatment interference.
A variety of research designs can be used to evaluate interventions. Each differs in its ade-
quacy for ensuring that valid inferences are made about effects and generalizability.
The poorest for controlling threats to internal validity are termed pre-experimental
designs. These lack adequate control groups.
The strongest are termed true-experimental designs. They incorporate control groups to
which subjects have been randomly allocated but may suffer from lack of generalizability.
Quasi-experimental designs represent a good compromise when randomization is not
possible.
110 P.G. Supino
21. Lewis JA, Machin D. Intention to treatwho should

References use ITT? Br J Cancer. 1993;68:64750.
22. Gorbach SL, Morrill-LaBrode A, Woods MN, Dwyer
JT, Selles WD, Henderson M, Insull Jr W, Goldman S,
1. Campbell DT, Stanley JC. Experimental and quasi- Thompson D, Clifford C. Changes in food patterns
experimental designs for research. Boston: Houghton during a low-fat dietary intervention in women. J Am
Mifin Company; 1963. Diet Assoc. 1990;90:8029.
2. Cook TD, Campbell DT. Quasi-experimentation: 23. B-Blocker Heart Attack Research Group. A random-
design and analysis for eld settings. Chicago: Rand ized trial of propranolol in patients with acute myo-
McNally; 1979. cardial infarction. JAMA. 1982;247:170714.
3. Kim SYH, Holloway RG, Frank S, Beck CA, 24. The International Study Group. In-hospital mortality
Zimmerman C, Wilson MA, Kieburtz K. Volunteering and clinical course of 20,891 patients with suspected
for early phase gene transfer research in Parkinson acute myocardial infarction randomized between
disease. Neurology. 2006;66:10105. alteplase and streptokinase with or without heparin.
4. Woodward SH, Stegman WK, Pavao JR, Arsenault NJ, Lancet. 1990;336:714.
Hartl TL, Drescher KD, Weaver C. Self-selection bias in 25. Stampfer MJ, Buring JE, Willett W, Rosner B,
sleep and psychophysiological studies of posttraumatic Eberlein K, Hennekens CH. The 2 2 factorial design:
stress disorder. J Trauma Stress. 2007;20: 61923. its application to a randomized trial of aspirin and
5. Lanfear DE, Jones PG, Cresci S, Tang F, Rathore SS, carotene in U.S. physicians. Stat Med. 1985;4:1116.
Spertus JA. Factors inuencing patient willingness to 26. Seabra-Gomes R, Aleixo AM, Adao M, Machado FP,
participate in genetic research after a myocardial Mendes M, Bruges G, Palos JL. Comparison of the
infarction. Genome Med. 2011;3:39. effects of a controlled-release formulation of isosor-
6. McCuen RH. The elements of academic research. bide-5-mononitrate and conventional isosorbide dini-
New York: ASCE Publications; 1996. trate on exercise performance in men with stable
7. Rosenthal R. The effect of the experimenter on the angina pectoris. Am J Cardiol. 1990;65:130812.
results of psychological research. In: Maher BA, 27. Kazdin AE. Single case research designs. New York:
editor. Progress in experimental personality research. Oxford University Press; 1982.
New York: Academic; 1964. 28. Janosky JE, Leininger SL, Hoerger MP, Libkuman
8. Mayo E. The human problems of an industrial civili- TM. Single subject designs in biomedicine. New
zation. New York: Macmillan; 1933. York: Springer; 2009.
9. Saretsky G. The OEO P.C. experiment and the John 29. Shadish WR, Cook TD, Campbell DT. Experimental
Henry effect. Phi Delta Kappan. 1972;53:57981. and quasi-experimental designs for generalized causal
10. Visser RF. Angiographic assessment of patency and inference. Boston: Houghton Mifin Company;
reocclusion: preliminary results of the dutch APSAC 2002.
reocclusion multicenter study (ARMS). Clin Cardiol. 30. Steyn K, Rossouw JE, Jooste PL, Chalton DO, Jordaan
1990;13:457. ER, Jordaan PC, Steyn M, Swanepoel AS. The inter-
11. Wender P, Reimherr F. Buproprion treatment of atten- vention effects of a community-based hypertension
tion decit hyperactivity disorders in adults. Am J programme in two rural South African towns: the
Psychol. 1990;147:101820. CORIS Study. S Afr Med J. 1993;83:88591.
12. Bolland J, Ward J, Bolland T. Improved accuracy of 31. Delate T, Mager DE, Sheth J, Motheral BR. Clinical
estimating food quantities up to 4 weeks after treat- and nancial outcomes associated with a proton pump
ment. J Am Diet Assoc. 1990;90:14027. inhibitor prior-authorization program in a Medicaid
13. Fisher RA. The arrangement of eld experiments. population. Am J Manag Care. 2005;11:2936.
J Min Agric. 1926;33:50313. 32. Reding GR, Raphelson M. Aroundthe-clock mobile
14. Medical Research Council. Streptomycin treatment of psychiatric crisis intervention: another effective alter-
pulmonary tuberculosis. BMJ. 1948;2:76982. native to psychiatric hospitalization. Commun Ment
15. Berry DA. Adaptive designs: the promise and the cau- Health J. 1995;31:17987.
tion. J Clin Oncol. 2011;29:6069. 33. Haukoos JS, Hopkins E, Byyny RL, Conroy AA,
16. Hu F, Rosenberger WF. The theory of response adap- Silverman M, Eisert S, Thrun M, Wilson M, Boyer B,
tation in clinical trials. Hoboken: Wiley; 2006. Heffelnger JD, Denver ED HIV Opt-Out Study
17. Heritier SR, Gebski VJ, Keech AC. Inclusion of Group. Design and implementation of a controlled
patients in clinical trial analysis: the intention-to-treat clinical trial to evaluate the effectiveness and efciency
principle. MJA. 2003;179:43840. of routine opt-out rapid human immunodeciency
18. Montori VM, Guyatt GH. Intention-to-treat principle. virus screening in the emergency department. Acad
CMAJ. 2001;165:133941. Emerg Med. 2009;16:8008.
19. Hollis S, Campbell F. What is meant by intention to 34. Holder HD, Gruenewald PJ, Ponicki WR, Treno AJ,
treat analysis? Survey of published randomised con- Grube JW, Saltz RF, Voas RB, Reynolds R, Davis J,
trolled trials. BMJ. 1999;319:6704. Sanchez L, Gaumont G, Roeper PR. Effect of com-
20. Sackett DL, Gent M. Controversy in counting and munity-based interventions on high risk drinking
attributing events in clinical trials. N Engl J Med. and alcohol-related injuries. JAMA. 2000;284:
1979;301:14102. 23417.
Protocol Development
and Preparation for a Clinical Trial 6
Joseph A. Franciosa
including the purpose of the study or statement

Introduction of the hypothesis being tested and the signi-
cance of its possible results; a detailed descrip-
A clinical trial protocol is a written document tion of the study population, including patient
that provides a detailed description of the ratio- eligibility criteria; implementation of the inter-
nale for the trial, the hypothesis to be tested, the vention, study specic visits, and observations
overall design, and the methods to be used in car- made; a plan for safety monitoring, including
rying out the trial and in analyzing its results. The reporting of adverse events; ethical consider-
protocol represents the means by which a hypoth- ations; a description of data management plans,
esis will be tested. As such, it must be written in including methods of data generation, recording
its entirety before the study is performed to help and processing; and statistical considerations,
assure the credibility of the results. In addition, it including a detailed description of the study
must be prepared in as detailed a manner as pos- design.
sible in order that the elements of the trial can be The purposes of this chapter are to briey
subjected to constructive critique and that others describe the clinical trial and to discuss, in depth,
can replicate it in the future with the expectation the various stepwise components of the protocol
of obtaining essentially the same results. structure and organization that guide it. This
A protocol has a structure and organization chapter will focus primarily on protocols for con-
made up of elements that follow the conception, ducting trials in human subjects or patients, with
development, and conduct of a clinical trial in a special emphasis on randomized controlled clini-
chronological fashion. Although these elements cal trials that test specic hypotheses. Protocols
vary from protocol to protocol, they typically for other types of clinical research (e.g., epide-
include the following, in this suggested order: miological studies) or for preclinical research
a statement of the background and rationale for (e.g., animal or laboratory bench studies) will not
the trial; a brief overview of the study design, be specically addressed here, though many of
the principles of clinical trials generally are appli-
cable. Though there is ample published in forma-
tion available about protocol development for
clinical trials, much of this is dispersed through-
J.A. Franciosa, MD ()
Department of Medicine, State University of New York out websites, institutional guidelines, proceed-
Downstate Medical Center, Brooklyn, NY, USA ings, literature, books, and software and may be
e-mail: josephafranciosa@gmail.com difcult to locate [1, 2].
112 J.A. Franciosa
Table 6.1 Components of the background and rationale

Background, Rationale, and Overview section of the clinical trial protocol
of Study Design General description of the disease being treated/
managed and why improved treatment/intervention/
management is needed
Background and Rationale Description of current treatment/management of the
disease/condition and any problems with available
The background and rationale section of a proto- therapy/management
col is a brief but comprehensive introductory sec- Description of known properties of the proposed
tion that should provide a compelling argument treatment/management intervention that justify its use
Brief summary of relevant preclinical and clinical
to justify the proposed research. Some key com-
experience with the proposed treatment/management
ponents of this section are shown in Table 6.1. It intervention
should succinctly summarize what has been done Rationale for the current study and its role in the
by the investigators and others in the specic and overall research program
related areas of research, it should highlight what Statement of the hypothesis and objectives of the
deciencies exist and what additional informa- proposed research
Brief description of the signicance of the study
tion is needed, and it should state how the pro-
posed research will address those needs. It is
important to stress the unique characteristics of
the proposed research, which may involve new The introductory section should logically lead
methods, unique patients, a new intervention, an to a statement of the hypothesis of the proposed
innovative study design, or other new approaches research. This section is the key to the entire pro-
that distinguish the proposed research and war- tocol as it describes the purpose of the trial and
rant its conduct. This should all logically ow to guides the rest of the protocol which, subse-
a concise statement of the hypothesis addressed quently, is developed to provide the details about
by the proposed research and be concluded by a the methodology to be used in assessing the stated
statement of the signicance of the anticipated hypothesis. In other words, the hypothesis
results, whether they conrm or fail to conrm addresses the primary question by providing a
the hypothesis. tentative answer. The rest of the protocol describes
The importance of this section cannot be over- how the hypothesis will be tested to provide a
stated as it provides the rst impression of the more denitive answer.
investigators to reviewers, funding agencies, and The section stating the hypothesis or hypoth-
others who may have to approve or support the eses (there may be more than one in a given
proposed research. It offers these others a glimpse study) typically begins with a broad description
of the investigators thought processes, their ana- of the overall goal of the research within the
lytic and synthetic abilities, the thoroughness of context of the investigators overall research
their methods, and their objectivity. Finally, it program. For example, the investigators may
should be written in a style that is suitable to both have an interest in seeking new treatments for a
scientic and nonscientic lay persons who may given disease, and the broad purpose of the pro-
be members of reviewing and approving bodies. posed study is to test a new drug for treating that
disease. The broad purpose in this case is an
attempt to answer a new question. In some situ-
Statement of Hypothesis ations, the broad purpose might be to conrm
previous preliminary work in the eld in a larger
The hypothesis (described in detail in Chap. 3) or different patient population. The overall pur-
must be asserted early in the protocol. Therefore, pose might also be preliminary in nature as a
we offer a few key points here on how it should proof of concept study to assess whether a
be stated in the protocol. hypothesized pathogenetic mechanism plays an
6 Protocol Development and Preparation for a Clinical Trial 113
important enough role in a disease such that it Table 6.2 Components of the study design summary
might be a therapeutic target. Statement of study type (e.g., controlled clinical trial)
In addition to stating the broad programmatic Overview of study design
goal of the proposed research, the statement of Parallel-group, crossover
the hypothesis also presents a more specic broad Level of blinding (e.g., open-label, single-blind,
objective of the research followed by some more double-blind)
Method of treatment assignment
detailed specic aims of the research. For exam- (e.g., randomization, stratication)
ple, a broad objective might be to test the hypoth- Statement of treatment/intervention to be used
esis that a new drug improves symptoms in Investigative drug or device
patients with the disease of interest to the overall Dosage of drugs or usage of devices
Type of control (e.g., placebo, active drug, no
research program. The specic aims might be to treatment)
determine whether certain of those symptoms Description of study population
improve by a specied amount over a specied Planned sample size
period of time without producing major side Source of patients
effects. The specic aims typically include major Number of centers
Note any unique patient characteristics (age, race,
outcomes (primary endpoint [s]) that essentially sex) required
drive the study design and other outcomes of Description of the disease or condition being studied
lesser importance (secondary endpoints) that pro- and any characteristics of that disease/condition that
vide supportive information, as will be discussed might affect patient eligibility or study outcomes
in greater detail below. Duration
Etiology
The statement of hypothesis should be suc- Severity
cinctly phrased and should provide a basis for the Treatment
overall study design being employed to test it, Sequence and duration of study visits
i.e., to determine whether the hypothesis is sup- Description of study endpoints
ported by the study results. As noted in Chap. 3,
the operational restatement of the hypothesis
should, at minimum, clearly identify the patient Overview of Study Design Summary
population, intervention (if any), primary end-
point, key methods, duration, and anticipated It is common practice and helpful to include an
outcomes. overall summary or synopsis of the study design
before embarking on the detailed discussion of
the various protocol components that will ensue.
Signicance of the Research This summary is especially useful to certain
reviewers, e.g., research administrators, funding
The Introduction should conclude with some dis- agency ofcials, or institutional review board
cussion, even if largely speculative, about the (IRB) members, who may not be scientists or
signicance of the proposed research and its pos- may not require the level of detail of the full pro-
sible outcomes. If the hypothesis is conrmed, tocol in order to perform their specic review or
what does that mean in terms of the initial objec- critique functions. Thus, this section is typically
tives? Is it conclusive or does it indicate a direc- very brief and to the point, as details of every-
tion for future research? Results which are not thing addressed here will be provided in the sec-
conrmatory may lead to outright rejection of the tions that follow. Table 6.2 shows the key
hypothesis or may imply a need for modication components of this summary.
of the research approach. Finally, some ndings The summary should include a statement of
of the study may generate new hypotheses to be the nature of the study design (e.g., whether it is
addressed by future research. controlled or uncontrolled, parallel or crossover,
114 J.A. Franciosa
blinded or unblinded, and the number and nature to address multiple primary endpoints almost
of treatment arms). A brief description of any invariably lead to methodological inconsistencies
randomization methods should be provided (the and difculties, resulting in a trial that fails to
details of which should be given in the Statistical achieve any meaningful result in terms of pri-
Considerations section). It also should indicate mary endpoints. The primary endpoint(s) should
the number of centers involved (single or multi- be specically dened, along with an explanation
center), total number of patients to be enrolled, of how and when it will be measured. The sec-
and the geographical area included, e.g., United ondary endpoints may be more numerous than
States, North America, Europe, China, or a the primary ones. They may represent additional
region of a country. The study population should measures of efcacy or safety but also may be
be characterized, especially any unique demo- included for other reasons such as exploration of
graphic characteristics, e.g., women only, mechanisms, particular safety concerns, and
African-Americans only, or a certain age group. development of data for future research. The sec-
In addition to patient demographics, a brief ondary endpoints also should be specically
description of their underlying disease condition dened, and the timing and methodology of their
being studied should be mentioned along with measurements should be briey stated.
any important information about the current sta- Factors considered in the selection of end-
tus, duration, severity, and treatment of the con- points (especially the primary endpoints), such as
dition that might affect patient eligibility as well relevance, practicality, acceptability, validation,
as outcomes. The active intervention being tested, and experience should be discussed. Clearly, it is
along with any control interventions, should be necessary to establish that the endpoint chosen is
briey described. In addition, the frequency and relevant to the patients and conditions being stud-
duration of the intervention should be stated ied; that is, it addresses real and signicant needs
along with the total study duration, which may be such as improving symptoms, survival, diagno-
longer than the intervention period. Finally, the sis, or other outcomes. In addition, the endpoints
primary study endpoint should be described should be practical, not only by addressing real
along with a statement about how it will be needs but by utilizing readily applied methods of
assessed, when it will be assessed, and how often objective measurement. Furthermore, the meth-
it will be assessed. Key secondary endpoints may ods used must be acceptable to both investigators
be simply listed. and patients in terms of ease of application,
safety, comfort, and cost. Optimally, they should
be standard methods that are appropriate for the
Endpoints group under study to avoid the necessity of vali-
dating them, which usually must be done in sepa-
It is desirable to present the study endpoints early rate preliminary studies [3]. Validation involves
in the protocol, as these tend to drive the rest of establishing (via the literature or the investiga-
the study design which is developed to measure tors own work) that the proposed methods per-
an effect on those same endpoints. Thus, the sam- form as intended in both the patients and
ple size, methodology, duration of study, and conditions being studied. The investigators must
analytical methods are all inuenced by the indicate that they have sufcient experience with
choice of endpoints. the successful use of the proposed methods.
The endpoints are dened as primary and sec- Finally, it is critical that there be a consensus
ondary. The primary endpoint is usually a single regarding study endpoints among all investiga-
one, though it may include two endpoints, or may tors, study administrators, and committees before
consist of a single composite endpoint made of the study starts in order to avoid disputes when
two or more components. It is important to strictly the nal results become available [4]. Table 6.3
limit the number of primary endpoints, as attempts lists guidelines for describing the key components
Table 6.3 Primary study endpoints Although the terms patients and/or subjects
State the primary study endpoint(s) often are used interchangeably or may be estab-
Briey mention the appropriateness and relevance of lished according to convention of the sponsoring
the endpoint group, we prefer to use the term patients for
Describe the methods, timing, and frequency for those individuals with a medical diagnosis or
assessing the endpoint
condition that is the target of the proposed
As needed, describe and special personnel perform-
ing the assessment (e.g., an unblinded assessor in a research. We reserve the term subjects for nor-
double-blind study) mal healthy individuals that typically are included
Additional details about collecting endpoint data may in some studies as the control population but who
need to include: also may represent the primary population, e.g.,
Details about the use of subjects diaries in studies of the clinical pharmacological proper-
Any instructions on timing/conditions of
assessment
ties of a new drug before it is given to patients.
Details about unusual collection, storage, or
analysis of laboratory samples
Provide information about the standardization and General Description of the Study
validation of the methods to be used for endpoint Population
measurement
Describe the investigators experience with the
methods to be used The study population should be described in
As needed, describe any training that might be terms of its general demographics, as well as the
required in using the methods for endpoint characteristics of the disease or condition being
measurements studied that the patients should have, along with
the number of such patients that will be recruited
and enrolled. The demographic characteristics
of primary study endpoints; secondary endpoints typically describe the sex and age group of
should follow this same sequence, though with patients and, if appropriate, their race. If any of
less detail. these characteristics are particularly restrictive,
It should be noted that endpoints, as discussed the reason for that restriction should also be
above, refer primarily to clinical trials. Other given. For example, if one is studying only Asian
kinds of studies, such as nonprospective obser- females in their 20s, the reason for focusing on
vational studies that evaluate associations or dis- that population should be presented. In many
tributional characteristics (e.g., prevalences) instances, this may have been addressed in the
rather than intervention effects may not employ introductory sections and need not be gone into
endpoints as described above for their study in great detail in this section. The selection of
objectives. Observational studies are discussed these demographic characteristics (especially
in greater detail in Chap. 4. age) should not be taken lightly, as they may
have important effects on adverse events and
study outcomes [5]. In fact, it has been sug-
gested that these kinds of patient characteristics
Study Population may impact study results more than other fea-
tures of the study design itself [6]. These charac-
This section is a detailed description of the teristics will be expanded upon in greater detail
patients/subjects to be included in the study and as needed in the list of inclusion/exclusion crite-
should provide a broad description of the study ria, as discussed below. The medical condition
population, the source of patients, and a compre- these patients must have in order to participate
hensive listing of the inclusion (eligibility) and in the study also should be described in terms
exclusion criteria for study participation. of its diagnostic criteria, duration, etiology
116 J.A. Franciosa
(if appropriate), treatment, present status, and location of investigative sites that will provide
severity. If normal subjects are included, then patients and/or participate in the trial. Not all
operational criteria for dening the normal sub- sites may actually have study investigators; some
ject also must be presented. Subjects may be may serve only as sources that will identify and
required to be completely normal, with no refer patients to an investigators site. Methods to
signicant past or current medical conditions, be used for nding patients should be described.
especially if these subjects constitute the pri- These may include various ways of publicizing
mary study population. If normal subjects are the study, ranging from notices within the local
included as a control group, they may only be institution to advertising in various media. These
required to be relatively normal, i.e., they techniques and the individuals responsible for
should not have the same disease as the other implementing them should be described. It also
patients in the study. These disease characteris- is necessary to describe how patients, once
tics will be expanded upon in greater detail in identied, will be further screened and by whom.
the list of inclusion/exclusion criteria. This sec- A detailed description of the screening process to
tion also should include a description of the determine eligibility should be included, listing
number of patients to be studied. Whereas a the specic initial parameters that will be used
sample size estimate typically is included in the preliminarily to identify potential eligible
statistical analysis section (see below), that patients. It is common practice to identify patients
estimate usually refers to the number of patients who meet initial screening criteria by history,
needed to complete the study. Since, typically, then follow them for a brief interval to determine
some patients fail to complete a trial for several whether they subsequently meet all study eligi-
different reasons, it is necessary to try to esti- bility criteria. For example, in a study of treat-
mate the total number of patients that will be ment of hypertension, patients initially may be
recruited in order to achieve the number needed screened on the basis of having a history of
to complete the trial. Depending on the disease, hypertension or of having a single reading of
study population, and treatment, patients may elevated blood pressure. Typically, such patients
drop out of the trial for many reasons, including would be followed for a limited period to see if
death and side effects of the treatment. In addi- they, in fact, do currently have hypertension.
tion to these reasons, which will vary, some The location of screening procedures should
patients withdraw consent, move, or just never be specied. This could involve screening of
return for follow-up. The investigator must make clinic records, emergency room logs, diagnostic
every attempt to estimate the number of expected laboratory reports, etc., depending on the popula-
dropouts and decide what to do about them, tion being sought. For example, in a study of
i.e., to replace them or not in the study. It is criti- patients with documented coronary artery dis-
cal to estimate the number of patients that need ease, one might screen the cardiac catheterization
to be recruited not only in order to achieve the and intensive care unit logs. The protocol should
desired number of study completers but also describe who will do this, when it will be done,
to properly estimate resource needs, e.g., study and how it will be done. Unlike some sections of
medications, case report forms, and laboratory the protocol (e.g., endpoint denitions, patient
supplies. inclusion/exclusion criteria, and analytic meth-
ods to be used), the screening procedures are not
carved in stone and may be modied as
Patient Sources needed.
For a more detailed description of recruiting
The techniques to be used for recruiting patients techniques and the many issues that may become
for the trial should be discussed in detail in this involved, the reader should consult standard ref-
section. One should describe the number and erences and the medical literature [1, 711].
Inclusion/Exclusion Criteria results), it is mandatory that these criteria be

carefully thought through and decided upon pro-
A list of all inclusion and exclusion criteria to be spectively. It is highly undesirable to change
used in determining eligibility of patients for the these in any way after the study has started as
trial must include a detailed description of all the such post hoc changes may introduce bias,
requirements a patient or subject must meet to be thereby impacting the results and their interpreta-
eligible for enrollment in the trial, along with a tion, and raise doubts about the validity of the
detailed description of all variables that would study in adequately testing the original hypothe-
render the patient ineligible for enrollment. Each sis. Occasionally, circumstances can arise that
patient enrolled must satisfy all of the inclusion may mandate a change in patient eligibility crite-
criteria and none of the exclusion criteria, with- ria, but these are rare and usually involve ethical
out exception, in order to be enrolled in the trial. issues. For an example, an effective new treat-
It is important that the list be very detailed, ment may become available for some or all of the
leaving no ambiguities for the study personnel patients in the trial, making it potentially unac-
who must use the list to screen for potential ceptable to leave them on a placebo or on an
patients. Thus, specic criteria, along with any unproven treatment. It generally is unacceptable
relevant methods needed to apply them, should to change eligibility criteria simply because the
be provided for each item on the inclusion/exclu- investigators have found it extremely difcult to
sion list. For example, eligibility for inclusion in recruit patients meeting the current criteria. In
a trial of antihypertensive treatment might require such instances, it may be wiser to terminate the
that the patient have a systolic blood pressure study and start a new one with different eligibil-
above 140 mmHg or a diastolic pressure above ity criteria. Obviously, such decisions have
90 mmHg, as determined by the average of 3 important consequences and should never be
readings taken 5 min apart with the patient seated taken lightly.
and using a standard sphygmomanometer. In Table 6.4 shows some items to be considered
addition, if the study is to include untreated when constructing an inclusion/exclusion criteria
patients, the exclusion criteria might state that A list. The rst requirement is written informed
patient may not be included if he/she has received patient consent. Without this, it would not be per-
any antihypertensive drugs within the past missible to proceed to the subsequent criteria
6 months, specically any diuretics, beta-block- which require obtaining condential medical his-
ers, calcium blockers, ACE-inhibitors, angio- torical information from the patient. The criteria
tensin-receptor blockers, or alpha-blockers. For list follows a structured progression from the
other agents with possible antihypertensive activ- general demographic characteristics to those that
ity, the investigator must obtain approval of the are more related to a specic disease, and con-
study chairperson before enrolling the patient. It cludes with criteria that relate to characteristics
is extremely important that this list be as compre- that might confound the conduct or outcomes of
hensive and detailed as possible since it will serve the study or that might impair the patients ability
as a checklist for many of the personnel involved to complete the study. The exclusion criteria
in the study, including those doing the screening, often mirror the inclusion criteria by stating the
designing the case report forms, developing the converse of the corresponding inclusion criteria,
database, analyzing the results, and auditing the thereby providing a means of double checking
studys conduct. the patients eligibility. For a more detailed dis-
Since the inclusion/exclusion criteria are criti- cussion of how to construct an inclusion/exclu-
cal to dening the study population (whose char- sion list, the reader should consult standard
acteristics, in turn, may greatly impact the study references [1].
118 J.A. Franciosa
Table 6.4 Patient eligibility considerations

Category Inclusion criteria Exclusion criteria
Patient Provision of written informed consent Failure to provide written informed consent
characteristics Demographics (age, sex, race) Hypersensitivity/intolerance to study
Body weight interventions
Pregnancy or childbearing potential Medical history (current or preexisting
Behaviors (alcohol, smoking, activity, diet) conditions and treatments)
Mental status Allergies/food intolerance
Occupational risk/hazard
Breast feeding
Characteristics Diagnostic criteria Nonpermitted treatment
of disease being Duration Status/severity that might bias results
studied Etiology Confounding concomitant conditions/
Status/severity complications
Treatment (required, permitted)
Screening Within limits specied Outside of limits specied
examinations Meets all run-in requirements (compliance, Fails to meet run-in criteria
stability)
Other factors Cooperative attitude Inability to perform study requirements/
Occupation procedures
Availability for all study requirements Lack of availability
for full duration Increased risk of lack of cooperation
Current/recent participation in another
clinical trial
Implementation of the Intervention, started immediately in the active phase of the

Study Visits, and Observations study or may be observed during a preliminary
phase (run-in period), before entering the active
Once the study endpoints and population have phase of the study.
been dened, one must provide the detailed
methods by which the actual study data will be Run-In Periods
generated. This section provides all interested It is common to have patients enter a run-in
parties with a precise description of how the period between the time that they qualify for a
patients will be entered into the study, how they clinical trial and the time that they begin active
will be started and followed while on the study involvement, i.e., are started on the actual study
intervention, how and when required observa- intervention. There are several reasons for using
tions will be made, and when and how the a run-in period. Common reasons include estab-
patients participation in the study will be lishing nal patient eligibility, demonstrating
terminated. stability, and assessing compliance. Not all inclu-
sion/exclusion criteria may be completely avail-
able for assessment at the time of screening,
Study Initiation especially if there is a requirement for recent
laboratory or diagnostic information. A run-in
After eligible patients have been dened, period just before starting a new drug or a special
screened, identied, and consented, they are procedure may be used to allow for obtaining
ready to be enrolled in the study and begin their any assessments that must be current (e.g., an
active participation. Depending on the specic echocardiogram to document presence of abnor-
study design and intervention, patients may be mal cardiac function) to conrm that the patient
actually has the medical condition required for him/her as a new patient in the screening
study participation. A run-in period also may be phase. Another potential risk and criticism of
used to demonstrate that a patient has the required run-in periods is that they may introduce bias by
status of the condition being studied. For exam- selecting the better responders to the active study
ple, it may be required that a patient have stable intervention [12].
symptoms while taking all standard treatment for
the condition in order to minimize difculty in
interpreting changes in the patients condition Start of Study Treatment/Intervention
after starting active treatment. If the patient was
not stable or if other treatments were started after Once all inclusion criteria are satised and no
the study intervention, it would be extremely exclusion criteria are met, whether at the end of
difcult to assess the cause of a change in the screening or after a run-in period, the patient is
patients condition. Another common reason for ready to initiate study-mandated activities. At
using a run-in is to assess the tolerability of the this time, the patient will be assigned his/her
study intervention. A patient may have difculty study treatment or intervention. If the study is not
complying with an intervention if it produces controlled, the patient will be started on the study
signicant side effects or is difcult to adminis- intervention. If the study is controlled, the patient
ter. Furthermore, patient compliance may be is randomized to his/her study treatment. The
inuenced by other patient conditions or behav- method of randomization, e.g., consulting a list,
iors, e.g., substance abuse or alcoholism. A run- opening an envelope, or contacting a central ran-
in period may be useful to assess the patients domization center should be briey described
likelihood of complying with and completing all here. If the intervention being evaluated in the
study requirements. trial includes pharmacological therapy, the study
Treatment during run-in periods may vary. If drug may also be dispensed at this time or
the purpose is only to acquire nal inclusion/ arrangements may be made for procuring it. The
exclusion information, no treatment may be patient should be given any applicable instruc-
needed. Obviously, if the purpose is to assess sta- tions at this time and scheduled for the next clinic
bility and/or compliance with an intervention visit. Typically, the details of the randomization
such as a study drug, it would be necessary that it technique, and the administration and manage-
be given according to the same regimen that ment of the intervention, respectively, are pro-
would be used in the active phase of the study. vided in the statistical and administrative sections
This phase usually involves either active study of the protocol.
intervention in all patients if its purpose is pri-
marily to assess tolerability or placebo in all
patients to assess patient compliance for reasons Schedule of Visits and Observations
other than tolerability of the intervention. Clearly,
the patient is kept blinded to treatment if the The protocol must provide a schedule of patient
active phase is to be double-blinded. visits with details about when these will be con-
Finally, the duration of the run-in periods ducted and what information will be collected at
should be as short as possible, typically not more each visit. This section is used and closely
than 23 weeks. In general, less time is needed to adhered to by study personnel, much as a recipe
obtain laboratory tests, and more time would be is followed by a cook. Study visits typically
needed to assess tolerability or compliance. The consist of a baseline or study initiation visit,
problem with excessively long run-in periods is follow-up interim visits, a nal on-treatment
that patients may change during this time. In study visit, and a post-study follow-up visit. It is
cases where a run-in period has had to be important to specify the timing of these visits,
extended, it is common practice to terminate that with a window of plus or minus a small number
patient from the study at that point and restart of days, if possible, to allow the patient some
120 J.A. Franciosa
exibility in scheduling appointments. Typically, visits primarily are intended to monitor the
the time is set relative to randomization or base- patients progress and his/her tolerability of the
line, i.e., at some time a time window follow- study intervention. A brief medical history and
ing the date of randomization or the baseline physical examination are carried out, with the
visit. The observations recorded at each visit emphasis on looking for any adverse events or
often are variable, with fewer items observed at ndings. Information on one or more study end-
interim visits. points may be collected, but not necessarily the
primary endpoint, especially if that involves a
Baseline Visit special procedure, e.g., cardiac catheterization,
The baseline visit is performed at or very close to which might be done only at the end of the study
the time when the patients are randomized to or once during an interim visit. In trials evaluat-
study treatment/intervention, whether or not that ing medications, patient compliance usually is
treatment/intervention has actually been insti- assessed, typically by having the patient bring
tuted. This is a critical visit as all observations any unused study medications with him/her and
recorded at this time will be the basis for com- calculating the percentage of pills taken relative
parison with all observations made while on to those prescribed. The interim visit also is con-
study treatment. Thus, a complete medical his- cluded by dispensing any study drugs or other
tory and physical examination usually are per- required materials to the patient, scheduling the
formed, along with laboratory tests. All next visit, and arranging for any procedures or
concomitant medications are recorded with tests needed for the next visit.
details about dose and duration of administra- Of course, patients may develop complica-
tion. In addition to this general medical exami- tions and may need to be seen between scheduled
nation, there is usually information collected visits. All clinical trials must include provisions
that is specic to the status of the medical condi- for patients to be seen by physicians who may be
tion being studied, e.g., its duration, severity, associated with the study in order to deal with
history of complications, current symptoms and clinical necessities whether or not a visit is spe-
status, and current treatment. Any special tests, cically related to a protocol-based assessment.
assessments, or procedures relating to study end- The reasons for, and ndings obtained during,
points are carried out at this visit or are sched- any unscheduled visits must be recorded as study
uled to be obtained very soon after this visit, if data on appropriate forms.
not yet already done. One cannot overemphasize
the importance of all baseline determinations.
They must be thorough and comprehensive, as Final Visit
any medical and/or laboratory ndings that The nal visit is the last one during which the
appear later must be ascribed in some way to patient is still receiving the study intervention.
study participation if they were not present at Its observations include essentially the same as
baseline. In trials evaluating experimental medi- those obtained at the baseline visit and are just
cations, the baseline visit is concluded by dis- as critical since they represent the study results
pensing any study drugs or other required and outcomes that will be compared to those
materials to the patient, scheduling the next visit, from the baseline visit. In addition, the same
and arranging for any procedures or tests needed kind of information collected at the interim
for the next visit. visits is obtained to cover the interval since that
preceding interim visit.
Interim Visits Whereas a nal visit is obtained routinely in
Following the baseline visit, the patient is seen at all patients at the end of the study, it may be nec-
intervals specied in the protocol to occur at essary to perform a nal visit if a patient termi-
some set time, e.g., every 3 months 1 week nates his/her study participation prematurely, as
from the date of the baseline visit. These interim might happen for intolerable side effects or other
Table 6.5 Template schedule of study events in protocol no. XXXX

Screening Baseline Treatment period Follow-up
Evaluation (Day xx) Day 0 Day # Day # Day # Day # Day #
Informed consent X
Inclusion/exclusion criteria X X
Demographics X
Medical history X
Full physical examination X X X
Partial physical examination X X X X
Laboratory tests X X X
Special tests/procedures X X X
Randomization X
Vital signs and weight X X X X X X X
Study intervention administration X X X X
Adverse event assessment X X X X X X X
Concomitant medication assessment X X X X X X
Terminate study drug X
reasons. In such cases, every attempt must be mandatory to attribute any side effects or compli-
made to have the patient return and perform all cations occurring during this period to the study
the procedures and observations required at a intervention. These post-study visits also are of
regularly scheduled nal visit. Without this, that value in helping to document patient status and to
patients entire dataset may be useless and exclude protect all study personnel and institutions in the
the patient from the study analysis. In most event of any future allegations stemming from
instances, nal visit data obtained even prema- the patients study involvement. It is strongly
turely may still be analyzable and allow the suggested that a ow chart of all scheduled visits
patient to be included in the results. and related procedures be included, a template of
At the end of the nal visit, study drug/inter- which is shown in Table 6.5.
vention is terminated, and the patient is sched-
uled for a study follow-up visit.
Data Management
Post-study Follow-Up Visit
Often by regulatory requirement, but more in the A clinical trial, along with its data generation and
interests of good clinical practice, patients should acquisition, is driven by the thoroughness and
be seen at least once after completing their study objectivity of the research protocol. The research
participation to ensure that they are not experi- data to be generated, collected, processed, and
encing any sequelae that might be attributed to stored in the clinical database must support the
their study involvement. Such visits usually are objectives of the study, as specied in the proto-
scheduled at 1 week to 1 month after the nal on col. This, in turn, relies on designing data man-
treatment study visit, depending on the possible agement processes that correctly capture the
duration of effects of the study intervention. required research data. All data generated by the
(As used here, the term on treatment means the trial must be captured and managed to ultimately
patient is still receiving a study-mandated inter- yield the results of the trial. Data management
vention, regardless of whether he/she is receiving has been enhanced dramatically in recent years
active therapy or an inactive control substance [or as a result of technological advancements includ-
other control condition].) In some instances, ing computerization of databases, bioinformat-
especially by regulatory requirements, it may be ics, and Internet applications to facilitate
122 J.A. Franciosa
acquisition and processing of data [1315]. As a the quality of conduct of the study; as such, they
consequence, modern data management pro- are commonly audited after study completion to
cesses involve specialized personnel and meth- help ascertain the validity and reliability of the
ods which are discussed in detail in Chap. 7. For study conclusions.
all these processes to be properly carried out, it is
necessary that a detailed, comprehensive, and
unambiguous protocol be developed, as the pro- Safety Monitoring Procedures
tocol drives the data management processes
which tend to follow the protocol in a chrono- A complete protocol should describe all proce-
logical fashion. Obviously, the tools used for dures that will be in place to ensure and assess the
data collection will be developed in accordance safety of study participants. Whereas much of
with protocol specications. Ideally, data man- this information already is included in different
agement processes should be developed in parts of the protocol, e.g., on the schedule of vis-
advance of data collection because post hoc its and procedures, it is recommended that a
changes potentially introduce a risk of bias, specic section be devoted to summarizing all
threatening the validity and credibility of the safety monitoring procedures. It should summa-
results, as noted above. rize how often patients will be seen, that an
The data management plan closely follows interim history and physical examination will be
the structure and sequence of the protocol. performed, and that laboratory tests will be
A well-written data management section will obtained. It is important to point out any special
provide detailed descriptions of each data item to visits, examinations, tests, or procedures that will
be collected, how it will be collected, and when it be conducted specically to look for known side
will be collected. The data management group effects of the treatment. For example, liver func-
must work very closely with the team that is pre- tion tests would be obtained in a trial of a new
paring the actual protocol to help ensure that all drug suspected of possibly producing liver toxic-
the data described are readily obtainable, com- ity, or the eyes would be examined often in a trial
plete, unambiguous, objective, and easily proof an intervention that could potentially be asso-
grammable and quantiable. Furthermore, it ciated with cataract formation.
must be ascertained that all of the data generation In addition to describing what will be done and
methods are generally accepted and that the how often, it is important to specify who is respon-
research team is adequately experienced in using sible for carrying out these procedures and what
these methods so as to help ensure reliability and will be done with the information in case some-
validity of the data obtained. thing is found, i.e., instructing the investigators
Whereas trials generally try to limit the amount whom to contact, how to establish contact, and
of information collected to that which is necessary the timeframe for making contact. It is important
to obtain valid results, it is common to collect that all study personnel know what constitutes an
additional information, especially at baseline, adverse event or serious adverse event. These are
because this is the last time one can make obser- not simply clinical impressions but are specically
vations before the effects of the trial interventions dened by regulations. These regulations also
come into play. Just being in a clinical trial may establish what information about the adverse
affect patient outcomes because of the level and event must be collected (start date, duration,
frequency of care provided (see also Chap. 5). It severity, drug dose, concomitant drugs, action
is critical that every attempt be made to capture taken, outcomes, etc.) and who must be notied
all the required data at the times specied by the within the specied time frame (other investiga-
protocol, as incomplete, inaccurate, and/or miss- tors, IRBs, study administrators, regulatory agen-
ing data can undermine the reliability and credi- cies, etc.). Instruction also should be provided to
bility of results. The completeness, accuracy, and the investigators regarding possible discontinua-
timeliness of data collection are key indicators of tion of the study drug, premature termination of
the patients study participation, unblinding of sponsor, auditors, or other regulatory authorities
any study medication, etc. and that his/her study information may be used in
It is critical that all study personnel understand publications. In any of these instances, the patient
that an adverse event is any undesirable sign, must be assured that his/her identity will be kept
symptom, or medical condition that occurs after strictly condential. The process of obtaining
starting study participation regardless of its rela- informed consent offers an excellent opportunity
tionship to the study intervention, i.e., even if a to establish good communications and rapport
cause other than the study intervention is present. between the patient and the investigators and, as
Any condition that was present before starting such, may impact the study outcome [2123]. It
study participation must be considered an adverse is important to recognize that consent for study
event if it worsened. Furthermore, the serious- participation contains important elements that
ness of an adverse event is not synonymous with distinguish it from consent to a procedure, be it a
its severity or potential outcomes. An adverse routine clinical procedure or one required as part
event is considered serious if it is (1) serious or of the study; thus, consent to participate in a
life-threatening, (2) requires or prolongs hospi- research study should be obtained separately
talization, (3) is signicantly or permanently dis- from other permissions obtained in caring for a
abling or incapacitating, (4) constitutes a patient [24]. The informed consent form itself is
congenital anomaly or birth defect, or (5) requires considered a part of the protocol. The protocol
medical/surgical intervention to prevent any one also should contain a statement that IRB approval
of the preceding. There is no mention of severity will be obtained and that the investigators and all
or potential seriousness. Thus, a severe symptom study personnel will obtain all periodic re-
or abnormal laboratory nding that does not meet approvals and comply with all other requirements
one of these criteria is not considered a serious of that review board.
adverse event. In addition, the protocol often includes a
Above all, it is critical that adverse events be description of the investigators responsibilities
looked for, recognized, recorded, and reported as regarding patient safety. This description typi-
quickly as possible to the appropriate study gov- cally points out the research policies, regulations,
erning personnel to allow any necessary actions to and requirements of governmental, international,
be taken to safeguard all other study participants. institutional, and sponsoring bodies. The investi-
gators are required to comply with all of these. In
addition, the investigators agree to accept full
Ethical Considerations responsibility for protecting the rights, safety,
(See Also Chap. 12) and welfare of patients under their care during
the study. The principles of good clinical practice
The protocol must state that all patients will pro- mandate that the investigators provide the best
vide informed consent prior to being enrolled in available care, themselves or by appropriate
the study. The consent form must be written in referral, for any medically related problems that
language the patient can fully understand and arise during the study, regardless of their rela-
must contain certain elements. These include a tionship to the study itself.
description of the study; what is expected of the
patient; what risks are involved with any tests,
procedures, and treatments; what alternative Statistical Considerations
treatments are available; and assurance that the
patient will be given the best available treatment All protocols should contain a section that
for his/her condition whether or not he/she describes trial-specic statistical evaluation
chooses to participate initially or to terminate plans. For randomized controlled clinical trials,
prematurely. The patient should also be informed such considerations typically include (but are not
that his/her study records may be reviewed by the limited to): the specic nature of the study design
124 J.A. Franciosa
and related issues, the specics of the randomi- Studies typically encounter unforeseen prob-
zation procedure and rationale employed, justi- lems and questions during their conduct. In addi-
cation of sample size and associated power (see tion, some potential issues can be foreseen prior
also Chap. 11), the statistical analysis planned to study initiation; these need to be prospectively
for assessing primary and secondary outcome addressed so that solutions can be decided quickly
measures, and a statement of the null hypothesis according to plan should they, indeed, arise dur-
for primary efcacy comparison. When appro- ing the course of the study. Examples of such
priate (e.g., a randomized controlled trial evalu- issues include endpoint criteria, rules for early
ating high-risk patients), this section also may termination of the study, need for protocol
articulate statistically-based stopping rules for changes, etc. It is important for any study, and is
premature termination of the study (e.g., early mandatory for multicenter studies, that the proto-
evidence of efcacy in the absence of safety col identify those individuals responsible for
problems). making decisions about the studys conduct.
Thus, the protocol should specify the individuals
and committees who are responsible for study
Protocol Implementation leadership and charged with making the kinds of
and Study Conduct decisions mentioned above.
Multicenter studies should have a chairperson
Recent observations suggest that the conduct of who is empowered to make and/or delegate day-
certain types of clinical trials have decreased, to-day decisions regarding such things as decid-
raising concerns about adequacy of planning and ing if a patient satises all inclusion/exclusion
implementation. For example, late phase clini- criteria or if a patient or center has violated pro-
cal trials represented about 20% of all clinical tri- tocol requirements, etc. In addition, there may be
als in 1994 whereas in 2008, they accounted for a steering or executive committee to address
only 4.4% of all clinical trials [16]. Possible rea- broader issues, e.g., protocol changes, and to
sons that may contribute to this apparent decline address recommendations of any subcommittees.
include inadequate organization and infrastruc- The subcommittees may typically include an
ture, lack of coordinated research team effort, and independent data safety and monitoring board
insufcient training [1618]. No matter how well (DSMB) that periodically reviews study data to
a protocol is written, it is of little value if it cannot assess the need for possible premature termina-
be implemented and carried out to completion. tion of the study if a clear benet or risk appears
that makes it unethical to continue the study.
Another subcommittee might analyze study end-
Study Organization, Structure, point outcomes, e.g., cause of death or reason for
and Administration hospital admission. It is mandatory that subcom-
mittees and committees prospectively dene the
In addition to describing how the study will be rules and criteria to be used in arriving at any
done, protocols typically address issues which decisions they make and that information required
help safeguard the well-being of patients during to satisfy these rules be included as a part of the
their study participation, while ensuring the integ- protocol. Subcommittees and other committees
rity and proper conduct of the study. Many of the generally make recommendations to the steering
topics discussed in this section are addressed at or executive committee who has responsibility
great length in other publications and reference for making nal decisions based on those
materials which the reader should consult [1, 9]. recommendations.
We will focus here on some of these topics, espe- In summary, the leadership of the study is
cially those that are typically required for inclu- responsible for the general satisfactory conduct
sion in a protocol by sponsoring institutions, of the study in all of its aspects. This includes
funding agencies, and regulatory authorities. resource recruitment and allocation, providing
any training required, ensuring timeliness of timely availability of supplies. In addition, study
patient recruitment, overseeing data manage- leaders must be readily available to these same
ment, and reporting of the results. individuals to try to resolve any supply problems
that might arise.
The protocol should contain information about
Resource Allocation and Management study materials the patient will need, including
study drugs, laboratory kits, questionnaires, dia-
Key resources include funds, manpower, and ries, etc. Information should be provided on who
supplies. Funding may be available prior to study is responsible for procuring and dispensing these
initiation in some settings with predetermined materials, how and where they will be procured,
budgets, e.g., industry. In other settings, funding how they will be supplied (kits, bottles, etc.), how
must be applied for, and its procurement often they will be labeled to correctly identify content
depends heavily on the quality of the research and the study patient, and instructions for their
proposal and/or protocol. Once funds are secured, use. There also should be a description of how
the study leadership must oversee their alloca- the supplies will be stored. Finally, there must be
tion, accountability, and continuing availability, an accurate inventory of all materials, with dates
as well as identify the individuals who will be of receipt, dispensing, names of recipients, etc.
responsible for these matters. There also must be a procedure for returning
The success of the study also will depend on study material and recording their receipt. All of
the availability of sufcient and qualied per- these records are mandatory for accountability of
sonnel to carry out all the required functions. For supplies and are subject to strict regulations,
certain functions, especially those that might especially when any controlled substances are
only be required from time to time to address involved. This section is critical to the study
specic issues that might arise, it may be prefer- sponsor who generally provides the materials and
able to use consultants. For example, if patient must be able to show that adequate instructions
recruitment lags, the advice of persons special- for their correct handling were provided to
ized in recruitment techniques might be sought. investigators.
It is critical that all personnel be qualied to
carry out whatever responsibilities they are
assigned and that the study leadership provides Recruitment of Study Participants
the proper training needed to ensure their
qualications. The recruitment of eligible patients/subjects into
Availability of all supplies needed to carry out the study in a timely fashion is one of the key
the study is critical and may be a rate-limiting rate-limiting processes that has a major impact
factor in starting and completing the study in a on study results. Failure to recruit patients in a
timely fashion. Obviously, the study cannot start timely manner may have serious consequences
without materials for gathering and reporting by precipitating retrospective protocol changes,
data, e. g., case report forms (see also Chap. 7). such as relaxing eligibility/exclusion require-
Similarly, study drugs and/or devices must be ments or modifying procedures and observations.
available and ready for use, i.e., properly coded Any such changes can signicantly affect the
and allocated for a randomized trial. Any supplies study and potentially undermine its original intent
for laboratory tests and study procedures also and capacity to properly test the study hypothe-
must be available. Not only is it important that all sis, thus yielding results that may not be valid and
supplies be available to start the study, but it also conclusive relative to the original intent. Failure
is necessary to assure that they will continue to be to recruit patients quickly enough in sufcient
available throughout the study until its conclu- numbers can lead to early termination of the
sion. A key responsibility of study leadership is to study itself as well as discontinuation of its fund-
oversee the individuals responsible for ensuring ing, thereby jeopardizing the power of the trial to
126 J.A. Franciosa
achieve its projected sample size needed to is important to describe the procedures that these
achieve statistically conclusive results. individuals will follow to ensure (1) adherence to
Techniques for recruiting study subjects vary the protocol, (2) provision of complete and accu-
considerably and represent a specialized topic in rate data, (3) response to queries, and (4) compli-
and of itself [1, 19, 20] that is beyond the scope ance with auditing. Instructions on record keeping
of this chapter. The study leadership must iden- and record retention should also be provided.
tify the individuals responsible for recruitment Monitoring techniques vary and may include
and provide them with adequate resources and simple periodic telephone or e-mail contact with
training for whatever recruitment techniques are mailing or electronic submission of study docu-
employed. The specic techniques to be used ments between investigator sites and the moni-
should be spelled out in detail in the protocol. tors. Monitors may visit sites on a periodic basis
Numerous recruitment techniques are available to retrieve and deliver study materials as well as
and include screening subjects from (1) the local directly observe the sites performance. For a
research site (ofce, clinic, hospital, etc.), (2) more detailed description of monitoring methods
collaborating local sites, and (3) collaborating and procedures, the reader should consult stan-
regional, national, and/or international sites. dards references on the subject [1].
Within each of these sites, local areas of interest
must be identied, e.g., ofce, laboratory, and
emergency room. Screening-type trials seeking Data Acquisition and Processing
large or broad populations of subjects may estab-
lish recruitment centers in churches, schools, The principles of data acquisition and manage-
supermarkets, shopping centers, commercial estab- ment are described in detail in Chap. 7. From the
lishments, etc., to identify appropriate patients. study conduct perspective, it is important that ade-
In addition, advertising through various media quate numbers of qualied personnel are available
should be utilized to reach potentially eligible for data processing and management. Furthermore,
participants. Other sources are colleagues, bulle- these individuals must have expertise or be trained
tin board notices, direct mailings, and telephone in the required methods to be used for acquiring
screening [1]. The nal decision regarding and processing data. Similarly, study leadership
recruiting methods will depend on the overall must ensure that all appropriate materials, espe-
number and kinds of patients/subjects needed. cially equipment, hardware and software, are
Importantly, the duration of active recruiting available to properly process the data.
efforts commonly is specied in a protocol. These
timelines should be closely monitored and
adjusted as needed by the study leadership. End of Study Procedures
Once all study visits have been completed in all

Study Monitoring subjects, the study itself can be terminated.
Procedures for terminating the study may include
Implementation of the protocol should be care- a nal monitoring visit to retrieve all outstanding
fully monitored. The persons assigned this task study materials such as case report forms and
should be identied and adequately trained in the study supplies. Data processing procedures, e.g.,
monitoring procedures to be used. These indi- quality control and source document verication,
viduals should identify the personnel responsible should be initiated and completed. Record reten-
for overseeing study conduct at the various cen- tion procedures should be implemented.
ters and should ensure that all personnel at the The nal results should be tabulated, ana-
center are well aware of and able to properly lyzed, and presented in a nal study report to be
carry out all the investigators responsibilities. It submitted as required to funding agencies, IRBs,
regulatory agencies, etc. Most importantly, it is trial takes the form of a prospective study
strongly recommended that all nal results be comparing the effect of an intervention, usually a
published. Only in this manner can the study be new drug or device, with a comparator or control
critically analyzed by all those with a stake in its (i.e., a placebo or a treatment already available)
outcome as well as be replicated if deemed [26]. The fundamental design of the clinical trial
desirable. can be widely applied to many different disci-
plines or areas of clinical research. (For a com-
prehensive discussion of contemporary clinical
Overview of the Interventional trial methodology, the reader is referred to the
Clinical Trial seminal writings of Spilker [1]). Clinical trials
can be employed to evaluate many forms of ther-
Most of what is discussed above has derived apy, including surgical interventions and radia-
from, and has been best dened by, intervention therapy. In addition, clinical trials can be
tional clinical trials which represent the culmina- used to test other nontherapeutic approaches to
tion of clinical research and merit special patient care, such as diagnostic tests or proce-
consideration because of their impact on clinical dures [27]. Thus, the NIH classies clinical trials
research methodology. Interventional clinical tri- into ve categories according to their purpose,
als are designed and conducted for the primary i.e., treatment trials, prevention trials, diagnostic
purpose of testing a treatment or management trials, screening trials, and health-related quality
strategy in patients with a specic disease. Such of life trials. These categories reect the way in
trials typically are sponsored by large research which clinical trials t within the entirety of the
organizations, such as the United States National clinical research spectrum, as they can be instru-
Institutes of Health (NIH), or by private organi- mental in assisting clinical efforts to improve not
zations such as pharmaceutical companies or only the treatment of a particular disease (as is
medical device manufacturers. most often the case) but also its prevention and
An interventional clinical trial is a formal detection [27].
experiment designed to elucidate and evalu- The clinical trial is the most widespread appli-
ate the relative efcacy and safety of different cation of experimental study design in humans
treatments or management strategies for patients [26]. Indeed, it is the adherence of the trial to the
with a specic medical condition [25]. Healthy principles of scientic experimentation, perhaps
volunteers often are used in the early phases of more so than a reliance on therapeutic compari-
assessment of a new therapy primarily to assure son, that most aptly validates the results of the
sufcient safety of an intervention before apply- trial. Along this vein, a number of general charac-
ing it to patients with the disease targeted by the teristics of the scientic method play a substan-
intervention. Such studies typically involve tial role in the modern conduct of clinical trials
establishing the proper dosing and/or administra- including, most notably, the control of extrane-
tion of the intervention along with demonstrating ous factors that might inuence outcome vari-
that the intervention is tolerated well enough to ability, selection bias, or interpretation of results
permit further studies in patients. However, [28]. For example, an important feature of the
healthy human volunteers provide only indirect randomized controlled trial, which is widely
evidence of effects on patients. Therefore, ulti- accepted as the primary standard of evidence
mately, clinical trials of putative interventions when interventions are evaluated, is the require-
must be conducted among individuals with dis- ment to randomly allocate patients to alternative
ease. The results obtained from this limited sam- interventions, strengthening the internal validity
ple then are used to make inferences about how of the study (see also Chap. 5).
treatment can be applied in the diseased popula- In any clinical trial, regardless of which inter-
tion in the future [25]. Most commonly, a clinical ventions or tests are administered, investigators
128 J.A. Franciosa
must carefully follow the progress of recruited

subjects, collecting data for a prespecied time Conclusions
interval according to the requirements of the
study protocol; subsequently, statistical analyses The study protocol is the most important and
are performed that might yield valuable conclu- critical document available to the investigator and
sions relevant to predened research objectives. is central to the conduct of any study. It provides
Some studies might involve more tests or medical the necessary guidance and serves as the main
visits than are clinically necessary, while others reference for all study personnel, while also pro-
interfere only minimally with normal patient care viding for the welfare and safety of all study par-
practices. In general, the details of the procedure, ticipants. It must be detailed and comprehensive
including the specic conceptual plans for obser- and must be prospectively dened. Whereas it is
vation, data capture, follow-up, and analysis not possible to foresee all things that might occur
depend on what type of clinical trial is being during the course of the study, it behooves the
conducted. Due to their broadening scope of investigators to plan for all foreseeable develop-
applicability since the mid-1900s, clinical trials ments in the protocol. Virtually, anything that
currently play a paramount role in examining the must be done post hoc has the potential to intro-
impact of interventions among human subjects. duce bias and undermine the credibility and
What has further cemented the clinical trial as a validity of the study. The degree to which the
valuable tool for the clinical investigator has been investigators can achieve these requirements will
the recognition by health-care professionals that, serve as testimony to their thoughtfulness, atten-
if insights into disease prevention and improve- tion to detail, and overall quality of work. A high-
ment to patient care are to be gained, experimen- quality protocol should allow others who follow
tal methodology should be followed as rigorously it rigorously to obtain the same results. Most
in a clinical setting as it is in basic science[28]. importantly, a high-quality protocol will likely
Proper preparation of the research protocol, lead to a valid and credible conclusion, whether
therefore, is essential to the successful and ethi- it conrms or refutes the hypothesis, thereby
cal application of the clinical trial to modern reducing the likelihood of needing a costly repeat
clinical research. study because of a faulty protocol.
Take-Home Points
A protocol is the most critical document in a research study.

It plays a central role in the conduct of a study by describing how a hypothesis will be
tested.
It provides the necessary guidance and serves as the main reference for all study personnel,
while also providing for the welfare and safety of all study participants; it must be prospec-
tive, detailed, and comprehensive.
A protocol is organized in chronological divisions; the background and rationale provide
the rst impression of the investigators; study endpoints, especially the primary ones,
drive the rest of the study design.
The study population schedule of visits/procedures, and methods for ensuring patient
safety, along with other human subjects issues, must be described in detail.
A high-quality protocol will enhance the likelihood of drawing valid conclusions, whether
they conrm or refute the hypothesis, thereby reducing the likelihood of needing a costly
repeat study.
planning clinical trials recruitment. Contemp Clin

References Trials. 2007;28:22031.
12. Franciosa JA. Commentary on the use of run-in peri-
ods in clinical trials. Am J Cardiol. 1999;83:9424.
1. Spilker B. Guide to clinical trials. New York: Raven; 13. Romano P. Automation of in-silico data analysis pro-
1991. cesses through workow management systems. Brief
2. Treweek S, McCormack K, Abalos E, Campbell M, Bioinform. 2008;9:5768.
Ramsay C, Zwarenstein M, PRACTIHC Collaboration. 14. Lacroix Z. Biological data integration: wrapping data and
The trial protocol tool: the PRACTIHC software tool tools. IEEE Trans Inf Technol Biomed. 2002;6:1238.
that supported the writing of protocols for pragmatic 15. Shah AR, Singhal M, Klicker KR, Stephan EG, Wiley
randomized controlled trials. J Clin Epidemiol. HS, Waters KM. Enabling high-throughput data man-
2006;59:112733. agement for systems biology: the Bioinformatics
3. Sellier P, Chatellier G, DAgrosa-Boiteux MC, Resource Manager. Bioinformatics. 2007;23:9069.
Douard H, Dubois C, Goepfert PC, Monpre C, Saint 16. Nussenblatt RB, Meinert CL. The status of clinical
Pierre A, Investigators of the PERISCOP study. Use trials: cause for concern. J Transl Med. 2010;8:658.
of non-invasive cardiac investigations to predict clini- 17. Smith A, Palmer S, Johnson DW, Navaneethan S,
cal endpoints after coronary bypass graft surgery in Valentini M, Strippoli GF. How to conduct a random-
coronary artery disease patients: results from the ized trial. Nephrology. 2010;15:7406.
prognosis and evaluation of risk in the coronary oper- 18. Paschoale HS, Barbosa FR, Nita ME, Carrilho FJ,
ated patient (PERISCOP) study. Eur Heart J. 2003; Ono-Nita SK. Clinical trials prole: professionals and
24:91626. sites. Contemp Clin Trials. 2010;31:43842.
4. Mahaffey KW, Harrington RA, Akkerhuis M, Kleiman 19. Bader JD, Robinson DS, Gilbert GH, Ritter AV,
NS, Berdan LG, Crenshaw BS, Tardiff BE, Granger Makhija SK, Funkhouser KA, Amaechi BT, Shugars
CB, DeJong I, Bhapkar M, Widimsky P, Corbalon R, DA, Laws R. X-ACT collaborative research group.
Lee KL, Deckers JW, Simoons ML, Topol EJ, Califf Four lessons learned while implementing a multi-
RM, For the PURSUIT Investigators. Disagreements site caries prevention trial. J Public Health Dent.
between central clinical events committee and site 2010;70:1715.
investigator assessments of myocardial infarction 20. Treweek S, Pitkethly M, Cook J, Kjeldstrm M,
endpoints in an international clinical trial: review of Taskila T, Johansen M, Sullivan F, Wilson S, Jackson
the PURSUIT study. Curr Control Trials Cardiovasc C, Jones R, Mitchell E. Strategies to improve recruit-
Med. 2001;2:18794. ment to randomised controlled trials. Cochrane
5. Marang van de Mheen PJ, Hollander EJ, Kievit J. Database Syst Rev. 2010;4:MR000013.
Effects of study methodology on adverse outcome 21. Helgesson G, Ludvigsson J, Gustafsson Stolt U. How
occurrence and mortality. Int J Qual Health Care. to handle informed consent in longitudinal studies
2007;19:399406. when participants have a limited understanding of the
6. Borgsteede SD, Deliens L, Francke AL, Stalman WA, study. J Med Ethics. 2005;31:6703.
Willems DL, van Eijk JT, van der Wal G. Dening the 22. Jones JW, McCullough LB, Richman BW. Informed
patient population: one of the problems for palliative consent: its not just signing a form. Thorac Surg Clin.
care research. Palliat Med. 2006;20:638. 2005;15:45160.
7. Chin Feman SP, Nguyen LT, Quilty MT, Kerr CE, 23. Albrecht TL, Franks MM, Ruckdeschel JC.
Nam BH, Conboy LA, Singer JP, Park M, Lembo AJ, Communication and informed consent. Curr Opin
Kaptchuk TJ, Davis RB. Effectiveness of recruitment Oncol. 2005;17:3369.
in clinical trials: an analysis of methods used in a trial 24. del Carmen MG, Joffe S. Informed consent for medi-
for irritable bowel syndrome patients. Contemp Clin cal treatment and research: a review. Oncologist.
Trials. 2008;29:24151. 2005;10:63641.
8. Sisk JE, Horowitz CR, Wang JJ, McLaughlin MA, 25. Pocock SJ. Clinical trials: a practical approach. New
Hebert PL, Tuzzio L. The success of recruiting minor- York: Wiley; 1983.
ities, women, and elderly into a randomized controlled 26. Portney LG, Watkins MP. Foundations of clinical
effectiveness trial. Mt Sinai J Med. 2008;75:3743. research: applications to practice. 2nd ed. New Jersey:
9. Armitage J, Souhami R, Friedman L, Hilbrich L, Prentice-Hall; 2000.
Holland J, Muhlbaier LH, Shannon J, Van Nie A. The 27. Basic questions and answers about clinical trials.
impact of privacy and condentiality laws on the con- Rockville (MD): Food and Drug Administration (US).
duct of clinical trials. Clin Trials. 2008;5:704. Last Updated: 07/16/2009. http://www.fda.gov/
10. Anisimov VV, Fedorov VV. Modelling, prediction forconsumers/byaudience/forpatientadvocates/
and adaptive adjustment of recruitment in multicentre hivandaidsactivities/ucm121345.htm Accessed 11
trials. Stat Med. 2007;26:495875. Aug 2011.
11. Abbas I, Rovira J, Casanovas J. Clinical trial optimi- 28. Piantadosi S. Clinical trials: a methodologic perspec-
zation: Monte Carlo simulation Markov model for tive. New York: Wiley; 1997.
Data Collection and Management
in Clinical Research 7
Mario Guralnik
in procedural manuals, which outline the plans

Introduction and processes for data ow, entry, and quality
control and represent the essential documents
As noted elsewhere in this volume, all successful for managing the conduct of the research. Not
clinical trials begin with a good study question surprisingly, most data are collected to address
or questions, optimally framed as one or more the research study objectives. However, trial
hypotheses, and an appropriate research design administration and compliance data also are often
that clearly denes appropriate study endpoints collected to provide evidence of the quality of the
as well as other key variables. As in most serious conduct of the study.
endeavors, the old adage Failing to plan is plan- Having developed the proper study design
ning to fail applies when conducting clinical and data denitions, the researcher next is faced
research, where poorly conceived study objec- with the challenge of selecting the systems to be
tives and incompletely dened endpoints can used to collect and manage the trial data. Well-
almost guarantee that a studys conclusions will designed data management processes, collection
be faulty. In such cases, the best the researcher tools, and systems will help ensure the validity
may hope for are anecdotal observations of ques- and integrity of the data to be analyzed. Only
tionable validity; at worst, they could mislead the data whose sources can be trusted as accurate,
community of patients, clinicians, and/or health complete, and protected from tampering can be
policy decision makers for whom the research used to substantiate conclusions about a trials
was conducted. outcomes. Also, clinical research inherently
Once these elements have been rigorously raises issues of patient privacy and data security;
dened, the next most important step is the desig- thus, data management processes and systems
nation of the data to be collected among the sub- used in clinical trials must address both of these
jects to be included in the trial and the manner of areas as well. Overall, defects and inefciencies
data collection. Optimally, these will be detailed in methods and procedures of data identication,
collection, and management translate into defects
in documented evidence and waste in the con-
duct of the trial itself [1]. These problems may be
M. Guralnik, PhD ()
compounded when studies are large, are long-
Synergy Research Inc, 3943 Irvine Blvd #627,
Irvine, CA 92602, USA term, or utilize multiple centers [2]. Therefore,
e-mail: Mario@guralnik.com well-designed trials and data management
132 M. Guralnik
methods are essential to the integrity of the

ndings from clinical trials and containing Data Types
the costs of conducting them.
The methods by which data are collected must The term data in clinical research refers to
be addressed during the research design step. observations that are structured in such a way as
Attention must be paid to identifying existing or to be amenable to inspection and/or analysis [3].
creating new research documents or devices In other words, they represent the evidence for
into which trial observations can be recorded. conclusions drawn in a trial. All data collected in
Selecting the documents/devices that provide the biomedical research studies are either numerical
most reliable and valid data is a critical compo- or nonnumerical. Nonnumerical data typically are
nent of the research design process. Historically, based on written text but also could include data
the cornerstone of data collection has been the from sources ranging from digital photography to
structured paper case report form (CRF) into voice dictation. Any individual study may collect
which the required data are transcribed from the either or both of these data types. The approaches
research documents. However, inherent required to analyze, summarize, and interpret
inefciencies are present in paper-based data col- each type vary, so the differences between the
lection due to its time and resource-intensive various approaches must be considered when
nature and the error-prone aspects of data tran- designing a study and collecting the data [4].
scription and database entry. Not surprisingly, in
the last decade, studies once steadfastly done on
paper now routinely use electronic data capture Quantitative Data
(EDC) in an attempt to overcome these
inefciencies. Specically, these EDC systems The data collected in randomized clinical trials
reduce redundancy, trap errors in real time (allow- (RCTs), where the effectiveness and safety of
ing their prompt resolution), and promote the uni- new clinical treatments are evaluated, primarily
form collection of data which can be analyzed and are quantitative (numerical) in nature. Such data
shared in a more consistent and timely manner. may be discrete or count-based (e.g., number of
Procedural manuals typically outline processes white blood cells or hospitalizations) or continu-
for data generation, ow, entry, and quality con- ous measurements (e.g., dimensions, tempera-
trol. They are essential for managing the conduct ture, ow) and are collected using such methods
of the research. Verifying, validating, and correct- as objective (laboratory) testing or patient
ing data entered into a clinical research database response questionnaires and surveys that ask
are critical steps for quality control. Several data the respondent how much or how many.
cleaning processes are available for this purpose. Quantitative data may be displayed graphically
This chapter will consider the tools and pro- or summarized and otherwise analyzed through
cesses that support the development of accurate the use of descriptive and/or inferential statistics.
clinical research data and efcient trial manage- Descriptive statistics, including distributional
ment. These tools and processes are designed to characteristics of a sample (e.g., frequencies or
satisfy the requirements of funding agencies, percentages), measures of central tendency (e.g.,
Institutional Research Boards (IRBs), and other means, medians, or modes), and measures of
regulatory bodies with regard to protecting variability (e.g., ranges or standard deviations),
human subjects, provide timely access to safety provide a way by which the voluminous numeri-
and efcacy data, and maintain patient cal data collected can be reduced to a manageable
condentiality. Topics to be covered include the and more easily interpretable set of numbers.
various types of data used in clinical research, Inferential statistics provide levels of probability
basic source and research documents, data cap- by which the research hypotheses can be tested
ture methods, and procedures for monitoring and and conclusions drawn (see Chap. 11 for an in-
securely storing data. depth discussion of these methods).
7 Data Collection and Management in Clinical Research 133
Qualitative Data groups of people or situations. Additional infor-

mation about validity and reliability can be found
Exploratory trials, in which one of the purposes in Chaps. 5 and 8.
is to generate information for use in the planning
and design of RCTs, rely heavily on nominal and
other forms of nonnumeric data produced using Principles of Data Identication
such methods as patient free-text opinion sur- and Collection
veys, diaries, and translations of verbal commu-
nications (e.g., interviews). The summarization As previously described, the research data to be
and analysis of nonnumeric data typically involve collected in any clinical trial and stored in the
the use of descriptive statistics (as is the case for clinical database must support the objectives of
quantitative study data), but additional work is the study and be specied in the protocol. This, in
required before the descriptive statistics can be turn, relies on designing data collection instru-
calculated. Specically, the nonnumeric data rst ments and computer databases that correctly cap-
must be translated into numeric codes based on a ture the dened research data. To support trial
coding scheme preferably specied in the proto- administration and to document compliance with
col or, at least, prior to the collection of the data. regulations and Good Clinical Practice (GCP),
The coding scheme, however, is by its very nature source documents also are expected to capture
a subjective process which has the potential for subject participation data, though such data
investigator bias resulting from selective collec- are not necessarily included in the research data-
tion and recording of the data (or from interpreta- base [4].
tion based on personal perspectives). The The research data represent the information
potential bias can be minimized by having at least that is analyzed to answer the questions being
two researchers independently collect and record stated in the study objectives. In most protocols,
the data based on the same information and cod- addressing primary and potentially secondary
ing scheme. objectives requires collection of both efcacy
and safety endpoints. To appropriately design
the data collection documents and collection
Reliability and Validity methods, it is important to consider the value or
weight that each study objective contributes to
Reliability and validity are concepts that reect the overall outcome of the study. Emphasis must
the rigor of the research and the trustworthiness be placed on accurate and complete collection
of the research ndings [5]. Reliability describes of the specic data points necessary to investi-
the extent to which a particular test, procedure, or gate the studys primary objectives, while the
data collection method (e.g., a questionnaire) will collection of extensive data in support of sec-
produce similar results under different circum- ondary objectives should never be allowed to
stances. Highly reliable data are in evidence when detract from satisfying the studys primary
the research tool or method used in the collection objectives.
of the data provides similar information when When considering the collection of adminis-
used by different individuals (interrater reliabil- trative source data to help with the management
ity) or at different times (reproducibility). Validity of a trial, the amount of such data required
is a subtler concept which describes the extent to depends to a large degree on the complexity of
which what we believe we are measuring the trial structure. For example, in a small,
accurately represents what we intended to mea- single-institution trial, much less information
sure. Internal validity indicates the accuracy of typically is needed than in a large multicenter
causal inferences drawn from a studys ndings. trial [4]. The specic types of data to be collected
External validity indicates the extent to which a will depend on the details of the trial and could
studys ndings can be applied to other similar include information about transport of study
134 M. Guralnik
materials, monitors assigned to each site, dates of

monitor visits, or drug supply levels at each Data Sources
site [4]. Regardless of the amount and the type of
administrative data collected, it is not uncommon During the research design step, attention must
for trial management information to be manually be paid to identifying existing or creating new
and/or electronically stored separately from the research source documents or electronic devices
clinical trial research results. into which trial observations can be recorded.
Other administrative data include personal Selecting the source documents/devices that
patient information. The Study Coordinating provide the most reliable and valid data to
Center must be able to link a patient to a specic investigate the research study objectives is a
institution and maintain a roster of contact details critical component of the research design pro-
for that institution (e.g., patient name, address, cess. Data may be extracted from research-
telephone and fax numbers, and names, titles, independent sources, e.g., health insurance
and e-mail addresses for key trial personnel at databases or electronic health records (EHRs),
that institution) [4]. However, due to privacy con- or research-dependent sources, e.g., lab reports
siderations, the patient identication information generated from the performance of procedures
must be stored separately from the trial database conducted according to a trial protocols sched-
which contains uniquely assigned patient and ule. Both types of sources may provide data for
possibly randomization numbers, which can also a research study and are described, in greater
be used to link data from multiple sources to the detail, below.
same patient (e.g., laboratory data, demographic
information, and medical history).
Most studies also will need some documenta- Source Documentation
tion of compliance with regulatory requirements. and the Concept of Original Ink
The level of detail for such compliance data
depends on the type and purpose of the trial. Source documents for research can be dened as
Studies to be submitted to regulatory agencies all information contained in original records and
in support of New Drug or Product License certied copies of results, observations, and
Applications typically require the most complete other aspects required for the reconstruction and
set of compliance source data. Types of docu- evaluation of a study and its conduct [7]. Source
mentation can include ethics committee approv- documentation in a clinical trial includes medical
als for the protocol, original copies of patient or physiological, social, and psychological indi-
consent forms, and personnel qualications and cators of health that can be used to determine the
training at participating sites [4]. effectiveness of a clinical intervention. These can
Bottom line, in most clinical trials, a large vol- involve copies of any or all of the following
ume of data is collected. According to a recent original condential medical records: pharmacy
review of data monitoring in clinical trials [6], dispensing records, physicians notes, clinic and
the more data that are collected, the more cum- ofce charts, nurses notes, clinical laboratory
bersome and complicated data management reports, diagnostic imaging reports, patient diaries
becomes. Therefore, one goal in trial design and questionnaires, hospital admission records,
should be to minimize the volume of noncritical hospital discharge records, emergency room
data required so as to increase the integrity and reports, autopsy reports, electronic diagnostic or
quality of the studys results. This requires a research test results, vital sign records, electroni-
realistic appraisal of the ability of investigators cally captured original study data, photographs,
and other study personnel to manage the amount diagrams, and sketches. Source documents also
of data collected with a minimum of confusion can be created or provided by a trial sponsor by a
and error. third party (e.g., a contract research organization
[CRO] or a site management organization [SMO]) then rely on the transcribed sponsors source
or by the investigator or site staff, and may include documents to be the accurate and overriding data
study case report forms (CRFs) or electronic case points for resolution. Simply stated, erroneous
report forms (eCRFs) if used as the rst point of data could be considered the factual representa-
data capture. A source document could even be a tion of an event or observation. A simple but
cafeteria napkin containing laboratory results or effective tool for avoiding such situations is to
other observations, although a more formal dene in advance on a site-by-site, as well as a
data collection source document would be much form-by-form, basis what is and what is not
preferred. source documentation. When clarifying the
Use of the original ink concept can help to denition of source documentation, an important
differentiate a source document from subsequent point to keep in mind is that the study staff may
documentation. Original ink is a term that may be habitually record original ink data in certain
used to dene the rst-ever written documenta- places. For example, a patients temperature and
tion of an event or observation pertaining to the pulse may be routinely taken at the bedside by
study subject. Thus, documents containing origi- the study coordinator and recorded on a copy of
nal ink are considered source documents for the CRF. If the patients blood pressure is then
research. The US Food and Drug Administration taken from the physicians notes and recorded on
(FDA) as well as other regulatory agencies also the copy, then that copy becomes the source doc-
recognize a CRF as source documentation when umentation for the rst two measurements, but
it has captured the original ink of an event or not for the third. Interviewing the staff prior to
observation in a clinical trial. In contrast, tran- source document verication is an effective time-
scriptions or reproductions are considered sub- saving tool. When done early in the study initia-
sequent documentation based on the source tion process, this method can very effectively
original ink document. With todays use of clarify potential discrepancies.
advanced computer technology, ranging from
digital photography to voice dictation, we must
consider other forms of original ink or, more Research-Independent Data Sources
appropriately termed, original electronic chroni-
cles. These include voice, electronic, magnetic, A wealth of medical information is generated
photo-optical, and other source documentation every day for nonresearch purposes. A signicant
and records. For further information on the FDAs source of such data, accessible for research pur-
position on source documentation, the reader is poses, are the patient medical records maintained
referred to Guidance for Industry: Electronic by hospitals, clinics, and doctors ofces. Even
Source Documentation in Clinical Investigations the simplest medical records could contain impor-
(2010) [8]. tant information for research purposes, such as
Confusing these issues can lead to misrepre- sociodemographic data, clinical data, administra-
sentation of clinical trial data. For example, after tive data, economic data, and behavioral data.
site staff has collected a subjects history directly Additional potential research-independent
on sponsor-designated CRFs, the study monitor primary data sources are (a) claims data (such as
might remind the investigators staff that pre- those from managed care databases), (b) encoun-
printed sponsor source documents exist and that ter data (such as those from a staff/group model
they are designed to assist the site in capturing of health maintenance organizations), (c) expert
all necessary data elements. The site staff might opinions, (d) results of published literature,
then proceed to transcribe data from the CRF (e) patient registries, and (f) national survey data-
onto the sponsors source documents. To further bases. Since these data sources contain historical
confuse the matter, subsequent monitoring or as well as current data that are updated on an
query resolution activities by the sponsor would ongoing basis, these sources provide data that
136 M. Guralnik
are potentially useful in both retrospective stud- be obtained directly from the patient, most often
ies (designed to investigate past events) and through the use of a questionnaire or survey.
prospective studies (designed to investigate Questionnaires and surveys consist of a prede-
events occurring after patients have been enrolled termined set of questions administered verbally, as
in a study). a part of a structured interview, or nonverbally on
paper or an electronic device. The responses to the
questions may be discrete bits of data or may be
Research-Dependent Data Sources grouped as measures of study outcomes (e.g., psy-
chological scales). If the questionnaire is intended
Controlled evaluation of investigational products to measure study outcomes, establishing its reli-
or interventions requires prospective data collec- ability and validity and minimizing bias are essen-
tion which typically involves identifying one or tial. Administering a published questionnaire for
more patient groups, collecting baseline data, which reliability and validity have been previously
delivering one or more products or interventions, determined is recommended when possible.
collecting follow-up data, and comparing the However, the use of some published question-
changes from baseline among the different patient naires requires permission of their authors and
groups. Although there may be some research- may have a cost associated with their use. When
independent sources collected in these controlled the use of published questionnaires is not feasible,
evaluations (e.g., demographic, characteristics, new questionnaires will need to be developed.
medical history), most of the baseline data and, Such questionnaires should be pretested systemati-
obviously, the follow-up data must be collected cally (i.e., piloted) with a small subgroup of the
from research-dependent sources. Well-designed patient population in order to identify and correct
investigations of this nature specify, prior to the ambiguities or biases in the way the questions are
initiation of the study, the data to be collected and stated. Training interviewers who verbally admin-
the collection methods to be used. ister a questionnaire will also increase the quality
of the data generated both from published or newly
developed data collection instruments. (See Chap. 8
Data Collection Methods for a detailed description of various item formats
used in questionnaires and general rules to con-
The study design and the study data to be col- sider when constructing questionnaire items.)
lected dictate the methods by which the data are
to be collected. Laboratory data (e.g., hematol-
ogy, urinalysis, serology) and vital signs (e.g., Data Capture
height, weight, blood pressure) may be required
in a clinical trial to evaluate efcacy and, often, Paper-Based Methods
to evaluate patient safety. These data typically
would be collected using standard methods for Efcient analysis, summarization, and reporting of
these data types and recorded in the patients biomedical research data require that data be avail-
medical records, often designed specically for able in an electronic database, such as a spread-
the research study. Other data collected to sheet or one of several available databases, some of
address the research question(s) may require which have been designed specically for clinical
clinical information (e.g., events experienced by research data. The manner in which the data are
the patient, nonstudy medications used by the entered into these databases has been evolving.
patient), tracking information (e.g., timing and Historically, most data in biomedical research, par-
amount of study medications received, alcohol ticularly in RCTs, were entered from a set of paper
consumption, sexual activity), or subjective CRFs specically designed for the study. Figure 7.1
information (e.g., personal opinions of medical shows an example of a typical paper CRF used to
condition or ease of treatment). These data must collect data obtained from physical examination.
Fig. 7.1 Example of a paper CRF used to collect research Health, Division of Cancer Prevention. http://dcp.cancer.
data from a physical examination. Downloaded from the gov/Files/clinical-trials/FINAL_DCP_CRF_Templates_
National Cancer Institute at the National Institutes of Version_3.doc (Accessed 10 Nov 2011)
Electronic Systems recent technological advances, paper-based

CRFs are being replaced by eCRFs into which
Despite their long-term use, paper-based sys- the data are entered directly into trial databases
tems for data collection and management have from source documents. Features of eCRFs are
been found to be inefcient and error prone presented in Table 7.1, but they may vary
because of multiple iterations of data transcrip- depending on the computer software upon which
tion, entry, and validation [9]. Thus, due to they are based.
138 M. Guralnik
Table 7.1 Features that may be available for electronic CRFs depending on the clinical trial data management software
used (Reproduced with permission from Brandt et al. [2])
Feature Function
Primary electronic data entry Data entered into CRF by interviewer or subject (rather than into a paper form rst)
Context-sensitive help Help is given in the context of the problem (immediately)
Default values set Based upon predened criteria, or previously entered date, values of elds may be set
Skip patterns Disabling of questions that become inapplicable based on response to a previous
question
Computed (derived) values Certain questions may be based on values of other questions (such as body mass
index (BMI) that is derived from height and weight). Computed values may also
control skip patterns on a CRF. If BMI exceeds a present threshold, questions related
to high BMI may be enabled
Interactive validation Immediate checking of the values entered into the CRF based upon predened
criteria such as ranges, other values in the CRF or study, etc.
Table 7.2 Content, format, and data-entry principles

Principles of Case Report of good case report form design
Form Design CRF content principles
Collect data that support questions (as dened in
Regardless of whether a CRF or eCRF is used, the protocol) that are to be answered by the
meaningful collection of high-quality data begins statistical analysis.
Dene terminology and scales.
with a CRF that is based on the trial protocol [8].
Avoid questions that address ancillary issues.
Hence, consistency with source documents Ask each question only once.
is an essential feature of a well-designed CRF. CRF format principles
However, an analysis of source document Ask questions directly and unambiguously, using
verication performed by the sponsors of clinical conventional and professional terminology.
For long-term studies, provide a separate CRF for
trials has identied areas of inconsistency in 70% each visit and group of visits.
of cases [10]. Several items were either covered Arrange the questions in a logical sequence
in the CRF but not in the source documents (i.e., the order in which a physician would
(including those pertaining to patient history and ordinarily collect the data).
Specify how precise answers should be (i.e.,
informed consent) or were described in the source whether values should be rounded off or carried to
documents but not in the CRF (including one or more decimal places).
those regarding patient history, complications, When possible, collect direct numerical measure-
adverse events, and concomitant drugs or other ments rather than broad categorical judgments.
Use design techniques that simplify reading and
therapies). Sources of such discrepancies need to completing the form:
be resolved before a trial begins. Although CRFs Balance white space with text.
play a pivotal role in the successful conduct of a When possible, use check-off blocks instead of
trial, the design of these forms often is neglected asking for a code, value, or term.
Block sections of the form to make them easy to
in the haste to launch a trial. [11]. The content, locate and complete.
format, coding, and data-entry requirement prin- Use variations in size and boldness to show the
ciples of good CRF design, described more than hierarchy of headings.
20 years ago by Bernd [11], remain applicable Highlight areas of the form where entries are
needed.
today (Table 7.2).
CRF coding and data-entry requirements
To avoid the excessive costs and delays often Use consistent reference codes (e.g., if code [1]
associated with printing CRFs, sponsors that use represents no for one question, it should not
paper-based data capturing have found alternatives represent yes for another question).
to the traditional outsourcing of this task. Although EDC systems are most often used
Desktop-publishing systems and precollated by formally organized research centers with data
no-carbon-required paper (NCR) allow printing, management staff, many clinical investigators in
collating, and binding of CRFs, with multicol- private practice or in academia conduct studies
ored two- or three-part sets [11]. Over the course without the support of qualied biomedical
of a longitudinal study, CRFs often are improved informatics consultants and sophisticated EDC
or rened, including the addition of new entries systems [15]. Nevertheless, EDC systems are
and modication or deletion of entries on previ- available that can be implemented without spe-
ous versions [2]. Some newly requested data cialized software for investigators with small
(such as information about the patients history) budgets or limited access to data management
may be obtainable later, whereas time-dependent staff.
observations (such as measurements taken at a Data collection has naturally evolved along-
certain clinic visit) will not. Data for new or side with computer and information technology.
modied questions that cannot be obtained must Major milestones in this evolution include
be treated as missing. Conversely, when a personal computers, relational databases, user-
question is deleted, data for patients evaluated friendly interfaces for software once reserved for
under the older CRF version must be archived or engineering and systems design staff, and broad-
purged or both [2]. Regardless of the types of ened connectivity options such as computer
changes made, the FDA requires that the sponsor to computer, internet networking, wireless to
preserve all electronic versions for agency review Ethernet, and cellular data connectivity. These
and copying [12]. advances along with the availability now of
Electronic systems are designed to support mobile computing and electronics devices, like
data entry where data are entered directly from the iPad, have a potentially huge impact on how
source documents with most data validations we gather data, as well as where data capture is
executed real time as the data are entered and heading.
errors promptly resolved typically by study site The iPad is a major step forward for clinical
staff. As will be noted below, EDC systems data management. These truly remarkable
also support the monitoring, cleaning, storage, devices, resting in the hands of all members of
retrieval, and analysis of research data [2], as the research team, would allow quick access to
well as promote the uniform collection of data, tools for capturing data, real time or otherwise.
which can then be more easily analyzed and They also offer two-way connectivity along
shared across a variety of platforms and data- with the portability and functionality of the
bases [13]. hardware, thereby lending them the exact adapt-
EDC systems, however, are not without their ability needed for clinical medicine and research
own constraints. To be useful in multicenter roles.
trials, EDC systems must allow electronic sub- Newer generation iPads allow data to migrate
mission of data from different sites to a central from text-based eld entry, or PDF form data
data center, be easy to implement and use, and entry, through to server-based relational data-
minimize disruption at the clinical sites [9]. bases. Using methods from e-mail as a carrier to
Timing is essential to the successful implementa- internet-connected applications, the data stream
tion of an EDC system. Considerable information can be instantaneous, allowing for immediate
technology (IT) support is needed to build the two-way data efforts, relaying back from sponsor
eCRFs, and considerable time must be dedicated to investigator. Third-party communications fur-
to educating the trial site staff on the proper use ther enhance the iPad platform. All of this has
of the new systems. To be successful and reap begun to evolve because the iPad platform has
the benets of EDC systems, this effort should be simplied the process of data capture and trans-
undertaken prior to the initiation of any research fer via its accessible hardware and novel data
study [14]. management applications.
140 M. Guralnik
entry errors by their deviation from allowable

Electronic Training Manuals and expected responses and interactively prompts
for corrected data). Compared with paper-based
Procedural and training manualscore docu- systems, EDC systems can more efciently
ments of any clinical studyoutline plans and clean data by reducing the number of data dis-
processes for study coordination, creation of crepancies and requests for clarications, as
CRFs, data entry, quality control, data audits, well as lower the cost of each data query resolu-
data-entry verication, and site/data restrictions tion by lessening the amount of manual input
[2, 16]. The recent technological advances have required.
not only made possible efciencies in data collec-
tion and data processing but also made possible
electronic manuals created with HTML-based Data-Entry Cleaning
content which offer several advantages over
paper-based core documents. Special software Cleaning, the process of verifying, validating,
can be used to edit multiple discrete documents, and correcting data entered onto the CRF or into
organize them hierarchically, and provide hyper- the database, is essential to verifying quality
linking between related topics. When a manual is control in a clinical trial. Double data entry, the
produced in this way, many authors can work most common way to verify data entered onto
simultaneously on different topics that are subse- CRFs [17], begins with reentry of data from the
quently integrated with a version-control system CRF into the study database at a later point than
as the content evolves. The version-control soft- the original entry; often, this step is performed by
ware can also manage changes and updates to a person other than the operator who made the
protocol documents. Other advantages of this sys- rst entry. Next, the two versions are automati-
tem over simple text documents include the capa- cally compared, and any discrepancies are cor-
bility for single-source authoring with generation rected [17]. Despite the widespread use of this
of multiple output formats (e.g., JavaHelp), distri- method, the quality of data so corrected has been
bution of the complete manual through a dedi- debated for many years [18]. Commenting that
cated web site of the complete manual (with the concept of typing a nal report twice to
hyperlinks to support online browsing) and sup- check for typographical errors is almost laugh-
port for highly efcient, full-text searching, with able, one group questioned why double data
results ranked by relevance [2]. entry but not double everything else? Because
double data entry rests on the assumption that
original records are correct and all errors are
Data Error Identication introduced during data entry, this system can
and Resolution never trap errors made by the person completing
the form without exploratory data analysis (EDA)
Verifying, validating, and correcting data entered [19]. EDA, which challenges the plausibility of
into a clinical research database are critical steps the written data on the CRF, should therefore be
for quality control. Several data cleaning pro- performed either as data entry is ongoing or as
cesses are available, including the following: the rst stage in an analysis when double data
double computer data entry which captures entry entry is used.
inconsistencies (though it cannot detect errors Random data-entry audits are another way to
made by the person supplying the data without check the quality of data on a CRF. This method
additional exploratory data analysis), random is based on a predetermined level of criticality
data-entry audits (which are based on a predeter- (assigned by the data management/ investigator
mined level of criticality for each data category), team) for each data category, with respect to the
and electronic data validation (which identies adverse consequences of entering erroneous data.
For each category, a proportion of the CRFs is process. The use of eCRFs in combination with
sampled by a random-sample-generating pro- manual ad hoc queries by study monitors has been
gram, and entered data are compared with the able to reduce data discrepancies and the conse-
source documents for discrepancies. For very quent need for clarications by more than 50%.
important categories (i.e., data that are central to The enhanced ability to clean and analyze data
the study objective and must be correct), as many has resulted in the generation of more accurate
as 100% of CRFs may be sampled [2, 6]. data [21]. Moreover, compared with a paper-based
Noncritical data, which should be correct but system, EDC systems with built-in error checking
would not affect the study outcome if incorrect, for data quality have been shown to reduce the
would require a lower proportion of CRFs to be total number of queries and decrease the cost of
checked [6]. After sampling, the number of dis- each query resolution from $60 to $10 [14].
crepancies is reported and corrective action taken.
The proportion of audited CRFs for any category
may be modied for a given site in light of site- Document Retention, Security,
specic discrepancy rates [2]. and Storage
Electronic data validation identies entry
errors by their deviation from allowable and Retention
expected values or answers. These include labo-
ratory measurements, answers that contradict All clinical investigators should ensure that rele-
answers to other questions entered elsewhere on vant forms such as CRFs are always accessible in
the CRF, spelling errors, and missing values [2]. an organized fashion. Informed-consent forms,
Because of their concrete nature, these errors can CRFs, laboratory forms, medical records, and
easily be identied. correspondence should be retained by the investi-
gator until the end of the study and, thereafter, by
the sponsor for at least 2 years after clinical
Data Queries development of the investigational product has
been formally discontinued or 6 years after the
To support the full process of study monitoring trial has ended. Even after the completion of
and auditing, the data management system should the study, side effects or benets of the interven-
have querying tools in place [2]. After the data tion may be present and the relevant forms may
entry/verication process discovers an entry that need to be retrieved. Factors to be considered are
requires clarication and determines that the data the availability of storage space and the possibil-
were accurately entered into the database, the ity of off-site storage if there is insufcient stor-
data coordinator sends the participating institu- age space [22].
tion a paper or electronic query. Examples of
entries that warrant queries include missing data
values, values out of range, values that fail Security and Privileging
logic checks, or data that appear to be inconsis-
tent [20]. The query should include protocol and Both during and after completion of a study, inves-
patient identiers, specic descriptions of the tigators and their staff must prevent unauthorized
form/data item in question and the clarication access, preserve patient condentiality, and prevent
needed, and instructions on how and when to retrospective tampering/falsication of data. Under
send a response. In turn, the coordinating center the FDAs Title 21 Code of Federal Regulations
should have a mechanism for recording the issue [23], access must be restricted to authorized per-
and response to each query [20]. sonnel, the system must prevent malicious changes
EDC systems have a proven superiority to to research data through selective data locking, and
paper-based systems with respect to the querying an audit trail must exist [2].
142 M. Guralnik
Consideration should be given for software patient identifying information, but other per-
that provides: sonnel, such as biostatisticians performing
Privileging: Study-specic role-based privi- analyses, may view only de-identied data.
leges should be assigned, with roles requiring Data Locking: The software should allow a
adequate training and documentation of such study coordinator to lock all the data in the
training prior to system use. In the case of system by study, subject, or CRF level when
multisite studies, it is especially important to required. All investigators, particularly those
be able to assure investigators from each site involved in any type of human subjects
that other sites can be restricted from altering research, must be sure to take adequate steps
their data or, in some cases, even seeing their to preserve the condentiality of the data they
data while the study is in progress. Also, dif- collect. Investigators must specify who will
ferent users should have different data access have access to the data, how and at what point
and editing privileges. Software should allow in the research personal information will be
site restriction of data and the assignment of separated from other data, and how the data
both role-based and functional privileges. The will be retained at the conclusion of the study.
software should allow the level of restriction The following guidelines for preserving patient
to be changed as appropriate. condentiality should be followed [24, 25]:
Storing of De-identified Data: For studies In general, all information collected as part of
where breach of patient condentiality could a study is condential: data must be stored in
have serious repercussions, the software a secure manner and must not be shared
should support storing of de-identied data. It inappropriately.
is important to note that the Health Insurance Information should not to be disclosed with-
Portability and Accountability Act (HIPAA) out the subjects consent.
does not prohibit the storing of patient- The protocol must clearly state who is entitled
identiable information: it requires only that it to see records with identiers, both within and
be secure, be made accessible strictly on a outside the project.
need-to-know basis, and that accesses to such Wherever possible, potentially eligible sub-
information be audited. The drawback of not jects should be contacted either by the person
storing patient-identiable information in to whom they originally gave the information
every study is that many of a systems useful or by another person with whom they have a
workow-automation features, such as gener- trust relationship.
ation of reminders to be mailed to patients Information provided to prospective subjects
periodically, cannot function seamlessly and should include descriptions of the kind of data
personalization of reminders requires manual that will be collected, the identity of the per-
processes. Also, in prospective clinical studies sons who will have access to the data, the
for life-threatening conditions such as cancer, safeguards that will be used to protect the data
where decisions such as dose escalation are from inappropriate disclosure, and the risks
based on values of patient parameters, the that could result from disclosure of the data.
storage and selective echoing of protected Academic and research organizations should
health information (PHI) provides an added establish patient privacy guidelines for non-
safeguard to ensure that data are being entered, employee researchers.
or the appropriate intervention is being per-
formed, for the correct patient.
Generation of De-identified Data: The soft- Other Responsibilities and Issues
ware should be able to de-identify the data
when required in order to share data and GCP guidelines mandated through the Code of
should utilize information about user role- Federal Regulations require that institutions (or
based privileges as well. For example, an when appropriate, an IRB) maintain records of all
investigator may have privileges to view research proposals reviewed (including any
scientic evaluations that accompany the propos- instruments that contain data, properly disposing
als), approved sample consent documents, prog- of computer sheets and other documents, limiting
ress reports submitted by investigators, and reports access to data, and storing research records in
of injuries to subjects [25]. Institutions also must locked cabinets. Although most researchers are
maintain adequate records on the shipment of the familiar with the routine precautions that should
drug product to the trial site and its receipt there, be taken to maintain the condentiality of data,
the inventory at the site, use of the product by more elaborate precautions may be needed in
study participants, and the return to the sponsor of studies involving sensitive matters such as sexual
unused product and its disposition [2628]. behavior or criminal activities to give subjects the
Because drug-accountability records must be condence they need to participate and answer
accurate and clear, especially for an audit of the questions. When information linked to individu-
study site [29], electronically based inventory als will be recorded as part of the research design,
management systems have been devised. In addi- IRBs require that data managers ensure that ade-
tion to describing current inventory [20], some of quate precautions are in place to safeguard the
these systems have look ahead capabilities to condentiality of the information; thus, numerous
assess and fulll future inventory needs [30]. specialized security methods have been devel-
oped for this purpose and IRBs typically have at
least one member (or consultant) who is familiar
Oversight of Data Management: Role with the strengths and weaknesses of the different
of Institutional Review Boards systems available. Researchers should also be
aware that federal ofcials have the right to
As will be noted in Chap. 12, IRBs have a wide inspect research records, including consent forms
range of responsibilities in the design, conduct, and individual medical records, to ensure compli-
and oversight of clinical trials, and it is important ance with the rules and standards of their pro-
that clinical researchers be familiar with them. grams. In the USA, FDA rules require that
IRB functions that are particularly germane to information regarding this authority be included
those managing data include oversight of protec- in the consent forms for all research regulated by
tion of the privacy and condentiality of human that agency.
subjects (identiers and other data), monitoring
of collected data to optimize subjects safety, and
continuing review of ndings during the duration Monitoring and Observation
of the research project [31].
One of the areas typically reviewed by the IRB is
the researchers plan for collection, storage, and
Condentiality and Privacy analysis of data. Regular monitoring of research
of Research Data ndings is important because preliminary data
may signal the need to change the research design
Information obtained by researchers about their or the information that is presented to subjects or
subjects must not be improperly divulged. It is even to terminate the study early if deemed nec-
essential that researchers be able to offer subjects essary. Thus, for an IRB to approve proposed
assurance of condentiality and privacy and research, the protocol must, as appropriate,
make explicit provisions for preventing breaches. include plans for monitoring the data collected to
For most clinical research studies, assuring ensure the safety of subjects. Investigators some-
condentiality typically requires adherence to the times misinterpret this requirement as a call for
following routine practices: substituting codes for annual reports to the IRB. Instead, US Federal
patient identiers, removing face sheets (contain- regulations require that, when appropriate,
ing such items as names and addresses) from survey researchers provide the IRB with a description of
144 M. Guralnik
their plans for analyzing the data during the the consent document(s) and any variations in the
collection process. Concurrent collection and manner of data collection must be reviewed and
analysis enables the researcher to identify aws approved by the IRB. The IRB has the authority
in the study design early in the project. The level to observe, or have a third party observe, the con-
of monitoring in the research plan should be sent process and the research itself. The researcher
related to the degree of risk posed by the is required to keep the IRB informed of unex-
research. Furthermore, when the research will be pected ndings involving risks and to report any
performed at foreign sites, the IRB at a US insti- occurrence of serious harm to subjects. Reports
tution may require different monitoring and/or of preliminary data analysis may be helpful both
more frequent reporting than that required by to the researcher and the IRB in monitoring
the foreign institution. Under normal circum- the need to continue the study. An open and coop-
stances, however, the IRB itself does not under- erative effort between the researcher and the IRB
take data monitoring. Rather, other independent protects all concerned parties.
persons (e.g., members of a data safety monitor-
ing board [DSMB]) typically are responsible for
monitoring trials and for decisions about Summary and Conclusions
modication or discontinuation of trials. It is the
IRBs responsibility, though, to ensure that these Clearly dened study endpoints combined with
functions are carried out by an appropriate well-designed source documents, CRFs, and
group. The review group should be required to systems for capturing, monitoring, cleaning, and
report its ndings to the IRB on an appropriate securely storing data are essential to the integ-
schedule. rity of ndings from clinical biomedical research
trials. Because IRBs have a wide range of
responsibilities in the design, conduct, and over-
Continuing Review sight of clinical trials, it is also essential that
clinical investigators be familiar with their
At the time of its initial review, the IRB deter- requirements.
mines how often it should reevaluate the research The inexorable shift from paper-based to EDC
project and will set a date for its next review. systems in large trials promotes the efcient and
Some IRBs set up a complaint procedure that uniform collection of data that can be analyzed
allows subjects to indicate whether they believe and shared across a variety of platforms and data-
that they were treated unfairly or that they were bases. EDC systems can build quality control
placed at greater risk than was agreed upon at into the data collection process from its incep-
the beginning of the study. A report form avail- tiona more productive approach than building
able to all researchers and staff may be helpful checks onto the end [19]. Although modern soft-
for informing the IRB of unforeseen problems or ware tools unquestionably improve the potential
accidents. US Federal policy requires that inves- for data collection and management, systems
tigators inform subjects of any important new alone are worthless without pro-active study
information that might affect their willingness to coordinators and investigators who create and
continue participating in the trial. Typically, the enforce policies and procedures to ensure
IRB will make a determination as to whether any quality [2]. Therefore, a trials data collection
new ndings, new knowledge, or adverse effects system and its ndings are only as sound as the
should be communicated to subjects, and it commitment by individuals who formulate and
should receive copies of any such information carry out document design, study procedures,
conveyed to subjects. Any necessary changes to training, and data management plans.
Take-Home Points
Well-designed trials and data management methods are essential to the integrity of the
ndings from clinical trials, and the completeness, accuracy, and timeliness of data collec-
tion are key indicators of the quality of conduct of the study.
The research data provide the information to be analyzed in addressing the study objec-
tives, and addressing the primary objectives is the critical driver of the study.
Since the data management plan closely follows the structure and sequence of the protocol,
the data management group and protocol development team must work closely together.
Accurate, thorough, detailed, and complete collection of data is critical, especially at base-
line as this is the last time observations can be recorded before the effects of the trial inter-
ventions come into play.
The shift from paper-based to electronic systems promotes efcient and uniform collection
of data and can build quality control into the data collection process.
11. Bernd CL. Clinical case report forms designa key

References to clinical trial success. Drug Inf J. 1984;18:38.
12. US Food and Drug Administration. Code of Federal
1. Liu EW. Clinical research the six sigma way. JALA. Regulations:21CFR11.10. Title 21food and drugs.
2006;11:429. Chapter I, subchapter Ageneral. Part 11electronic
2. Brandt CA, Argraves S, Money R, Ananth S, Trocky records; electronic signatures. Subpart Belectronic
NM, Nadkarni PM. Informatics tools to improve clin- records. Sec. 11.10controls for closed systems.
ical research study implementation. Contemp Clin http://www.accessdata.fda.gov/scripts/cdrh/cfdocs/
Trials. 2006;27:11222. cfCFR/CFRSearch.cfm?fr=11.10. Accessed 25 Feb
3. Piantadosi S. Clinical trials. A methodologic perspec- 2008.
tive. 2nd ed. Hoboken: Wiley; 2005. 13. Meadows BJ. Eliciting remote data entry system
4. McFadden E. Data denition, forms, and database requirements for the collection of cancer clinical trial
design. In: Management of data in clinical trials. 2nd data. Comput Inform Nurs. 2003;21:23440.
ed. New York: Wiley; 2007. 14. Welker JA. Implementation of electronic data capture
5. Roberts P. Reliability and validity in research. Nurs systems: barriers and solutions. Contemp Clin Trials.
Stand. 2006;20:415. 2007;28:32936.
6. Williams GW. The other side of clinical trial monitor- 15. Kashner TM, Hinson R, Holland GJ, Mickey DD,
ing; assuring data quality and procedural adherence. Hoffman K, Lind L, Johnson LD, Chang BK, Golden
Clin Trials. 2006;3:5307. RM, Henley SS. A data accounting system for
7. Crerand WJ, Lamb J, Rulon V, Karal B, Mardekian J. clinical investigators. J Am Med Inform Assoc. 2007;
Building data quality into clinical trials. J AHIMA. 14:3946.
2002;73:4456. 16. Argraves S, Brandt CA, Money R, Nadkarni P.
8. Guidance for industry: electronic source documenta- Informatics tools to improve clinical research. In:
tion in clinical investigations. Rockville: Food and Proceedings of the American Medical Informatics
Drug Administration (US). 2010. Ofce of Commu- Association Symposium, 2226 Oct 2005;
nication, Outreach and Development, HFM-40. Washington, DC.
Accessed 28 Oct 2011. 17. Kawado M, Hinotsu S, Matsuyama Y, Yamaguchi T,
9. Kush R, Alschuler L, Ruggeri R, Cassells S, Gupta N, Hashimoto S, Ohashi Y. A comparison of error detec-
Bain L, Claise K, Shah M, Nahm M. Implementing tion rates between the reading aloud method and the
Single Source: the STARBRITE proof-of-concept double data entry method. Control Clin Trials. 2003;
study. J Am Med Inform Assoc. 2007;14:66273. 24:5609.
10. Takayanagi R, Watanabe K, Nakahara A, Nakamura 18. King DW, Lashley R. A quantiable alternative to dou-
H, Yamada Y, Suzuki H, Arakawa Y, Omata M, Iga T. ble data entry. Control Clin Trials. 2000;21:94102.
Items of concern associated with source document 19. Day S, Fayers P, Harvey D. Double data entry: what
verication of clinical trials for new drugs. Yakugaku value, what price? Control Clin Trials. 1998;19:
Zasshi. 2004;124:8992. 1524.
146 M. Guralnik
20. McFadden E. Software tools for trials management. scripts/cdrh/cfdocs/cfcfr/CFRSearch.cfm?fr=312.59.

In: Management of data in clinical trials. 2nd ed. Accessed 24 Feb 2008.
New York: Wiley; 2007. 27. US Food and Drug Administration. Code of Federal
21. Trocky NM, Fontinha M. Quality management tools: Regulations: 21CFR312.61 Investigational New Drug
facilitating clinical research data integrity by utiliz- Application. Title 21Food and drugs. Chapter I,
ing specialized reports with electronic case report subchapter Ddrugs for human use. Part 312inves-
forms. In: Proceedings of the American Medical tigational new drug application. Subpart
Informatics Association Symposium, 2226 Oct Dresponsibilities of sponsors and investigators.
2005; Washington, DC. Sec. 312.61control of the investigational drug.
22. Saw SM, Lim SG. Clinical drug trials: practical prob- http://www.accessdata.fda.gov/scripts/cdrh/cfdocs/
lems of phase III. Ann Acad Med Singapore. 2000;29: cfcfr/CFRSearch.cfm?fr=312.61. Accessed 24 Feb
598605. 2008.
23. http://www.datagovernance.com/adl_FDA_21_CFR_ 28. US Food and Drug Administration. Code of Federal
USA.html. Regulations: 21CFR312.62. Title 21Food and drugs.
24. Department of Health and Human Services. IRB Chapter I, subchapter Ddrugs for human use. Part
Guidebook Chapter IV: Consideration of research 312investigational new drug application. Subpart
design, 2007. http://www.hhs.gov/ohrp/irb/irb_chapter4. Dresponsibilities of sponsors and investigators.
htm. Accessed 25 Apr 2008. Sec. 312.62investigator recordkeeping and record
25. US Food and Drug Administration. Code of Federal retention. http://www.accessdata.fda.gov/scripts/cdrh/
Regulations: 21CFR56.115. Title 21Food and drugs. cfdocs/cfcfr/CFRSearch.cfm?fr=312.62. Accessed 24
Chapter I, subchapter Ageneral. Part 56institu- Feb 2008.
tional review boards. Subpart Drecords and reports. 29. Siden R, Tankanow RM, Tamer HR. Understanding
Sec. 56.115IRB records. http://www.accessdata. and preparing for clinical drug trial audits. Am J
fda.gov/scripts/cdrh/cfdocs/cfcfr/CFRSearch. Health Syst Pharm 2002;59:2301,2306,2308.
cfm?fr=56.115. Accessed 24 Feb 2008. 30. DDOTS, Inc. IDEA Web-based software for investi-
26. US Food and Drug Administration. Code of Federal gational drug inventory management. http://www.
Regulations: 21CFR312.59. Title 21Food and ddots.com/idea_product_overview.cfm. Accessed 24
drugs. Chapter I, subchapter Ddrugs for human use. Feb 2008.
Part 312investigational new drug application. 31. Department of Health and Human Services. IRB
Subpart Dresponsibilities of sponsors and investi- Guidebook Chapter III: Basic IRB Review. 2007.
gators. Sec. 312.59 disposition of unused supply of http://www.hhs.gov/ohrp/irb/irb_chapter3.htm .
investigational drug. http://www.accessdata.fda.gov/ Accessed 25 Apr 2008.
Constructing and Evaluating
Self-Report Measures 8
Peter L. Flom, Phyllis G. Supino, and N. Philip Ross
A self-report measure, as the name implies, is a subject often can provide valuable information
measure where the respondent supplies informa- about social, demographic, economic, psycho-
tion about him or herself. Such information may logical, and other factors related to the risk of dis-
include self-reports of behaviors, physical states ease or to adverse outcomes of disease. The
or emotional states, attitudes, beliefs, personality choice between self-report, observational, and
constructs, and self-judged ability among others. biophysiological measures will depend on the
A self-report may be obtained via questionnaire, data that are available and the nature of the research
interview, or related methods. Questionnaires questions and hypotheses. It is important to note
typically are written documents that are adminis- that while the range of biophysiological measures
tered without the involvement of an interviewer, is constantly increasing, and while these mea-
whereas interviews usually (but not always) sures may permit objective evaluation of clini-
are administered orally [1]; both are sometimes cally relevant attributes, they are not perfectly
termed surveys. reliable (i.e., free from measurement error). Even
Self-reports are important in medical research more importantly, they may fail to capture the
because while some variables can be evaluated specic quality that the investigator wishes to
through physiological measures, chart review, evaluate. For example, if an investigator is inter-
physical exam, direct observation of the respon- ested in blood pressure, this may be evaluated
dent, or by reports by others, other variables only biophysiologically. However, if the aim of the
can be assessed from information directly fur- investigation is to examine the effects of mood on
nished by the patient or other subject. Indeed, the blood pressure, mood can be evaluated only by
self-report as there are no biophysiological
measures of mood (though there may be biophys-
P.L. Flom, PhD () iological correlates, and even causes and conse-
Peter Flom Consulting, LLC,
quences of biophysical factors). Observational
515 West End Ave, New York, NY 10024, USA
e-mail: Peteromconsulting@mindspring.com data also may provide useful information, but
their use has its own perils as individuals do not
P.G. Supino, EdD
Department of Medicine, College of Medicine, always accurately observe the actions of others.
SUNY Downstate Medical Center, For these reasons, information directly reported
450 Clarkson Avenue, Box 1199, by patients and other subjects commonly is col-
Brooklyn, NY 11203, USA
lected by clinicians, clinical investigators, and
e-mail: phyllissupino@aol.com
other health-care professionals, and can be used
N.P. Ross, BS, MS, PhD Statistics
as a tool for patient management or for research.
9006 Kirkdale Road, Bethesda, MD 29817, USA Topics commonly examined by self-report
e-mail: ross@statlogic.net include physical or mental symptoms, level of
148 P.L. Flom et al.
pain or stress, activities of daily living, health- Questionnaires, like tests, can produce a total
related quality of life, availability of social sup- score or subscores, but also can yield different
port, use and perceived effectiveness of strategies types of information that can be separately ana-
used to cope with ill-health, satisfaction with the lyzed. Questionnaires are almost always a neces-
doctor-patient interaction, and adherence to med- sity when direct contact with the subject is not
ication schedules (though the latter might, at least possible. Under these circumstances, question-
in theory, also be evaluated through objective naires typically are administered by mail to the
testing). respondent who, in turn, completes and returns
Although self-report instruments are relatively them to the sender. In other circumstances, ques-
easy to use, their construction and validation can tionnaires may be read to the respondent over the
be difcult. This chapter will cover fundamental telephone or in-person as part of a structured
aspects of, and distinctions among, question- interview, or they may be administered via the
naires, interviews, and other methods of self- Internet in a variety of ways. A questionnaire can
report and will indicate the circumstances under cover virtually any topic, although here we will
which a new self-report measure may be needed. emphasize those that capture information related
It also will describe methods of generating and to medical issues or health-related topics includ-
structuring responses; discuss approaches to asking, but not limited to, diseases, symptoms, and a
ing about sensitive information; describe the patients experiences with doctors and other
rationale for, and processes involved in, pilot test- health professionals. Some well-known question-
ing, evaluating, and revising a measure; review naires used in medical research are the Brief
related ethical and legal aspects; and provide a Symptom Inventory (a 53-item questionnaire
general guide to the entire process. covering nine dimensions of psychological
health [5]); the SF-36 (a 36-item patient-centered
questionnaire about general physical and mental
What Is a Questionnaire? health-related quality of life [6]); the 26-item
World Health Organization Quality of Life
A questionnaire is a type of self-report instru- Questionnaire (WHOQOL) [7] assessing general,
ment that is designed to elicit specic informa- physical, emotional, social, and environmental
tion from a population of interest. Questionnaires health quality; the Minnesota Living with Heart
may be standardized but often are designed (or Failure Questionnaire (MLHFQ) (comprising 21
adapted) specically for a particular study. questions that measure the patients perceived
Depending on the objective of the study and limitations due to heart failure [8]); and the
resources, the questionnaire, like other self- Morisky Scale (a series of six questions about
report measures, may be administered to all sub- medication adherence [9]).
jects in the available sample or to a dened
subsample. As noted below, the most common
method of administration is direct mailing to Interviews and Related Methods
subjects, though other methods exist. Deciding
upon the sampling strategy is a complex pro- There are a large variety of interview and related
cess. It can range from a simple random sample methods that also can be used to collect self-
to a very complex hierarchical design involving report data. These can be categorized along sev-
multiple strata and sampling procedures, as eral dimensions: level of structure of the interview,
reviewed in Chap. 10. For additional informa- number of respondents involved (one vs. two or
tion on this subject, the reader is referred to Kish more), and use of subject narrative (historical or
(1995) [2], Groves et al. (2004) [3], and Cochran anecdotal methods). In addition, these types of
(1977) [4]. measures are usually qualitative (i.e., focus
The questionnaire usually is in the form of a groups, in-depth/unstructured interviews, ethno-
written document, though sometimes it may be graphic interviews) as opposed to quantitative
administered by audio or with pictorial methods. (e.g., structured interviews and questionnaires) in
8 Constructing and Evaluating Self-Report Measures 149
nature. This chapter provides an overview of anthropological literature, is termed ethno-

some of these qualitative methods, but the graphic. With ethnographic methods, there is
construction of these methods and the analysis of even less structure than with traditional unstruc-
the data generated from qualitative methods are tured interviews, as the process begins with the
quite complicated and outside the scope of this interviewer simply listening. Perhaps the best
chapter. For further information on qualitative known example of a medical ethnographic study
methods and data analysis, see Strauss and Corbin can be found in the book The Spirit Catches You
(1998) [10]. and You Fall Down [12], which describes the
horric experiences of a young Hmong immi-
grant child and her American doctors, caused by
Level of Structure the collision of their vastly differing cultural
views about illness and medical care.
Unstructured interviews (also known as in- Sometimes investigators may prepare a topic
depth interviews [11]), contain very little orga- guide or a list of questions of interest, but the
nization; the developers of unstructured respondents are free to respond in any way they
interviews may have only a general idea of what choose. Interviews of this nature are termed
sort of information they need or they may wish to semistructured and can be useful when there is
allow the respondents to develop their responses concern about imposition of bias or constraint of
with minimal interference. Unstructured inter- potential responses. Typically, in a semistruc-
views often resemble conversations, proceeding tured interview, follow-up questions are simple
from a very general question to more specic probes, such as tell me more, but occasionally
ones (the latter dependent upon responses to the they may be more complex. Because the ques-
general question). They are advantageous because tions contained in the interview are not fully
they produce data that reect an exact accounting articulated before the interview, interviewers
of what the respondent has said and can elicit using these methods must be thoroughly
important information that the interviewer had trained [13]. Semistructured interviews have
not considered before the interview. However, been used in a number of biomedical and health
they suffer from a number of limitations. An education studies. For example, this methodol-
important one is reproducibility, that is, the same ogy has been used to ascertain cancer patients
interview, conducted twice with the same sub- views about disclosing information to their
ject, can yield quite different results due to varia- families [14] and to evaluate the consumption
tions in the circumstances of the interview and perceived usefulness of nutritional supple-
(including, but not limited to, the inuence of ments among adolescents [15].
unintended responses by the interviewer) [1]. As the name implies, a structured interview
Other disadvantages include the potential for delineates the questions in advance, usually with
digressions by the respondent that can cause this the aid of a written questionnaire or other
type of interview to be excessively time-consum- instrument [11]. This approach provides more
ing, complexities of coding and categorization of uniformity than is possible with a semistructured
responses, and difculty generalizing responses or unstructured questionnaire, but it lacks some
to the reference population (as unstructured sur- of their advantages. Probably the best-known
veys typically are performed on relatively small examples of highly structured interviews are
numbers of subjects). An example of an unstruc- polls, where the respondents choices are strictly
tured interview can be found in the work of limited. Although polls are most familiar in polit-
Cohen et al. who studied patients perceptions of ical contexts, they also can be used in medical
the psychological impact of isolation in the set- research aimed at, for example, eliciting informa-
ting of bone marrow transfusions, which began tion about patient preferences regarding types
with the question What was it like to have of care or provider characteristics. A greater
bone marrow transplantation? [11]. Another degree of structure generally is appropriate
type of unstructured interview, often found in the when specic hypotheses are involved and when
the eld of study is well developed. A lesser joint interview) they must have sufcient skill to
degree of structure is more appropriate earlier in ensure that one member of the group does not
the development of a eld of knowledge or when dominate the discussion. Focus groups have been
the particular research is highly exploratory. used in medical research to uncover attitudes
about a particular illness or difculty. For exam-
ple, Quatromoni and colleagues used focus
Number of Respondents groups to explore the attitudes toward, and knowl-
edge about, diabetes among Caribbean-American
While the traditional interview typically entails a women [21], whereas Hicks et al. used focus
one-on-one interaction between interviewer and groups to explore ethical problems faced by med-
an interviewee (respondent), the joint interview ical students [22].
involves two (or sometimes several) individuals
who know each other, commonly a couple or a
family [16]. Joint interviews differ from focus Narrative Methods
group methods (described below) where those
being interviewed may be strangers. They have Life Histories, Oral Histories, and Critical
value in survey research because different indi- Incidents: Life histories are narrative self-
viduals may have very different perspectives that disclosures about personal life experiences,
may be illuminated by the interaction between or typically recounted orally or in writing in
among them. These different perspectives, in chronological sequence [1]. They commonly
turn, may provide the researcher with greater are used as an ethnographic tool for identify-
insight into the problem at hand; however, to ing and elucidating cultural patterns, but the
accomplish this objective, the interviewer must technique also can be of value for eliciting the
be able to prevent one respondent from dominat- experience of patterns and meanings of health
ing the discussion. Joint interviews have been care in populations of interest. Oral histories
used to study family reactions to youth suicide are similar to life histories, but they focus on
[17] and to study reliability of reports of pediatric personal recollections of thematic events
adherence to HIV medication by interviewing rather than on individual life stories. The crit-
both patients and their caregivers [18]. Note that ical incident technique, pioneered by
the term joint interview sometimes is used Flanagan [23] in the mid-1950s, is widely
when there are two interviewers, rather than two used in many areas of health sciences and
subjects. This approach can be used as a vehicle health sciences education. More focused than
for interviewer training and for determination of life or oral history methods, the critical inci-
inter-rater reliability, but it also can be used to dent technique requires respondents to iden-
provide better answers to health-care questions, tify and judge past behaviors and related
as when a psychiatrist and an internist jointly factors that have contributed to their success
interview a patient to obtain information from or failure in accomplishing some outcome of
varying perspectives [19]. interest. The critical incident method has been
In a focus group, typically four or more used to explore such wide-ranging topics as
individuals (usually a fairly homogenous group) adverse reactions to sedation among children
collectively discuss an issue, guided by a moder- [24], attitudes of third-year medical students
ator. Focus groups are useful for exploring a par- toward becoming physicians [25], and reasons
ticular issue in depth. However, to provide useful why physicians changed their areas of clinical
information, members of the focus group must be practice [26].
properly selected. In addition, moderators must Diaries: A diary is not technically an interview,
be matched well to the subjects, they must know as no one is asking questions. Nonetheless,
the subject matter very well, they must be able to because diaries have some similarities with
elicit information from those who do not offer it interview methods, sometimes they are
spontaneously [20], and (as in the case of the classied with them. A diary is a written
record kept by the respondent, usually over a There are even questionnaires that may be com-
fairly lengthy period of time. Diaries may have pleted by couples or groups. Nevertheless, these
any degree of structure or content; for exam- methods differ in certain important respects. As
ple, in a study of diet, a diary might include noted, questionnaires tend to be more structured;
only what the respondent ate each day. On the some forms of interview, such as those conducted
other hand, in a study of reactions to medica- with focus groups, cannot be conducted as a ques-
tion, the diary might include any reactions that tionnaire and require a trained moderator. In addi-
a patient may have experienced after taking tion, some individuals (e.g., young children,
the medication. If subjects are not literate, dia- stroke patients, nonnative speakers) may be more
ries may need to be orally recorded. Diaries comfortable with spoken than with written English
have been used in clinical research to describe and may have a diminished ability to read, which
somnolence syndrome in patients after under- would limit their ability to complete a paper and
going cranial radiotherapy [27], to measure pencil questionnaire. These factors notwithstand-
morbidity of children experienced at home ing, some types of questions, particularly those
[28], and for improving heart failure recogni- that are relatively complex, are better suited to
tion after intervention [29]; the methodology questionnaires, particularly when skip patterns
has been particularly useful for monitoring are clear. (The skip pattern refers to the idea that
symptoms in individual patients in the setting some questions will be passed over appropriately
of N of 1 randomized clinical trials [30] (see depending on answers to earlier questions or
Chap. 5). when the questions do not apply to the respon-
Think-Aloud Methods: With think-aloud dent.) For example, in a questionnaire about gen-
methods, respondents are asked to dictate their eral health, women might answer questions on
thoughts into a recorder while they are trying topics such as menstruation and pregnancy,
to solve a problem or make a decision. These whereas men would not answer these questions.
methods produce inventories of decisions as In addition, because it takes less time to read a
they occur in context [1]. One fundamental question than to speak it, questionnaires can con-
aspect of think-aloud methods that differenti- tain more items, yet be completed within the same
ates them from other approaches is that they amount of time as an interview covering fewer
are concurrent with the process involved items. Finally, self-completed questionnaires may
that is, information is gathered while active be viewed as less intrusive than face-to-face inter-
reasoning is taking place. Think-aloud meth- views. Thus, the choice is a complex process, and
ods have been used to examine nurses reason- a variety of factors must be weighed.
ing and decision-making processes [31] and
have been shown to produce useful informa-
tion in hospital settings [32]. For further infor- When Is a New Self-Report
mation about this approach, the reader is Measure Needed?
referred to the seminal writings of Ericsson
and Simon (1993) [33]. Creating a new self-report measure entails con-
siderable time and effort for item construction
and for pilot testing, renement, and validation.
Making the Choice: Questionnaires Before undertaking such a project, it makes sense
Versus Interviews to be sure it is necessary to do so. As noted above,
answers to some questions can be obtained
This choice is, in some ways, a false one. Similar through biophysiological methods or through
questions may be asked in interviews and ques- direct observation and some cannot. Should the
tionnaires, and as noted above, interviews may be investigator decide that answers to a research
guided by written questionnaires. Either approach question can be obtained only through use of a
may be relatively structured or unstructured. self-report measure, he or she should rst
determine whether a suitable measure already respondents reading level and related charac-
exists. (The Internet site http://www.med.yale. teristics must be kept in mind. How educated
edu/library/reference/publications/tests.html pro- will they be? In which languages will they be
vides directories of tests and measures in medi- uent? If subjects are excluded who are not
cine, psychology, and other elds; other good uent in the language used in the question-
sources are Tests in Print [34], the Mental naire, how will lack of uency bias the sample?
Measurements Yearbook [35], and the Directory Answers to all of these questions will vary by
of Unpublished Experimental Mental Measures sample and by location. If, for example, an
[36].) Should an existing measure be selected investigator is surveying a group of profes-
(even if widely used and psychometrically sound sionals (e.g., doctors or nurses) in the United
in other populations), the investigator should States [USA], England, or in another country
ensure that it has been successfully employed in which the native language is English, it
and, optimally, validated in the population under probably is safe to assume that the respon-
study. If an appropriate preexisting measure can- dents will have a reasonable command of
not be identied, it may be possible to identify English as well as a high level of education.
two (or more) measures that together may serve On the other hand, if patients are being
the needs of the study, though the investigator surveyed from among a heterogeneous popu-
should be aware that combining multiple mea- lation where geographic variations in language
sures (or rewording items) can impact the psy- exist, it must be assumed that the patients lan-
chometric properties of their constituent parts. guage prociency in the countrys primary
language (and their use of alternative lan-
guages) will vary by location and that at least
Sources of Items some may have little formal education. These
assumptions can be examined by administer-
The rst source of items for a self-report measure ing various tests of reading level. If reading
is the existing literature, which, as noted, includes level is low, alternative formats can be used
existing tests and measures. In some cases, there including auditory or pictorial methods. For
may be a strong conceptual basis for a set of example, pain scales exist that use faces repre-
questions in which case the theoretical or discur- senting different levels of pain [37]. These can
sive literature may be helpful for item generation. be particularly useful with young children or
An additional source of items is observation with illiterate respondents. (Issues regarding
and interview. One protable long-term research need for and methods of translating question-
strategy is to begin with relatively qualitative naires are discussed below.)
methods (such as unstructured interviews or Clarity: Not only must questions be readable
observation), administered among relatively by the target population, they also must be
small samples, and use the ndings obtained with clearly framed to render the survey process as
these methods to develop more structured forms simple as possible for the respondent. It is
that can be administered to signicantly larger very common to assume that a question that is
samples. On the other hand, unexpected responses clear to the investigator will be clear to others.
to a highly structured method may provide the However, this often is not the case. The best
impetus to developing less structured surveys route to assess clarity is thorough pilot test-
that can further explore those areas. ing. Questions that are unclear may be skipped
by the respondent or, worse, may be answered
in unexpected ways. Unlike readability, lack
Structuring Questions: Key Points of clarity affects respondents at all levels of
education and language prociency, although
The Respondents Reading Level: When it may be more problematic at lower
developing a questionnaire, the potential levels. Ironically, sometimes it can be more
problematic at higher levels of prociency, as Otherwise, they will be at risk. Is it the

readers may overinterpret the questions. Lack doctors, the nurses, or the patients who will be
of clarity can arise from the use of vague or at risk? To ensure clarity, it may be helpful to
uncommon words whose meaning is impre- operationally dene terms within the survey
cise and not evident in context. However, even process [39]. However, if denitions of terms
common words such as assist, require, are provided, they should be provided to all
and sufcient may be misunderstood [38]. respondents, not only to those who ask for
The respondents perception of clarity will them. Fowler [39] provides a particularly good
depend greatly on the population being sur- example of an unclear question of this nature,
veyed. For example, if the population com- in which respondents were asked how often
prises medical professionals, it may be clearer they visited doctors. Those who asked for
to use a less common word because, often, the clarication were told that doctors included
less common word is more precise. For psychiatrists, ophthalmologists, and anyone
example, the choice between abdomen and else with a medical degree, whereas those who
stomach might depend on whether the sur- did not seek clarication may have excluded
vey is of medical professionals (for whom the psychiatrists and ophthalmologists, or may
former term is more precise) or the general have included individuals without medical
population (for whom it may be obscure). degrees (e.g., psychologists, nurses, individu-
Vague words often are found in the response als trained in alternative medicine who did not
options associated with the questions. For have MD or similar degrees), rendering inter-
example, when asking about the frequency pretation of these data very difcult.
with which a subject does something, Avoiding Leading Questions: A leading ques-
words like regularly and occasionally are tion is one that guides a respondents answers
vagueit would be better to specify a fre- and represents a signicant source of bias in
quency (e.g., three times a week). Other any questionnaire or interview. This can be
common vague words are sometimes, deliberate or accidental and can occur in a
often, most of the time, and rarely. single question or in a series of questions. For
Clarity also can be negatively impacted by example, Do you smoke, even though you
ambiguity. Could a word, a sentence, or a know it causes cancer and many other health
question mean more than one thing within a problems? is a leading question framed
given context? For example, if respondents within a single question. Similarly, if the
are asked about how much money they made respondent is rst asked questions about the
in the last year, is the question soliciting infor- many dangers of smoking, and these questions
mation about before-tax or after-tax are followed with one that asks the respon-
income? Does the question imply individual dents if they smoke, different answers may be
income, household income, or family income? obtained than if the question about the respon-
If the latter, does the term include individuals dents smoking history had been posed with-
not living with the family who contribute out the initial background questions. More
nancially or individuals living with the subtle leading questions include those that
household who are not family members? start with negative wording (e.g., Dont you
Should unearned income, illegal income, odd agree that .? rather than Do you agree or
jobs, capital gains, etc., be included? Complex disagree that .?) [38].
questions such as these may be better asked as Avoiding Double-Barreled Questions: A
several questions [39]. Ambiguity also can double-barreled (or multibarreled) question is
arise when pronouns are used in unclear one that combines multiple questions. For
ways [40]. Consider, for example, being asked example, a subject may be asked to respond to
to agree or disagree with the statement: the statement I exercise regularly and get
Doctors and nurses must educate patients. plenty of sleep. If the respondent answers
afrmatively, it will not be clear whether he or Translation issues: If large numbers of

she exercises, sleeps adequately, or does both. potential respondents are not uent or in
A negative response is similarly uninterpretable the primary language spoken by the popu-
[38]. A more subtle double barrel is a question lation to which results are to be extrapo-
that incorporates a particular reason, for exam- lated (e.g., English in the USA), excluding
ple, I support civil rights because discrimina- those individuals may introduce sampling
tion is a crime against God. Such a question bias. However, including them, but asking
may lead to confusion among individuals who questions only in English, may bias their
support civil rights for other reasons [40]. responses. Under these circumstances,
Question Order : There are several universal the survey may need to be translated.
criteria that must be met for proper ordering of Preparatory to this process, it will be
questions. Below is a guide: necessary to ascertain the primary lan-
Group similar questions so that respon- guages spoken by members of the sample.
dents can remain focused on one area. Then, for each language spoken by large
When testing ability, arrange items from numbers of the sample, questions and
easy to difcult to build condence. answer choices will need to be carefully
Arrange items from interesting to dull so translated. (If self-reported data are to be
that respondents do not stop answering collected via an interview rather than by
questions. questionnaire, it will be necessary to recruit
As noted below (see section Asking About interviewers who are uent in these various
Sensitive Information), if the survey languages.) After translation, the material
includes questions that are potentially sen- will need to be back-translated to iden-
sitive, these are best asked after relatively tify potential linguistic problems. However,
neutral questions to increase respondent even these steps may not sufce. Not all
comfort level. words and phrases have exact equivalents
Arrange items from general to specic to in other languages, and some concepts vary
avoid biasing the answers. For example, if strongly from culture to culture. Chang and
querying patients about their general and colleagues [41] investigated premenstrual
specic experiences in a hospital, the gen- syndrome in Chinese-American women.
eral question should be asked rst; other- Using a questionnaire that had been trans-
wise, respondents may answer the general lated and back-translated, they asked
question as the sum of the specic ques- bilingual women to respond to both the
tions, ignoring factors that were not included Chinese and English versions. While intra-
in the specic questions (even if those fac- class correlations indicated moderate to
tors were important to the respondents). high levels of equivalence for total scores
Ideally, all questions should apply to each and scales, some questions showed very
respondent. When this is not possible, skip little consistency between languages.
questions or conditional logic should be Asking the same question in more than one
used to guide respondents through the sur- way: Rephrasing a question also can help
vey so that they are not required to answer to minimize ambiguity and avoid honest or
irrelevant items or sections. Alternatively, a dishonest errors. As an example, studies
not applicable category can be included have found that respondents tend to pro-
as a response option to avoid confusion. vide more precise and accurate information
Not applicable is not equivalent to no when they are asked for birth dates com-
opinion; rather, it indicates that the ques- pared to when they are asked to state their
tion does not apply to the respondent (e.g., ages [42]. This phenomenon may be due to
questions about complications during intentional mistruth or to poor recall. Thus,
pregnancy apply neither to men nor to commonly, those collecting self-report data
women who have never been pregnant). often will ask for both the respondents
birthday and his or her age. However, it is a question such as When did you move to
important to be selective, as asking all New York? then, given an open-ended format,
questions in multiple ways not only will respondents may name a year, a date, or may
make for a very long survey, it will invari- refer to a time in their lives (e.g., right after I
ably irritate the respondents. Therefore, it got married) or to the history of the area (e.g.,
is best to include intentionally redundant just before the big blackout). For a question
items only for key areas and under condi- such this, it is better to ask for a specic type of
tions where ambiguity is difcult to avoid. response (e.g., either How old were you when
you moved to New York? or In what year did
you move to New York?) because, under these
Structuring Potential Responses circumstances, it is unlikely that any response
given would be unduly constrained.
There are two broad types of questions that can Closed-Ended Questions: Closed-ended ques-
be included in a self-report measure: open-ended tions are those in which the respondent is
(also known as open) questions and closed- asked to choose from a preexisting set of
ended (also known as closed) questions. These response options that have been generated by
differ according to who (the developer of the sur- the individuals developing the survey. Closed-
vey or the respondent) is responsible for dening ended questions, therefore, limit the answers
possible answers to the questions. that the respondent can provide. Their primary
Open-Ended Questions: Open-ended ques- advantages are that they are easier to code and
tions are those for which the respondent sup- analyze, provide more specic and uniform
plies the answer. These are subcategorized into information for a given question, and gener-
(1) numeric open-ended questions that may ally take less time to answer than open-ended
ask for responses expressed as quantities (e.g., questions. Closed-ended questions can be
How much out-of-pocket money did you subclassied into those calling for dichoto-
spend on medications during the past week? mous responses versus polychotomous (multi-
How much weight did you gain during the ple choice) responses. Dichotomous responses
last year? How old were you when you had are those that have only two possible values
your rst heart attack?) versus (2) free text most commonly, yes or no. Examples of
questions (sometimes called verbatims). The questions that may generate such responses
latter, often seen at the end of surveys, ask are legion (Did the patient die? Do you
about experiences or satisfaction with services have a physician? Have you ever had
(e.g., Do you have any other comments youd surgery?). When items are framed as state-
like to share?). Open-ended questions are the ments rather than as questions, typical dichot-
question-level equivalent of unstructured sur- omous responses include true/false or
veys and share some of the same problems (in agree/disagree response options. Items
particular, they may be difcult to code). The calling for dichotomous responses sometimes
chief advantage of open-ended questions is are combined into scales that can yield an
that they do not constrain the range of possible aggregate score. One well-known example is
responses. Indeed, they permit respondents to Thurstone scaling. Thurstone scaling refers
freely respond to the question, allowing them not to a method of soliciting responses to
to describe their feelings about, attitudes single unrelated items, but to a method of
toward, and understanding of the topic at hand. constructing and scaling several related items.
As such, they potentially can generate more The essential idea is to construct several
information about the topic than other formats. dichotomous statements about a respondents
Open-ended responses also tend to reduce the attitudes, each of which may be answered
response error associated with answers sup- Agree or Disagree. This method of
plied by others (i.e., the survey developer). But scaling can be used to classify respondents
this approach has its perils. If a survey includes with different levels of an attribute [40].
For example, if the area of inquiry entailed a follow-up question asking about reasons for the
nurses attitudes about doctors orders, the hospitalization, with responses entered into
following series of items might be presented: separate columns of a spreadsheet.
Ordinal responses are those that have a mean-
(a) A nurse must always follow every order
ingful sequence, but no xed distances between
that a doctor gives, even if he/she thinks it
the levels of the sequence. Questions about sub-
is wrong.
jective responses are often ordinal. For example,
Agree Disagree
responses to a question such as How much pain
(b) A nurse should almost always follow a
are you in? could range from none, to a lit-
doctors orders, but may raise questions
tle, to some, to a lot, to excruciating. They
on rare occasions.
are considered to be ordinal rather than interval
Agree Disagree
because while they arguably proceed from least
(c) A nurse should generally follow a doc-
to most pain, it is not at all clear whether the dif-
tors orders, but should also voice his/her
ference between, for example, none and a lit-
opinions about those orders.
tle is larger, smaller, or the same as the difference
Agree Disagree
between, for example, a lot and excruciating.
(d) Nurses should be equal partners in all
As noted, ordinal response scales typically
decisions about patient care and should
include a number of possible answers. Usually,
regard doctors orders as advice.
an odd number of responses (typically ve or
Agree Disagree
seven) is chosen to allow the respondent a neu-
In contrast to questions soliciting dichotomous tral or midrange option, though there is no con-
responses, multiple choice questions include sensus about how many choices to include. There
three or more response options. These, in turn, are a variety of different ordinal response scales.
can be differentiated into questions calling for The most common are given below:
nominal-level responses and those that call Traditional Ordinal Rating Scales: These
for ordinal responses. rating scales ask the respondent to evaluate an
As noted in Chap. 3, nominal variables are attribute such as performance by checking or
simply namesthey have no order. There are circling one of several ordered choices. Rating
two primary types of questions that call for nomi- scales often are used to measure the direction
nal responses. The rst includes items for which and intensity of attitude toward the target attri-
the respondent can provide only one answer, as bute. An example of a traditional rating scale is
the available response options are mutually exclu- given below:
sive. Examples include questions about demo-
Excellent Good Fair
graphic characteristics (e.g., religion, gender),
Poor Very Poor
other characteristics such as hair color and blood
type, and so on. The second type includes ques- Likert Scales represent another traditional
tions where the respondent can select more than type of rating scale that asks the respondent to
one response (i.e., choose all that apply ques- indicate his or her level of agreement with a
tions). The latter may provide very useful infor- given statement, with the center of the scale
mation but pose data entry and analytic challenges typically representing a neutral point [40].
that need to be considered when designing the Likert scales are most frequently used for
survey instrument. To counter these, special items that measure opinion and take the gen-
techniques are needed. For example, if one is eral form shown below:
interested in learning about why patients have
Strongly Disagree Neither Agree
gone to the hospital, it is advisable to divide the
Disagree Nor Disagree
main question into two subquestions: the rst
asking the respondent whether he or she has been Agree Strongly Agree
to the hospital and (if answered in the afrmative)
Semantic Differential Scales: Semantic of the scale typically represents a neutral

differential scales measure the respondents response (as is the case in many rating
reactions to stimulus words and concepts scales).
using rating scales with contrasting adjectives
at each end [43]. For example, one might ask During the past year, Dr. Heartly has:
a question where the polar extremes are Outstanding 7 Independently published (as
good and bad, with gradations between performance sole or rst author) two or more
these extremes provided as response options. research manuscripts in top-tier
journals, with others in draft
Good __ __ __ __ __ __ __ Bad Very good 6 First authored one research
3 2 1 0 1 2 3 performance manuscript in a well-regarded
peer-reviewed journal, with
The Behaviorally Anchored Rating Scale minimal input from senior
(BARS) is a complex approach to perfor- faculty
Somewhat 5 Coauthored one or more
mance appraisal that combines the elements
good published research manuscripts
of traditional rating scales with critical inci- performance in a peer-reviewed journal, with
dent methods. It was developed to counter support from senior faculty
concerns about subjectivity associated with members
traditional rating scales and, thus, to facilitate Neither good 4 Presented a rst-authored
nor poor abstract at a scientic meeting
relatively more accurate ratings of target performance but has not completed the
behaviors or performance versus other manuscript
approaches. A BARS is constructed by com- Somewhat 3 Actively coauthored a research
piling examples of ineffective and effective poor abstract, but provided very
behaviors (usually based on the consensus of performance limited assistance in manuscript
development
experts), converting these behaviors into per-
Very poor 2 Provided minimal contribution
formance dimensions, and identifying multi- performance as coauthor on a research
ple incidents per dimension to form a abstract but no participation in
numerical scale in which each item is associ- manuscript development
ated with a particular type of behavior [44]. Extremely 1 Made no progress in developing
poor scientic manuscripts or
Respondents may rate their degree of agree- performance abstracts, due to competing
ment with each item by checking or circling priorities or interests
the appropriate level of the accompanying
rating scale. Shown below is a 7-point BARS Visual Analog Scales: Visual analog scales
that could be used to evaluate an academic (VAS) are similar to Likert scales or semantic
faculty members research productivity in differential scales, except, rather than check-
terms of number and types of publications ing a box or circling a predened response
produced during a given period (a dimension option, the respondents indicate their responses
of interest to faculty leaders). Note each scale by making a mark (denoted here by the x)
value (1 = extremely poor performance, along a line anchored by terms describing
2 = very poor performance, 3 = somewhat opposite values of an attribute, as shown in the
poor performance, 4 = neither good nor hypothetical example below:
poor performance, 5 = somewhat good per-
Good x Bad
formance, 6 = very good performance,
7 = outstanding performance) is anchored VAS have the dual advantages of being very
in specic behaviors related to the dimension sensitive, and, in cases where the measure is
of interest. Unlike traditional rating scales, repeated over time, the respondent will not
which are presented horizontally, BARS typi- be able to intentionally duplicate his or her pre-
cally is arrayed vertically, comprising between vious response. However, different individuals
5 and 9 scale points (values); when the num- may encode physical space differently.
ber of scale values are uneven, the midpoint Thus, a mark halfway between good and bad
may not mean the same thing to all respondents. 6574 and 7584. Nonetheless, there can be
VAS have been used commonly for the clinical advantages to categorical scaling. The primary
measurement of chronic and acute postoperative advantage is that some respondents may be more
pain. In one study designed to formally assess its willing or able to answer some questions in cate-
psychometric performance in the latter setting, gorical form than in numerical form. This is
DeLoach and coworkers [45] administered the particularly true of income questions, where
VAS to 60 patients in the immediate postopera- respondents may not know their precise income,
tive period, using the scale anchors no pain and but they will know it approximately. (Ironically,
worst imaginable pain. The authors found good self-reported age follows an opposite pattern as
correlations between the VAS and a traditional individuals appear to be better able and more
numeric measure though individual VAS esti- willing to give their birthdates than their ages.)
mates tended to be relatively imprecise.
Rank Order Scales: With this form of mea-
sure, respondents are asked to rank alterna- Asking About Sensitive Information
tives in order, rather than rate them on a scale.
For example, if members of a medical school What is sensitive information? The answer to this
class all had the same professors in one semes- question depends on the respondent, because
ter, they could be asked to grade them in rela- what is sensitive to one person is not to another.
tion to one another, as shown below: In general, questions about stigmatized or illegal
behaviors, or unusual beliefs and opinions will be
Please rank each of your professors from best
judged to be more sensitive by those who engage
to worst, where 1 = best and 5 = worst:
in those behaviors or hold those beliefs than by
Adams _____ Bassett _____ Cochran _____ those who do not [39]. Highly personal questions
Davis _____ Edwards _____ (e.g., income, weight, some health conditions) or
questions about traumatic events (e.g., rape
or child abuse, or other forms of abuse) also may
Advantages and Disadvantages be viewed as sensitive. When asking about sensi-
of Categorizing Responses tive information, warm up questions often are
used to set the respondent at ease, thereby increas-
Many times, responses that are fundamentally ing the likelihood that the sensitive questions will
continuous in nature are transformed into cate- be answered. It also may be useful to include a
gorical responses by the design of the question- cool-down or cool-off phase that can reduce
naire. Instead of asking How old are you? a the possible stress induced by the sensitive ques-
respondent can be asked Are you: (a) under 18, tions. Typical warm-up questions include those
(b) 1924, (c) 2534, (d) 3544, (e) 4554, (f) about nonsensitive demographics (e.g., county of
5564 or (g) over 65? This approach, however, residence, birth order); cool-down questions
has several important drawbacks. First, categori- often are quite trivial (e.g., pet ownership, taste in
cal responses cannot be reconverted into continu- music, food preferences, and similar items).
ous responses. Second, it can limit comparisons Sensitive questions can be uncomfortable to
with other questionnaires that utilize different the respondent and may raise ethical concerns.
breaks between categories. Third, breaks must When included within a research protocol, the
be meaningful, with variations occurring only investigator may need to demonstrate to his or
between those that have been included. Sometimes her institutional review board (IRB) the need for
the survey developer may choose breaks that are such questions and provide assurances that the
inappropriate. For example, if, after data collec- respondent will not be compelled to answer them.
tion, it is determined that most respondents are When asking highly sensitive questions, inter-
over age 65, it is not possible to reverse course viewer training is essential, and interviewers may
and redo the survey adding additional breaks for need to be aware of referral services that can be
offered if the respondent reveals high-risk any particular population, precluding generaliz-
behavior, for example, being involved in an ability of conclusions. These limitations apply
abusive relationship, being suicidal, or using even to mail surveys that have been published in
illicit drugs. In addition, becoming aware of cer- medical journals, where average response rates
tain types of behavior via self-report may impose have been shown to be approximately 60% [47].
ethical responsibilities on certain classes of pro-
fessionals. For example, clinical psychologists
have a duty to report certain behaviors. Clinical E-mail and Web-Based Surveys
researchers typically are obligated to report non-
adherence to (or adverse outcomes associated E-mail and web-based surveys are less costly to
with) treatment. More generally, anyone who is a administer than traditional postal mail surveys,
member of a group that has licensure will need to but have several limitations. Anonymity can be
investigate his or her own specic requirements difcult to ensure, response rates may be low,
for such disclosure. and responses may not be random (often, there is
no way of knowing exactly who is answering the
questions). Response rates with Internet surveys
Modes of Administration have been found to differ from those obtained by
postal methods, depending on the group sur-
Self-reported information can be obtained via a veyed. Younger individuals tend to respond more
variety of methods. These include face-to-face frequently than older individuals to e-mail,
interviews, mailed questionnaires, e-mail and whereas older individuals more to traditional
web-based surveys, telephone surveys, computer- mail [48]; in one study, medical doctors have
assisted response systems, and randomized been found to respond more frequently to tradi-
response methods. tional mail than to Internet-based methods [49].
Face-to-Face Interviews Telephone Surveys
The chief advantages of face-to-face administra- Telephone surveys are less costly than face-to-
tion are that response rates are optimized and that face interviews, but the telephone-based approach
it provides an opportunity for the interviewer to may lead to signicant nonresponse. Assuming
clarify confusing items. Disadvantages include that the subject can be reached, the lack of per-
expense (both time and money), the possibility sonal contact between the interviewer and respon-
that interviewer behaviors may inuence (bias) dent may increase the likelihood that the latter
responses, and the fact that some individuals may will decline the interview. In addition, in the cur-
be reluctant to answer some questions in the rent era, many potential respondents lack landline
presence of an interviewer due to embarrassment telephones, and some have multiple telephones
(especially sensitive items) or concerns about creating difculties in achieving a random sample.
revealing illegal behavior. A recent study using telephone survey methodol-
ogy found response rates of only 39% [50].
Mail (Postal) Surveys

Computer-Assisted Interviews (CAI)
Administering a questionnaire by mail is rela-
tively inexpensive and helps to avoid interviewer The availability of computers over the last several
bias. However, unless care is taken, response decades has created new methods of administer-
rates are likely to be suboptimal (i.e., <85%) [46], ing and responding to surveys. Among the most
and respondents may not be a random sample of common are the Computer-Assisted Personal
Interview (CAPI), the Computer-Assisted Randomized Response

Telephone Interview (CATI), and the Audio
Computer-Assisted Self-Interview (ACASI). Randomized response is a useful method of
With CAPI, the interviewer typically uses a com- assessing the rate of stigmatized behaviors. In
puter screen to read questions to respondents in brief, respondents ip a coin (in private) or use
the setting of a face-to-face interview. With some other randomizing device to determine
CATI, the interviewer follows a script provided whether they are about to answer an innocuous
by a software application to ask questions by question (e.g., Is today Monday?) or a sensitive
telephone. Depending on the system used, the one (e.g. Have you ever used heroin?). They
respondent may have the options of interacting report their answers (yes/no) without the
with a live interviewer or listening to a recorded investigator being aware which question the
interview and may answer questions by voice or respondent was asked, thus protecting the latters
touch phone mechanisms. CATI also provides privacy. At the conclusion of the assessment, a
the advantages of automating initial calls and statistical algorithm is used to calculate out over-
call-backs and keeping notes on the status of the all prevalence of the target behavior. Variations
interviews. With ACASI, the respondent uses a on randomized response methods also exist for
headphone connected to a computer to listen to ordinal and interval level variables. Randomized
preprogrammed questions and enters his or her response methodology has been widely used for
responses directly into the computer via a key- highly stigmatized behaviors such as illegal drug
board or keypad. If respondents have limited use [56] and homosexual sex [57] and has been
computer literacy, these systems can be engi- found to yield more accurate data than direct
neered to employ a touch screen mechanism surveys [58].
whereby the respondent simply pushes a patch of
a certain color. Because absence of an interviewer
protects privacy (broadly dened as control of Methods for Boosting
access of oneself to others), some respondents Response Rates
may feel more comfortable answering sensitive
questions in this format. Indeed, studies have There is a large literature comprising methods for
shown that respondents are more likely to admit increasing response rates to surveys, some of
use of illicit drugs and to report sensitive or stig- which involve paying or providing other incen-
matized sexual behaviors with ACASI than when tives to respondents for their participation. Their
interacting with an interviewer in person or by appropriateness is largely dependent on the pop-
telephone [51, 52]. CAI methods have distinct ulation with which the investigator is working as
advantages over traditional paper-and-pencil well as the nature and magnitude of the induce-
surveys. They improve turnaround time, avoid ment. For example, if participants are members
problems associated with skip patterns and of a low-income, nonprofessional group, offering
branching in complex surveys, and facilitate modest compensation for time and effort would
entry validation and internal consistency checks. be ethically appropriate and could encourage par-
They also minimize (or entirely eliminate) the ticipation in a survey, whereas offering large
requirement for secondary data entry and clean- sums of money or valuable goods for such par-
ing, further improving data quality by avoiding ticipation would be viewed as coercive. Among
additional keystroke errors [53]. The primary more advantaged respondents, offering an induce-
limitation of CAI is their relatively high initial ment could backre (if the respondent viewed the
setup costs. In medicine, computer-assisted inducement as insulting). For such subjects, a
methods have been shown to be of value for reasonable alternative is to offer money to a
obtaining information from stroke victims [54] charity of the respondents choice. Other effec-
or others with limited ability to use a pen. They tive methods, frequently adopted in other domains
also have been used to improve patient care in such as marketing but applicable to medical
the setting of HIV infection [55]. research, include making the survey interesting,
including questions that are relevant to the subjective methods, the measurement instrument
respondent and keeping the survey short and sim- provides only an estimate of the quantity of
ple (KISS). Strategies specic to mail surveys interest. By an estimate, we mean that the
include the use of personalized questionnaires recorded value is not a direct measure of the
and/or cover letters that orient the respondent to underlying quantity of interest or the true
the purpose and importance of the study and value. For example, if we are measuring the
invite their participation. Additional strategies blood pressure of an individual, the observed
include the use of colored ink, rst class mail and value for the systolic pressure may be 124 mmHg.
recorded delivery, stamped return envelopes (or However, the true value cannot be observed and
permitting use of facsimile), contacting is equal to the 124 plus or minus some value
participants before sending surveys, maintaining reecting measurement error as well as other
follow-up contact with participants, and provid- sources of error.
ing nonrespondents with replacement question- Two fundamental components of accuracy,
naires when the initial questionnaires were not both inversely related to the error of an observa-
readily accessible [59]. In one study, the com- tion, are validity and reliability. Physicians and
bined use of replacement questionnaires and others using self-report measures for research
chocolate (the inducement) was found to should have a fundamental understanding of
signicantly increase response rates versus either these concepts if they are to form judgments
method alone [60]. Strategies specic to tele- about the quality of outcomes based on these
phone surveys include allowing the respondent to measures or develop their own measures. In the
return the call using a toll-free number and setting of tests and measures, validity relates to
sending alerts prior to initiation of the survey. how well the instrument measures what it pur-
(For more possibilities, the reader is referred to ports to measure and reliability relates to how
the website www.guidestarco.com/Increasing- consistently the instrument measures whatever it
survey-response-rates.htm.) is that it measures. These qualities exist on a con-
tinuum rather than as absolutes, that is, inferences
drawn from an instrument are neither valid nor
Evaluating Psychometric Properties invalid nor are they reliable or unreliable;
of a Self-Report Measure rather, they are valid to a certain degree and reli-
able to a certain degree for a given population
Before a self-report measure can be used with and setting (i.e., are sample dependent).
condence, it must be rigorously evaluated to Together, validity and reliability reect the abil-
determine whether it is psychometrically sound; ity of the instrument to provide an accurate
that is, that it measures the construct of interest quantitative estimate of the characteristic of inter-
(e.g., quality of life, satisfaction, emotional state est to the researcher.
of health) accurately in the population of inter-
est. Such an assessment not only is essential for
all newly developed instruments, it also is impor- Validity
tant for instruments that have been validated for
other populations. By accuracy, we mean that Validity has been dened as the degree to which
the quantitative or qualitative assessment pro- conclusions drawn from the results of any assess-
vided by the instrument should provide as true a ment are well-grounded or justiable, being at
measure of the underlying construct as possible. once relevant and meaningful [61]. When the
Unfortunately, all measurement is accompanied term validity is applied to measurement, it refers
by the possibility of error which is either system- to the extent to which the instrument measures
atic or random as no data collection technique is the actual parameter of interest [62]. Thus, a
perfect. Whenever we measure a patient charac- well-built scale should, on average, produce read-
teristic, be it by objective testing or by more ings that permit a meaningful conclusion about a
persons actual weight; a well-constructed process. Does the assessment seem like a
measure of clinical depression should yield data reasonable way to gain the information the
that are useful for drawing meaningful conclu- investigator is attempting to obtain? Does it
sions about the presence and severity of depres- look as though it will measure what it is sup-
sive symptoms; and a properly designed measure posed to measure? Does it seem well
of health-related quality of life should provide designed? [64] For example, the Beck
responses that are value for drawing meaningful Depression Inventory, which is widely used
conclusions about health status or health utility in clinical medicine, asks questions about
from the perspective of the patient. In each of depression; more specically, it asks about
these cases, the quality of the instrument is judged such attributes as sadness, suicide, and loss of
according to the soundness of the conclusions pleasure [65]. It has face validity because
that can be drawn from the responses that it these (and other) items are what most people
provides. Therefore, though the term valid is think of as depression.
commonly used as a descriptor for various tests Content Validity: Content validity reects
and measures, validity, as Cook and Brown have how well the items comprising a measure
noted, represents a property of the inference cover (sample) the subject of interest or
rather than the instrument itself [63]. Because domain. When a domain is well dened,
these inferences are inuenced by the circum- content validity is relatively easy to ascertain.
stances under which the instrument is adminis- If the domain is less well dened, ascertain-
tered, there is no such entity as a generically valid ment of content validity may require having
instrument. Indeed, all instruments should be experts in the eld review the measure [40].
validated for each interpretation, including the The content validity of a test of knowledge of
specic populations and contexts in which it will womens health was called into question by
be used. For example, a test that measured knowl- comparing the domains it covered with those
edge of basic addition and subtraction might be covered by a set of curriculum guides [66],
used to draw valid inferences about mathematics and the content validity of the SF-36 health
prociency among rst-grade students but would questionnaire was afrmed by comparing it
not be useful for drawing similar inferences about with the longer instrument from which its
college mathematics majors. Similarly, a scale items were drawn [67].
that has been validated for one disorder (e.g., Construct Validity: Construct validity is the
depression) would need to be re-evaluated to degree to which a measure is related to other
establish its validity in the setting of another (e.g., measures or attributes, as dictated by theory. It
anxiety). Moreover, an instrument that has been reects the extent to which the construct under
shown to permit valid inferences under research study (e.g., depression), even if it cannot
conditions or in highly selected patients may directly be assessed, has been properly labeled
need further evaluation before use in a general (operationalized) by the items comprising the
clinical population [63]. measure. In other words, does the instrument
Validation of a measurement instrument is a measure what it was designed to measure?
complex process, in part, because validity encom- Thus, construct validity is a key part of valid-
passes various dimensions. The most common of ityno instrument has any value unless it
these are summarized below: satises this criterion. Inferences about con-
Face Validity: Face validity (validity at face struct validity can be evaluated by a variety of
value), also known as representation valid- methods. A common approach to construct
ity, is concerned with how a measurement validation entails assessment of the conver-
instrument or procedure appears to be relevant gent and divergent (or discriminant) validities
to a construct, as judged by a potential respon- of a measure. Convergent validity indicates
dent. It is the simplest type of validity to gauge that the measure correlates highly with
and, typically, is assessed early in the validation other measures of similar constructs, whereas
divergent validity indicates that it correlates substantiate the single-factor structure of a

poorly with measures of other constructs. For mental well-being scale [72].
example, we would expect a measure of Criterion Validity: Criterion validity (also
depression to correlate more highly with mea- known as criterion-related or instrumental
sures of anxiety than with measures of most validity) refers to how well the results obtained
physical characteristics. Similarly, we would by an instrument correlate with or predict
expect measures of post-traumatic stress dis- some real world behavior or other attribute. It
order to correlate more highly with measures estimates the accuracy of the measure by com-
of similar stressors than with measures of age. paring it with some preexisting indicator that
A related approach is known groups analysis, has been demonstrated to measure the same
which evaluates the extent to which scores on construct (i.e., a gold standard). There are
a measure discriminate between individuals two primary forms of criterion validity: con-
known to possess an attribute versus those current and predictive. Concurrent validity is
who do not. Known groups validity analysis evaluated by comparing two measures in par-
has been used to provide support for the con- allel and determining whether they are con-
struct validity of the Pediatric Evaluation of cordant. For example, the concurrent validity
Disability Inventory by demonstrating differ- of a test of tness could be dened by
ent scores among individuals with different determining the extent to which it correlates
levels of disability [68]; the method also was with maximum oxygen uptake measured at
used to support the construct validity of the (or approximately at) the same time [73].
Multidimensional Fatigue Inventory by dem- Predictive validity implies that the measure
onstrating scores consistent with greater forecasts an expected result. As examples, a
fatigue among patients presenting with chronic self-report measure of functioning in the
fatigue-like symptoms or chronically unwell elderly was found to predict mortality [74]; a
patients versus healthy controls [69]. An alter- measure of readiness to change was used to
native approach involves the use of factor predict change in drinking behavior in exces-
(exploratory or conrmatory) analysis or prin- sive drinkers [75]; and a measure of adherence
cipal components analysis to identify clusters to medication instructions was afrmed by
of related items on a scale. Collectively, these predicting blood pressure 5 years later [9].
methods are useful for (a) determining how Responsiveness to Change: A primary goal of
many latent variables or dimensions under- clinical management and target of clinical
lie a set of items (thereby helping to elucidate investigation is assessment of change over
or conrm the structure of the instrument), time in a patients status in response to treat-
(b) condensing a relatively larger number of ment. As Portney and Watkins have noted, the
items into a smaller number of variables to use of change scores as a basis for assessing
facilitate statistical analysis, and (c) clarifying treatment outcomes is pervasive throughout
the meaning of these variables [39]. As exam- clinical research [76]. While some methodolo-
ples, principal components analysis was used gists contend that the sensitivity of an instru-
to dene two distinct higher-order clusters ment to change (i.e., its responsiveness) is
reecting mental and physical health from distinct from validity [77], others argue that
among the eight scales comprising the Medical responsiveness is, indeed, an important com-
Outcomes Study Short Form (SF) 36 [70]; ponent of validity [76, 78]. An instrument is
exploratory factor analysis was used to identify considered to be responsive if it can accurately
three subdimensions of climate (clarity, detect change when (and only when) it has
challenge, support) in a work-group climate occurred [79]. In other words, it should pro-
assessment tool for improving the perfor- duce the scores that change in proportion to
mance of public health organizations [71], and the change in the patients status, but remain
conrmatory factor analysis was used to stable when the patient is unchanged [76].
Two forms of responsiveness are recognized: a systematic error consistently affects the mea-
internal and external [80]. Internal respon- surement of the variable in the same way each
siveness represents the instruments capacity time that the measurement is done. It provides an
to detect change from before to after exposure incorrect measure of the variable, and the error
to an intervention of acknowledged efcacy will be the same for every subject.
[81]. Typically, it is evaluated in the setting of There are several types of bias that specically
repeated measures designs that incorporate affect responses obtained in self-report measures;
assessments before and after the intervention some of the most common are listed below. (For
in the same individual. These designs can a fuller list, the reader is referred to Aiken and
involve a single group of subjects followed Mardegan [44] and Choi and Pak [38].) Although
over time (i.e., a treated cohort, where intra- adequate quantitative data are not available for
subject change is expected) or include two purposes of comparison, there is general agree-
groups (including an untreated control where ment that the extent and impact of these biases
change is unexpected). External responsive- vary greatly from discipline to discipline and
ness refers to the degree to which changes in a from one population to another.
measurement correlate with changes in other Social Desirability Bias: Social desirability
putatively related changes in health status bias (sometimes termed faking good bias)
[81]. Both forms of responsiveness are refers to the tendency of respondents to answer
inuenced by reliability and scale characteris- questions in ways that make them look good,
tics. Scales that are unreliable will produce rather than honestly [40]. This positive
too much noise to allow for determination of response bias may be of two typessome
meaningful change over time. Scales with too respondents may deliberately tell falsehoods
few response categories may fail to detect all in order to appear acceptable to those conduct-
but very large changes. Scales producing ing the survey, whereas others may have inter-
ceiling effects (due to restriction at the upper nalized the dishonest response. (The latter
level of the range of possible values) may occurs more commonly than generally recog-
leave little room for improvement on subse- nized [84].) The social desirability bias can
quent testing just as those producing oor compromise most forms of self-report, but its
effects (where data cannot take on lower val- potential impact should be anticipated when
ues) will be insensitive to clinical decline even asking about stigmatized behaviors or atti-
when there is a worsening of status or func- tudes (e.g., when questions involve issues of
tioning. When instruments with varying scal- criminality, violence, or sexual orientation), or
ing characteristics (type, length, directionality, when the respondent has reason to believe that
etc.) are compared to determine their relative a socially nondesirable response could cause
responsiveness, unit-free statistical approaches him or her to lose something of critical value
including standardized scores and compari- (e.g., a belief by a patient that nonadherence to
sons (e.g., effect sizes or standardized response a health-care providers instructions could
means) must be used. (For an excellent negatively impact future interactions with that
discussion of these techniques and their provider). Although it may not be possible to
interpretation, the reader is referred to Liang eradicate this form of bias, the extent of its
et al. [82] and Angst et al. [83]). potential inuence can be examined by embed-
As noted throughout this volume, the validity ding, in the self-report measure, an item or
of any study can be threatened by bias, which two that ask the respondent to answer a ques-
broadly is dened as known or unknown system- tion such as I have never intentionally told a
atic error in the design, sampling, measurement, lie or I always know the difference between
or other critical aspects in the conduct of an right and wrong or through formal testing.
investigation that can produce distortions of A common test of social desirability is the
ndings. Unlike a random error, described below, Marlowe-Crowne scale [85]; a shorter version
of this scale has been created by Strahan and impressions guide their ratings. It is suspected
Gerbasi [86]. whenever respondents assign similar ratings to
Agreement Bias: Agreement bias (also known each dimension measured in a survey (e.g.,
as acquiescence bias) is the tendency to say rate all aspects of performance as excellent
yes or I agree to every item regardless of or all components of a course or program as
content. It is subtly different from social very good). The phenomenon, empirically
desirability bias as agreement bias includes conrmed by Thorndike in 1920 [93], is
admission to possessing socially undesirable thought to result from a cognitive bias, whereby
traits. For example, respondents manifesting one particular trait, especially a positive char-
agreement bias might respond afrmatively to acteristic, inuences or extends to perception
the question, Have you ever used illicit of other traits. A commonly cited example is
drugs? whereas those exhibiting social desir- judging an attractive person as more intelli-
ability bias would likely provide the opposite gent. Its logical opposite is sometimes termed
response. The phenomenon is thought to have the devil, horns, or reverse-halo effect
multiple causes. First, it has been argued that whereby individuals judged to have a single
most respondents desire to be polite and undesirable trait (e.g., unattractiveness) are
respectful and, thus, not wish to disagree with subsequently judged to have other undesirable
the questioner [87, 88]. Second, respondents traits (e.g., lack of intelligence) based on the
may feel that they have lower standing than evaluators tendency to allow a single weak-
the questioner and agree with questions based ness to inuence the totality of impressions
on this perceived status differential [89]. [94]. In the setting of a survey, a respondents
Third, respondents may select an agreeable prejudices, recollections of previous observa-
(but not necessarily truthful) answer to com- tions, and even answers to previous questions
plete the survey as rapidly as possible [90]. also may inuence responses. Thus, the halo
Whatever the cause, agreement bias can be (and reverse-halo) effects collectively repre-
detected (and sometimes resolved) by includ- sent an important bias that must be recognized
ing a balance of positively and negatively and, if possible, minimized to improve the
worded items [91], though care must be taken accuracy of individual ratings. Several
to minimize confusion to the respondents. approaches have been recommended includ-
Faking Bad Bias: In contrast to social desir- ing proper introduction of the purpose of the
ability (or faking good) bias, the faking survey (to emphasize the importance of the
bad bias occurs where failure (in the usual respondents ratings), increasing the number
sense) is a goal. In the context of self-reported of attributes to be rated (bearing in mind that
information, faking bad is a negative response an excessive number of questions may cause
bias that is caused by the respondents desire the respondent to abandon the survey), and/or
to appear worse (e.g., manifest symptom physically arranging scales so that their favor-
amplication) than he or she really is either to able and unfavorable ends alternate.
avoid duty or responsibility (i.e., malinger) or
to qualify for goods or services [38]. If faking
bad bias is suspected, methods exist to detect Reliability
it. (For a comprehensive discussion of one
such method [the Fake Bad Scale], the reader Reliability is related to the question how
is referred to Nelson et al. [92].) consistent or reproducible are the scores that an
Halo Effect: The halo effect is a systematic instrument produces? Like validity, reliability
bias that occurs when respondents fail to rate technically is considered to be a property of the
individual attributes of a person, object, event, measurement rather than of the instrument itself
or service in isolation but instead let overall because the same instrument administered in
different settings and to different subjects under research setting (e.g., unintended variations in
varying conditions can yield widely varying reli- temperature, lighting, noise, or interruptions).
ability estimates [63]. Reliability is considered to Finally, many factors causing random error have
be a necessary, but insufcient, element of valid- their source in the instrument. For example,
ity [95, 96]. This is because valid conclusions unclear questions or directions, inadequate item
cannot be drawn from an instrument that yields sampling, suboptimal format, or even the order in
inconsistent observations [63]. At the same time, which the questions are posed are potential
reliability does not imply validity because an sources of random error. Random error (like sys-
instrument can produce consistent errors. tematic error) must be considered in interpreting
The concept of reliability can be illustrated the results of studies; the greater the error, the
using the metaphor of a bathroom scale. For less we can rely on the results of the measure-
example, if you are like many people, you prob- ment process for decision-making. In designing
ably will step on your bathroom scale in the or selecting among instruments, we are constantly
morning, check your weight, step off, and step striving to create or identify those that not only
back on the scale to recheck the reading. You measure the attribute of interest but which mea-
have learned through experience that the measure that attribute reliably.
surement displayed by a bathroom scale the rst Like validity, reliability can be classied
time you weigh yourself is not always the same according to several dimensions. These include
as the second time you try, but usually it is very the stability of the measurement over time, the
close. A good scale might vary by half a pound or congruence of a measurement when dened by
so, but if measured weight differs signicantly different assessors (or determined by different
(e.g., more than 5 lb) at 7:00 a.m., 7:01 a.m., and methods), the consistency (homogeneity) of
7:02 a.m., the readings that the scale produces items within a measure or scale, and the
would have very limited reliability. Similarly, if correspondence of parallel measures. These
an instrument is designed to measure a patients dimensions, typically expressed as reliability
self-condence, then it should yield approxi- coefcients, are evaluated using various method-
mately the same result each time it is adminis- ological approaches, as described below:
tered to the same subject. Test-Retest Reliability (Temporal Stability):
Whereas validity is diminished by systematic Test-retest reliability is the most commonly
error, reliability is reduced by random (chance) recognized form of reliability. It is evaluated
error. There are many sources of random error in by administering the same item, scale, or
research measurement. The most common are instrument to a sample of individuals twice
those caused by factors related to the subject, over a relatively short period (the period
researcher, environment, and instrumentation. depending on the intrinsic stability of the vari-
For example, a subject who is tired, sick, hungry, able under study) and comparing the results
angry, irritable, or confused may produce mea- using Pearsons product moment correlation
surements that are different than they would be if for interval data or Spearmans rank order
the subject were not so aficted. Indeed, any correlation for ordinal data. Typically, test-
changing physical, emotional, or psychological retest correlation coefcients ranging 0.70
state of the subject, including the subjects aware- 0.80 generally are considered to be satisfactory
ness of the researchers presence, can introduce to good (though criteria for acceptability vary
error into the measurement process. The according to discipline). This measure of
researcher can introduce random error in mea- reliability is most appropriate for assessing
surement simply by his or her physical appear- relatively enduring characteristics such as per-
ance, demeanor, or other personal attributes or by sonality traits, aptitude, and chronic health
becoming fatigued, impatient, bored, ill, or dis- status in stable populations where subjects are
tracted. Many factors that cause random error in willing to undergo multiple administration of
measurement can arise from perturbations of the the same measure. It is less appropriate for
estimating temporal consistency of attitudes, when evaluated as repeated measures, can

mood, and knowledge that can be inuenced falsely create the impression of relatively low
by experience(s) or for health states that have reliability [99]. Internal consistency reliability
been altered by intercurrent events between customarily is evaluated by a variety of
measurements. approaches, each of which assesses equiva-
Interobserver (Inter-rater) Reliability: Inter- lence of responses within a related group of
rater reliability reects the agreement between items during a single administration of the
or among two or more assessors who indepen- instrument to the same subjects. The most
dently rate the same item, scale, or instrument common are given below:
administered within a sample of individuals at Split-Half Reliability is one of the oldest
a single point in time. Cohens Kappa (k) is a methods for evaluating internal consis-
commonly used statistic for estimating agree- tency. It is calculated by dividing a scale
ment between two raters for binary data (e.g., into two parts, computing a correlation
heart failure present vs. absent); a related sta- coefcient between those parts, and adjust-
tistic (Weighted Kappa) may be used for ing the correlation using the Spearman-
ordinally ranked data such as those obtained Brown prophecy formula to correct for
via Likert-type scales. If the raters are in com- foreshortened test length (as shorter scales
plete agreement, then k = 1. If there is no tend to yield lower reliability estimates).
agreement beyond that which would be As a rule of thumb, coefcients between
expected by chance, then k = 0 (values <0 sig- .70 and .80 indicate adequate reliability,
nify that agreement is even less than that and .90 or greater indicates high reliability.
which would be attributable to chance). If the two half measures are highly cor-
Although there is no universal consensus, in related, this provides evidence that they are
the range of values indicating better than measuring the same attribute. Two com-
chance agreement, statistics 0.010.20 have mon methods for performing this analysis
been interpreted as slight agreement, 0.21 are to choose the rst N items and the last
0.40 as fair agreement, 0.4060 as moder- N items, or to choose odd numbered items
ate agreement, 0.610.80 as substantial and even numbered items. It is important
agreement, and .81 as almost perfect that split-half reliability be determined for
agreement [97]. When data are at the interval particular scales, not for entire question-
level, inter-rater reliability can be established naires comprising different scales. For
via computation of the Pearsons correlation example, if a questionnaire assesses both
coefcient (r) when sample size is relatively anxiety and depression, the split-half reli-
large and by the interclass correlation ability of the two measures will need to be
coefcient (ICC) when sample size is smaller evaluated separately.
(i.e., <15) [98], and is interpreted in the same The Kuder-Richardson Formula 20
manner as Kappa. (KR-20) [100]. The KR-20 can be used
Internal Consistency: Internal consistency is to provide an estimate of internal consis-
an approach to reliability assessment that esti- tency for scales calling for binary (dichoto-
mates the homogeneity of items in a scale that mous) responses (e.g., yes/no, true/
are intended to measure the same construct. false, agree/disagree, symptomatic/
The essential idea is that the various items on asymptomatic). Unlike the split-half
a scale all should correlate highly and posi- method (described above), which is based
tively; that is, when one item is answered in a only on a single splitting of items, the
particular way, other related items ought to be KR-20 computes split-half reliability based
answered similarly. This approach is prefera- on all combinations of splittings and pro-
ble to test-retest methods for instruments that duces an estimate of the mean correlation
are highly sensitive to change and which, of the items comprising the measure. Values
can range from 0.00 to 1.00 (sometimes the rst assessment can inuence the results
expressed as whole numbers, 1100). of subsequent assessment by providing an
A high KR-20 coefcient (i.e., >0.90) opportunity for practice or learning inde-
indicates a homogeneous measure or scale. pendent of the intervention. This threat to
A variant, the KR-21, is computationally internal validity (testing effects) can be
simpler (it is based only on the assessment minimized (though not entirely eliminated)
mean, variance, and number of items on the by using alternate forms of measurement of
scale), but tends to produce lower reliabil- the same construct or content domain
ity estimates. before and after the intervention. One com-
Cronbachs Alpha [101] is the best known, monly used approach to creating these
and most commonly used, measure of alternate forms is to generate a large pool
internal consistency. Like the KR-20, of items, each of which addresses the con-
Cronbachs alpha conceptually represents struct being studied, and randomly divid-
the mean of all split-half reliability esti- ing the items to create two functionally
mates for a scale [76] and is computed by equivalent instruments of similar difculty
calculating pair-wise correlations between and length. Other methods include chang-
items in a scale; however, Cronbachs alpha ing the wording or order of the questions in
can be used with scales that include several the two instruments. (The same approach is
ordinal response options (e.g., 1 = strongly used to discourage cheating on high stakes
agree through 5 = strongly disagree or achievement or aptitude tests.) After the
0 = not limited by heart failure symptoms alternate forms are created, they are admin-
through 3 = severely limited by heart fail- istered to the same sample, and the results
ure symptoms) as well as those that are correlated. If they produce similar
include binary response options, making it results for the same subjects (i.e., they yield
more exible than the KR-20. Values of correlation coefcients >0.80), they are
0.70 or above are widely viewed as accept- considered to be equivalent forms and can
able, and values of approximately 0.90 are be used interchangeably [62]. (The reader
considered to be excellent [102]; however, will note that the methodology for estab-
extremely high reliability estimates (i.e., lishing alternate form reliability, when
0.95) suggest that some of the items may based on division of a related item pool, is
be redundant, contributing no additional analogous to that used for estimating split-
information than that furnished by other half reliability. The primary difference is
items on the scale. Alpha if item is deleted that with split-half reliability, items within
is a widely used index that can be useful for a single scale or measure are divided solely
deleting nonhomogenous or redundant for the purpose of determining internal
items during the process of scale develop- consistency, whereas with the alternate
ment. Nonetheless, when using standard- form approach, the objective is to construct
ized scales, all items (including those that two equivalent instruments that can be used
reduce alpha) should be retained to permit independently of one another.)
meaningful comparison with previous as
well as future assessments employing the
same instrument. Ethical and Legal Aspects of Survey
Alternate (Equivalent, Parallel) Form Methods
Reliability. An investigator may be con-
cerned that repeated measurement using Given below is a brief prcis of some ethical and
the same instrument might threaten the legal issues involved in survey research. Any
internal validity of an intervention study investigator should carefully review the policies
because (as noted in Chap. 5) exposure to of his or her institution to ensure compliance.
If the investigator has a professional license, that during the chain referral process, as disclo-
licensing body may also have relevant rules and sures from the investigator could compromise
regulations governing survey research. privacy of the subject and condentiality
1. General participation. In all cases, respon- of their data, destroy the relationships
dents must know that they are free to not par- within the chain, and militate against further
ticipate, to skip questions, and to stop the recruitment [103].
survey at any time. 5. Focus groups. Focus groups pose ethical spe-
2. Sensitive questions. If sensitive questions are cial problems, because members of the focus
asked, provision should be made for debrieng, group share information that can, potentially,
and respondents should be provided with be used by one participant against another. As
information about relevant services, as appro- a hypothetical example, suppose a focus group
priate. For example, if an investigator asks a of medical students were convened to evaluate
subject about illicit drug use, information may specic academic programs and one member
need to be provided about available treatment of the focus group identied a certain faculty
facilities. member as incompetent. If another focus
3. Privacy. Especially when sensitive informa- group member knew the identity of the partici-
tion is discussed, substantial efforts should be pant expressing this view, he or she could be
made to keep identifying information private. threatened or even blackmailed. As another
One solution is to use code numbers rather example, if a focus group member acknowl-
than names and, if necessary, to store a link of edged having HIV or some other stigmatized
code numbers to names in a separate and condition or admitted to engaging in illicit
secure location. behavior (such as abuse of prescription or
4. Snowball (chain-referral) sampling. nonprescription drugs), similar problems
Sometimes, when a sampling characteristic is could ensue.
relatively rare within a population, or when a 6. Children and other special populations.
population is concealed from society at Additional rules apply when conducting self-
large, an investigator may have difculty report surveys involving children and other
locating an adequate number of subjects for a special populations (e.g., prisoners, individu-
survey. This can occur when the population of als with mental disabilities). These populations
interest comprises individuals who exhibit may have limited ability to supply informed
illegal or otherwise stigmatized behaviors consent, either due to lack of comprehension
(e.g., illicit drug use or prostitution). One (e.g., young children and individuals with
approach that sometimes is used to increase mental disabilities) or because of feelings of
sample size under these conditions is to recruit duress (e.g., prisoners). (A listing of these rules
a relatively small number of subjects who pos- can be found in the Code of Federal Regulations,
sess the desired sampling attribute and ask Title 45 Public Welfare, Department of Health
each subject to bring in additional subjects and Human Services [104].)
from among their acquaintances (social net-
work) who possess the same attribute. These,
in turn, may be called upon to recruit similar Summary: A General Guide
additional subjects for the study. Thus, the to Constructing a Measure
sample grows metaphorically like a snow-
ball. Though snowball sampling can reduce This chapter has highlighted the complexities of
subject search costs and provide access to constructing a self-report measure. If the investi-
subjects who would otherwise be inaccessible, gator believes that the need for a new measure
the investigator must take great care to ade- outweighs the effort required to develop it, the
quately protect the potentially sensitive and following provides an outline of the essential
damaging information given by respondents steps involved, adapted from those suggested by
DeVellis [40] and Fowler [39]. (Further details of narrowed later in the process. It is not uncom-
these steps can be found in their writings.) mon for the initial pool to contain four times
1. Determine precisely what must be mea- as many items as the number of items com-
sured. It is not sufcient to have a vague idea prising the nal scale.
of what it is to be measuredone needs to be 5. Determine the measurement format. As
fairly precise. If the study is analytic, how previously indicated, questions and responses
well does the new measure facilitate testing can be framed in numerous ways. The pre-
of the research hypothesis? If the study is ferred format should be considered at the
performed to generate a hypothesis, how well same time that the item pool is generated to
will the anticipated responses achieve this maintain consistency. For example, will the
objective? Will the measure assess knowl- survey be unstructured, semistructured, or
edge, attitudes, behaviors, or a combination structured? If the questions call for closed-
of these areas? What areas must be covered? ended responses, how many response catego-
How will the new measure differ from exist- ries will there be? What type of scaling will
ing measures? What theory will guide the be used? Will the time frame to which the
development of the new measure? How questions refer be specied or implied, etc.?
specic versus general should the measure 6. Develop validation items. Validation
be? As is the case for all forms of research, items are of two types: (a) those that do not
time spent clarifying objectives at the outset directly measure the construct under study,
will save a great deal of time later on. but which may be useful for detecting aws
2. Define the population of interest. State, as (biases) in the measurement process, and (b)
precisely as possible, whom you wish to those which assist in assessing the construct
study. Often, the choice will be a compro- validity of the new measure. Including a
mise between optimal versus available sub- social desirability scale can help to determine
jects. An investigator may be interested in all which items tend to be inuenced by this
humans with a disease, but it is never possi- positive bias and serve as a basis for elimi-
ble to study all such individuals. It also is nating them. The inclusion of items from a
very difcult, if not impossible, to obtain a putatively related measure can be used to
random sample of such individuals from buttress a claim of construct validity or iden-
around the world. Early in the design of the tify poorly performing items [40].
study, the investigator should identify the 7. Pretest. Once a large pool of items has been
age group and gender(s) of interest, the geo- dened, it can be reduced to a manageable
graphic location of potential respondents, number and screened for omissions, errors,
their racial or ethnic characteristics, etc. and related problems. Independent review by
3. Select the type of self-report to be used. content-matter experts, colleagues, and key
Decide whether the information being sought decision makers can be helpful for establish-
is best obtained via a mailed self-completed ing both the face and content validity of the
questionnaire, an in-person or telephone preliminary instrument and for obtaining
interview, or a computer-based method. Each feedback regarding specic items. Reviewers
approach has advantages and disadvantages, can be asked:
as noted above. How relevant each item is to the construct
4. Generate the item pool. Initially, a large being measured
pool of items should be generated, covering How clear the items are
as many different parts of the construct of If there are ways to make the items more
interest as possible from different perspec- concise
tives. Brainstorm. At this stage, the creator of If key items are missing (there should be
the survey instrument should not fear redun- at least one question for every variable of
dancy or a long list of itemsthese can be interest)
If items are superuous or redundant overly intrusive? Were any redundant? Did
If items are difcult to read or answer they ow well?). Statistical methods (e.g.,
(e.g., are ambiguous or otherwise evaluation of distributional characteristics,
unclear) examination of missing answers, item-to-
It also is helpful to solicit review of the item and item-to-scale correlations) can be
drafted items from individuals who are simi- applied to responses obtained in the pilot to
lar to the intended respondents. This can be detect poorly performing or redundant items
done within a focus group or as a series of and to evaluate their impact on internal con-
one-on-one cognitive interviews consistency when retained or deleted. It is
ducted among a small number of individual difcult to nd guidance regarding the mini-
respondents. Both approaches allow explora- mal number of participants to be included in
tion of how well the items are understood a pilot. Some workers in the eld have sug-
and are particularly useful when the intended gested 300 [105]; others [40] have recom-
respondents differ greatly from the individu- mended that for single scales comprising
als writing the survey instrument. Specic relatively few (e.g., 20) items, a smaller num-
questions should be asked about how respon- ber may sufce. A cautionary note is in order.
dents interpreted the questions, how they If too few respondents are chosen, it may not
thought the various questions differed from be possible to evaluate the items properly; if
each other, how readable they were, and what the sample is not representative, items may
their responses meant. At this stage, ques- have different meanings to the pilot sample
tions can be open-ended, as one of the goals versus the target population, and the relation-
of pretesting is to identify response options ships among the items may be different as
that may have been overlooked (a prespecied well [40].
list of responses options will, by force, con- 9. Edit. Invariably, once a measure is pilot
strain the respondent to think like the survey tested, revision will be required. Directions
developer). Feedback from the pretest can be may need to be claried. Confusing, overly
use to add, delete, and otherwise rene ques- intrusive, or unanswered questions will need
tions to be included in the preliminary instru- to be deleted or reworded (though reworded
ment and to frame appropriate response items may need to be retested). If revisions
options. are extensive, a second round of pilot testing
8. Pilot test. Pilot testing is crucial to develop- may be required. Once poorly performing
ment of a valid and useful scale. No matter items are eliminated, the length of the instru-
what care is taken in developing and screen- ment should be evaluated. Too short a mea-
ing items, some will be misinterpreted by sure will not fully explore the construct of
respondents. Pilot testing involves adminis- interest. However, one that is too long may
tering the preliminary questionnaire (includ- bore or frustrate the respondents.
ing the cover letter and directions) to 10. Assess reliability and validity. Before an
respondents who, again, are as similar as instrument can be used for formal research
possible to members of the target population. purposes, its reliability and validity must be
The pilot should be performed, to the extent assessed in the population of interest. As
possible, under conditions that mirror the noted above, the most common test for reli-
conditions under which the nal survey will ability is Cronbachs alpha; for validity, the
be conducted. It should ask respondents to appropriate method depends on the degree of
nd aws in the survey (e.g., Were directions development of substantive knowledge and
and skip patterns (if any) clear? Was the sur- the existence of (a) other measures of the
vey too long? Was the format appropriate? same construct, (b) measures of similar but
Were any of the questions confusing or oth- different constructs, and (c) the availability
erwise unclear? Did any not apply? Were any of a gold standard.
Take-Home Points
A self-report (a.k.a. survey) is a measure where the respondent supplies information about
him or herself.
Self-reports are important in medical research because some variables (e.g., attitudes,
beliefs, self-judged ability) only can be assessed from information directly furnished by the
patient or other subject.
A self-report is obtained by questionnaire, interview, or related methods.
Questionnaires are written documents that can be self-completed without interviewer
involvement or read aloud as part of an interview; interviews usually (but not always) are
administered orally; both can be structured (comprise closed-ended questions), unstruc-
tured (comprise open-ended questions), or semistructured (comprise a mix of both question
types).
If answers to a research question can be obtained only via self-report, the investigator
should rst determine whether an instrument already exists that is reliable, valid, and oth-
erwise suitable for the population of interest.
In situations where a new instrument must be developed, the investigator must clearly
dene the question(s) of interest; identify the population to be surveyed; select the pre-
ferred type of self-report/format of measurement; consider inclusion of validation
questions; pretest, pilot test and edit the measure; and test the nal battery of questions
for reliability and validity.
When developing or implementing a survey, the investigator must be certain to observe all
ethical and legal aspects of survey methodology.
pimobendan. Pimobendan Multicenter Research

References Group. Am Heart J. 1992;124:101725.
9. Morisky DE, Green LW, Levine DM. Concurrent
1. Polit DF, Beck CT. Nursing research: principles and and predictive validity of a self-reported measure of
methods. 7th ed. Philadelphia: Lippincott, Williams medicine adherence. Med Care. 1986;24:6772.
and Wilkins; 2004. 10. Strauss AL, Corbin JM. Basics of qualitative
2. Kish L. Survey sampling. New York: Wiley; 1995. research: techniques and procedures for developing
3. Groves RM, Fowler FJ, Couper MP, Lepkowski JM, grounded theory. 2nd ed. Thousand Oaks: Sage;
Singer E. Survey methodology. New York: Wiley; 1998.
2004. 11. Cohen MZ, Ley C, Tarzian AJ. Isolation in blood
4. Cochran WG. Sampling techniques. 3rd ed. New and marrow transplantation. West J Nurs Res.
York: Wiley; 1977. 2001;25:3748.
5. Derogatis LR. BSI: Brief Symptom Inventory: 12. Fadiman A. The spirit catches you and you fall
administration, scoring and procedures manual. down: a Hmong child, her American doctors, and the
Minneapolis: National Computer Systems; 1993. collision of two cultures. New York: Farrar, Straus
6. Ware JE, Snow KK, Kosinski M, Gandek B. SF-36 and Giroux; 1998.
health survey: manual and interpretation guide. 13. Drever E. Using semi-structured interviews in small-
Lincoln: RI, QualityMetric, Inc.; 2000. scale research, a teachers guide. ERIC. Edinburgh:
7. Skevington SM, Bradshaw J, Saxena S. Selecting SCRE; 1995.
national items for the WHOQOL: conceptual and 14. Benson J, Britten N. Respecting the autonomy of
psychometric considerations. Soc Sci Med. 1999;48: cancer patients when talking with their families:
473487. qualitative analysis of semistructured interviews
8. Rector TS, Cohn JN. Assessment of patient outcome with patients. BMJ. 1996;313:729731.
with the Minnesota Living with Heart Failure 15. ODea JA. Consumption of nutritional supplements
Questionnaire: reliability and validity during a ran- among adolescents: usage and perceived benets.
domized, double-blind, placebo-controlled trial of Health Educ Res. 2003;18:98107.
16. Allan G. A note on interviewing spouses together. 34. Murphy LL, Spies RA, Plake BS, editors. Tests in
J Marriage Fam. 1980;42:205210. print VII. Lincoln: Buros Institute of Mental
17. Kalischuk RG, Davies B. A theory of healing in the Measurements; 2006.
aftermath of youth suicide. J Holist Nurs. 2001;19: 35. Geisinger KF, Spies RA, Carlson JF, Plake BS,
163186. editors. The seventeenth mental measurements
18. Dolezal C, Mellins C, Brackis-Cott E, Abrams EJ. yearbook. Lincoln: Buros Institute of Mental
The reliability of reports of medical adherence from Measurements; 2007.
children with HIV and their adult care givers. J 36. Goldman BA, Mitchell DF, Egelson PE, editors.
Pediatr Psychol. 2003;28:355361. Directory of unpublished experimental mental mea-
19. Dym B, Berman S. The primary health care team: sures. Washington, DC: American Psychological
family physician and family therapist in joint prac- Association; 2007.
tice. Fam Syst Med. 1986;4:921. 37. Bieri D, Reeve R, Champion GD, Addicoat L,
20. Morrison-Beedy D, Ct-Arsenault D, Feinstein NF. Ziegler JB. The Faces Pain Scale for the self-
Maximizing results with focus groups: moderator assessment of the severity of pain experienced by
and analysis issues. Appl Nurs Res. 2001;14:4853. children: development, initial validation and pre-
21. Quatromoni PA, Milbauer M, Posner BM, Carballeira liminary investigation for ratio scale properties.
NP, Brunt M, Chipkin SR. Use of focus groups to Pain. 1990;41:139150.
explore nutrition practices and health beliefs of 38. Choi BCK, Pak AWP. A catalog of biases in ques-
urban Caribbean Latinos with diabetes. Diabetes tionnaires. Prev Chronic Dis. 2005;2:113.
Care. 1994;17:869873. 39. Fowler FJ. Improving survey questions. Thousand
22. Hicks LK, Lin Y, Robertson DW, Robinson DL, Oaks: Sage; 1995.
Woodrow SI. Understanding the clinical dilemmas 40. DeVellis RF. Scale development: theory and applica-
that shape medical students ethical development: tions. Newbury Park: Sage; 1991.
questionnaire survey and focus group study. BMJ. 41. Chang AM, Chau JPC, Holroyd E. Translation of
2001;322:709710. questionnaires and issues of equivalence. J Adv
23. Flanagan JC. The critical incident technique. Psychol Nurs. 2010;29:316322.
Bull. 1954;51:327358. 42. Healey B, Gendall P. Asking the age question in mail
24. Cot CJ, Notterman DA, Karl HW, Weinberg JA, and online surveys. Austral and New Zeal Marketing
McClosky C. Adverse sedation events in pediatrics: Acad (ANZMAC) Conference 2007. Dunedin;
a critical incident analysis of contributing factors. 2007.
Pediatrics. 2000;105:80514. 43. Heise DR. The semantic differential and attitude
25. Branch W, Pels RJ, Arky R. Becoming a doctor. research. In: Summers GF, editor. Attitude measure-
Critical-incident reports from third-year medical stu- ment. Chicago: Rand McNally; 1970.
dents. N Engl J Med. 1993;329:11301132. 44. Aiken LR. Rating scales and checklists. New York:
26. Allery LA, Owen PA, Robling MR. Why general Wiley; 1996.
practitioners and consultants change their clinical 45. DeLoach LJ, Higgins MS, Caplan AB, Stiff JL. The
practice: a critical incident study. BMJ. 1997;314: visual analog scale in the immediate postoperative
870874. period: intrasubject variability and correlation with a
27. Faithfull S. The diary method for nursing research. numeric scale. Anesth Analg. 1998;86:102106.
Eur J Cancer Care. 2007;1:1318. 46. Brealey SD, Atwell C, Bryan S, Coulton S, Cox H,
28. Bruijnzeels NA, Foets M, van der Wooden JC, Prins Cross B, Fylan F, Garratt A, Gilbert FG, Gillan
A, van den Houvel WJ. Measuring morbidity of MGC, Hendry M, Hood K, Houston H, King D,
children in the community: a comparison of inter- Morton V, Orchard J, Robling M, Russell IT,
view and diary data. Int J Epidemiol. 1998;27: Torgerson D, Wadsworth V, Wilkinson C. Improving
96100. response rates using a monetary incentive for patient
29. White MM, Howie-Esquivel J, Caldwell MA. completion of questionnaires: an observational
Improving heart failure symptom recognition: a study. BMC Med Res Methodol. 2007;7:1216.
diary analysis. Cardiovasc Nurs. 2010;25:712. 47. Asch D, Jedrziewski MK, Christakis N. Response
30. Woodeld R, Goodyear-Smith F, Arroll B. N-of-1 rates to mail surveys published in medical journals.
trials of quinine efcacy in skeletal muscle cramps J Clin Epidemiol. 1997;50:11291136.
of the leg. Br J Gen Pract. 2005;55(512):181185. 48. Diment K, Garrett-Jones S. How demographic char-
31. Aitken L, Mardegan KJ. Thinking aloud: data col- acteristics affect mode preference in a postal/web
lection in the natural setting. Western J Nurs Res. mixed survey of Australian researchers. Soc Sci
2000;22:841853. Comput Rev. 2007;25:410417.
32. Fonetyn M, Fisher A. Use of think aloud method to 49. Shih TH. Comparing response rates from web and
study nurses reasoning and decision making in clin- mail surveys: a meta-analysis. Field Methods.
ical practice settings. J Neurosci Nurs. 1995;27: 2008;20:249271.
124128. 50. OToole J, Sinclair M, Leder K. Maximising
33. Ericsson K, Simon H. Protocol analysis: verbal response rates in household telephone surveys. BMC
reports as data. London: MIT Press; 1993. Med Res Methodol. 2008;8:71.
51. Tourangeau R, Smith TW. Asking sensitive ques- 68. Feldman AB, Haley SM, Coryell J. Concurrent and
tions: the impact of data collection mode, question construct validity of the pediatric evaluation of dis-
format and question context. Public Opin Q. 1996;60: ability inventory. Phys Ther. 1990;70:602610.
275304. 69. Lin JM, Brimmer DJ, Maloney EM, Nyarko E,
52. Turner CF, Al-Tayyib AA, Rogers SM, Eggleston BeLue R, Reeves WC. Further validation of the
MA, Villarroel MA, Roman AM, Chromy JR, Cooley Multidimensional Fatigue Inventory in a US adult
PC. Improving epidemiological surveys of sexual population sample. Popul Health Metr. 2009; 7:18
behavior conducted by telephone. Int J Epidemiol. doi:10.1186/1478-7954-7-18.
2009;38:11181127. 70. McHorney CA, Ware Jr JE, Raczek AE. The MOS
53. Couper MP, Nicholls II WL. The history and 36-item Short-Form Health Survey (SF-36): II.
development of computer assisted survey informa- Psychometric and clinical tests of validity in mea-
tion collection methods. In: Couper MP et al., edi- suring physical and mental health constructs. Med
tors. Computer assisted survey information Care. 1993;31:247263.
collection. New York: Wiley; 1998. 71. Management Sciences for Health. Creating a climate
54. Vataja R, Pohjasvaara T, Leppvuori A, Mntyl R, that motivates staff and improves performance. The
Aronen HJ, Salonen O, Kaste M, Erkinjuntti T. Manager. 2003;11:122.
Magnetic resonance imaging correlates of depres- 72. Tennant R, Hiller L, Fishwick R, Platt S, Joseph S,
sion after ischemic stroke. Arch Gen Psychiatry. Parkinson J, Secker J, Stewart-Brown S. The
2001;58:92531. Warwick-Edinburgh Mental Well-Being Scale
55. Schackman BR, Dastur Z, Rubin DS, Berger J, (WEMWBS): development and UK validation.
Camhi E, Netherland J, Ni Q, Finkelstein R. Health and Quality of Life Outcomes 2007;
Feasibility of using audio computer-assisted self- 5:63doi:10.1186/1477-7525-5-63.
interview (ACASI) screening in routine HIV care. 73. Cooper SM, Baker JS, Tong RJ, Roberts E, Hanford
AIDS Care. 2009;21:992999. M. The repeatability and criterion related validity of
56. Oetting ER, Beauvais F. Adolescent drug use. the 20 m Multistage Fitness Test as a predictor of
J Consult Clin Psychol. 1990;58:385394. maximal oxygen uptake in active young men. Br J
57. Fidler DS, Kleinknec RE. Randomized response Sports Med. 2005;39:e19.
versus direct questioning: two data-collection meth- 74. Reuben DB, Siu AL, Kimpau S. The predictive
ods for sensitive information. Psychol Bull. 1977;84: validity of self-report and performance-based mea-
10451049. sures of function and health. J Gerontol. 1991;47:
58. Lensvelt-Mulders GJLM, Hox JJ, van der Heijden M106M110.
PGM, Maas CJM. Meta-analysis of randomized 75. Heather N, Rollnick S, Bell A. Predictive validity of
response research. Sociol Method Res. 2005;33: the readiness to change questionnaire. Addiction.
319348. 1993;88:16671677.
59. Edwards P, Roberts I, Clarke M, DiGuisseppi C, 76. Portney LG, Watkins MP. Foundations of clinical
Pratap S, Wentz R, Kwan I. Increasing response rates research. Applications to practice. Upper Saddle
to postal questionnaires. BMJ. 2002;324:118391. River: Prentice Hall Health; 2000.
60. Brennan M, Charbonnau J. Improving mail survey 77. Guyatt G, Walter S, Norman G. Measuring change
response rate using chocolate and replacement ques- over time: assessing the usefulness of evaluative
tionnaires. Public Opin Q. 2009;73:368378. instruments. J Chronic Dis. 1987;40:171178.
61. Merriam-Webster Online. Available at http:// 78. Hays RD, Hadorn D. Responsiveness to change: an
www.m-w.com/. Accessed 27 July 2010. aspect of validity, not a separate dimension. Qual
62. Waltz CF, Strickland OL, Lenz ER. Measurement in Life Res. 1992;1:7375.
nursing and research. New York: Springer Publishing 79. Beaton DE, Bombadier C, Katz JN, Wright JG. A
Inc; 2005. taxonomy for responsiveness. J Clin Epidemiol.
63. Cook DA, Beckman TJ. Current concepts in validity 2001;54:12041217.
and reliability for psychometric instruments: theory 80. Husted JA, Cook RJ, Farewell VT, Gladman DD.
and application. Am J Med. 2006;119(2):166. Methods for assessing responsiveness: a critical
e7166.e16. review and recommendations. J Clin Epidemiol.
64. Litwin MS. How to measure survey reliability and 2000;53:459468.
validity. Thousand Oaks: Sage; 1995. 81. Roach KE. Measurement of health outcomes: reli-
65. Beck AT, Steer R, Brown GK. Manual for the Beck ability, validity and responsiveness. JPO. 2006;
Depression Inventory-II. San Antonio: Psychological 18:812.
Corporation; 1996. 82. Liang MH, Fossel AH, Larson MG. Comparison of
66. Williams RA. Womens health content validity of ve health status instruments for orthopedic evalua-
the family medicine in-training exam. Fam Med. tion. Med Care. 1990;28:632642.
2007;39:572577. 83. Angst F, Verra ML, Lehmann S, Aeschlimann A.
67. Ware JE, Sherbourne CD. The MOS 36 item short Responsiveness of ve condition-specic and
form health survey. Med Care. 1992;30:473483. generic outcome assessment instruments for chronic
pain. BMC Med Res Methodol 2008;8:26 (published 94. Roeckelein J. Elseviers dictionary of psychological
online 2008 April 25 doi:10.1186/1471-2288-8-26). theories. Amsterdam: Elsevier BV; 2006.
84. Tavris C, Aronson E. Mistakes were made, but not 95. Feldt LS, Brennan RL. Reliability. In: Linn RL,
by me. Orlando: Harcourt Books; 2008. editor. Educational measurement. 3rd ed. New York:
85. Crowne DP, Marlowe D. A new scale of social desir- American Council on Education and Macmillan;
ability independent of psychopathology. J Consult 1989.
Psychol. 1960;24:349354. 96. Downing SM. Validity: on the meaningful interpre-
86. Strahan R, Kerbasi K. Short homogenous version of tation of assessment data. Med Educ. 2003;37:
the Marlowe-Crowne Social Desirability Scale. J 830837.
Cin Psychol. 1972;28:191193. 97. Landis JR, Koch GG. The measurement of observer
87. Furnham A, Henderson M. The good, the bad and agreement for categorical data. Biometrics. 1977;33:
the mad: Response bias in self-report measures. Pers 159174.
Indiv Differ. 1982;3:311320. 98. Shrout PE, Fleiss JL. Intraclass correlations: uses in
88. Leary MR, Kowalski RM. Impression management: assessing rater reliability. Psychol Bull. 1979;86:
a literature review and two-component model. 420428.
Psychol Bull. 1990;107:3447. 99. McDowell I, Newell C. Measuring health. A guide to
89. Lenski GE, Leggett JC. Caste, class, and deference rating scales and questionnaires. 2nd ed. New York:
in the research interview. Am J Sociol. 1960;65: Oxford University Press; 1996.
463467. 100. Kuder GF, Richardson MW. The theory of the esti-
90. Krosnick JA, Alwin DF. An evaluation of cognitive mation of test reliability. Psychometrika. 1937;2:
theory of response order effects in survey measure- 15160.
ment. Public Opin Q. 1987;51:201219. 101. Cronbach LJ. Coefcient alpha and the internal
91. Toner B. Impact of agreement bias on the rating of structure of tests. Psychometrika. 1951;16:297334.
questionnaire response. J Soc Psychol. 1987;127: 102. George D, Mallery P. SPSS for Windows step by
221222. step. Boston: Allyn & Bacon; 2003.
92. Nelson NW, Parsons TD, Grote CL, Smith CA, 103. Faugier J, Sargeant M. Sampling hard to reach popu-
Sisung II JR. The MMPI-2 Fake Bad Scale: concor- lations. J Adv Nurs. 1997;26:790797.
dance and specicity of true and estimate scores. 104. Code of Federal Regulations, Title 45 Public wel-
J Clin Exp Neuropsychol. 2006;28:112. fare, department of Health and Human Services,
93. Thorndike EL. A constant error in psychological rat- Revised 15 Jan 2009, (Effective 14 July 2009).
ing. J Appl Psychol. 1920;4:2529. 105. Nunnally JC. Psychometric theory. New York:
McGraw-Hill; 1978.
Selecting and Evaluating Secondary
Data: The Role of Systematic 9
Reviews and Meta-analysis
Lorenzo Paladino and Richard H. Sinert
Sorting through the body of available literature is means for physicians to translate clinical research
a daunting task. MEDLINE, only one of many into standard practice and help reconcile
databases, indexed 902,346 articles in 2010. This conicting studies in the literature.
number reects a continuing increase over 2009
(854,506) and 2008 (821,834). How can clini-
cians have any chance of keeping up with the Difference Between a Narrative
literature or use it for guiding research or for for- Review, Systematic Review,
mulating clinical practice decisions if their pri- and Meta-analysis
mary sources are restricted to individual studies?
The answer is that it is difcult, if not increas- A narrative review (sometimes termed a tradi-
ingly impractical. Reliance on individual studies tional literature review) is a summary of primary
is further complicated when current beliefs and published studies in which conclusions are drawn
standards of practice are challenged by new stud- by the reviewer, guided by his or her own inter-
ies. For clinicians to make informed decisions, pretations of the studies, rather than by external
they must analyze multiple studies for both their criteria. Narrative reviews are well suited for
quality and relevance to the patient population of general topics or broad coverage of a eld as they
interest. This is a principal reason for the long lag usually cover a wide range of issues within a
time before clinical research is incorporated into given topic [2], e.g., Update on Multiple
standard practice. A representative example is the Sclerosis. Typically, they are written by experts
20-year delay between initial reports suggesting in the specic eld of study rather than by experts
the utility of thrombolytic therapy for myocardial on research methodology. As such, narrative
infarctions in the late 1970s and its adoption in reviews do not necessarily explicitly state or
the 1990s [1]. For these reasons, secondary follow the rules of evidence-based search strate-
sources such as narrative reviews, systematic gies (including selection criteria for articles and
reviews, and meta-analyses are an important abstracts found) or assess the quality or validity
of the included studies. This decit leads to lack
of transparency and reproducibility and is likely
L. Paladino, MD R.H. Sinert, DO () to reect a biased selection of the total evidence
Department of Emergency Medicine, available (selection bias). A common bias in nar-
SUNY Downstate Medical Center, rative reviews is failure to include research that
450 Clarkson Avenue, 1228, Brooklyn,
NY 11203, USA
conicts with the beliefs or opinions of the expert.
e-mail: Lorenzopaladino@yahoo.com; Nonetheless, the majority of published reviews
Richard.sinert@downstate.edu are narrative rather than systematic.
178 L. Paladino and R.H. Sinert
In contrast, systematic reviews (in medicine, effective research described by Tuckman [6] and
written most commonly about treatment or reviewed in Chap. 1. They are systematic because
diagnostic research) focus on a specic question information gathering is done in a structured and
within a topic (e.g., Are steroids effective in rigorous way and the data contained within them
controlling ares of multiple sclerosis? Does are interpreted. They are logical in that their
positron emission tomography have strong posi- methodologies employ tools for assessing the
tive predictive value for breast cancer?), render- studies bias (internal validity) and procedures to
ing them amenable to an explicit search strategy. discern the effects of varying populations on
This characteristic makes them excellent tools to study results (external validity). They are repli-
explore clinically relevant topics. Systematic cable both because they demonstrate whether the
reviews identify the databases searched and, thus, results of individual studies are congruent and
present clear and reproducible search strategies. also because the methodology employed in the
A comprehensive literature search is conducted, review, if properly performed and reported, is
and all identied studies identied are assessed sufciently explicit to be permit reproduction.
for relevance and methodology. Selection is based They are transmittable because, by digesting
on predened inclusion and exclusion criteria, available information and coming to a conclu-
quality is assessed, and data are abstracted in a sion, they effectively summarize what is cur-
standardized format. By explicitly stating how rently known on a specic topic and, when
the evidence was found, how it was appraised or published, enable clinicians to learn about the
validated, and which studies were excluded (and conclusions of research. In addition, meta-analy-
why), systematic reviews eliminate many of the ses, specically, gather, compare, and pool the
biases inherent in narrative reviews. empirical products (data) of the studies collected
A meta-analysis (sometimes termed a quanti- and are reductive to a clinical conclusion. As
tative review) often, but not always, is included noted above, meta-analyses increase sample size
as a component of a systematic review. First used by pooling the subjects of smaller studies when
for medicine in 1904 by renowned statistician appropriate. This larger N increases the general-
Karl Pearson to examine the preventive effect of izability of the results. When the results cannot
serum innoculations against enteric fever [3] and be pooled, they often shed light on reasons why
later formalized by contemporary statistician and results may not be generalizable.
educational researcher, Gene V. Glass (who
coined the term in 1976) [4], meta-analysis cur-
rently is employed in many disciplines as a statis- Searching for a Systematic Review
tical methodology to combine the results of or Meta-analysis
several studies about a topic as if they were from
one large study. In studies of treatment (the most Almost all of the of the databases described in
common focus of meta-analysis in clinical medi- Chap. 2 can be used to search for meta-analyses
cine), its principal purposes are to enable detec- and systematic reviews. The Clinical Queries
tion of overall and subgroup effects (as statistical link on the PubMed interface for MEDLINE can
power may be suboptimal due to limitations in be used to apply search lters to focus on system-
sample size of individual trials), to improve esti- atic reviews [7]. A variety of databases also are
mates of the magnitude of these effects, and to available that specialize in systematic reviews
aid in the resolution of uncertainty due to incon- and meta-analyses. The Cochrane Library (www.
sistent ndings (i.e., interstudy differences) [5]. thecochranelibrary.com), developed under the
The studies included in a meta-analysis should be auspices of the Cochrane Collaboration (an inter-
found using the same rigorous search methodol- national network dedicated to promoting well-
ogy as that used for systematic reviews. informed health-care decision-making), maintains
Well-constructed systematic reviews and an online collection of systematic reviews on
meta-analyses have many of the characteristics of intervention and treatment. The Database of
9 Selecting and Evaluating Secondary Data 179
Promoting Health Effectiveness Reviews Should the search be focused on males or

(DoPHER) is a registry of systematic and nonsys- females only? For which specic diseases is
tematic reviews of public health. BestBETs (www. information sought (eg., diabetes or acute
bestbets.org), ACP Journal Club (www.acpjc.org), myocardial infarction or acute myocardial
and the TRIP Database (www.tripdatabase.com/ infarctions in diabetics)? An overly broad
index.html) are other sources of systematic search typically will yield an excessive quan-
reviews for clinical questions. The Database tity of information, whereas an overly narrow
of Abstracts of Reviews of Effectiveness search (e.g., females 3035 years of age) will
(DARE) (www.crd.york.ac.uk/CMS2Web) con- result in too few or no results.
tains abstracts of systematic reviews that have I denotes the intervention. In the setting of
been assessed for their quality. clinical medicine, the term intervention com-
monly is considered to be therapy (e.g., medi-
cal or surgical treatment or a risk-reduction
Steps in Writing a Systematic Review initiative such as a smoking-cessation or
weight-reduction program). However, this
There are several steps in writing a systematic component of the PICO is somewhat of a mis-
review. Below is a brief outline that may serve as nomer as it also can pertain to diagnostic test-
an overview (discussion of these steps is provided ing. When the PICO method is applied to
below): analyze questions about progression of dis-
1. Formulate the question. ease, the intervention (more appropriately
2. Dene the literature searching strategy. termed factor of interest as it is not purpo-
3. Select the studies to be included. sively applied) would be presence of a prog-
4. Summarize results across studies. nostic factor such as age, gender, morbidity,
5. Assess heterogeneity. lifestyle, or family.
6. Consider appropriateness of pooling results C denotes the comparator, that is, to what
for meta-analysis. the intervention in question will be compared.
A clinician might argue, I dont want to com-
pare two drugs, I just want to know if giving
Formulating the Question aspirin is benecial to my patients? This
question, however, by its very nature, must
As is true for all well-designed primary studies, include a comparator, that is, giving aspirin
the rst step in conducting a systematic review is versus giving nothing, in which case the target
denition of a clear searchable question. The of the search likely will include studies that
importance of this initial step often is underesti- involve administering a placebo as a compari-
mated, leading to frustrating and unsuccessful son. In diagnostic questions such as Is a ultra-
searches. The process is best guided by the often- sound a good study for detection of common
used four-part PICO method, originally dened bile duct stones? the comparator optimally is
by the McMaster University Centre for Evidence- the best available or gold standard test (i.e.,
Based Medicine (Hamilton Ontario) 1992 recom- endoscopic retrograde cholangiopancreatog-
mendations for asking focused clinical questions raphy [ERCP]). (In questions about prevention
[8]. The PICO method can help translate a ques- and prognosis, the optimal comparators are,
tion into terms that will allow whichever search respectively, absence of a preventive initiative
engine is selected to retrieve the most appropriate or a given prognostic factor or factors.)
literature. Its components are described below: The O (outcome) denotes the component
P denotes the patient population or problem. that often spurs the research question. For
The reviewer needs to carefully dene the example, will this therapy decrease morbidity
population from among many available or mortality? This element of the PICO
options. What is the age group of interest? typically requires renement (consider the
Fig. 9.1 MeSH for mortality on PubMed. Available at http://www.ncbi.nlm.nih.gov/mesh?term=mortality
concept of mortality reduction: what period of that is used for indexing articles; it is hierarchically
time is clinically meaningful? 30 days? arrayed to facilitate searching at varying levels of
6 weeks? 6 months? 1 year?). specicity [9]. Use of all of these tools invariably
will yield a more inclusive search.
Consider the example: Does drawing blood
Dening the Literature Search Strategy: cultures (intervention) change mortality (out-
Keywords, MeSH, and Boolean come) in adult patients with pneumonia (popula-
Operators tion)? (The comparison implied by the question
is not drawing blood cultures.) In some literature,
An organized literature search will increase the blood culture may be classied as microbiologi-
likelihood of nding answers to the question of cal culture, microbial culture, or microbial
interest. The PICO question described above can testing; pneumonia as lung infection or respi-
be subdivided into its four components for entry ratory infection; and mortality as death or
into the databases search engine. We recommend survival. MeSH terms can help expand the
that the reviewer search broadly at rst and then search by including many or all of these syn-
search more narrowly (cone down). The more onyms under one umbrella (Fig. 9.1). However,
limited the initial search, the more likely it will they should not be solely relied upon because
miss relevant articles. Each component of the ques- inclusion or exclusion of an item under a specic
tion should be searched by keywords, probable MeSH is determined subjectively by those per-
synonyms, and, if using PubMed, its MeSH (medi- forming the NML indexing.
cal subject headings) terms (also called descrip- During the search, the selected terms are
tors). MeSH is the US National Library of connected by the Boolean operators AND,
Medicines (NLM) controlled vocabulary thesaurus OR, and NOT (see Venn Diagram,
Though, as noted, the Boolean NOT operator

is available, to optimize inclusiveness, it is better
to search positively (i.e., to join desired concepts)
rather than to search by exclusion.
An inclusive search should not miss any rel-
evant information. Unfortunately, the literature
is not centralized, and many databases (e.g.,
MEDLINE, EMBASE, and others listed in
Chap. 2) must be queried to assure a complete
search. The bibliographies of relevant papers should
be checked for articles missed by the initial
search, a methodology often refered to as snow-
balling. Repeating this process on the additional
papers can lead to greater retrieval. Citation
searches using the Web of Science or SciVerse-
Scopus also may yield additional papers. New
keywords found on these papers can be added to
Fig. 9.2 Boolean terms OR, AND, and NOT augment the original search terms. Consulting a
research librarian to perform expert searches also
should be done for completeness. Unpublished
studies can be found by searching clinical trials
registries and by contacting experts and individ-
ual authors in the eld. The Cochrane Library
maintains a registry of controlled clinical trials,
Cochrane Library Cochrane Central Register of
Controlled Trials (CENTRAL) as does
Clinicaltrials.gov. These important steps help to
prevent the reviewer from missing relevant yet
unpublished research, common with negative
studies (see below: Detecting Publication Bias).
Fig. 9.3 (Mortality OR survival) AND pneumonia

Selecting Articles
Fig. 9.2 ). The meaning of these operators are Having formed the search question, the next step
self-explanatory; however, the implications of in constructing the systematic review is consider-
their additions to a search deserves outlining. The ation of the types of literature available to answer
OR operator expands the search to include the question. Selection should be based on sev-
any of the selected terms, whereas AND limits eral key factors, the most important of which are
it to those that contain all selected terms. listed below.
To start a search broadly, the keywords in the
query should be connected by the OR operator Levels of Evidence
(e.g., mortality OR survival). This strategy pro- Medline and other databases contain literature
vides the sum of all words as if they were searched that is very heterogenous with regard to the
individually. By adding AND pneumonia, the strength of evidence provided. The varying types
search will yield articles only about both mortal- of studies contained within the literature are
ity (OR survival) and pneumonia. This concept is represented here as a pyramid (Fig. 9.4), with
illustrated by the Venn diagram given in Fig. 9.3. the weakest evidence for answering clinical
Fig. 9.4 Pyramid of evidence
questions shown at the bottom and the strongest casecontrol studies provide stronger evidence
evidence shown at the top. Bias decreases as we for association than case reports or case series,
move up the pyramid, in direct contrast to the caution must be exercised in interpretation of
amount of literature available on a given topic. results because demonstration of a statistical rela-
In vitro and animal studies, although impor- tionship does not provide proof of causality.
tant for hypothesis generation, cannot be applied Cohort studies follow individuals with specic
directly for clinical care or provide a direct answer risk factors or exposures over time and compare
to a clinical research question, as can case reports, them with comparable individuals who do not
series, casecontrol, cohorts, and randomized have the risk factor or exposure being studied to
controlled clinical trials (RCTs). As noted in pre- evaluate differences in outcomes. Though cohort
vious chapters, a case report describes the pre- studies (particularly those that are prospective in
sentation and/or treatment of an individual patient, nature) provide better evidence of association
whereas a case series consists of a collection of than casecontrol studies, they (like casecontrol
reports on several individual patients. Because studies) are observational and, as such, are sub-
they do not have control groups with which to ject to more bias than studies in which an inter-
compare outcomes, neither has validity for draw- vention has been purposely applied; their greatest
ing conclusions about associations or cause and utility in clinical epidemiology is for dening
effect. Casecontrol studies are always retro- prognosis of a disease. Quasi-experimental
spective studies in which subjects who already studies contain some of elements of true experi-
have a specic condition are compared with those ments (parallel control groups and/or repeated
who do not. These studies are well suited to test assessments) but (as noted in Chap. 5), due to
associations between risks or toxic exposures and lack of random allocation to treatment group, are
diseases, especially when the latter are relatively not fully protected from all threats to internal
rare. Data collection typically is based on the validity. In contrast, randomized controlled
medical record and/or patient recall. Though clinical trials (RCTs) study the effects of a
Table 9.1 Criteria for calculating the Jadad score (Reprinted with permission from Jadad et al. [12])
Criteria Yes (1 point) No (0 points)
1. Was the study described as randomized?
2. Was the randomization process described, and was it appropriate?
3. Was the study described as double blind?
4. Was the method for double blinding appropriate?
5. Were the withdrawals and drops out of the study enumerated?
Interpretation
Score 02 Low-quality study
Score 35 High-quality study
purposively applied therapy by comparing an predened inclusion and exclusion criteria should
intervention group and control group to which be reported in the methods section and the search
subjects have been randomly allocated. They also strategy in the appendix, to facilitate replication
incorporate additional methodologies such as of results.
blinding (masking) and analysis by intention-to-
treat that reduce the potential for a variety of Assessing the Quality of Primary Studies
threats to internal validity, though they may suffer Assessment of bias in the methodology of the indi-
from limitations in generalizability (external vidual studies is a core component of a systematic
validity). In theory, as syntheses of prior research, review; therefore, tools for appraising the quality
systematic reviews and meta-analyses, though of the individual studies should be integrated.
relatively few in number, are at the top of the Unfortunately, no gold standard exists to evaluate
pyramid, providing the strongest evidence for the methodology of therapeutic trials or assess-
associations or cause-and-effect relationships. ments of diagnostic test performance even though
However, for this to be true, both must meet their quality and methods for synthesis are thought
stringent methodological quality criteria by some to be superior to that of other forms of
(described below) and the elements of the meta- clinical research (e.g., prognostic studies) [11].
analysis (i.e., the included studies), specically, Consensus and working groups continually reeval-
must have sufciently similar study design char- uate and improve upon assessment tools; thus, the
acteristics to permit pooling of results, a criterion preferred methods or systems change over time.
that is not always met in practice. When it does Below is a listing and brief discussion of a cross
not, a meta-analysis, if performed, will be more section of tools for detecting bias in these types of
useful for hypothesis generation than for hypoth- studies. We present these to introduce the topic
esis testing [10]. rather than to advocate a specic scoring system.
(For the author of a primary study, they can be
Standardizing Selection of Articles used as a check list to ensure a sufciently com-
The list of abstracts generated from the PICO prehensive methods section.)
search query is next screened for selection of
relevant articles. Although inclusion criteria (e.g., Therapeutic Testing Articles Appraisal
nature of the patient population, specic outcomes A variety of assessment tools for therapeutic arti-
and summary measures) optimally are predened, cles exist such as the Jadad scale [12], shown in
the process is not immune to subjectivity and Table 9.1. Common to all is evaluation of key
bias. The list should be screened independently areas prone to bias. Inclusion and exclusion
by two reviewers to minimize subjectivity. Any criteria should be reviewed to decide whether the
discrepancies should be compared and discussed patients included in the identied study meet the
to reach a consensus. The reviewers interrater requirements of the P of the PICO. As indi-
reliability should be measured and reported. The cated earlier, the highest quality studies optimally
will use randomized treatment assignment with outcome. The NNT must be weighed with the
concealment of allocation, double blinding, and baseline risk, NNH, benet magnitude and/or
intention-to-treat analyses. Follow-up should be cost to have comprehensive meaning to the clini-
complete and transparent. In addition, readers cian. It may be more acceptable in clinical prac-
should look for an explanation as to why partici- tice to apply a treatment that is inexpensive, easy
pants may have dropped out of an investigation, to use, and of almost no adverse risk but has
as differential attrition from a study may impact higher NNT than one that has a lower NNT but is
conclusions regarding the effectiveness of the expensive, dangerous, and has only a marginal
investigational treatment (e.g., if the sickest clinical benet. For example, while the NNT was
patients dropped out of the treatment arm receiv- relatively higher with aspirin than with SK in
ing an investigational new drug, the drug might ISIS-2, there was no reported bleeding requiring
appear to be more effective than it is.) transfusion or conrmed cerebral hemorrhage
Studies about treatment, optimally, will associated with aspirin (a very low cost, easy-to-
express the impact of therapy quantitatively as manage intervention), whereas there was a very
the number needed to treat (NNT) or the number small (though statistically signicant) excess
needed to harm (NNH). The NNT is the number occurence of these events with SK (0.5% vs.
of patients that need to be given the intervention 0.2% with placebo [major bleeds], equivalent to
for one patient to benet, thus expressing the a NNH = 333; 0.1% (SK) vs. 0.0% with placebo
effectiveness of an intervention in a clinically [cerebrovascular hemorrhage], equivalent to a
meaningful manner. It is calculated as the recip- NNH = 1,000).
rocal of the difference in outcomes of the inter-
vention and control groups (absolute risk Diagnostic Testing Articles Appraisal
reduction) derived from a therapeutic trial. The Diagnostic accuracy studies investigate how well
closer the NNT is to 1, the greater the efcacy of the results from an index test (test being evalu-
the intervention; the further from 1, the lesser its ated) agree with the results of the reference stan-
efcacy. As an example, in the landmark study dard. (As noted above, the reference standard or
ISIS-2 [13], the efcacy of (1) 1 h of IV infusion gold standard is considered the best available
of 1.5 MU streptokinase (SK), (2) 1 month of method to determine the presence or absence of a
160 mg of enteric-coated aspirin (ASA) taken condition.) Diagnostic studies have unique design
daily for 30 days, and (3) both active agents ver- features which differ from therapeutic testing;
sus placebo was evaluated through 35 days after therefore, different methods exist for detecting
a suspected acute myocardial infarction (AMI) bias and variability.
among 17,187 patients. Analysis revealed that The Quality Assessment of Diagnostic
the absolute reductions in risk of vascular mortal- Accuracy Studies (QUADAS) tool [14] is one
ity associated with SK and ASA and their combi- such method. The tool comprises 14 items,
nation versus placebo, respectively, were 2.8%, dened by expert consensus, that examine a vari-
2.4%, and 5.2%, yielding NNTs of 36 (SK), 42 ety of important biases and other methodological
(ASA), and 19 (SK + ASA). These NNTs (not concerns specic to the evaluation of diagnostic
calculated in the original study) indicated that 36 tests (Table 9.2), though it it does not address the
patients would need to be treated with SK and 42 issue of intra- or interobserver reliability.
patients with ASA aspirin to prevent one vascular Responses are framed as binary yes/no ques-
death, whereas the same result could be achieved tions, or if not enough information is supplied,
with combination therapy in 19 patients. unclear. The Cochrane Collaboration offers a
A closely related parameter is the number needed similar tools for assesing diagnostic studies [15].
to harm (NNH), calculated as the inverse of the In the past, calculations of the sensitivity,
absolute risk increase (again expressed as a pro- specicity, and predictive values of a diagnostic
portion) and interpreted as the number of patients were considered sufcient for evaluation of its
one would need to treat to expect an adverse utility. In this era, a high-quality diagnostic
Table 9.2 The development of QUADAS: a tool for the quality assessment of studies of diagnostic accuracy included
in systematic reviews (Reproduced with permission from Whiting et al. [14])
Item Yes No Unclear
1. Was the spectrum of patients representative of the patients who will receive the () () ()
test in practice?
2. Were selection criteria clearly described? () () ()
3. Is the reference standard likely to correctly classify the target condition? () () ()
4. Is the time period between reference standard and index test short enough to be () () ()
reasonably sure that the target condition did not change between the two tests?
5. Did the whole sample or a random selection of the sample, receive verication () () ()
using a reference standard of diagnosis?
6. Did patients receive the same reference standard regardless of the index () () ()
test result?
7. Was the reference standard independent of the index test (i.e. the index test did () () ()
not form part of the reference standard)?
8. Was the execution of the index test described in sufcient detail to permit () () ()
replication of the test?
9. Was the execution of the reference standard described in sufcient detail to () () ()
permit its replication?
10. Were the index test results interpreted without knowledge of the results of the () () ()
reference standard?
11. Were the reference standard results interpreted without knowledge of the results () () ()
of the index test?
12. Were the same clinical data available when test results were interpreted as would () () ()
be available when the test is used in practice?
13. Were uninterpretable/intermediate test results reported? () () ()
14. Were withdrawals from the study explained? () () ()
study also will dene thresholds values for their The LR is the probability that a given test result
diagnostic test using receiver operator character- would be expected in a patient with the target
istic (ROC) curves which are plots of the true disorder divided by the probability (P) that that
positive rate (sensitivity) versus the false positive same result would be expected in a patient with-
rate (1-specicity) (Fig. 9.5). The area under the out the target disorder. LRs can be calculated
curve reects the relationship between sensitivity both for positive (LR+) and negative (LR) test
and specicity for a given test. As a curve asymp- results, as shown below.
totically approaches the upper left-hand corner,
the area under the curve approaches 1 (100% sen- sensitivity P (Test + | Disease + )
sitivity and specicity). A random guess would LR + =
1 specificity P(Test + | Disease )
generate a point along the diagonal bisecting the
graph, also called the line of no discrimination. 1 sensitivity P (Test | Disease + )
Points above the diagonal represent better results LR =
specificity P(Test | Disease )
(greater diagnostic accuracy), while points below
the line are poor (lower diagnostic accuracy). High LR + values (LR+ > 10) signicantly
(For further discussion of the use of ROC curves increase the probability of disease and low
for determination of diagnostic accuracy, the LR values (LR < 0.1) signicantly decrease
reader is referred to Chap. 11.) the probability of disease. The extent to which
Once thresholds for a positive and negative the results of a diagnostic test changes the prob-
diagnostic test are dened by ROC curves, then ability that the patient has a disease (posttest
an evidence-based operating characteristic of the probability) can be estimated using a graphical
test can be dened by its likelihood ratios (LR). tool known as the Fagan nomogram [16] by
Fig. 9.5 Receiver operator

characteristic curve
using a straight edge to draw a line from the summary statistic (e.g., a risk ratio, a difference
pretest probability through the calculated LR between outcome means) for the observed effect
(Fig. 9.6). is abstracted or recalculated from each included
study. (A less common approach, not reviewed in
this chapter, combines original or patient-level
Summarizing the Results: The Role data from prior studies; for an excellent discus-
of Meta-analysis sion of the pros and cons of this method, known
as Individual Patient Data [IPD] meta-analysis,
As noted earlier, sometimes the size of an indi- the reader is referred to Stewart and Tierney
vidual clinical trial may be too small to detect a 2002) [17].) Next, a pooled effect estimate is cal-
treatment effect or to estimate its magnitude reli- culated as a weighted average by sample size of
ably. Meta-analysis is a method to increase the the intervention effects reported in the individual
power of statistical analyses and precision of esti- studies. By pooling results, the standard error of
mates by pooling the results of related trials (i.e., the weighted average effect size of the included
those that address a similar hypothesis) to obtain studies and its associated condence interval are
a quantied synthesis. Not all systematic reviews reduced, typically affording greater statistical
lead to a meta-analysis. The trials may be so var- power to detect an effect than would be possible
ied in their methodology, end points, or results from any one consitutent study. Reduction of the
that combining them may not be appropriate. condence intervals also increases precision of
In a conventional meta-analysis (sometimes the estimated population effect size [18]. In
known as aggregate-level meta-analysis, a assigning weights for generating the pooled
timing, and measurable differences other than

sampling variability (see also assessment of het-
erogeneity below). Athough more data are
required for random effects models to achieve the
same statistical power as with xed effects mod-
els, the former represents a more conservative
assumption. Unless the author of a meta-analysis
has guidance from a statistician indicating that a
xed model is appropriate, a random effects
model typically is preferrable.
Most meta-analyses summarize their ndings
graphically using a forest plot [19]. A forest plot
illustrates the relative effects of multiple studies
addressing the same question or hypothesis. The
studies are listed in the left-hand column, typi-
cally in chronological order. The measured effect
for each of these studies is represented by a
square, whose area is related to the relative
sample size of the individual study. The effect
may be an odds ratio, risk difference, or a correla-
tion coefcient. The condence intervals (CI) are
represented by horizontal lines bisecting the
square. The width of the CI is related to the power
and variability of the study. The combined results
of the meta-analysis usually are represented by a
diamond, the width of which is the CI for the
pooled data. A vertical line is placed at 1 for
Fig. 9.6 The Fagan nomogram (Reproduced with per- ratios (odds or risks) and correlation coefcients,
mission from Fagan [16]) or at 0 for differences, representing no effect. If
the CI of an individual study or the pooled data
crosses this line, the null hypothesis is accepted.
estimate, the evaluator needs to consider whether Figure 9.7 illustrates a forest plot used in a meta-
it is more appropriate to use a xed versus a analysis of the effects of administration of beta
random effects model as these make different blockade on in-hospital mortality rates among
assumptions about the nature of the included patients with acute coronary syndrome [20].
studies. A xed effects model assumes that all of
the studies contained within the meta-analysis
have attempted to measure a single true effects Assessment of Heterogeneity: Methods
value and that variation in observed effect sizes of Investigation
is due only to chance. An assumption underlying
such a model is that all of the studies have been Heterogeneity in meta-analysis refers to the vari-
conducted under similar conditions with similar ation in outcomes among included studies. As
subjects, differing only in their power to detect noted above, a certain degree of variability should
the outcome of interest. (This rarely, if ever, is the be expected when comparing multiple studies
case.) In contrast, a random effects model (hence, the rationale for suggesting the more con-
assumes that the true effect size can vary from servative random effects model for pooling data).
study to study along a distribution due to differ- Clinical variability occurs when there are dif-
ences in the nature of the populations, dosing, ferences in the study population, interventions,
Fig. 9.7 The forest plot (Reproduced with permission from: Brandler et al. [20])
or outcomes measured. Methodological vari- intervention or may be too far along the disease
ability occurs when there are differences in process to show any efcacy. Sometimes, the
study design. Not suprisingly, clinical or method- interventions themselves may be dissimilar.
ological differences will cause variations in the For example, a review of antibiotics in sepsis
effect measured. Heterogeneity refers to this may include studies that used different classes of
difference in effect size (or direction) between antibiotics. Dosing size may have an impact
studies. Of course, like all statistical tests, the on heterogeneity as well. The effects, benecial
heterogeneity of the effect size in pooled studies or harmful, may increase with increased dose
may occur by chance. and with the duration or frequency of the
Assessment of clinical and methodological het- intervention.
erogeneity includes both qualitative and quantita- Clearly, outcome measures also must be simi-
tive elements. One begins by comparing the study lar to permit appropriate comparison. Thus,
populations. Are the studies similar in age, sex, or 6-month mortality after cardiac intervention in
even type of disease? If not, is it appropriate to one study should not be compared to left ventric-
pool them together? Are the interventions the ular ejection fraction at 6 months in another.
same? Some studies may include co-interventions Length of follow-up of a trial may have an
which may be a source of confounding. Studies inuence on the estimate of treatment effect. Like
also may exhibit variability in terms of the timing applying the intervention at disparate times, fol-
of the intervention; thus, imposition of an inter- low-up at different stages of the disease likely
vention at different stages during the disease pro- will impact outcomes. This issue should have be
cess may cause differences in degree of efcacy. resolved during the study selection stage of a
For example, a study on the impact of oncologic review so that studies lacking the desired out-
surgery would likely exhibit differences in come measure were excluded. One should also
efcacy if conducted early after cancer detection be critical of surrogate marker use as an outcome
as opposed to after metastases had developed. measure, especially when being compared to a
The question of timing overlaps the issue of pop- direct outcome. Different study methods will
ulation differences as patients may be sicker at have different degrees of bias. Those conducting
one stage of the disease than another. This can meta-analyses should consider whether it is
magnify the effects or negate them. An ill popu- appropriate to compare RCTs with blinding and
lation may exaggerate the benecial effects of an concealment to unblinded cohort studies.
Table 9.3 Assessing heterogeneity with I2 statistic

I2 Degree of heterogeneity
<0.25 Low
0.25 to 0.50 Moderate
>0.50 High
distribution with N 1 degree of freedom (df),

that indicates whether the individual effects
are farther away from the common effect,
beyond what would be expected by chance. A
p value < 0.10 indicates signicant heteroge-
neity. (The level of signicance for Cochran Q
often is set at 0.1 due to the low power of the
test to detect heterogeneity.) If the Cochran Q
Fig. 9.8 The LAbb plot is not statistically signicant, but the ratio of
Cochran Q and the degrees of freedom (Q/df)
is >1, the result is interpreted to indicate
Heterogeneity of the effect size can be ana- possible heterogeneity. If the Cochran Q is not
lyzed graphically or statistically. The following statistically signicant and Q/df is <1, then
are some of the commonly accepted methods: heterogeneity is much less likely. A limitation
The forest plot (described above) can be visu- of the Cochran Q test is that it is underpow-
ally analyzed to determine whether the effects ered to detect heterogeneity if there are few
of the individual studies are scattered about on studies in the meta-analysis. Conversely, it is
both sides of the no difference/association overpowered (i.e., may detect negligible vari-
line or whether they are grouped together ability) when the number of studies is large.
(i.e., are on one side or another of the this An additional limitation is that the Cochran Q
line). If there is very little or no overlap of the test evaluates only the presence or absence of
condence intervals, then signicant hetero- heterogeneity rather than its magnitude.
geneity exists and pooling of the results may The I 2 statistic represents the percentage of
not be appropriate. If a meta-analysis is car- variation across studies due to heterogeneity. I2
ried out, the authors should address the cause is an index that quanties the degree of hetero-
of the heterogeneity, whether clinical, meth- geneity in a meta-analysis and can be used as a
odological, or both, and provide a justication complement to the Cochran Q test. I2 is calcu-
for continuation. lated from the Cochran Q according to the for-
The LAbb plot (Fig. 9.8) also can be used to mula: I2 = 100 (Q df)/Q, where df = degrees
explore the heterogeneity of effect estimates of freedom. Values may range from 0% to
[21]. The proportion of events in the interven- 100%, with a value of 0% indicating no
tion group (y-axis) is plotted against that in observed heterogeneity (Table 9.3). Although
the control group (x-axis). The no effects line negative values are possible from the equation,
runs between them at 45. The symbol size is they are equivalent in meaning to 0.
proportional to sample size. Sensitivity analysis. A sensitivity analysis
The Cochran chi-square (Cochran Q) is a tests whether the results of the meta-analysis
common test for quantifying heterogeneity in are affected by restrictions and alterations in
meta-analyses. It assumes the null hypothesis the included studies. Examples include
that all the variability among the individual removing an outlier (i.e., the study with the
study results is due to chance. The Cochran Q largest effect size in either direction) or
test generates a p value, based on a chi-square removing the largest study to test if this
changes the magnitude or direction of the or effect size and publication, with potentially
pooled effect size or its statistical signicance. adverse consequences (i.e., type I error or inap-
This analysis helps to determine whether the propriate rejection of the null hypothesis in favor
pooled result is inuenced heavily by a par- of the alternative hypothesis, further discussed in
ticular trial. Other permutations include using Chaps. 10, 11). Fortunately, a variety of graphical
only blinded, higher quality trials (or exclud- and statistical methods are available to help detect
ing lower quality trials) or performing the it. The most widely used of these are described
analysis under xed and random effects below:
assumptions. If the results are consistent, the Funnel plots. The funnel plot [23] is a graphic
sensitivity analysis provides stronger evi- display of the sample size or precision (1/stan-
dence of an effect and of generalizability. dard error) on the y-axis versus the effect esti-
mate (x-axis) used to detect publication bias.
Ideally, the results from small studies will
Pooling Results for Meta-analysis: scatter widely at the bottom of the graph form-
Considerations ing the base of the triangle or funnel because
they have less precision, with the spread nar-
Heterogeneity (whether dened graphically or rowing around the summary effects line at the
statistically) should be considered alongside a apex for larger studies. This pattern occurs
qualitative assessment of the combinability of when publication bias is absent or unlikely.
studies. When signicant methodological differ- Asymmetry indicates systematic differences,
ences and heterogeneity are detected, a meta- errors of measurement, or publication bias; as
analysis probably should not be performed as it noted, small studies with positive results are
may be misleading. Under these circumstances, more likely to be published, whereas negative
the systematic review should report the results studies of similar size are not and, therefore,
descriptively using text and tables and not pool not found during execution of the search
the data. However, if effect sizes are similar strategy. The absence of these balancing
despite variability of clinical and methodological studies are made visually obvious in the asym-
differences, the results probably are robust and metry of the plot (Fig. 9.9). Although funnel
generalizable. A cost-free program for producing plots usually are employed to test for publica-
the tables and graphs and performing the statis- tion bias, there are other causes of asymmetry
tics for a meta-analysis is available from the such as systematic differences and errors of
Cochrane group, RevMan 5 (Review Manager, measurement. When found, the causes of the
Version 5.0, The Cochrane Collaboration, asymmetry should be investigated and
Copenhagen, Denmark). explained to justify the continued grouping of
these studies for meta-analysis.
Fail-safe N. The inability to locate every
Detecting Publication Bias unpublished study about a subject might be
unnerving to authors of a meta-analysis. As a
The literature tends to be biased toward positive method of compensation for what may be
ndingsa phenomenon known as publication unknown, Rosenthal [24] developed formulae
bias [22]. Studies with large sample sizes have a based on the desired level of signicance
greater probability of achieving statistical (p value), later named the fail-safe N by Cooper
signicance and, therefore, achieving publica- [25]. Orwin [26] adapted the fail-safe N to
tion. This holds true for studies demonstrating adjust for small (d = 0.2), medium (d = 0.5), or
large treatment effects as well, even if the sample large (d = 0.8) effect sizes [27]. The formula
size is small. Indeed, many smaller or negative calculates the number of studies that would be
trials are never published. Publication Bias needed to conrm the null hypothesis and,
produces a positive relationship between sample thereby, reverse a conclusion that a signicant
Fig. 9.9 The funnel plot
relationship exists. The formula for Orwins quality of reporting of meta-analyses of clinical
fail-safe N [26] is given below: randomized controlled trials. Since that time,
many additions, updates, and expansions of this
N ( d dc ) statement for broader applicability have led to
N fs =
dc the development of the PRISMA. (Preferred
Reporting Items for Systematic Reviews and Meta-
where N = the number of studies in the meta- analyses) statement, which provides guidelines
analysis, d = the average effect size for the designed to reduce the risk of awed reporting of
studies synthesized, and dc = the criterion value systematic reviews and improve the clarity and
selected that d would equal when some know- transparency in how reviews are conducted [31].
able number of hypothetical studies (Nfs) were Included are a 27-item checklist (Table 9.4) and
added to the meta-analysis. If the fail-safe N is 4-phase owchart (Fig. 9.10) [32].
sufciently high, it may provide reassurance Though not part of current current checklists,
that a few missing studies would not alter the conicts of interest such as nancial funding of
conclusion. individual trails should be reported in the system-
atic review or meta-analysis.
Assessing Quality of Systematic

Reviews and Meta-analyses Limitations of Systematic Reviews
and Meta-analyses
Systematic reviews and meta-analysis are power-
ful informational tools. However, unless properly The major limitations of narrative reviews have
conducted and reported, they can produce errone- been described above. The reader should be
ous conclusions that potentially could impact the aware that caution also must be exercised when
public health [28, 29]. Thus, as there are tools for conducting or interpreting a systematic review or
assessing the quality of individual trials, there meta-analysis. Of note, an evaluation of 300
also are guidelines for assessing the quality of systematic reviews conducted by Moher et al. in
systematic reviews and meta-analysis. In 1996 2007 found that the quality of these reviews was
(published in 1999) [30], the QUOROM (quality inconsistent [33], a nding that led to the above-
of reporting of meta-analyses) statement was mentioned 2009 PRISMA statement. Other criti-
issued to address standards for improving the cisms are based on poor methodology including
Table 9.4 PRISMA checklist for reporting systematic reviews with (or) without meta-analyses (Reproduced with
permission from Moher et al. [32]
Reported
Section/topic Item No Checklist item on page No
Title
Title 1 Identify the report as a systematic review, meta-analysis, or both
Abstract
Structured summary 2 Provide a structured summary including, as applicable,
background, objectives, data sources, study eligibility criteria,
participants, interventions, study appraisal and synthesis
methods, results, limitations, conclusions and implications of
key ndings, systematic review registration number
Introduction
Rationale 3 Describe the rationale for the review in the context of what is
already known
Objectives 4 Provide an explicit statement of questions being addressed with
reference to participants, interventions, comparisons, outcomes,
and study design (PICOS)
Methods
Protocol and registration 5 Indicate if a review protocol exists, if and where it can be
accessed (such as web address), and, if available, provide
registration information including registration number
Eligibility criteria 6 Specify study characteristics (such as PICOS, length of
follow-up) and report characteristics (such as years considered,
language, publication status) used as criteria for eligibility,
giving rationale
Information sources 7 Describe all information sources (such as databases with dates
of coverage, contact with study authors to identify additional
studies) in the search and date last searched
Search 8 Present full electronic search strategy for at least one database,
including any limits used, such that it could be repeated
Study selection 9 State the process for selecting studies (that is, screening,
eligibility, included in systematic review, and, if applicable,
included in the meta-analysis)
Data collection process 10 Describe method of data extraction from reports (such as piloted
forms, independently, in duplicate) and any processes for
obtaining and conrming data from investigators
Data items 11 List and dene all variables for which data were sought
(such as PICOS, funding sources) and any assumptions and
simplications made
Risk of bias in individual 12 Describe methods used for assessing risk of bias of individual
studies studies (including specication of whether this was done at the
study or outcome level), and how this information is to be used
in any data synthesis
Summary measures 13 State the principal summary measures (such as risk ratio,
difference in means).
Synthesis of results 14 Describe the methods of handling data and combining results of
studies, if done, including measures of consistency (such as I2
statistic) for each meta-analysis
Risk of bias across studies 15 Specify any assessment of risk of bias that may affect the
cumulative evidence (such as publication bias, selective
reporting within studies)
Additional analyses 16 Describe methods of additional analyses (such as sensitivity or
subgroup analyses, meta-regression), if done, indicating which
were pre-specied
Results
Study selection 17 Give numbers of studies screened, assessed for eligibility,
and included in the review, with reasons for exclusions at
each stage, ideally with a ow diagram
(continued)
Table 9.4 (continued)
Reported
Section/topic Item No Checklist item on page No
Study characteristics 18 For each study, present characteristics for which data were
extracted (such as study size, PICOS, follow-up period) and
provide the citations
Risk of bias within studies 19 Present data on risk of bias of each study and, if available,
any outcome-level assessment (see item 12).
Results of individual 20 For all outcomes considered (benets or harms), present
studies for each study (a) simple summary data for each intervention
group and (b) effect estimates and condence intervals, ideally
with a forest plot
Synthesis of results 21 Present results of each meta-analysis done, including
condence intervals and measures of consistency
Risk of bias across studies 22 Present results of any assessment of risk of bias across studies
(see item 15)
Additional analysis 23 Give results of additional analyses, if done (such as sensitivity
or subgroup analyses, meta-regression) (see item 16)
Discussion
Summary of evidence 24 Summarize the main ndings including the strength of evidence
for each main outcome; consider their relevance to key groups
(such as health care providers, users, and policy makers)
Limitations 25 Discuss limitations at study and outcome level (such as risk
of bias), and at review level (such as incomplete retrieval of
identied research, reporting bias)
Conclusions 26 Provide a general interpretation of the results in the context of
other evidence, and implications for future research
Funding
Funding 27 Describe sources of funding for the systematic review and other
support (such as supply of data) and role of funders for the
systematic review
Fig. 9.10 PRISMA four-phase ow diagram (Reproduced with permission from Moher et al. [32])
nonadherence to proper searching strategies, lack as noted earlier, it should be considered for
of statistical rigor, and introduction of bias (inten- hypothesis generation only). In addition, the
tional or unintentional) in which studies were increased power gained by pooling the results of
cherry picked to suit the personal agenda of individual studies that is advantageous for
the reviewer/analyst. Unfortunately, not all of the decreasing type II errors also may allow small
limitations can be minimized by strict method- biases to be interpreted erroneously as an effect,
ology. A fundamental limitation of meta-analy- increasing type I errors. (Again, see Chaps. 10
sis, specically, is that it is comprised of studies and 11 for further elaboration of these fundamen-
performed under different protocols and at differ- tal concepts.) On occasion, the same dataset may
ent times; for purposes of the analysis, it is be published multiple times, making the results
assumed that the differences in protocol and not independent. If this is not recognized, the
study design of the elements are obviated by the dataset will be weighed more than once in the
large number of observations ultimately avail- analysis, articially inating the results. Finally,
able. This assumption is highly questionable. the results and conclusions of a systematic review
As noted above, if clinical and methodological or meta-analysis are only as reliable as the meth-
diversity across studies is such that substantial ods used in each of the primary studies. The
heterogeneity is determined, it may be better not methodology used for their qualitative or quanti-
to combine them in a meta-analysis (if a meta- tative synthesis does not compensate for aws or
analysis is performed under these circumstances, errors in the individual primary studies.
Take-Home Points
For clinicians to make informed decisions for patient management and research, they must
analyze multiple studies for quality and relevance to the population of interest.
Secondary sources of information (especially systematic reviews and meta-analyses) help
to summarize and reconcile conicting studies in the literature.
By explicitly stating how evidence was found, selected, and evaluated, systematic reviews
eliminate many of the biases inherent in narrative reviews.
Meta-analysis uses statistical methodology to combine results of several related studies,
which affords greater statistical power versus that of individual studies.
Though retrievable via traditional online literature search engines, a variety of databases
are available that specialize in systematic reviews and meta-analyses.
To construct a quality systematic review, one should formulate a clear question, dene a
comprehensive yet efcient literature searching strategy, include all appropriate studies,
summarize results, assess heterogeneity, and consider appropriateness of pooling results if
individual studies for meta-analysis.
Caution must be exercised when conducting/interpreting a systematic review or meta-analy-
sis to: ensure inclusiveness of literature searching, optimization of statistical rigor, minimi-
zation of bias, and avoidance of inclusion of multiple publications of the same dataset.
The results and conclusions of a systematic review or meta-analysis are only as reliable as
the methods used in each of the primary studies; their synthesis does not compensate for
errors of methodology in the individual primary studies.
Meta-analyses, constructed as they are of multiple nonidentical studies, must be viewed as a
hypothesis-generating rather than a hypothesis testing tool especially if major methodological
differences or heterogeneity among their components is detected.
17. Stewart LA, Tierney JF. To IPD or not to IPD?

References Advantages and disadvantages of systematic reviews
using individual patient data. Eval Health Prof.
1. Lau J, Antman EM, Jimenez-Silva J, Kupelnick B, 2002;25:7697.
Mosteller F, Chalmers TC. Cumulative meta-analysis 18. Matt GE, Cook TD. Threats to the validity of research
of therapeutic trials for myocardial infarction. N Engl synthesis. In: Cooper H, Hedges LV, editors. The
J Med. 1992;327:24854. handbook of research synthesis. NewYork: Russell
2. Collins JA, Fauser BCJM. Balancing the strengths of Sage; 1994.
systematic and narrative reviews. Hum Reprod 19. Lewis S, Clarke C. Forest plots: trying to see the wood
Update. 2005;11:1034. and the trees. BMJ. 2001;322:147980.
3. Pearson K. Report on certain enteric fever inoculation 20. Brandler E, Paladino L, Sinert R. Does the early
statistics. Br Med J. 1904;3:12436. administration of beta-blockers improve the
4. Glass GV. Primary, secondary, and meta-analysis of in-hospital mortality rate of patients admitted with
research. Educ Res. 1976;5:38. acute coronary syndrome? Acad Emerg Med. 2010;17:
5. Sacks HS, Berrier J, Reitman D, Ancona-Berk VA, 110.
Chalmers TC. Meta-analyses of randomized con- 21. LAbbe KL, Detsky AS, ORourke K. Meta-analysis in
trolled trials. N Engl J Med. 1987;316:4505. clinical research. Ann Intern Med. 1987;107:22433.
6. Tuckman BW. Conducting educational research. 3rd 22. Easterbrook PJ, Berlin JA, Gopalan R, Matthews DR.
ed. New York: Harcourt Brace Jovanovich; 1972. Publication bias in clinical research. Lancet.
7. Wilczynski NL, Haynes RB, Lavis JN, 1991;337:86772.
Ramkissoonsingh R, Arnold-Oatley A, HSR Hedges 23. Egger M, Smith DG, Schneider M, Minder C. Bias in
Team. Optimal search strategies for detecting health meta-analysis detected by a simple, graphical test.
services research studies in MEDLINE. CMAJ. BMJ. 1997;315:62934.
2004;171:117985. 24. Rosenthal R. The le drawer problem and tolerance
8. Oxman AD, Sackett DL, Guyatt GH. Users guides to for null results. Psychol Bull. 1979;86:63841.
the medical literature. I. How to get started. The 25. Cooper HM. Statistically combining independent stud-
Evidence-Based Medicine Working Group. JAMA. ies: a meta-analysis of sex differences in conformity
1993;270:20935. research. J Pers Soc Psychol. 1979;37:13146.
9. Fact Sheet. Medical Subject Headings (Mesh). U.S. 26. Orwin RG. A fail-safe N for effect size in meta-analysis.
National Library of Medicine. http://www.nlm.nih.gov/ J Educ Stat. 1983;8:1579.
pubs/factsheets/mesh.html. Accessed 16 Aug 2011. 27. Cohen J. Statistical power analysis for the behavioral
10. Yu CH, Beattie WS. The effects of volatile anesthetics sciences. New York: Academic; 1969.
on cardiac ischemic complications and mortality in 28. Monami M, Bigiarini M, Rotella CM, Mannucci E.
CABG: a meta-analysis. Can J Anaesth. 2006;53: Inaccuracy in meta-analysis on rosiglitazone and
90618. myocardial infarction. Nutr Metab Cardiovasc Dis.
11. Centre for Reviews and Dissemination. Systematic 2011;21:e78. Epub 2010 Dec 25.
reviews: CRDs guidance for undertaking reviews in 29. Claggett B, Wei JL. Analytical issues regarding
healthcare. York: University of York NHS Centre for rosiglitazone meta-analysis. Arch Intern Med.
Reviews & Dissemination; 2009. 2011;171:17980. author reply 180.
12. Jadad AR, Moore RA, Carroll D, Jenkinson C, Reynolds 30. Moher D, Cook DJ, Eastwood S, Olkin I, Rennie D,
DJM, Gavaghan DJ, McQuay HJ. Assessing the quality Stroup DF. Improving the quality of reports of
of reports of randomized clinical trials: is blinding meta-analyses of randomized controlled trials:
necessary? Control Clin Trials. 1996;17:112. the QUOROM statement. Lancet. 1999;354:
13. ISIS-2 (Second International Study of Infarct Survival) 1896900.
Collaborative Group. Randomised trial of intravenous 31. Liberati A, Altman DG, Tetzlaff J, Mulrow C, Gtzsche
streptokinase, oral aspirin, both, or neither among PC, Ioannidis JP, Clarke M, Devereaux PJ, Kleijnen J,
17,187 cases of suspected acute myocardial infarc- Moher D. The PRISMA statement for reporting sys-
tion: ISIS-2. Lancet. 1988;2:34960. tematic reviews and meta-analyses of studies that
14. Whiting P, Rutjes AW, Reitsma JB, Bossuyt PM, evaluate health care interventions: explanation and
Kleijnen J. The development of QUADAS: a tool for elaboration. Ann Intern Med. 2009;151:W6594.
the quality assessment of studies of diagnostic accu- 32. Moher D, Liberati A, Tetzlaff J, Altman DG,
racy included in systematic reviews. BMC Med Res The PRISMA Group. 2009. Preferred reporting items
Methodol 10 Nov 2003;3:25. http://www.biomedcentral. for systematic reviews and meta-analyses: The
com/1471-2288/3/25. Accessed 16 Sep 2011. PRISMA statement. http://www.plosmedicine.org/
15. Deeks JJ, Bossuyt PM, Gatsonis C, editors. Cochrane article/info%3Adoi%2 F10.1371%2Fjournal.pmed.
handbook for systematic reviews of diagnostic test 1000097. Accessed 14 Sep 2011.
accuracy version 1.0. The cochrane collaboration, 2010. 33. Moher D, Tetzlaff J, Tricco AC, Sampson M, Altman
http://srdta.cochrane.org/. Accessed 16 Sep 2011. DG. Epidemiology and reporting characteristics of
16. Fagan TJ. Letter: nomogram for Bayes theorem. systematic reviews. PLoS Med. 2007;4:e78.
N Engl J Med. 1975;293:257. doi:10.1371/journal.pmed.0040078.
Sampling Methodology:
Implications for Drawing 10
Conclusions from Clinical
Research Findings
Richard C. Zink
study sample is important to reduce bias, the

Introduction difference between what our sample tells us about
the population and the truth [1]. This chapter will
It is our inherent curiosity that drives us to under- address these topics and provide insight into the
stand the world around us. We design experi- concept of sampling.
ments, observe and collect data, and perform Throughout this chapter, we will illustrate
analyses in the hopes that our ndings will pro- many concepts using the clinical development
vide insight into the problem at hand. Perhaps the program for entecavir. Entecavir is an antiviral
most difcult subject to understand is the human agent indicated for the treatment of chronic hepa-
being. Our bodies are extraordinarily complex, titis B infection, a disease that ultimately can
affected by our environment, diet, physical con- lead to liver cirrhosis and hepatocellular carci-
ditioning, and genetic background. Further, noma. The goal of any effective medication
research involving human beings is guided by against the hepatitis B virus (HBV) is to reduce
ethical principles that limit our means of investi- and suppress viral load, while simultaneously
gation. We also live complicated lives, making it limiting the possibility of viral mutation which
difcult to add study participation to our busy can lead to the reemergence of the virus due to
schedules. Finally, people are at times unreliable, drug resistance [24]. We refer to two phase III
forgetting to take medications, not exercising clinical trials by Chang et al. [5] and Lai et al. [6]
consistently, or failing to attend scheduled study comparing entecavir to lamivudine, the standard
visits. Ideally, all of these factors should be con- of care at the time, in the treatment of antiviral-
sidered in the design and reporting of any clinical nave subjects with one of two subtypes of
investigation. HBV, hepatitis B e antigen-positive (HBeAg+) or
In clinical research, we obtain conclusions hepatitis B e antigen-negative (HBeAg-) disease.
that (hopefully) address the study hypothesis. Though the primary endpoint of both trials was
These ndings certainly apply to the sample of histologic improvement, reduction of viral load
individuals under study. However, we generally was a key secondary endpoint and an important
wish to extend our conclusions beyond our study factor for long-term liver health [4]. We will refer
to the larger population. Careful selection of the to viral load, the count of the number of viral par-
ticles per milliliter of blood, throughout the
remainder of this chapter.
A second example is employed to illustrate
R.C. Zink, PhD ()
some of the more complex sampling designs.
JMP Life Sciences, SAS Institute, Inc,
SAS Campus Drive, Cary, NC 27513, USA Data on diarrheal disease was examined from
e-mail: richard.zink@jmp.com four prevalence surveys in Africa and Asia to
198 R.C. Zink
determine the levels of household and village resistance to therapy [2, 3]. Minimizing hetero-
clustering [7]. The goal of the research was to geneity of the disease is important for careful
identify the risk factors and understand the pat- study and can be accomplished using the study
terns of disease transmission. Careful study of inclusion/exclusion criteria and designing studies
factors at the household and village level ulti- of appropriate sample size and duration. Due to
mately could lead to an optimal strategy for inter- the above limitations, a sample of individuals is
vention. Within each survey, villages were selected from the study population. Data are col-
randomly selected for inclusion into the study, lected from this sample; summary statistics,
and all households within each village that had at condence intervals, and statistical tests are com-
least one child within the appropriate age range puted, and conclusions are generated. Inferences
were included. about the study population are made from the
sample ndings, and the quality of this inference
is related to how representative the sample is to
Populations and Samples the study population.
Suppose, for example, that there was interest
As noted in Chap. 2, all research begins with a in estimating the average viral load for subjects
question. For example, in developing a new anti- chronically infected with HBV meeting study
viral for chronic hepatitis B infection, we could entry criteriaour study population. Typically,
ask whether entecavir is more efcacious than viral load is measured on the log10 scale since val-
lamivudine. Here, we might assume that our pop- ues often are skewed to the right (i.e., there is a
ulation of interest is all individuals with chronic long tail of large viral load counts). The log10
HBV infection. However, most clinical trial pro- transformation is applied to make the viral loads
tocols are written with a number of inclusion or appear more normally distributed. For the study
exclusion criteria in order for subjects to partici- population, the average log10(viral load) is
pate. For example, subjects generally need to be denoted by m, and the spread of log10(viral load)
of a certain age with a well-dened and specic from this average value is represented by s, so
disease severity, and we may wish to focus on a that roughly 95% of the values are within m 2s.
particular subtype of the disease, such as those The unknown parameters m and s are referred to
positive or negative for HBeAg [5, 6]. Further, as the mean and standard deviation of log10(viral
subjects with other coexisting diseases or medi- load) in the study population, and if normally dis-
cations that may interfere or complicate interpre- tributed, we can describe the distribution of
tation of the results, or that could pose an log10(viral load) values as N(m,s2).
unreasonable safety risk, would be excluded from If we select a sample of size n from the
participation in the trial. Therefore, it is more study population, we can use the sample

n
appropriate to dene our population as all indi- xi
i =1
viduals with chronic hepatitis B infection meet- mean x= and sample variance
n2
ing the inclusion and exclusion criteria of the
i =1 (xi x )
n
study. We can refer to the larger population and s2 = as estimates for m and s2,
those eligible for the study as the population n 1
with the condition and the study population, respectively. The sample mean x will be dis-
respectively [8]. tributed N(m,s2/n), and we can use this fact to
For reasons of time and money, it is generally compute condence intervals and hypothesis
impractical to consider the entire study popula- tests to generate inference for the population
tion to address the research hypothesis. Money is mean m. Figure 10.1 plots several normal distri-
an obvious limitation, but time can be an impor- butions for the sample mean of log10(viral load)
tant factor as well, as the disease under study may for varying sample sizes with m = 9.6 and s = 2
naturally change over time. For example, antivi- (similar to summary statistics from Lai et al.
ral treatment can lead to mutations that enable [6]). Note that as the sample size n increases, the
10 Sampling Methodology: Implications for Drawing Conclusions from Clinical Research Findings 199
Fig. 10.1 Plot of normal distributions
distribution of x becomes narrower, allowing achieve an accurate representation of factors that

for more precise inference for describing m, the may be unknown to have an impact on the out-
average log10(viral load) of the study population. come or those factors that are not being measured
We will return to the discussion of sample size for the current study. However, one difculty of
later. random sampling is enumerating all of the indi-
How then does one choose a sample? Samples viduals in the study population, the sampling
can be selected by probabilistic or nonprobabilis- frame.
tic means. Probabilistic or random sampling Nonprobability or nonrandom sampling, as
assigns each individual in the study population a the name implies, does not have this element of
chance of being selected into the sample. Based randomness in selecting the study sample.
on some random process, which typically is per- Individuals may be selected based on conve-
formed using computer software or from tables nience or based on certain characteristics they
of random numbers, individuals are chosen for exhibit. Nonprobability sampling usually is
participation in the sample. Of course, participa- employed when it is not possible or practical to
tion in the study is always a choice left to the identify every individual within the study popula-
individual, and nonresponse should be consid- tion to assign them a chance of entry into the
ered when interpreting the study results since if sample. Though straightforward to apply, non-
nonresponders vary systematically from those probability sampling does raise a concern,
who participate in the study, we will obtain a namely, that important characteristics of the study
biased estimate of the population parameters. population may not be represented in the sample.
A random sample generally will provide a repre- Therefore, any conclusions reached in the sample
sentative snapshot of the characteristics known to may provide a biased or misleading representa-
be important or inuence the outcome, so that tion of the study population. Returning to our
estimates from this sample are representative of example above, if a physician was interested in
the study population. An additional benet of the average viral load for subjects with HBV and
random sampling is that the sample should selected only the sickest individuals for study
200 R.C. Zink
with no mention of this in the protocol, the results size calculations, and though we do not present
would be an overestimate of the average viral any here, entire books have been devoted to the
load for the study population. subject [9].
Despite the benets of randomness, random Maximizing power is one way of choosing the
sampling provides no guarantee of correct infer- size of a sample, but it is by no means the only
ence from the sample to the study population. It method. Sample size can be chosen to achieve a
is entirely possible to generate a sample from the certain level of precision in the parameter esti-
study population consisting of extreme values mates. This particular type of sample size calcula-
that are not a reection of the typical response. tion is often used in oversampling, when we
As noted in Chap. 11, two types of errors in infer- purposefully select a higher proportion of a partic-
ence can occur in computing a condence inter- ular kind of subject in the sample than exists in the
val or performing a hypothesis test on a sample of population. For example, the two phase III trials
data. In testing the null hypothesis of no differ- discussed in this chapter are predominately male
ence in mean log10(viral load) between the HBV (approximately 75%). Suppose these gender rates
study treatments, H0: mE = mL, versus the alterna- are reective of the true population of subjects in
tive hypothesis that a treatment difference exists, the study population. If we wanted to estimate a
HA: mE mL, a type I error occurs if we reject the treatment effect between these two antivirals with
null hypothesis based on the sample data when a particular precision for females, we could include
mE = mL is true for the population. A type II error a higher proportion of women in our study sample.
occurs when we fail to reject the null hypothesis In this scenario, our overall treatment effect could
when the null hypothesis is false. In the context be biased if gender has an important impact on the
of our clinical trial example, a type I error could characteristics of the disease. To obtain an unbi-
lead one to conclude that entecavir had better ased estimate, we could employ survey weights to
efcacy than lamivudine, when in actuality there downplay the contribution of females to obtain
is no difference between the two antivirals. A type overall estimates for the various endpoints that are
II error would have the sponsor conclude that the reective of the study population.
two antivirals have similar efcacy, when ente- A nal comment on sample size is worth men-
cavir is the more potent drug. tioning in the conduct of clinical trials. While it is
important to have sufcient sample size to have a
representative sample and achieve high levels of
Sample Size power for testing the null hypothesis, the trial
designer should realize that every subject enrolled
As further discussed in Chap. 11, the probability in the trial potentially is exposed to some unknown
of making a type I or type II error is referred to as safety risk attributable to the medications under
a and b, respectively. Typically, the sample size investigation. Therefore, it is of paramount
for a clinical trial is chosen to minimize the prob- importance that the trial designers study enough
ability of these errors occurring, subject to avail- subjects to achieve their goals, without exposing
able resources. Appropriate values for a and b unnecessary additional individuals to an experi-
depend on the scientic question at hand, but mental treatment with an unknown or limited
typical practice in clinical trials has a xed at safety prole.
0.05, with b chosen between 0.1 and 0.2.
Alternatively, we could choose sample size to
maximize the probability 1-b, which is called Probability Sampling
power. Power is the probability of rejecting the
null hypothesis, given the null hypothesis is false, As alluded to above, probability sampling
and powering a study means allocating sufcient identies the individuals within the study popula-
sample size to have a high likelihood of rejecting tion and assigns every subject a chance of being
the null hypothesis in favor of the specied alter- selected into the sample. The easiest method of
native. Formulae exist for many types of sample selecting a sample assigns every individual the
same chance of being selected into the study. how far the proportion of HBeAg subjects in the
This method of sampling subjects is referred to as sample is from one-third of the total sample.
simple random sampling, and within clinical However, employing a stratied random sampling
research, it generally is performed without scheme, we can select separate samples from each
replacement. Without replacement sampling stratum such that 67% and 33% of the total sam-
implies that once a particular subject is selected ple size come from HBeAg-positive and HBeAg-
for inclusion into the study, the subject is not negative subjects, respectively. Thus, we maintain
returned to the pool for further sampling. The the appropriate proportions of this important dis-
practical implication of this approach is that each ease characteristic within our sample.
subject is counted exactly once within a single Stratied random sampling has a number of
clinical investigation. In contrast, with replace- additional benets. First, stratied sampling can
ment sampling returns the sampled observation lead to more efcient statistical testing through a
to the study population for further sampling. reduction in the variability of the sample esti-
While simple random sampling is straightfor- mates. Second, distinct methods for sampling can
ward to apply, it does have some disadvantages. be employed within each of the strata. For exam-
First, sampling from particularly large popula- ple, individuals located in more populated areas
tions can be cumbersome since it requires enu- may be sampled at the individual level, while
meration of all possible subjects to dene the subjects in more remote areas might be sampled
sampling frame. Such data may not exist or could as part of a cluster (described below) [10]. While
be expensive to generate. Second, while we the aforementioned example could have nancial
expect the average sample to be representative of benets in sampling distant individuals, it actu-
the population, it is possible to generate a sample ally may be necessary due to the nature of the
where important characteristics related to the information available to dene the sampling
study outcome are under- or overrepresented by frame within each stratum.
random chance. These deciencies can be Though stratied random sampling is advanta-
addressed using methods described below. geous, there are a number of difculties associ-
In stratified random sampling, mutually exclu- ated with its use. First, it is possible to stratify
sive subcategories (strata) of the study population only for characteristics known to inuence the
are dened prior to sampling. Then, within each disease in question, and the ability to identify
stratum, a separate random sample is selected. By these characteristics quickly and easily is impor-
dening the sampling scheme in this manner, it is tant for generating the sample. Second, if there
possible to maintain the appropriate proportions are multiple endpoints under investigation, it may
of important disease characteristics within the be difcult to select strata that are benecial for
study sample. Suppose, in lieu of the two separate every endpoint. Stratication can result in efcient
clinical trials for HBeAg+ and HBeAg subjects statistical testing when the strata are correlated
described above, sufcient funding was available with the outcome of interest (such as HBeAg sta-
for only a single study to obtain an overall esti- tus and viral load). However, strata that do not
mate of log10(viral load) for the study population have this property may contribute to additional
of HBV subjects. Further, suppose that HBeAg complexity and cost in the study design.
disease accounts for roughly one-third of all HBV If it is possible to order a sampling frame, a
infection [6]. If we applied simple random sam- systematic random sample can be generated by
pling to select subjects from the study population, selecting every kth value in the list after randomly
we could by random chance obtain a sample selecting a starting observation. Sampling pro-
where the proportion of HBeAg subjects differs ceeds in this manner until the required sample
substantially from 33%. Since the log10(viral load) size is obtained. One benet of systematic ran-
of HBeAg subjects tends to be lower than dom sampling is that it can naturally account for
HBeAg+ subjects [6], the overall estimate of viral the presence of strata, by sorting the frame by the
load would be biased for the study population, stratication variables. However, an important
and the magnitude of this bias would depend on drawback of systematic sampling can occur if the
202 R.C. Zink
sampling frame has periodicity present. For sampling design. Suppose that many of the
example, suppose we attempt to replace our two villages or townships were large and that sufcient
phase III trials of HBeAg-positive and HBeAg- information was available to describe each house-
negative subjects with a single study. Further, hold within each village. In the rst stage, we
suppose that the sampling frame is ordered such could randomly select villages. In the second
that positive and negative subjects alternate stage, we could select a random sample of house-
within the list. Choosing an even value for k holds from within each sampled village and
would result in a sample that was either entirely include every individual within the chosen house-
positive or negative in terms of HBV infection. holds meeting study criteria. Another option
Though this is an extreme example, it illustrates would be to apply a simple random sample of
the importance of understanding how the sam- individuals within each of the randomly selected
pling frame is ordered prior to sampling. villages, but this approach would rely on each
The methods described above assume that a village having a list of all its citizens. To further
sampling frame exists for the selection of indi- complicate the design, stratication could be
vidual subjects. However, it is often difcult or applied to allow for different sampling schemes
expensive to generate such lists, or such informa- within each stratum (the four countries described
tion may not readily be available. One alternative in the manuscript could be considered strata).
to selecting at the subject level is to randomly Ultimately, there is no one-size-ts-all solution to
select groups or clusters of observations for dene an appropriate sampling scheme. Based on
studycluster random sampling. For example, the available information, study design is a care-
as described in a study of diarrheal disease in ful balance of costs, statistical efciency, and
Africa and Asia [7], villages were randomly operational complexity.
selected for inclusion into the samples of four An important note about selecting clusters:
separate population surveys. Once a village was Cluster sizes may vary greatly, and as noted above,
selected, all households within the village that it is quite reasonable to expect that individuals are
met the study criteria were included in the sam- more similar within the cluster than between clus-
ple. A benet of sampling clusters of observa- ters. Because of this, it may be more appropriate
tions is that it can simplify the data collection to select clusters with probability proportional to
process. For example, in a situation where hun- the size of the cluster. For example, suppose that
dreds of villages may exist, randomly selecting our population consisted of ve villages with 100,
villages reduces the number of villages to which 150, 200, 250, and 300 inhabitants and that we
it may be necessary to travel. A simple random select one village to generate an estimate for the
sample of subjects across all villages may require subjects of all villages. If we sample the smaller
traveling to a majority of the villages to collect village of 100 inhabitants, the estimate of our end-
the necessary information. However, one down- point may not fully reect the individuals within
side of cluster sampling is that it typically requires the larger villages. Rather than give each village
a larger sample size than a simple random sample an equal likelihood of being in the sample (in this
to obtain the same power or precision of sample case, 1/5 or 20%), we can dene the selection
estimates. This is because individuals within probability for each cluster as equal to the total
clusters tend to be more alike than individuals size of the village divided by the total population
across clusters, and this often leads to an increase of all villages combined. For our example, the vil-
in the variability of the estimated parameters. In lages would be selected with probabilities
the example above, the reduced travel costs may 100/1,000, 150/1,000, 200/1,000, 250/1,000, and
more than make up for the additional subjects 300/1,000 or 10%, 15%, 20%, 25%, and 30%,
needed for study. respectively. By sampling in this manner, we give
By employing the selection of clusters of larger clusters a greater chance of being selected
observations, it is possible to rene the above into the sample, though this choice also increases
design for diarrheal disease into a multistage the expected sample size of the study.
biased, such as deviant case sampling. While

Nonprobability Sampling such a method may provide interesting examples
for the purposes of illustration, the sampled sub-
Recall, nonprobability sampling does not employ jects should not be interpreted as representing the
random selection; it often is chosen when it is not average subject in the study population. The
practical to identify every individual within the potential bias of other methods may be less obvi-
study population to assign them a chance of entry ous. If the goal of the researcher is to describe
into the sample. Though these sampling method- characteristics of the student body, selecting indi-
ologies are frequently employed, particularly in viduals from her classes may not provide an
the preliminary stages of research, great care accurate representation of individuals with majors
should be used before applying the study ndings that differ substantially from the researcher.
to other related groups of individuals. Sampling from the student center may not include
A commonly used method is convenience students whose courses are typically located at a
sampling. In short, the researcher enrolls subjects distance, or it may underrepresent students who
that are readily available. Perhaps, for the pur- live off campus or attend part-time. One could
poses of a graduate research project, the researcher argue that the appropriate study population con-
will sample subjects from classes he/she attends, sists of students who visit the student center.
or from a frequently visited location, such as the Even so, other important factors could arise based
student center, to describe a larger population of on how sampling is performed. For example, if
students. The benet of convenience sampling is the researcher obtains samples every morning
that it is cost-effective and that subjects are typi- prior to classes, she could miss an entire group of
cally plentiful and readily available. Should the individuals that visit the student center later in
researcher have entry criteria that perhaps limit the day. These concerns may or may not be
the number of subjects from whom data can be important to the nal conclusions of the study;
collected, snowball or chain sampling may be however, it is important to consider such factors
employed (see also Chap. 8). In this particular when designing a study and interpreting results.
sampling design, additional subjects are recruited
from the friends, family, and acquaintances of the
individuals the researcher identies. In this way, Returning to Our Clinical Trial
the researcher is able to readily identify addi- Example
tional people with the necessary characteristics
for inclusion into the study, often with additional After trumpeting the importance of random sam-
aid from those currently under study. pling, it may be somewhat surprising to the reader
If it is necessary to divide the population into to hear that clinical research often is conducted
mutually exclusive groups (like strata) and recruit using nonprobability sampling methods. For our
a specied number of each type, this is referred to clinical trial example, we suggested that the study
as quota sampling. A typical example is recruit- population be dened as subjects with the appro-
ing sufcient males or females to evaluate gen- priate disease characteristics meeting the eligibility
der-related differences. Unlike the examples requirements of the trial. Ideally, to describe this
above, the investigator has identied a character- study population, a random sample of subjects
istic that may be important to the endpoints of would be selected for participation into the study.
interest. Judgment sampling chooses individuals However, this typically is not the manner in which
based on the knowledge or expertise of an expert, subjects are enrolled in a clinical trial. In general, a
while extreme or deviant case sampling selects pharmaceutical company identies clinicians who
individuals that are particularly notable to learn are interested in participating in research and may
about a particular topic or phenomenon. have access to patients appropriate for the study.
Some limitations of nonprobability sampling Of particular importance is identifying not only cli-
are evident. Certain methods are purposefully nicians with the expertise to aid in the design of the
204 R.C. Zink
trial (e.g., which tests to perform, what endpoints results applied cautiously. Additionally, in
to measure, and the appropriate disease character- reviewing the study materials, consider these
istics for inclusion and exclusion criteria) but also additional questions: Are important geographical
those inuential persons whose participation may considerations overlooked? How are subjects
entice other physicians to become involved (and who did not consent to study procedures different
eventually write prescriptions). from those who did? Are subjects who do not
The clinician may have a number of patients seek routine care sicker than those who do? What
who regularly attend his or her practice for dis- important disease features differ in subjects who
ease management who could be included in the cannot stop taking medications that are prohib-
study, and in the course of his or her day-to-day ited within the study? How are subjects who are
job, s(he) may gauge these individuals interest in not local to participating clinicians different than
participating in a trial for a new medical therapy. those who are? How might subjects who took the
Additionally, the clinician may choose to adver- drug at the time of the trial be different from those
tise the clinical trial to attract additional patients who will take it when it becomes approved?
from medical practices not participating in the These questions may never be answered satisfac-
trial, or those individuals who may not, for one torily, and ultimately a leap of faith may be
reason or another, have routine exams with their needed to apply sample ndings to the popula-
doctor. Should the patients meet eligibility crite- tion with the condition [8].
ria and consent to study procedures, they would
be randomized to one of the available treatment
arms of the study. Conclusions
However, as the statistician Senn points out,
clinical trials are concerned with the comparative We are often in such a hurry to collect and ana-
inference of the drugs under study among the lyze data that we neglect the importance of care-
subjects under study; rarely are they concerned ful study design, and how we select individuals
with being representative of the study population for study is a critical feature. Through many of
[11]. In other words, the primary goal of the trial the examples described above, we have learned
is to illustrate the effectiveness of a new medica- that whom we select, where and how we select
tion against concurrent and comparable controls. them, and even at what time they are selected
Subjects are randomized to treatments to mini- may have serious implications on the study
mize bias in measuring the treatment effect since ndings and how they may be interpreted or
on average over all randomizations, the treatment applied. This is true for samples of convenience
groups would be considered equal at baseline. as well as for any complex probabilistic survey
This is not to say that representative samples are sample. Without knowing the characteristics of
never used within clinical research, but how a the subjects under study and how they were cho-
researcher samples an individual from a popula- sen, the researcher has an incomplete grasp of the
tion should be tied to the ultimate goals of the conclusions of the study. In fact, there may be a
study. This raises important questions: How can number of shortcomings that become obvious
one safely apply the results of a clinical investi- only once the sampling scheme is understood.
gation to other subjects? To what study popula- Finally, it is important to remember that not all
tion do the subjects ultimately belong? studies are designed to comprehensively reect
Perhaps the most straightforward way of iden- the population with the condition. Particularly for
tifying the individuals to whom these results may clinical researchers and physicians prescribing
apply to is to review the table of summary statis- new medications, it is important to rst under-
tics for baseline demographic and disease charac- stand the subjects under investigation and how
teristics and the eligibility criteria from the study they may differ from other populations available
manuscript or drug label. for treatment. Understanding these key points
Subjects who are quite characteristically dif- ensures a more successful application of new
ferent than described should have the study knowledge.
Take-Home Points
Generating a random sample from a population is important to minimize the bias of sample
estimates in describing the population parameters.
Despite the benets of random sampling for generating appropriate inference, clinical
research often relies instead on samples of convenience.
Baseline characteristics and study inclusion and exclusion criteria can help identify the
study population from which the sample was drawn.
It is important to understand the factors that differ between the study sample and the larger
population and the potential impact these differences may have on the conclusions of the
study and how appropriate it is to apply study results to the larger population.
5. Chang TT, Gish RG, de Man R, Gadano A, Sollano J,

References Chao YC, Lok AS, Han KH, Goodman Z, Zhu J, Cross A,
DeHertogh D, Wilber R, Colonno R, Apelian D. A com-
1. Durham TA, Turner JR. Introduction to statistics in parison of entecavir and lamivudine for HBeAg-positive
pharmaceutical clinical trials. London: Pharmaceutical chronic hepatitis B. N Engl J Med. 2006;354:100110.
Press; 2008. 6. Lai CL, Shouval D, Lok AS, Chang TT, Cheinquer H,
2. Allen MI, Deslauriers M, Andrews CW, Tipples GA, Goodman Z, DeHertogh D, Wilber R, Zink RC, Cross
Walters KA, Tyrrell DL, Brown N, Condreay LD. A, Colonno R, Fernandes L. Entecavir versus lamivu-
Identication and characterization of mutations in dine for patients with HBeAg-negative chronic hepa-
hepatitis B virus resistant to lamivudine. Hepatology. titis B. N Engl J Med. 2006;354:101120.
1998;27:16707. 7. Katz J, Carey VJ, Zeger SL, Sommer A. Estimation of
3. Liaw YF, Chien RN, Yeh CT, Tsai SL, Chu CM. Acute design effects and diarrhea clustering within households
exacerbation and hepatitis B virus clearance after and villages. Am J Epidemiol. 1993;138:9941006.
emergence of YMDD motif mutation during lamivu- 8. Friedman LM, Furberg CD, DeMets DL. Fundamentals
dine therapy. Hepatology. 1999;30:56772. of clinical trials. 4th ed. New York: Springer; 2010.
4. Liaw YF, Sung JJY, Chow WC, Farrell G, Lee CZ, 9. Chow SC, Wang H, Shao J, editors. Sample size cal-
Yuen H, Tanwandee T, Tao QM, Shue K, Keene ON, culations in clinical research. 2nd ed. Boca Raton:
Dixon JS, Gray F, Sabbat J; Cirrhosis Asian Chapman & Hall/CRC; 2008.
Lamivudine Multicentre Study Group. Lamivudine 10. Kish L. Survey sampling. New York: Wiley; 1995.
for patients with chronic hepatitis B and advanced 11. Senn S. Statistical issues in drug development. 2nd
liver disease. N Engl J Med. 2004;351:152131. ed. Chichester: Wiley; 2007.
Introductory Statistics in Medical
Research 11
Todd A. Durham, Gary G. Koch,
and Lisa M. LaVange
examples for illustration. The presentation of

Introduction technical details has purposely been kept to a
minimum. It is an ambitious objective to cover a
Statistics is a discipline concerned with the wide range of statistical topics in a single chapter.
collection, analysis, and interpretation of quanti- Given the brief coverage of this material, sug-
tative information. As discussed by Senn gested reading is provided for readers to delve in
(2003) [1], modern statistics and probability in to particular topics of interest. General references
medical research date back to the 1700s and that cover similar material in this chapter include
include investigations into the sex ratio at birth, Campbell et al. (2007) [2], Durham and Turner
causes of death, and an early clinical trial for (2008) [3], Bowers, House and Owens (2006)
scurvy. Since the early 1900s, there have been [4], Schork and Remington (2000) [5], and
signicant developments in statistical methodol- Woolson and Clarke (2002) [6].
ogy and an increasing number of applications for
statistics in medical research. Statistical methods
are applicable to a wide range of medical investi- Descriptive Statistics and Exploratory
gations including case control studies, cohort Data Analysis
studies, and therapeutic clinical trials. As dis-
cussed in other chapters, each of these types of All studies are limited in size for reasons of time,
studies has its own strengths and limitations, money, or logistics and are conducted on a sam-
which should be considered when interpreting ple of subjects, a subset of a larger population of
the results obtained from them. subjects of interest. In accordance with a studys
The objective of this chapter is to introduce objectives, a number of characteristics are mea-
readers to some common statistical methods sured for individual subjects that comprise the
encountered in medical research, with selected study sample, and these characteristics typically
vary from subject to subject (e.g., age or weight).
In statistics, any characteristic which varies
T.A. Durham, MS () among subjects is a variable, and it becomes a
Axio Research, LLC, 2601 4th Avenue, Suite 200, random variable through its representation in a
Seattle, WA 98121, USA sample. Random variables may be quantitative
e-mail: todd.a.durham@gmail.com
(e.g., height in inches) or qualitative (e.g., gen-
G.G. Koch, PhD L.M. LaVange, PhD der) and are symbolically represented as X.
Department of Biostatistics, University of North Carolina
Realized values of a random variable are called
at Chapel Hill Gillings School of Global Public Health,
Chapel Hill, NC, USA observations. There are a number of ways for
e-mail: bcl@bios.unc.edu; Lisalavange@yahoo.com researchers to characterize a group of individual
208 T.A. Durham et al.
Fig. 11.1 Histogram of the relative frequency of 100 age values
observations (e.g., age values of participants in a y-axis. Bars are typically centered about
medical study), such as the relative frequency of the midpoint of the interval.
each value, the typical value in the group, and the A sample histogram of 100 ages from a clini-
extent to which individual observations vary from cal trial is provided in Fig. 11.1. By examining a
subject to subject. When statistical techniques are frequency histogram, one can see which values of
used simply to summarize random variables from the variable are more or less common. A graphi-
the sample, the results obtained from them are cal representation or mathematical expression of
said to be descriptive statistics. the relative frequency of values of a random vari-
The number of observations which comprise a able is referred to as a distribution. For the con-
sample is referred to as the sample size, denoted struction of a histogram, the width of each
by the symbol n. One way to describe a sample of category should be the same for all categories.
observations with respect to a quantitative ran- However, it is important to note that the number
dom variable is to display the frequency of values of categories used can affect the shape of the dis-
of the random variable graphically. One type of tribution. Care should be taken so that valuable
graphical display is the frequency histogram. information is not lost through the use of too few
A histogram is constructed by: categories. If the overall sample size is large,
Dening 310 mutually exclusive (nonover- more than 10 categories may be used.
lapping) categories of equal width for the vari- Histograms are useful since they enable one to
able of interest. inspect the shape of the distribution. Distributions
Tabulating the number of observations that which have more values in the middle and fewer
fall into each category. values on the extremes are said to be unimodal,
Calculating the relative frequency of observa- and they are symmetric when the extremes
tions in each category as the count of have similar representation. Distributions which
observations in each category divided by the have more values at one extreme than at the
sample size. middle or the opposite extreme are said to be
Displaying a bar for each category contigu- asymmetric or skewed. As evident in Fig. 11.1, the
ously on the x-axis with a bar height equal to most common age values are in the category of
the relative frequency of each category on the 6069 (midpoint of 64.5). There were very few
11 Introductory Statistics in Medical Research 209
observations with ages less than 50 or more than 79. and 81 for the following ages of participants in a
For Fig. 11.1, the shape of the distribution of age clinical trial: 37, 39, 43, 43, 56, 57, 63, 81, 81,
values in the sample is reasonably symmetric. and 85.
Identication of a typical value or a measure For quantitative variables, the mean is the pre-
of central tendency from the sample is frequently ferred measure of central tendency if the distribu-
of interest. There are a number of measures of tion is relatively symmetric. If the distribution is
central tendency, some of which include the arith- asymmetric, the median and mode are appropri-
metic mean, the median, and the mode. The arith- ate measures. For qualitative variables (e.g., gen-
metic mean, typically referred to simply as the der), the mode is the most appropriate measure of
mean, is calculated as the sum of the individual central tendency.
values (indexed by the subscript i below) divided In addition to measures of central tendency,
by the sample size: the extent to which values of a characteristic vary
from observation to observation, i.e., the disper-
n
sion or variety of values, is also of interest. If two
x
i =1
i groups from a study have similar mean numbers
x= of lesions, but one group has more variation in
n
the number of lesions across subjects, one may
The mean is always dened for a sample of suspect that the two groups are different in some
numeric values. If the relative frequency of val- way. There are a number of measures of disper-
ues or the distribution is at least somewhat sym- sion, and the appropriate choice among them
metric, the sample mean is a reasonable choice as depends on what one would like to say about the
a measure of central tendency. One disadvantage variation and, to some extent, on the shape of the
of the mean is sensitivity to extreme values. For distribution (i.e., symmetric vs. skewed). All
example, heart rate values of 60, 61, 63, 58, and measures of dispersion are nonnegative, and dis-
98 beats per minute have a mean of 68, which persion of zero indicates no variation in the ran-
poorly represents a typical value. dom variable from observation to observation.
When there are a few extreme values or a The simplest measure of dispersion is the
skewed distribution, the median can be a more range, dened as the difference between the max-
appropriate measure of central tendency. The imum and minimum values. Quartiles can also be
median is the middle value after all observations used to describe dispersion. Just as the median
have been sorted from lowest to highest. If the represents the middle value, through the value
sample size is odd, the median is the ((n + 1)/2)th below and above which approximately 50% of
value after sorting (e.g., the third largest value the values lie, the 25th percentile (or rst quartile)
from a sample of 5). If the sample size is even, is the value below which approximately 25% of
the median is calculated as the mean of the two the values lie. Similarly, the 75th percentile (or
middle values, the (n/2)th value and the ((n/2) + 1)th third quartile) is the value below which approxi-
value. For example, if there are 20 observations mately 75% of the values lie. The interquartile
in a sample, the median is calculated as the mean range, another measure of dispersion, is dened
of the 10th and 11th values. The median is always as the difference between the third and rst
dened for a sample of numeric values. quartiles. It also represents the dispersion of val-
Another measure of central tendency is the ues that encompasses the middle 50% of values.
mode, dened as the most common value. The A graphical display which features a number
mode is 1 among the following rating scores: 0, of measures of central tendency and dispersion is
0, 1, 1, 1, 1, 1, 2, 2, 2, and 3. If all values of the a box plot. There are a number of different types
random variable are unique, the mode is not of box plots, but typically box plots are used to
dened. However, if there are multiple values display the values of the mean, median, 25th
which are equally as common, there is not one percentile, and 75th percentiles. Extreme values
mode, but multiple modes. The modes are 43 may also be plotted and often include the
Fig. 11.2 Box plot of age

by gender
minimum and maximum values. Box plots can be The coefcient of variation is a unit-less mea-
displayed side by side for the comparison of sure of dispersion, dened as the standard devia-
characteristics of distributions among levels tion divided by the mean:
of the experimental factor (e.g., cases and con-
trols, treatments in a clinical trial, or time points s
CV =
in an observational study). Box plots of age val- x
ues for males and females in a cohort study are
displayed in Fig. 11.2. In this gure, the 25th per- The coefcient of variation is helpful when
centile and 75th percentile are represented by the used to compare two or more random variables
box, the median by the line bisecting the box, the with regard to their dispersion. It is used as a
mean by the crosses, and the minimum and maxi- measure of precision in assay development, but
mum value by the lines extending from the box. may also be used to compare dispersion between
A measure of dispersion of values about the two unrelated random variables each with differ-
mean is the variance. The sample variance, ent scales (e.g., dispersion of heart rate vs. sys-
denoted by the symbol s2, is calculated by sum- tolic blood pressure).
ming squared differences between each value and Descriptive analyses such as those described
the mean (to obtain a positive value), and divid- above may be used as part of a preplanned analy-
ing the result by n 1: sis (e.g., as prescribed in a study protocol or anal-
ysis plan) or for exploratory purposes. Results
n
from exploratory analyses often generate new
(x x)
2
i
i =1
hypotheses to test in future research. Descriptive
s2 = statistical analyses provide insight into the nature
n 1
of the data, as well as provide a rationale for the
It is difcult to interpret the variance since it is statistical methods used to make inferences about
expressed in squared units (e.g., age2). As a result, the population from which the sample arose.
the square root of the variance is taken to obtain
the standard deviation, which is often denoted
by the symbol s. A small value of the standard Estimation, Condence Limits,
deviation indicates that most values are close to and Hypothesis Testing
the sample mean. A large value of the standard
deviation indicates many values are far from the One important goal of statistics is to use data
sample mean. Other words which are used to from a sample (e.g., a limited number of partici-
convey the concept of the standard deviation are pants in a clinical trial) to draw a conclusion
spread and scale. about a larger set of subjects, a population.
Statistical procedures for which the aim is to Select a sample from a population of interest.
make inferences about a relevant population are Collect data from the sample.
called inferential statistical methods. A popula- Calculate appropriate sample statistics as esti-
tion of interest in a clinical trial may be all mates of the population parameter.
patients who will ever be diagnosed with a par- Make a statistical inference about the popula-
ticular viral infection. A population of interest tion parameter.
in a cohort study may be all Americans exposed Make a conclusion about the population
to a carcinogen in the environment. A popula- itself.
tion from a case control study may be adults The value of the summary statistic from a
who have and have not been diagnosed with sample is called a point estimate, and it repre-
coronary heart disease. Statistical inferences sents the estimate of the population parameter
about these populations are necessary since they that is reasonably well supported by the sample
can be used to justify important policy deci- data. If one were to repeat an experiment or study
sions, such as making a new medical therapy with a new sample of the same size from the same
available for use or revising educational mate- population, a different point estimate would
rial about lifestyle changes that reduce the risk be obtained. When each sample has an equal
of adverse health events. chance of being selected from the population, the
However, as noted in Chap. 10, it is not feasi- sample is called a simple random sample.
ble to study every person who may be a member The extent to which point estimates vary from
of the population. As a result, research is con- sample to sample (of the same size) represents
ducted on a small number of them, a sample, and sampling variability and can be quantied. If one
statistical methods are used to make inferences were to select a sample of size n from the popula-
about the population of interest. One note of caution of interest, calculate a sample statistic or
tion is that the validity of the statistical inference point estimate (e.g., the sample mean), record it,
depends not only on the appropriate use of statis- and repeat the process a large number of times, the
tics but also on the selection of an appropriate relative frequency of values of the sample statistic
sample on which the inference will be based. over all samples of size n would constitute the
A general conceptual description of inferential sampling distribution of the sample statistic.
statistics is provided in this section. In the previous section, the term standard devi-
A parameter is a quantitative characteristic ation was dened and represented the typical
from a population, the value of which is spread of observations about the sample mean. If
considered xed but unknown. For a case con- one were to calculate the standard deviation of
trol study, one may be interested in the value of values of the sample statistics (i.e., the standard
the population odds ratio, an estimate of the rela- deviation of the sampling distribution), the result
tive risk of an event. An example of a relevant represents the typical spread of sample statistics
population parameter from a randomized clini- about the population parameter. This quantity is
cal trial is the difference in population mean known as the standard error. The standard error of
response. Summary statistics (e.g., proportions an estimate is a measure of how precisely the
of subjects exposed to some risk factor among sample statistic has estimated the population
cases and controls or the difference in sample parameter or, stated another way, the extent to
means between the treated and control groups in which use of the sample has misestimated the true
a clinical trial) are calculated from the sample as population parameter. The larger the sample is,
estimates of the unknown population parameter the smaller the standard error will be, indicative of
of interest. The purpose of statistical inference is less uncertainty about the population parameter. It
to evaluate how well a sample statistic estimates is important to note that there is not just one stan-
an unknown population parameter. The general dard error. For every estimator, or mathematical
process of making statistical inferences is as rule used to calculate a sample statistic, there is a
follows: standard error. The remainder of this section will
dene the standard error for one estimator, the tracting the population mean from each value and
sample mean. Later sections mention standard dividing by the standard deviation:
errors for other estimators, but details of their X m
derivation will not be included. As will be seen, Z=
s
the standard error of an estimate can be calculated
from the sample. The resulting random variable has a standard
Since a single point estimate from a sample normal distribution with mean 0 and standard
will likely vary from sample to sample, a more deviation 1. A random variable that follows the
useful way of estimating the population parame- standard normal distribution is often denoted by Z
ter of interest is an interval estimate, with a lower and called a Z score. Using the expressions above:
limit (LL) and an upper limit (UL). The general P( 1.04 < Z < 1.04) = 0.70
conceptual approach with interval estimation is
P( 1.96 < Z < 1.96) = 0.95
to dene an interval so that the proportion of
random samples that enclose the parame- P( 2.58 < Z < 2.58) = 0.99
ter q within the lower and upper limits is (1 a). The precision coefcient is specic to the
Using some notational shorthand, we would parameter being estimated. Precision coefcients
like to estimate values, LL and UL, such that can be obtained from tabled values or from statis-
P( LL < q < UL ) = 1 a , where P expresses the tical software. If a random variable follows a
proportion of random samples. The lower and standard normal distribution, one can use the
upper limits are random variables, the values of known distribution of Z scores to state that 95%
which depend on the point estimate, the standard of all Z scores lie between 1.96 and 1.96. The
error of the estimate (a measure of the error value 1.96 is the precision coefcient needed for
attributed to sampling), and a precision an interval estimate with 95% condence. Stated
coefcient. simply, a precision coefcient is the number of
A precision coefcient is a measure of how standard deviations within which 100(1 a)% of
consistently a sample statistic estimates the pop- the values of the random variable fall from the
ulation parameter, and it is obtained from well- population parameter. The symbol a represents
dened distributions of standardized random ones willingness to estimate the underlying pop-
variables. To illustrate, consider a random vari- ulation parameter incorrectly. In most elds of
able that has a particular distribution known as research, an a level of 0.05 is considered reason-
the normal distribution. Normal distributions are able, but there may be times when a higher or
symmetric about their means with a bell shape, lower a level is acceptable.
the downward slope determined by the standard In general, the standard error gets smaller as
deviation. For any random variable, X, that has a the sample size increases. The standard error for
normal distribution (mean m and standard devia- the sample mean is dened as the standard devia-
tion s), the following can be said about the prob- tion divided by the square root of the sample size:
ability of observing certain values of the random
s
variable: SE ( x ) =
n
P( m 1.04s < X < m + 1.04s ) = 0.70
Greater condence for an interval estimate
P( m 1.96s < X < m + 1.96s ) = 0.95
requires a larger precision coefcient. These
P( m 2.58s < X < m + 2.58s ) = 0.99 observations hold for standard errors of other
In other words, 70% of values are within 1.04 estimators and for other distributions used in the
standard deviations of the mean; 95% of values construction of intervals. The construction of a
are within 1.96 standard deviations of the mean; condence interval follows a general form of:
and 99% of values are within 2.58 standard devi-
ations of the mean. It is possible to standardize Point estimate (precision factor) (standard
any normally distributed random variable by sub- error of the estimate).
Given this general form, the following obser- Note that the precision coefcient, t1a / 2, n 1, is
vations are worth noting. All other things being the 100(1 a)th percentile of the t distribution
equal, condence intervals are: with n 1 degrees of freedom. Since the t distri-
Narrower with larger sample sizes than smaller bution is symmetric about zero, t1a / 2 is the pre-
sample sizes cision coefcient that denes a central area of
Wider when more condence is required than (1 a).
when less condence is required As an example, consider an observational
Although somewhat of a simplication, study of 25 patients with primary biliary cirrhosis
condence intervals represent a plausible range (PBC). Among these 25 patients, the mean alka-
of values of the population parameter given the line phosphatase (U/L) value was 1,983 U/L and
sample estimate and uncertainty attributed to the the standard deviation was 2,140 U/L. Researchers
sampling process. are interested in a 95% condence interval for the
Since population parameters (e.g., the popula- population mean alkaline phosphatase. In this
tion standard deviation) are not known, there are case, the precision factor is the value from the t
related standardized scores which utilize only distribution with 24 degrees of freedom that
data from the sample. denes a central area of 95%. From a table of
The t-statistic is perhaps the best known of values, one nds this value to be 2.06. The 95%
these, and it will be used as an example of a condence interval is then
condence interval for a population mean. When
the sample size is small (particularly <30) and the 1983 2.06(2140 / 5) = 1983 881.7
population standard deviation is unknown, the = (1101.3,2864.7)
ratio of a standard normal random variable to its
standard error has a t distribution for which the The statistical interpretation of this result is
shape is determined by the number of degrees of that we are 95% condent that the interval
freedom (n 1). The statistic, (1,101.3, 2,864.7 U/L) includes the population
mean alkaline phosphatase value among patients
x m
T= with PBC.
s/ n A statistical concept that is closely related to
follows a t distribution (Students t). The t the construction of condence intervals is hypoth-
distribution is symmetric about its mean (zero) esis testing. Hypothesis testing involves the fol-
and looks like a normal distribution with, in cases lowing steps:
of sample sizes less than 200, heavier tails. As Posing a null hypothesis about the value of the
was the case with the normal distribution, the population parameter of interest
shape of the t distribution can be used to nd two Stating the alternative hypothesis about the
values which dene a central area under the den- value of the population parameter
sity curve of size (1 a). It can be shown that Identifying an appropriate test statistic against
once a value of T associated with an area of inter- which the null hypothesis will be evaluated
est (translated as a probability) is determined, the Describing the distribution of the test statistic
sample mean x is within T (s / n ) of the popula- when the null hypothesis is true; identifying
tion mean, m . This enables one to calculate a values of the test statistic that occur less than
condence interval for the population mean when 100 a% of the time under the null hypothesis
the sample size is small and the population vari- (the rejection or critical region)
ance is unknown. Calculating the test statistic
The interval estimate of the population mean, Making a conclusion about the null and alter-
the two-sided (1a)% condence interval for the native hypotheses on the basis of the test sta-
population mean, is tistic compared to the rejection region
Since the inference is being made from a sample,
x t1a / 2, n 1 (s / n )
the hypothesis test can result in two types of
errors: rejecting the null hypothesis when it is 0.05, but there may be instances in which
should not have been rejected (a type I error) or smaller values are desirable. The test statistic is
failing to reject the null hypothesis when it should calculated as the difference between the sample
have been rejected (a type II error). Making an mean and the hypothesized value divided by the
erroneous conclusion at the end of a study is standard error of the mean:
undesirable. Hence, studies are designed to limit x m0
the probability of either of these errors occurring. t=
The probability of a type I error is denoted by s/ n
alpha (a), previously referred to as the signi- If this value is close to zero, there will be
cance level, and that for a type II error is denoted insufcient evidence to reject the null hypothesis.
by beta (b ). Its complement, (1 b ), is called the If the value is far from zero, the evidence is con-
power of a test and is the probability of correctly sidered sufcient to reject the null hypothesis.
rejecting the null hypothesis. For clinical trials, The rejection region is represented by those val-
study design considerations include specication ues of the test statistic that occur with probability
of a and (1 b ) since these affect the ability of a a or less when the null hypothesis is true. If the
study sponsor to address study objectives (e.g., to null hypothesis is rejected, either the population
claim an effect of an investigational drug). mean alkaline phosphatase is not 80 U/L or a
The process of hypothesis testing can be illus- type I error has occurred.
trated with data from the previous example. For the results obtained, the calculated value
Suppose researchers would like to know if, as of the test statistic is
they suspect, the mean alkaline phosphatase value
1983 80 1903
among patients with primary biliary cirrhosis is t= = = 4.45
different from otherwise normal subjects. The 2140 / 25 428
mean among normal volunteers is around 80 U/L. Using tabled values of the t distribution with
The null hypothesis is that the mean alkaline 24 degrees of freedom, one obtains a rejection
phosphatase among patients with PBC is 80 U/L. region of t < 2.06 or t > 2.06. Since the test statis-
If there is sufcient evidence to reject the null tic is in the rejection region, the null hypothesis is
hypothesis, the following alternative hypothesis rejected at the a = 0.05 level. The conclusion
will be concluded: the mean alkaline phosphatase from the hypothesis test is that the difference
among patients with PBC is not 80 U/L. Using between the sample estimate and the hypothe-
statistical notation: sized value is greater than would be expected by
chance (due to sampling) alone. The population
H 0 : m = 80 versus H A : m 80 mean alkaline phosphatase value for patients with
PBC is different from 80 U/L.
Note that rejection of the null hypothesis could It is possible that two people would not agree
occur because the mean PBC level was less than on the appropriate value of a, so another proba-
or greater than the hypothesized value. Since bility, the p value, is often used to reect the
there are two sides to the alternative hypothesis, extremeness of the value of the test statistic.
the test is considered two-sided. In advance, the A p value is the probability of observing the
researchers will have decided upon a value of a, actual value of the test statistic or one more
the size of the test, which represents the prob- extreme (i.e., favoring the alternative hypothesis)
ability that they will reject the null hypothesis when the null hypothesis is true. If a p value is a,
erroneously. The choice of the size of the test one rejects the null hypothesis. A p value of 0.02
depends on ones willingness to commit a type I means that the value of the observed test statistic
error. For example, if the implication of commit- and all other values more extreme (i.e., contradic-
ting a type I error is not very important as in early tory of the null hypothesis) occurs with probabil-
studies in drug development, a researcher may be ity of 0.02 under the null hypothesis. One major
satised with an a level of 0.10 or 0.20. A com- drawback of p values as a measure of evidence is
mon value for a in conrmatory research settings that they are highly dependent on the sample size
Fig. 11.3 Relationship between condence intervals and hypothesis tests
as it relates to the standard error of the estimator Another algebraic manipulation of the sample
and, thereby, the power to contradict the null size expression yields the following:
hypothesis. The sample size for a study typically
(Z )
2
a + Zb
is estimated in advance to ensure there is adequate n=
power to detect an effect of interest. ( / s )2
The sample size required to provide power of The expression ( / s ) is called the effect size.
(1 b ) to reject the null hypothesis that the mean In the case where is dened as the difference
m is not different from a specied value m 0 while between two means with a common standard devi-
maintaining a type I error of a is ation, s , Cohen (1992) [7] has characterized effect
sizes around 0.2 as small, around 0.5 as moder-
(Z )s ate, and 0.8 or greater as large. Another use of

2
a + Zb 2
n= effect sizes is that they may be compared across

2 studies for comparative purposes, or when appro-
priate, combined across similarly designed studies
where = ( m m 0 ) is the applicable difference as a meta-analysis, as described in Chap. 9.
and Za and Z b are precision coefcients for (1 a ) Condence intervals and hypothesis tests are
and (1 b ) of the standard normal distribution. closely related inferential procedures. In the case
Note that the variance, s 2, and the true differ- of a two-sided 100(1 a)% condence interval,
ence, , must be dened specically for the type the lower and upper limits represent the range of
of outcome (e.g., means, proportions) being eval- plausible values of the unknown population
uated. The true difference is often regarded as a parameter. A hypothesis test may be carried out
clinically meaningful difference or the difference by positing a number of values of the population
one would like to detect. Estimates of the vari- parameter in the null hypothesis. All values out-
ance are typically obtained from previous stud- side of the limits of the two-sided 100(1 a)%
ies. The sample size expression can be rewritten condence interval would be rejected by a two-
using algebra as sided hypothesis test of size a. Conversely, val-
2 ues within the two-sided 100(1 a)% condence
s2
= interval would not be rejected. Thus, a condence
n Za + Z b interval can be used to test a number of values of
the population parameter.
s2
The quantity is the required squared stan- A graphical representation of the relationship
n between a two-sided 100(1 a)% condence
dard error for (1 b ) power to contradict m 0 as interval about the population parameter, q , and a
the null hypothesis with type I error of a when two-sided hypothesis test of the null hypothesis,
the true mean m differs from m 0 by . H 0 : q = q 0 , is presented in Fig. 11.3. In Fig. 11.3,
three hypothetical condence intervals (with the control group is much larger than the mean
lower and upper limits indicated by the brackets) for the test group) or the positive direction (i.e.,
are displayed with the corresponding statistical mean for the test group is much larger than the
conclusion regarding the null hypothesis. The mean for the control group). In many instances,
rst interval lies entirely to the left of the hypoth- sponsors of clinical trials are only interested in
esized value of the population parameter, indicat- one direction of the alternative hypothesis,
ing that the plausible values of q are less than q 0 . namely, the direction that corresponds to a benet
Therefore, the null hypothesis is rejected. The of the test treatment. However, the null hypothe-
second interval encloses the hypothesized value sis is tested using a two-sided test of size a.
of the population parameter, indicating that q 0 is Hence, if it is rejected, the probability of errone-
among the plausible values of q . Therefore, the ously claiming a benet of the treatment is a/2
null hypothesis is not rejected. The third interval and the probability of erroneously detecting a
lies entirely to the right of the hypothesized value harm of the treatment is a/2.
of the population parameter, indicating plausible Test products which are intended to be similar
values of q are greater than q 0 . Hence, the third to an existing product in terms of the clinical
condence interval is also consistent with rejec- response are evaluated in equivalence trials. The
tion of the null hypothesis. objective of an equivalence trial is to demonstrate
The formulation of the condence interval that the difference in response between the test
depends on the population parameter being treatment and the active control does not exceed
estimated, which depends on the null hypothesis, an acceptable margin. New pharmaceutical prod-
which in turn depends on the research question of ucts which are shown to be equivalent to an active
interest. Medical research includes observational control may have other advantages to justify their
studies (prospective and retrospective) and use such as better safety, more convenient dosing,
clinical trials which are intended to evaluate the or lower cost. Bioequivalence studies are intended
effects of a medical intervention, such as a phar- to demonstrate that the pharmacokinetic proper-
maceutical agent, a surgical procedure, use of a ties of two formulations of a treatment are
device, or implementation of an educational or equivalent.
counseling program. Pharmaceutical agents are The following statistical hypotheses corre-
evaluated for their usefulness, among other spond to a clinical trial intended to demonstrate
things, on the basis of their efcacy and safety [3]. the equivalence of a test treatment to an active
In the context of pharmaceutical development, control with respect to the difference in popula-
the objective of a clinical trial can be to demon- tion means of a continuous outcome:
strate that a test treatment is:
Superior to an inactive or active control H 0 : m test m control d equivalence versus
Not unacceptably worse than (not inferior to) H A : m test m control < d equivalence
an active control
Equivalent to an active control The quantity d equivalence is called the equiva-
The following statistical hypotheses corre- lence margin, and it must be specically dened
spond to a clinical trial intended to demonstrate in advance of the study analysis. In the case of
the superiority of a test treatment compared to a pharmaceutical studies, the equivalence margin
control with respect to a continuous outcome: must be agreed upon by regulatory authorities if
the study is to be used for registration purposes.
H 0 : m test m control = 0 versus The null hypothesis in equivalence trials is
H A : m test m control 0 typically tested using a condence interval about
the difference in population parameters (e.g.,
The null hypothesis of interest could be means or proportions). If the condence interval
rejected if the difference in mean response is far excludes the equivalence margin (by being
from zero in the negative direction (i.e., mean for entirely within it), the null hypothesis is rejected.
An important consideration in equivalence trials the test treatment is not inferior to the active
is that rejection of the null hypothesis can be control. Similar to equivalence trials, interpreting
interpreted as meaning that the test and control this statistical conclusion also depends on the
are both efcacious or neither is. The credibility ability of the study to establish assay sensitivity.
of such a result depends on the ability to demon- As seen in this section, both hypothesis tests
strate that the active control would have been and condence intervals are used to draw conclu-
efcacious if an inactive control were used in the sions about a quantitative characteristic of a pop-
study. The ability of a study to differentiate an ulation. In the remaining sections, specic
efcacious treatment from an inefcacious treat- statistical methods are described.
ment is called assay sensitivity. One way assay
sensitivity can be established is by the use of his-
torical data for the inactive control to demonstrate Differences Between Means
that the active control would have been superior and Proportions
to the inactive control if it had been studied.
Another way to establish assay sensitivity is to A common statistical analysis involves making an
include an inactive control group in addition to inference about the equality of two means when
the active control, although such a design may the observations are independent, meaning the
not be ethical. Interested readers may refer to value of one observation does not depend on
Chow and Liu (2004) [8] for further information another. In many medical studies, observations
on equivalence and noninferiority clinical trials. can be considered independent because the obser-
Another objective of some clinical trials is to vations are single values from different study sub-
demonstrate that a test treatment is not unaccept- jects. However, medical studies frequently involve
ably inferior to the control. Studies with such an repeated tests for the same individual (e.g., heart
objective are called noninferiority studies, and rate taken at a number of times for the same indi-
they may be used when it is unethical or logisti- vidual) or related tests within the same individual
cally difcult to use an inactive control. If the test (e.g., presence of a characteristic in more than one
treatment is considered not unacceptably worse skin location within an individual study subject).
than the active control, it may have other advan- Such observations are considered dependent.
tages such as better safety or greater convenience. In the case of independent observations, the
The following statistical hypotheses correspond hypothesis tested for the equality of two popula-
to a clinical trial intended to demonstrate the non- tion means is
inferiority of a test treatment to an active control
with respect to the difference in population means H 0 : m1 m 2 = 0
of a continuous outcome. In this formulation of
the hypotheses, a larger value of the mean is If this null hypothesis is rejected, the follow-
favorable: ing alternative hypothesis will be favored:
H 0 : m test m control d non - inferiority versus
H A : m1 m 2 0
H A : m test m control > d non - inferiority
The test statistic to test the null hypothesis is
As with equivalence trials, the noninferiority
margin must be specied in advance. The null
x1 x2
hypothesis is tested using a condence interval. t=
1 1
If the noninferiority margin is enclosed within sp +
the condence interval, the null hypothesis is not n1 n2
rejected. If the noninferiority margin is below the
lower limit of the condence interval, the null where the numerator is the difference in sample
hypothesis is rejected, and the conclusion is that means, an estimate of the difference in
population means, and the denominator is the The size of the test will be a = 0.05. Given the
standard error of the difference in sample means. sample sizes in each group, the rejection region
(n1 1)s12 + (n2 1)s22 is any value of the test statistic t < 2.01 or
The quantity s p = is the t > 2.01, which corresponds to values of the t dis-
n1 + n2 2
tribution with 48 degrees of freedom which
pooled standard deviation and represents the dene areas in the left-hand tail of 0.025 and in
weighted average of the standard deviation across the right-hand tail of 0.025, respectively. The
the two samples with sample sizes of n1 and n2. values that dene the rejection region can be
This test is called Students t test or the indepen- obtained from tables of the distribution or from
dent groups t test because the test statistic follows statistical software.
a t distribution under the null hypothesis. The pooled standard deviation is calculated as
The assumptions required for the use of the
two-sample t test are that the distribution of
sp =
(25 1)142 + (25 1)122 = 13
the random variable is approximately normal, the 25 + 25 2
two groups represent simple random samples
from the two populations of interest, and the pop- The test statistic is calculated as
ulation variances are equal (although likely 134 118
unknown). Under the null hypothesis (i.e., assum- t= = 5.1
1 1
ing the two population means are equal), the test 13 +
statistic follows a t distribution with n1 + n2 2 25 25
degrees of freedom. For a two-sided hypothesis Since the value of the test statistic, 5.1, is in
test of size a, the rejection region is dened as the rejection region (t > 2.01), the null hypoth-
any value of the test statistic t > t1a / 2, n1 + n2 2 or esis is rejected. The evidence from the study
t < ta / 2, n1 + n2 2 , i.e., the values from the t distribu- suggests that the population mean LDL is dif-
tion with n1 + n2 2 degrees of freedom that lie ferent between adults with CHD and those
outside of a central area of (1 a). Note that without CHD. Since the difference between the
since the t distribution is symmetric, sample statistic and the hypothesized value of
ta / 2, n1 + n2 2 = t1a / 2, n1 + n2 2 . the population parameter differs much more
Consider the following example. One may be than what would be expected by chance alone,
interested in whether or not there is a difference such a difference is often called statistically
in the mean LDL cholesterol between adults with signicant. A corresponding condence inter-
coronary heart disease (CHD) and adults without val for the difference between two group means
CHD. To answer such a research question, two can be written as
samples corresponding to the populations of
(x1 x2 ) (t1a / 2,n + n 2 )s p
1 1
interest (i.e., adults with a diagnosis of CHD and +
adults with no diagnosis of CHD) would be stud-
1 2
n1 n2
ied. LDL cholesterol levels were ascertained for
25 subjects from each group. The sample means Note that this condence interval follows the
were 134 mg/dL and 118 mg/dL for the CHD and general form described previously. In this case,
non-CHD subjects, respectively. The sample 1 1
the quantity s p + is the standard error of
standard deviations were 14 mg/dL and 12 mg/dL, n1 n2
respectively. The statistical hypotheses are
the difference in sample means.
H 0 : m CHD m non - CHD = 0 versus For this particular example, the corresponding
H A : m CHD m non - CHD 0 95% condence interval is
1 1 sum to tabled values (representing percentiles of

(134 118) 2.01 13 + the distribution) in order to reject or fail to reject
25 25 the null hypothesis. The rejection region is
= 16 7.4 = 8.6, 23.4 dened according to tabled values of a distribu-
tion dened just for this particular test. When
The interpretation of this result is that we are ranking the observations, ties are managed by
95% condent that the interval (8.6, 23.4) assigning the mean of the ranks that would have
encloses the true difference in population mean been assigned if the observations had not been
LDL. Note that the hypothesized value of the dif- tied. For example, if the third, fourth, and fth
ference, zero, is outside of the calculated interval smallest observations are all tied, the assigned
which is consistent with the rejection of the null rank for each of these observations will be 4. The
hypothesis of no difference. Wilcoxon rank sum test assumes that the two
A similar method can be used when the differ- samples came from the same population and
ence in population means represents dependent therefore have the same variance under the null
observations, such as a subjects systolic blood hypothesis. If the null hypothesis of a common
pressure (SBP) before and after treatment with an population distribution is rejected, the interpreta-
antihypertensive. The null and alternative hypoth- tion is that the distribution of one population is
eses can be specied as shifted away from the other. It is worth noting
that the nonparametric analog to the paired t test
H 0 : m Post m Pre = 0 versus
is the Wilcoxon signed-rank test for pairs,
H A : m Post m Pre 0 although the details of this method are not dis-
cussed. Interested readers are referred to LaVange
In this case, the test is called a paired t test and and Koch (2006) [9] for additional information
the test is carried out by calculating the differ- on the Wilcoxon rank sum test and the Wilcoxon
ence in paired observations (e.g., SBP post minus signed-rank test.
SBP pre) and forming the test statistic using the In studies involving comparisons of the popu-
sample mean difference divided by the standard lation mean among more than two populations,
error of the difference in paired values: an appropriate statistical method is called analy-
xd sis of variance (ANOVA). If interest is in the
t= equality of k population means, the null hypoth-
sd / n
esis tested in ANOVA is stated as
If there is no difference between the paired
observations, the test statistic will be close to zero. H 0 : m1 = m 2 = ... = m k 1 = m k
The rejection region is dened using a t distribution
with n 1 degrees of freedom, where n is the If the null hypothesis is rejected, the alterna-
number of subjects with paired observations. tive hypothesis will be favored:
When the independent groups t test cannot be
used, such as when the distribution is not approx- HA: at least one pair of the population means is
imately normally distributed, a nonparametric unequal. That is, at least one of the following
test (which does not assume any shape to the inequalities is true: m1 m 2 , , m1 m k 1 , and
underlying distribution) may be appropriate. m k 1 m k .
A nonparametric analog to the independent The assumptions required for use of ANOVA
groups t test is the Wilcoxon rank sum test which are that the samples represent simple random
addresses the equality of mean ranks. The samples from the populations of interest, the ran-
Wilcoxon rank sum test is carried out by ranking dom variable is normally (or approximately nor-
all individual observations across the two groups mally) distributed in the populations, and the
from lowest to highest, calculating the sum of population variance is equal among the popula-
ranks for one of the groups, and comparing this tions. The overall variation of the individual
Table 11.1 ANOVA for mean VAS pain score from three k
dose groups N = ni
i =1
Source Sum of squares df Mean square F
Drug 99.89459 2 49.947295 6.28 If the null hypothesis of equal means is
Error 238.67896 30 7.955965 rejected, one would like to know which pairs of
Total 338.57355 32 the population means are unequal.
Following a signicant test result from the F
responses is partitioned into within-group vari- test, one can compare the population means
ability (the inherent variability within each sam- among samples (e.g., treatment groups in a clini-
ple) and among-group variability (the variability cal trial) using numerous methods that appropri-
of the sample means relative to the overall mean). ately control the overall type I error rate. This is
The test statistic F is calculated as the ratio of the important since one could test each of c = k(k 1)/2
variability among samples (e.g., treatment pairs of population means using an independent
groups) to the variability within samples: groups t test with a = 0.05, but the type I error
rate is only controlled at a = 0.05 with this method
VAmong when k = 3. When k > 3, the probability of incor-
F=
VWithin rectly rejecting at least one hypothesis increases
with the number of individual hypotheses tested.
That is, if the sample means vary more than In general, if c null hypotheses each have inde-
the inherent variability, the ratio will be greater pendent tests at the a level, the probability of
than one, and the evidence will suggest that the rejecting at least one by chance alone is
sample means did not arise from populations
with a common population mean. Results from = P (rejecting at least one of c hypotheses)
an analysis of variance are often displayed in an
= 1 (1 a )
c
ANOVA table, such as the results displayed in
Table 11.1 from an analysis of a clinical trial of
three doses of an analgesic. The response of inter- For example, if ve such independent com-
est is the mean pain score recorded using a visual parisons of treatment groups are tested at a = 0.05,
analog scale (VAS). The mean square for drug the probability of rejecting at least one by chance
(49.95) represents the average variability of alone could be as large as 0.226.
means relative to the grand mean response. The One appropriate method for controlling the
mean square error (7.96) represents the variabil- experimentwise error rate is the Bonferroni test
ity of responses within each treatment group. The which involves testing each of the c pairs of
ratio of these two is the test statistic and can be means using a t test with aB = a/c. For example, if
interpreted as the extent to which the variability a study with four groups was conducted and the
in mean responses across groups exceeds the F test was rejected (a = 0.05), then the six com-
inherent variability in response. parisons of means could subsequently be tested
The test statistic calculated in such a manner using aB = 0.05/6 = 0.0083. This method controls
follows an F distribution, for which the shape the experimentwise error rate since the probabil-
(and therefore the critical region) is determined ity of incorrectly rejecting at least one null
by two parameters: the numerator degrees of hypothesis is bounded by 6(0.05/6) = 0.05.
freedom, dened as the number of degrees of Another method which can be used to com-
freedom required to estimate the variability pare pairs of means is Tukeys Honestly
among sample means (k 1), and the denomina- Signicant Difference test. This test requires the
tor degrees of freedom, dened as the number of use of additional tabled values to determine the
degrees of freedom for estimating variability minimum absolute difference in means that
within samples (N k), where N represents the would lead to rejection of the null hypothesis of
total sample size across the k samples: the equality of two means. However, Tukeys
method is more powerful than the Bonferroni Table 11.2 Number of events for subjects exposed and
test, meaning the absolute difference in means not exposed
leading to rejection is smaller than that required Exposed Not exposed
for the Bonferroni test. Another method which Workplace Yes 30 8 38
may be used when comparing a number of group injury? No 70 132 202
means to a common control is Dunnetts test. 100 140 240
Additional details about methods used to control
the experimentwise error rate in the setting of and without the outcome of interest for each
multiple tests can be found in Schork and group.
Remington (2000) [5] and, on a more advanced Data from a hypothetical cohort study are dis-
level, in Westfall et al. (1999) [10]. played in Table 11.2. In this study, patients with a
If the shape of the underlying distribution can- conrmed diagnosis of a particular neurological
not be assumed to be normal, a nonparametric condition (exposed) and age- and sex-matched
approach may be used. The Kruskal-Wallis test is controls (not exposed) were followed for a
to the ANOVA as the Wilcoxon rank sum test is period of 1 year to ascertain the occurrence of
to the independent groups t test. That is, for the workplace injuries.
Kruskal-Wallis test, the original random variable Note that if the proportions of subjects between
is ranked across all k groups. The test statistic is the groups were equal, the observed counts of
the ratio of the variability in ranks among groups subjects with each outcome (yes or no) would be
to the variability in ranks within groups. The null distributed in equal proportion among the groups
hypothesis for the Kruskal-Wallis test is that the (exposed or not exposed). One method that can
groups have the same population distribution. If be used to test the hypothesis of equal propor-
the null hypothesis is rejected, one would con- tions between two populations is the chi-squared
clude the alternative hypothesis is true, that the test of homogeneity. The chi-squared test is an
population distributions are different, particularly example of a goodness-of-t test, for which the
for their location. The assumptions required for observed counts of subjects with and without the
the use of the Kruskal-Wallis test are that the event are compared to the expected number of
observations are independent, the samples are subjects with and without the event when no dif-
simple random samples from the populations of ference exists (or under the null hypothesis). For
interest, and the variance is equal among the pop- goodness-of-t tests, the expected counts are
ulations under the null hypothesis. The test statis- obtained on the basis of an assumed model. In the
tic is evaluated using a chi-squared distribution, case of the test of equal proportions, the expected
for which the shape (and therefore the critical counts would be obtained by applying the overall
region) is determined by the number of degrees (across groups) proportion with response to each
of freedom (k 1). groups sample size. The test statistic for a chi-
Many medical studies examine the propor- squared test is expressed as the ratio of the
tion of subjects with a particular response, squared difference of the observed and expected
such as deaths, myocardial infarctions, or counts (denoted by O and E, respectively) to the
some risk factor for disease as the outcome of expected count for each cell and summed over all
interest. The difference in proportions between four cells (indexed below by j) of the table:
two groups (e.g., represented by cases or con- 4 (O j E j )2
trols in an observational study or treatment c =2
j =1 Ej
groups in a clinical trial) can be expressed as
one proportion minus another, p1 p2 , or as a Squaring the deviations of observed counts
p
ratio of the two, p1 , a quantity called the rela- from the expected ensures the difference is posi-
2
tive risk. Data from studies with these kinds of tive, which is required for a random variable from
outcomes are usually presented in the form of a chi-squared distribution. An alternative, math-
a table displaying the counts of subjects with ematically equivalent, form of the test statistic is
( p 1 p 2 )2 Similarly, the expected number of events

c2 =
1 1
among unexposed subjects is
38
n + n p(1 p ) 240
140 = 22.17.
1 2
The expected numbers of subjects exposed and

where p1 represents the sample proportion from not exposed without the event is obtained by
group 1, p2 represents the sample proportion applying the overall proportion without event to
from group 2, and p is the overall proportion the sample size in each group. The test statistic is
across the two groups. then obtained as
In the case of a hypothesis test for two propor-
tions, the null and alternative statistical hypothe- (30 15.83)2 (8 22.17)2 (70 84.17)2
c2 = + +
ses can be stated as follows: 15.83 22.17 84.17
(132 117.83) 2
H 0 : p1 p2 = 0, H A : p1 p2 0 + = 25.817
117.83
where the population proportions for each of two For a test with a = 0.05, the rejection region is
independent groups are represented by p1 and p2 . dened as any value of the test statistic >3.84 (chi-
Under the null hypothesis, the test statistic has squared distribution with 1 degree of freedom).
a chi-squared distribution with 1 degree of free- Therefore, the null hypothesis is rejected with a
dom. Therefore, the null hypothesis will be conclusion that the proportion of exposed subjects
rejected if the test statistic is in the rejection with workplace injuries is greater than the propor-
region dened by c > c1a ,1. Note that only large
2 2
tion of age- and sex-matched unexposed subjects.
values of the test statistic contradict the null A condence interval can also be constructed
hypothesis. Therefore, the rejection region for for the difference in two proportions. A two-sided
the chi-squared test is represented by only the 100(1 a)% condence interval for the differ-
upper tail of the distribution. The chi-squared test ence in sample proportions, p 1 p 2 , is given by
is appropriate when the groups are independent,
the outcomes are mutually exclusive, and most of ( p 1 p 2 ) z1a / 2 SE ( p 1 p 2 ) , where
the expected cell counts are at least ve. The use
of the chi-squared test is illustrated with the data p 1 (1 p 1 ) p 2 (1 p 2 )
SE ( p 1 p 2 ) = +
from Table 11.2. n1 n2
The null and alternative hypotheses concern-
ing the proportion of subjects exposed and unex- For this particular example, the corresponding
posed with workplace injuries are 95% condence interval is calculated as
H 0 : pExposed pNot Exposed = 0 (0.300 0.057)

H A : pExposed pNot Exposed 0 (0.300)(0.700) (0.057)(0.943)
1.96 +
100 140
If the null hypothesis is true, the proportions = 0.243 1.96(0.050) = (0.15, 0.34 )
of subjects with injuries would be equal.
Therefore, the expected counts of subjects with Since the hypothesized value of the difference
events are calculated by applying the overall in population proportions, zero, is outside of the
proportion of events to the sample size in each calculated interval, this result is consistent with
group. The expected number of events among rejection of the null hypothesis.
The chi-squared test can be used in the instance
38
exposed subjects is 100 = 15.83 . when there are more than two groups (k > 2) for
240
Table 11.3 Number of subjects with and without counts, (a + b) and (a + c), should be about equal.
symptom pre- and postintervention Therefore, the test statistic is calculated as
Post
(b c )2
Yes No c2 =
Prior Yes a b a+b b+c
No
c d c+d and has a chi-squared distribution with 1 degree of
a+c b+d n freedom under the null hypothesis. A useful gen-
eral reference that includes additional details about
this test is Stokes, Davis, and Koch (2000) [11].
which the proportions are compared. The test sta-
tistic is computed in the same manner as for the
two groups case, except the test statistic is com- Statistical Issues in Diagnostic
puted by summing over all 2k cells. In the more Testing and Screening
general case, under the null hypothesis of equal
proportions across the k groups, the test statistic Tests which are used as an aid to diagnosing a
has a chi-squared distribution with k 1 degrees disease are called diagnostic tests. An ideal diag-
of freedom. nostic test would not identify a patient as positive
When the sample size requirements for the chi- for disease if she or he did not have it. Nor would
squared test cannot be met due to small expected an ideal diagnostic test fail to identify a patient as
cell counts, an exact test is more appropriate. The negative for disease if she or he did have it. The
fundamental concept of Fishers exact test is that diagnostic accuracy of a new test is often com-
the margins of the table are considered xed (e.g., pared to an existing gold standard test. For such
count of subjects with and without events over all studies, two samples of patients are selected:
groups and the count of subjects in each group). those who test positive for the disease using the
Given the xed margins, it is possible to specify all gold standard test and those who test negative for
possible patterns of event counts. Then the exact the disease using the gold standard. All partici-
probability of each pattern of counts of events is pants from both groups are subjected to the new
calculated using the hypergeometric distribution. test and the outcome, either test positive or test
The p value corresponding to the test is calculated negative, is noted.
exactly by summing the probabilities associated Two measures of diagnostic accuracy are sen-
with all tables which have probabilities as small sitivity and specicity. Sensitivity is the probabil-
as, or smaller than, that for the observed table. ity that a subject who has the disease will test
Medical studies involving assessment of the positive. Specicity is the probability that a sub-
presence or absence of a characteristic in the ject who does not have the disease will test nega-
same subjects before and after an intervention tive. If a diagnostic test does not have high
(e.g., negative or positive for a symptom before sensitivity or specicity, it will be of limited use
and after treatment) yield counts of paired obser- as important diagnoses will be missed in the for-
vations, as shown in Table 11.3. mer and unnecessary medical follow-up may
For assessment of whether the intervention result from the latter.
had an effect on the response, McNemars test Many assays produce a quantitative result
can be used. The null and alternative hypotheses which must be interpreted as either negative or
from such a study are positive. Using different cutoff values for the
result yields sensitivity and specicity for each
H 0 : pPre pPost = 0, H A : pPre pPost 0 one. Consider the use of the prostate-specic
antigen (PSA) test as a diagnostic for prostate
If the intervention had no effect, the propor- cancer. Higher values of PSA level (ng/mL) are
tion with response would be the same prior to and more indicative of cancer. One may be interested
postintervention, and therefore the marginal in what specic value of PSA should be used to
Table 11.4 Sensitivity and specicity of PSA as a diag- the cutoff that provides the greatest diagnostic
nostic for prostate cancer accuracy. A ROC curve is displayed in Fig. 11.4
PSA (ng/mL) Sensitivity Specicity for the PSA data.
1.0 1.0 0.46 For this data set, a PSA value of 4.0 ng/mL is
2.0 1.0 0.72 the cutoff that optimizes both sensitivity and
3.0 0.98 0.82 specicity. Sensitivity and specicity can be
4.0 0.95 0.88 interpreted as sample proportions for which
5.0 0.81 0.92 condence intervals can be constructed to
6.0 0.54 0.95
estimate the precision of the sample estimate
7.0 0.35 0.96
relative to the underlying population proportion.
8.0 0.22 0.97
A two-sided 100(1 a)% condence interval for
9.0 0.13 0.98
10.0 0.09 0.98
a sample proportion p is
11.0 0.06 0.98
12.0 0.03 0.99 p z1a / 2 SE ( p ) , where
13.0 0.01 0.99
14.0 0.01 0.99
p (1 p )
15.0 0.01 0.99 SE ( p ) =
n
For example, an estimate of sensitivity of 0.95
indicate a positive test result for cancer. To among 100 study subjects would result in a 90%
address this question, sensitivity and specicity condence interval for the population sensitivity of
are calculated for all possible cut points or thresh-
olds. For example, a PSA 2 can be interpreted (0.95)(0.05)
0.95 1.64
as a positive test, and PSA < 2 can be interpreted 100
as a negative test. Using this criterion yields an = 0.95 0.036 = (0.91, 0.99)
estimate of sensitivity and specicity. When this
is repeated for all possible cutoff values for PSA, Apart from a tests accuracy relative to a gold
it becomes evident that there is a tradeoff between standard diagnostic protocol, its ability to accu-
sensitivity and specicity, as shown in Table 11.4. rately screen for disease is of interest. The prob-
As seen in Table 11.4, nearly all patients with ability that a patient who tests positive for disease
cancer have PSA 2 (sensitivity of 1), but only actually has the disease is called the positive pre-
three-fourths of patients without cancer have dictive value. Similarly, the probability that a
PSA < 2 (specicity of 0.72). patient who tests negative for disease does not
The results obtained for multiple cutoff values have disease is called the negative predictive
can be plotted in a receiver operating characteris- value. Through the use of a mathematical expres-
tic (ROC) curve. For each cutoff, the value of sion called Bayes theorem, it can be shown that
sensitivity is plotted on the y-axis and the value the positive predictive value is a function of the
of (1-specicity) is plotted on the x-axis. The sensitivity and specicity of the test and the
value of the cutoff that is closest to the upper left underlying prevalence (expressed as a propor-
quadrant (sensitivity of 1 and specicity of 1) is tion) of disease in the population of interest:
(sensitivity)(prevalence)
positive predictive value =
(sensitivity)(prevalence) + (1 specificity)(1 prevalence)
The negative predictive value is also a function diagnostic accuracy of the test since the two
of these quantities. It is important to note that it is groups sampled (those who test positive and those
usually not appropriate to estimate the prevalence who test negative) are typically not chosen at ran-
of disease from the same study used to dene the dom from the population of interest. Estimates of
Fig. 11.4 Receiver operating characteristic curve for PSA levels
the prevalence of disease are more appropriately Some methodological issues are worth men-
estimated from epidemiologic studies. tioning for diagnostic studies. As described by
As an example, consider a test with sensitivity Ransohoff and Feinstein (1978) [12], there are a
and specicity of 0.95 each. If the prevalence of number of biases that may be introduced that
the disease is 0.1 (a common disease), the posi- affect the results of assessments for sensitivity
tive predictive value is 0.68. However, if the prev- and specicity. When carrying out a study to
alence of the disease is 0.05, 0.01, or 0.001, the assess the diagnostic accuracy of a test, it is
positive predictive value is 0.5, 0.16, and 0.02, important to select participants carefully, so they
respectively. These results suggest that, despite are similar to a population of patients for whom
high diagnostic accuracy, the use of a diagnostic the test may ultimately be used. Failure to include
test may not be informative. Secondly, this exam- an appropriately broad group of participants may
ple is an illustration of how Bayes theorem is lead to the so-called spectrum bias. Further, it is
used to combine prior information (in this case, important to use the same gold standard diagnos-
the prevalence of disease) with newly collected tic test among all participants. Lastly, carrying
data (a result from a diagnostic test with certain out the gold standard and test diagnoses sepa-
accuracy) to estimate the probability of disease rately can eliminate the possibility that one
given the test result. Condence intervals about inuences the other, thereby articially inating
the positive predictive value and negative predic- the estimates of diagnostic accuracy.
tive value also can be calculated to assess the pre- When the standard diagnostic test cannot be
cision of the sample estimates. considered a gold standard (i.e., results from it
cannot be considered the truth), sensitivity and ance of each random variable. The coefcient is
specicity are not meaningful quantities. In this dened as
case, one would be more interested in the extent
to which results from the new test were in agree- n
ment with the standard test. A new test could be ( xi x )( yi y )
r= i =1
helpful if its diagnostic accuracy was similar to
n n
2
( xi x ) ( yi y )
2
the standard, but was advantageous for some
other reason (e.g., less expensive, easier to admin- i =1 i =1
ister, or safer than the standard test). One measure
of agreement is the kappa statistic, which ranges It is possible to test the hypothesis that there is
from 0 (indicative of agreement likely due to a signicant linear relationship between the two
chance alone) to 1 (indicative of perfect agree- random variables by testing the value of the pop-
ment). Interested readers are referred to Woolson ulation correlation coefcient, r. An assumption
and Clarke (2002) [6] and Landis and Koch for this test is that the random variables are nor-
(1977) [13] for additional details on this statistic. mally distributed and they have a joint distribu-
tion called the bivariate normal distribution. The
null and alternative hypotheses are
Correlation and Regression
H 0 : r = 0, H A : r 0
Describing the relationship between two random
variables can lend insight into their relationship The test statistic is
or association to each other. A measure of the
extent to which one variable is linearly related to r n2
t=
(or associated with) another is a correlation 1 r2
coefcient. Correlation coefcients can range
from 1 to 1. Negative correlation coefcients which has a t distribution with n 2 degrees of
imply that as values of one variable increase in freedom when the null hypothesis is true. If the
value (e.g., displayed on the x-axis) values of the null hypothesis is rejected, it is in favor of the
second variable (displayed on the y-axis) alternative hypothesis that the correlation
decrease in value. Similarly, positive correlation coefcient is not equal to zero, meaning there is a
coefcients mean that as values of one variable signicant linear relationship between the two
increase in value, values of the second variable random variables. Condence intervals for r are
also increase in value. Correlations of 1 or 1 useful and can be obtained from statistical soft-
imply perfectly linear relationships. A correla- ware. A note of caution is that cause and effect
tion of 0 implies that there is no linear relation- cannot be established solely on the basis of a sta-
ship between the two random variables. One tistical association.
signicant limitation of correlation coefcients is When at least one of the random variables is
that one random variable may be related mathe- not intervally scaled, but at least ordered (e.g., a
matically to another, but has a small correlation rank or count variable), a nonparametric correla-
coefcient because the relationship is not linear tion coefcient is more appropriate. The Spearman
(e.g., as a quadratic function). rank correlation is computed by ranking both of
The Pearson correlation coefcient, for which the random variables and calculating the correla-
the sample estimate is denoted by the symbol r, is tion coefcient on the ranks. For large sample
appropriate when the random variables are con- sizes (n > 30), the hypothesis test of the Spearman
tinuous and approximately normally distributed. rank correlation is based on a test statistic similar
The Pearson correlation coefcient is a function to that for the Pearson correlation coefcient.
of the extent to which the two random variables A statistical method used to describe the
vary jointly (the covariance) as well as the vari- relationship between an outcome (or dependent
variable) and one or more independent or explan- formulation of the standard error is not included
atory variables (considered xed) is called regres- in this chapter. The test statistic is dened as
sion. Regression techniques use observed data to
estimate model coefcients for the explanatory b 1
t=
variables that account for the variability in the se(b 1 )
response. The simplest example is linear regres-
sion for which the dependent variable, often which follows a t distribution with n 2 degrees
denoted for regression as Y, is expressed as a lin- of freedom under the null hypothesis. Other ref-
ear function of one or more explanatory variables, erences for this topic note that this t-statistic is
denoted as X or X1, X2, etc. identical to that used for testing the null hypoth-
A linear regression model with a single explan- esis, H 0 : r = 0 . Likewise, a 100(1 a)%
atory variable, called simple linear regression, is condence interval can be constructed as
y = b 0 + b1 x + e . One can obtain estimates of the
model parameters for the y-intercept (b 0 ) and the b 1 t1a / 2, n 2 se(b 1 )
slope (b1 ) by tting a line to a set of observed data
points (paired values of x and y for all subjects in Interested readers can nd additional details
the study). The assumptions required for the use of in Schork and Remington (2000) [5].
linear regression are that for xed values of X, the In a prospective observational study of 202
distribution of Y is normal (with potentially differ- adults between the ages of 20 and 60, triglycer-
ent means across X) and the variance of Y is equal ides and other lipoproteins were tested over a
for all values of X. The estimates of the model period of several weeks. A linear regression
parameters, b 0 and b1, are used to predict values model was tted to the triglycerides levels as a
of y for given values of x. The resulting prediction function of age. The least squares estimates of
equation is y = b 0 + b 1 x . The interpretation of the the y-intercept and the slope yielded the follow-
slope coefcient is that for every unit change in x, ing prediction equation:
the change in y is b1 . For values of x, the differ-
ence between the actual and predicted values, triglycerides = 411.2 1.80 age
y y , is called the residual because this difference
represents the variability in the response that is So for every year increase in age, triglycerides
remaining after tting the model. The best tting were lower on average by 1.80 mg/dL. Likewise,
line is the one with the smallest sum of squared for every 10-year increase in age, triglycerides
deviations between the observed and predicted were lower on average by 18 mg/dL. A test of the
values (i.e., smallest sum of squared residuals). slope coefcient for age based on the t distribu-
Hence, the usual method to obtain the model esti- tion is rejected at the a = 0.05 level, indicating a
mates is called the method of least squares. signicant linear relationship (negatively associ-
A hypothesis test may be used to test whether ated) between triglycerides and age. Kleinbaum
the value of the slope coefcient is different from et al. (1998) [14] have written a helpful reference
zero. The corresponding hypotheses are for linear regression.
H 0 : b1 = 0, H A : b1 0
Survival Analysis and Logistic
If the null hypothesis is rejected, the appropri- Regression
ate conclusion is that there is a signicant linear
relationship between the independent variable In many studies, subjects do not participate for
and the dependent variable. The test statistic (and the planned length of observation. When research-
a corresponding condence interval) for the slope ers are interested in the occurrence of a particular
coefcient use the standard error of the sample event or not (e.g., death, occurrence of a disease
estimate, se(b1 ) . For the sake of brevity, the exact or condition, or onset of a symptom), the outcome
Fig. 11.5 Kaplan-Meier

estimate of the survival
distribution for unfavorable
outcome from a
clinical trial
may or may not occur during the period of obser- One common method is Kaplan-Meier esti-
vation. It is often desirable to utilize the experi- mation of the survival function. The Kaplan-
ence of subjects for the time they were under Meier estimate is constructed by calculating the
investigation, whether or not they had the out- conditional probability of subjects surviving a
come of interest. Consider a study of subjects time interval (e.g., year 12) conditional on sur-
who were newly diagnosed with a fatal disease. viving all previous time intervals (e.g., year 01).
One may be interested in the death rate for the Subjects who have the event or drop out prior to
5 years following diagnosis. Some subjects who the time interval are not included in the risk set
enter such a study will die while under observa- (i.e., they are no longer at risk) for that time inter-
tion, some will survive the 5-year observation val and subsequent time intervals. The probabil-
period, and some will withdraw from the study ity of surviving past a given time is calculated as
during the middle of the observation period with the product of the probability of surviving the
a last known status of alive. Among subjects who interval among those at risk and the probability
will eventually have the outcome of interest, the of surviving all other previous time intervals. The
occurrence may not be during the period of obser- survival function from the Kaplan-Meier estimate
vation. These subjects are said to be censored at is often depicted graphically as shown in Fig. 11.5
the last known observation time. for an unfavorable outcome.
Survival analyses are used when the outcome The Kaplan-Meier estimate is a step function
of interest is a binary outcome (event or not), and according to the shape of the distribution. One
it is desirable to account for the time subjects are can read off values of the survival function for a
at risk for the event. The survival function, value of X, as follows. In this gure, the survival
denoted S(t), describes the probability that a sub- distribution is plotted against time (days since
ject in the study will survive without having the start of treatment in a clinical trial). In the pla-
event past a time, t. For example, S(1 year) is the cebo group on Day 1, the estimate is 0.9, and then
probability subjects will survive past year 1 with- it drops down to 0.8 on Day 2. An important
out the event. There are a number of statistical property of the step function dened using dis-
techniques used to describe and make inferences crete event times is that it is a discontinuous
about the survival distribution. function (i.e., not dened) between event times.
For example, the survival distribution function risk of the event in a small interval of time.
for the placebo group equals 0.46 on Days 6, 7, The hazard is modeled as a function of one or
and 8 (no events occurred), and then at Day 9, the more explanatory variables (e.g., age, treatment
estimate is 0.35. Looking at the Kaplan-Meier in a clinical trial, baseline severity status). The
curve for the placebo group, one could read contrast between the simple linear regression
Day 9 as having an estimate of 0.35 or 0.46, but model and the Cox regression model is important
it is appropriate to remember that the outside to understand, as the model coefcients are inter-
edge of the step (right at Day 9) is discontinuous, preted differently in the two cases. The Cox
and thus the estimated probability of survival for regression models the hazard (y) as a function of
Day 9 or later is 0.35. a single explanatory variable x and is given by
A commonly cited measure of central ten-
dency from the Kaplan-Meier estimate is the y = b 0 e b1 x
median survival time. The median survival time
is the value of t beyond which approximately The term b 0 can be thought of as the baseline
50% of subjects survive without the event, i.e., hazard for a reference group represented by X = 0.
S(median time t) = 0.5. Using this guideline, one If the explanatory variable X is dichotomous (e.g.,
can read off the median survival times by draw- 1 = hypertensive vs. 0 = normotensive), the base-
ing a reference line across the gure at S(t) = 0.50 line hazard represents the hazard for normotensive
and nding the earliest value of time on the curve patients. That is, X = 0 implies y = b 0 . Note that
below the reference line. The median times are when X = 1, y = b 0 e b1. The ratio of these two, e b1,
6 and 16 days for the placebo and active groups, is called the hazard ratio, which can be thought of
respectively. as the relative risk of the event for hypertensive
Cohort studies or clinical trials may have the patients compared to normotensive patients. When
comparison of survival distributions between two the explanatory variable, X, is continuous, the haz-
or more groups as an objective. This can be ard ratio corresponds to the multiplicative increase
accomplished through the use of the logrank test. in hazard associated with a one-unit change in X.
The logrank test is carried out by treating each Since the hazard for many events does not change
distinct event time as a stratum, calculating con- with small increments in the explanatory variable,
tributions to a chi-squared test statistic within it is often helpful to recode or rescale the explana-
each stratum, and combining over the strata. The tory variable. The Cox regression model can be
null hypothesis is that the survival distributions extended to include multiple explanatory vari-
are the same. Under the null hypothesis, the ables. An important assumption for the model is
expected counts of events would be expected to that the contribution of the explanatory variable(s)
be similar to observed counterparts across the has a constant multiplicative effect on the hazard
groups being compared. Therefore, large devia- over time. This is often referred to as the propor-
tions between the observed and expected counts tional hazards assumption.
at a number of event times will lead to a large As with simple linear regression, tting the
value of the logrank test statistic. The resulting model results in estimates of each of the
test statistic from the logrank test is distributed as coefcients and corresponding standard errors.
a chi-squared statistic with 1 degree of freedom. Exponentiation of the coefcient estimates for an
If the null hypothesis is rejected, the conclusion explanatory variable results in a hazard ratio
is that the survival distributions did not arise from expressing the increase in risk for one value of
the same population. the explanatory variable compared to another
A second analysis approach used with cen- while adjusting for other explanatory variables in
sored data is called Cox regression. For Cox the model. Condence intervals and tests for the
regression, the outcome is not the probability of coefcients can be constructed which can be
survival but the hazard, dened loosely as the transformed to condence intervals and tests for

the hazard ratio. Care must be taken to code the odds ratio, e b1 , which is an estimate of the rela-
model correctly so the interpretation can be made tive risk of the event for subjects with x = 1 com-
with respect to a meaningful reference group (or pared to those with x = 0. Logistic regression
baseline hazard group) and not an arbitrary one. models can be extended to multiple explanatory
Cox models can be particularly helpful in variables (either categorical or continuous). When
observational studies since the primary interest is an explanatory variable is continuous, the odds
typically in one experimental factor (exposed or ratio is interpreted relative to a unit change in x.
not) while controlling for other potential explana- For example, in a logistic model of coronary heart
tory effects for the response. An introduction to disease as a function of LDL cholesterol, an odds
this topic can be found in a text by Woolson and ratio of 1.02 means that a patient with LDL of 130
Clark (2002) [6]. A reference at a more advanced has greater risk of developing CHD in terms of a
level has been written by Lee (1992) [15]. 1.0210 times greater odds and thereby greater risk
A technique called logistic regression is help- of developing CHD than a patient with LDL of
ful when the outcome of interest is dichotomous 120. Standard errors for the estimates may be used
(e.g., death, seroconversion), and the research in the construction of condence intervals for the
objective is to describe how the probability of the odds ratio. Thus, the precision of the sample esti-
outcome is related to one or more explanatory mate can be evaluated, and tests of the hypotheses
variables without accounting for the time at risk. can be carried out to determine if an explanatory
Instead of modeling the probability of outcome variable is signicantly associated with increased
as a linear function of explanatory variables, the risk of the outcome or event. Excellent references
log odds of the outcome is the dependent vari- for logistic regression include Kleinbaum and
p Klein (2002) [16], Stokes et al. (2000) [11], and
able, where the odds is dened as , the Hosmer and Lemeshow (2000) [17].
1 p
probability of outcome divided by the probability
of no outcome. The reason for this choice is that Summary
a probability is bounded by 0 and 1, whereas the
p This chapter has served as an introduction to sta-
log odds or logit, ln
1 p
, is continuous on the tistical methods in medical research. Descriptive
statistics were discussed and are commonly used
scale from negative to positive innity. The logis- to characterize the experience of study subjects
tic model with a single independent variable is and their background characteristics. Inferential
specied as statistical methods, such as condence intervals
p and hypothesis testing, are frequently used to
ln y = b 0 + b 1x
1 p
evaluate observed associations relative to chance
variation in the sampling process. The research
Model estimates can be interpreted in a man- process begins with a research question that moti-
ner similar to that described for the Cox regres- vates a study designed to answer the question for
sion model. If X is a dichotomous variable (e.g., which relevant data are collected. The involve-
gender), the predicted value y = b 0 is the log odds ment of statistics ideally begins at the start of the
of the event for a reference group with x = 0. When research process and concludes with the nal
x = 1, the predicted value y = b 0 + b 1 is the log interpretation of the analyses. Further study of
odds of the event for the group with x = 1. The dif- these topics is encouraged so that readers may
ference of these two is the log odds ratio, b1 . enhance their abilities to interpret results of pub-
Exponentiation of the log odds ratio results in the lished medical literature.
Take-Home Points
Descriptive statistics are used to summarize individual observations from a study and esti-
mate a typical value (measures of central tendency) and the spread of values (measures of
dispersion). Measures of central tendency include the mean and median. Measures of dis-
persion include the standard deviation and the range.
Hypothesis tests and condence intervals are two general forms of inferential statistical
methods, for which the aim is to make an inference from a sample of subjects to a rele-
vant population.
Condence intervals represent a plausible range of values of for a population parameter,
such as the difference in mean response, the difference in proportions, or the relative risk.
p values are reported from hypothesis tests. Small p values (e.g., <0.05) suggest that the
observed result was unlikely to have occurred by chance alone.
There are many statistical methods which may be appropriate for any given research
study. The most appropriate statistical approaches must consider the research question
and the study design.
9. LaVange LM, Koch GG. Rank score tests. Circulation.

References 2006;114:252833.
10. Westfall PH, Tobias RD, Rom D, Wolnger RD,
1. Senn S. Dicing with death: chance, risk, and health. Hochberg Y. Multiple comparisons and multiple tests
Cambridge: Cambridge University Press; 2003. using SAS. Cary: SAS Institute, Inc.; 1999.
2. Campbell MJ, Machin D, Walters SJ. Medical 11. Stokes ME, Davis CS, Koch GG. Categorical data
statistics: a textbook for the health sciences. 4th ed. analysis using the SAS system. 2nd ed. Cary: SAS
Chichester: Wiley; 2007. Institute Inc.; 2009.
3. Durham TA, Turner JR. Introduction to statistics in 12. Ransohoff DF, Feinstein AR. Problems of spectrum
pharmaceutical clinical trials. London: Pharmaceutical and bias in evaluating the efcacy of diagnostic tests.
Press; 2008. N Engl J Med. 1978;299:92630.
4. Bowers D, House A, Owens D. Understanding clini- 13. Landis JR, Koch GG. The measurement of observer
cal papers. 2nd ed. Chichester: Wiley; 2006. agreement for categorical data. Biometrics. 1977;
5. Schork MA, Remington RD. Statistics with applica- 33:15974.
tions to the biological and health sciences. Upper 14. Kleinbaum DG, Kupper LL, Muller KE, Nizam A.
Saddle River: Prentice-Hall; 2000. Applied regression analysis and other multivariable
6. Woolson RF, Clarke WR. Statistical methods for the methods. 3rd ed. Pacic Grove: Duxbury Press; 1998.
analysis of biomedical data. 2nd ed. New York: Wiley; 15. Lee ET. Statistical methods for survival data analysis.
2002. 2nd ed. New York: Wiley; 1992.
7. Cohen J. A power primer. Psychological bulletin. 16. Kleinbaum DG, Klein M. Logistic regression: a self-
1992;112:1559. learning text. 2nd ed. New York: Springer; 2002.
8. Chow S-C, Liu J-P. Design and analysis of clinical 17. Hosmer DW, Lemeshow S. Applied logistic regres-
trials. 2nd ed. Hoboken: Wiley; 2004. sion. 2nd ed. New York: Wiley; 2000.
Ethical Issues in Clinical Research
12
Eli A. Friedman
from a minor skin rash to a major complication

Introduction such as liver failure, stroke, and even death. Even
with purely observational studies, risks to patient
For more than four millennia, professionals privacy (especially the potential loss of that pri-
involved in human medical examination and vacy) must always be carefully considered while
research have pondered the boundaries for preparing and subsequently performing an inves-
responsible patient care, and many have ques- tigation. Risks to the subject are always weighed
tioned the ethical limits of human subjects relative to the probability of benet (both to the
research. The goals of this chapter are to: subject and to society in general).
Briey review the historical background of For any research study, the benets and risks
contemporary research ethics can never be known ahead of time, nor can the
Elucidate the ethical issues to be considered effects be fully determined before investigation is
when constructing human research studies nished. (Indeed, if they could be known before-
Provide workable denitions and guidelines hand, there would be no need for a study to take
governing benets and risks involved in human place.) While seemingly obvious, this precept
subjects research. should be a conscious pragmatic reality for all
A study benet is a positive effect gained clinical researchers so that necessary discretion is
from participating in a research study that might followed prior to drawing important conclusions.
accrue in an individual, such as obtaining a better Even when preliminary anecdotal information is
therapeutic outcome (e.g., life extension, morbid- suggestive or prior animal studies appear to sup-
ity reduction) or a less tangible advantage such as port a hypothesis related to disease processes or
learning that a medical treatment might reduce treatment outcomes in humans, clinical circum-
the need for a more invasive procedure or improve stance or results obtained in an animal model of a
quality of life in a specic patient. The risks of disease may not translate into solid evidence for
a study, particularly one examining the impact of human clinical practice. As an illustration of the
a purposively applied intervention, may range limits of preclinical investigation, in a study con-
ducted by this author in the late 1990s [1], treat-
ment with agents that block the enzyme aldose
reductase effectively halted progression of dia-
E.A. Friedman, MD ()
Department of Medicine, Sate University of New York, betic retinopathy and nephropathy in induced
Downstate Medical Center, Brooklyn, NY 11203, USA diabetic rats but was of minimal to no value when
e-mail: elifriedmn@aol.com tested in diabetic humans.
234 E.A. Friedman
Table 12.1 Central principles of ancient Chinese medi-

Background: Recognition cal ethics
and Introduction of an Ethics To appreciate the value of life and practise medicine
Base in Medicine with a heart of compassion and humaneness
To master Confucianism prior to learning medicine
To master medical knowledge by studying reliable
For the purposes of this chapter, the following sources diligently and extensively
discussion on medical ethics will focus primarily To improve clinical skill and maintain a high profes-
on clinical research involving human subjects. sional standard
With that said, the contemporary notion that the To be frugal, not to be greedy for wealth and fame
individual is inherently valuable, and as such To treat patients equally, and as if they were your family
To be sincere, decorous, devoted, absorbed and seless in
should be protected by policies that regulate treating patients
human investigation, is derived from a sense of To treat female patients only in the presence of an
humanism evidently valued long before the enact- attendant; respecting their condentiality, and not being
ment of twentieth century legislation. Thus, in lustful
order to provide context for the contemporary To be modest and prudent toward other physicians, not to
belittle and criticize ones colleagues [4]
state of ethics in human investigation, this section
will provide a brief introduction to some histori-
cal gures and inuential schools of thought that
helped to shape the vast eld of modern-day interpersonal relationships. This is the rela-
health care. We nd evidence of attention to med- tional aspect of personhood, from which it
ical ethics and the nature of the relationship follows that the humaneness (jen) one must
between practitioner and patient in prominent attain in striving toward chun-tze can only be
ancient civilizations, such as early Chinese and achieved through interaction with other individu-
Greek cultures. als [3]. Thus, according to Confucianism, physi-
Many of the principles of medical ethics that cians are in the position of striving for chun-tze
arose from these two cultures can be traced to the and humaneness on a personal level but, as doc-
teachings of two renowned historical gures: tors, they help others to maintain balance in the
Confucius and Hippocrates. Confucius was a autonomous and relational aspects of their lives.
Chinese thinker and educator of the fth century Daniel Fu-Chang Tsai [4] evaluated the vast
B.C., while Hippocrates lived in ancient Greece teachings of ancient Chinese medical ethics that
in the fourth century B.C., practicing as a Confucianism has engendered. The principles
physician and providing instruction on the art of that he identied as common threads throughout
medicine [2]. the various teachings and texts from this school
Central to Confucianism, the ideology based of thought are given in Table 12.1.
on the teachings of Confucius, is the chun-tze, Hippocrates (460 B.C.) became one of the
the morally ideal person. Confucius concept of most well-known ancient scientists for formally
persons, according to his theories of chun-tze, favoring constraints on physician behavior [5].
gives a two-dimensional approach to life: the His credo for neophyte physicians, known as The
autonomous person and the relational Hippocratic Oath, requires practitioners to fol-
person [3]. Rather than promulgating a univer- low a system of guidelines that ultimately benets
sal code of behavioral guidelines, Confucius patients while abstaining from any actions that
proposed that each person subject himself to self- are mischievous or would not be in the patients
examination. Accordingly, physicians following best interest; various versions of this oath cur-
Confucian teachings practiced self-cultivation rently exist [6, 7], and to this day, one version or
through self-examination, self-criticism, and self- another is recited throughout the world by new
restriction. At the same time, a signicant aspect physicians. Conicting interpretations of this
of personhood is based on the individuals oath are evident when physicians defend their
12 Ethical Issues in Clinical Research 235
Table 12.2 Early unregulated human research efforts: An incomplete chronology

18451949: Dr. J. Marion Sims performs a series of experimental gynecological operations without anesthesia on
enslaved African-American women [15]
1874: Dr. Roberts Bartholow inserts needle electrodes into the exposed brain of a feeble-minded servant
woman as part of a series of experiments in cerebral localization [16]
1895: Dr. Henry Heiman infects two idiot boys with gonorrhea to investigate the causative agents of the
disease [17]
1896: Dr. Arthur Wentworth withdraws spinal uid from 29 hospitalized children to determine the effective-
ness of spinal tapping [18]
1906: Dr. Richard P. Strong, researching vaccines for tropical diseases in the Philippines, injects inmates at a
Manila prison with cholera, 13 of whom later die [19, 20]
1908: Three Philadelphia physicians infect children at the St. Vincents Home orphanage with tuberculin in
order to compare the effectiveness of several diagnostic tests [18]
19181922: Dr. Leo Stanley subcutaneously injects over 600 inmates at San Quentin prison with animal testicular
tissue while researching a cure for criminality [19, 20]
1914: Dr. Joseph Goldberger, in an effort to prove that pellagra is caused by nutritional deciencies,
induces the disease in a dozen Mississippi inmates, denying their requests to be removed from the
study [19, 20]
1921: Dr. Alfred F. Hess, studying the effect of varying dietary factors on the development of disease,
withholds orange juice from infants until they show the characteristic hemorrhages of scurvy [18]
1931: Dr. Cornelius Rhoads, studying hookworm and tropical sprue anemia in Puerto Rico, transplants
cancer in several human subjects (killing eight) after writing in a condential note to a colleague that
the entire population should be exterminated [21]
treatment decisions on the basis of perceived modern-day medicine. More than 2,000 years
ethical obligation to never cause the death of any have passed since their deaths, dynasties have
patient. Contrary to this view is the belief that a risen and fallen, and religious gures, revolu-
physician holds an ethical obligation to relieve tions, and explorations have led to vast changes
pain even if the patient dies as a consequence of in virtually every aspect of civilization. Yet, as
the advocated treatment. (One of the most widely noted above, to this day, most graduating medi-
known modern examples includes the views of cal students in the United States of America
Dr. Jack Kevorkian [811]). (USA), Canada, and in certain other parts of the
It was Celsus, a Roman encyclopedist, who is world recite some form of the Hippocratic Oath,
thought to have been the rst to consider the and current US federal legislation incorporates
rights of subjects under experimentation [12]. He principles identied by Confucianism as central
spoke strongly against procedures such as vivi- to the practice of medicine. Thus, society con-
section on condemned criminals in Egypt, calling tinues to acknowledge the importance and rele-
physicians who performed them assassinating vance of the ancient Greek and ancient Chinese
medical practitioners [13]. Though it certainly is teachings on medical ethics, both of which cham-
the case that both the ethics regarding human pioned one particular concept above all others:
subjects research and regulations for such the veneration of human life, today termed
research have evolved substantially since the time benevolence. From Hippocrates forward, all
of Celsus, his belief that medical practice should guides to the ethical practice of medicine
be a work of mercy as opposed to one of dire included this concept [14]. Although benevo-
cruelty laid the ethical foundation for human lence in medicine implies that physicians should
subjects research long ago, eventually becoming do everything in their power to ensure no harm is
the moral standard by which such research is done to the patient, hundreds of incidents, as
judged today. sampled in Table 12.2 [1521], reect efforts to
One would be hard-pressed to challenge the exploit availability of prisoners, slaves, impover-
inuence of Confucius and Hippocrates on ished adults, and even children in sometimes
236 E.A. Friedman
lethal medical investigations prior to the

establishment of regulatory boundaries for The Post-WWII Evolution of Ethical
human subjects studies. Policies for Human Subjects Research
For the majority of cases referenced in
Table 12.2, informed consent was obtained nei- Many scholars trace modern concerns about the
ther from the adult subjects nor the parents of the unethical treatment of patients to the ndings
children. Furthermore, for the cases in which of the Nuremberg Military Tribunal (also known
consent was obtained, the subjects were either as the Doctors Trial), which followed sadistic
mentally limited individuals without a proxy or unscientic research involving forced human
prisoners promised pardon for participation in exposure to the effects of freezing, incendiary
research. As reected by the chronology of the devices, mustard gas, and other experimental
table, abuses of patients by their doctors atrocities performed under Nazi Germany during
existed long before Germany invaded Poland, World War II. Of 23 Nazi doctors and scientists
initiating World War II in 1939. In fact, it was tried for the murder of concentration camp
German governmental regulations in 1931 that inmates who were used as research subjects, 15
rst promulgated a code for conducting human were convicted (7 were condemned to death by
investigation in what was termed the Reich Health hanging, while 8 received prison sentences from
Council Regulations of 1931 [22]. This document 10 years to life [23]). An outgrowth of the judg-
(obviously ignored by Adolph Hitler, throughout ment and sentences handed down at the trial was
his 11-year Third Reich) consisted of 14 points an outline of required elements for conducting
demanding complete responsibility of the medi- research with humans, collectively known as the
cal profession, including informed consent and Nuremberg Code [24, 25] (Table 12.3) which cur-
risk-benet analysis for human medical research rently is recognized as the most important docu-
experimentation. Included were technical and ment in contemporary human subjects research
ethical standards for maintaining written records ethics. Table 12.3 lists the elements of the code.
describing the justication for studying vulnera- As the eyes of the world focused on the
ble populations. activities in Nuremberg in the 1940s, events
Table 12.3 The Nuremberg Code of 1947

1. The voluntary consent of the human subject is absolutely essential.
2. The experiment should be such as to yield fruitful results for the good of society, unprocurable by other methods
or means of study, and not random and unnecessary in nature.
3. The experiments should be so designed and based on the results of animal experimentation and knowledge of the
natural history of the disease or other problem under study that the anticipated results will justify the performance
of the experiment.
4. The experiment should be so conducted as to avoid all unnecessary physical and mental suffering and injury.
5. No experiment should be conducted where there is a prior reason to believe that death or disabling injury will
occur, except perhaps, in those experiments where the experimental physicians also serve as subject.
6. The degree of risk to be taken should never exceed that determined by the humanitarian importance of the
problem to be solved by the experiment.
7. Proper preparations should be made and adequate facilities provided to protect the experimental subject against
even remote possibilities of injury, disability or death.
8. The experiment should be conducted only by scientically qualied persons. The highest degree of skill and care
should be required through all stages of the experiment of those who conduct or engage in the experiment.
9. During the course of the experiment the human subject should be at liberty to bring the experiment to an end if he
has reached the physical or mental state where continuation of the experiment seems to him to be impossible.
10. During the course of the experiment the scientist in charge must be prepared to terminate the experiment at any
stage, if he has probable cause to believe, in the exercise of the good faith, superior skill, and careful judgment
required of him, that a continuation of the experiment is likely to result in injury, disability, or death to the
experimental subject [25].
concurrently unfolding and earning publicity in anesthesiologist Henry K. Beecher, M.D., to

the USA were slowly beginning to draw audible capture the attention of members of the medical
concern from the American public, government and science communities in the USA to whom he
ofcials, and professionals in various elds had attempted to voice his concerns about the
regarding the ethical nature of human subjects ethical nature of human subjects research
research being conducted domestically. Criticism since the late 1950s. Beecher, at one point
followed a paper published in 1936 on the the anesthesiologist-in-chief at Massachusetts
Tuskegee Study of Untreated Syphilis in General Hospital, was a renowned researcher in
the Negro Male, a research project initiated by his own right, a signicant factor contributing to
the US Public Health Service in conjunction his concern about the ethical quality of research
with the Tuskegee Institute in 1932. The purpose practices [29]. As David J. Rothman described,
of the study was to investigate the effects of Beechers sharpest fear was that research of
untreated syphilis [26]. Six hundred black males dubious ethicality might impugn the legitimacy
in Macon County, Al., approximately two thirds of experimentation, discrediting the prime force
with syphilis, were enrolled in the program under bringing progress to medicine [30].
the premise that they were to be treated for what Though Beechers 1959 JAMA publication,
was at the time colloquially termed bad Experimentation in Man, did not achieve the
blood [27]. This vague term was used to refer to reverberating impact for which he had hoped, a
a number of ailments, including syphilis. As crit- 1965 speech that he gave to an audience of jour-
ics charged, treatment was withheld from the nalists invited by the Upjohn Pharmaceutical
men in the study even after treatment with peni- Company had a far more tangible inuence on
cillin was accepted as the standard of care for ethical human subjects research discourse in the
syphilis in 1945; unwitting subjects were led to USA. Beechers revelations of ethical miscon-
believe they were being treated. Despite eliciting duct and his assertion that such questionable
concern as early as 1936, the study remained in activities were being conducted at leading medi-
progress until 1972. In fact, it took 30 years from cal schools, medical centers, even in the mili-
that rst Tuskegee publication for the movement tary, caused enough of a stir among his audience
toward evaluating ethical practices in human sub- to spark dramatic headlines throughout the
jects research to gain any truly sustained momen- nation, such as the Boston Globes Are humans
tum on a national level. used as guinea pigs not told? As one might
Internationally, discussions on the ethics of expect, Beecher faced a strong backlash from
human experimentation continued following the colleagues who felt as though he had violated
Nuremberg proceedings. The World Medical professional etiquette [29] by discussing his
Association (WMA) prominently adopted the concerns and in such a public manner. Despite
Declaration of Helsinki in 1964 to serve as a the hostile response, Beecher continued to push
guide for regulating human subjects research. the issue.
Though ratied by multiple WMA General In 1966, the New England Journal of Medicine
Assemblies, most recently in October 2008, the published a paper, Ethics and Clinical Research,
wide-ranging principles and policies of the by Beecher in which he reported his survey of 22
Declaration of Helsinki (spanning fundamentals published medical studies documenting exposure
of ethical recruitment of study subjects, to prin- of subjects to substantive risks without their
ciples of good study design, to essential elements knowledge or approval [31]. Of note, these stud-
of a research protocol, to ethical considerations ies were conducted at some of this countrys most
in publication of the results of the research) prestigious institutions, gaining publication in
remain active to this day [28]. highly prestigious journals. Examples of investi-
The worldwide presentation of these princi- gator misdirection and/or abuse of their study
ples may have helped prominent Harvard-trained patients included:
238 E.A. Friedman
Performing heart catheterizations in patients Boundaries between clinical practice and oth-
who believed that they were to have erwise unneeded research
bronchoscopy Basic ethical principles to be preserved during
Assigning patients with life-threatening dis- all research studies (respect for persons,
eases to placebo control groups, where effec- benecence, and justice)
tive treatments were known to be available Fundamental applications (guidelines for
Randomizing US soldiers suffering from informed consent, assessment of risk and
streptococcal pharyngitis to penicillin versus benets, and selection of subjects).
treatments known to be ineffective. Notably, we nd that The Belmont Report
While Beechers article drew attention in its [35]a document created late in the twentieth
own right, his crusade gained remarkable steam century in a highly developed Western nation
through the publicity generated by a 1972 New presents morality-driven guidelines similar to
York Times article. Whistleblower Peter Buxton those of ancient Confucian ideology and
revealed the shocking truths behind the Tuskegee Hippocrates. In Part B: Basic Ethical Principles
study to the paper, which subsequently published of the report, respect for persons asserts the
Syphilis Victims in US Study Went Untreated importance of respecting an individuals auton-
for 40 Years as its front-page headline on July omy and protecting those persons with dimin-
26, 1972 [32]. When the study was terminated in ished autonomy, benecence requires that
1972, congressional hearings were held to actions do not cause harm and that treatments
address the matter of ethical conduct in human aim to maximize potential benet while minimiz-
investigation. ing risks, and justice entails considering vari-
The National Research Act of 1974 was passed ous factors in determining the fairness in
in the USA as a direct response to these above- distribution with regard to the benets and risks
mentioned ethical abuses (especially the revela- of human subjects research.
tion of the Tuskegee experiment) [33]. Through In the decades both leading up to and follow-
the act, congress called for the establishment of ing the release of the Belmont Report, the USA
the National Commission for the Protection of undertook a substantial review and overhaul of
Human Subjects of Biomedical and Behavioral federal regulations in human subjects research.
Research [34], which was charged with the tasks A chronology of key events is provided in
of identifying key ethical issues to be addressed Table 12.4.
by researchers and injecting clear ethical prac-
tices into human subjects research that would
help assure the public of the safety of medical The Genesis of Institutional Review
research and avoid future atrocities. Following Boards in the USA and Their
Beechers disturbing portrayal of extreme over- Regulatory Role
riding of patient rights in medical investigation
by US investigators and the rules established by With the guidance of The Belmont Report, the US
the 1974 Research Act, additional reports were Department of Health, Education, and Welfare
published that recounted instances of exposing (now the Department of Health and Human
subjects, without their consent, to radiation, Services [HHS]) established requirements for the
infectious agents, or injection of cancer cells. Of development of Institutional Review Boards or
the responses generated, perhaps the single most IRBs [36]. (IRB is a generic term used by gov-
important resource used as a basis for governing ernmental agencies, but each institution that
both the practice of medicine and conduct of establishes an IRB may maintain any name to
research involving human subjects was The describe such a board.) As a general rule, the role
Belmont Report [35], released by the commission of the IRB is to regulate human subjects research
in 1979, which established: by advocating, upholding, and maintaining the
Table 12.4 Post-World War II developments aimed at protecting human subjects in research
1947: Nuremberg Code denes subject-centered principles for ethical human subjects research in response to
unethical medical experimentation by the Nazis during WWII [25].
1964: World Medical Association adopts the Declaration of Helsinki, dening new guidelines for human subjects
was an outline of required research (last revised in October 2008) [28].
1965: A speech addressing problems in clinical research is given by Henry Beecher, M.D., to journalists
assembled by the Upjohn Pharmaceutical Company and draws attention nationwide through prominent
media outlets [29].
1966: Henry Beecher M.D. publishes Ethics and clinical research in The New England Journal of Medicine,
expressing concern over the potentially vast impact of unethical procedures in clinical research, and
referencing 22 studies without explicitly identifying the studies or investigators [30].
1972: Tuskegee whistleblower Peter Buxtin contacts the Associated Press with information on the study, leading
to The New York Times July 26, 1972, article, Syphilis Victims in U.S. Study Went Untreated for 40 years;
Syphilis Victims Got No Therapy; the study is terminated that same year [32].
1973: Congressional hearings are held to address human experimentation primarily in response to the Tuskegee
revelations [33].
1974: The National Research Act is created, establishing the National Commission for the Protection of Human
Subjects of Biomedical and Behavioral Research [34].
1979: The Commission releases The Belmont Report, identifying relevant ethical principles and guidelines for
human subjects research [35].
1981: Human subject regulations are amended to provide a common framework within which Institutional
Review Boards (IRBs) can review human subjects research [36].
1991: Regulations for the protection of human subjects are codied under Title 45, Part 46 of the Code of Federal
Regulations; Subpart A is accepted by 17 US Federal Agencies as the Common Rule [38].
rights and welfare of humans participating in the FDA in 21 CFR 56.107, which covers FDA
research. IRBs are universally engaged in all oversight of drugs and medical devices.
health and social science studies funded by the The Common Rule requires that IRBs approve
National Institutes of Health (NIH) and HHS. and oversee all human research supported directly
Such studies include, but are not limited to, clini- or indirectly by, what is today known as, HHS. It
cal trials of new, novel, or repurposed devices or is within the purview of the Ofce for Human
drugs regulated by the Food and Drug Research Protections (OHRP) within HHS to
Administration (FDA); investigations of behav- regulate all IRBs, but today, all IRBs also are
ior, opinions, and attitudes; or studies on health- subject to additional governmental organization
care management. (e.g., FDA) regulations. Similar regulatory boards
In 1991, the US Federal Policy for the have been in place for animal research since the
Protection of Human Subjects was published in enactment of the Laboratory Animal Welfare Act
the Federal Register (56 FR 28003) and incorpo- of 1966, like the Institutional Animal Care and
rated into the regulating codes of 17 Federal Use Committees, which may be considered
departments [37]. The policy, known as the IRBs for nonhuman research subjects. For a
Common Rule, provides specic direction for broader discussion of ethical issues considered
the operations and regulation of IRBs, outlines in preclinical research which is beyond the
requirements for obtaining informed consent, and scope of this chapter, the reader is referred to
requires written assurance of institutional com- Animal Experimentation. The Moral Issues
pliance with federal research regulations. The (Baird and Rosenbaum, 1991) [39] or Animal
policy was codied by HHS as Title 45 Code of Experimentation: A Guide to the Issues (Monamy
Federal Regulations [CFR] Part 46 Subpart A, 2000) [40].
Basic HHS Policy for Protection of Human Historically, academic institutions and medi-
Research Subjects [38]. It later was codied by cal facilities created their own IRBs to oversee
240 E.A. Friedman
human subjects research, specically to avoid or However, these ndings may pose considerable
limit ethical problems in such research. In this unexpected risk to subjects, especially if that
era, there are additional for-prot independent or information was later revealed and linked back to
commercial IRBs that institutions may choose to the subject.
contract out to monitor their research; their role, The underlying concern for governing bodies
accountability, and composition are no different regulating clinical research is the level of risk
than that of traditional IRBs. In brief, all IRBs posed to human subjects. As such, the corner-
must contain at least ve members, chosen in a stone for virtually all IRB operations is the evalu-
nondiscriminatory fashion, with sufcient exper- ation of risks to study subjects. Beginning at the
tise to judge the scientic merit of each proposed earliest stages of application for study approval,
protocol and to assess whether the rights of the the nature of identied risks to human subjects in
subjects are properly safeguarded. In its early a research study directs the procedures for IRB
days, concerns were raised over the relatively review and approval. For example, the level of
homogeneous composition of IRB membership. risk (a concept to be further described momen-
In response, the HHS Common Rule provided tarily) is a signicant factor in determining
regulations in 45 CFR 46.107 designed to ensure whether a research study qualies for exempt
satisfactory and unbiased review of clinical status or expedited review or requires full-
research projects by requiring diversity of IRB committee review. It should be noted that certain
members with regard to their eld of expertise, types of research protocols (as mentioned below)
afliations, experience, gender, race, and cultural may qualify for exempt status with regard to
background [38]. A majority of the members IRB review. Clinical research studies considered
must be present for voting to take place, at least exempt by the IRB are additionally absolved
one of whom is a nonscientist, and IRB members from standard informed consent requirements
may not vote on their own projects [38, 41]. As unless the research involves protected health
necessary, IRBs can invite nonvoting content information, in which case patient authorization
experts to assist in the review process [38]. or IRB waivers of authorization must be obtained
IRBs must review all research protocols and for each subject [42].
related materials (e.g., informed consent docu- For the purposes of IRB review, there are three
ments and promotional iers) to ensure that pro- levels of risk to which subjects can be exposed
posed investigations are ethically conducted. For in any given research study: less than minimal
example, they must determine that patients are risk, minimal risk, and greater than minimal risk
properly selected, that the proposed protocol is [43]. Studies that involve less than minimal risk
designed so that valid inferences can be drawn, include those that pose no known physical,
that subjects are fully informed about the risks emotional, psychological, or economic risk to
and benets of the study, and that their participa- subjects. Such studies may be deemed exempt
tion is entirely voluntary (or, for special patient from IRB review and, therefore, would not
populations [e.g., those with dementia, mental require review by an IRB committee member.
retardation, severe neuropsychiatric disorders] As stipulated in 45 CFR 46.101(b), a pro-
that informed permission is appropriately obtained posed investigation may be classied as exempt
by proxy). Their role is to maximize safety in the (unless otherwise mandated by a department or
delicate balance between risk and benet for sub- agency head) if it limits involvement of human
jects once they are enrolled in research. subjects to one or more of the following catego-
Each IRB must advocate and uphold the inter- ries: (1) educational practices and assessments
ests of all research subjects. Such advocacy (e.g., comparing two or more teaching methods),
includes protection of the future interests of sub- (2) interviews or observations of public behavior,
jects, especially in situations involving tissue and (3) studies of public data or specimens
storage. Clearly, future technologies may arise without accompanying information that might
that can yield potentially valuable new data. permit subject identication [38]. Also exempt is
research examining public benet or service

programs, procedures for obtaining benets or
services under those programs, possible changes
in (or alternatives to) those programs or proce-
dures, or modication of payment for benets or
services in these programs. Other exemptions
include dietary studies of nontoxic food deemed
to be safe for human consumption by the FDA,
Environmental Protection Agency, or the Food
Safety and Inspection Service of the US
Department of Agriculture [38]. The participa-
tion of certain populations (e.g., minors, prison-
ers, pregnant women) generally excludes studies
that otherwise may be viewed as posing less
than minimal risk from qualifying for exempt
status.
Studies that involve minimal risk, as dened
by 45 CFR 46.102(i) [38], are those in which
the probability and magnitude of harm or dis-
comfort anticipated in the research are not greater
in and of themselves than those ordinarily
encountered in daily life or during the perfor-
mance of routine physical or psychological
Fig. 12.1 Regulatory questions to be considered prior to
examinations or tests. Minimal risk studies
study initiation
include, but are not limited to, observational
investigations which involve the collection of
medical test data that are ordered for routine clin-
ical purposes. Studies that involve medical chart For example, an activity such as walking, which
review would be considered to pose no more than is considered to be a normal daily activity for the
minimal risk, provided that no unique identiers majority of the population, may pose a greater
are included in the records. Often subsumed in than minimal risk to certain subjects (e.g., indi-
the category of minimal risk are studies that viduals suffering from moderate to severe
use questionnaires or surveys, provided that no angina). Thus, a crucial factor to consider is how
unique identiers are included and that it is the subject interprets or responds to the per-
unlikely that the questions would cause emotional ceived risk and what the individual considers
distress to the participant. Thus, minimal risk minimal risk to be in the context of his or her
studies frequently are eligible for expedited life [44]. All research studies posing greater than
review by select members (e.g., the IRB chair or minimal risk require a full-committee review by
a designated board member) [43]. the IRB.
Studies are considered to pose greater than Figure 12.1 provides a summary of the regu-
minimal risk to subjects if they include latory questions that must be considered before
risk beyond that ordinarily encountered by initiating a clinical research study. In brief, the
subjects [43]. Research procedures that require Principal Investigator (PI) should rst determine
subjects to take experimental drugs, mandate whether the proposed project qualies as human
implantation of medical devices, or involve sur- subjects research (as dened by 45 CFR 46.101
gical procedures are among the more obvious and 45 CFR 46.102). Even if the criteria for
types of such studies [43]; however, there are less IRB-exempt status appear to have been satised
evident factors that can elevate the level of risk. under 45 CFR 46.101(b), most institutions
242 E.A. Friedman
recommend that the PI submit a formal applica- protocol are to be considered, as are potential
tion to his or her local governing body for review conflicts of interest. Finally, the IRB will deter-
by an IRB chair or other senior IRB administrator mine if the study warrants additional reviews
in order to permit their assessment. (Research during a one year period. The protocols for all
studies that have been determined to be exempt ongoing research studies are considered to be
from IRB review will retain this status unless the undergoing continuing review and are required
conditions of the study have changed, at which to be reviewed at least annually [45].
time the study should be resubmitted to deter- In addition to previously cited requirements,
mine whether such changes affect risk to subjects all IRBs must review any amendments, including
and level of required review.) If the research study updates to any research-related forms, along with
does not qualify as exempt, the IRB, based on any other documents that the IRB deems neces-
information furnished by the PI, must determine sary to protect potential human study subjects.
whether the study poses minimal risk or whether (There may be local variation in the order in
it poses more than minimal risk. Minimal risk which an IRB veries the propriety of proposed
studies may qualify for expedited review if they and/or ongoing research of human subjects.)
fall under one of the categories previously In 1996, the International Conference on
described [38]. If the study does not qualify for Harmonisation of Technical Requirements for
expedited review or if it is determined to pose Registration of Pharmaceuticals for Human Use
greater than minimal risk to potential subjects, Guideline for Good Clinical Practice (ICH-GCP)
the protocol must undergo full-committee review. [46] established additional guidelines for IRB
Though the PI may participate in the process, the oversight of clinical trials, which later were
nal decision on category of risk and level of adopted by the FDA [47]. In this context, clini-
review ultimately is governed by the IRB chair or cal trials are dened as studies that involve
his or her designee(s). investigational products. In addition to the gen-
The IRB considers a number of complex eral procedures discussed previously for human
issues in its review of proposed protocols. The subjects research studies, the requirements set
impact of the study design on human subjects is forth in the Guideline for Good Clinical Practice
evaluated with careful attention paid to any pro- mandate that ongoing IRB reviews of clinical tri-
tocol implementing deception or withholding of als must include proposed drug and device safety
information. Deception is a particularly com- documentation.
plex issue in human subjects research due to the The role of the IRB comprises far more than
extensive federal regulations regarding informed extensive document review. The Belmont Report
consent and disclosure of information. The IRB cited above states that research on human sub-
conducts an extensive assessment of risks and jects must ethically address benecence,
benefits and may require additional safeguards respect for persons, and justice [35]. This
to be implemented. It also may examine the edict can be fullled only when the IRB approves
selection of subjects, evaluating both inclusion research that fully informs subjects (or when nec-
and exclusion criteria and ensuring that the pro- essary, their proxies) about the risks of the study
cess is free of coercion. The IRB considers the before they provide consent for participation in
planned methods for identification of research the research.
participants and associated procedures in place IRBs are required to provide special attention
to protect the privacy of study subjects. Its to proposed studies of persons with diminished
members evaluate the process for obtaining comprehension, pregnant women, prisoners, the
informed consent and thoroughly review the elderly, or children. In his well-focused Lancet
informed consent forms as well as any other review of David Wendlers book on The Ethics of
documents or devices that will be introduced to Pediatric Research [48] (provoked by intended
study subjects or will be used in recruitment. use of children as the subjects of investigation),
The qualications of all investigators on the Peter Singer poses the daunting question: Is it
ever ethical to do research on human subjects subjects research as commonly dened [51].
without their consent? [49]. Arguing that A year later, problems persisted, causing the rep-
research with children is justiable with parental resentatives from the OHA and AHA to issue a
consent, Singer bases this inference on his belief reafrmation of their 2003 statement [52].
in the subjects inherent altruistic desire to benet
othersan objective that, in the broadest inter-
pretation, infers that contributing to a signicant HIPAA, the Privacy Rule, and
research project is an accomplishment of ultimate Preparatory to Research Activities
value to the contributor. While the question
remains hotly debated, Singer suggests that par- In the US, IRBs took on additional tasks follow-
ents should be able to give consent for their child ing the enactment of the Health Insurance
to enroll in a well-designed study of an impor- Portability and Accountability Act (HIPAA) of
tant question despite the fact that doing so 1996 [53], passed by Congress to improve por-
involves momentary pain and, in good medical tability and continuity of health insurance cover-
practice, a risk that is greater than zero, but age in the group and individual markets, to
still extremely small [49]. In sum, current ethi- combat waste, fraud, and abuse in health insur-
cal standards in the USA permit parent-approved ance and health care delivery, to promote the use
research on children when the risk of harm to the of medical savings accounts, to improve access to
child is minor and potential benet to others is long-term care services and coverage, to simplify
likely or in situations when no alternative mecha- the administration of health insurance, and for
nism exists to attain those benets. other purposes [53]. To accomplish these tasks,
IRBs have not been free of criticism, even in the act called for a vast overhaul of the methods
this new millennium. As late as 2010, Hall, used to transmit medical information, including a
Friedman, King et al. noted that academic medi- shift toward standardized electronic transmis-
cal center IRBs and conict of interest commit- sions. HIPAA has been modied a number of
tees usually are not involved in reviewing times since its enactment in 1996 [5456], and
research budgets to determine whether per capita though initially the act most evidently applied to
payments are excessive [50] (italics added); in health-care providers and health-care plan pro-
certain circumstances (to be described later), viders, extensions of the act have had a signicant
excessive payments may be seen as undue induce- impact on clinical research. Most notably, a pro-
ment for participation in the study. In addition, vision was made requiring compliance with the
due to what is perceived to be misunderstanding HSS-issued Standards for Privacy of Individually
of specic social science research methods (e.g., Identifiable Health Information, known as the
ethnography, oral histories) by many IRB mem- HIPAA Privacy Rule, by most covered entities as
bers, some social scientists have argued that cur- of April 2003. HSS provides the following
rent regulation of social science research is statement on covered entities with regard to
insufciently exible; they believe that current research:
regulatory requirements (e.g., lengthy and/or Covered entities are health plans, health care
complicated consent forms) are overly burden- clearinghouses, and health care providers that
some in light of the fact that social science stud- transmit health information electronically in con-
ies generally pose only limited risk to subjects. In nection with certain dened HIPAA transactions,
such as claims or eligibility inquiries. Researchers
an attempt to address these concerns, the OHRP, are not themselves covered entities, unless they
in conjunction with the Oral History Association are also health care providers and engage in any
(OHA) and the American Historical Association of the covered electronic transactions. If, how-
(AHA), stated in 2003 that investigative ever, researchers are employees or other work-
force members of a covered entity (e.g., a hospital
procedures (e.g., oral histories, collection of or health insurer), they may have to comply with
anecdotes, unstructured interviews, and other that entitys HIPAA privacy policies and proce-
related methods) often do not constitute human dures. [57]
244 E.A. Friedman
Table 12.5 Information that must be removed for deidentication

1. Names
2. Contact information (e.g., phone or fax #, website or internet protocol [IP] or electronic mail addresses, geographic
address smaller than State, except rst three digits of zip code)
3. Identifying dates (more detailed than year, e.g., birth, death, admission, discharge)
4. Age over 89 years (unless listed as 90 or older)
5. Social security, medical record, insurance identication number
6. Vehicle identication numbers (e.g., serial numbers, license plate numbers)
7. Device identication or serial numbers
8. Certicate or license numbers
9. Biometric identity (e.g., voice or retinal print, ngerprint, full face image)
10. Any other unique account numbers or material [58]
The purpose of the HIPAA Privacy Rule is to Table 12.6 Conditions permitting the use or disclose of
regulate the use and disclosure of certain indi- PHI for research by covered entities
vidually identiable health information, termed If the subject of the PHI has granted specic written
protected health information (PHI). An individu- permission through an Authorization that satises
section 164.508
als PHI includes information pertaining to (1) his
For reviews preparatory to research with representa-
or her past, present, or future physical or mental tions obtained from the researcher that satisfy section
health or condition, (2) the provision of health 164.512(i)(1)(iii) of the Privacy Rule
care to the individual, and (3) the past, present, or For research solely on decedents information with
future payment for the provision of health care to certain representations and, if requested, documenta-
tion obtained from the researcher that satises section
the individual [58]. An individuals genetic infor-
164.512(i)(1)(iii) of the Privacy Rule
mation also is considered to be PHI. PHI is pro- If the covered entity receives appropriate documenta-
tected under the Privacy Rule when it contains tion that an IRB or Privacy Board has granted a
information that possibly could be used to deter- waiver of the Authorization requirement that satises
mine the identity of the individual. It is possible section 164.512(i)
If the covered entity obtains documentation of an IRB
to deidentify PHI by removing certain informa-
or Privacy Boards alteration of the Authorization
tion pertaining to the individual. The information requirement as well as the altered Authorization from
that must be removed in order to deidentify an the individual
individuals PHI is listed in Table 12.5. If the PHI has been de-identied in accordance with
Just as medical professionals must maintain the standards set by the Privacy Rule at section
164.514(a)(c) (in which case, the health information
the security and privacy of their patients PHI, is no longer PHI)
so too must clinical researchers who are covered If the information is released in the form of a limited
entities, work for covered entities, or who obtain data set, with certain identiers removed and with a
data from covered entities (as dened above). data use agreement between the researcher and the
The legislation established to regulate the trans- covered entity, as specied under section 164.514(e)
mission of PHI signicantly impacts the clinical Under a grandfathered informed consent of the
individual to participate in the research, an IRB
researcher in two specic ways: (1) a subjects waiver of such informed consent, or Authorization or
PHI must be obtained and used in a manner other express legal permission to use or disclose the
deemed permissible by the Privacy Rule, and information for research as specied under the
(2) activities that are considered to be prepara- transition provisions of the Privacy Rule at section
164.532(c) [57]
tory to research and that involve the review of
PHI must be carried out in accordance with
specic guidelines [57]. The ways by which a As mentioned, the Privacy Rule has had a sub-
covered entity may use or disclose an individu- stantial impact on activities preparatory to
als PHI for research purposes are outlined in research. Activities preparatory to research
Table 12.6. include reviews of data that enable researchers to
determine whether or not it would be purposeful controls are considered sufcient. Safeguards
or reasonable to pursue a particular research study. must be in place for any third parties to uphold
This may include reviewing medical records to the same level of security and privacy with regard
determine whether or not there are enough poten- to PHI. Plans for audits of these procedures to
tial subjects to be able to carry out the study. Such make sure problems are clearly identied and
activities also may be used to allow the researcher rectied are required.
to identify potential research participants for Additionally, plans must be in place for
recruitment purposes and to contact potential responding to breaches of private information.
study participants. Each of these activities must Breaches of PHI generally are dened as the
be carried out in accordance with particular unauthorized acquisition, access, use, or disclo-
requirements. For example, a covered entity may sure of protected health information which com-
allow a researcher to review PHI, but they may promises the security or privacy of such
not permit the researcher to remove any PHI from information [56]. In the event that a breach
the covered entity. Additionally, the researcher occurs, all individuals whose PHI may have been
would not be permitted to contact a potential inappropriately disclosed (or their next of kin, if
study participant based on the PHI reviewed with- the individual is deceased) must be informed.
out the researcher being a workforce member of Furthermore, a notice of breach is to be listed on
the covered entity or without the researcher secur- the afliated institutions website or disseminated
ing proper documentation of a waiver of authori- through a major media outlet. Cases in which a
zation from the IRB or Privacy Board [59]. large number of individuals (500 or more) have
The regulations previously discussed been affected require that the secretary of HHS
specically address the researchers ability to also be notied. Affected parties should be
obtain and utilize a subjects PHI; however, there informed as to what they can do to further protect
are additional directives under the Privacy Rule themselves after a breach occurs [56].
that stipulate the handling of PHI beyond the per-
missibility of transmission. The rules for main-
taining privacy and security include written Human Research Requires
privacy procedures in which a privacy ofcer, Informed Consent
who is responsible for upholding such proce-
dures, is designated. It must be clearly stated who As mentioned previously, the informed consent
has access to specic private health information process for human subjects is a cornerstone of
and how to modify levels of accessibility. ethical standards in human research. It is important
Appropriate training must occur on a scheduled to note the distinction between authorization,
and ongoing basis for all persons with access to as discussed in relation to the HIPAA Privacy
PHI. Research information must be securely Rule, and informed consent. Authorization is
backed up in case the original information is lost written permission from an individual permitting
or corrupted in an emergency. the disclosure and/or use of his or her PHI for
A key guideline for ensuring the privacy of research. Informed consent is an individuals per-
PHI is to transmit only the minimal amount of mission to participate in research.
information necessary. Any equipment used for To the extent possible, one must receive clearly
research or patient management that contains stated information explaining the studys pur-
PHI must be monitored and protected from unau- pose, methods, risks, benets, and alternatives to
thorized access. With the growth of digital infor- research in order to be considered an informed
mation systems, any PHI that is sent over an open subject [60]. However, violations of this precept
network must have adequate encryption, but there occasionally occur even in developed societies.
is some leeway with regard to PHI sent via closed A particularly horric example was given in a
networks. In the case of closed networks, encryp- 2005 paper [61] that described how a mother
tion is optional and the existing network access learned, after the death of her baby, that the child
246 E.A. Friedman
had been buried without its heart by the staff of to participants rests with the PI and all associate
the Bristol Royal Inrmary in the United Kingdom investigators who personally interact with the
(UK). This was done without her knowledge and subject. This is to ensure that the subject (or his
consent so that tissue samples could be used for or her proxy) understands what is being proposed
future research by investigators. and comprehends any and all known potential
After the subject has been adequately adverse consequences that could arise from his or
informed, it is up to him or her to decide whether her participation. In other words, responsibility
to participate in the study. If the research under- for obtaining consent should not be delegated to
taken is to be considered ethical, it is imperative subordinates.
that this decision be completely voluntary. The Consent must be obtained in a noncoercive
subject must be able to freely decide not only and fully voluntary manner, avoiding the fraud of
whether to begin participation at the outset but Tuskegee (cited previously) and the horrors of
also whether to continue participation after the Nazi experimentation as a prelude to murder. As
study has commenced. An important point often it is always the ultimate responsibility of the
overlooked during this process is that the subject investigators to ensure that their research is prop-
must fully understand the conveyed information. erly conducted, they must remain alert (even if,
Otherwise, the decision made may not reect the as noted above, IRBs are not) to the reality that
true wishes or interests of the individual. However, excessive payment to research subjects might be
in cases where the potential subject is a child, an coercive. While compensation to subjects is gen-
unconscious adult, or an individual of otherwise erally viewed as an acceptable way of covering
limited mental capacity, informed consent from their expenses and rewarding them for their time
the individual is not required. In these instances, and effort related to the study, the use of relatively
consent is obtained instead through a proxy large incentives to facilitate recruitment may
(a decision maker who is empowered to ensure comprise, in certain circumstances, a form of
that the subjects involvement in the study is con- undue inuence by inducing the individual to
sistent with his or her values, beliefs, and inter- accept seemingly irresistible offers against his or
ests). In this way, the decision that is ultimately her better judgment. [62]
made will most closely represent what the sub- A striking example is the series of experiments
ject would have willfully done if he or she had conducted at the Willowbrook State School, in
been able to render a decision [60]. which parents were asked to enlist their retarded
Fundamental to the process of informed con- children in a research project that required them
sent is the concept of respect for potential and to be infected with hepatitis [62]. As incentive,
enrolled subjects. It is important that enrolled the child was offered a place in a residential treat-
subjects be treated with respect from the time ment facility that otherwise would have been
they are approached to be in the study to the time difcult to secure. It is not hard to see that such
their participation has ended. Likewise, individu- an incentive, as an attempt to induce parents to
als who decline to participate nevertheless should overcome their hesitation about the study by
be treated with respect throughout the entire appealing to their concern for their childs treat-
recruitment process. Respect for subjects entails ment, is ethically unsound.
not only respecting their decisions and keeping Those in favor of subject compensation argue
private information condential but also disclos- that compensating subjects for participating in
ing new information (e.g., novel risks and benets research is no different than paying people for
that might emerge during the course of the study working. As McNeill has noted, however, unlike
and affect their willingness to participate), moni- work, experimentation on human subjects
toring their well-being to prevent and treat inherently exposes people to unnecessary risks of
adverse effects, and informing them about what harmrisks that cannot be known in
was learned from the research [60]. advance [63]. Therefore, while a completion
The responsibility to maintain the integrity of bonus for a relatively harmless research study
the processes of communicating details of a study usually poses no ethical problems and is, in fact,
a commonly employed method for emphasizing (H. pylori) caused gastritis and predisposed pep-
the importance of full commitment to the study, tic ulceration even in patients with a healthy
caution should be exercised when the research mucus lining, Marshall volunteered to ingest a
might be painful or distressing for the subject; in sample of H. pylori. After he developed the char-
cases such as these, compensation may be seen as acteristic symptoms of gastritis, it was shown that
undue inuence, seductively pressuring the sub- ingested H. pylori is able to colonize completely
ject to accept conditions they would otherwise normal gastric mucosa and lead to the acute
deem unreasonable or aversive. inammatory changes collectively referred to as
Investigators should always bear in mind acute H. pylori gastritis [67].
that inequalities in authority between investiga- Current federal regulations, however, do not
tor and subject persist even after informed distinguish between self-experimentation and
consent is given, creating potential threats to experimentation on subjects recruited for a
autonomy [64]. Certain strategies customarily are specic project. Clinicians may feel that if they
employed to minimize the impact of such potential are experimenting with their own bodies, then as
vulnerability. For example, while consent for par- doctors, they are cognizant of all the risks and
ticipation in a clinical research study may include may consider circumventing the IRB approval
agreement to certain pre- or postintervention pro- process altogether. However, as a general rule,
cedures, subjects still retain the right to discon- IRBs require prior submission and approval of an
tinue their participation at any time, even when application detailing all aspects of any study
their treating physician or a consulting physician incorporating self-experimentation before it
for the study believes it may be life threatening for starts. The rationale for IRB approval is the con-
the subject to withdraw from the study [65]. cern that overly zealous investigators may subject
themselves to inappropriate, unnecessary, and
unforeseen risk without the IRBs oversight. As
Self-Experimentation Guidelines an example, proper IRB oversight would protect
an investigator, with early signs of Huntingtons
Dened as the special case of single-subject disease, from self-experimenting with a promis-
scientic experimentation in which the experi- ing drug undergoing early animal trials for safety
menter conducts experiments on himself or her- and efcacy that ultimately may cause more
self, self-experimentation usually means that the deaths than standard-of-care treatment. Control
designer, operator, subject, analyst, and ultimate of self-experimentation is a delicate issue since
user of resulting information are all the same respect for each individuals right of autonomy is
person. Lawrence K. Altman has catalogued a key feature of federal governance via IRBs.
numerous instances of physician investigators Scientic research is, of course, not the only
who opted to rst expose themselves to the risks context in which people are likely to expose them-
of a new technique or therapy. [66]. Included is selves to potentially harmful situations. In a free
Karl Landsteiners pursuit of what would be society, individuals can daily engage in a wide
named the ABO blood groups repeatedly range of risky behaviors at their own discretion. For
depended on blood samples drawn from himself example, individuals may willingly have unpro-
and ve members of his staff. Similarly, tected sex, maintain an unhealthy diet, consume
invasive cardiology was pioneered in Germany alcohol in excessive amounts, or ride a motorcycle
by Werner Forssmann, who would eventually without wearing a helmet for protection. However,
receive the Nobel Prize in Physiology or if a research study requires the individual to engage
Medicine following years of self-experimenta- in a risky activity due to the research, it obligates
tion he performed by catheterizing his heart the investigators (with IRB oversight) to, truthfully
numerous times [66]. Another signicant and without restriction, fully inform each potential
example of self-experimentation was an experi- research subject of all aspects of an intended study,
ment conducted by Barry J. Marshall [67]. including risks, which the candidate would not
In order to conrm that Helicobacter pylori have assumed had the research not been performed.
248 E.A. Friedman
This task is especially daunting in the setting of Scientic Validity

self-experimentation because investigators may not
be objective about risks to their own health and Research that leads to invalid conclusions is
safety, especially (as noted above) when the likeli- unethical because it wastes time and resources
hood of risk for potential major adverse effects may while needlessly exposing subjects to risk. For
not already be known. this reason, IRBs consider the scientic credibil-
ity of the study to be an important ethical consid-
eration. For example, are the questions addressed
Evaluating a New Human Research by the study likely to be answered by the tech-
Study: Scientic Value of Research niques and methods to be utilized? Are the ques-
Methods and Reporting tions investigators are asking answerable and are
the research methods valid and feasible for this
As best elucidated by Emanuel, Wendler, and purpose? Has the study been designed with a
Grady [68] and summarized by the NIH for their clear scientic objective using accepted princi-
own recommendations [60], the seven main prin- ples, methods, and reliable practices? Does the
ciples presently guiding the conduct of ethical sample size detailed in the statistical plan provide
research are social and clinical value, scientic good precision for estimation of population
validity, fair subject selection, favorable risk- parameters or sufcient power to adequately test
benet ratio, independent review, informed con- the research hypothesis?
sent, and respect for potential and enrolled
subjects [68]. Fellows and junior faculty prepar-
ing to initiate or join ongoing human research Fair Subject Selection
involving possible injury to the subject (as may
follow organ or tissue biopsy or penetration for According to the NIH, those accepting the risks
measurement of uid pressures in pulmonary, and burdens of the research also should be in a
renal, or cardiac vasculature) can test whether position to enjoy its benets, and those who may
their protocol addresses, and is responsive to, all benet should share some of the risks and bur-
seven principles. Below is a brief explanation of dens [60]. Therefore, researchers should care-
how the rst ve principles help guide the ethical fully assess who is to be included in the study
review process. Informed consent and respect for such that the issues being investigated may be
potential and enrolled subjects have been addressed appropriately. In other words, has
described in detail previously (see Human study recruitment been based on the weighing of
Research Requires Informed Consent). scientic goals against subject vulnerability,
privilege, or other factors unrelated to the
purposes of the study? For the purposes of fair-
Clinical and Social Value ness, specic subgroups (e.g., minorities, women,
children, and the elderly) cannot be excluded
An overriding concern in research is the question from research unless a good scientic reason or a
of whether the proposed study explores questions particular susceptibility to risk exists [60].
that, if answered, will provide new information
of signicant value for present or future patients
with a specied illness or for society in general: Favorable Risk-Benet Ratio
If the new information pursued is deemed to be
important, are the risks inherent in the study A fundamental principle that was stressed at the
sufciently reasonable to justify exposure and beginning of this chapter was that the risks and
inconvenience of the research subjects? Is it benets associated with a given research project
anticipated that answers to the research question or experiment can never be determined before
will contribute to scientic understanding of the actual study has been conducted. In fact, the
health or improve our disease management? very denition of research implies uncertainty
regarding the effects of whatever drug, device, or

therapy is being tested. Because it is impossible Ethical Misconduct
to predict if a given risk (whether physical, psy- and Consequences
chological, economic, or social) will be trivial or
serious, transient or long term, it is of the utmost With potentially decades of work, reputations,
importance that clinical researchers strive to and nancial and professional interests at stake,
achieve a favorable risk-benet ratio by minimiz- clinical research certainly is vulnerable to ethical
ing all potential risks to subjects while maximiz- misconduct. The Ofce of Research Integrity,
ing all potential benets. Furthermore, it must be maintained by HHS, provides the following
ascertained that the studys potential benets to denition of research misconduct:
other individuals outweigh the risks to its sub- Research misconduct means fabrication, falsication,
jects. Only with these measures can the uncer- or plagiarism in proposing, performing, or review-
tainty inherent in every research pursuit be ing research, or in reporting research results.
approached safely and sensibly.
(a) Fabrication is making up data or results and
recording or reporting them.
(b) Falsication is manipulating research mate-
Independent Review rials, equipment, or processes, or changing
or omitting data or results such that the
The ultimate question to be asked, of course, is research is not accurately represented in the
whether local IRBs have reviewed the study and research record.
deemed it to be ethically acceptable before it (c) Plagiarism is the appropriation of another
starts. As is inferable from the preceding discus- persons ideas, processes, results, or words
sion, the IRB is usually the main body in the USA without giving appropriate credit.
that will determine whether the investigators con- (d) Research misconduct does not include hon-
ducting the trial are sufciently free of bias, est error or differences of opinion. [69]
whether adequate protection has been afforded to Ethical misconduct relating to data tampering
research volunteers, and whether the trial has and related abuses is well documented, and pen-
been ethically designed with an acceptable risk- alties for such misconduct can be quite severe. In
benet ratio. a notable example of research fraud, a 1998 pub-
Ethically sensitive issues (often relating to the lication in the Lancet alleged the identication of
seven above-mentioned guiding principles) also a new brain-bowel syndrome and a link
can arise in disciplines such as interventional between that syndrome and the measles, mumps,
nephrology or cardiology, especially when pro- and rubella (MMR) vaccine, based on a research
posing invasive bodily research. Local circum- study conducted in the UK in the 1990s by
stances predominantly take precedence over a Andrew Wakeeld, M.D., and colleagues [70].
simple resolution based on what appears ethically Wakeeld et al. claimed that the onset of behav-
correct. For example, some institutions will not ioral symptoms in eight of the 12 children
perform a kidney transplant for patients older involved in the study was directly associated with
than age 70; thus, consideration of this procedure receiving the MMR vaccine. Their paper also
for an intensive care unit patient above this age at cited a high correlation between regressive
one of these facilities is moot and such a patient autism and nonspecic colitis to lend support
could not be eligible for a kidney transplant to the claims of a new brain-bowel syndrome.
research projects even if, as may be the case in These reported ndings had a substantial adverse
multicenter studies, the overarching protocol impact on adherence to recommended vaccina-
would allow inclusion of such a patient. Similarly, tion regimens here in the USA, leading to a rise
criteria for acceptability of HIV-positive patients in previously controlled childhood diseases such
may have been established by a hospital IRB. as measles, mumps, and rubella.
250 E.A. Friedman
Following more than a decade of contentious additional cases of ethical misconduct in research
debate over the validity of the study (during which throughout the past decade [76]. Included was the
countless parents considered the much discussed case of Dr. Eric Poehlman who was sentenced to
link between the MMR vaccine and autism one year in prison in 2006 for falsifying and fabri-
when deciding whether or not to vaccinate their cating research data for a study on menopause and
children), the paper was retracted in February metabolism. Also in 2006, Elizabeth Goodwin, a
2010. Revelations of ethical misconduct in University of Wisconsin professor, resigned fol-
Wakeelds study included (1) nine of the chil- lowing the revelation that she made false state-
dren were reported as having regressive autism, ments in her genetics research. Dr. Gary Kammer
but a third lacked any autism diagnosis, and only resigned from Wake Forest University in 2005
one child actually showed clear signs of the con- when it was discovered that he had fabricated fam-
dition; (2) ve of the 12 children described as ilies in his NIH grant application, this a year after
being previously normal actually had docu- Harvard professor Ali Sultan resigned due to false
mented preexisting developmental concerns; (3) information in his own grant application [76].
the immediacy of the onset of symptoms follow- The previous examples are just a few selected
ing MMR vaccination was greatly exaggerated in cases of misconduct, with many more cases
some instances; (4) following a medical school reported in the literature about falsication of
research review, the diagnosis for nine of the data, plagiarism, research conducted without
children was changed from unremarkable to proper consent, undisclosed conicts of interest,
nonspecic colitis; (5) while 11 families actu- and much more [77]. It is difcult to calculate
ally alleged the MMR vaccine caused their chil- how much research funding has been squandered
drens symptoms, three late cases were and how much harm has been caused to the pub-
intentionally omitted in order to create the false lic health by generating and advancing fraudulent
impression of a 14-day window between vaccine ndings.
exposure and symptom onset and (6) recruitment
and funding aspects of the study correlated closely
to anti-MMR programs, accounting for substan- Final Thoughts and Closing
tial grounds for conict of interest claims [71]. It Unanswered Moral Research
also was revealed that Wakeeld proted from a Dilemmas
future lawsuit against the patent holders of cur-
rent vaccines. Wakeeld and John Walker-Smith, The vast regulations, protocols, and governing
the senior clinician involved in the study, were bodies developed over the course of history to
subjected to the UKs longest General Medical protect the ethical integrity of clinical research are
Council Fitness to Practice Hearing and were evidence that the issue is a cornerstone of human
eventually struck off the medical register [71]. subjects investigation. While current legislation
In 2009, Scott Reuben, M.D., previously a provides answers to many of the questions that
renowned anesthesiologist and pain management may be posed today regarding the ethicality of
investigator, published agrantly fraudulent research activities, it is important to keep two con-
ndings from studies that he performed without siderations in mind: (1) It is often the case that as
the approval of his own institutions IRB, going societies evolve, so too do the standards of appro-
so far as to fabricate patient data and to forge the priateness governing the nature of principles, and
name of a colleague in order to list him as a coau- (2) the passage of time will inevitably force work-
thor on a publication [7274]. In the aftermath of ers in the eld of clinical investigation to take into
that scandal, Dr. Reuben lost all credibility in his consideration issues or concerns that simply could
eld, has served jail time for health-care fraud, not be projected as possibilities at an earlier time.
and a large ne was levied against him by a US Mentioned below are some questions that can, and
federal court [75]. should, be asked by clinical researchers in this era.
In his article published in the Cleveland Clinic Is investigation of ones self ethically appropriate?
Journal of Medicine, James G. Sheehan cited four Is any age too old for subjects in an invasive
biopsy study such as kidney, lung, or major vessel protocol that might provide benecial therapy?
transplantation or replaceable device? For an Should prisoners be excluded from recruitment?
organ transplant study, should young candidates In an experimental life-sustaining device (e.g.,
be selected before geriatric candidates? In studies aortic balloon pump or a hypothermia catheter)
allocating an expensive and limited therapy (e.g., study of coma patients after resuscitated cardiac
bone marrow, heart, or kidney transplant) should arrest, if the study subject fails to respond to the
individuals in advantaged positions be accepted experimental device, who decides to discontinue
into a research protocol ahead of people not so use of the device (e.g., the patient, family/proxy
advantaged? Must undocumented noncitizens be decision maker) and when should that decision be
excluded from innovative, experimental, or poten- made? How should a subjects nonadherence to a
tially life-sustaining therapy that may be scarce or protocol [78], hostility to staff, or criminality [79]
expensive? Are women to be approached for be managed? (e.g., is it ethical to withdraw ther-
research on an equal basis with men? Is it reason- apy or to consult with psychiatry, social services,
able to include race and religion as inclusion/ administration, lawyers, clergy, family members
exclusion criteria for study candidates? Is HIV or friends, or members of the Ethics Committee
infection a reasonable exclusion criterion for a under these circumstances?) Sensitivity to the
study of an experimental surgical procedure? need for respect, autonomy, and dignity of indi-
Should absence of insurance coverage or being viduals subjected to investigation in these types of
impoverished (and thus, in both cases, inability to situations allows researchers to detect and correct
pay for standard care that may not be covered by deviations from appropriate conduct in modern
a research grant) dictate exclusion from a research human research.
Take-Home Points
From the earliest prebiblical writings to modern day, concern for and debate on the
appropriate conduct by caregivers toward patients has been a central theme of appropriate
(ethical) medical practice.
Resulting from awareness of World War II German atrocities performed on prisoners, the
mentally decient, and defenseless civilians, the Nuremberg Code and Belmont Report were
devised to protect patients and society from inappropriate assault on their body and psyche,
later to be followed by regulations regarding the importance of patient privacy.
Central to acceptable ethical behavior in human research are three main principles: respect
for persons, benecence, and justice.
When possible, a fully informed written consent based on protocol comprehension must be
obtained and preserved from each subject.
With reservation and caution, parental consent may be sufcient for child participation in a
study of low risk but potential importance to society.
Currently, international guidelines for ethical human research require prior approval of
research protocols by an Institutional Review Board (IRB). The IRB must document its
views in writing, clearly identifying the trial being assessed, which documents were
reviewed, and the dates of its reaching decisions for approval, disapproval, or need for
restructuring.
The US National Institutes of Health (NIH) names the principles governing acceptable human
research: social and clinical value, scientic validity, fair subject selection, favorable risk-
benet ratio, independent review, informed consent, and respect for potential and enrolled
subjects.
252 E.A. Friedman
22. Sass HM. Reichsrundschreiben 1931: pre-Nuremberg

References regulations concerning new therapy and human exper-
imentation. J Med Philos. 1983;8:99111, (reprint of
1. Friedman EA. Diabetic nephropathy: fresh perspec- German original and English translation).
tives. Facta Univ. 1999;6:3147. 23. Mitscherlich A, Mielke F. Epilogue: seven were
2. Carrick P. Medical ethics in the ancient world. hanged. In: Annas GJ, Grodin MA, editors. The Nazi
Washington, DC: Georgetown University Press; 2001. doctors and the Nuremberg Code human rights in
3. Tsai DF. How should doctors approach patients? A human experimentation. New York: Oxford University
Confucian reection on personhood. J Med Ethics. Press; 1992.
2001;27:4450. 24. Nuremberg Code [from Trials of War Criminals
4. Tsai DF. Ancient Chinese medical ethics and the four before the Nuremberg Military Tribunals under
principles of biomedical ethics. J Med Ethics. 1999; Control Council Law No. 10. Nuremberg, October
25:31521. 1946April 1949. Washington, DC: U.S. G.P.O,
5. Chadwick J, Mann WN. Hippocratic writings. 19491953].
London: Penguin; 1950. 25. Shuster E. Fifty years later: the signicance of the
6. Owsei T, Temkin C. Ancient medicine. Selected Nuremberg Code. N Engl J Med. 1997;337:143640.
papers of Ludwig Edelstein Johns. Baltimore: Hopkins 26. Jones JH. Bad blood: the Tuskegee Syphilis
University Press; 1987. Experiment. New York: The Free Press; 1993.
7. The Hippocratic Oath: Today. Doctors Diaries. 27. Shamoo AE, Resnick DB. The use of human subjects
WGBH Educational Foundation. 1964. http://www. in research. Responsible conduct of research. New
pbs.org/wgbh/nova/body/hippocratic-oath-today. York: Oxford University Press; 2003.
html. Accessed 15 Sept 2011. 28. Declaration of Helsinki: Ethical Principles for Medical
8. Betzold M. Appointment with doctor death. Troy: Research Involving Human Subjects. Adopted by the
Momentum Books; 1993. 18th WMA General Assembly Helsinki, Finland, June
9. Kevorkian J. Medicine, ethics, and execution by lethal 1964, and amended by the 29th WMA General
injection. Med Law. 1985;4:30713. Assembly, Tokyo, Japan, October 1975; 35th WMA
10. Kevorkian J. A comprehensive bioethical code for General Assembly, Venice, Italy, October 1983;
medical exploitation of humans facing imminent and 41st WMA General Assembly, Hong Kong,
unavoidable death. Med Law. 1986;5:81197. September 1989; 48th WMA General Assembly,
11. Kevorkian J. The long overdue medical specialty: bio- Somerset West, Republic of South Africa, October
ethiatrics. J Natl Med Assoc. 1986;78:105760. 1996; and the 52nd WMA General Assembly,
12. Jonsen AR. The birth of bioethics. Experiments peril- Edinburgh, Scotland, October 2000. http://www.wma.
ous: the ethics of research with human subjects. New net/en/30publications/10policies/b3/. ccessed 15 Sept
York: Oxford University Press; 1998. 2011.
13. Spencer WG. Celsus De Medicina a learned and 29. Harkness J, Lederer SE, Wikler D. Laying ethical
experienced practitioner upon what the art of medi- foundations for clinical research. Bull World Health
cine could then accomplish. Proc R Soc Med. Organ. 2001;79:3656. Epub 2 Jul 2003.
1926;19(Sect Hist Med):12939. 30. Rothman DJ. The doctor as whistle-blower. Strangers
14. Friedman EA. Stressful ethical issues in uremia ther- at the bedside. New York: Basic Books; 1991.
apy. Kidney Int. 2010;78(Suppl 117):S2232. 31. Beecher HK. Ethics and clinical research. N Engl J
15. Ojanuga D. The medical ethics of the father of Med. 1966;274:135460.
gynaecology, Dr J Marion Sims. J Med Ethics. 1993; 32. Heller J. Syphilis victims in U.S. study went untreated
19:2831. for 40 years. New York Times (New York) 26 July
16. Lederer SE. Subjected to science: human experimen- 1972;1,8.
tation in America before the Second World War. 33. Cobb WM. The Tuskagee Syphilis Study. J Natl Med
Baltimore: John Hopkins University Press; 1997. Assoc. 1973;65:3458.
17. Grodin MA. Children as research subjects: science eth- 34. Brody B. The ethics of biomedical research. New
ics and law. New York: Oxford University Press; 1994. York: Oxford University Press; 1998.
18. Lederer SE. Orphans as guinea pigs: American chil- 35. The Belmont Report. Ethical principles and guidelines
dren and medical experimenters, 18901930. In: for the protection of human subjects of research. The
Cooter R, editor. The name of the child: health and National Commission for the Protection of Human
welfare, 18801940. New York: Routledge; 1992. Subjects of Biomedical and Behavioral Research.
19. Caelleigh AS. Prisoners. Acad Med. 2000;75:999. 18 Apr 1979. http://ohsr.od.nih.gov/guidelines/belmont.
20. Hornblum AM. They were cheap and available: pris- html. Accessed 13 May 2011.
oners as research subjects in twentieth century 36. US Food and Drug Administration. The genesis of
America. BMJ. 1997;315:143741. human subjects protections regulations and biomedi-
21. Rosenthal ET. The Rhoads not given: the tainting of cal research in the 21st century. CDER small business
the Cornelius P. Rhoads Memorial Award. Oncol assistance training clinical trial workshop,
Times. 2003;25:1920. September 2011. http://www.fda.gov/downloads/
Drugs/NewsEvents/UCM275441.pdf. Accessed 20 Human Services regulations for the protection of

Nov 2011. human subjects at 45 CFR part 46, subpart A to oral
37. Federal Policy for the Protection of Human Subjects; history interviewing. Oral History Association. 2003.
Notices and Rules. Federal Register. 1991;56(117). 52. Shopes L, Ritchie D. An update on the exclusion of
http://www.hhs.gov/ohrp/policy/frcomrul.pdf . oral history from IRB review. Oral history
Accessed 15 Sept 2011 association. 2004. http://classicweb.archive.org/
38. Public Welfare, Protection of Human Subjects, Basic web/20080115224655/alpha.dickinson.edu/oha/org_
HHS Policy for Protection of Human Research irbupdate.html. Accessed 6 Apr 2011.
Subjects, Title 45 CFR Part 46, Subpart A. 2005. 53. Health Insurance Portability and Accountability Act
http://ohsr.od.nih.gov/guidelines/45cfr46.html . of 1996. Public Law 104-191. 104th Congress. 1996.
Accessed 15 Sept 2011. http://www.gpo.gov/fdsys/pkg/PLAW-104publ191/
39. Baird RM, Rosenbaum SE. Animal experimentation. pdf/PLAW-104publ191.pdf. Accessed 15 Sept 2011.
The moral issues. Buffalo: Prometheus Books; 1991. 54. HHS, Health Insurance Reform: Security Standards;
40. Monamy V. Animal experimentation. A guide to the Final Rule. 45 CFR parts 160, 162 and 164. 2003.
issues. Cambridge, UK: Cambridge University Press; Federal register. http://aspe.hhs.gov/admnsimp/nal/
2000. fr03-8334.pdf. Accessed 12 May 2011.
41. Brown, JG. Department of Health and Human 55. HHS Strengthens HIPAA Enforcement. 2009. http://
Services. Ofce of Inspector General. Institutional www.hhs.gov/news/press/2009pres/10/20091030a.
review boards: their role in reviewing approved html. Accessed 12 May 2011.
research. Ofce of Evaluations and Inspections, June 56. Guidance for Securing Protected Health
1998. Information. 2009. http://www.hhs.gov/ocr/privacy/
42. UVA IRB-HSR Research Guidance. Informed consent. hipaa/understanding/coveredentities/hitechr .pdf.
University of Virginia, 2010. http://www.virginia. Accessed 12 May 2011.
e d u / v p r / i r b / H S R _ d o c s / G u i d a n c e / U VA 57. Clinical Research and the HIPAA Privacy Rule. NIH
InvestigatorGuide5-1-08.doc. Accessed 15 Sept 2011. Publication Number 04-5495. 2004. http://privacyru-
43. Assessing Level of Risk and Type of IRB Review. leandresearch.nih.gov/clin_research.asp. Accessed 15
Research compliance news. University of South Sept 2011.
Alabama. 2008. www.southalabama.edu/research- 58. HSS, Summary of the HIPAA Privacy Rule. http://
compliance/pdf/compliancenews0908.pdf. Accessed www.hhs.gov/ocr/privacy/hipaa/understanding/
15 Sept 2011. summary/privacysummary.pdf. Accessed 5 Sept 2011.
44. Mazur DJ. Evaluating the science and ethics of research 59. Institutional Review Boards and the HIPAA Privacy
on humans: a guide for IRB members. Baltimore: The Rule. NIH Publication Number 03-5428. 2003. http://
Johns Hopkins University Press; 2007. privacyruleandresearch.nih.gov/irbandprivacyrule.
45. Protocol Review Process. Institutional review board Accessed 15 Sept 2011.
for health sciences research. University of Virginia 60. NIH & clinical research. Ethics in clinical research.
IRB-HSR. 2008. http://www.virginia.edu/vpr/irb/hsr/ http://clinicalresearch.nih.gov/ethics_guides.html.
reviewprocess_background.html#issues. Accessed 10 Accessed 5 Apr 2011.
Sept 2011. 61. Diamond B. Removal, retention and storage of organ
46. Guideline for Good Clinical Practice E6 Harmonisation and tissue in the UK. Br J Nurs. 2005;14:1078.
of Technical Requirements for Registration of 62. Grant RW, Sugarman J. Ethics in human subjects
Pharmaceuticals for Human Use. 1996. http://www. research: do incentives matter? J Med Philos. 2004;
ich.org/leadmin/Public_Web_Site/ICH_Products/ 29:71738.
Guidelines/Efcacy/E6_R1/Step4/E6_R1__ 63. McNeill P. A response to Wilkinson and Moore.
Guideline.pdf p10. Accessed 5 Apr 2011. Paying people to participate in research: why not?
47. Guidance for Industry E6 Good Clinical Practice: Bioethics. 1997;11:3906.
Consolidated Guidance. U.S. Department of Health 64. Litton P, Miller FG. What physician-investigators owe
and Human Services. 1996. http://www.fda.gov/ patients who participate in research. JAMA. 2010;
downloads/Drugs/GuidanceCompliance 304:14912.
RegulatoryInformation/Guidances/ucm073122.pdf 65. Schwarze ML, Bradley CT, Brasel KJ. Surgical buy-
p10. Accessed 5 Apr 2011. in: the contractual relationship between surgeons and
48. Wendler D. The ethics of pediatric research. patients that inuences decisions regarding life-sup-
New York: Oxford University Press; 2010. porting therapy. Crit Care Med. 2010;38:8438.
49. Singer P. When is research on children ethical? Lancet. 66. Altman Lawrence K. Who goes rst? The story of
2011;377:1156. self-experimentation in medicine. New York: Random
50. Hall MA, Friedman JY, King NM, Weinfurt KP, House; 1998.
Schulman KA, Sugarman J. Commentary: per capita 67. Marshall BJ, Armstrong JA, McGechie DB, Glancy
payments in clinical trials: reasonable costs versus RJ. Attempt to full Kochs postulates for pyloric
bounty hunting. Acad Med. 2010;85:15546. campylobacter. Med J Aust. 1985;142:4369.
51. Ritchie D, Shopes L. Oral history excluded from IRB 68. Emanuel EJ, Wendler D, Grady C. What makes clini-
review: application of the Department of Health and cal research ethical? JAMA. 2000;283:270111.
254 E.A. Friedman
69. Denition of Research Misconduct. HHS, Ofce of fraud in systematic reviews: lessons from the Reuben
Research Integrity. http://ori.hhs.gov/misconduct/ case. Anesthesiology. 2009;111:127989.
denition_misconduct.shtml. Accessed 15 Sept 2011. 75. Johnson P. Scott Reuben, a former Baystate doctor
70. Wakeeld AJ, Murch SH, Anthony A, Linnell J, who faked research, sentenced to 6 months for health
Casson DM, Malik M, Berelowitz M, Dhillon AP, care fraud. MassLive.Com, 24 June 2010.
Thomson MA, Harvey P, Valentine A, Davies SE, 76. Sheehan JG. Fraud, conict of interest, and other
Walker-Smith JA. Ileal lymphoid nodular hyperplasia, enforcement issues in clinical research. Cleve Clin
non-specic colitis, and pervasive developmental J Med. 2007;74(Suppl 2):S637. discussion S68S9.
disorder in children [retracted]. Lancet. 1998;351: 77. Wells JA. Final report: observing and reporting sus-
63741. pected misconduct in biomedical research. 2008.
71. Deer B. How the case against the MMR vaccine was http://ori.dhhs.gov/research/intra/documents/
xed. BMJ. 2011;342(C5347):7782. gallup_nalreport.pdf. Accessed 13 May 2011.
72. Borrell BA. Medical Madoff: anesthesiologist faked 78. Stewart DO, DeMarco JP. Rational noncompliance
data in 21 studies. Scientic American, 10 Mar 2009. with prescribed medical treatment. Kennedy Inst
73. Harris G. Doctor admits pain studies were frauds, Ethics J. 2010;20:27790.
hospital says. New York Times, 11 Mar 2009. 79. Cleaveland C. We are not criminals: social work
74. Marret E, Elia N, Dahl JB, McQuay HJ, Miniche S, advocacy and unauthorized migrants. Soc Work.
Moore RA, Straube S, Tramr MR. Susceptibility to 2010;55:7481.
How to Prepare a Scientic Paper
13
Jeffrey S. Borer
appraisal and analysis of the resulting data, and

The Purpose of the Research Paper the development of conclusions by interpretation
of these data. In Feinsteins words, the goal of
It has long been a commonly accepted precept this process is to answer the original questions
that the process of research is complete only [on which the research was based], and to
when the research has been reported in appropri- establish knowledge that may clarify the past,
ate form to the scientic community. Currently, illuminate the present, and anticipate the
that form is the scientic paper. Without the future [1]. Thus, the fundamental goal of medi-
scientic paper, the research cannot be replicated, cal research is the creation of new knowledge.
clinicians and researchers cannot evaluate it and Dr. Feinstein argued that the same method, the
act upon it, and society cannot benet from it. scientic method, underlies the reasoning of a
To best understand the purpose and scope of physician in selecting a management strategy for
the scientic paper, one must understand the pur- a single patient, the work of a clinical researcher
pose and scope of research. These characteristics, studying large groups of patients, and the efforts
considered in the opening chapter, are also clearly of the laboratory scientist observing and experi-
delineated in a monograph, entitled Clinical menting with animals, cells, or molecules. These
Judgment, by the late Alvan Feinstein, which activities differ only in the procedures employed
provides illuminating insight into the relation of for making observations and the precision and
science and medicine [1]. Dr. Feinsteins thesis representativeness of the resulting data. All these
was that clinical judgment must be based on activities can create new knowledge which, either
application of the scientic method. As indicated directly or ultimately, may carry forth the goals
earlier in this book, the scientic method is an of the physician-scientist: relief of suffering and
intellectual concept referring to the development improvement in quality and, perhaps, length of
of a hypothesis, testing of the hypothesis by life. Thus, Feinstein suggests that all activities
observations employing relevant methodology, of the physician, both in the laboratory and
at the bedside, are the product of the same
problem-solving methods, including application
of Boolean algebra and its associated logic.
J.S. Borer, MD ()
Department of Medicine, Division of Cardiovascular The scientic paper reports and describes the
Diseases, Howard Gilman Institute for Valvular Heart problem that was studied, the methods employed,
Diseases, and Cardiovascular Translational Research the results of the research, and the interpretation
Institute, State University of New York (SUNY)
drawn from these results by the investigator.
Downstate Medical Center, 450 Clarkson Avenue,
50, Brooklyn, NY 11203, USA Consistent with the possible scope of the research,
e-mail: canadad45@aol.com discussed above, in medicine the scientic paper
256 J.S. Borer
can range from a report of a well-studied single Science (AAAS), the essential elements of the
clinical experience (case report) to a highly com- electronically published scientic paper are that
plex, controlled, and carefully blinded study of the nal published version of an article after
the impact of a transfected gene on myocardial peer review (or any future peer-review equiva-
protein degradation in tissue culture. The term lent), which AAAS denotes as the Definitive
scientic paper may seem relatively nonspecic. Publication, needs to be clearly identied as such
However, given the explosion of biomedical lit- and must be publicly available, the relevant com-
erature during the past generation, the concomi- munity must be made aware of its existence, a
tant recruitment of highly talented and experienced system for long-term access and retrieval must be
journal editors, and the relative paucity of costly in placeit must not be changed (technical pro-
journal publication space, it is not surprising that tection and/or certication are desirable), it must
a fairly rigorous denition for the term can be not be removed (unless legally unavoidable), it
found. must be unambiguously identiedit must have
The denition of a scientic paper is compre- a bibliographic recordcontaining certain mini-
hensively developed and discussed by Robert A. mal information, [and] archiving and long-term
Day, professor emeritus of English at the preservation must be provided for [3].
University of Delaware and past president of As indicated by the AAAS criteria, the
the Society for Scholarly Publishing and of the denition of the scientic paper, either printed on
Council of Biology Editors, in his denitive book, paper or in electronic media, encompasses the
How to Write and Publish a Scientic Paper [2]. concept of prepublication peer review. Peer
As stated by Professor Day, a scientic paper is review is the process by which other profession-
a written and published report describing original als, understood on the basis of their own publica-
research results. However, it must be written tions or other credentials to have expertise in
and published as dened by [three centuries of papers area of focus, evaluate the paper and grade
developing] tradition, editorial practice, scientic it as to priority for publication. Most journals
ethics, and the interplay of printing and publish- employ a system of peer review to select manu-
ing procedures. Professor Day quotes the scripts to be published from within the larger pool
denition of an acceptable primary scientic pub- of those submitted. The number of peer reviewers
lication developed by the Council of Biology for most publications usually is two, though more
Editors: it must be the rst disclosure containing or fewer may be employed in any instance. The
sufcient information to enable peers (1) to assess criteria for judgment generally include the intrin-
observations, (2) to repeat experiments and (3) to sic importance of the subject about which the
evaluate intellectual processes; moreover, it must paper is written (hypothesis to be tested, research
be susceptible to sensory perception, essentially problem, etc.), the adequacy of the methodology
permanent, available to the scientic community for the stated purpose, the credibility of the results
without restriction, and available for regular and the adequacy of the data analysis, the reason-
screening by one or more of the major recognized ableness and fairness of the conclusions/interpre-
services (e.g., currently, Biological Abstracts, tations, the adequacy of the bibliography, and the
Chemical Abstracts, Index Medicus, Excerpta adequacy of the formal presentation (i.e., is the
Medica, Bibliography of Agriculture, etc., in reader likely to be able to understand the material
the United States and similar services in other as it is presented). In addition, it is hoped that
countries) [2]. peer reviewers will help to identify submissions
Today, considerable publication is performed that already are in review by more than one venue
via electronic media and the Internet, and may or that present data already published (both
never appear in an edition printed on paper. The ndings indicate transgression of copyright laws
denition of the scientic paper is not altered by and general standards for publication). Peer
the use of electronic media. As indicated by the reviewers also are expected to have some sense of
American Association for the Advancement of the likelihood that the data are real and not
13 How to Prepare a Scientific Paper 257
fraudulent, though the latter is largely impossible his new ideas in print may be reluctant to discuss
for a reviewer to verify. It is essential for authors how much ignorance he had to overcome [4].
to recognize the characteristics by which peer Fortunately, however, given the limited journal
reviewers will judge a manuscript (and, subse- publication space available, the capacity to evalu-
quently, to respond courteously and appropriately ate the logic underlying a given piece of research
to suggestions for additions, clarication, or other far outweighs the need to scrutinize the specic
alterations to the manuscript) if the work is to be and often circuitous path by which that logic was
accepted for publication. revealed, Dr. Medawar and the Saturday Review
The denition of the scientic paper also notwithstanding!
implies a certain amount of detail in reporting Since a scientic paper must communicate
methodology and results; the degree of such several aspects of a research project, a logical,
detail ultimately is the product of a complex standardized reporting format is preferred.
interplay of intellectual and moral/ethical consid- Currently, the most commonly used format is
erations and may vary with the mores of the era known by the acronym, IMRAD: introduction,
and the context within which the publication is methods, results, and discussion. This probably
conceived and written. Centuries ago, a scientic should be changed to AIMRAD to reect the
treatise did not necessarily conform to the rigor- almost universal placement of an abstract at
ous research standards that prevail today, with the the head of the scientic paper, a relatively recent
necessity for substantiating data. The concept development. The abstract is important since it
was paramount; data reporting was less rigorous may alter the information content required of the
and often relatively inaccurate. Scientic thought introduction. The IMRAD format (or AIMRAD,
was evolving, but scientists did not have the lux- or TAAIMRAD, if the title and authors are con-
ury of the technological resources available today sidered, since they, too, can convey important
that mark the often exquisite details of current information) indicates sequentially what problem
research. was studied, why it was studied, what was found,
The degree of detail that is required depends in and how these ndings should be interpreted,
part on the familiarity of the intended audience particularly within the context of related work in
with the methods employed. In many instances, the eld.
techniques that are widely used and generally The best aid to crafting a useful scientic
accepted as standard (e.g., electrocardiography) paper is a well-organized, well-planned, and
require no more than recitation, with no support- clearly written research proposal or protocol. The
ing bibliographic reference. On the other hand, well-crafted proposal will (1) clearly state the
other aspects of methodology, and particularly specic aims of the research, including hypothe-
elements of study design, may be so critical to ses to be tested (if any); (2) provide a context and
interpretation of the results by the reader that con- justication for the study with reference to the
siderable descriptive detail may be necessary. literature; and (3) dene precisely the methods to
The need for a scientic paper to enable the be employed, including the research design, mea-
reader to evaluate the intellectual process surement techniques and approach to statistical
requires either direct discussion of that process in analysis, the principal results expected, and the
the manuscript or, more commonly, organization conclusions that might be suggested by them. In
of the manuscript such that relevant inferences other words, the protocol provides the basis for
can be drawn. As noted by Feinstein, the latter the introduction and methods sections of the
has resulted in the complaint by Peter Medawar paper. However, since the best laid plans often go
in the Saturday Review that scientic writing is somewhat astray, the proposal must be supple-
often intellectually fraudulent because the care- mented by consideration of the procedures actu-
ful organization given to the published material ally employed and data truly collected before the
do[es] not reect the way things happened. After scientic paper can be written. As Turato et al.
conquering his ignorance, the scientist presenting have noted, Investigative studies without explicit
258 J.S. Borer
hypotheses give rise to the supposition that these latter indicates the general subject of the study,
enterprises have a merely mechanical course. the former also indicates the methodological
That is, they uncritically repeat the dominant approach, including the variables measured.
groups methodological models in the world of Although more verbose, the longer title helps to
academic medicine. Failure to present hypothe- dene the scope of the study and to distinguish it
ses, before enumerating the objectives, usually from others in the eld. If no study of prognosti-
represented a failure to respect the logical cation in mitral regurgitation had been performed
sequence of stages, which are understood as previously, the lengthier title would be less essen-
occurring naturally in the mind of the thinker [5]. tial. However, since other studies have been
As Knottnerus also has observed, We should not performed, the additional verbiage is useful, pro-
forget that mathematical indices are just ways to viding the knowledgeable reader with some indi-
summarise collected research data. For the qual- cation of the uniqueness of the paper and its
ity of research, dening the research question, relevance for his or her work.
and methodological challenges in study design, Other important considerations, suggested
are far more important [6]. above, include the desirability of conveying more
A summary of some specic characteristics of of the IMRAD information than merely the sub-
the components of the scientic paper follows, ject of the study and the desirability of brevity.
organized as per the TAAIMRAD format. This The criterion for acceptable brevity varies with
summary owes a considerable debt to the pub- fashion (e.g., Darwins title for his account of his
lished comments of Professor Day, as well as to voyages on the Beagle, On The Origin of Species
personal experiences in applying the generally by Means of Natural Selection, or The Preservation
accepted precepts. of Favoured Races in the Struggle for Life [7],
acceptable in 1859 but not in a medical journal
in 2010!).
The Title In summary, in crafting a title, effort is well
spent attempting to minimize words while maxi-
The title is the rst and, often, only contact of the mizing clarity, focus, and information content.
reader with the paper. Therefore, it must convey
considerable information with an economy of
words. The primary consideration in crafting a Authorship
title is clarity. Jargon should be avoided, and the
relevant rules of grammar should be followed. Different criteria exist for inclusion in an authors
Equally importantly, a title should be specic list and for the order of listing. When this author
and focused. Thus, the title must refer specically worked at the National Institutes of Health (NIH),
to the subject of the research, rather than merely a simple rule of thumb was proposed: listed
to the eld within which the research is under- authors should have made an important contribu-
taken. (Of course, the operating denition of tion to the research and should be able to present
subject and eld can vary with the research.) and defend the paper at a scientic meeting. This
For example, in a prospective study employing denition implies that an author has acquired a
radionuclide cineangiography and echocardiog- body of knowledge which can serve as a context
raphy to develop prognostic indices for survival for the reported research and that he or she is inti-
in patients with mitral regurgitation who had not mately familiar with the intricacies of the meth-
undergone valve replacement or repair, the title odology employed in the research as well as with
Prediction of Survival in Patients with Mitral the results. However, with the rapid increase
Regurgitation by Use of Noninvasively Dened in technological and biological information in
Indices of Left and Right Ventricular Performance recent years, it has become increasingly neces-
would be preferable to, for example, Prediction sary for projects to be carried out by teams
of Survival in Mitral Regurgitation. While the comprising collaborators with different, and
often widely disparate, areas of expertise. For do not provide intellectual input into the process
example, in the randomized Collaborative Study (though many exceptions exist). Problems can
of Coronary Artery Surgery (CASS) [8], a trial also arise regarding the inclusion of senior scien-
designed to assess the effects of coronary artery tists in whose area of responsibility the research
bypass grafting plus standard (ad hoc) pharmaco- occured but who may have had little direct input
logical/dietary therapy compared with standard into the specic project. Clearly, the distinction
(ad hoc) pharmacological/dietary approaches between those whose intellectual responsibility is
alone on natural history of patients with coronary sufcient to warrant authorship and those whose
artery occlusive disease, public health specialists/ responsibility is not is difcult to make with pre-
epidemiologists and statisticians were critical to cision. Ultimately, this determination probably
the study design and analysis. In fact, an epide- depends on a consensus of the involved investiga-
miologist and a statistician were the rst and sec- tors. However, those who allow their names to be
ond authors of one of the most important papers listed as authors incur another responsibility,
resulting from the trial. specically for the veracity of the reported data.
However, surgeons and cardiologists partici- In several celebrated cases of research fraud three
pated in the trial, and the cardiologists included decades ago, some renowned senior scientists,
those who performed catheterizations and those not associated with collecting or analyzing data
who did not. It is likely that representatives of all but involved (sometimes distantly) in project
these groups, and more, participated in the con- conceptualization, were listed among the authors
ceptualization of the study, that all but the statisti- of papers found to be fraudulent; though none of
cian participated in primary data collection, and them was aware of the fraudulence of the reported
that many participated in interpretation of the data, they were perceived as irresponsible in
results. However, it would be excessive to expect allowing their names to be used without ade-
the epidemiologist or statistician to understand quately assessing the reported projects.
the methodological pitfalls of the catheterization Regarding the order of authorship, again per
(much less to identify them when they occurred), Day, authors should normally be listed in order
or to expect the cardiologist, the epidemiologist, of importance to the experiments [2]. Sometimes
or the statistician to fully understand and identify a senior investigator or group leader chooses to
methodological problems associated with surgi- move out of such ordering into the last position
cal procedures, or for the cardiologist or the sur- on the list, from which his or her senior status
geon to understand fully or to be able to defend can be inferred and which provides added
the procedures employed by the statistician. recognition to junior authors by moving them up
Therefore, Days denition is now most appro- the list. In some cultures, authors are listed alpha-
priate: an author of a paper should be dened as betically. No universally accepted rules exist for
one who takes intellectual responsibility for the ordering the authors list; the ultimate test of the
research results being reported [2]. Thus, authors appropriateness of the list is consensus of
should include those who actively or substan- the individuals involved.
tially contributed to the conceptualization, design,
and performance of the research. It is sometimes
true that individuals intimately involved in con- The Abstract
ceptualization and design of research and in anal-
ysis and/or interpretation of results have little or The Abstract represents a brief summary of the
no responsibility for primary data collection and paper. As such, it should contain a concise state-
that individuals involved in primary data collec- ment of the research problem, sufcient method-
tion have little or no involvement in the other pro- ological information to orient the reader, a
cesses. The latter is particularly true of technicians summary of the results of primary importance,
or research assistants, who in most circumstances and the authors principal conclusions.
260 J.S. Borer
The length of the Abstract and, often, its for- fraction, it would be inappropriate to include in
mat are governed by the policy of the publication the Introduction a paean to the value of ejection
to which the paper is submitted. fraction as an index of prognosis in heart disease.
Important considerations in Abstract writing Though ejection fraction is a useful prognostic
include (1) avoidance of abbreviations whenever index, the problem under study has nothing to do
possible and, when they are needed, limitation to with the use of ejection fraction for prognosis.
those which are generally recognized; (2) mini- The mention of this property of ejection fraction
mization of words without disregard for grammar may suggest to the reader that prognostication
and syntax; and (3) avoidance of reference to data strategies have been studied. The resulting confu-
or methods not reported in the paper. The latter sion may preclude clear assimilation of the data
requires careful nal editing since substudies or actually presented. If the reader is performing
subanalyses sometimes are eliminated from the prepublication peer review for a journal, this con-
nal edition of a paper because of considerations fusion may be translated into rejection for an oth-
of relevance or space, but still may appear in the erwise worthy effort.
previously written Abstract. As we have emphasized in Chap. 2 of this
book, the statement of the problem must be
sharply focused. Many authors have documented
Introduction a relation between alcohol consumption, acute or
chronic, and deterioration of left ventricular per-
The Introduction is a tool for communication and formance. Few have dened the quantitative rela-
is critical to the success of the paper. It serves tion between alcohol consumption and ejection
several functions. These include, but are not lim- fraction change. If the study in question was
ited to, engaging the readers interest sufciently designed to provide such information, and the
to justify proceeding into the details of methodol- relevant data were collected, then the statement
ogy and results, suggesting the logic of the meth- of the problem should focus on the effort to quan-
ods, and providing a framework for assimilating tify the relation between the intervention and the
and interpreting the results. parameter employed.
To serve these purposes, the Introduction must The author should not promise something,
(1) clearly state the problem or problems (hypoth- directly or by implication, that he or she does not
eses, research questions, specic aims) under deliver. Thus, for example, in justifying the study
study; if more than one problem has been stud- of the effect of alcohol consumption on ejection
ied, the relation of the problems, and the reason fraction, it would be best to avoid suggesting that
for studying them together, should be elucidated; the study was performed because it might help to
(2) provide a basis, usually from the literature, guide therapy unless (a) the results include data
for choosing to study the problem(s); (3) outline on the effects of therapy in this condition and
the approach to the problem indicating, when (b) the relationship of the effects of therapy to
appropriate, why this approach, rather than oth- ejection fraction is described. (In certain situa-
ers, was chosen; and (4) indicate the importance tions, this speculation might be appropriate in the
or uniqueness of the paper, i.e., justify the perfor- Discussion.)
mance of this particular study. Though some It is important to inform the reader if multiple
writers choose to briey describe results and con- problems have been assessed. All but the most
clusions in the Introduction, most do not. These compulsive readers generally will remember no
are available in the Abstract and are redundant more than one fact or concept after reading a
when more complete exposition of these aspects paper. If multiple concepts or types of results
of the research will follow. In stating the prob- have been generated in a study, a well-constructed
lem, the writer should avoid distracting irrelevan- Introduction may improve the likelihood of their
cies. For example, if one has studied the effect of recognition and retention. A negative example
alcohol consumption on left ventricular ejection may illustrate the point. In 1979, this author and
colleagues assessed response of left ventricular whole should be treated as a guessing game or as
volume and function to exercise in patients with a nely wrought mystery-drama. The busy reader
aortic regurgitation [9]. In a brief, two-paragraph should be engaged early by references to material
Introduction, only the study of function (mea- which the author considers most important.
sured as ejection fraction) was mentioned. The Finally, the Introduction should be brief.
assessment of volume change received no com- Detailed review of collateral or supporting litera-
ment. In the many subsequent references to this ture is appropriate for the Discussion, but not for
frequently cited paper, the citation invariably has the Introduction. Generally, the Introduction
been to the effect of exercise on ejection fraction. should be limited to one double-spaced typed page
To this authors knowledge, no one ever has men- (approximately 250 words). If the Introduction
tioned our nding of marked reduction in left substantially exceeds this limit, the author must
ventricular end diastolic lling during exercise, consider the possibility that he or she has not
which was reported in this paper. Other authors clearly identied the key concepts in his or her
subsequently reported studies of volume changes own mind.
during exercise in aortic regurgitation, without
reference to these data. This oversight is likely
related in large part to an incomplete Introduction Methods
to the paper. As a result, other investigators could
not benet from these ndings in designing their As Day has noted, the primary purpose of this
studies. section is to describe and (if necessary) defend
A brief description of the methodological the experimental design and then provide enough
approach employed in the study will permit the detail that a competent worker can repeat the
knowledgeable reader to place the study in an experiments [2].
appropriate context for interpretation while other Clear and accurate description of methods is
sections of the paper are being read. If methodol- critically important. The careful reader cannot
ogy somehow was unique, this should be indi- properly interpret the results or evaluate the con-
cated, together with the reason for use of the new clusions without a fundamental understanding of
method. For example, in 1977, this author and the methods employed in making the observa-
colleagues reported a study of the effect of exer- tions. As a corollary, the limitations of the meth-
cise on regional and global left ventricular func- ods should be understood. This may require a
tion/performance in 11 patients with coronary specic statement by the author if he or she
disease who had normal performance descriptors believes that the interpretation or generalizability
at rest [10]. In this instance, the method employed of results is importantly mitigated by some aspect
to study performance during exercise was of of the methodology or, conversely, if the author
greater interest than the effect of exercise itself. believes that an apparent methodological limita-
Application of radionuclide cineangiography tion can be explained in a manner that minimizes
during exercise had not been previously reported circumscription of conclusions.
in a scientic paper. Therefore, the Introduction In general, the Methods section should begin
included a paragraph explaining the theoretical with a detailed statement of the subjects employed
importance of studying the effect of exercise in (physical models or devices, cells, tissues, or ani-
coronary disease and another paragraph describ- mals if the study is nonclinical) or humans stud-
ing the relevance of radionuclide cineangiogra- ied (if the study is clinical). This statement should
phy in permitting such study. include generally accepted group descriptors
The introduction should be organized accord- (i.e., demographic data in clinical studies), crite-
ing to journalistic precepts: the most important ria for acceptance and/or exclusion of subjects
concept should be presented rst, and subsidiary from the study population, and a description of
concepts should be presented thereafter. Neither any special procedures employed to determine
the Introduction nor the scientic paper as a tness for acceptance. If rabbit hearts have been
262 J.S. Borer
homogenized for analysis of protein content, the size selection are best illustrated with reference to
weight, age, and breed of rabbit should be noted, studies of therapeutic interventions, usually eval-
as well as the total number of rabbits instru- uated by comparing a new treatment modality
mented for study and reasons for any discrepancy with an established therapy. For such studies, the
between this number and the number whose expected outcome event rate with the established
hearts actually were homogenized and analyzed. therapy may be estimated from earlier studies;
This information helps the reader to evaluate pos- the difference to be sought between the new ther-
sible interactive effects of selection bias, albeit apy and the comparator may be selected by the
unintentional, that might alter extrapolability of investigators based on their judgment of
results. Similarly, in the previously noted exam- the magnitude of difference that may be clinically
ple of the study to develop prognostic strategies useful. However, if the event rate with established
in mitral regurgitation, in addition to age, sex, therapy is found to differ importantly from his-
and, perhaps, other demographic descriptors if torical standards (particularly if it is lower), the
deemed relevant to interpretation of results, the calculated sample size may provide far less than
author should dene the basis for determining the the anticipated power to detect superiority of the
diagnosis of mitral regurgitation and its severity new therapy, even if it exists. Presentation of
(physical examination, echocardiography, cathe- the basis for selection of the sample size in the
terization, etc.), including the specic criteria methods may help the reader to avoid erroneous
employed for classication with the method[s] (negative) interpretation of the data.
chosen. If the study were designed to develop As these examples suggest, the specic param-
prognostic strategies in systemic arterial hyper- eters described in a methods section will vary
tension, rather than in mitral regurgitation, then, from study to study. Nonetheless, each aspect of
in addition to age and sex, race, weight, and the methodology must be dened rigorously and
height might be important demographic descrip- precisely. On the other hand, excessive detail
tors since the pathophysiology of hypertension is which does not affect data interpretation (e.g.,
known to vary with race and, to a lesser extent, hair color, shoe size, and telephone numbers of
with obesity. the patients with mitral regurgitation) can be
Special note should be made of sample size confusing, misleading, and inappropriate. One
estimates (see detailed discussion in Chap. 11). caveat: some journals may have specic require-
Sample size should be planned in the study ments regarding identication of materials or
protocol. It may be appropriate to relate the pro- methods. These will be related in Instructions to
tocol-mandated plan in the Methods and the rea- Authors in the journal and must be followed.
soning on which the plan was based. This is After describing the subjects/items on which
particularly true when the primary results, or studies were performed and, if appropriate,
some important secondaries, are negative, i.e., explaining the basis of sample size selection, the
the expected relationships are not found. Lack of author should detail the materials employed in
statistical signicance is not equivalent to true processing and testing the subjects, as well as the
lack of relationships. Sample size estimates are procedures used to make observations. Again,
based on the expected outcome, the expected detail should be sufcient to permit interpretation
variability of the measurement methods, the like- and/or replication of results. If procedures have
lihood that the result is not due to chance alone been well described in the literature and were
(the alpha level, selected before the study by performed without substantial change from those
the investigators), and the likelihood of nding published, a general statement with a literary
the expected outcome if it really exists (also cho- citation may sufce. As a hypothetical example,
sen before the study by the investigators). The lat- assume that the authors of a study state that
ter is known as the power to nd the expected equilibrium radionuclide cineangiography was
results and is expressed as a percentage. The haz- performed at rest and during symptom-limited
ards involved in not reporting the basis of sample supine bicycle ergometry according to methods
analogous to those we have previously described methodology and performed a multiple logistic
and employing in vivo labeling of red cells with regression analysis with some of the data. An
Tc99m. If substantial changes have occurred since astute peer reviewer noted that, given the size of
the previous publication of the method, these the patient population, an excessive number of
should be described and, if necessary, justied or parameters had been tested for independent
defended (e.g., radionuclide cineangiography signicance in the regression model. The descrip-
was performed using a recently developed image tion of the statistical methodology permitted
rendering method to precisely dene left ventric- detection and correction of this error.
ular borders. This method involves. It was In summary, description of methods requires
employed because cardiac function indices were judgment as to the appropriate degree of detail.
signicantly better correlated with independent When in doubt, it is usually better to include
standards than were older methods). Appropriate more rather than less, though much detail may be
references also should be supplied. removed by editorial suggestion after peer review
The research design also should be specied. and before publication. The guiding principle
If interventions are employed in some study sub- should be that sufcient information is transmit-
jects but not in others, the basis for allocation of ted so that, in the view of the authors and journal
subjects to treatment groups should be dened editor, the results can be accurately interpreted.
(e.g., randomization, stratication) as should
other design elements that reduce bias (e.g.,
blinding in processing/evaluating primary data). Results
The temporal sequencing of the observations
relative to the intervention should be described. In the Results section, the author presents the
Statistical methods employed to analyze data observations which will permit assessment of his
must be presented, including criteria for accept- or her original hypotheses and specic aims. In a
ing or rejecting the null hypothesis (i.e., the sense, the results represent the new knowledge
p value below which result will be declared sta- which has been created by the research.
tistically signicant). Most physicians are rela-
tively unfamiliar with the details of statistics and
with the criteria for selecting specic tests of Narrative
signicance in certain situations. However, the
ready availability of statistical computer pack- In general, and particularly when complex math-
ages has led to widespread performance of statis- ematical analyses and subanalyses have been per-
tical tests by nonstatisticians. While many of formed, it is useful to present the results in
these procedures undoubtedly are correctly narrative form, supplemented by tables and
selected and performed, some probably are not. gures. The narrative should indicate as clearly
The best remedy for this problem is to consult a as possible the ow and thrust of the data, i.e., the
statistician in the design of the research protocol overall sense of the ndings. Interpolation of
and in statistical analysis of results and to ask the numbers into this narrative should be done with
statistician to write the appropriate portion of care and caution, preferably when they do not
the methods section, explaining it conceptually to impede the ow. However, the narrative may be
the other authors. However, if this is not done, the strengthened by judicious interpolation of evi-
statistical methods employed should be carefully dence of the statistical signicance of the ndings
cited so that the statistically literate reader (and (p values). This latter approach necessitates
the peer reviewers) can evaluate the appropriate- clarity and comprehensiveness in the design of
ness of the analysis and resulting conclusions. tables and gures in which the data are presented
During a study performed some years ago by this quantitatively, since the narrative must be
authors group, one nonstatistician spent consid- consistent with the numbers. Moreover, the nar-
erable time familiarizing himself with statistical rative should present only the results and not the
264 J.S. Borer
conclusions. A well-designed narrative plus diminishing space needed to present the new
graphics may lead obviously to certain conclu- knowledge. In the current era in which Internet
sions, but statement of these should await the publication, with supplements and appendices,
next section. often is undertaken or accompanies printed ver-
As with the methods, some judgment must be sions of scientic papers, the space limitation
employed in deciding which results require pre- may be overcome by adding tables (and gures)
sentation. Intensive analysis may reveal many in electronic appendices, surmounting the pro-
relationships unsuspected in the planning of the scription on such additions. However, the author
study. Concern about the chance nding of a sta- must always remember that the primary purpose
tistically signicant relationship on the basis of of publication is communication and that
overanalysis of data probably is well-founded. the accretion of additional material may obscure
Therefore, unexpected relationships, particularly rather than clarify the focus and conclusions of
those derived from post hoc analyses, should be the research. In the Results, however, tables and
evaluated with caution. Nonetheless, some of gures are invaluable and space-saving devices
these may be important in drawing conclusions that often help to clarify complex results by
from the research and certainly can be hypothe- removing them from the narrative, enabling
sis-generating for future studies. Some may be comprehensible summary presentations supple-
irrelevant. The latter generally do not require pre- mented by the data from which they are derived.
sentation. Negative results often are important In the following summary of considerations in
though these, too, must not be overinterpreted. the use and conguration of tables and gures,
A negative nding may have resulted from mea- much has been gained from review of the chap-
surement error or from sample size that is ters on these subjects in the monograph by
inadequate to properly assess the relationship Edward J. Huth (How to Write and Publish
under study. In these instances, a positive, i.e., Papers in the Medical Sciences) to which the
statistically signicant, result would have been reader is referred for greater detail [11].
unlikely even if, in fact, the sought-after relation-
ship actually exists. Such limitations in the
extrapolability of the data generally should be Tables
noted in the discussion section.
Tables and gures need not be limited to the Multiple well-focused tables are preferable to
Results, but this is the section in which they are one massive compendium of all relevant data.
generally most appropriate and useful. Tables However, the number of tables that can be
and gures can be employed in the Introduction employed often is dened or limited by the edito-
or Discussion to summarize work done by others rial policy of the individual journal and must be
into which context the newly reported results known when planning use of these devices. More
must be integrated, or to diagram relationships importantly, the author must consider which
(often pathophysiological relationships) believed tables would best further the communication at
to underlie the results that are being reported. In which the paper is aimed. Information involving
general, these strategies are best reserved for few data that might be effectively displayed in
review articles and should be avoided in scientic tabular form for an oral presentation probably
papers (original research reports) because the use can be communicated more appropriately by nar-
of space for this purpose is seldom justied by rative summary in a paper. If tables are employed,
any gain in comprehension by the reader. Indeed, however, they must be cited and sequenced within
in order to make such tables comprehensible, an the text so that their relation to the narrative
expanded explanatory text often is required (fre- results is easily discernable [11]. In any table, the
quently drawing upon data not generated within title should clearly dene the focus and nature of
the report being presented by the author), poten- the data or relationships to be presented, and
tially increasing the size of the printed article column and row headings should be simple and
beyond the limit allowed by the journal and easily understood. If abbreviations or technical
terms are employed for the sake of the esthetic/ and primary dependent (outcome) variables
clarity of the layout, these should be precisely (particularly when the relation follows a clear
dened in a legend. The legend also may include pattern), etc. However, the latter gures only
a summary statement amplifying or totally replac- should be employed when they provide clear
ing the table title to clarify the specic purpose of support for an authors subsequent conclusions
the table. It is critically important to dene the [11]. It is not necessary, and, in my view, it is
units of measurement for any numerical data in inappropriate to provide examples of individual
the table [11]. In some tables, absolute numerical data (e.g., a photograph of a histological sample
results are followed by parenthetical presenta- of a degenerated myocyte from an organism with
tions of percentages of the data set represented by heart failure) unless some unique characteristic
these absolute values. Unless the antecedent data of the photograph supports the existence of a pre-
set is precisely dened and obviously visible, viously unsuspected process. It is not necessary
such formats can lead to reader confusion and to present illustrations to prove that certain analy-
deterioration of communication. If statistical ses were performed: there is general agreement
comparisons among elements of the table are among researchers that statements of fact pre-
presented, it must be made absolutely clear which sented in the Results are trueit is the interpreta-
elements are being compared and what type of tions that may differ; gures are most useful
comparison has been performed. For example, a when they support interpretations.
p value for noninferiority between two data It should be intuitively obvious that any gure
sets may indicate the high likelihood that one set employed in a publication must be clean, techni-
is noninferior to the other, but unless the type of cally well reproduced, and easy to read. In addi-
comparison has been explicitly stated and there is tion, however, considerable attention should be
a numerical difference between the sets, the paid to labeling. In displays of coordinate axes,
reader may assume the p value refers to superi- the ordinate and abscissa must be clearly labeled
ority, an erroneous conclusion that could preclude with units of measurement, amplied if neces-
comprehension and subsequent application of the sary by statements in the legend. Similarly, inter-
results. ventions, time intervals, etc., must be precisely
laid out in ow charts and study design diagrams.
Idiosyncratic abbreviations in labels should be
Figures avoided when possible. Ultimately, as for tables,
the use of gures should be undertaken only
To be optimally effective, gures should be rela- when they are clearly useful in potentiating com-
tively uncluttered. In general, one fact or prehension of results and conclusions presented
relationship should be illustrated by each gure, in the Discussion. It is an error, likely to be cited
though many observations in the narrative may and extirpated by peer reviewers and editors, to
be supported by gures. It can be very confusing present the same data both in tabular and graphic
to decipher three-dimensional plots, or single formatif the data require amplication beyond
gures with two or three different ordinate or the narrative, select one format or the other, not
abscissa scales, each referring to a different line both. Remember that the goal of the presentation
identiable with reference to black or white poly- is clear communication.
gons, all within the same coordinate axes.
Examples of gures that can be very useful in
clarifying or amplifying (or replacing) text Discussion
include graphic presentations of complex study
designs, ow charts indicating reductions in pop- The purpose of the Discussion is to present con-
ulation size as exclusions or other factors impact clusions based on the results of the research.
on the population studied, quantitative relations Thus, the Discussion is the authors opportunity
between important independent (input) variables to interpret and identify the importance of their
266 J.S. Borer
work and, as Day has noted, to present the prin- authors project and that lack of placement of
ciples, relationships, and generalizations shown these ndings in the appropriate literary context
by the Results [2]. Certain principles should be does not alter their intrinsic validity or value;
observed in writing a Discussion. If they are not, nonetheless, lack of adequate literary references
most editors and many reviewers will call the may lead a reader to under- or overvalue or other-
author to task and may even reject an otherwise wise misunderstand the importance and implica-
laudable report. tions of the reported research. Also, lack of
Less generally is more. Lengthy discussions, appropriate referencing is unfair to the work and
extrapolating from every conceivable aspect of workers thus disregarded. Even if the intrinsic
the data, often are not well received. Moreover, moral issue here is uninteresting to an author, its
they can detract from the importance and practical consequences often are not. It is almost
originality of the primary observations by over- a truism that the author of a study you neglect
whelming and distracting the reader. As a corol- will be a prepublication peer reviewer and may
lary, summarization of the results is redundant resent what is perceived as an inappropriate claim
and inappropriate in the Discussion and usually of priority.
is not tolerated by editors jealous of their limited Limitations of the work, in terms of methodol-
publication space. ogy employed, inconsistencies in results, etc.,
Conclusions should be clearly and closely should be discussed. Interpretation in light of
related to the data obtained in the study. Far- these limitations should be defended when neces-
reaching speculations generally should be sary. Readers and reviewers will be aware of
avoided. Fairness and balance are necessary in these limitations, and failure to deal with them in
interpreting results. Excessive emphasis on a pet the Discussion may detract from the credibility
theory should be avoided, particularly if alterna- of otherwise excellent work.
tives exist that may be credible. Therefore, the Theoretical or abstract conclusions are
relation of the results to those of other parallel or appropriate when logically drawn from data,
similar studies should be discussed. If possible, circumscribed in their scope, and supported by
some explanation should be provided for appar- appropriate references to parallel work in the
ent differences. Often, these may be ascribable to eld. As stated by Howard Haggard in The Doctor
differences in methodology, so that careful review in History, a theory affords an explanation for
of the methodology of collateral references can known facts. Theories, when correct serve as
be very helpful. Claims of priority are appropri- guides in the search for new facts. But when
ate if correct (e.g., This study represents the rst incorrect, they obscure the truth [12]. Whether
demonstration of parthenogenesis in the Syrian or not truth is obscured, wide-ranging theorizing,
hamster), but check the literature carefully to be only tenuously related to the data, often raises the
certain of the claim (see below). ire of peer reviewers, with unfortunate conse-
Support for or refutation of conclusions should quences for the scientic paper.
be cited from the published literature and may Finally, as in the Introduction, the journalistic
require additional discussion. It is the responsi- approach is useful: discuss primary conclusions
bility of the authors to undertake a reasonable rst and secondary or subsidiary extrapolations
literature search to nd appropriate references. later. Thus, in the study of prognostic strategies
As discussed in Chaps. 2 and 9, the explosion of in mitral regurgitation, suppose that right ven-
scientic literature has made this a difcult and tricular ejection fraction less than 30% at study
time-consuming undertaking. However, several entry was associated with poor two-year survival
computer-based literature search services can be and that, as an unexpected ancillary nding,
helpful, including those readily available via an association exists between prior rheumatic
the National Library of Medicine. It is true that fever and left ventricular ejection fraction less
the scientic paper reports the ndings of the than 50% at rest. The Discussion might begin,
These data indicate that right ventricular ejec-

tion fraction at rest is closely related to survival Afterthoughts
in the absence of valve replacement. While the
author alternatively might choose to highlight the The foregoing represents some considerations
rheumatic fever association rst (These data regarding the authors personal approach to
indicate a signicant association between a his- scientic paper writing, supplemented by the
tory of rheumatic fever and chronic depression of published views of a professional who has devoted
left ventricular performance), this point is not much of his professional life specically to this
germane to the primary focus of the study or the area (Robert Day) and other authors who have
paper. Beginning the Discussion in this way prob- presented ideas that have been inuential. Many
ably would confuse the reader and detract from subjects (acknowledgements, concerns regarding
the impact of the study. It is useful to outline a grammar and usage, how to respond to reviewers,
Discussion prior to writing it. This approach per- etc.) have not been covered and can be sought in
mits a review of the logic and ow of the discus- texts devoted to medical writing, which also may
sion and of the appropriateness of placement of provide more comprehensive comments regard-
collateral or supporting references from the liter- ing the areas discussed. Ultimately, however, the
ature. The need to check the logic of the conclu- decision on how to write a scientic paper rests
sions cannot be overstressed. The basis for each with the author, modied by the policies of the
conclusion must be clearly presented. If a editor and prepublication reviewers. If the author
Discussion is logically decient, then the Results, remains always cognizant that the scientic paper
and the relevant literature, should be searched for is a tool for communication, a critical part of the
the missing puzzle piece. If the link remains research process by which new knowledge is
unapparent, then the authors conclusions require made available for the benet of others, then he or
reappraisal. she will successfully accomplish the task.
Take-Home Points
The scientic paper is the vehicle that reports what research problem was studied, why it
was studied, what was found, and how these ndings should be interpreted, particularly
within the context of related work in the eld. Its publication, making the data available to
the scientic community, is the nal step in the research process.
The scientic paper is a communications tool. Clarity and precision of expression are criti-
cally important.
The best aid to crafting a useful scientic paper is a well-organized, well-planned, and
clearly written research proposal or protocol.
The results (not the discussion or authors interpretation) are the new knowledge; their
evaluation by the reader requires clear exposition of the methods. The discussion is not
a mystery novelstate the conclusions in order of their importance. Remember that less
usually is more.
268 J.S. Borer
7. Darwin C. On the origin of species by means of natural

References selection, or the preservation of favoured races in the
struggle for life. 1st ed. London: John Murray; 1859.
1. Feinstein AR. Clinical judgment. Baltimore: Williams 8. Alderman EL, Fisher LD, Litwin P, Kaiser GC, Myers
and Wilkins; 1967. WO, Maynard C, Levine F, Schloss M. Results of cor-
2. Day RA, Gastel B. How to write and publish a onary artery surgery in patients with poor left ventricu-
scientic paper. 6th ed. Westport: Greenwood; 2006. lar function (CASS). Circulation. 1983;68:78595.
3. American Association for the Advancement of Science 9. Borer JS, Bacharach SL, Green MV, Kent KM, Henry
(AAAS), Science & Policy, Electronic Publishing in WL, Rosing DR, Seides SF, Johnston GS, Epstein SE.
Science, Defining and Certifying Electronic Exercise-induced left ventricular dysfunction in
Publication in Science, A Proposal to the International symptomatic and asymptomatic patients with aortic
Association of STM Publishers Originally Drafted regurgitation: assessment by radionuclide cineangiog-
October 1999; Revised March and June/July 2000. raphy. Am J Cardiol. 1978;42:3517.
Learned Publishing. 2000;13:251258. 10. Borer JS, Bacharach SL, Green MV, Kent KM,
4. Medawar PB. Is the scientic paper fraudulent? Yes; Epstein SE, Johnston GS. Real-time radionuclide cin-
it misrepresents scientic thought. Saturday Review, eangiography in the noninvasive evaluation of global
1 Aug 1964; 4243. and regional left ventricular function at rest and dur-
5. Turato ER, Machado AC, Silva DF, de Carvalho GM, ing exercise in patients with coronary-artery disease.
Verderosi NR, de Souza TF. Research publications in N Engl J Med. 1977;296:83944.
the eld of health: omission of hypotheses and pre- 11. Huth EJ. How to write and publish papers in the medi-
sentation of common-sense conclusions. Sao Paulo cal sciences. 2nd ed. Baltimore: Williams & Wilkins;
Med J. 2006;124:22833. 1990.
6. Knottnerus JA. Challenges in dia-prognostic research. 12. Haggard HW. The doctor in history. New Haven: Yale
J Epidemiol Community Health. 2002;56:3401. University Press; 1934.
About the Editors
Phyllis G. Supino, EdD

Dr. Supino is an internationally recognized expert in research methodology,
cardiovascular epidemiology, and medical education who has spearheaded
multiple innovative educational programs on research methods for clinical
audiences, upon which this book is largely based. She has authored more
than 140 publications on research training, program evaluation, psychomet-
rics, evidence-based medicine, and various topics in clinical medicine includ-
ing valvular and coronary heart diseases, geriatric screening and death and
dying. She is Professor of Medicine at the State University of New York
(SUNY) Downstate Medical College, Professor of Public Health at the
SUNY Downstate School of Public Health, Adjunct Research Professor of
Public Health at Weill Medical College of Cornell University, and Director of
Clinical Epidemiology and Clinical Research in the SUNY Downstate
Division of Cardiovascular Medicine. She has primary responsibility for
leading and mentoring clinical and epidemiological research in cardiovascu-
lar medicine and for teaching principles of clinical research methodology to
medical students, postgraduate medical trainees, attending physicians, and
other health professionals at SUNY Downstate. Formerly, Dr. Supino served
as a full-time faculty member at Weill Cornell Medical College and at other
academic medical institutions in the greater New York area, where she
directed clinical research and designed, introduced, and taught new courses
on research methodology and hypothesis and protocol design for clinicians
and allied health professionals. Several of these courses have been published
as curriculum models in the international medical literature. She has taught
biostatistics, epidemiology, and evidence-based medicine to medical under-
graduates, mentored hundreds of medical students and physicians on clinical
research and epidemiological methods, and created novel programs to intro-
duce area college students to clinical investigation. Dr. Supino received her
BA in biological sciences from the City College of the City University of
New York. She holds an earned doctorate, with research distinction, in sci-
ence education from Rutgersthe State University of New Jersey, and has
conducted seminal postdoctoral research on learning theory and teaching
DOI 10.1007/978-1-4614-3360-6, Phyllis G. Supino and Jeffrey S. Borer 2012
270 About the Editors
methods at Princeton University, where she served as a member of faculty,

and at the Personality and Social Behavior Research Group of the Educational
Testing Service, both in Princeton, New Jersey. She has won numerous
awards for excellence in research, teaching, and mentoring, chaired several
committees on research education in medicine, served on a variety of edito-
rial boards and scientic advisory committees, and has chaired various
scientic sessions in cardiovascular medicine. Dr. Supino is a Fellow of the
New York Academy of Medicine and has been included in Whos Who in
America, Whos Who in the World, Whos Who in Medicine in Healthcare,
Whos Who in Science and Engineering, and Whos Who among American
Women.
Jeffrey S. Borer, MD
Jeffrey S. Borer, MD is Professor of Medicine, Cell Biology, Radiology, and
Surgery at SUNY Downstate Medical Center and College of Medicine in
New York City. He is Chairman, Department of Medicine and Chief, Division
of Cardiovascular Medicine, and Director of the Howard Gilman Institute for
Heart Valve Disease and of the Cardiovascular Translational Research
Institute at SUNY Downstate. Dr. Borer received his BA from Harvard, his
MD from Cornell and trained at the Massachusetts General Hospital. He
spent 7 years in the Cardiology Branch of the NHLBI at the NIH and a year
at Guys Hospital in London as a Senior Fullbright Hays Scholar, where he
completed the rst clinical demonstration of the utility of nitroglycerin in
acute myocardial infarction. Upon returning to the NIH, he developed stress
radionuclide cineangiography, for the rst time allowing non-invasive assess-
ment of cardiac function with exercise. He then returned to Cornell for
30 years, where he was the Gladys and Roland Harriman Professor of
Cardiovascular Medicine and Chief of the Division of Cardiovascular
Pathophysiology. At Cornell, his primary research involved developing prog-
nostic standards for regurgitant valve diseases and exploring the cellular and
molecular biology of myocardial dysfunction in valve diseases, now contin-
ued at SUNY Downstate. He has been an Advisor to the USFDA for 33 years,
chairing the CardioRenal Advisory Committee for three terms and the
Cardiovascular Devices Advisory Committee for one, and Advisor to NASA
for 24 years. He has served as President of the American College of Cardiology
(ACC), New York State Chapter, and member of the Board of Governors of
the national ACC, as well as on the Boards of Governors or Trustees of mul-
tiple other national professional societies. Currently, he is President of the
Heart Valve Society of America and a member of the ISO US Valve Experts
Committee. Dr. Borer has published 400 scientic papers and four books,
edits the journal, Cardiology, and has received several awards and other
recognitions, including the Public Service Medal of NASA. He has been
extensively involved in the training of medical students, residents, fellows,
and translational scientists. Since 1990, he has closely collaborated with
Dr. Supino on a variety didactic teaching programs on research methodology
for clinicians and other members of the academic communities of Weill
Medical College and SUNY Downstate Medical Center.
Index
A instrumentation bias, 83
Abstract, scientic paper, 256, 259260 loss to follow-up bias, 7, 16, 60, 61
ACP Journal Club, 179 maturation bias, 83
Alternate form reliability, 168 nonparticipation bias, 60, 61
American Association for the Advancement of Science publication bias, 12, 44, 181, 190192
(AAAS), 256 recall bias, 7, 71, 75
Analysis of variance (ANOVA), 47, 219221 referral bias, 72, 80
Analytic research, 9 sampling bias, 154
Association consistency, 76 selection bias, 16, 66, 72, 8082, 91, 92, 95, 96, 98,
Audio computer-assisted self-interview (ACASI), 160 100102, 105, 109, 127, 177, 262
Authors list, scientic paper, 258, 259 social desirability bias, 164165
sources of, 6062, 7071, 76
testing bias, 81, 84, 91, 105, 109, 168
B Biological plausibility, 42, 76
Bacon, Sir Francis, 33 BIOSIS Previews, 23, 24
Bartholow, Roberts, 235 Bonferroni test, 220, 221
Basic research, denition, 34 Boolean operators, 180181
Bayes theorem, 34, 224, 225 Box plots, 209, 210
Beck Depression Inventory, 162 Brief Symptom Inventory (BSI), 148
Beecher, Henry K., 237239 Buxton, Peter, 238
Behaviorally anchored rating scale (BARS), 157
Belmont report, 3, 238, 239, 242, 251
Benecence, 238, 242, 251 C
Bernard, Claude, 31, 36 Case-control study
BestBETs, 179 advantages and disadvantages, 7475
Bias case
accuracy, 61, 71 denition, 64
agreement bias, 165 selection, 64
allocation bias, 88, 90 vs. cohort study, 5662, 74
denition, 95, 100, 107, 116, 153, 249 controls
detection bias, 72, 83 denition, 6567
devil bias, 165 selection, 65
expectancy bias, 84 odds ratio calculation, 67, 73, 74, 76
experimental mortality (attrition), 82, 88, 90, prevalent vs. incident case, 6465
96, 102, 105, 109 Case report form (CRF), 132, 135142, 144
experimenter bias, 83, 88, 95, 109 Case series, 7, 9, 182
exposure misclassication, 6061 Case study, 7, 9, 8788, 90
faking bad bias, 165 Categorical responses, 158
history bias, 8081, 88, 93, 95, 96, 99, 102, Ceiling effect, 164
105, 106, 109 Central tendency, measures of, 209, 229
horns bias, 165 Chi-squared/Chi-square test, 221223, 229
DOI 10.1007/978-1-4614-3360-6, Phyllis G. Supino and Jeffrey S. Borer 2012
272 Index
Closed-ended questions data sources, 134, 135

dichotomous responses vs. polychotomous data types, 132, 134, 136
responses, 155, 156 document retention, 141
nominal-level responses, 156 electronic data capture (EDC)/electronic systems,
ordinal responses, 156 132, 137141, 144
Clustering illusion, 5 GCP guideline, 133, 142143, 242
Cochran chi-square, 189 original ink concept, 135
Cochrane collaboration, 25, 178, 184, 190 patient privacy, 131, 142
Coefcient of variation, 210 source documentation, 134135
Cohort study Data safety and monitoring board (DSMB), 124, 144
advantages and disadvantages, 58, 59 Descriptive research, 210, 230
basic notation, 56 Diagnostic testing, 179, 184186, 223226
exposure, denition, 57 Diaries, 115, 125, 133, 134, 150151
exposure information, 5758 Directory of Unpublished Experimental Mental
exposure misclassication Measures, 152
outcome information, 5859 Discussion, scientic paper, 257, 261, 265267
prospective vs. retrospective cohort design, 63 Dispersion, measures of, 9, 46, 209, 210
relative risk calculation, 6263
Common Rule, 239, 240
Comparative research, 9 E
Computer-assisted interview (CAI), 5960 EBSCOhost family, 23, 25
Computer-assisted personal interview (CAPI), 159160 Effect size, 164, 186191, 215
Computer-assisted telephone interview (CATI), 160 Electronic data capture (EDC), 132, 139141, 144
Condence interval, 186, 187, 189, 193, 198, 200, 212, Electronic health records (EHRs), 134
213, 215218, 222, 224227, 229, 230 E-mail and Web-based surveys, 159
Confounding, denition (criteria), 5960 EMBASE, 2325, 181
Control group, 10, 58, 66, 67, 7073, 75, 8183, 86, Equivalence trials, 216, 217
87, 89, 90, 9295, 101103, 107, 108, 116, Estimation, techniques for, 210217
182184, 189, 211, 216, 217, 238 Ethics (in research)
Controlled trials, 26, 27, 124, 127, 191 basic ethical principles
Correlation analysis, 9, 46, 167 benecence, 238
Council of Biology Editors, 256 benecence and justice, 238
Cox regression model, 229, 230 benevolence, 235
Critical incident method, 150, 157 respect for persons, 238
Cronbachs alpha, 168, 171 clinical and social value, 248
Crossover design, 86, 98, 100 deception, 242
Cross-sectional study, 56, 75 fair subject selection, 248
Cumulated Index to Nursing and Allied Health HIPAA, 243245
Literature (CINAHL), 24, 25 informed consent, 245247
Institutional Review Board (IRB), 238250
misconduct and consequences, 249250
D Nuremberg Code of 1947, 236
Data analysis, 2, 11, 33, 51, 144, 149, 256 Nuremberg Military Tribunal, 236
Database of Abstracts of Reviews of Effectiveness plagiarism, 249, 250
(DARE), 179 research subjects, 233250
Database of Promoting Health Effectiveness Reviews research with children, 243
(DoPHER), 179 risk-benet ratio, 236, 248249
Data collection and management self-experimentation guidelines, 247248
administrative data, 133135 Tuskegee study, 237, 238
anecdotal observation, 131 withholding of information, 242
case-report form (CRF), design, 132, 135, 138139 Ethnographic methods, 149
clinical database, 121, 133 Experimental Research, 910
condentiality and privacy, 143 Exploratory data analysis (EDA), 33, 140, 207210
data collection instrument, 133, 136
data collection plan, 11
data-entry cleaning, 140141 F
data error identication and resolution, 140 Face-to-face interview, 151, 159, 160
data monitoring, 134, 143, 144 Factorial design, 92, 96, 97
data queries, 141 Fagan nomogram, 185187
data security, privileging, 131, 141142 Fail-safe N, 190191
Index 273
Feasibility study, 19 Interobserver (inter-rater) reliability, 167, 184

Figures, 265 Interventional research, study designs for
Fishers exact test, 223 clinical trial, 84, 85, 91, 92, 98 (see also Randomized
Floor effect, 12 clinical/controlled trial (RCT))
Focus groups, 10, 148, 150, 151, 169, 171 crossover design, 86, 98, 100
Funnel plots, 190191 factorial design, 92, 96, 97
factorial study, 96
multiple time-series design, 101, 107, 108
G n-of-1 study, 98
Goldberger, Joseph, 235 non-equivalent control group design, 101103,
Gold standard diagnostic test, 225 107, 108
Goodwin, Elizabeth, 250 notation, 86
Graphical display, 208, 209 one-group pretest-posttest only design, 8890
one-shot case study, 87, 88, 90
parallel comparison groups, 88, 99
H posttest only control group design, 92, 94, 95
Halo effect, 165 pretest-posttest control group design, 92, 93, 101
Hawthorne effect, 85 static-group comparison, 87, 89, 90, 102
Health Insurance Portability and Accountability Act time-series design, 101, 103108
(HIPAA), 142, 243245 true-experimental crossover design, 98
individually identiable health information, 243, 244 two-period crossover design, 98
protected health information (PHI), 142, 244 Introduction, scientic paper, 260261
waiver authorization, 244, 245 IRB. See Institutional Review Board (IRB)
Heiman, Henry, 235 ISI Web of Knowledge, 23
Hess, Alfred F., 235 I2 statistic, 189, 192
HIPAA. See Health Insurance Portability and ITT. See Intention-to-treat analysis (ITT)
Accountability Act (HIPAA)
Histogram, 208
Hypothesis J
abduction/abductive reasoning, 3436 Jadad scale, 183
alternative hypothesis, 4244 Jenner, Edward, 34
association hypothesis, 33, 34, 37, 38, 42, John Henry effect, 86
44, 46, 49 Joint interview, 150
bivariable hypotheses, 42 Judgment sampling, 203. See also Sampling
conceptual vs. operational hypothesis, 40, 41
deduction/deductive reasoning, 33
directional hypotheses, 42, 43 K
falsication vs. veriability, 39, 40, 43 Kaplan-Meier estimation, 228, 229
induction/inductive reasoning, 3334 Known groups validity analysis, 163
mechanistic hypothesis, 4243 Kruskal-Wallis test, 221
nonmechanistic, 4243 Kuder-Richardson Formula 20 (KR-20), 167168
null hypothesis, 43
operational hypothesis, 40
Hypothesis-generating research, 4, 10, 33 L
Hypothesis-testing study, 6 LAbb plot, 189
Hypothetico-deductive approach, 4, 10 Life histories, 150
Hypothetico-deductive model, 39 Likelihood ratios (LR), 185
Likert scales, 156157. See also Scaling
Literature searching (selecting articles)
I BIOSIS, 23, 24
IMRAD format, 257 CINAHL, 24, 25
Index Medicus, 256 Cochrane Library, 24, 25, 178, 181
Individual Patient Data (IPD), 186 EMBASE, 2325, 181
Institutional Review Board (IRB), 11, 12, 20, 21, 113, keywords, 25, 180181
121, 123, 126, 132, 142144, 158, 238250 MEDLINE, 2325, 177, 178, 181
Intention-to-treat analysis (ITT), 91, 92 MeSH, 23, 25, 180181
Interclass correlation coefcient (ICC), 167 PsycInfo, 24, 25
Interim analysis, 12 Pubmed, 23, 24, 178, 180, 181
Internal consistency, 82, 160, 167, 168, 171 search strategies, 177, 178
274 Index
Literature searching (cont.) O

Social Science Citation Index, 24, 25 Observational study
Web of Science, 2325, 181 association consistency, 76
Logical positivism, 38 association strength, 76
Logrank test, 229 biological plausibility, 42, 76
Longitudinal study, 81, 88, 139 case-control study, 56, 75
cohort study, 56, 61, 76, 210, 221
cross-sectional study, 56, 75
M dose-response relationship, 76
Mail (postal) surveys, 159, 161 temporal association, 76
MarloweCrowne scale, 164165 Occams razor, 38
Matching Odds ratio (OR), 67, 73, 74, 76, 187, 211, 230
group vs. calipers, 6869 One-group pretest-posttest only design, 8890
individual vs. frequency, 6970 One-shot case study, 87, 88, 90
propensity matching, 70 Open-ended questions, 155
McNemars test, 223 Operational denition, 11, 18, 4951
Medawar, Peter, 257 Operational hypothesis, 40, 50
Medical Outcomes Study Short Form (SF), 163 Oral histories, 150, 243
Medical subject headings (MeSH), 23, 25, 180181 Original ink concept, 135
MEDLINE, 2325, 177, 178, 181
Mental Measurements Yearbook, 152
Meta-analysis P
Cochran chi-square, 189 Parallel comparison design, 88, 95, 99, 108
Fagan nomogram, 185187 Pediatric Evaluation of Disability Inventory, 163
fail-safe N, 190191 Peer-review process, 17, 29
forest plot, 187189, 193 Phase III trials, 20, 200, 202
funnel plot, 190, 191 PICO method, 179
heterogeneity, 179, 187190, 194 Pierce, Charles Sanders, 35
interstudy differences, 178 Pilot testing, 148, 151, 152, 171
IPD, 186 Plagiarism, 249, 250
I2 statistic, 189, 192 Poehlman, Eric, 250
limitations of, 191194 Pooled effect estimate, 186
PRISMA, 191193 Popper, Karl, 33, 3840
sensitivity analysis, 189190 Population, denition, 3, 117, 179, 198
statistical methodology of, 178 Positive predictive value, 178, 224, 225
Methods, scientic paper, 255257, 261263 Posttest only control group design, 92, 94, 95
Minnesota Living with Heart Failure Power, statistical, 11, 64, 69, 70, 97, 98, 178, 186, 187
Questionnaire (MLHFQ), 148 Preferred Reporting Items for Systematic Reviews
Morisky scale, 148 and Meta-analyses (PRISMA), 191194
Multibarreled question, 153 Primary endpoint, 113, 114, 120, 197
Multidimensional Fatigue Inventory, 163 Principal investigator (PI), 11, 12, 241, 242, 246
Multiple time-series design, 101, 107, 108 Privacy, 131, 134, 142, 143, 160, 169, 233, 242245
Problem statement, 2, 11, 2629
Proof of concept study, 112113
N Proportionate risk, 21
National Center for Biotechnology Prospective research, 68
Information (NCBI), 23 Protected health information (PHI), 142, 240, 244, 245
National Death Index, 82 Protocol development
National Institutes of Health (NIH), 4, 127, 137, background and rationale, 112118
239, 248, 250, 258 elements of, 111
National Library of Medicine (NLM), 23, 24, 180, 266 endpoint selection factors, 114115
National Research Act of 1974, 238, 239 ethical considerations, 111, 123
Negative predictive value, 22, 224, 225 implementation, 111, 118124, 126
New York Heart Association (NYHA), 45, 51 investigators responsibilities, 123, 126
N-of-1 study, 98 resource allocation and management, 125
Non-equivalent control group design, 101103, safety monitoring procedure, 122123
107, 108 statement of goals and objectives, 112
Number needed to harm (NNH), 184 statistical considerations, 111, 114, 123124
Number needed to treat (NNT), 184 Pyramid of evidence, 182
Index 275
Q extreme/deviant case sampling, 203

Qualitative research, 3, 10, 29 frame, 199, 201, 202
Quality Assessment of Diagnostic Accuracy Studies multiphase sampling, 202
(QUADAS), 184, 185 nonprobability sampling, 199, 203
Quality of reporting of meta-analyses (QUOROM), 191 oversampling, 200
Quantitative research, 10, 29 probability sampling, 200202
Quasi-experimental studies, design, 10, 182 sampling bias, 154
Questionnaire, 10, 11, 57, 58, 75, 82, 125, 132134, simple random sampling, 201
136, 147149, 151154, 158, 159, 161, 162, snowball (chain-referral) sampling, 169
167, 170, 171, 241 stratied random sampling, 91, 201
Quota sampling, 203 Scaling, 45, 155, 158, 164, 170
Science Citation Index, 2325
Scientic method, 1, 41, 127, 255
R Screening, 16, 22, 66, 116119, 121, 126, 127,
Randomization, 75, 80, 91, 92, 95, 96, 98, 109, 113, 114, 171, 192, 223226, 256
119121, 124, 134, 183, 204, 263 Secondary endpoint, 113115, 197
Randomized clinical/controlled trial (RCT), 6, 7, 24, 25, Self plagiarism, 249, 250
75, 85, 91, 93, 94, 96, 98, 107, 124, 127, 132, Self report
133, 136, 151, 182183, 188, 191 denition of, 65
Rank order scales, 158 interview and related methods, 148151
Receiver operator characteristic (ROC) curves, 185, 186 pilot testing, 148, 151, 152, 171
Regression analysis, 263 psychometric properties, 152, 161168
Relative risk, 59, 6263, 73, 74, 76, 211, 221, randomized response, 159, 160
229, 230 rank order scales, 158
Reliability reliability and validity assessments, 165, 166
denition of, 165166 respondents, 148160, 164, 165, 169171
types, 166168 skip patterns, 151, 160, 171
Repeated measures design, 20, 88, 98, 164 sources of items, 152
Research structuring questions, 152155
characteristics of, 13, 3740 types of, 155156
denition, 1 Semantic differential scales, 157
design, 2, 1011, 15, 29, 40, 46, 50, 51, 55, 60, 65, Semistructured interview, 149
84, 86100, 109, 131, 132, 134, 143, 257, 263 Sensitive information, 148, 154, 158159, 169
planning, 11, 19 Sensitive questions, 71, 158160, 169
problem, statement of Sensitivity analysis, 189190
steps, 1013, 29 SF-36, 32, 148, 162
types, 2, 240 Sims, Marion, 235
Researchable question, 18 Social desirability bias, 164165
Response rates, 159161 Special populations, 169
Results, scientic paper, 263 Specic research aim, 149
Retrospective research, 7 Split-half reliability, 167, 168
Reverse-halo bias, 165 Standard deviation, 46, 132, 198, 210213,
Rhoads, Cornelius, 235 215, 218
Risk factor, 79, 19, 37, 42, 4547, 5559, 6164, 67, Standard error of the mean, 46, 214
68, 7076, 83, 91, 101, 182, 198, 211, 221 Standardized reporting format, 257
Rival hypothesis, 79, 102103 Stanley, Leo, 235
Run-in periods, 98, 118119 Statement of purpose, 27, 28
Statistical methods, 45, 55, 171, 190, 207,
210, 211, 217, 230, 263
S Stratied random sampling, 91, 201
Safety monitoring procedure, 122123 Strong, Richard P., 235
Sample, denition, 11, 57, 64, 68, 107 Structured interview, 136, 148149
Sample size, 7, 11, 16, 20, 22, 44, 48, 60, 62, 64, 68, 73, Students t test, 213, 218
91, 98, 109, 113, 114, 116, 124, 126, 167, 169, Study monitoring, 126, 141
178, 186, 187, 189, 190, 198202, 208, 209, Study termination, 105, 122125
212215, 218, 220223, 226, 248, 262, 264 Survival analysis, 227230
Sampling Systematic error, 81, 164, 166
cluster random sampling, 202 Systematic random sampling, 201
convenience sampling, 203 Systematic review, 2426, 177194
276 Index
T testing reactive effects, 8486, 89, 9496, 100,

Tables, 190, 199, 218, 223, 263265 103, 106
Telephone survey, 159, 161 face validity, 162
Test-retest reliability (temporal stability), 89, 166167 internal validity threat
Tests in Print, 152 experimental mortality, 82, 88, 90, 96, 102, 105
Texas sharpshooter fallacy, 5 experimenter bias, 8384, 88, 95
Think-aloud methods, 151 factors interaction, 8283, 89
Thurstone scaling, 155 history effects, 8081, 88, 93, 96, 102103, 106
Time-series design, 101, 103108 instrument decay, 81
Title, scientic paper, 257, 258, 264, 265 interactive effect, 8485, 94, 96, 97, 99
Traditional ordinal rating scales, 156 maturation effects, 81, 96, 105
Translational research, 34 regression effects, 82, 93, 103, 105
Translation issues, 154 rival hypotheses, 79
TRIP Database, 179 selection bias, 8083, 91, 92, 95, 96, 98102, 105
True experimental (randomized) design, 91101, statistical regression, 60, 82, 89, 90, 93
103, 109 subject expectancy effects, 8384
Tukeys Honestly Signicant Difference test, 220 testing effects, 81, 84, 90, 105
Type I error, 190, 194, 200, 214, 215, 220 Variables
Type II error, 194, 200, 214 control variable, 4850
dependent variable, 4750, 79, 80, 86, 88, 9293, 95,
96, 101, 102, 105, 107, 227, 230
U independent variable, 4650, 79, 86, 227, 230
Unstructured interview, 10, 148, 149, 152, 243 interval variable, 4546
intervening variable, 2, 46, 4950
moderator variable, 4750
V nominal variable, 45, 156
Validation, 3, 12, 75, 114, 115, 137141, 148, 151, 160, ordinal variable, 45, 46
162, 170 quantitative variable, 46, 209
Validation items, 170 random variable, 207210, 212, 213, 218, 221, 226
Validity ratio variable, 45, 46
construct validity, 162163, 170 Visual analog scales (VAS), 157158, 220
content validity, 162, 170
convergent validity, 162163
criterion validity, 163 W
divergent/discriminant validity, 162, 163 Wakeeld, Andrew, 249
external validity threat Weighted Kappa, 167
multiple treatment interference, 86, 88, 97, 100 Wentworth, Arthur, 235
reactive effects, 8486, 89, 9496, 100, 103, 106 Wilcoxon rank sum test, 219, 221
selection and treatment, interactive effects, 84, 85, World Health Organization Quality of Life Questionnaire
89, 94, 96, 99, 103 (WHOQOL), 148

Phyllis G. Supino EdD (Auth.), Phyllis G. Supino, Jeffrey S. Borer (Eds.) - Principles of Research Methodology - A Guide For Clinical Investigators-Springer-Verlag New York (2012)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Phyllis G. Supino EdD (Auth.), Phyllis G. Supino, Jeffrey S. Borer (Eds.) - Principles of Research Methodology - A Guide For Clinical Investigators-Springer-Verlag New York (2012)

Uploaded by

Copyright:

Available Formats

Principles of Research Methodology

Phyllis G. Supino Jeffrey S. Borer

Foreword by Stephen E. Epstein

ISBN 978-1-4614-3359-0 e-ISBN 978-1-4614-3360-6 (eBook)

Library of Congress Control Number: 2012937226

Phyllis G. Supino and Jeffrey S. Borer 2012

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)

Washington, DC, USA Stephen E. Epstein, MD

tion and in teaching various key aspects of research methodology to medical

Brooklyn, NY, USA Phyllis G. Supino

We wish to give special thanks to the following individuals, who provided

1 Overview of the Research Process .............................................. 1

11 Introductory Statistics in Medical Research ............................. 207

About the Editors ................................................................................. 269

Index ...................................................................................................... 271

Jeffrey S. Borer, MD Department of Medicine, Division of Cardiovascular

Richard H. Sinert, DO Department of Emergency Medicine, SUNY Down-

The term research can be defined broadly as a

conceptualize general relationships based on [CQI] or formative and summative appraisals

Fig. 1.1 The Texas

Fig. 1.2 Variables

Research is a rigorous problem-solving process whose ultimate goal is the discovery of

15. Goldblatt EM, Lee WH. From bench to bedside: the

In his discussion of how problems are gener-

The Problem Should Lead to Clear, The Problem Should Be Feasible

collection procedures (including acceptability of (http://discoverybuzz.com/blog), and Trust the

ethical considerations entailed in studying the classied as descriptive (What is occurring?

Table 2.1 Selected core online resources

A well-designed research project, in any discipline, begins with conceptualizing the

research-worthy problem. Inform Sci: Int J Emerg

predicted, the hypothesis is supported. As noted

It is important to recognize the difference between

comparison]) and contain at least two premises Hypothesis by Induction

Fig. 3.1 The three stages

and diabetes would be considered study vari-

nondirectional hypothesis is usually referred 3. The proportion of recurrent MIs among

Constructing the Hypothesis: 2. The Ordinal Variable

to be the cause, or a predictor of that outcome, corticosteroid therapy on systolic performance

Fig. 3.4 A hypothetical

A control variable is dened as any poten- represent a disease process or physiological

Fig. 3.5 Interrelation among variables in a study design

Independent variable: adrenal corticoster-

10. Popper KR. Objective knowledge: an evolutionary

to be further elucidated below), we might start

Sources of Outcome Information

Fig. 4.1 Computing the

Computing and Interpreting exposed) is a/(a + b); the corresponding incidence

Fig. 4.2 Relative risk: an

with a very specic subtype and/or severity (e.g.,

Selection Bias endocrinology, or renal clinic might create a

Fig. 4.3 Computing

Fig. 4.4 The odds ratio:

The question addressed by a cross-sectional study The Question of Causality

5. Agresti A. Categorical data analysis. 2nd ed. Hoboken:

falsity of a proposition [2]. In scientic inquiry,

Fig. 5.1 Example of a one-shot case study

In this study, the X represents the anistreplase,

Fig. 5.2 Example of the one-group pretest-posttest only design

Fig. 5.3 Example of the static-group comparison design

Fig. 5.4 Example of the pretest-posttest control group design

Fig. 5.5 Example of the posttest only control group design

Fig. 5.6 Example of a 2 2 factorial true-experimental design