You are on page 1of 192

BASICS IN

EPIDEMIOLOGY AND BIOSTATISTICS

h
ta

9
ri 9

n
U
-

ti e

V
d

G
R

tahir99 - UnitedVRG

BASICS IN

EPIDEMIOLOGY AND BIOSTATISTICS

V
d

ti e

Waqar H Kazmi MD MS (Tufts, Boston)

G
R

Principal, Professor of Nephrology and Director Research


Karachi Medical and Dental College/Abbasi Shaheed Hospital
Karachi, Pakistan

n
U
-

Farida Habib Khan DCH MPH MCPS FCPS


Professor of Community Medicine
Princess Nora Bint Abdulrahman University
Riyadh, Kingdom of Saudi Arabia

h
ta

9
ri 9

Foreword
Waris Qidwai

The Health Sciences Publisher


New Delhi | London | Philadelphia | Panama

Jaypee Brothers Medical Publishers (P) Ltd.


Headquarters
Jaypee Brothers Medical Publishers (P) Ltd.
4838/24, Ansari Road, Daryaganj
New Delhi 110 002, India
Phone: +91-11-43574357
Fax: +91-11-43574314
E-mail: jaypee@jaypeebrothers.com
Overseas Offices
J.P. Medical Ltd.
83, Victoria Street, London
SW1H 0HW (UK)
Phone: +44-20 3170 8910
Fax: +44(0)20 3008 6180
E-mail: info@jpmedpub.com

Jaypee-Highlights Medical Publishers Inc.


City of Knowledge, Bld. 237, Clayton
Panama City, Panama
Phone: +1 507-301-0496
Fax: +1 507-301-0499
E-mail: cservice@jphmedical.com

Jaypee Medical Inc.


The Bourse
111, South Independence Mall East
Suite 835, Philadelphia, PA 19106, USA
Phone: +1 267-519-9789
E-mail: jpmed.us@gmail.com

Jaypee Brothers Medical Publishers (P) Ltd.


17/1-B, Babar Road, Block-B, Shaymali
Mohammadpur, Dhaka-1207
Bangladesh
Mobile: +08801912003485
E-mail: jaypeedhaka@gmail.com

Jaypee Brothers Medical Publishers (P) Ltd.


Bhotahity, Kathmandu, Nepal
Phone: +977-9741283608
E-mail: kathmandu@jaypeebrothers.com
Website: www.jaypeebrothers.com
Website: www.jaypeedigital.com
2015, Jaypee Brothers Medical Publishers

The views and opinions expressed in this book are solely those of the original contributor(s)/author(s)
and do not necessarily represent those of editor(s) of the book.
All rights reserved. No part of this publication may be reproduced, stored or transmitted in any form
or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior
permission in writing of the publishers.
All brand names and product names used in this book are trade names, service marks, trademarks
or registered trademarks of their respective owners. The publisher is not associated with any product
or vendor mentioned in this book.
Medical knowledge and practice change constantly. This book is designed to provide accurate,
authoritative information about the subject matter in question. However, readers are advised to
check the most current information available on procedures included and check information from the
manufacturer of each product to be administered, to verify the recommended dose, formula, method
and duration of administration, adverse effects and contraindications. It is the responsibility of the
practitioner to take all appropriate safety precautions. Neither the publisher nor the author(s)/editor(s)
assume any liability for any injury and/or damage to persons or property arising from or related to use
of material in this book.
This book is sold on the understanding that the publisher is not engaged in providing professional
medical services. If such advice or services are required, the services of a competent medical
professional should be sought.
Every effort has been made where necessary to contact holders of copyright to obtain permission to
reproduce copyright material. If any have been inadvertently overlooked, the publisher will be pleased
to make the necessary arrangements at the first opportunity.
Inquiries for bulk sales may be solicited at: jaypee@jaypeebrothers.com

Basics in Epidemiology and Biostatistics


First Edition: 2015
ISBN: 978-93-5152-631-5
Printed at

tahir99 - UnitedVRG

V
d

Dedicated to

ti e

Medical and Dental Students


and
Young Researchers

h
ta

9
ri 9

n
U
-

G
R

tahir99 - UnitedVRG

Foreword

It gives me immense pleasure in writing a foreword for Basics in Epidemiology


and Biostatistics, written by highly eminent and respected scholars Professor
Waqar H Kazmi and Professor Farida Habib Khan. Prof Kazmi is considered an
authority on this subject and has skills to present challenging concepts in the
area of epidemiology and biostatistics, in an easy-to-understand language.
He obtained his Masters in Epidemiology from Tufts University, Boston, USA
and has a strong clinical background being a Professor of Nephrology, as
well. Farida Habib Khan is the Professor of Community Medicine and served
College of Physicians and Surgeons as a regular facilitator of the Workshops
on Research Methodology and dissertation writing and served two medical
journals as an Associate Editor.
The book fills a great need that exists for availability of such books on
this important yet neglected subject. Epidemiology and biostatistics has
been neglected in medical education curriculum and, therefore, healthcare
providers are lacking expertise in this important area. The book will go a long
way, in addressing important need to provide an easy-to-understand guide
for healthcare providers and others, to understand and apply concepts of
epidemiology and biostatistics in their work. Its simple language and practical approach, makes it indispensable for those involved in research work as
well as those associated with teaching epidemiology and biostatistics. It will
be useful for undergraduate and postgraduate students in various disciplines
of healthcare as well as those practicing medicine.
Besides, the book would be highly useful to healthcare providers, teachers
and researchers.

9
ri 9

h
ta

n
U
-

ti e

V
d

G
R

Waris Qidwai

Chair, Working Party on Research


World Organization of Family Doctors (WONCA)
Former Chair
International Federation of Primary Care
Research Networks (IFPCRN)
Professor and Chairman
Department of Family Medicine
Aga Khan University
Karachi, Pakistan

tahir99 - UnitedVRG

Preface
Basics in Epidemiology and Biostatistics introduces the medical/dental students,
postgraduates, researchers, or clinicians, to the study of statistics applied to
medicine. We have incorporated our experiences in medicine and statistics to
develop a comprehensive text covering the traditional topics of biostatistics
and epidemiology. Particular emphasis is given to study design and the interpretation of results of medical research.
It has been more than a decade that we have been giving lectures at
various undergraduate and postgraduate institutes. The students find these
lectures worthwhile for the understanding of basic concepts in biostatistics
and epidemiology. We realized that by writing a book, we could reach a large
number of students and faculty members in remote areas, which were not
accessible to us otherwise. Thus, we hope that anyone interested in research
will find the book extremely helpful.
We have tried to explain all statistical concepts in simple terms. No special
background knowledge will require to understand the text. An effort has been
made to cover all the fundamental concepts and important terms in the book.

V
d

n
U
-

The book contains the following features:

ti e

G
R

Simple Text
The book is written in a very simple and easy-to-understand manner. The
information given in the book is relevant to the need of any junior and early
stage researcher. The information is presented in a schematic pattern. This is
necessary because a learner must understand the pre-requisite information
before understanding the more advanced concepts in basic epidemiology
and biostatistics. Thus, all the information have been presented in a schematic
and synchronized way so that the reader could grasp them very easily.

9
ri 9

h
ta

Pictorial and Tabular Display of Information


Different learners have different learning styles. Some find textual information easy to understand, while others are more at ease of understanding the
pictorial and tabular display of information. Thus, all relevant texts have also
been presented in a pictorial and tabular form. We hope that a large number
of readers could grasp the important and useful information by having a good
look at the pictures and tables.
Relevant Examples
We have used multiple clinical and nonclinical examples so that the reader
will understand the basic concepts of epidemiology and biostatistics. Simple
interesting examples have also been used for the purpose.

Basics in Epidemiology and Biostatistics


Software Relevant to Use in Research
There are a number of softwares relevant to be used for research purpose. In
this book, multiple softwares have been used to compute sample size. The
reader will surely find the book useful to have the understanding of how to
use the relevant software for sample size calculation.

Waqar H Kazmi
Farida Habib Khan

tahir99 - UnitedVRG

Acknowledgments

We are extremely grateful to Muhammad Abdul Samad, Lecturer, Research


Department, Karachi Medical and Dental College, Karachi, Pakistan, for his
invaluable support and efforts in every stage of writing the book.
We express our gratitude to Mrs Huma Khan, Research Co-ordinator,
Universal Research Group, Pakistan, for her support regarding proofreading of
the book.
We are thankful to Asma Kazmi, Assistant Professor, California Institute of
Fine Arts, Los Angeles, USA, for designing the Cover Page.
Our special thanks to M/s Jaypee Brothers Medical Publishers (P) Ltd, New
Delhi, India, for their active co-operation in publishing this book.

V
d

h
ta

9
ri 9

n
U
-

ti e

G
R

tahir99 - UnitedVRG

What is Research ? 1
Types of Research 1
Steps to Conduct Research 3
Selection of Research Topic 3
Scale for Rating Research Topics 5
Resources of Literature Search 5

30

3. Sampling Procedure

n
U
-

9
ri 9

h
ta

ti e

Definition 8
Types of Epidemiological Study Designs 8
Descriptive Observational Studies 10
Analytical or Comparative Studies 14
Analytical Observational Studies 14
Registries 20
Interventional/Experimental Studies 21
Blinding 24
Consent Form 25
Intent to Treat Analysis 25
Quasi-experimental Studies 25
Clinical Trials and their Phases 25
Research Questions and Study Types 27
Meta-analysis 27

G
R

y
y
y
y
y
y
y
y
y
y
y
y
y
y

V
d

2. Study Designs

y
y
y
y
y
y

1. Introduction to Research

Contents

41

4. Variables, Data and its Presentation


y Population 30
y Reasons for Sampling 31
y Sampling Techniques 31

51

5. Biostatistics: Basic

y Measures of Central Tendency 51


y Measures of Variation 52
y

y Variables and their Types 41


y Data and its Types 42
y Tabulation and Graphical Presentation of Data 44

xiv Basics in Epidemiology and Biostatistics

57

Point Estimate 57
Interval Estimate 57
Hypothesis Testing 57
Introduction to the Scale of Probability 58
Test of Hypothesis 59
Decision Errors 62

7. Measures of Disease Frequency

69

y
y
y
y
y
y

6. Estimation and Hypothesis Testing


y Standard Error of Mean 54


y Normal Distribution 54

77

8. Measures of Association

y Ratio, Proportion and Rate 69


y Prevalence and Incidence 70
y Special Types of Incidence Rates 73

Introduction 89
Bias 89
Control of Bias 92
Confounding 92
Effect Modifiers 93

103

y Reliability and Validity of a Screening Test 103


y Sensitivity and Specificity 104
y Predictive Values 105
y

11. Screening

Sample Size 95
Sample Size for Single Proportion 95
Sample Size for Single Group Mean 96
Sample Size for Two Proportions 98
Sample Size for Two Group Means 98
Sample Size for Sensitivity and Specificity 101
Suggested Websites for Sample Size Calculator 102

y
y
y
y
y
y
y

95

10. Sample Size Estimation

y
y
y
y
y

89

9. Factors Affecting Study Outcomes


y Association between Two Continuous Variables 77


y Relative Risk and Odds Ratio 84

tahir99 - UnitedVRG

Contents xv

110

12. Basic Statistical Tests

y Unpaired Samples 110


y Paired Samples 110
y What are Validity and Reliability in Research Findings? 113

y Different Data Collection Techniques 115

120

14. Data Analysis Plan

V
d

y Importance of Data Analysis Plan 121


y What Should the Plan Include? 121

16. Dissertation Writing

n
U
-

y
y
y
y

Methodology 129
Plan for Analysis of Results 130
Title/Topic 130
Introduction 130

15. Synopsis Writing


y
y
y
y

G
R
115

13. Overview of Data Collection Techniques

ti e

129

151

y
y

Steps in Writing a Dissertation 151


Title 152
Table of Content 152
Title Page 152
Abstract 152
Introduction 152
Hypothesis 153
Study Objective 153
Subjects/Material and Methods 153
Results 153
Discussion 154
Optional Components 154
References 155
Annexes 155
The Whole Manuscript/Dissertation Should be
in Past Tense 155
y Sample of Title Page 155
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y

h
ta

9
ri 9

157

Citing a Journal Article 157


Title of Journal Article 158
Journals Title 158
Citing a Book Reference 159

y
y
y
y

17. Reference Writing

xvi Basics in Epidemiology and Biostatistics

y Other Authors 161


y Dissertation Reference 161
y Citing Internet and other Electronic Sources 161

164

18. Guidelines for Consent Writing

19. Consent to Participate in Research (Sample)

y General Ethical Principles 164


y Guidelines for Drafting an Informed Consent Form 166
y Important Notes 168

169

Title or Paraphrased Title of the Study 169


Purpose of the Study 169
Procedures 169
Potential Risks and Discomforts 170
Potential Benefits to Subjects and/or to Society 170
For Biomedical Studies only,
Add the Following Section Here 172
y Identification of Investigators 172
y Rights of Research Subjects 173

Index

175

y
y
y
y
y
y

tahir99 - UnitedVRG

CHAPTER

Introduction to
Research

V
d

WHAT IS RESEARCH ?

G
R

Research is a systematic process of collection and analysis of data


and later on its interpretation so as to find solutions to a problem or
any event around us (Fig. 1.1).

n
U
-

TYPES OF RESEARCH

ti e

Basically research is of two types, i.e. empirical and theoretical


(Flow chart 1.1 for the classification of research). Empirical approach
is based upon observation and experience, while theoretical is
based upon theory and abstraction. Both empirical and theoretical
research complement with each other to develop an understanding
of the phenomenon, predict future events and prevent harmful
events for the general welfare of the population of interest.
Empirical research is further divided into qualitative and
quantitative.

9
ri 9

h
ta

Qualitative Research
This type of research is context based. Here there is an inquiry with
the goal to understand a social or human problem so build up a
complex and holistic picture of the phenomena of interest. The
researcher interprets the results of perspectives or information
taken from subjects.

Figure 1.1 Research as a systemic process

Basics in Epidemiology and Biostatistics


Flow chart 1.1 Classification of research

In logic, we often refer to the two broad methods of reasoning as


the deductive and inductive approaches.
Deductive reasoning works from the more general to the more
specific approach. Sometimes this is informally called a top down
approach. Inductive reasoning works the other way, moving from
specific observations to the broader generalizations and theories,
called bottom up approach. Qualitative research is the inductive
form.
There are three types of qualitative research, i.e. case studies,
ethnographic studies and phenomenological studies.
1. Case study is a descriptive study of a single entity with respect to
time and entity.
2. Ethnographic study is a study of a cultural group in a natural
setting. A cultural group could be group of people who share a
common location or any common social experience, e.g. prisons
in jail or cultural group of Muslims.
3. Phenomenological study is a human experience of a small group
of people over a long period of time.

tahir99 - UnitedVRG

Introduction to Research

Quantitative Research

In quantitative research reality is studied objectively by the


researcher. Theory or hypothesis is tested by using numbers and
analyzed by statistical methods. This type of research is based
on deductive form of logic. Ultimately, the researcher develops
generalization and contributes to theory.
Three different types of quantitative research are experimental,
quasi-experimental and surveys.
1. In experimental type of research, there is random assignment of
subjects to experimental conditions. The results are compared
with controls.
2. Quasi-experimental studies are similar to experimental studies
with the exception that there is nonrandomized assignment of
subjects to experiments.
3. Surveys are cross-sectional studies using questionnaires or
interviews with an intent of estimating the characteristics of a
larger population based on a smaller group from that population.
Health science research mostly deals with quantitative type of
research approach.

V
d

n
U
-

ti e

G
R

STEPS TO CONDUCT RESEARCH

9
ri 9

Research is a systemic process starting from selection of research topic


and ends at reporting the research findings at local/international
journals or scientific meeting. The Table 1.1 gives details about
various steps and relevant purposes in conducting research.

h
ta

SELECTION OF RESEARCH TOPIC

Main Criteria for Selecting a Research Topic

There are seven criteria for selecting a research topic.


1. Relevance: Here consider the prevalence of the problem in which
you are interested. In other words, how big is the problem.
2. Innovation: It is good to look into a new problem but it is not
always possible to work or search for new problems as you may
have limited resources. Thus, you can work on the old problem
but with a different perspective.
3. Feasibility: It means the availability of different resources that you
may need to carry out the research project. It includes manpower,
money, material, machinery, skills and time, etc.

Basics in Epidemiology and Biostatistics

Table 1.1: Steps to conduct research


Steps

Purpose

Selecting a research topic and


formulating objective(s)

To assess what questions will the


study address
What will it measure?

Undertaking literature review

To establish why the question is


important?
What is already known about it?
What new will this study assess?

Selecting a study design

To ensure that the research design


matches the objectives set

Selecting the subjects

To ensure generalizability and


validity

Identifying study

Collection of data

To ensure collection of data


aligned to the objective(s) in a
reliable and nonbiased manner

Analyzing data

To present quantifiable result and


assess validity

To be clear in context to:


Predictor variables
Outcome variables
Confounding variables

4. Acceptability: It is important to consider whether your proposal


will be supported by the local authorities or not. It also includes
the acceptability of the procedure or the method that you are
going to apply on the community as certain communities have
certain social boundaries that may hamper in your research
procedure.
5. Cost-effectiveness: Consider whether the resources which you
are spending are worthwhile, for example, in terms of decline in
morbidity/mortality rates or length of stay in hospital.
6. Ethical consideration: It includes informed consent, beneficence,
nonmaleficence (do no harm), and confidentiality of information
taken, etc.
7. Applicability of possible results and recommendations: Is it likely
that the recommendations from the study will be applied? This
depends not only on the blessing of the authorities but also on the

tahir99 - UnitedVRG

Introduction to Research
availability of resources for implementing the recommendations.
The opinion of the relevant stakeholders (i.e. potential clients
and of the responsible staff) will influence the implementation of
recommendations as well.

SCALE FOR RATING RESEARCH TOPICS


Every criterion that is mentioned above is graded from 1 to 3,
1 being low, 2 means medium, while 3 stands for high (Table 1.2).
Hence, the maximum score that is possible for any topic is 21. The
topic for which there is highest score should be chosen.

RESOURCES OF LITERATURE SEARCH


Relevant scientific literature could be searched through internet,
medical journals, conference literature, newspaper or documents
of government or nongovernment organizations. Usually internet is
used as the process is quick, reliable and freely accessible.
Through internet one can link with library catalogues, online
databases, like MEDLINE and a number of biomedical journals.
Researchers should give adequate time in conducting literature
search as this will help in writing a good quality of synopsis and
dissertation.
Before using internet for literature search, the researcher should
set the keywords for the topic of interest.
Suppose a researcher wants to work on the complication
nephropathy, among diabetic patients who are hypertensive. The
keywords are diabetes, hypertension and nephropathy.
Table 1.2: Scale for rating research topics
Low (1)
Relevance
Innovation
Feasibility
Acceptability
Cost-effectiveness
Ethical consideration
Applicability

Low (2)

Low (3)

Basics in Epidemiology and Biostatistics


After opening the PubMed window by directly entering www.
pubmed.com or http://www.ncbi.nlm.nih.gov/PubMed/, the first
keyword (diabetes) in the search bar (for) is entered. Approximately
160000 research papers will be displayed which is not manageable
(Fig. 1.2).

Figure 1.2 PubMed window after entering the first keyworddiabetes

Figure 1.3 PubMed window after entering the second


keyword hypertension

tahir99 - UnitedVRG

Introduction to Research

Figure 1.4 PubMed window after entering the third


keywordnephropathy

After entering the second keyword (hypertension), the number of


articles have also narrowed down to 16057 but still it is a very large
figure (Fig. 1.3).
After entering the third keyword (nephropathy), the number
of articles will narrow down to just 3010 which is manageable
(Fig. 1.4).

BIBLIOGRAPHY
1. Dawson B, Trapp RG (Eds). Reading the Medical Literature. Basic
and Clinical Biostatistics, 3rd edn. Singapore: Lange Medical Books;
McGraw Hill; 2001.pp.317-9.
2. Fathalla MF, Fathalla MMF (Eds). What research to do? WHO Regional
Publication, Eastern Mediterranean Series: A Practical Guide for Health
Researchers. Cairo: World Health Organization; 2004.pp.25-42.
3. Harvard L. How to conduct an effective and valid literature search?
[Online]. 2007 [cited 2008 Jul]; Available from: URL: http://www.
nursingtimes.net/ntclinical/how_to_conduct_a_literature_search.
html
4. Hulley SB, Newman TB. Getting started: the anatomy and physiology
of clinical research. In: Hulley SB, Cummings SR, Browner WS (Eds).
Designing clinical research. Philadelphia, PA: Lippincott Williams and
Wilkins; 2007.pp.3-15.
5. Research and Scientific Methods. In: World Health Organization.
Health research methodology: a guide for training in research methods.
Manila: World Health Organization; 2001.pp.1-10.

CHAPTER

Study Designs

DEFINITION

A study design is a plan to conduct a study which allows the


researcher to translate a conceptual hypothesis into an operational
one. It is the method of data collection with respect to time, exposure
and outcome (Fig. 2.1).
The selection of a study design depends upon the research
objective and hypothesis. The researcher should know and use the
most appropriate study design that matches best with the objective.

TYPES OF EPIDEMIOLOGICAL STUDY DESIGNS

Epidemiological study designs are classified as follows (Flow chart 2.1):


Descriptive or observational designs for generating hypothesis:
Case report
Case series
Cross-sectional studies.

Figure 2.1 Study designs with respect to time

tahir99 - UnitedVRG

Study Designs
Flow chart 2.1 Types of epidemiological study designs

Analytical or observational designs for generating/testing hypothesis:


Case control studies
Cohort studies.
Analytical or experimental designs for testing hypothesis:
Randomized control clinical trials (Gold standard)
Quasi-experimental design.
The difference between hypothesis-testing and hypothesisgeneration is that in a hypothesis generating study only an asso
ciation between an exposure and an outcome can be established,
while on the basis of an hypothesis testing study one can say with
confidence that a certain exposure causes a certain outcome. The
experimental studies (randomized controlled clinical trial) are
the most robust of studies and the only hypothesis-testing studies,
hence are considered the gold standard. The observational studies
are weaker studies and can only generate a hypothesis.

10 Basics in Epidemiology and Biostatistics

Epidemiological study designs are broadly divided into two


main types, i.e. descriptive and analytical. In descriptive studies a
researcher quantifies (in % or mean SD) the distribution of certain
variables in a study population at a point of time (Table 2.1), while
in analytical studies (observational or experimental), the researcher
tests the prior stated hypothesis.
In observational studies, the researcher merely observes what
is happening or what has happened in the past and tries to draw
conclusions based on these observations. In experimental studies,
the investigator assigns an intervention to one of the groups. Another
distinguishing feature of the experimental study is the process of
randomization.
The basic variable that defines a study design is time (Flow
chart 2.1). If both the exposure and outcome are determined at
one point of time, it is a cross-sectional (descriptive) study. If the
outcome has occurred and researcher goes back from the outcome
towards exposure, it is a case control study, while if patients are
followed from the exposure towards the outcome, then it is a cohort
study or experimental study.

DESCRIPTIVE OBSERVATIONAL STUDIES

These studies are usually carried out in one patient/group. These


studies describe an event or a problem with respect to time, place
and person. The researcher usually does not have a hypothesis at
the beginning of the study though one can formulate/generate a
hypothesis based on the conclusion of the study.
The three different types of descriptive observational study
designs are case report, case series and cross-sectional studies.

Case Report
It is report of a single case of disease, usually with an unexpected
presentation, which typically describes the findings, clinical course
and prognosis of the case. Writing of a case report is like writing a
good clinical history of a patient that includes presenting features,
clinical signs, lab investigations, and diagnosis after excluding a list
of differential diagnosis. A classical example of a case report from
history is that of a congenital anomaly affecting limbs and digits

tahir99 - UnitedVRG

Study Designs 11
Table 2.1: Baseline characteristics of patients with chronic kidney disease
(hypothetical table of a descriptive study design)
Patients characteristics

Mean SD or %

Age (years)
Male Gender
Race
Caucasians
African-American
Asians
Others
Insurance
Private
HMO
Medicare
Medical aid
None
Comorbidity Index
Zero
One
Two
Three
Cause of CRI
Diabetes mellitus
Hypertension
GN/PKD/IN
Other
Laboratory values

Serum creatinine (mg/dL)


GFR (mL/min/1.73 m2)
BUN (mg/dL)
Serum albumin (g/dL)
Hct (%)

from Germany in late 1959 (The Thalidomide tragedy). The world has
never heard or seen such a unique congenital anomaly before. These
are the type of cases which should be presented as a case report.

12 Basics in Epidemiology and Biostatistics

Case Series

When several unusual cases all with similar conditions are described
in a published report, this is called a Case Series. A case series does
not include a control group. Subsequently after the first case report
of thalidomide tragedy a case series was published in 1961. The
thalidomide was used for nausea and vomiting in pregnancy in that
era, hence soon more such mal-developed children were identified
becoming a basis for a case series.
It was quite easy to identify the exposure now as thalidomide
because all mothers with the outcome (mal-developed children)
used this drug.

Cross-sectional Studies

In a cross-sectional study, the data is collected at one point of time.


The hallmark of such studies is that there is no follow-up. These
studies are also called Prevalence Studies as they determine the
burden of disease in a population, e.g. National Health Survey of
Pakistan on the prevalence of hypertension in Pakistan or Pakistan
National Diabetic Surveyshows Prevalence of Diabetes Mellitus in
Pakistan.
A survey is a classical example of a cross-sectional study. These
days surveys are also being carried out by people other than the
health professionals, for example, the media.
In a cross-sectional study, data on both the exposure and
outcome are determined at the same time. Hence, in this type of
study 4 groups are made, i.e. those exposed and have the outcome,
those exposed but do not have the outcome, those unexposed but
have the outcome, and those unexposed but without the outcome
(Flow chart 2.2). Exposure rates are calculated in each group, thus a
2 2 table can be constructed. These exposure rates are compared.
If a cross-sectional study covers the whole population, it is called
a census.
A cross-sectional design is not suitable to study the association
between an exposure and an outcome. While using this design
it is difficult for the researcher to establish whether the exposure
preceded the outcome or not. Ideally, the exposure should always
precede the outcome. For example, if the researcher is studying the
association of uric acid level and hypertension, and on analysis finds

tahir99 - UnitedVRG

Study Designs 13
Flow chart 2.2 Design of a cross-sectional study

that most of the hypertensive patients have hyperuricemia as well;


here the researcher cannot say with confidence that hyperuricemia
is really an exposure/risk factor for hypertension (outcome); as
hyperuricemia can cause hypertension and hypertension is also a
risk factor for hyperuricemia. Hence, temporal association cannot
be established in such studies. Temporal association is one of the
first criteria according to Hills Criteria to confirm an association
between an exposure and an outcome. Temporal association simply
means that there has to be a time period between the exposure
and an outcome, and that the exposure should always precede the
outcome. For instance, in the above example it has to be shown that
a person had hyperuricemia initially and then after a period of time
developed hypertension. Unfortunately, in a cross-sectional study
the data is collected on hyperuricemia and hypertension at the same
time and cannot establish which came first, the chick and egg
phenomenon.
Hence, cross-sectional studies are useful for determining the
prevalence of a disease, but not recommended if the researcher
wants to study an association between an exposure and an outcome.

Advantages



Easy to perform
Prevalence/frequency of the disease can be calculated
Inexpensive as compared to analytical studies
Useful for evaluating diagnostic procedures, e.g. comparing two
diagnostic or treatment modalities, or the usefulness of a new
diagnostic procedure

14 Basics in Epidemiology and Biostatistics


Useful for measuring current health status and planning for some
health services
Takes lesser time as compared to analytical studies
Researcher can generate hypothesis.

Disadvantages
The data about both the exposure to risk factors and the presence
or absence of disease are collected simultaneously, hence it is
difficult to determine temporal relationship of a presumed cause
and effect.
Nonresponders bias (in surveys), it is difficult to obtain sufficiently
large response rates, as some people are too busy or reluctant to
participate.
Hypothesis though can be generated but it is a weak hypothesis
which needs to be tested by conducting further analytical study.

ANALYTICAL OR COMPARATIVE STUDIES

The hallmark of these types of study designs is that the researcher


has at least 2 groups (made either on basis of exposure or outcome)
at the beginning of the study and a follow-up. Such studies are also
called longitudinal studies.
Hence, the association between an exposure and outcome can be
established.
It includes:
Observational studies, e.g. case control and cohort study designs
Interventional or experimental studies.

ANALYTICAL OBSERVATIONAL STUDIES

Analytical observational studies include:


Case control study design
Cohort study design (prospective, retrospective and combination
of retrospective and prospective cohort study).
Such study designs are useful to test etiological hypothesis. From
each of these studies, the data is analyzed to find out:
Whether any association exists between the exposure/risk factor
and the outcome/disease (by calculating odds ratio in case
control study and relative risk in cohort study).

tahir99 - UnitedVRG

Study Designs 15
If so, what is the strength of association between the exposure/
risk factor and the outcome/disease under study?
To ascertain whether the association between the exposure and
the outcome is not by chance. This is determined by a test of
significance commonly called the p-value.

Case Control Study


Here the two groups are recruited on the basis of their outcome.
The group of patients who have the outcome in which researcher is
interested are called cases while the group of people who do not
have that outcome of interest are called controls (Flow chart 2.3).
For example, a pediatrician researcher wants to study the
association between the use of tap water for drinking and diarrhea.
His hypothesis is that children using tap water for drinking are more
likely to suffer from diarrhea as compared to those who use mineral
water. In this example, children who are suffering from diarrhea
will be cases, while those not having diarrhea will be controls. The
exposure in this study is the use of tap water for drinking, while the
outcome is diarrhea.
Cases and controls are questioned, or their medical records are
consulted regarding past exposure to risk factors. Later the measure
of association is determined which in case of a case-control study is
odds ratio.
Flow chart 2.3 Case control study design

16 Basics in Epidemiology and Biostatistics

Advantages




Multiple exposure for a single outcome can be detected


Inexpensive as compared to other analytical study designs
No need of follow-up
Takes lesser time as compare to other analytical study designs
Recommended for those problems which have a long incubation
period as cancers.
Recommended for studies on rare diseases
Recommended for investigating a preliminary hypothesis.

Disadvantages
Recall bias is the main problem as the cases will be more likely to
recall the past exposure. Similarly, if the researcher is working on
geriatric patients then recall bias can be problematic both in cases
and controls as the respondents might not have good memory
due to old age. For example, in a study looking at the association
of being a cigarette smoker for ten years and development of lung
cancer, some participants may have difficulty in recalling whether
they have been a cigarette smoker for ten years or not.
Selection bias is another problem if the cases and controls are not
properly selected. Here are two examples of selection bias in two
studies carried out at two leading tertiary care centers of the world
by two very eminent researchers of the time.

Study 1
In 1929, Raymond Pearl at John Hopkins, Baltimore conducted a
study to test the hypothesis that tuberculosis (TB) protected against
cancer. He selected 816 cases of cancer from 7500 consecutive
autopsies. He also selected 816 controls from others on whom
autopsies had been carried out at John Hopkins. Of the 816 cases
(with cancer), 6.6% had TB. Of the 816 control (without cancer), 16.3%
had TB. From the finding that the prevalence of TB was considerably
higher in the control group, Pearl concluded that TB was protective
against cancer. Actually at the time of this study, TB was one of the
major reasons for hospitalization at Johns Hopkins Hospital. Pearl
thought that the control groups rate of TB would represent the level
of TB in the general population; but because of the way he selected
the controls, they came from a pool that was heavily weighted with

tahir99 - UnitedVRG

Study Designs 17
TB. He should have compared the patients with cancer to a group of
patients admitted for some specific diagnosis other than cancer. The
way the controls are selected is a major determinant of whether a
conclusion is valid or not.

Study 2
Coffee-drinking and Cancer of the Pancreas in Women. The cases
(patients with cancer of the pancreas) were white cancer patients
from 11 Boston and Rhode-Island hospitals. The controls were
recruited from the Gastrointestinal Clinics of the same hospital.
McMohan found that coffee consumption was greater in cases
than controls. The controls were patients who had reduced their
coffee consumption because of Physicians advice. The controls
level of coffee consumption was not representative of the general
population. When a difference in exposure is observed between
cases and controls we must ask Is the level of exposure observed
in the controls really the expected level in the general population.
In the two studies (1 and 2) the researchers erroneously concluded
about the association between an exposure and outcome because of
improper selection of controls.

Cohort Studies
Cohort means a group of people sharing the same attribute, e.g. all
those who are exposed to the use of tobacco as compared to those
not exposed to the use of tobacco.
In a cohort study design, the two groups are made on the basis of
exposure (i.e. smokers and nonsmokers). These groups are followed
for a specific period of time for the outcome of interest. This study
design is preferred if the researcher aims to determine the incidence
and the risk factors associated with the disease.
There are two types of cohort studies:
1. Prospective Cohort Study or Concurrent Cohort Study
2. Retrospective Cohort Study or Historical Cohort Study

Prospective Cohort Study


In prospective cohort studies the investigators conceive and design
the study, recruit subjects, and collect baseline exposure data from
all subjects, before any of the subjects have developed an outcome

18 Basics in Epidemiology and Biostatistics

of interest. The subjects are then followed into the future in order
to record the development of an outcome of interest. The follow-up
can be conducted by mail questionnaires, by phone interviews, via
the Internet, or in person with interviews, physical examinations,
and laboratory or imaging tests. For example a study investigating
the association between cigarette smoking for ten years or more and
lung cancer, if the researcher wants to choose a prospective cohort
design then his study would start in the year 2013 and end into 2023
(Flow chart 2.4).
The Framingham Heart Study is a good example of large, pros
pective cohort study. It is an ongoing cohort study still in progress to
identify the risk factors associated with heart disease.

Advantages
Multiple outcomes to a single exposure can be detected
Incidence rates are calculated
It helps in calculating the relative risk and the attributable risk
Temporal association is best studied in prospective cohort study
It allows the assessment of dose response relationship
Flow chart 2.4 Prospective cohort study

tahir99 - UnitedVRG

Study Designs 19
It helps to accept or to refute the hypothesis with a high degree of
validity
Complete control over the data.

Disadvantages





Expensive
Time consuming
Strict follow-up is required
Not suitable for diseases that have a long incubation period
Not suitable for rare diseases
Attrition (loss to follow-up) due to migration or death of the
respondents.

Retrospective Cohort Study


Retrospective studies are also called historical cohort studies.
Sometimes in a prospective cohort study with a long outcome for
example the cigarette smoking for ten years and lung cancer study
loss to follow-up, long wait for the completion of the study and
finding a funding source are issues. In order to save time and money
and to complete the study in a shorter time the retrospective study is
an ideal situation (Flow chart 2.5).

Flow chart 2.5 Retrospective cohort study

20 Basics in Epidemiology and Biostatistics

Advantages
Less expensive
Less time consuming
Follow-up data is obtained through records so follow-up time is
saved
Other advantages of cohort studies are also there.

Disadvantages

There is no control over the data, whatever variable information


is available is there. Nothing can be done about missing data.
Sometimes information on a variable of interest is not available.
In a prospective cohort study, the investigators are typically
present from the beginning to the end of the observation period.
However, it is possible to maintain the advantages of the cohort
study without the continuous presence of the investigator, or having
to wait for a long time to collect the necessary data, through the
use of a retrospective cohort study. In other words, although the
investigator was not present when the exposure was first identified,
he reconstructs the exposed and unexposed population from records,
and then proceeds as though he has been present throughout the
study. For example, if the 10 years cigarette smoking and lung cancer
study using a retrospective cohort design was being done today
(year 2013), the investigator would look into records and identify
the people who were smokers in the year 2003. In this manner, he
has selected a cohort who have been exposed to cigarette smoking
for ten years. He would now determine the outcome of lung cancer
today (year 2013). This way by using the retrospective cohort design
he has been able to complete a study which would have taken ten
years from now in a few months time.

REGISTRIES
In the developed world, researchers have collected data pertaining to
specific diseases like the United States Renal Data Systems (USRDS)
for end-stage renal disease patients (ESRD). The USRDS has data on
all dialysis patients being dialyzed in any of the 52 states in the US.
Any patient who initiates dialysis is immediately registered in this
data base and subsequently the entire follow-up including clinical
characteristics, labs and medicines are recorded continuously until the

tahir99 - UnitedVRG

Study Designs 21
patients is alive/dies/receives a kidney transplant. A researcher may be
interested to look at the risk factors associated with ESRD and may like
to study patients who initiated dialysis from 2001 to 2006. The data may
be used from this registry to conduct a retrospective cohort study.
Data from registries are ideal for retrospective cohort studies.
Clinicians of every specialty should be encouraged to conduct chart
audits to collect data retrospectively on disease of their interest.
Unfortunately, the hospital records are not well maintained in low
resource settings and, hence, it is difficult to create registries. In the
developed world, the majority of the studies done are retrospective
cohort studies using registries. We can also follow the foot-steps by
improving our in-door patients record system.

INTERVENTIONAL/EXPERIMENTAL STUDIES
Here intervention or some action is involved such as deliberate
application of a drug in the experimental (study) group and
no intervention in the control group. Later, the outcome of the
experiment is compared in both the groups (Flow chart 2.6).
Thus it differs from the observational analytical study designs
in that here the experiment is directly under the control of the
investigator whereas in the observational analytical studies, the
investigator takes no action, just observes.
There are three key components of an experimental study design:
(1) prepost test design, (2) a treatment group and a control group,
and (3) random assignment of study participants.
A prepost test design requires the collection of data on study
participants level of performance before the intervention is given
(pre-), and that you collect the same data on similar participants
after the intervention was given (post). This design is the best way to
be ensure that the intervention had a causal effect.
Flow chart 2.6 Sketch of experimental study design

* Pretest are characteristics measured at Baseline.


** Post-test are characteristics measured at end point of the trial.

22 Basics in Epidemiology and Biostatistics

To get the true effects of the program or intervention, it is necessary


to have both a treatment group and a control group. As the name
suggests, the treatment group receives the intervention while the
control group does not receive intervention. It is also important that
both the treatment group and the control group are of adequate size
to be able to determine whether an effect took place or not. While
the size of the sample ought to be determined by specific scientific
methods, a general rule of thumb is that each group ought to have at
least 30 participants.
Finally, it is important to make sure that both the treatment group
and the control group are statistically similar. While no two groups
will ever be exactly alike, the best way to ensure that two groups
are comparable is by randomly assigning the participants into the
treatment group and control group. Such random allocation ensures
that any difference between the treatment group and control group
is due to chance alone, and not by a selection bias (Table 2.2).
Randomization is the heart of the clinical trial as every individual
has an equal chance of being selected into either study group or
control group, from the reference population.

Table 2.2: Baseline characteristics of coronary artery disease patients


treated by medical/surgical therapy
Surgical
therapy group

Medical
therapy group

(N = 1140)

(N = 1130)

61.4 10.0

61.7 9.6

p-value

Characteristics
Ageyear

Sexno (%)

0.95

Male

974 (85.4)

964 (85.3)

Female

165 (14.5)

165 (14.6)

Race or ethnic groupno (%)

0.54

0.64

White

984 (86.3)

972 (86.0)

Black

55 (4.8)

55 (4.9)

Hispanic

66 (5.8)

56 (5.0)

Others

34 (3.0)

46 (4.1)
Contd...

tahir99 - UnitedVRG

Study Designs 23
Contd...
Surgical
Medical
p-value
therapy group therapy group
Clinical
Angina (CCS class)no (%)

0.24

132 (11.6)

146 (12.9)

338 (29.6)

339 (30.0)

11

407 (35.7)

423 (37.4)

111

259 (22.7)

219 (19.4)

Missing data

3 (<1)

2 (<1)

Duration of anginamonths
Median

0.53

Episodes/week with exertion or at rest within last month


Median

0.83

Diabetes

365 (32.0)

395 (35.0)

0.12

Hypertension

755 (66.2)

763 (67.5)

0.53

Congestive heart failure

56 (4.9)

51 (4.5)

0.59

Cerebrovascular Disease

99 (8.7)

100 (8.8)

0.83

Myocardial Infarction

435 (38.2)

437 (38.7)

0.80

Previous (PCI)*

173 (15.2)

183 (16.2)

0.49

Coronary artery bypass graft


(CABG)

124 (10.9)

124 (11.0)

0.94

Total patientsno (%)

968 (84.9)

974 (86.2)

0.84

Treadmill testsno (%)

552 (57.0)

550 (56.5)

Duration of treadmill
test-minute

6.9 2.6

6.8 2.2

Pharmacologicstress no (%)

415 (42.9)

425 (43.6)

Echocardiographyno (%)

61 (5.4)

52 (4.6)

Nuclear imagingno (%)

683 (70.6)

705 (72.2)

0.59

Single reversible defects

152 (22.2)

159 (22.6)

0.09

Multiple reversible defects

441 (66.0)

481 (68.2)

0.09

Historyno (%)

Stress test

* PCI is per cutaneous intervention

0.43

24 Basics in Epidemiology and Biostatistics

The aim of the experimental study designs is to provide scientific


proof of the etiological factors/risk factors.
There are three main types of experimental study designs:
1. Clinical Trial or Randomized Controlled Trial with patients as
unit of study.
2. Field Trial or Community Intervention Studies with healthy
people as unit of study.
3. Community Trial with entire community as unit of study.
Table 2.2 shows a good example of how evenly balanced are the
characteristics of the two groups in a Randomized Controlled Clinical
Trial (RCCT). All the p-values are statistically nonsignificant which
means that the characteristics of the two groups are comparable.
This is important in a clinical trial because unless the two groups
are comparable you cannot compare the outcomes in the two
groups. If the two groups are not comparable, as often happens in
an observational study, then your study will be called comparing
apples with oranges.

BLINDING
Blinding represents an important, distinct aspect of randomized
controlled trials. The term blinding refers to keeping trial participants,
investigators or assessors (those collecting outcome data) unaware
of an assigned intervention. Blinding is of three types:

Single-blind
Here the participants do not know whether they are assigned to the
study or the control group. It means that they do not know whether
they are getting the new drug which is under investigation or the
old conventional drug. However, only the investigator knows who is
getting which drug. This trial helps to overcome subject variation.

Double-blind
Here neither the investigator (doctor) nor the participant (patient)
knows the group allocation and treatment received. However, the
statistician knows it. The drug is coded before handing over to the
doctor. Usually this trial is in practice.

tahir99 - UnitedVRG

Study Designs 25

Triple-blind
It goes one step further. All the participants, the doctor and the
statistician are unaware (blind) of the group allocation. Only the
principal investigator is aware of the group allocation and the
treatment allocation.

CONSENT FORM
Since these studies involve human subjects, hence there are always
ethical issues which cannot be over looked. Approval from Ethical
Review Board (ERB) is mandatory. Consent forms are always
required and are scrutinized in detail by the ERB.

INTENT TO TREAT ANALYSIS


In clinical trials, once a patient is randomized to a particular group
he/she will always be analyzed in that particular group. For example, a
study on coronary artery disease comparing the outcome (mortality)
between patients who receive medical therapy vs surgical therapy,
a patient who is randomized to the medical therapy group will be
analyzed in this group. If during the trial he has an acute myocardial
infarction and subsequently undergoes CABG surgery he will not be
considered in surgery group despite the fact that he has undergone
surgical treatment.

QUASI-EXPERIMENTAL STUDIES
In a quasi-experimental study, one characteristics of a true
experiment is missing, either randomization or the use of a separate
control group. A quasi-experimental study, however, always
includes the manipulation of an independent variable which is the
intervention.
One of the most common quasi-experimental designs uses two
(more) groups, one of which serves as a control group. Both groups
are observed before as well as after the intervention, to test if the
intervention has made any difference.

CLINICAL TRIALS AND THEIR PHASES


There are five phases of clinical trials:

26 Basics in Epidemiology and Biostatistics

Preclinical Phase
Drug is developed and evaluated in cells and animals to see its
potential effect on human body.

Phase I Trial
These trials are conducted to determine recommended dose, side
effects and manner in which drug is processed by body. Here just
1020 healthy volunteers are recruited.

Phase II Trial
These are controlled clinical studies conducted to evaluate the
effectiveness of the drug or treatment to a larger group of people
(100300) to see if it is effective. These trials further evaluate its safety
and determine the common short-term side effects and risks.

Phase III Trial


These trials are used as a basis for regulatory approval of a new
drug/device, or for a new indication for a marketed product. These
are expanded controlled and uncontrolled trials after preliminary
evidence suggesting effectiveness of the drug has been obtained.
The study drug or treatment is given to large groups of people
(1,0003,000) to confirm its effectiveness, monitor side effects,
compare it to commonly used treatments, and collect information
that will allow the drug or treatment to be used safely. These trials
are intended to gather additional information to evaluate the overall
benefit-risk relationship of the drug and provide adequate basis for
physician prescription.

Phase IV Trial
This includes post-marketing studies to delineate additional
information including the drugs risks, benefits, optimal use and
long-term side effects.

Post-marketing Surveillance
These involve observational studies such as case reports, cohort
studies or case control studies. Its purpose is to assess drug safety

tahir99 - UnitedVRG

Study Designs 27
under the conditions of use in general practice, as opposed to the
conditions under which they were tested in phase III trials.

Post-marketing Clinical Trials


Here uncontrolled clinical trials are designed to gain more experience
with efficacy and safetyand promote use of the drug or device.
It also includes controlled clinical trials designed to obtain
regulatory approval for a new indication (Phase IIIB).

RESEARCH QUESTIONS AND STUDY TYPES


See Table 2.3.

META-ANALYSIS
A meta-analysis is a particular type of systematic review that
focuses on the numerical results. The main aim of meta-analysis
is to combine the results from individual studies to produce, if
appropriate, an estimate of the overall or average effect of interest
(e.g., the relative risk). The direction and magnitude of this average
effect, together with a consideration of the associated confidence
interval and hypothesis test result, can be used to make decisions
about the therapy under investigation and the management of
patients.
In the below study, Figure 2.2 is a meta-analysis comparing two
intervention for a certain outcome. The studies A [RR= 0.65 (CI = 0.1
0.7); p-value = 0.01] and E [RR = 0.7 (CI = 0.1 0.4); p-value = 0.0001]
show group A is better. While the study H [RR = 1.5 (CI = 1.2 2.0);
p-value = 0.001] shows that group B is better. The overall effect size is
not significant; [RR = 0.75 (95% CI = 0.3 1.1; p-value=0.32)].

Statistical Approach
We decide on the effect of interest and, if the raw data is available,
evaluate it for each study. However, in practice, we may have to
extract these effects from published results. For example, if the
outcome in a clinical trial comparing two treatments is numerical
the effect may be the difference in treatment means. A zero difference
implies no treatment effect. Similarly, if the outcome is binary (e.g.
died/survived) we consider the risks of the outcome (e.g. death) in

28 Basics in Epidemiology and Biostatistics


Table 2.3: Research questions and study types

Knowing that a
problem exists but
knowing little about
its characteristics or
possible causes

What is the nature/magni


tude of the problem?
Who is affected?
How do the affected
people behave?
What do they know,
believe, and think about
the problem and its
causes?

Exploratory studies,
or descriptive
studies:
Descriptive case
studies
Cross sectional
studies

Suspecting that
certain factors
contribute to the
problem

Are certain factors indeed


associated with the
problem? (e.g. Is lack of
preschool education related
to low school performance?
Is low fiber diet related
to carcinoma of the large
intestine?)

Analytical
(comparative)
studies:
Cross sectional
comparative
studies
Case control
studies
Cohort studies

Having established
that certain factors
are associated
with the problem:
desiring to establish
the extent to which
a particular factor
causes or contributes
to the problem

What is the cause of the


problem?
Will the removal of a
particular factor prevent
or reduce the problem?
(e.g. stopping smoking,
providing safe water)

Cohort studies
experimental or
quasi-experimental
studies

Having sufficient
knowledge about
cause(s) to develop
and assess an
intervention that
would prevent,
control or solve the
problem

What is the effect of a


Experimental or
particular intervention/
quasi-experimental
strategy? (e.g. treating
studies
with a particular drug:
being exposed to a certain
type of health education).
Which of two alternate
strategies gives better
results?
Which strategy is most
cost effective?
-












tahir99 - UnitedVRG

Type of study

Type of research question

State of knowledge of
the problem

Study Designs 29

Figure 2.2 Hypothetical meta-analysis figure to compare two interventions


(group A and group B) for a certain outcome

the treatment groups. The effect may be the difference in the risks
or their ratios, the RR. If the difference in risks equals zero or RR=1,
then there is no treatment effect.

BIBLIOGRAPHY
1. Hulley SB, Newman TB. Getting started: the anatomy and physiology of
clinical research. In: Hulley SB, Cummings SR, Browner WS. Designing
clinical research. Philadelphia, PA: Lippincott Williams and Wilkins;
2007.
2. Last John M. A Dictionary of Epidemiology. Oxford University Press
1983.
3. Park K. Parks Textbook on Preventive and Social Medicine 18th edn,
2005.
4. Schlesselman JJ. Case-Control Studies. Oxford University Press. New
York 1982.
5. Types of epidemiologic studies. In: Hennekens CH, Buring JE.
Epidemiology in Medicine. Boston: Little, Brown and Company; 1987.
pp. 101-204.

CHAPTER

Sampling Procedure

POPULATION

A major purpose of the research is to infer or generalize findings from


a sample to a target population. Population is the term statisticians
use to describe a large set or collection of items that have something
in common (i.e. all pregnant women, all pregnant women in third
trimester, all anemic pregnant women in third trimester, etc.).
Target population is a group about which researcher aims to draw
conclusion. In medicine, population generally refers to patients
or other living organisms, but the term can also be used to denote
collections of inanimate objects, such as autopsy reports, X-ray
reports, or birth certificates.
Figure 3.1 shows relationship among target population, study
population and sample. Target population is a population of ultimate
clinical interest about which researcher aims to draw a conclusion.
On account of the cost and other practical issues, the entire target
population cannot be studied. Study population is a subset of
target population that can be studied. Samples are subsets of study
populations investigated in clinical research because often not every
individual in a study population can be measured.
A sample is a subset of population with all its inherent qualities.
Studies are conducted on samples but inference is made about
target population. That is why it is important that the sample should
be a true representative of the target population. Hence, the selected
elements should be properly approached, recruited in the study and
interviewed. Thus, selection of sample is critical as, otherwise, the
research findings might not be valid.
It is vital to have a clear understanding of the terms population
and sample; these two terms must not be used interchangeably.

tahir99 - UnitedVRG

Sampling Procedure 31

Figure 3.1 Relationship among target population,


study population and sample

REASONS FOR SAMPLING


It is reasonable and practical to collect information from sample
rather than the whole population. Below are the reasons listed for
sampling:
Samples can be studied more quickly than population
A study of a sample is less expensive than that of an entire
population
A study of a population is impossible in most situations
Samples are more often accurate than results based on a
population
If samples are properly selected, probability methods can be used
to estimate the error in the resulting statistics
Samples can be selected to reduce heterogeneity.

SAMPLING TECHNIQUES
Broadly, there are two types of sampling techniques (Table 3.1):
1. Probability sampling techniques.
2. Nonprobability sampling techniques.
In a probability sampling technique, each participant in a study
population has an equal (or at least a known) chance of being
selected. The method protects the research from bias and ensures

32 Basics in Epidemiology and Biostatistics

Table 3.1: Different sampling techniques


1. Simple random sampling

1. Consecutive sampling

2. Systematic random sampling

2. Convenience sampling

3. Cluster sampling

3. Purposive sampling

4. Stratified random sampling

4. Quota sampling

Nonprobability sampling

Probability sampling

5. Snowball sampling

that the sample is a true representative of a population. Importantly,


it helps a researcher to make meaningful statistical estimation while
analyzing the results of the research. In a nonprobability technique,
each participant does not have an equal chance of being selected.

Probability Sampling Techniques


Simple Random Sampling
Simple random sampling is the simplest method of probability
sampling. In this type of sampling technique each individual within
the study frame has an equal chance of inclusion in the sample. A
common example is sometimes called the lottery method and
illustrated in Figure 3.2.

Figure 3.2 Lottery sampling technique

tahir99 - UnitedVRG

Sampling Procedure 33
For example in a recruitment for a study there are 100 participants
available, of these 25 have to be selected (sample size). The
participants to be recruited in the study will be selected randomly by
drawing a chit bearing the names/ID number of the 100 individuals.
Each individual in the study frame has an equal probability of being
selected for the study (i.e. when the first participant is to be selected
the probability is 1/100 for all participants, for second participant
the probability is 1/99 for all participants, for third participant
the probability is 1/98 for all participants and so on). Thus each
participant has an equal probability of being selected for the study.
The recommended way to select a simple random sample is to
use a table of random numbers, or a computer-generated list of
random numbers. For this approach each participants should have
an identification number (ID), and a list of ID numbers called a
sampling frame.
The steps of simple random sampling are as follows:
Prepare the sampling frame (assign a number to each element) of
the whole population [Participants are numbered from 1 to 100].
Determine the sample size [Estimated sample size is 25]
Randomly select the element [Any 25 numbers are picked from
1 to 100]
OR
If using computer generated lists to randomly select the
participant
Enter lowest ID number (i.e. in this case 001)
Enter highest ID number (i.e. in this case 100)
Enter the estimated sample size as 25
Computer generated randomization software will generate a
table of randomly selected participants/ID number (Fig. 3.3).

Systematic Random Sampling


In systematic sampling technique study participants are selected at
regular intervals using a sampling frame (Fig. 3.4).
Just estimate the population size (N) and calculate the required
sample size (n).
Now divide population size by sample size, i.e. N/n. This will give
you the kth number (sampling interval). In the above study example,
the number of individuals were 100 and the required sample size

34 Basics in Epidemiology and Biostatistics


001

002

003

004

005

006

007

008

009

010

011

012

013

014

015

016

017

018

019

020

021

022

023

024

025

026

027

028

029

030

031

032

033

034

035

036

037

038

039

040

041

042

043

044

045

046

047

048

049

050

051

052

053

054

055

056

057

058

059

060

061

062

063

064

065

066

067

068

069

070

071

072

073

074

075

076

077

078

079

080

081

082

083

084

085

086

087

088

089

090

091

092

093

094

095

096

097

098

099

100

Figure 3.3 Random selection of 25 participants represented by bold

Figure 3.4 Systematic random sampling (Every 3rd selected )

was 25, hence 100/25 would be 4 and so every 4th X-ray should be
selected.
First element is selected randomly from 1st to kth element (i.e.
in above example from 1 to 4). Then every kth element is selected
till the researcher achieves the required sample size. For example in
Figure 3.5 second individual in the study population is selected at
random and then every fourth individual is selected (i.e. 6th, 10th,
14th, etc.).

tahir99 - UnitedVRG

Sampling Procedure 35
001

002

003

004

005

006

007

008

009

010

011

012

013

014

015

016

017

018

019

020

021

022

023

024

025

026

027

028

029

030

031

032

033

034

035

036

037

038

039

040

041

042

043

044

045

046

047

048

049

050

051

052

053

054

055

056

057

058

059

060

061

062

063

064

065

066

067

068

069

070

071

072

073

074

075

076

077

078

079

080

081

082

083

084

085

086

087

088

089

090

091

092

093

094

095

096

097

098

099

100

Figure 3.5 Systematic random sampling (Every 4th selected )

Stratified Random Sampling


Stratified random sampling is a sampling technique that divides
the population into various sub-groups, i.e. based on gender, age
groups, ethnicity, etc. (Fig. 3.6) and then any of the random sampling
technique is employed to randomly select participants from each
group (Fig. 3.7). Suppose a population consisted of more females
than males. In spite of the random technique employed, females will
constitute a greater proportion of sample than males. Such problem
could be overcome by utilizing stratified random sampling.
For example, a population consisted of 60 individuals and the
researcher wants to select equal representation of all the strata based
on ethnicity. Firstly, the population is stratified according to ethnicity
(i.e. Caucasians, African-American and Hispanic-American). There
are 30 Caucasians, 20 African-American and 10 Hispanic-American.
As the researcher wants to select 15 participants thus each strata
must constitute 5 participants. Finally, 5 participants are randomly
selected from each strata.
One of the main purposes of stratified sampling is to compare
different strata (e.g. males with females, different age groups, etc.)
which may not be possible with simple random sampling alone.

36 Basics in Epidemiology and Biostatistics

Figure 3.6 Stratified random sampling technique


(Individuals in each strata)

Figure 3.7 Stratified random sampling technique (Participants selected from


each strata represented by bold headed stickman)

Cluster Sampling
In clustered sampling technique sub-group of population is used as
a sampling unit instead of individuals. It is a probability sampling
technique, employed when the researcher aims to select participants
from a large geographical area i.e. country, province, state or city
(Flow chart 3.1). Suppose the city of Karachi consisted of 18 towns
and each town consisted of 10 union councils. Initially, 5 towns are

tahir99 - UnitedVRG

Sampling Procedure 37
Flow chart 3.1 Cluster random sampling technique

selected by either of the random technique methods. Later, from


each town 4 union councils are randomly selected. Finally, from
20 union councils houses are randomly selected. Thus in this type
of sampling method households are the sampling unit instead of
individual residents.

Nonprobability Sampling Techniques


Consecutive Sampling
It involves sequential selection of all accessible eligible participants
that meets the selection criteria. If the study participants are selected
in a consecutive manner, they might be inherently similar to eligible
participants that meets inclusion and exclusion criteria for the study.
Suppose, a strategy is devised to recruit 100 patients (the estimated
sample size) for a study that satisfies the selection criteria and seen
in a Nephrology clinic from Monday to Friday between 9.00 am to
12.00 pm. The first 100 patients who meets the eligibility criteria and
attend the outpatient clinic during these days and timings will be
recruited in the study. This method is best among nonprobability
sampling techniques as it minimizes selection bias by recruiting
complete accessible population within the parameters of estimate
sample size and selection criteria.

Convenience Sampling
Convenience sampling is presumed to be the most commonly used
technique in clinical research. It involves the selection of subjects
that are conveniently accessible to the researcher. Suppose, a

38 Basics in Epidemiology and Biostatistics


researcher working as a professor of nephrology aims to identify the
communication skills of postgraduate trainees. The description of
20 postgraduate trainees is assuredly 20 postgraduate trainees in
nephrology ward who volunteered for this study. The participants
were selected on account of investigator feasibility to recruit these
participants, as working in the nephrology ward. The method is easy,
fast and less expensive but not the representative of a larger overall
population thus introducing selection bias in the research.

Purposive Sampling

Purposive sampling is also called judgmental sampling. The


technique is criticized for introducing selection bias in the research
as the researcher recruit participants based over pre-existing belief
that certain subjects will be more likely benefit, compliant or respond
in certain way. Thus, the researcher selects study participants with a
particular purpose in mind.
For example, if the researcher wants to check the hypothesis that
Pakistani females have better knowledge regarding medical research
than American females. Selection of Pakistani females medical
students (a group that has better understanding of medical research
than other women) and American females who came to the market
for shopping were selected. As the two groups are noncomparable,
evidently Pakistani females will display a better knowledge regarding
medical research which might not be the case. Such deviation from
truth is on account of purposeful sampling.
Similarly, while conducting a knowledge survey on the mode
of transmission of HIV; selecting participants that are relatives of
AIDS patients will demonstrate an excellent knowledge regarding
transmission modes of HIV. Evidently the selection of study
participants was biased as the sample was not the true representative
of the target population.

Snowball Sampling
Snowball sampling method is employed when study participants
are difficult to identify, access or locate. The method is commonly
employed to recruit participants from hard to reach group (i.e. sex
workers, IV drug users, etc.). The sample is built through chain
referrals. Suppose, you are investigating the knowledge about

tahir99 - UnitedVRG

Sampling Procedure 39
Flow chart 3.2 Snowball sampling technique

contraception among female sex workers. Female sex workers are


hard to identify as they are not registered in Pakistan. Thus one
female sex worker will be identified and recruited in the study.
Later, the participant will be requested to recommend more sex
workers. Each of these will recommend more sex workers. In this
way, a sizeable sample may be obtained even for hard to reach group
(Flow chart 3.2).

Quota Sampling
Quota sampling is a nonprobability sampling method that
ensured a certain number of study participants from different
subgroups constitute the sample so that all these characteristics are
represented. Suppose you aim to identify the quality of life among
dialysis patients but you think that socioeconomic status has a
strong affect on quality of life in these patients. Thus you decide to
include 25% of respondents from each socioeconomic groups (i.e.
upper, middle, lower middle and lower). If the estimated sample
size is 200, each socioeconomic group will include 50 participants.
Thus initially a population is divided into different strata and then
any nonprobability sampling technique will be applied to select
participants.

BIBLIOGRAPHY
1. Beth Dawson-Saunders, Robert G Trapp. Basic and Clinical Biostatistics,
1989.

2. Hulley SB, Newman TB. Choosing the study subjects: specification,


sampling, and recruitment. In: Hulley SB, Cummings SR (Eds).
Designing clinical research. Philadelphia, PA: Lippincott Williams and
Wilkins; 2007.pp. 27-36.
3. Kuzma JW, Bohnenblust SE (Eds). Populations and samples. Basic
statistics for the health sciences. Boston: McGraw Hill; 2005.pp. 16-28.
4. Last John M. A dictionary on Epidemiology. Oxford University Press
1983.
5. Morris JN. Uses of Epidemiology. ELBS 3rd edn, 1983.

40 Basics in Epidemiology and Biostatistics

tahir99 - UnitedVRG

CHAPTER

Variables, Data and


its Presentation

VARIABLES AND THEIR TYPES


Variable
A variable is a measureable characteristic of a person, object or
phenomenon that can take on different values. A simple example of
a variable is a persons age. The variable age can take on different
values because a person can be 20 years old, 35 years old, and so on.

Type of Variables
Dependent and Independent Variables
As in health system research you often look for causal explanations,
hence it is important to make distinction between dependent and
independent variables.
The variable that is used to describe or measure the problem
under study is called the dependent variable. It represents the
output or effect, or is tested to see if there is an effect. A dependent
variable is also known as a response variable, outcome variable,
and output variable.
The variables that are used to describe or explain the difference
in the dependent variable or to cause changes in the dependent
variables are called the independent (exposure) variables. It
represents the inputs or causes, or is tested to see if they are the cause.
An independent variable is also known as a predictor variable,
explanatory variable, and exposure variable.
For example, in a study of the relationship between smoking and
lung cancer, suffering from lung cancer (with the values yes or no)
would be the dependent variable and smoking (varying from not

42 Basics in Epidemiology and Biostatistics

smoking to smoking more than three packets a day) would be the


independent variable.
Whether a variable is dependent or independent, is determined
by the statement of the problem and the objectives of the study. It is,
therefore, important when designing an analytical study to clearly
indicate which variable is the dependent and which the independent.
If a researcher investigates why people smoke; smoking is the
dependent variable and pressure from peers to smoke could be an
independent variable. In the lung cancer study smoking was the
independent variable, and lung cancer was the outcome.

DATA AND ITS TYPES


Data are values of the observation recorded for variables (e.g. age,
weight, sex).

Types of Data

Data is classified as either qualitative and quantitative (Flow chart 4.1):


1. Qualitative or categorical data.
2. Quantitative or numerical data.
Flow chart 4.1 Classification of data types

* Mutually exclusive means both events cannot occur at the same time (i.e. tossing a
coin will result in either head or tail).

tahir99 - UnitedVRG

Variables, Data and its Presentation 43

Qualitative or Categorical Data


Qualitative or categorical data comprises of a characteristic which
cannot be expressed numerically like gender, ethnicity, healing, etc.
It is divided in three types:
1. Binary
2. Nominal
3. Ordinal.
Binary data: In binary data, the variables are divided into two
mutually exclusive categories.
Example: Binary data

Gender

Categories
Male, female

Nominal data: In nominal data, the variables are divided into more
than two mutually exclusive categories. These categories however,
cannot be ordered one above another (as they are not greater or less
than each other).
Example: Nominal data
Categories

Marital status 
Single, married, widowed, separ
ated and divorced

Employment status Unemployed, self-employed, public
employee and Govt. employee
Ordinal data: In ordinal data, the variables are also divided into more
than two mutually exclusive categories, but they can be ordered one
above another, from lowest to highest or vice versa.
Example: Ordinal data
Categories

Level of knowledge:
Good, average, poor

Level of blood pressure: High, moderate, low

Quantitative or Numerical Data


These are the characteristics which can be expressed numerically
like age, height, weight, blood pressure, hemoglobin, temperature
and number of children in a family.
Discrete and continuous: Quantitative variables can be classified as
discrete and continuous.
Discrete variable is one in which values can only be whole
numbers. If we are studying the number of children in a family, each

44 Basics in Epidemiology and Biostatistics

child is equal with respect to providing one counting unit. There are
no intermediate values between each number.
Continuous variable is one in which there are no gaps in the values
of the variables: there are an unlimited number of possible values
between any two adjacent values on the scale. Thus, if the variable
is height measured in inches, then 4 and 5 inches are two adjacent
values of the variable. However, there can be an infinite number of
the intermediate values, such as 4.5 and 4.7 inches, variables such
as these are known as continuous variables (the values which can
occur in fractions or decimals).

TABULATION AND GRAPHICAL


PRESENTATION OF DATA
Data once collected should be presented in such a way as to be easily
understood. The style of presentation depends, of course, on the
type of data. Data can be presented in as frequency tables, charts,
graphs, etc. Here, we would discuss some of the important means of
presentation.

Frequency Tables

The most common way of presentation of data is to arrange them in


the form of tables. It gives the frequency with which (or the number
of times) a particular value appears in the data.
The basic principles of tabulation of data are:
1. The information should be in a simple and orderly manner.
2. The table should have a title which must be brief and compre
hensive.
3. Rows and columns must have their own captions.
4. The titles of the rows must be entered on the left side of the table
while the titles of the columns are on the top row. The rest of
the table constituting the body, contain the numerical values in
actual numbers, in percentages or in both forms.
5. The class intervals are usually taken at equal intervals.
6. Standard codes or symbols, if used, should be explained in the
foot note.
In a frequency Tables 4.1 and 4.2, data is presented in a tabular
form. It gives the frequency with which (or the number of times) a
particular value appears in the data.

tahir99 - UnitedVRG

Variables, Data and its Presentation 45


Table 4.1: Systolic blood pressure of 100 patients coming to a tertiary care
hospital
Systolic blood
pressure (mm Hg)

Frequency
(n =100)

Relative
frequency

Cumulative
relative

Below 100

15

0.15

0.15

100120

25

0.25

0.40

121140

20

0.20

0.60

141160

30

0.30

0.90

Above 160

10

0.10

1.00

Total

100

1.00

Table 4.2: Clinical presentation of patients coming to medical OPD


Clinical presentation

Frequency

Percentage

Vomiting

30

30.0%

Fever

25

25.0%

Dyspepsia

20

20.0%

Nausea

15

15.0%

Headache
Total

10

10.0%

100

100.0%

Graphs
Another way to summarize and display data is through the use of
graph or pictorial representations of data, so that the data is easier to
interpret. Graphs should be designed so that they convey at a single
glance the general patterns in a set of data.

Types of Graphs




Bar charts
Pie charts
Histograms
Line graphs
Scatter plots

46 Basics in Epidemiology and Biostatistics

Figure 4.1 Marital status of respondents

Bar Charts
Bar charts are used for binary, nominal and ordinal data (categorical)
and comprises of nonadjacent bar. The bars can be vertical or
horizontal.

Example: The marital status of different respondents (200 in total)


participated in a knowledge, attitude and practice survey regarding
dengue fever are as follows; Single 60 (30%), Married 120 (60%) and
Divorced 20 (10%). The bar graph is shown in Figure 4.1.
Y axis = Percentage of respondent
X axis = Marital status of respondent.

Pie Charts
Pie charts can also be used to display binary, nominal and ordinal
data (categorical). A pie chart consists of circular region partitioned
into sections, with each percentage represents a part or a percentage.
Example: The data regarding knowledge of research ethics were
collected from 150 postgraduate trainees were collected. The survey
showed that 60 (40%) of the respondents were male and 90 (60%)
were female. The data is represented in Figure 4.2.

tahir99 - UnitedVRG

Variables, Data and its Presentation 47

Figure 4.2 Gender distribution of respondents

Figure 4.3 Histogram with normal curve showing


distribution of age (in years)

Histograms
A histogram depicts a frequency distribution for quantitative data, it
comprises of series of adjacent bars (Fig. 4.3).
Histograms are constructed to represent the continuous or
quantitative data. Ideally, every quantitative variable should be
normally distributed (bell shaped curve).

48 Basics in Epidemiology and Biostatistics

Line Graphs
A line graph (also called time series plot) is appropriate for
representing data that vary continuously. It shows a trend of variable
over time. To construct a time series plot, time is placed on a
horizontal axis and the variable being measured on a vertical axis,
with points being connected using line segments (Fig. 4.4).
Example: The population statistics of the US for the years 18601950
are as in Table 4.3:

Table 4.3: Population statistics of US population


Year

Population
(in millions)

1860
1870
1880
1890
1900
1910
1920
1930
1940
1950

31.4
39.8
50.2
62.9
76.0
92.0
105.7
122.8
131.7
151.1

Figure 4.4 Line graph of US population data

tahir99 - UnitedVRG

Variables, Data and its Presentation 49

Scatter Plots
Scatter plot represents a relationship between two continuous
variable.
Example: Suppose, a researcher wishes to identify whether studying
for longer hours will lead to better scores. A collection of data is given
in Table 4.4.
Based, on the data below a scatter plot has been constructed as
shown in Figure 4.5. (Note: When connecting a scatter plot, do not
connect the dots).
Table 4.4: Data on studying hours and
corresponding scores
Participant No.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.

Study hours
3
5
2
6
7
1
2
7
1
7

Score
80
90
75
80
90
50
65
85
40
100

Figure 4.5 Scatter plot of students test scores and hours of study

50 Basics in Epidemiology and Biostatistics

1. Kuzma JW, Bohnenblust SE (Eds). Organizing and displaying data.


Basic statistics for the health sciences, 3rd edn. London: Mayfield
Publishing Company; 2001.pp.23 43.
2. Kuzma JW, Bohnenblust SE (Eds). Organizing and displaying data.
Basic statistics for the health sciences, 5th edn. Boston: McGraw Hill;
2005.pp.29 53.
3. Pagano M, Gauvreau K (Eds). Data presentation. Principles of
biostatistics. Australia: Duxbury Press; 2000.pp.7 37.
4. Perrie A, Sabin C (Eds). Displaying data graphically. Medical Statistics
at glance. UK: Blackwell Science Ltd; 2000.pp.14 5.
5. Perrie A, Sabin C (Eds). Type of data: Medical Statistics at glance.
UK: Blackwell Science Ltd; 2000.pp.8 9.
-

BIBLIOGRAPHY

tahir99 - UnitedVRG

CHAPTER

Biostatistics: Basic

MEASURES OF CENTRAL TENDENCY


Measures of central tendency refer to the summary measures used
to describe the most typical value in a set of values. The three most
common measure of central tendency are mean, median and mode.
Mean: The most popular measure of central tendency for a quanti
tative data set is the arithmetic mean or simply the mean of the data
set. It is also known as the Average. It is calculated by adding all the
observations and dividing by the total number of observations. The
sample mean is denoted by x (pronounced x bar) and the population
mean is denoted by (the Greek letter mu). Note that the mean can
only be calculated for quantitative data.
Median: The median is an important measure of central tendency.
It is the value that divides a distribution into two equal halves. We
arrange the observations in order from smallest to largest value
or vice versa. If there are an odd number of total observations,
the median is the middle value. If there is an even number of total
observations, the median is the average of the two middle values.
The median is useful when some measurements are much bigger
or much smaller than the rest. The mean of such data will be biased
toward these extreme values while the median is not influenced by
extreme values.
Mode: The mode is the most frequently occurring value in a set of
observations.

Example of Mean, Median and Mode


Suppose we draw a sample of five women and measure their weights.
They weigh 110 pounds, 110 pounds, 140 pounds, 150 pounds, and
160 pounds.

52 Basics in Epidemiology and Biostatistics

The mean weight would equal (110 + 110 + 140 + 150 + 160)/5 =
670/5 = 134 pounds.
The median value would be 140 pounds; since 140 pounds is the
middle weight.
Most frequent value is 110 (as occurring twice), so the mode of the
data set is 110 pounds.
The mode of the data is 110 pounds, since it is occurring twice
(more frequently).

Mean Versus Median

The median may be a better indicator of the most, typical value, if a


set of scores has an outlier. An outlier is an extreme value that differs
greatly from other values.
Score that are much above or below the mean are called outliers.
For example if in the above mentioned data one individual has a
weight of 250 lbs (weight of 160 lbs replaced by 250 lbs). This will be
an extreme value, i.e. outlier and will impact the mean value.
Mean = (110 + 110 + 140 + 150 + 250)/5 = 760/5
Mean = 152 pound
The mean value on account of 250 pound is much higher than
most reading in the data set. Hence, in such cases median should be
reported which will continue to be 140.
However, when the sample size is large and does not include
outliers, the mean score usually provides a better measure of central
tendency.

MEASURES OF VARIATION

These include the measures to describe the amount of variability or


spread in a set of data. The most common measures of variability are
the range, variance, and standard deviation.
Range is the simplest measure of variability. It is defined as the
difference in value between the highest (maximum) and the lowest
(minimum) observation in the data set. For example, consider the
following women weights in the data set 110 pounds, 110 pounds,
140 pounds, 150 pounds, and 160 pounds. The range would be
160100 = 60 lbs.
Variance quantifies the amount of variability or spread about the
mean of the sample.

tahir99 - UnitedVRG

Biostatistics: Basic 53
For instant, the women weights in the above example were 110,
110, 140, 150 and 160 pound, the mean weight would be 134 pounds.
Variance (S) = S (xi x)2 / (n 1)
Where xi
= Individual sample observation
x
= Sample mean
n
= Total sample size
S = sum of the differences between individual sample observation
and sample mean
Example:
S = [(110134)2 + (110134)2 + (140134)2 + (150134)2 +
(160134)2]/51
S = [ (24)2 + (24)2 + (6)2 + (16)2 + (26)2]/51
S = [576 + 576 + 36 + 256 + 676]/4 = 2120/4
S = 530
Standard deviation is the square root of the variance. The standard
deviation is a measure, which describes how much individual
measurement differs, on the average, from the mean.
Standard deviation is the square root of variance (S):
SD = S
SD = (530) = 23.02
The same results can easily be obtained by SPSS (statistical
package).
Below is the SPSS output showing central tendency and variation
of above data set.
N (Number of observations)

Mean

134

Median

140.00

Mode

110.00

Standard deviation
Variance
Range

23.02
530.00
50.00

A large standard deviation reflects that there is a wide scatter of


measured values around the mean, while a small standard deviation
reflects that the individual values are concentrated around the mean
with little variation among them (construct figure).

54 Basics in Epidemiology and Biostatistics

STANDARD ERROR OF MEAN

When we draw a sample from study population and compute its


sample mean, it is not likely to be identical to the population mean. If
we draw another sample from the same population and compute its
sample mean, this may also not be identical to the first sample mean.
It might also differs from the true mean of the total population from
which the sample was drawn; this phenomenon is called sampling
variation.
The standard error of the mean gives an estimate of the degree
to which the sample mean (x) varies from the population mean ().
This measure is used to calculate confidence interval (CI), which is
discussed in the next chapter.

NORMAL DISTRIBUTION

A normal distribution such as the distribution shown in the following


figure (Figs 5.1A and B) is classically a bell shaped curve. Most of
the values are clustered near the mean and a few values are near the
tails. The normal distribution is symmetrical around the mean. If
the variable is normally distributed, then mean, median and mode
values will be approximately equal.
An important characteristic of a normally distributed variable is
that 95% of the measurements have values which are approximately
within 2 standard deviations (SD) around the mean (Fig. 5.1B).
When the area of the normal curve is divided into sections by
standard deviations above and below the mean, the area in each
section is a known as a quantity. For example, 34 percent of all the
values of a normally distributed variable are between the mean
and one standard deviation above it. It also means that there is
a 0.34 chance that a value drawn at random from the distribution
will lie between these two points. Similarly, 34 percent of all the
values of a normally distributed variable are between the mean and
one standard deviation below it. It also means that there is a 0.34
chance that a value drawn at random from the distribution will lie
between these two points. Consequently, 68 percent of all the values
of a normally distributed variable are between the mean and one
standard deviation either side.
Sections of the curve above and below the mean may be added
together to find the probability of obtaining a value within (plus

tahir99 - UnitedVRG

Biostatistics: Basic 55

9
ri 9

n
U
-

ti e

V
d

G
R

Figures 5.1A and B Proportion of cases under portion of the normal curve

h
ta

or minus) a given number of standard deviations of the mean.


For example, the amount of curve area between one standard
deviation above the mean and one standard deviation below is
0.34 + 0.34 = 0.68, which means that approximately 68 percent
of the values lie in that range. Similarly, about 95 percent of the
values lie within two standard deviations while 99.7 percent of
the values lie within three standard deviations around the mean
(Fig. 5.1B).
Example: Suppose, for a study on 300 chronic kidney disease
(CKD) patients, the hemoglobin levels were obtained. The data on

56 Basics in Epidemiology and Biostatistics

Figure 5.2 Hemoglobin level of 300 CKD patients

hemoglobin level was plotted. The data is normally distributed,


with mean Hb and standard deviation are calculated as 7 mg/dL
and 1.0 mg/dL, respectively.
Thus, out of 300 CKD patients (Fig. 5.2):
68% (204) will have hemoglobin within the range of 6.0 mg/dL to
8.0 mg/dL (within one standard deviation from mean).
95% (285) will have hemoglobin within the range of 5.0 mg/dL to
9.0 mg/dL (within two standard deviation from mean).
99.7% (299) will have hemoglobin within the range of 4.0 mg/dL
to 10.0 mg/dL (within three standard deviation from mean).

1. Kuzma JW, Bohnenblust SE (Eds). Summarizing data: Basic statistics


for the science. London: Mayfield Publishing Company; 2001.pp.44 54.
2. Kuzma JW, Bohnenblust SE (Eds). The Normal Distribution: Basic
statistics for the science. London: Mayfield Publishing Company; 2001.
pp.79 91.
3. Pagano M, Gauvreau K (Eds). Numerical summary measures. Principles
of biostatistics. Australia: Duxbury Press; 2000.pp.38 65.
4. Perrie A, Sabin C (Eds). Describing data. Medical Statistics at glance.
UK: Blackwell Science Ltd; 2000.pp.16 9.
5. Perrie A, Sabin C (Eds). Theoretical distribution (1): the normal distri
bution. Medical Statistics at glance. UK: Blackwell Science Ltd; 2000.
pp.20 1.
-

BIBLIOGRAPHY

tahir99 - UnitedVRG

CHAPTER

Estimation and
Hypothesis Testing

V
d

G
R

Estimation refers to the process by which one makes inferences


about a population, based on information obtained from a sample.

POINT ESTIMATE

ti e

A point estimate is a specific numerical value estimate of a


parameter.
The best point estimate of the population mean () is the sample
mean.
But how good is a point estimate?
There is no way of knowing how close the point estimate is to
the population mean. Statisticians therefore prefer another type of
estimate called an interval estimate.

9
ri 9

n
U
-

INTERVAL ESTIMATE

An interval estimate of a parameter is an interval or a range of


values used to estimate the parameter (confidence level).
The confidence level of an interval estimate of a parameter is the
probability that the interval estimate will contain the parameter.
Two commonly used confidence levels are 95 percent and
99 percent.
If one desires to be more confident then the sample size must be
large enough.

h
ta

HYPOTHESIS TESTING

What is a Hypothesis?
Hypothesis is a testable theory. Hypothesis testing is the method
of testing whether claims or hypothesis regarding a population are

58 Basics in Epidemiology and Biostatistics

likely to be true. For example, such claims could be regarding the


prevalence of an outcome of interest, mean of an outcome of interest
and association between a dependent and independent variable. For
example, investigator can have a hypothesis that the mean Hb of CKD
patients on dialysis is 7.5 g/dL. In epidemiological or clinical studies
it is an expected or anticipated association between an independent
variable and a dependent variable. For example, an investigator aims
to look at the association between cigarette smoking (independent
variable) and lung cancer (dependent variable). His hypothesis
would be that cigarette smokers are more likely to develop lung
cancer than nonsmokers.
Association referred to as an assumption that is formulated
regarding a population parameter of interest, i.e. mean or proportion
prior to a research being conducted. During the course of research
the researcher endeavors to test the formulated hypothesis.

What is the Need to Test a Hypothesis in Research?

Most often we are interested in finding out difference between


two groups or an association between two variables. An observed
difference or association between two groups or variables can be a
real difference, or can be attributed to chance.
Hypothesis testing is done to determine if the observed difference
or association is because of chance. If a researcher has been
meticulous in control of bias and confounding, then after hypothesis
testing he or she can ascertain the reality of observed difference or
association with a certain degree of confidence. This confidence is
expressed in terms of percentage which is derived from the criteria
at which the hypothesis has been tested.

INTRODUCTION TO THE SCALE OF PROBABILITY

It is worthwhile to understand certain basic concepts related


to probability before discussing the steps of hypothesis testing.
Probability is defined as the chance of occurrence of an event. It is a
common everyday concept which relates to chances of happening of
a particular incident, e.g. it is usual to talk about probability of rain
on a given day. The chances of a formulated hypothesis being true or
otherwise are also determined in terms of probability.
Probability is measured on a scale of 0 to 1 (zero to one). A zero
probability means that there is an absolute certainty of an event not

tahir99 - UnitedVRG

Estimation and Hypothesis Testing 59

Figure 6.1 Probability scale: Range (Zero to One)

G
R

happening or an outcome not appearing; whereas a probability of


one means an absolute certainty of an occurrence of an event or
appearance of an outcome. In between the absolute values of zero
and one there is a whole range of probabilities. The application of
the scale of probability in the concept of hypothesis testing will be
elaborated subsequently (Fig. 6.1).

TEST OF HYPOTHESIS

n
U
-

ti e

V
d

Suppose a study is being conducted to answer questions about


differences between two regimens for the management of
diarrhea in children: the sugar based modern ORS, and the timetested indigenous herbal solution made from locally available
herbs.
One question that could be asked is:
In the population is there a difference in overall improvement
(after three days of treatment) between the ORS and the herbal
solution?
There could be only two answers to this question:
1. Yes
2. No

9
ri 9

h
ta

Null Hypothesis (H0)

There is no difference between the 2 regimens in term of improve


ment (null hypothesis). A null hypothesis is usually a statement
that there is no difference between groups or that one factor is not
dependent on another and corresponds to the No answer.

Alternative Hypothesis (HA)


There is a difference in terms of improvement achieved by a
three days treatment with the ORS and that of the herbal solution
(alternative hypothesis).

60 Basics in Epidemiology and Biostatistics


Associated with the null hypothesis there is always another


hypothesis or implied statement concerning the true relationship
among the variables or conditions under study if no is an implausible
answer. This statement is called the alternative hypothesis and
corresponds to the Yes answer.

Types of Alternate Hypothesis

Directional
Nondirectional
A directional hypothesis is one which the researcher is able to
explicitly state the direction of the relation between the populations.

Steps in Hypothesis Testing

1. State the hypotheses: Every hypothesis test requires the researcher


to state a null hypothesis (H0) and an alternative (HA). The hypo
theses are stated in such a way that they are mutually exclusive.
That is, if one is true, the other must be false; and vice versa.
2. Formulate an analysis plan: The analysis plan describes how to
use sample data to accept or reject the null hypothesis. It should
specify the following elements:
Selection of significance level (a): Often, researchers choose
significance level equal to 0.01 (1 in 100), 0.05 (1 in 20), the
significance level is the risk we are willing to take that a sample
which showed a difference was misleading. Five percent
significance level means that we are ready to take a 5 percent
chance of wrong results. The significance level is set prior to
the actual testing of the null hypothesis, if alpha is set at 0.01,
then the researcher desires to be 99 percent confident before
rejecting the null hypothesis.
Choosing a test statistic: t-test, z-test for continuous data,
chi-square for proportions, etc. Test statistics is computed
from the sample data and is used to determine whether the
null hypothesis should be rejected or retained. Test statistics
generates a p-value.
3. Analyze sample data: Using sample data perform computations
called for in the analysis plan.
p-value: Indicates the probability or likelihood of obtaining a
result at least as extreme as that observed in a study by chance

tahir99 - UnitedVRG

Estimation and Hypothesis Testing 61

alone, assuming that there is truly no association between


exposure and the outcome under consideration.
The final decision to either reject the null hypothesis or not,
depends on the p-value.
By convention the p-value is set at 0.05 levels. Thus, any
value of p less than or equal to 0.05 indicates that there is at
the most a 5 percent probability of observing an association
as large or larger than that found in the study due to chance
alone, given that there is no association between exposure and
outcome. If the P values is higher than the set value of alpha is,
e.g. p value>0.05, then we do not reject the null hypothesis
(Fig. 6.2).
4. Interpret the results: If the sample findings are unlikely, given
the null hypothesis, the researcher rejects the null hypothesis.
Typically, this involves comparing the p-value to the significance
level, and rejecting the null hypothesis when the p-value is less
than the significance level.

9
ri 9

n
U
-

ti e

V
d

G
R

h
ta

Figure 6.2 Level of significance (a = 0.05) for hypothesis (testing)

62 Basics in Epidemiology and Biostatistics

DECISION ERRORS

tahir99 - UnitedVRG

Two types of errors can result from a hypothesis test.


1. Type I error: A type I error occurs when the researcher rejects a
null hypothesis when it is true. The probability of committing a
type I error is called the significance level. This probability is also
called alpha, and is often denoted by . Thus, if of a study is
lowered from 0.05 to 0.01 the maximum chance of committing a
type I error also reduces from 5 to 1 percent.
Suppose, a researcher wants to compare the mean ages of males
and females in a class of final year students. The null hypothesis for
this research is that there is no difference in the mean age of males
and females of this class. For some reason (i.e. small number of
sample size, inappropriate statistical analysis technique, etc.) the
p-value is calculated as 0.01 (as less than 0.05 thus significant at
95 percent confidence interval). As a result, the researcher has to
reject the true null hypothesis, thus forced to make a type I error.
In this example, there was no true difference in the mean ages of
males and female as they are students in the same class, and the
null hypothesis was true.
Similarly, this type of error can happen in the court of law,
a judge while prosecuting a trial if sends an innocent behind
the bars, he has committed a type I error. Type I error is more
important, which both researcher and judge must avoid in all
cases, and for this reason they make every effort not to commit a
type I error ( = 0.05).
2. Type II error: A type II error occurs when the researcher accepts
a null hypothesis that is false. The probability of committing
a type II error is called beta, and is often denoted by b. The
probability of not committing a type II error is called the Power of
the test. The power is generally kept at 80 percent and determined
by 1-b. The level of significance and power of a study play a very
crucial role in sample size determination.
Suppose, a researcher aims to compare the mean Hb of chronic
kidney disease (CKD) patients with that of normal population.
The null hypothesis is; there is no difference in the mean Hb of
CKD patients and normal population. Considering that the mean
hemoglobin of the sample of CKD patients was 7G/dL, and that
of the normal population was 12G/dL. If in this study the sample
size is inadequate or because of some inappropriate statistical

Estimation and Hypothesis Testing 63


technique the p-value is calculated as 0.15 (greater than 0.05 thus
nonsignificant at 95 percent confidence interval). As a result, the
researcher fails to reject a null hypothesis which was false, thus
making a type II error. In this example, there is a big difference
between the hemoglobin levels between the CKD sample and
the normal population. Obviously, there was logically a true
difference between the mean hemoglobin of CKD patients and
normal population and the null hypothesis was false, but because
the sample size was small, the analysis failed to work a significant
difference. In order to avoid this error, sample size calculation is
carried out at the synopsis level.
Similarly, this type of error can happen in the court of law, a
judge while prosecuting a trial, if declares a guilty person as
innocent and frees him/her, he has committed a type II error. This
can happen in the court of law where a person who is thought
to be guilty gets away from punishment because the court does
not have enough evidence against him. So we can see that in the
court of law having enough evidence is a must to make a decision,
whereas in research the evidence is a large and adequate sample
size. Type II error is less important than type II error, but it should
also be tried to avoid by having an adequate sample size with a
minimum power size of 80 percent (Table 6.1).

9
ri 9

n
U
-

V
d

ti e

G
R

Simple Explanation of p-value and


95 Percent Confidence Interval

Hypothesis is all about the confidence of researcher in his results.


Having completed his study he is faced with two questions:

h
ta

Table 6.1: a- and b-errors

Truth in the
Population

Decision
Retain the null
hypothesis

Reject the null


hypothesis

True

Correct
1a

Type I error
a

False

Type II error
b

Correct
1b
Power

64 Basics in Epidemiology and Biostatistics

1. Are you 100 percent sure about your results?


2. Are you sure the results are not by chance?
The researcher tries to take the help of p-value and 95 percent
confidence interval to show the confidence he has on his result.
The researcher tries to explain the first question by saying that he
cannot be 100 percent sure about his results, but he is 95 percent
confident that the results are true (95% confidence interval).
Regarding question number two, he seeks help from the p-value.
A p-value of less than 0.05 means that he has less than 5 percent
probability of having his results by chance. The smaller the p-value,
e.g. 0.001 the probability of the researcher of having his results
by chance is extremely negligible. So p-values and 95 percent
confidence interval are all about the confidence the researcher has
on his results.
Table 6.2 is a multivariate regression analysis of factors associated
with mortality in dialysis patients. The interpretation of independent
variable, age is that for every one year of increase in age the risk
of mortality increases by 3 percent (RR = 1.03). The 95 percent
confidence intervals for the same variable is 1.021.04. This mean
Table 6.2: Multivariate regression analysis of factors associated with
mortality among end stage renal disease patients
Variable

Relative risk (RR)

95% CI

p-value

Age (per year increase)

1.03

1.02, 1.04

<0.0001

Female (ref = male)

1.12

0.97, 1.31

0.14

White (ref = non-white)

1.51

1.27, 1.78

<0.0001

GFR at initiation of dialysis


(per mL/min/1.73 m2

1.04

1.03, 1.05

<0.0001

Angina (ref = no)

1.11

1.03, 1.18

0.02

Congestive heart failure


(ref = no)

1.12

1.04, 1.19

0.01

Ambulate independently
(ref = no)

0.48

0.39, 0.58

< 0.0001

LR (ref = ER)

1.66

1.30, 2.07

< 0.0001

Outcome (dependent variable), for this regression analysis was mortality at


1 year after initiation of (dialysis).
Abbreviations: LR: Late referral; ER: Early referral

tahir99 - UnitedVRG

Estimation and Hypothesis Testing 65


that if the study is carried out a hundred times, 95 percent or more
time the relative risk is going to lie between 1.02 and 1.04. Also for this
variable the 95 percent confidence interval is extremely close to the
relative risk, these are called tight confidence intervals, and means
that the results are extremely valid. The p-value for the variable age
is less than 0.0001. This means that the probability of the researcher
having his results by chance are almost negligible. So we can see
that the researcher takes the age of 95 percent confidence interval
and p-value to be sure that his results are valid. The next variable
in this multivariate regression analysis is gender. The relative risk
of mortality among females compared to males is 12 percent (RR =
1.12). However, when we look at p-values and 95 percent confidence
interval for this variable, they are both statistically insignificant. The
interpretation for the variable gender in this study considering the
p-value and 95 percent confidence interval, would be that there is no
difference for mortality among males and females.

n
U
-

V
d

ti e

G
R

Solving Hypothesis Testing Problems

The six steps for solving hypothesis testing from problems are as
follows:
1. State the hypothesis and identify the claim
2. Choose a significance level a
3. Find the critical value (s)
4. Compute the test value
5. Make the decision to reject or not to reject the null hypothesis
6. State the appropriate conclusion.

9
ri 9

h
ta

Example: The population has a mean Hb level () of 12 g/dL, and a


SD of 2. A sample of the population (x) has a mean Hb of 7 g/dL. Is
the Hb level of the sample representative of the population mean?

Solution

Step 1: State the hypothesis and identify the claim:


Null: The mean Hb level of x =
Alternate: The mean Hb level of x

Step 2: Choose a significance level a:


Alpha = 0.05
The researcher is willing to accept < 5 percent chance of committing
a type I error (of rejecting a true null hypothesis, by chance).

66 Basics in Epidemiology and Biostatistics

Step 3: Find the critical value (s)


One-tailed hypothesis
Alpha = 0.05

7 12
2

= 5/2 = 2.5

Step 4: Compute the test value


x u
Z =
SD

Step 5: Make the decision to reject or not to reject the null


hypothesis
Since z = 2.5
And 2.5 is less than 1.96, we reject the null hypothesis

Step 6: State the appropriate conclusion


We reject the null hypothesis of no difference, and conclude that the
mean Hb levels of the sample is different from the population mean
(Fig. 6.3).

Interpretation
Since z-score calculated by statisticians for 2 standard deviation cut
of point is 1.96 and +1.96. Any z-score less than 1.96 and/or greater

Figure 6.3 Critical regions (the two tails) for rejecting the null hypothesis
(a = 0.025)

tahir99 - UnitedVRG

Estimation and Hypothesis Testing 67

V
d

Figure 6.4 Region of rejection and region of acceptance

ti e

G
R

than +1.96 will fall in the region of rejection. In the above study, 2.5
is smaller than 1.96 we can see in Figure 6.4 the CKD sample mean
falls within the region of rejection (Fig. 6.4) of the population mean.
Hence, we reject the null hypothesis.

n
U
-

Level of Significance ( Level: )

The level of significance is the maximum probability of committing


a type I error. Statisticians generally agree on using 3 arbitrary
significance levels, i.e. 0.10, 0.05 and 0.01. If the significance level
is 0.01, there is 1 percent probability of committing a mistake (and
accepting results that are not true). If the significance level is 0.10,
there is 10 percent probability of committing a mistake, while if the
significance level is 0.05 it means, there is 5 percent probability of
committing a mistake. If the significance level is set as 1 percent an
extremely large sample size is required which may be difficult to
achieve, while if the significance level is set as 10 percent the sample
size required will be small, but the validity of the results will become
questionable. Increase in sample size makes the findings more
valid while decrease in sample size invariably affects the validity
of the results. Thus it is investigator choice and decision to set the
significance level at an appropriate level so that the findings are
valid, at an affordable sample size.

9
ri 9

h
ta

1. Duffy MS, Jacobsen BS. Key principles of statistical inference. In: Munro
BH (Ed). Statistical methods for health care research. Philadelphia:
Lippincott William and Wilkins; 2005. pp. 73-106.

BIBLIOGRAPHY

2. Gravetter FJ, Wallnau LB (Eds). Introduction to hypothesis testing.


Essentials of statistics for the behavioral sciences. New York: Thomson
Wadsworth; 2005. pp. 184-218.
3. Hulley SB, Cummings SR (Eds). Getting ready to estimate sample size:
hypotheses and underlying principles. Designing clinical research.
USA: Lippincott Williams and Wilkins; 2007. pp. 51-63.
4. Kuzma JW, Bohnenblust SE (Eds). One sample significance test, point
estimation, and confidence interval. Basic statistics for the science:
London: Mayfield Publishing Company; 2001. pp. 105-35.
5. Osborne C (Ed). The normal distribution and statistical inference. Statis
tical applications for health information management. Massachusetts:
Jones and Bartlett; 2006.pp.121-51.
6. Perrie A, Sabin C (Eds). Confidence interval. Medical statistics at glance.
UK: Blackwell Science Ltd; 2000.pp.28-9.
7. Perrie A, Sabin C (Eds). Hypothesis testing. Medical statistics at glance.
UK: Blackwell Science Ltd; 2000.pp.42-3.
8. Perrie A, Sabin C (Eds). Sampling and sampling distribution. Medical
statistics at glance. UK: Blackwell Science Ltd; 2000.pp.26-7.

68 Basics in Epidemiology and Biostatistics

tahir99 - UnitedVRG

CHAPTER

Measures of Disease Frequency

V
d

G
R

For epidemiological purposes, the occurrence of cases of disease


must be related to population at risk giving rise to the cases.
Several measures of disease frequency are in common use. There are
three general classes of mathematical parameters used to relate the
number of cases of a disease or outcome to the size of the source
population.

n
U
-

ti e

RATIO, PROPORTION AND RATE


Ratio

It is obtained by simply dividing one quantity by another without


implying any specific relationship between the numerator and
denominator, such as gender ratio, i.e. females : males. In ratio, the
numerator and denominator are mutually exclusive.
For example, the female to male ratio of postgraduate trainees in
Abbasi Shaheed Hospital is:

9
ri 9

= 150/50
Female : Male = 3:1

h
ta

Number of female trainees in


Abbasi Shaheed hospital
Gender ratio = __________________________________________
Number of male trainees in
Abbasi Shaheed hospital

Proportion
It is a type of ratio in which those who are included in the numerator
must also be included in the denominator.

70 Basics in Epidemiology and Biostatistics

For example: the proportion of postgraduate trainees who have


passed the FCPS part 2 examinations.
Total number of postgraduate trainees who appeared in the FCPS
part 2 examinations = 1500
Total number of postgraduate trainees who cleared FCPS part 2
examinations = 150

Total number of postgraduate


trainees who cleared FCPS part 2

____________________________________________

Total number of postgraduate


trainees who appeared FCPS part 2

Proportion

= 150/1500
= 1/10

Rate

A rate is a proportion with specification of time. There is a distinct


relationship between the numerator and denominator with a
measure of time being an intrinsic part of the denominator.
For example, the number of newly diagnosed cases of cervical
cancer per 100,000 women during a given year.

Important Point

It is necessary to be very specific about what constitutes both the


numerator and the denominator. In some circumstances, it is
important to make clear distinction whether the measure represents
the number of events or the number of individuals.
For example, the frequency of myopia among a population of
school children could represent the number of affected eyes in
relation to total eyes (measure represents the event), or the number
of children affected in one or both eyes relative to all students
(measure represents the number of individuals).

PREVALENCE AND INCIDENCE


The frequency of disease is commonly measured in epidemiological
studies and broadly categorized as incidence and prevalence.

tahir99 - UnitedVRG

Measures of Disease Frequency 71

Prevalence

Prevalence quantifies the proportion of individuals in a population


who have the disease at a specific instant and provides an estimate
of the probability that an individual will be ill at a point in time.
Prevalence is proportion, so has no unit.
Prevalence P can be calculated as

V
d

Prevalence =

Total population at a given point in time

Number of existing cases (both old


and new) of a disease
___________________________________________________

Point Prevalence

ti e

G
R

Point prevalence measures the frequency of disease of interest in a


defined population at a single point in time.

_______________________________________________________

Point prevalence =

Number of persons in a defined population


at the same point in time

n
U
-

Number of cases (diseased) in a defined


population at one point in time

For example: Of 25,000 male residents in Steel Town on 1st March,


2013, 25,00 have diabetes. The prevalence of diabetes among men in
Steel Town on 1st March, 2013 is calculated as:
Prevalence (P) = 2500/25000
= 1/10 or 0.1
Prevalence can also be expressed as percentage (cases per 100),
by multiplying P by 100. Thus the prevalence percentage in the above
example was calculated(as)
Prevalence in (%) of diabetes among men in Steel Town on 1st
March, 2013 is calculated as = 0.1 100
= 10%

h
ta

9
ri 9

Period Prevalence
Period prevalence is the total number of cases (diseased) at any
point during a specified period of time divided by the population at
risk midway through the period.

72 Basics in Epidemiology and Biostatistics

Cases present at the start of time period +


New
cases developed during this time period
Period prevalence = ____________________________________________________________
Population at risk midway during that
time period

For example: A study was conducted in Gulberg Town, Karachi from


January 1st 2011 to December 31st 2012, to determine the period
prevalence of hypertension in women greater than 45 years during
the time period. The Period Prevalence based over the below data is
as follows:
Number of hypertensive women greater than 45 years residing in
Gulberg Town as on 1st January 2011 = 2500
Number of women greater than 45 years residing in Gulberg Town
developed hypertension from 1st Jan 2011 to 31st Dec 2012 = 500
Population at risk midway (as on 31st Dec, 2011) = 60,000
2500 (old cases) + 500 (new cases)
Period prevalence = _____________________________________________
60,000 (midway population)
= 3,000/ 60,000
= 1/20 or 0.05

Period prevalence when expressed in percentage would be:


= 0.05 100
= 5%

Factors Influencing Prevalence


Prevalence is a useful measure in quantifying the burden of disease
in a population at a given point in time, thus beneficial in planning
health services. However, as it is influenced by a number of factors
(Table 7.1) thus not a useful measure to establish the determinant of
disease (causality) in a population.

Incidence
Incidence quantifies the number of new events or cases of disease
that develop in a population of individuals at risk during a specified
time interval.

Number of new cases of a disease


Incidence = _____________________________________________
Total population at risk

tahir99 - UnitedVRG

Measures of Disease Frequency 73


Table 7.1: Factors influencing prevalence
Increased by

Decreased by

Longer duration of disease

Shorter duration of disease

Prolongation of life of patients


without cure

High case fatality rate from disease

Increase in new cases (Increase in


incidence)

Decrease in new cases (decrease in


incidence)

In-migration of cases

In-migration of healthy people

Out-migration of healthy people

Out-migration of cases

In-migration of susceptible people

Increase cure rate of cases

Improved diagnostic facilities (Better


reporting)

Worsening diagnostic facilities


(Poor reporting)

n
U
-

ti e

V
d

G
R

Issues in the Calculation of Measures of Incidence

For any measure of disease frequency, precise definition of the


denominator is essential for both accuracy and clarity. This is of
particular concern in the calculation of incidence. The denominator
of a measure of incidence should include only those who are
considered at risk of developing the disease. That is, the total
population from which the new cases could arise.
Consequently, those who currently have or have already had the
disease under study or persons who cannot develop the disease for
reasons such as age, immunization, or prior removal of the involved
organ should be excluded from the denominator.

9
ri 9

h
ta

SPECIAL TYPES OF INCIDENCE RATES

Cumulative incidence rate or incidence risk


Incidence density rate.

Cumulative Incidence Rate


Cumulative incidence (CI) is a simpler measure of a disease. It is the
proportion of people who become diseased during a specified period
of time. It provides an estimate of the probability, or incidence risk,

74 Basics in Epidemiology and Biostatistics

that an individual will develop a disease during a specified period of


time. Hence, the characteristics of CI are:
A population is identified and screened for the disease at baseline.
Those who do not have the disease are followed for a year and
then rescreened.
Any cases that develop in this period are new cases.
It measures the denominator at only one point in time (usually at
the mid-point of the specified period).
Formula for CI =

Number of new cases of a disease


in a specified period of time
______________________________________________

Number of disease free person at


the beginning of that time period
It is important to note that the denominator is the total number
of people who were free of the disease at the beginning of the study
period; defined as the population at risk. The cumulative incidence
assumes that the entire population at risk at the beginning of the
study period has been followed for the specified time period for the
development of the outcome under investigation. This is called a
closed population.

Incidence Density (ID) Rate

Incidence density measures the rate (speed) at which new cases


of disease occur in a population. It is a more precise measure of
an incidence as it takes into account for varying time periods of
follow-up due to reasons (i.e. refuse to continue to participate in the
study, migrate, death, and new participants entering into the study
some time after it starts, etc.).
As every individual in the denominator is not followed up for
the full timecommonly due to loss to follow-up and different
individuals may be observed for different lengths of time; thus
Incidence Density Rate is calculated for a more precise estimate of
incidence.
Number of new cases of disease
during a given period of time
Incidence density = __________________________________________
Total person-time at risk during
a follow-up period

tahir99 - UnitedVRG

Measures of Disease Frequency 75


Table 7.2: Person-time (years) at risk for 5 individuals in a hypothetical
cohort study between 2008-2012
Year

Jan
2008

Jan
2009

Jan
2010

Jan
2011

Jan
2012

Years at risk

---------

---------

---------

----------

---------

5 years

---------

---------

---------

----------

----x

4.5 years

---------

---------

-----x

---------

---------

-----L

2.5 years

---------

---------

-----x

3.5 years

Persons

4
5

---------

Total

2.5 years

18 years

----------- = Time at risk


x = Developed disease
L = Person lost to follow-up

Calculating Person-time at Risk


The denominator in incidence density is person-time at risk, which
is the sum of each individuals time at risk (i.e. length of time study
participants were followed in the study). It is commonly expressed
as persons year at risk. When a study subject develops the disease,
dies or leaves the study, they are no longer at risk and will no longer
contribute person-time units at risk (Table 7.2).
Thus, in the above example, the incidence density (per person
years at risk) for the disease (x) is calculated as:
3 cases
Incidence density (per person years at risk) = _______________ 100
18 person
= 5.5 cases per 100 person years at risk

Morbidity Rate
It is the incidence rate of nonfatal cases in the total population at risk
during a specified period of time. For example, the morbidity rate
of tuberculosis (TB) in the US in 1982 can be calculated by dividing
the number of nonfatal cases newly reported during that year by the
total US mid-year population.

76 Basics in Epidemiology and Biostatistics


Total no of nonfatal cases of TB in population at risk/mid-year


population 25,520/231,534,000 = 11.0 per 100,000 population.

Mortality Rate
It expresses the incidence of deaths in a particular population during
a period of time. It is calculated by dividing the number of fatalities
during that period by the total population. This can be further
divided into cause specific mortality rate, age specific mortality rate
or sex specific mortality rate, etc.

1. Gordis L (Ed). Measuring the occurrence of disease. Epidemiology.


Philadelphia, PA: Saunders Elsevier; 2008. pp. 37-57.
2. Hennekens CH, Buring JE (Eds). Measures of disease frequency and
association. Epidemiology in medicine. Boston: Little Brown and
Company; 1987. pp. 54-100.
3. Kuzma JW (Ed). Vital statistics and demographic methods. Basic
statistics for the science. London: Mayfield Publishing Company; 2001.
pp. 255-73.

BIBLIOGRAPHY

tahir99 - UnitedVRG

CHAPTER

Measures of Association

In order to describe the strength of the relationship between an


exposure (independent variable) and an outcome (dependent
variable) measures of association are used. The types of measures
used to define the association between an exposure and an outcome
depends upon the type of data.

ASSOCIATION BETWEEN TWO


CONTINUOUS VARIABLES
Correlation
Correlation measures the strength of the linear association between
two continuous variables. Correlation coefficient r is a measure
of degree of how much (magnitude) two continuous variables are
associated with each other. It is always a number between 1 and +1
(Table 8.1). The sign of r indicates whether the correlation is positive
or negative. The magnitude (absolute value) of r indicates the
strength of the correlation, or how close the array of data points is to
a straight line.
Table 8.1: Approximate degrees of association
corresponding to level of r
Correlation coefficient (r)

Degree of association

1.0

Perfect

0.7 to 1.0

Strong

0.4 to 0.7

Moderate

0.2 to 0.4

Weak

0.01 to 0.2

Negligible

On 0.0

No association

78 Basics in Epidemiology and Biostatistics


The different types of correlation coefficient are:
Pearson correlation coefficient
Spearman rank correlation coefficient.
Example 1: Consider the data table below which contains measure
ments on two variables for ten people: the number of months the
person has owned an exercise machine and the number of hours the
person spent on exercise in the past week.
Person

10

Machine owned
(in months)

10

12

Hours exercised

10

If you display these data pairs as points in a scatter plot (Fig. 8.1),
then you can see a definite trend. The points appear to form a line
that slants from the upper left to the lower right. As you move along
that line from left to the right, the values on the vertical axis (hours of
exercise) decreases, while the values on the horizontal axis (months
owned) increases. Another way to express this is to say that the two
variables are inversely related: the more months the machine was
owned, the less the person tends to exercise. Thus, there seems to be

Figure 8.1 Scatter plot of two continuous variables (Months exercise


machine owned and hours of exercise) showing negative correlation

tahir99 - UnitedVRG

Measures of Association 79
a correlation between these two continuous variables, but the two
variables are correlated negatively.
Example 2: Now consider the data table below which contains
measurements on two continuous variables for ten people; the
number of months the person has owned an exercise machine and
their cardiovascular fitness (measured on a scale from 1 to 12, higher
scores showing better cardiovascular fitness).
Person

10

Machine owned
(in months)

10

12

Cardiovascular
fitness (score from
1 to 12)

11

Thus in Figure 8.2 data of months owned is plotted against


cardiovascular fitness on scatter plot. The pattern of these data
points suggests a line that slants from lower left to upper right, which
is the opposite of the direction of slant in the first example. Thus,
the figure shows that the longer the person has owned the exercise
machine, the better his or her cardiovascular fitness tends to be; this
is an example of a positive correlation (Fig. 8.2).

Figure 8.2 Scatter plot of two continuous variables (Months exercise machine
owned and cardiovascular fitness score) showing positive correlation

80 Basics in Epidemiology and Biostatistics

If two variables are positively correlated, as the value of one


increases, so does the value of the other. If they are negatively (or
inversely) correlated, as the value of one increases, the value of the
other decreases.
A third possibility remains; that as the value of one variable
increases, the value of the other neither increases nor decreases.
Example 3: Now consider the data table below which contains
measurements on two variables for ten people; the number of
months the person has owned an exercise machine and their height.
Person

10

Machine owned
(in months)

10

12

Height (meters)

1.3

1.8

1.5

1.9

1.3

1.9 1.4

1.8

1.5

Figure 8.3 is a scatter plot of months exercise machine owned


(horizontal axis) by persons height (vertical axis). No line trend can
be seen in the plot. These two variables appear to be uncorrelated.
You can go even farther in expressing the relationship between
variables. Compare the two scatter plots in Figures 8.4 and 8.5. Both
plots show a positive correlation because as the values on one axis
increases, so does the values on the other. But the data points in
Figure 8.5, which are more closely packed than the points in Figure
8.4, which are more spread out. If a line were drawn through the
middle of the trend, the points in Figure 8.5 would be closer to the
line than the points in Figure 8.4. In addition to direction (positive or
negative), correlations can also have strength, which is a reflection of

Figure 8.3 Scatter plot of two continuous variables (Months exercise


machine owned and height) showing no correlation

tahir99 - UnitedVRG

Measures of Association 81

Figure 8.4 Weak/low correlation

Figure 8.5 Strong/perfect correlation

the closeness of the data points to the perfect line. Figure 8.5 shows
a stronger correlation than Figure 8.4.

Simple Linear Regression


Correlation is not concerned with causation in relationships among
variables. However, a statistical procedure called regression is used
to establish causality. Regression is used to assess the contribution
of one or more predictor/explanatory variables (called independent
variables) to one dependent variable. It can also be used to predict
the value of one variable from the values of other variable. When
there is only one independent variable and when the relationship

82 Basics in Epidemiology and Biostatistics

Figure 8.6 Simple linear regression

can be expressed as a straight line, the procedure is called simple


linear regression. Any straight line (Fig. 8.6) in two dimensional
space can be represented by the equation;
Y = a + bX
where
Y is the variable on the vertical axis.
X is the variable on the horizontal axis.
a is the y value where the line crosses the vertical axis (often called
an intercept).
b is the amount of change in y corresponding to a one unit increase
in X (often called the slope).

Example: In a cross sectional survey the data is collected from 77


patients on maintenance hemodialysis. Variables on which the
data was collected were number of months on dialysis and Beck
depression score (validated tool to identify the presence and severity
of depression).
To predict the Beck depression score (dependent variable) from
months on dialysis (independent variable) a linear regression
analysis on SPSS was performed. SPSS generate a number of
output, but the most important inferential output is displayed below
(Table 8.2).

tahir99 - UnitedVRG

Measures of Association 83
Table 8.2: SPSS output (labeled as coefficient) for linear regression
Coefficients
Standar95% confidence
dized cointerval for B
efficients
Mode
t
Signifi Lower Upper
B
Standar
Beta
cance bound bound
dized error
1(Constant) 19.932
2.07
0.300
9.61
0
15.801 24.064
dialysis
0.061
0.023
2.728 0.008 0.106 0.017
duration
in months
Unstandardized
coefficients

Source: Dependent variable: Beck depression inventory (scoring)

The output labeled as coefficient is the most important table as


the values in this table will be helpful in generating equation of
the regression line.
The standardized coefficient (B) that is 0.3 is basically the
correlation between Beck depression score and dialysis duration
in months.
As the (p-value = 0.008 <0.05) thus the correlation is statistically
significant.
The dialysis duration in months row under the beta (B) column
gives the slope of the regression line that is 0.061.
The slope value (0.061) gives information that with a increase in
one month of dialysis the Beck depression score is predicted to
decrease by 0.061.
The Constant row under the beta (B) column gives the intercept
that is 19.932.
The constant gives the value of dependent variable when the
explanatory variable is 0. Thus if the dialysis duration in months is
0, then the Beck depression score will be 19.932.
Special package for the social sciences (SPSS) does not generate
the equation of the regression line.
Thus these two coefficients will be used to construct the regression
equation that is
Y = a + b (X)

84 Basics in Epidemiology and Biostatistics

where
Y is the predicted value of the dependent variable Y
a is the intercept (in this case it is 19.932)
b is the slope or the gradient of the regression line (in this case, it
is 0.061)
X is the independent or explanatory variable

Thus, the equation of the regression line will be


Y = 19.932 + (0.061)X
Y = 19.9320.061X
Task: Predict the Beck depression score for maintenance dialysis
patients on dialysis for 16 months and on 17 months?

=
=
=
=

19.932-0.061 X
19.932-0.061 (16)
19.932-0.976
18.956

Y
Y
Y
Y

19.932-0.061 X
19.932-0.061 (17)
19.932-1.037
18.895

=
=
=
=

Y
Y
Y
Y

Dialysis duration 16 months

Dialysis duration 17 months

Thus one unit increase in dialysis duration in months (from 16


to 17 months) the Beck depression score decreases from (18.956 to
18.895 a decrease of 0.061). It is also manifested in the above output
generated by SPSS (highlighted in green).
The intercept is 19.932 is the Beck depression score at time 0 of
dialysis. This is the value on the y axis where the best fit line touches
the y axis (Fig. 8.7).

RELATIVE RISK AND ODDS RATIO


The relative risk (RR) and odds ratio (OR) are commonly used to
describe the relationship between an exposure and an outcome. The
RR is used in cohort studies, whereas the OR is used in case control
studies.

Relative risk
The relative risk (or risk ratio) is defined as the ratio of the incidence of
disease in the exposed group divided by the corresponding incidence
of disease in the nonexposed group. Relative risk can be calculated in
cohort studies such as the Framingham Heart Study where subjects

tahir99 - UnitedVRG

Measures of Association 85

Figure 8.7 Linear regression showing on association between duration of


dialysis (months) and Beck depression scores

with certain exposures (e.g. hypertension, hyperlipidemia, smokers)


were followed prospectively for cardiovascular outcomes. The
incidence of cardiac events in subjects with and without exposures
were then used to calculate relative risk and determine whether
exposures were cardiac risk factors.
Incidence in exposed
Relative risk = ____________________________________
Incidence in nonexposed
Risk

Diseased

Nondiseased

Exposed

a+b

Nonexposed

c +d

a
________
a + b
Relative risk (RR) = _____________
c
_________
c + d

Total

86 Basics in Epidemiology and Biostatistics


Incidence in exposed individuals = a/a+b or proportion of exposed


people who developed the disease. Incidence in nonexposed
individuals = c/c+d or proportion of nonexposed people who
developed disease.

Risk factor

Disease status

Total

CHD present

CHD absent

Smoker

112
a

176
b

288
a+b

Nonsmoker

88
c

224
d

312
c+d

Incidence in exposed = a /a+b = 112/288 = 0.38


Incidence in nonexposed = c /c + d= 88/312 = 0.28
RR= 0.38/0.28 = 1.38

Interpretation of RR
As compared to nonsmokers, the smokers have a 1.38 times greater
risk of developing CHD.
Alternative explanation: Compared to nonsmokers, the smokers
have a 38 percent greater risk of developing CHD.

Interpretation of RR if the RR is < 1.0: Supposing the RR in the above


study was 0.68.
Then, the interpretation would be that compared to nonsmokers
the smokers have a 32 percent lesser risk of developing CHD. This in
research is called a protective effect of the exposure. In other words,
the exposure is beneficial.

Odds Ratio

The odds ratio is defined as the odds of exposure in the group


with disease divided by the odds of exposure in the control group.
As subjects are selected on the basis of disease status in case
control studies; therefore, it is not possible to calculate the rate of
development of disease (or the incidence).

tahir99 - UnitedVRG

Measures of Association 87
In research the word risk is used for the development of a
disease or outcome, e.g. the risk of developing CHD. In a case control
study because the cases and controls are defined on the basis of the
outcome/disease, i.e. those who have CHD are the cases, and those
who do not have CHD are controls. Since the study starts with the
disease/outcome, hence researchers want to use a different word
for looking at the prevalence of the exposure in those who had the
disease versus those who did not have the disease. The researchers
prefer to use the word odds for an exposure rather than risk.
Odds of exposure in the cases
Odds ratio = _____________________________________________
Odds of exposure in the controls

Calculating Odds Ratio (Case Control Studies)


Cases

Control

Total

Exposed

a+b

Nonexposed

c+d

a
______
c
Odds ratio = ________
b
______
d

Odds ratio = ad/bc

Oral Contraceptives and Breast Cancer


Exposure

Breast cancer

Total

Yes

No

Exposed
(oral contraceptive users)

140 (a)

370 (b)

510

Nonexposed

40 (c)

234 (d)

274

88 Basics in Epidemiology and Biostatistics

a
______

c
Odds ratio = ________

b
______

Odds of exposure in cases = a/c = 140/40 = 3.5


Odds of exposure in controls = b/d = 370/234 = 1.6
OR = 3.5/1.6 = 2.2

Interpretation of OR
Compared to the controls (those who did not have Ca breast), the
odds of being an oral contraceptive user were 2.2 greater in those
who had Ca breast (cases).

1. Coggon D, Rose G. Quantifying disease in populations. [Online].


1997 [cited 2008 Oct 01]; Available from: URL: http://www.bmj.com/
epidem/epid.2.html
2. Grimes DA, Schultz KF. Cohort studies: marching towards outcomes.
Lancet. 2002;359:341 5.
3. Israni RK. Guide to biostatistics. [Online]. 2007 [cited 2008 Aug 05];
Available
from:
URL:http://www.medpagetoday.com/Medpage
Guide to Biostatistics.pdf
4. Schultz KF, Grimes DA. Case control studies: research in reverse.
Lancet. 2002;359:431 4.
-

BIBLIOGRAPHY

tahir99 - UnitedVRG

CHAPTER

Factors Affecting
Study Outcomes

INTRODUCTION
Results of an epidemiological studies may reflect the true effect of an
exposure(s) on the development of the outcome under investigation,
but it must always be considered that the results may in fact due to an
alternative explanations. Such alternative explanations, may be on
account of the effects of chance (random error), bias or confounding
which may produce spurious results, leading the researcher to
believe the existence of a valid statistical association when one does
not exists or alternatively the absence of an association when one is
truly present.
Observational studies are more susceptible to the effect of chance,
bias and confounding, so appropriate steps must be taken at both
the design and analysis so their effects could be minimized.

BIAS
Any systematic error that results in an incorrect estimate of the
association between an exposure and the disease/outcome is
called a bias. It is usually introduced by the researcher due to
nonstandardized measuring techniques.

Types of Bias
More than 50 types of bias are identified in epidemiological studies,
but for simplicity, they are broadly grouped into two categories:
1. Selection bias
2. Information bias

90 Basics in Epidemiology and Biostatistics

Selection Bias

It occurs when the inclusion of subjects in a study depends in some


way on the outcome of interest. It occurs mainly in case control and
retrospective cohort studies and not in prospective cohort study as
outcome of interest has not yet occurred. Selection bias can occur
due to improper means or source of selection of study subjects.
A classical example of selection bias is a study conducted to see
the association between oral contraceptives (OC) and thromboembolism. There was a concern in this study that as physicians
were already aware of the possible relationship between OC and
thromboembolism, hence proportion of women that had been
hospitalized for evaluation of thromboembolism was all current
users of OCs. So any increased frequency of thromboembolism
in oral contraceptive users could be in part due to the fact that
hospitalization and the determination of the diagnosis were both
influenced by a history of OC use.
Another means of selection bias could be due to inappropriate
source of selection, e.g. cases selected from hospitals and controls
from household surveys. In this case it is possible that a number of
demographic and lifestyle variables could be different amongst the
cases and controls leading to noncomparability between groups and
incorrect results with respect to association between exposures and
outcome.
In a clinical trial a selection bias can occur if there is no
randomization. Suppose that the principal investigator is taking
a decision as to which patients are going to be included in the
standard drug group and which patient is going to be included in
the new drug group. If the principal investigator is allowed to do so,
he might include all the healthy patients in the new drug group and
all patients who are sick (and have multiple comorbid conditions) in
the standard drug group. Thus, he can show better outcomes among
the new drug group (who are healthy patients) compared to the
standard treatment group (who are sicker) and present results which
are not true. The process of randomization ensures that selection bias
cannot take place, by ensuring that the principal investigator and
his team members are not even close to where the randomization
process is taking place.

tahir99 - UnitedVRG

Factors Affecting Study Outcomes 91

Observation or Information Bias


It includes any systematic error in the measurement of information
on exposure or outcome. It is further classified into different
categories on the basis of source of noncomparability into:
Recall bias: It occurs when individuals with previous adverse
health outcomes remember and report their previous exposure
differently or with different degree of completeness and accuracy
than those who are unexposed/unaffected. It can lead to an over
or underestimate of the association between exposure and disease,
depending on whether the cases recall their exposure to a greater or
lesser extent than the controls.
For example, in a case-control study mothers whose recent
pregnancies had ended in fetal death (cases) may report their
exposure experience (drug history) differently than a matched group
of mothers whose pregnancy had ended normally (controls). That is,
cases may have a better recall on past exposure than controls. Recall
bias can be reduced by:
Collecting exposure data from work or medical records
Blinding participants to the study hypothesis.
Interviewer bias: It refers to any systematic difference in
the soliciting, recording or interpretation of information by
interviewer from study participants and can affect every type
of epidemiologic study.
Lost to follow-up bias: It is a major concern in a cohort or
any prospective study. When persons lost to follow-up differ
from those who remain in the study with respect to both the
exposure and the outcome, any observed association will be
biased. Even very small loss to follow-up can be a potential for
bias as long as such loss is related to both exposure and disease.
Misclassification bias: It occurs when the sensitivity and/
or specificity of the procedure/tool to detect exposure and/
or outcome is not perfect, that is exposed/diseased subjects
can be classified as nonexposed/nondiseased and vice versa,
based on the means of determination which may be unclear
or not standardized. It is inevitable in every study and always
a potential for concern and therefore should be carefully
evaluated.

92 Basics in Epidemiology and Biostatistics

CONTROL OF BIAS
Control of bias is mostly done at the design phase of the study.
Following are some means to ensure the same.

For Control of Selection Bias


Correct choice of study population (sampling procedure)
Randomization.

For Control of Information Bias


Correct training of interviewers and use of clearly written protocols
ensuring uniform methodology of obtaining information.
Use of standardized, tested instruments for data collection, and
utilizing uniform source of data on all study subjects.
Maintaining of complete records and having definite means of
contact with respondents to prevent loss to follow-up.
Use of clearly defined means of determination of both exposure
and outcome variables.
Blinding of interviewees and interviewers to study objectives.

CONFOUNDING
The concept of confounding is a central one in the interpretation
of any epidemiological study. It can be thought of as mixing of the
effect of the exposure under study on the outcome, with that of an
extraneous factorthe confounder. This external factor or variable
must be associated with the exposure, and independent of the
exposure must be a risk factor for the disease to be deemed as a
confounder. Confounding can lead to an over or an underestimation
of the true association between exposure and outcome.
Example 1: In a study conducted to determine the association
between smoking and myocardial infarction (MI), age can be a
confounder as it is associated with both exposure and outcome
independently.

tahir99 - UnitedVRG

Factors Affecting Study Outcomes 93


Table 9.1: Relation of myocardial infarction (MI) to recent oral contraceptive
(OC) use
Oral contraceptive (OC)

MI +

MI -

Yes

29

135

No

205

1607

Total

234

1742

Estimated relative risk

= 1.68

Table 9.2: Age-specific relation of myocardial infarction (MI) to recent oral


contraceptive (OC) use
Age (years)
2529
3034
3539
4044
4549

Recent OC use

MI +

MI -

Yes

62

No
Yes
No
Yes
No
Yes
No
Yes
No

2
9
12
4
33
6
65
6
93
234

224
33
390
26
330
9
362
5
301
1742

Total

Estimated age-specific
relative risk
7.2
8.9
1.5
3.7
3.9

Example 2: A study was conducted to assess the association between


recent oral contraceptive use and MI, the following were the results:
However, the data was confounded by age, which was leading to an
underestimation of the true effect, as can be seen by the (Table 9.2).
Confounding can be controlled in study design by restriction,
matching and randomization. In analyses, it can be controlled
through stratification and multivariate analysis (Tables 9.1 and 9.2).

EFFECT MODIFIERS
Effect modifiers are variables that bring about a change in the
magnitude of an effect. Unlike confounder, effect modifier does not

94 Basics in Epidemiology and Biostatistics

require to be related to both exposure and outcome variable. For


example if we want to determine the incidence of coronary heart
disease (CHD) amongst smokers, the outcome will be affected by
age. Hence, in all such cases age is an effect modifier. Its impact has
to be reported through stratification.
It is important to bear in mind the role of bias, confounding,
chance and evaluate/control for the same so as to ensure that the
results are valid and generalizable.

1. Delgado-Rodrguez M, Lorca J (Eds). Bias. J Epidemiol Community


Health. 2004;58(8):635-41.
2. Hennekens CH, Buring JE (Eds). Analysis of epidemiologic studies:
evaluation the role of bias. Epidemiology in medicine. Boston: Little
Brown and Company; 1987.pp. 243-71.
3. Rothman KJ, Greenland S, Lash TL. Validity in epidemiologic study.
In: Rothman KJ, Greenland S, Lash TL (Eds). Modern epidemiology.
Philadelphia, PA: Lippincott, Williams and Wilkins; 2008. pp. 128-47.

BIBLIOGRAPHY

tahir99 - UnitedVRG

CHAPTER

10

Sample Size Estimation

SAMPLE SIZE
The sample size calculation depends on:
Type of study
Magnitude of the outcome of interest derived from previous
studies
Type of statistical analysis required (comparing means or
proportions)
Level of significance/power.

SAMPLE SIZE FOR SINGLE PROPORTION


Sample size for single proportion depends on:
The prevalence of the condition/attribute of interest
Level of confidence
Margin of error.

Example of Sample Size Calculation


for a Single Proportion
A researcher aims to estimate the prevalence of chronic kidney
disease (CKD) among adults greater than 18 years of age in a locality.
How many adults should be included in the sample so that the
prevalence may be estimated within 5 percent point of the true value
with 95 percent confidence, if it is known that the true rate is unlikely
to exceed 40 percent?
Values needed to be entered into the WHO sample size calculator
Confidence interval: 95 percent
Anticipated prevalence or population proportion for CKD: 40 percent
Absolute precision required (based on researcher judgment): 5 percent

96 Basics in Epidemiology and Biostatistics

Figure 10.1 Sample size calculation and formula for single proportion

When the above values are entered into WHO sample size
calculator, the estimated sample size will be calculated (Fig. 10.1).
The estimated sample size calculated is 369. Thus, at least
369 participants must be recruited in the study to determine the
prevalence of CKD at confidence interval of 95 percent, with a
precision of 5 percent.

SAMPLE SIZE FOR SINGLE GROUP MEAN


Sample size for single group mean depends on:
The mean of the condition of interest
Level of confidence
Margin of error.

Example of Sample Size Calculation


for Single Group Mean
A researcher aims to estimate the mean hemoglobin level among
pregnant women admitted to a tertiary care hospital. A previous
study of pregnant women showed average hemoglobin level

tahir99 - UnitedVRG

Sample Size Estimation 97


8.2 g/dL and standard deviation of 4.2 g/dL. How many pregnant
women must be studied if he wants the estimate should fall within
1 g/dL with 95 percent confidence?
Values needed to be entered into the WHO Sample Size Calculator:
Confidence interval: 95 percent
Population mean (Average hemoglobin of pregnant women identified
from previous study): 8.2 g/dL
Population standard deviation: 4.2 g/dL
Absolute precision required: 1 g/dL
When the above values are entered into WHO sample size
calculator, the estimated sample size will be calculated (Fig. 10.2).
Where = d/
= Relative precision
d = Absolute precision
= Population mean
The estimated sample size calculated is 68. Thus, at least 68
participants must be recruited in the study to estimate the mean
hemoglobin level among pregnant women at confidence interval of
95 percent, with a precision of 1 g/dL.

Figure 10.2 Sample size calculation and formula for single group mean

98 Basics in Epidemiology and Biostatistics

SAMPLE SIZE FOR TWO PROPORTIONS


The sample size for two proportions depends on:
The prevalence of the condition/attribute of interest for both
groups
Level of confidence
Power of the test.

Example of Sample Size Calculation


for Two Proportions
It is believed that the proportion of patients who develop depression
on one type of dialysis modality (peritoneal dialysis) is 5 percent
while the proportion of the patients who develop depression on
other type of dialysis modality (hemodialysis) is 15 percent. How
large should be the sample size in each of the two groups of patients
if an investigator wishes to detect with a power of 90 percent, whether
the second dialysis modality (hemodialysis) has depression rate
significantly higher than the first at 5 percent level of significance?
Values needed to be entered into the WHO sample size calculator:
Confidence interval: 95 percent
Anticipated prevalence of depression in peritoneal dialysis patients:
5 percent
Anticipated prevalence of depression in hemodialysis patients:
15 percent
Power of test: 90 percent
When the above values are entered into WHO sample size
calculator, the estimated sample size will be calculated (Fig. 10.3).
The estimated sample size calculated is 153. Thus, at least 153
participants must be recruited in the study to estimate any significant
difference in psychiatric illness in two different types of dialysis
modality at a power of 90 percent.

SAMPLE SIZE FOR TWO GROUP MEANS


The sample size for group means depends on:
The means and variance of both groups
Level of confidence
Power of the test

tahir99 - UnitedVRG

Sample Size Estimation 99

Figure 10.3 Sample size calculation and formula for two proportions

Example of Sample Size Calculation


for Two Group Means
Suppose the true mean systolic blood pressure (SBP) of
3539-year-old oral contraceptive (OC) users is (135 mm Hg) and
standard deviation (16 mm Hg). Similarly, for non-OC users, the
mean SBP is (130 mm Hg) with standard deviation (17 mm Hg).
If we desire to estimate the difference between 2 groups of equal
size, what would be the minimal sample size required with a power
of 80 percent at 95 percent confidence level?
Values to be Entered into the Open Epi-Software
Following values should be entered in the Open epi sample size
calculator (Table 10.1).
Confidence interval: 95 percent
Power: 80 percent
Mean systolic BP of oral contraceptive users: 135 mm Hg
Standard deviation of oral contraceptives users: 16 mm Hg
Mean systolic BP of nonoral contraceptive users: 130 mm Hg
Standard deviation of nonoral contraceptive users: 17 mm Hg

100 Basics in Epidemiology and Biostatistics

Table 10.1: Sample size calculation for comparing two means


95

Confidence interval % (two-sided)

80

Power
Ratio of sample size (Group 2/Group 1)
Group 1
135
and
Mean
Standard
Deviation

16

Enter a value between


0 and 100, usually 95%
Enter a value between
0 and 100, usually 80%

1
Group 2
130
Enter means values of
each group
Enter standard
17
deviation or variance of
each individual group

Variance

Table 10.2: Sample size calculation result


Input data
Confidence interval (2-sided)
Power
Ratio of sample size (Group 2/Group 1)

Mean
Standard deviation
Variance
Sample size of group 1
Sample size of group 2
Total sample size

95%
80%
1
Group 1 Group 2

135
16
256

Mean
difference*
(135130)=5

130
17
289
172
172
344

* Mean difference of Systolic BP of Group 1 and Group 2

The minimum sample size required to compare the mean of OC


and non-OC user is 344. Thus 344 (172 in each group), participants
must be recruited in the study to estimate the difference in two
groups at 95 percent confidence interval with a power of 80 percent
(Table 10.2).

tahir99 - UnitedVRG

Sample Size Estimation 101

SAMPLE SIZE FOR SENSITIVITY AND SPECIFICITY


The sample size for sensitivity and specificity depends on:
The prevalence of the condition/attribute of interest
Estimated sensitivity
Estimated specificity
Level of significance
Margin of error.

Example of Sample Size Calculation


for Sensitivity and Specificity
Suppose we want to determine the sensitivity and specificity of ELISA
in the diagnosis of HIV by the gold standard Western Blot. How many
patients should be included in the sample? The prevalence of HIV
is 15 percent and estimated sensitivity of gold standard Western Blot
is 97 percent and estimated specificity is 94 percent with 95 percent
confidence, if we want to keep margin of error as 5 percent how much
patients should be invited to participate in this study. 347 patients will
be required for the sensitivity and specificity analysis in this study.
Sample size calculation and formula for sensitivity and specificity
studies values to be entered into the WHO software.
Prevalence of HIV: 15 percent
Sensitivity: 97 percent
Specificity: 94 percent
Confidence interval: 95 percent
Margin of error: 5 percent

Confidence level

95%

From literature or pilot study


From literature or pilot study
From literature or pilot study
Researchers judgment
95% is recommended

To achieve the precision of 0.05 for Sensitivity, we need the total sample size of = 347
is preferable as it will give precision of 0.05 or less for both sensitivity and specificity
With this sample size, the precision for Specificity will be = 0.027

0.97
0.94
0.15
0.05

Expected Sensitivity
Expected Specificity
Expected Prevalence
Desired Precision

This

102 Basics in Epidemiology and Biostatistics

SUGGESTED WEBSITES FOR


SAMPLE SIZE CALCULATOR
1. http://www.raosoft.com/samplesize.html
2. http://www.quantitativeskills.com/sisa/calculations/samsize.htm
3. http://www.openepi.com/Menu/OpenEpiMenu.htm

BIBLIOGRAPHY

1. Calkins KG. Power and sample size: an appropriate sample size is


crucial to any well-planned research investigation. [Online]. 2005 [cited
19 Sep. 2008]; Available from: URL: http://www.andrews.edu/~calkins/
math/edrm611/edrm11.htm.
2. Naing L, Winn T, Rusli BN. Practical issues in calculating the sample
size for prevalence studies. Arch Orofac Sci. 2006;1:9-14.
3. Naing L. Sample size calculation for sensitivity and specificity studies.
[Online]. 2004 [cited 10 Aug. 2008]; Available from: URL: http://www.
kck.usm.my/ppsg/statistical_resources/samplesize_forsensitivity_
specificitystudiesLinNaing.xls.
4. OpenEpi Version 2.2.1: open source epidemiologic statistics for public
health. [Online]. 2008 [cited 10 Oct. 2008]; Available from: URL: http://
www.openepi.com/Menu/OpenEpiMenu.htm.
5. Sample size calculations: statistics guide for research grant applicants.
[Online]. [2001?] [cited 14 Oct. 2008]; Available form: URL: http://
www.sgul.ac.uk/index.cfm?D7DEB028-B5BE-7536-BD9D2EC800CE3789CAB35E63-88E4-4358-889C-043A012DF815.

tahir99 - UnitedVRG

CHAPTER

11

Screening

The active search for disease among apparently healthy people is a


fundamental aspect of prevention. This is embodied in screening,
which has been defined as the search for unrecognized disease
or defect by means of rapidly applied tests, examinations or other
procedures in apparently healthy individuals.
Screening is a way of improving patients outcome by detecting
the disease in apparently healthy individual at an earlier stage,
which is usually a treatable stage. For this purpose, there are tests
such as physical examination, biochemical assay of blood, urine
and other body fluids, radiography, ultrasonography, cytology and
histopathology. One question needed to be answered in context to
screening is how good are these tests in distinguishing individuals
with and without the disease in question.
A screening program is most effective and beneficial if it is directed
to a high-risk target population. Screening a total population for a
relatively infrequent disease can be very wasteful of resources and
may yield very few previously undetected cases.

RELIABILITY AND VALIDITY OF A SCREENING TEST


An effective screening program will use tests which are ideally
inexpensive, easy to administer, impose minimal discomfort on
those in whom they are administered, reliable (measure a variable
consistently and free of random error) and are valid (able to
differentiate between individuals with a disease or its precursor, and
those without).

Validity (Accuracy)
The term validity refers to what extent the test accurately measures
which it intends to measure. In other words, validity expresses the

104 Basics in Epidemiology and Biostatistics

ability of a test to separate or distinguish those who have the disease


from those who do not. For example, glycosuria is a useful screening
test for diabetes, but a more valid or accurate test is the glucose
tolerance test. Accuracy refers to the closeness with which measured
values agree with the true values.
Assessment of test performance is presented in a two by two
table (Table 11.1). The disease status (as assessed through the Gold
Standard) is conventionally put in the top row while the screening
test result in the first column.
In the above table, a is the number of subjects who have the disease
and are found positive by the test (true positives), b is the number of
subjects who do not have the disease and are found positive by the
test (false positives), c the number of subjects who have the disease
but are found negative by the test (false negative), and d the number
of subjects who do not have the disease and are found negative by
the test (true negative).
Validity has two componentssensitivity and specificity.

SENSITIVITY AND SPECIFICITY


Sensitivity is defined as the ability of a test or procedure to identify
correctly all those who have the disease, that is true-positive in the

Table 11.1: A two-by-two table for screening test


Disease status as determined by Gold Standard
Disease present

No disease

Test Positive

*True Positives
(a)

#False Positives
(b)

Total Test Positive


(a + b)

Test Negative

False Negatives
(c)

~True Negatives
(d)

Total Test Negative


(c + d)

Total with Disease


(a + c)

Total without
Disease (b + d)

Total Screened
(a + b + c + d)

*True positives = number of individuals with disease and a positive screening test
(a); #False positives = number of individuals without disease but have a positive
screening test (b); False negatives = number of individuals with disease but have a
negative screening test (c); ~True negatives = number of individuals without disease
and a negative screening test (d)

tahir99 - UnitedVRG

Screening 105
screened population. Sixty percent sensitivity means that 60 percent
of the diseased people screened by the test will give a true positive
result and the remaining 40 percent a false-negative result. Thus,
expressed as the proportion of those with disease correctly identified
by a positive screening test result.
Number of true positives
Sensitivity =
Total with disease

= a/(a + c)
when expressed in percent
a
100
=
a+c

V
d

G
R

Specificity is the ability of the test or procedure to identify correctly


all those who do not have the disease, that is true negatives in
the screened population. Thirty percent specificity means that
30 percent of the nondiseased persons will give true-negative result,
while 70 percent of the nondiseased persons screened by the test
will be incorrectly classified as diseased when they are not. Thus,
expressed as the proportion of those without disease correctly
identified by a negative screening test result.
Number of true negative
Specificity =
Total without disease
= d/(b + d)
when expressed in percent

9
ri 9

h
ta

d
100
b+d

n
U
-

ti e

PREDICTIVE VALUES

Predictive value reflects the diagnostic power of the test. The


predictive accuracy depends upon sensitivity, specificity and disease
prevalence.
Positive predictive value describes the probability of having
the disease given a positive screening test result in the screened
population. Thus, expressed as the proportion of those with disease
among all screening test positives. The positive predictive value of
mammography, for example, will tell a woman how likely it is that
she has breast cancer after a positive mammogram.

106 Basics in Epidemiology and Biostatistics


Number of true positives
Total test positives
= a/(a + b)

Positive predictive value (PPV) =


when expressed in percent

a
100
a+b
Negative predictive value describes the probability of not having
the disease given a negative screening test result in the screened
population. Thus, expressed as the proportion of those without
disease among all screening test negatives. The negative predictive
value of mammography, for example, will tell a woman the probability
that she truly does not have breast cancer, if the mammogram is
negative.
Number of true negatives
Positive predictive value (PPV) =
Total test negatives
= d/(c + d)
when expressed in percent
d
100
=
c+d

Example
A new ELISA (antibody test) is developed to diagnose HIV infections.
Serum from 80 patients that were positive by Western Blot (the Gold
Standard assay) was tested, and 60 were found to be positive by the
new ELISA screening test. The manufacturers then used the new
ELISA to test serum from 120 study participants that were negative
by Western Blot (the Gold Standard assay); 70 were found to be
negative by the new test.
Infected

ELISA
Test

Positive
Negative
Total

HIV
Non-infected

Total
a + b =110
60 (a = TP)
50 (b = FN)
Total test positive
c +d = 90
20 (c = FP)
70 (d = TN)
Total test negative
80 (a + c)
120 (b + d)
a + b + c + d = 200
Total infected Total not infected
Total screened

tahir99 - UnitedVRG

Screening 107
a
60 100
100 =
= 75%, i.e. the new test ELISA is
a+c
80
75 percent sensitive in correctly identifying HIV infection.
Sensitivity =

d
70 100
100 =
= 58%, i.e. the new test ELISA is
d+b
120
58 percent specific to detect non-HIV infected persons.
Specificity =

Positive Predictive Value (PPV)

V
d

G
R

a
60 100
100 =
= 55% , i.e. based over, ELISA the new
a+b
100
screening technique for HIV 55 percent persons who test positive,
are actually suffering from HIV.
PPV =

n
U
-

Negative Predictive Value (NPV)

ti e

d
70 100
100 =
= 78%, i.e. based over, ELISA the new
c +d
90
screening technique for HIV 78 percent persons who test negative,
are actually free from HIV.
NPV =

9
ri 9

Relationship between Sensitivity,


Specificity, PPV and NPV
Sensitivity and NPV

h
ta

Sensitivity and Negative predictive value are positively correlated


(increase in one will increase other). If the test is more sensitive, it
is less likely that an individual with a negative result will have the
disease, so the greater Negative predictive value.

Specificity and PPV


Specificity and Positive predictive value are directly correlated (i.e.
increase in one will increase other). If the test is more specific, it
is less likely that an individual with a positive test will be free from
disease, so the greater the Positive predictive value.

108 Basics in Epidemiology and Biostatistics

The Effect of Disease Prevalence

Sensitivity and Specificity are independent of prevalence of disease


as they are test specific (describes how well the screening test
performs against the gold standard).
Positive predictive value (PPV) and Negative predictive
value (NPV) are dependent over disease prevalence as they are
population specific. Both PPV and NPV gives information on how
well a screening test will perform in a given population with known
prevalence. Prevalence is directly related to PPV and inversely to
NPV, thus a higher prevalence will increase the PPV and decrease
the NPV.
Example 1a: In a population of 10,000 with a disease prevalence of
1%, Sensitivity = 99%; Specificity = 95% of test A;
Disease
prevalence
1%

Disease
positive

Disease
negative

Total

99

495

594

Test (Negative)

9405

9406

Total

100

9900

10,000

Test (Positive)

99
9405
100; NPV =
100
594
9406
= 17%
= 99.99%
However, with the same sensitivity, specificity and population
size, if the prevalence changes then what will be the effect on the
tests positive predictive value (PPV); see example 2b.
Example 1b: In a population of 10,000 with a disease prevalence of
5% Sensitivity = 99%; Specificity = 95% with test A;

PPV =

Disease
prevalence

Disease
positive

Disease
negative

Total

Test (Positive)

495

475

970

Test (Negative)

9025

9030

Total

500

9500

10,000

5%

495
9025
100; NPV =
100
970
9030
= 51.03 %
= 99.94 %

PPV =

tahir99 - UnitedVRG

Screening 109

Thus an increase in prevalence from 1 to 5 percent with same the


same level of sensitivity and specificity; has increased the tests positive
predictive value (PPV) from 17 to 51.03 percent, and decreases the
negative predictive value (NPV) from 99.99 to 99.94 percent.

V
d

G
R

1. Hennekens CH, Buring JE. Screening. Epidemiology in medicine.


Boston, Mass: Little, Brown and Co; 1987.pp.327-47.
2. Park K. Screening for Disease. In Parks Textbook of Preventive and
Social Medicine. India: Bhanot; 2009.pp.123-130.
3. Petrie A, Sabin C. Diagnostic tools. Medical statistics at a glance. UK:
Blackwell Science; 2000.pp.90-1.
4. Wassertheil-Smoller S. Mostly about screening. Biostatistics and
epidemiology: a primer for health and biomedical professionals. New
York: Springer-Verlag; 1995.pp.118-28.

BIBLIOGRAPHY

h
ta

9
ri 9

n
U
-

ti e

CHAPTER

12

Basic Statistical Tests

Selection of a correct test is vital to run an analysis on SPSS. The


selection of test depends on whether
Data is qualitative (categorical) or quantitative (Continuous)
Data is unpaired (independent groups) or paired (repeated
measures)
Distribution is normal or skewed.

UNPAIRED SAMPLES
In unpaired samples, there is no relation between subjects in group
1 and subjects in group 2 (two independent groups). Suppose a data
is collected on ICT skills comparing medical versus engineering
students. These are two independent groups. Whenever you are
comparing mean of continuous variable in two independent groups
(e.g. medical students and engineering students), an independent
sample t-test will be applied.

PAIRED SAMPLES

In paired samples, repeated measures (pre-post test) are taken on the


same subject. For example, if you wanted to determine how much a
student learned in a statistics class, you would do a pre (before) and
post (after) test to determine the impact of intervention (statistical
class) on the score.
Whenever comparing a categorical variable (qualitative data)
between two groups, a Chi-square test is used.
Comparing a continuous variable (quantitative data) between
two independent groups is called comparing two means, a t-test (e.g.
independent t-test, paired test, etc.) is applied for this purpose.
When comparing a continuous variable (quantitative data)
between two paired groups (pre-post) a paired t-test is applied.

tahir99 - UnitedVRG

Basic Statistical Tests 111


Flow charts 12.1 and 12.2 give different choices of tests for
qualitative and quantitative data.

Flow chart 12.1 Selection of statistical test for qualitative data

9
ri 9

n
U
-

V
d

ti e

Flow chart 12.2 Selection of parametric statistical test for


quantitative data to compare means

h
ta

G
R

112 Basics in Epidemiology and Biostatistics

Nonparametric Tests

When assumptions of the parametric tests are not satisfied, i.e.


data is not normally distributed or the data is collected on less
than 30 participants a nonparametric test is applied (Flow chart
12.3). Nonparametric tests are an alternative to parametric tests.
Chi-square is the most frequently used nonparametric test. Other,
nonparametric tests are:
Wilcoxon Rank Sum test or Mann-Whitney U test is the
nonparametric version of the independent sample t-test and can
be used when assumptions of the parametric tests are not satisfied.
Thus, Mann-Whitney U test is used to compare median of two
independent samples when the data is either:
On interval scale; or
Ranked (ordered) scale.
The test is used to test the hypothesis that two population
distributions do not differ in median (e.g. a null hypothesis comparing
median bicep skinfold thickness of patients with celiac disease and
Crohns disease would say that the two median are equal).

Flow chart 12.3 Selection of nonparametric statistical


test for quantitative data to compare means

tahir99 - UnitedVRG

Basic Statistical Tests 113

Wilcoxon signed rank test is the nonparametric version of a paired


sample t-test, which is also called the Wilcoxon matched pairs test
and is used when the data is either:
On interval scale; or
Ranked scale.
The test is based on the rank of absolute difference, rather than
the numerical value of the difference (Table 12.1).
Kruskal-Wallis test is the nonparametric version of ANOVA and
used when the assumptions of the parametric tests are not satisfied.

V
d

WHAT ARE VALIDITY AND RELIABILITY


IN RESEARCH FINDINGS?

ti e

G
R

Validity and reliability has been discussed in Figures 12.1A to D.


Validity means that your scientific observations actually measure
what they intend to measure (your conclusions are true).
Reliability means that someone else using the same method in
the same circumstances should be able to obtain the same findings
(your findings are repeatable).
Reliability (repeatability) refers to the possibility to replicate
(repeat) the observations and is related to the precision of the
instrument used for scientific observations. Validity refers to the
soundness of the observations and to the accurateness of the data
collected by the research method/instrument.

n
U
-

9
ri 9

Table 12.1: Wilcoxon signed rank test

h
ta

Participants ID

Placebo

Drug

Difference
(Placebo-Drug)

13

16

-3

19

11

114 Basics in Epidemiology and Biostatistics

Figures 12.1A to D (A) Neither valid nor reliable. The research method does
not measure the research outcome (not valid) and repeated attempts are unfocused; (B) Reliable but not valid. The research method does not measure
the research outcome (not valid), but repeated attempts get almost the same
(wrong) results; (C) Fairly valid but not very reliable. The research method
measures the research outcomes fairly closely, but repeated attempts have
very scattered results (not reliable); (D) Valid and reliable. The research
method precisely measures the research outcomes, and repeated attempts
produce similar results

1. Data management: preparing to analyse the data. In: Peat J, Barton B.


Medical Statistics: a guide to data analysis and critical appraisal. USA:
Blackwell Publishing Ltd; 2005.pp.1-23.
2. Field A. Discovering statistics using IBM SPSS statistics. Sage
Publications, 2013.
3. Petrie A, Sabin C. Medical Statistics at a glance (vol 29). John Wiley &
Sons;2009.
4. Pallant J. SPSS Survival manual: A step-by-step guide to data analysis
using SPSS for windows (version 10): Allen and Unwin, 2001.
5. Rosner B. Fundamentals of biostatistics. Cengage Learning, 2010.

BIBLIOGRAPHY

tahir99 - UnitedVRG

CHAPTER

13

Overview of Data
Collection Techniques

V
d

G
R

Data collection techniques allow us to systematically collect


information about our objects of study (people, objects, phenomena)
and about the settings in which they occur.

ti e

DIFFERENT DATA COLLECTION TECHNIQUES








n
U
-

Using available information


Observing
Interviewing (face-to-face)
Administering written questionnaire
Focus group discussion
Projective techniques
Mapping and scaling.

9
ri 9

Using Available Information

Usually, there is a large amount of information/data that has


been collected by some other source but not being analyzed and
published. For example, analysis of information collected from a
Primary Health Care Center regarding the proportion of different
diseases and the age group affected in those diseases in an area. The
advantage of using existing knowledge is that it is a very inexpensive
method, however, the data may not always be completed or too
disorganized.

h
ta

Observing
It is a technique that involves systematically selecting, watching
and recording behavior and characteristics of living beings, objects
or phenomena. Observations can be open (e.g. observing a health
worker during his/her routine activities) or concealed (e.g. mystery

116 Basics in Epidemiology and Biostatistics


clients trying to obtain antibiotics without medical prescription).
This method gives more accurate, additional information on
behavior of people than interviews or questionnaire. It also checks
the information collected through interviews especially on sensitive
topics as alcohol use or behavior of people towards the patient
having any stigmatizing disease.

Interviewing
Here there is oral questioning of respondents. Answers to the
questions posed during an interview are either written down or
recorded by a tape recorder, or both techniques could be used.
The unstructured method of asking questions is used. This method
is frequently used in exploratory studies where the investigator has,
as yet, little understanding of the problem, or if the topic is sensitive.

Questionnaire

A written questionnaire also known as self-administered question


naire, is a data collection tool in which questions are presented that
are to be answered by the respondent himself in written form.
Questionnaire comprises of a formal, written, set of closed-ended/
open-ended questions that are asked from every respondent in the
study. It provides an objective means of collecting information (data)
related to exposure/outcome of interest as well as on confounders or
effect modifiers.

Types of Questions

Open-ended questions are those questions that solicit additional


information from the inquirer. They are also called infinite
response or unsaturated type questions. By definition, they are
broad and require more than one or two word responses. These
types of questions are of use in conduct of qualitative research.
Closed ended questions: Closed ended questions are those
questions, which can be answered finitely by either yes or no.
They are also called dichotomous or saturated type questions.
In quantitative research closed ended questions are maximally
used.

tahir99 - UnitedVRG

Overview of Data Collection Techniques 117

Ways of Administration of Questionnaire





Mail
Telephone
Via computer
Interviewer.

Important Points in Designing a Questionnaire

G
R

The information obtained by each question will be specific to


the information you would need in your analysis. Therefore,
before you compose any question, think through your research
questions/objectives and also think how you will conduct your
analysis.
It should be ensured that the format of the questionnaire be
attractive and easy for the respondents to fill, overcrowding or
cluttering of inquiries should be avoided. All pages and questions
should be clearly numbered.
The questionnaire should never be too long. In general, questions
should be short and to the point (around 12 words or less).
Only information relevant to the objective should be solicited, the
proforma/questionnaire should not resemble a history sheet.
Be careful about responses of neutral or no opinion versus do
not know.
Questions concerning major areas should be grouped together.
Simple questions about age, birth date, etc. should be put at the
beginning to warm up the respondent.
Questions should ask only 1 piece of information.
Question wordings should ensure that every respondent will
be answering the same thing, so avoid ambiguous wording or
wording that means different things to different respondents. Also
avoid terms for which the definition can vary (if it is unavoidable,
provide the respondent with a definition).
Question should be preferably close ended, possible answers
to close ended question should be lined vertically, preceded by
boxes, brackets or numbers.
Example: How many different medicines do you take daily (check
one)?
[ ] None
[ ] 12

h
ta

9
ri 9

n
U
-

ti e

V
d

118 Basics in Epidemiology and Biostatistics

[ ] 34
[ ] 56
[ ] 7 or more
If more details are required pertaining to a question, then the
filter/skip technique should be used to save time and allow
respondents to avoid irrelevant questions.
Example: Have you ever been told that you have hypertension?
Yes
No
If yes, proceed to the next question
How long back were you told that you have hypertension?
Always choose an appropriate means of measurement e.g. score/
scales.
Example: Two words that are often used inappropriately are
frequently and regularly. A poorly designed question might read,
I frequently engage in exercise, and offer a Likert scale giving
responses from strongly agree through to strongly disagree.
But frequently implies frequency, so a frequency based rating
scale (with options such as at least once a day, twice a week, and
so on) would be more appropriate.
Sensitive questions should be left for the end.
Using a previously validated and published questionnaire will
save your time and resources, so if similar research instruments
are available it may be a good idea to review and borrow questions.
Always try to ensure that if questions are to be asked in any
language besides English they shall be so written too.

1.
2.


Focus Group Discussion


Focus group discussion allows a group of 810 informants to freely
discuss a certain subject with the guidance of a facilitator or reporter.

Projective Techniques
When a researcher uses projective techniques, he asks an informant
to react to some kind of visual or verbal stimulus.
For example, the presentation of a hypothetical question or
an incomplete sentence or case/study to an informant (story with
a gap). The researcher then asks the informant to complete the
sentence in writing such as;

tahir99 - UnitedVRG

Overview of Data Collection Techniques 119


If I were to discover that my neighbor had tuberculosis, I would
suggest him-------------------------------------------------------------- If my wife were having labor pains, I would do-------------------- If my child had diarrhea, I would give him---------------------------

Mapping and Scaling

G
R

It is a valuable technique to display relationships and resources. In


a water supply project, for example, mapping is invaluable. It can be
used to present the placement of wells, distance of the homes from
the wells, other water systems, etc.It gives researcher a good overview
of the physical situation and may help to highlight relationships
hitherto unrecognized.
Scaling is a technique that allows researcher through their
respondents to categorize certain variables that they would not be
able to rank themselves. For example, they may ask their informant
to bring certain types of herbal medicine and ask them to arrange
these into piles according to their usefulness. The informant would
then be asked to explain the logic of their ranking.
Mapping and scaling are used as techniques in rapid appraisals
or situation analysis. Rapid appraisal technique is an approach often
used in health systems-research.

9
ri 9

BIBLIOGRAPHY

n
U
-

ti e

V
d

1. Bourque, Linda and Eve Fielder. How to Conduct Self-Administered


and Mail Surveys? Learning Objectives. Thousand Oaks, CA: Sage
Publications, 1995.
2. Converse Jean M, Stanley Presser. Survey Questions: Handcrafting the
Standardized Questionnaire. Quantitative Applications in the Social
Sciences (series). Thousand Oaks, CA: Sage Publications, 1986.
3. Dillman Don A. Mail and Internet Surveys: The Tailored Design Method.
New York: J Wiley, 2000.
4. Fink Arlene. How To Ask Survey Questions? Thousand Oaks, CA:Sage
Publication, 1995.
5. Fowler, Floyd J Jr. Improving Survey Questions: Design and Evaluation.
Thousand Oaks, CA: Sage Publications, 1995
6. Sudman Seymore, Norman M Bradburn. Asking Questions: A Practical
Guide to Questionnaire Design. San Francisco: Jossey-Bass Inc., 1982.

h
ta

CHAPTER

14

Data Analysis Plan

Development of a research process is a cyclical process. The double-headed arrows indicate


that the process is never linear.

tahir99 - UnitedVRG

Data Analysis Plan 121

IMPORTANCE OF DATA ANALYSIS PLAN


Preparation of a plan for data processing and analysis will provide
you with better insight into the feasibility of the analysis to be
performed as well as the resources that are required. It also provides
an important review of the appropriateness of the data collection
tools for collecting the data you need. That is why you have to plan for
data analysis before the pretest. When you process and analyze the
data you collect during the pretest you will spot gaps and overlaps
which require changes in the data collection tools before it is too late!

V
d

WHAT SHOULD THE PLAN INCLUDE?

ti e

G
R

When making a plan for data processing and analysis the following
issues should be considered:
Sorting data
Performing quality-control checks
Data processing
Data analysis.

Sorting Data

n
U
-

An appropriate system for sorting the data is important for facilitating


subsequent processing and analysis.
If you have different study populations (for example, doctors,
paramedical staff and medical students), you obviously would
number the questionnaires separately.
In a comparative study, it is best to sort the data right after
collection into the two or three groups that you will be comparing
during data analysis. For example, in a study where you are
interested to know the use of sedatives by the doctors, users and
nonusers would be two basic categories. In a study of the reasons
why doctors object to being posted in rural areas, rural and urban
doctors would be basic categories. In a case-control study obviously
the cases are to be compared with the controls. It is useful to number
the questionnaires belonging to each of these categories separately,
right after they are sorted.

9
ri 9

h
ta

Performing Quality-Control Checks


Usually the data have already been checked in the field to ensure
that all the information has been properly collected and recorded.

122 Basics in Epidemiology and Biostatistics

Before and during data processing, however, the information should


be checked again for completeness and internal consistency.
If a questionnaire has not been filled in completely, you will have
missing data for some of your variables. If there are many missing
data in a particular questionnaire, you may decide to exclude the
whole questionnaire from further analysis.

Data Processing: Quantitative Data

Process and analyze the data from questionnaires by:


Manually, using data master sheets or manual compilation of the
questionnaires, or
By computer, for example, using a microcomputer and existing
software or self-written programs for data analysis.
Data processing in both cases involves:
Categorizing the data
Coding
Summarizing the data in data master sheets, manual compilation
without master sheets, or data entry and verification by computer.
Answers that are difficult or impossible to categorize may be
put in a separate residual category called others, but this category
should not contain more than 5 percent of the answers obtained.
If the data will be entered in a computer for subsequent processing
and analysis, it is essential to develop a coding system.
For computer analysis, each category of a variable can be coded
with a letter, group of letters or word, or be given a number. For
example, the answer yes may be coded as Y or 1; no as N or 2
and no response or unknown as U or 9.
The codes should be entered on the questionnaires (or checklists)
themselves. When finalizing your questionnaire, for each question
you should insert a box for the code in the right margin of the page.
These boxes should not be used by the interviewer. They are only
filled in afterwards during data processing. Take care that you have
as many boxes as the number of digits in each code.

For example:
Yes (or positive response)
No (or negative response)
Do not know
No response/unknown

tahir99 - UnitedVRG

code-Y or 1
code-N or 2
code-D or 8
code-U or 9

Data Analysis Plan 123

A number of computer programs are available on the market that


can be used to process and analyze research data. The most widely
used programs are:
Epi Info (version 6), a very consumer friendly program for data
entry and analysis, which also has a word processing function
for creating questionnaires (developed by the Center for Disease
Control, Atlanta, USA and World Health Organization, Geneva),
LOTUS 1-2-3, a spreadsheet program (from the Lotus
Development Corporation),
dBase (version III plus or IV), a data-management program (from
Ashton-Tate), and
SPSS, which is a quite advanced Statistical Package for Social
Sciences (SPSS Inc.).
If you intend to use a computer, you may ask advice from
an experienced person concerning which program is the most
appropriate for your type of data. Note that Epi Info may be freely
used and copied. All the other programs have copyrights.

n
U
-

Data Analysis: Quantitative Data

ti e

V
d

G
R

Analysis of quantitative data involves the production and


interpretation of frequencies, tables, graphs, etc., that describe the
data.
After deciding on a data entry format, the information on the
data collection instrument will have to be coded (e.g., Male: M or 1,
Female: F or 2). During data entry, the information relating to each
subject in the study is keyed into the computer in the form of the
relevant code (e.g., if the first subject (identified as 001) is a male
(code 1) aged 25, the data could be keyed in as 001125).
The computer can do all kinds of analysis and the results can be
printed. It is important to decide whether each of the tables, graphs,
and statistical tests that can be produced makes sense and should
be used in your report. That is why we plan the data analysis before
hand!
Frequency counts: From the data master sheets, simple tables can
be made with frequency counts for each variable. A frequency
count is an enumeration of how often a certain measurement or a
certain answer to a specific question occurs.

9
ri 9

h
ta

124 Basics in Epidemiology and Biostatistics


For example:

Smokers
51
Nonsmokers
93
Total
144
If numbers are large enough it is better to calculate the frequency
distribution in percentages (relative frequencies): 51/144 100 =
35 percent are smokers and 93/144 100 = 65 percent nonsmokers.
This makes it easier to compare groups than when only absolute
numbers are given. In other words, percentages standardize the
data.
Divide the range into three to five categories. You can either aim
at having a reasonable number in each category (e.g. 02 km,
34 km, 59 km, 10+ km for home-clinic distance) or you can
define the categories in such a way that they are each equal in size
(e.g. 2029 years, 3039 years, 4049 years, etc.).
Construct a table indicating how data are grouped and count the
number of observations in each group.
Cross-tabulations: Further analysis of the data usually requires
the combination of information on two or more variables in order
to describe the problem or to arrive at possible explanations for it.
For this purpose it is necessary to design cross-tabulations.
Depending on the objectives and the type of study, two major
kinds of cross-tabulations may be required:
1. Descriptive cross-tabulations that aim at describing the
problem under study.
2. Analytic cross-tabulations in which groups are compared in
order to determine differences, or which focus on exploring
relationships between variables.
A descriptive cross-tabulation (Table 14.1) would, for example,
relate smoking behavior to sex or occupational background:
The males appear to be smoking more (43%) than females (28%).
Table 14.1: Smoking by sex
Sex

Smoking

Not smoking

Total

Males

31 (43%)

41 (57%)

72 (100%)

Females

20 (28%)

52 (72%)

72 (100%)

Total

51 (35%)

93 (65%)

144 (100%)

tahir99 - UnitedVRG

Data Analysis Plan 125


An analytic cross-tabulation serves to investigate, if there is a
relationship between smoking (independent variable) and persistent
cough, or chest complaints (dependent variables/problems).
Of the informants with a cough, the majority (77%) is smoking,
whereas among those without a cough, only one-third (33%) are
smokers. The expected relationship between smoking and chest
problems seems confirmed.
When the plan for data analysis is being developed, the data, of
course, is not yet available. However, in order to visualize how the
data can be organized and summarized it is useful at this stage to
construct the so-called dummy cross-tabulations.
A dummy table contains all elements of a real table, except that
the cells are still empty. In a research proposal dummy tables should
be prepared to describe the study population in order to show the
crucial relationships between variables.
For the study exploring the relationship between smoking
behavior and persistent cough, a table should be constructed as
below (Table 14.2).
Some practical hints when constructing tables:
If a dependent and an independent variable are cross-tabulated,
the headings of the dependent variable are usually placed
horizontally (Table 14.2: cough and no cough), and the
headings of the independent variable vertically: (smoking and
not smoking in the same table).
All tables should have a clear title and clear headings for all rows
and columns.
All tables should have a separate row and a separate column for
totals to enable you to check if your totals are the same for all
variables and to make further analysis easier.
All tables related to a certain objective should be numbered and
kept together so the work can be easily organized and the writing
Table 14.2: Smoking in relation to persistent cough over the past 2 weeks
Smoking behavior

Cough

No cough

Total

Smoking

10 (77%)

41 (32%)

51 (35%)

Not smoking

3 (23%)

90 (68%)

93 (65%)

Total

13 (100%)

131 (100%)

144 (100%)

126 Basics in Epidemiology and Biostatistics


of the final report will be simplified. To further analyze and
interpret the data, certain calculations or statistical procedures
must usually be completed. Especially, in large cross-sectional
surveys and in comparative studies, statistical procedures are
necessary if the data is to be adequately interpreted. Statistical
tests should, for example, indicate whether the gender differences
in smoking behavior are true differences or due to chance. When
conducting such studies it is advisable to consult a person with
statistical knowledge right from the start.

Processing and Analysis of Qualitative Data

Qualitative data may be collected through open-ended questions in


self-administered questionnaires, in individual interviews or focus
group discussions or through observations during fieldwork. For a
detailed description of the analysis of qualitative data see Module
10C and in particular Module 23, which specify the methods most
often used. Here we will concentrate on the analysis of responses
obtained from open-ended questions in interviews or selfadministered questionnaires.
Commonly solicited data in open-ended questions include:
Opinions of respondents on a certain issue;
Reasons for a certain behavior; and
Descriptions of certain procedures, practices or perceptions with
which the researcher is not familiar.
The data can be analyzed in seven steps:
Step 1: Take a sample of (say 20) questionnaires and list all answers for
a particular question. Take care to include the source of each answer
you list (in the case of questionnaires you can use the questionnaire
number), so that you can place each answer in its original context, if
required.
Step 2: To establish your categories, you first read carefully through
the whole list of answers. Then you start giving codes (A, B, C, for
example, or keywords) for the answers that you think belong together
in one category, and write these codes in the left margin. Use a pencil
so that it is easy to change the categories if you change your mind.

tahir99 - UnitedVRG

Data Analysis Plan 127


Step 3: List the answers again, grouping those with the same code
together.
Step 4: Then interpret each category of answers and try to give it a label
that covers the content of all answers. In the case of data on opinions,
for example, there may be only a limited number of possibilities,
which may range from (very) positive, neutral, to (very) negative.
Data on reasons may require different categories depending on
the topic and the purpose of your question. In the exercise below
you will be asked to categorize the reasons why people smoke by
grouping them in such a way that it is easy to find entry points for
health education aimed at reducing smoking.
After some shuffling you usually end up with 5 to 7 categories.
Step 5: Now try a next batch of 20 questionnaires and check if the
labels work. Adjust the categories and labels, if necessary.
Step 6: Make a final list of labels for each category and give each label
a code (keyword, letter or number).
Step 7: Code all your data, including what you have already coded,
and enter these codes in your master sheet or in the computer.
Note again that you may include a category others, but that it
should be as small as possible, preferably used for less than 5 percent
of the total answers. If you categorize your responses to open-ended
questions in this way you can:
Analyze the content of each answer given in particular categories,
for example, in order to plan what actions should be taken (e.g.,
for health education). Gaining insight in a problem, or in possible
interventions for a problem, is the most important function of
qualitative data.
Report the number and percentage of respondents that fall into
each category; so that you gain insight in the relative weight of
different opinions or reasons.
Questions that ask for descriptions of procedures, practices, or
beliefs usually do not provide quantifiable answers (though you may
quantify certain aspects of them). The answers rather form part of a
jigsaw puzzle that you have to put together in order to obtain insight
in your problem/topic under study.

128 Basics in Epidemiology and Biostatistics

1. AO Foundation (n.d.0. Step-by-step guide to doing clinical research.


Retrieved on 09 October 2006 from http//:www.aofoundation.org/
portal/wps/portal/!ut/p/.cmd/cs/.ce/7_0_7T5/_s.7_0_A/7_0_7T5.
2. Designing and conducting Health Systems Research Project Volume
1. The International Development Research Centre (Science for
Humanity). Module 13: Plan for data processing and analysis. Retrieved
on 14 April 2010 from http//:www.idrc.ca/en/ev-56622-201-1-D0TOPIC.html
3. Professional Data Analysts (n.d.). Stage 3: Data Analysis. Retrieved on
09 October 2006 from http//:www.pdastats.com/default.asp

BIBLIOGRAPHY

tahir99 - UnitedVRG

CHAPTER

15

Synopsis Writing

A research proposal/synopsis and research protocol are synonymous


terms and can be used interchangeably. Development of a research
proposal is the first step taken prior to initiation of any research
project. It very precisely and elegantly describes the importance of
the area of research, the research questions/hypothesis behind the
research and how it will be carried out.
Research proposal is basically a research plan with well-defined
measurable outcome that an investigator aims to follow to achieve
the research objectives. A good research proposal is vital for
successful research. All researches must begin with a clearly focused
research proposal. In recent years there has been an enormous
dissemination of research culture thus formulation of an excellent
research proposal became necessary, not only for ensuring a high
quality of research but also for reasons like attracting a research
grant. A research proposal must be precize and convincing, with
execution (researcher can do this research) as an ultimate test.
Importantly, a research proposal must incorporate a properly
formulated hypothesis and a good analytical plan.
The components of research proposal are outlined as follows:
Title of the research project
Project summary
Statement of the problem
Justification and use of the results
Theoretical framework
Research objectives (general and specific).

METHODOLOGY
Operational definitions
Type of study and general design

130 Basics in Epidemiology and Biostatistics


Universe of study, sample selection and size, unit of analysis and
observation, selection criteria
Proposed intervention
Data collection procedures, instrument used and methods for
data quality control
Procedure to ensure ethical considerations in research with
human subjects.

PLAN FOR ANALYSIS OF RESULTS

Methods and model of data analysis


Programs to be used for data analysis
SPSS
SAS

TITLE/TOPIC

The title of synopsis/research protocol precisely reflects the


objective (s) of the proposed study concisely and clearly. The title
must provide the keywords for classification and indexing of the
research project. If your study is a clinical trial being carried out
on children with ear infections using an antibiotic X, then your key
features should be reflected in the title. Your title should say Role of
antibiotic X in children with ear infectionsa clinical trial.
It is important to include the keywords in the title, as it helps
the reader to identify whether the article is of relevance to him/her
or not. As the articles are searched through keywords, thus these
keywords help the reader to locate articles of interest. For example, if
the reader enters the keywords (antibiotic X, children, ear infection),
the search engine of an electronic repository (i.e. pubMed) will yield
all articles containing these keywords.

INTRODUCTION
An introduction is the most important part of the research protocol
and it should come very strongly just like a thunder to grab the
readers attention. It is here that one tries to let the reviewer know
that his research is going to be different from what other people have
done. One should also know that in case of research protocol the
onus is on the researcher to tell the reviewer how important is the
study going to be. Let me explain here, in case the reviewer comes
from a specialty different than the researcher, the former might not
tahir99 - UnitedVRG

Synopsis Writing 131


know the relevant details of the topic of interest. It is the responsibility
of the researcher to tell him the precise facts about the subject. This
is best done by giving him statistical facts. For example in case of a
research proposal from a nephrologist on end-stage renal disease it
would be a good idea to give statistics about how many thousands
of patients are suffering from end-stage renal disease, how many
billions of dollars being spend each year. To a non-nephrologist
reviewer these statistics highlighting the burden of disease and cost
of illness would be enough to tell him about the importance and
significance of the subject. The first paragraph which is about how
big is the problem, should have these statistics so as to make a real
loud thunderous introduction.
An introduction should ideally comprise of four paragraphs;
addressing what is known about the problem, what is not known
about the problem, point out the existing gaps in scientific knowledge
and how the study will contribute to fill the gap, and strength of
the study planned. It is important to present some statistics from
local and international population to impose a seriousness of the
magnitude of problem of interest. The example is described as here:

A Template of a First Paragraph


End-stage renal disease (ESRD) is a significant clinical and public
health problem. In 1999, prevalent ESRD population approached
3,50,000. Total annual cost of care of ESRD in the US was estimated
to be $ 17.9 billion, in 1999. Despite this high cost, there were 68000
ESRD deaths reported in 1999.
The second paragraph should include what is not known about
the problem. In this paragraph one should focus on either novel
ideas, gray areas, or controversial subjects in that area, because these
are the areas where people would like to know more about. Hence,
in this is paragraph you would choose from either of the three things
mentioned in the above subject, that is novel idea, gray area, and
controversial area. This will convince the reviewer that your study is
going to be a useful addition to whatever is available in the literature.

A Template of a Second Paragraph


In a study about patients with end-stage renal disease (ESRD) the
researcher wanted to investigate further the issue of late nephrology

132 Basics in Epidemiology and Biostatistics

referral (LR) to a nephrologist and its associated subsequent


outcomemortality, which was a controversial one. Some previous
studies had shown LR to have worst outcomes compared to early
referral (ER), while others had not.
As shown in the above example the researcher has identified a
controversial subject among ESRD patients, delayed vs early referral
issue. In the second paragraph he has highlighted the controversy
and tried to give a justification that why his study is still important.
This is important because the reviewer is always obsessed with the
idea why is your study important?

Third Paragraph

The third paragraph should point out the existing gaps in scientific
knowledge and how the present study will contribute to fill in the
gaps.
A template of a third paragraph: Using the same study as an example,
this is how the researchers made their point. The outcome mortality
has been a controversial issue among LR vs ER in end stage renal
disease (ESRD) patients. The previous studies on this subject have
been single center studies with a sample size of a few hundred
patients only.
Our study will be carried out on a generalizable United States
population of about 3,50,000 dialysis patients recruited from all
states of the US. Our study will also be using a novel statistical
technique called propensity score analysis (PS analysis). PS analysis
is a proxy for randomization. Thus using a PS analysis will make the
study as good as randomized controlled clinical trial. Hence, our
study is going to make an effort to settle this controversy regarding
LR vs ER in a robust fashion.

Fourth Paragraph

The fourth paragraph should give details about the rationale of the
study planned. Thus a clear emphasis must be made why this study
is important.
A template of a fourth paragraph: The outcome (i.e. mortality)
associated with late vs early referral has been a controversial subject
and has generated immense debate among the researchers. There
is lack of consensus among researchers whether late referral is

tahir99 - UnitedVRG

Synopsis Writing 133


associated with worst outcomes or not, compared to the early referral
patients. Thus a more valid study is required with an well-developed
research plan, robust statistical analysis technique on a large dataset
to correctly answer this controversial issue.
We feel that our study will be a unique study, different from the
other studies on this subject because we will use the novel technique
of PS analysis in a nationally representative sample of ESRD patients
to examine this issue in a more robust fashion. The PS analysis is a
proxy for randomization in observational studies and will be used to
balance the covariates in ER and LR groups.

Research Objectives
A research objective is a statement that clearly depicts the goal to be
achieved by a research project. In other words, the objectives of a
research project summarize what a study plans to achieve.
The formulation of objectives will help you to:
Focus the study (narrowing it down to essentials)
Avoid the collection of data which are not strictly necessary for
understanding and solving the problem you have identified (to
establish the limits of the study)
Organize the study in clearly defined parts or phases.
Properly formulated, specific objectives will facilitate the
development of your research methodology and will help to orient
the collection, analysis, interpretation and utilization of data.
Objectives should be stated using action verbs that are specific
enough to be measured:
Examples: To determine , To compare, To verify, To calculate,
To describe, etc.
Do not use vague nonaction verbs such as:
To appreciate To understand To believe
An objective is intent of what the researcher wants to determine
and should be stated in clear, measurable terms. While developing
a research protocol a researcher must ensure that the research
objective must match the hypothesis and data analysis plan.
Moreover, a researcher can have as many objectives as he feels that
the study is feasible to achieve.
Given below is an example of specific aims/objectives mentioned
for a study looking at the impact of socioeconomic factors on

134 Basics in Epidemiology and Biostatistics


outcomes among kidney transplant recipients, submitted to National
Institutes of Health. The point worth noticing in this synopsis is that
the objectives match well with the hypothesis and the analytical plan
for each objective.
Material and methods: The methodology explains the procedures
that will be used to achieve the objectives. The methodology of a
research project is the core of the study. Components of a research
design that should be addressed in the methodology section of a
research proposal are:
Operational definition
Hypothesis
Variables
Research methods or techniques
Sampling method
Plan for data collection
Plan for analysis of data and interpretation of the results
Staffing, supplies and equipment (covered in detail in Budget
and plan for data collection and analysis section)
Ethical considerations.

Operational Definition

It is the definition of the exposure and outcome variables of interest


in context to objective in a particular study and their means of
measurement/determination.
Consider that one wishes to do a study on anemia in patients
with chronic kidney disease (CKD). He has to give an operational
definition of anemia in his study. This definition of anemia should
not be a textbook definition of anemia, rather it should mention
what anemia means in this particular study. For example, he
should mention an operational definition that anemia in this study
is defined as hemoglobin less than 11 g/dL. This cut-off of 11 g/dL
should ideally come from a world recognized body like the WHO or
National Kidney Foundation.
Take another example, a study to compare the effectiveness of
dressing A and dressing B in patients presenting with infected wounds
of the foot. An outcome variable should be easily measureable. By
looking at the objective it is not clear that what will be deemed as
effective and how will effectiveness be measure. So effectiveness
should be defined in clear measurable terms. The effectiveness

tahir99 - UnitedVRG

Synopsis Writing 135


could be defined as positive if there is presence of granulation tissue
on clinical examination on the 7th postoperative day.
Hypothesis: Statistical hypothesis (null and alternate hypothesis)
where required, should be appropriately framed in terms of
objectives (please see hypothesis for specific aim 1 and specific
aim 2 in Table 15.1).
Table 15.1: Formulation of specific aims, hypotheses, and statistical
analysis plan for synopsis writing: A template
Title of the study
Impact of socioeconomic factors on outcomes among kidney transplant
recipients (KTR)
Specific Aim 1
Evaluate the prevalence of
complications of chronic kidney
disease [CKD]; (anemia, malnutrition,
hyperlipidemia, abnormal calciumphosphorus metabolism), and
comorbid conditions (hypertension,
diabetes, cardiovascular) among
kidney transplant recipients

Specific Aim 2
Determine the influence of
complications of CKD (anemia,
malnutrition, hyperlipidemia,
abnormal calcium-phosphorus
metabolism), comorbid conditions
(hypertension, diabetes,
cardiovascular), and socioeconomic
factors (decreased access to care)
on mortality among KTR

Hypothesis/Rationale
(For Specific Aim 1)
The prevalence of complications
of CKD (anemia, malnutrition,
hyperlipidemia, abnormal calciumphosphorus metabolism) and
comorbid conditions (diabetes,
hypertension, cardiovascular disease)
is high among kidney transplant
recipients

Hypothesis/Rationale
(For Specific Aim 2)
Complications of CKD (anemia,
malnutrition, hyperlipidemia,
abnormal calcium-phosphorus
metabolism), comorbid conditions
(hypertension, diabetes,
cardiovascular), and decreased
access to care are associated with
increased mortality among KTR

Statistical analysis
(For Specific Aim 1)
Descriptive statistics and/
or frequency distributions of
continuous variables and of
categorical variables will be
obtained. The prevalence of

Statistical analysis
(For Specific Aim 2)
Descriptive statistics and/or the
proportion of deaths among
kidney transplant recipients will be
determined for overall deaths and
by specific causes
Contd...

136 Basics in Epidemiology and Biostatistics


Contd...
malnutrition, various levels of
anemia, dyslipidemia, abnormal
calcium-phosphorus metabolism,
presence of diabetes mellitus,
presence of cardiac comorbidity, and
hypertension will be determined at
baseline and yearly intervals postkidney transplant
The prevalence will be determined
separately by marital status,
education level, employment status,
status and race

The proportion of deaths in year


1 post-transplant, and then at
each post-transplant year will be
determined
The proportion of deaths will
be determined separately by
marital status, education level,
employment status, and race
Analytic technique: Time at risk will
be calculated as the time in days
from the date of transplant to the
earliest of return to dialysis, death,
transfer, loss to follow-up or end of
study (12/31/05)
Cox proportional hazards
regression (CPHR) model: CPHR
models will be used to examine
the independent contributions
of anemia, hypoalbuminemia
and hyperlipidemia, comorbid
conditions, and socioeconomic
characteristics to mortality among
KTR
Survival analysis: Kaplan-Meier
curves will be developed to
examine the differences by
marital status, education level,
employment status, and race as
well as by different hematocrit
(Hct) levels (<30, 3032.9, 3335.9,
>36), albumin levels (<3.5, >3.5),
stages of CKD, cholesterol levels
(<240, >240), and SF-36 scores
(< and > median or mean score
of study population), and will be
compared using log-rank test
Dependent variable/outcome: Allcause mortality
Contd...

tahir99 - UnitedVRG

Synopsis Writing 137


Contd...
Independent variables: Age,
gender, race, presence of DM,
socioeconomic factors (type
of insurance, employment
status, marital status, education
level, language), comorbidity
(hypertension, cardiovascular
disease), serum albumin,
hematocrit (as a continuous
variable; categorical <30, 30-<33,
33-<36, >36 ), hyperlipidemia
(yes vs no), abnormal calciumphosphorus metabolism (yes vs
no), ACE-Inhibitors (yes vs no),
antihypertensives (yes vs no),
lipid-lowering drugs (yes vs no),
rHuEPO use (yes vs no), calcineurin
inhibitor (yes vs no), cyclosporine
vs tacrolimus, antimetabolite (MMF
vs AZA), HLA matching, type of
transplant (living vs cadaveric),
delayed graft function (yes vs no),
number of rejection episodes in
year 1 post-transplant

Variables: A variable is a measureable characteristic of a person,


object or phenomenon that can take on different values. A simple
example of a variable is a persons age. The variable age can take on
different values because a person can be 20 years old, 35 years old,
and so on.
The variable that is used to describe or measure the problem
under study is called the dependent variable. It represents the
output or effect, or is tested to see if there is an effect. A dependent
variable is also known as a response variable, outcome variable,
and output variable.
The variables that are used to describe or explain the difference
in the dependent variable or to cause changes in the dependent
variables are called the independent (exposure) variables. It
represents the inputs or causes, or is tested to see if they are the cause.

138 Basics in Epidemiology and Biostatistics


An independent variable is also known as a predictor variable,
explanatory variable, and exposure variable.

Numerical and Categorical Variables

The values of some variables (i.e. age, number of children, monthly


income) are expressed in numbers, we call them numerical variables.
Some variables may be expressed in categories. For example, the
variable gender has two distinct categories, male and female. Since
these variables are expressed in categories, we call them categorical
variables.
Study design: The selection of an appropriate study design is essential
for any study. The study design should match the specific aims of the
study. It is the foundation or pillar stone for any research project. If
the study design is not appropriate the study will not be able to yield
valid and reliable results. The type of study design chosen depends
on the:
Type of problem under investigation
Knowledge already available about the problem
Resources available for the study.

Sampling Method

Sampling is the process involving the selection of a finite number


of elements from a given population of interest, for purposes of
inquiry. A researcher can use either a probability or nonprobability
sampling technique after considering the cost, resources available
and practicability.
Large-scale descriptive studies almost always use probabilitysampling techniques. Intervention studies sometimes use probability
sampling but also frequently use nonprobability sampling. Qualita
tive studies almost always use nonprobability samples.
Probability sampling techniques are preferred by researchers
as maximizes external validity or generalizability of the results of
the study while nonprobability sampling techniques introduces
selection bias in the research.

Inclusion and Exclusion Criteria


It is important for a researcher to have a predefined inclusion and
exclusion criteria regarding what participants will be included

tahir99 - UnitedVRG

Synopsis Writing 139


in the study. Inclusion of subjects that should not be included, or
excluding participants that should have been included would make
the findings less valid. It is also important to mention from where
the participants will be enrolled (e.g. private clinics, tertiary care
hospitals, rural settings, etc.)the study setting.
For example, a study was planned to evaluate the efficacy of
parenteral iron. As the parenteral iron has a teratogenic effect on
pregnant women in first and second trimester, thus all pregnant
women in first and second trimesters were excluded. Moreover,
parenteral iron is indicated when the Hb level falls below 8 g/dL,
thus only those pregnant women in third trimester whose Hb level
were below 7 g/dL were included.
Note: It is a word of caution for junior researchers who often tend
to write a few inclusion criteria and then write the opposite in the
exclusion criteria. For example, in a study about adult population
they would write as all participants who are aged 18 and above will
be included. In the exclusion criteria, they would write children will
be excluded. This is bad practice and must be avoided.

Duration of Study
It is also important to make clear that during what time period the
data will be collected. For example, all participants who attend the
outpatient diabetic clinics of XYZ hospital from 1st January 2012 to
31st December 2013 will be included in the study.

Sample Size Calculations


Estimating appropriate sample size is an important determinant for
the accuracy of the result thus vital to avoid type 2 error. Calculation
of sample size depends on level of significance (normally 0.05),
power (should be greater than 80%), and estimates taken from the
reference studies. For a detailed reading on sample size calculation
read chapter 10.
Example: This is an example of the sample size calculation done
for a study looking at the association of betel nut and oral cancer in
Pakistan (Fig. 15.1).

140 Basics in Epidemiology and Biostatistics

Software
The sample size calculation was done using the WHO software for
Sample Size Calculation edited by Lemeshow L and Lwanga SK.

Reference Study

The reference study used for this sample size calculation is;
Charit, Virchow Klinikum et al. Betel quid chewing, oral cancer
and other oral mucosal diseases in Vietnam. J Oral Pathol Med.
2008 Oct;37(9):511-4. Epub 2008 Jul 8. The values obtained from the
reference study are P1 = 0.30 ; 30% of the controls in the reference
study were consuming betel quid (chemical similar to ghutka).
P2 = 0.70 ; (70% of the cases in the reference study were consuming
betel quid). These two numbers 30 percent and 70 percent were
plugged into the WHO sample size software.
According to the proportion of exposures in cases and controls
in the above study, the sample size calculated is 38 (Fig. 15.1). The
results of the study are valid as confirmed by sample size calculation
using WHO software for sample size calculation.
Although the calculated sample size according to the WHO
software is 38 cases and 38 controls.

Figure 15.1 Sample size calculation for two population proportions

tahir99 - UnitedVRG

Synopsis Writing 141


Moreover, in many studies the sample size is inflated by 5 percent
to 10 percent to account for nonresponse bias or lost of follow-up of
subjects. Thus in this study, the sample size was calculated as 76 (38
cases and 38 control), but to account for nonresponse the sample
size was inflated by 10 percent to 84 (41 cases and 41 control).
Also, while doing a multivariate regression analysis (to look at
factors associated with a certain outcome) the sample size would
keep into consideration the number of independent variables
(risk factors) being studied. For example in the previous study,
a multivariate regression analysis for factors associated with the
risk of developing oral cancer was being studied. The sample size
calculated was 41 cases and 41 controls. If in this study there were 8
independent variables, the sample size should have been at least 80
cases and 80 controls (10 cases and 10 controls for each independent
variable).
Data collection: Data collection is the most important part of any
research. Initially, the researcher must make it clear whether a
primary data will be collected or a secondary data will be used for
the research.
Primary data is a first hand information which is collected from
the study participants usually through a data collection form or a
performa. In the data collection portion of the research protocol,
an investigator must make clear that on what variables the data
is collected; including demographic, socioeconomic status, lab
variables and outcome variable, etc. The demographic variables
include; age, gender, race, ethnicity, marital status, etc. Moreover,
the researcher should also make clear as to what is his outcome
variable/variables (e.g. mortality, hospitalization, length of stay,
quality of life, etc.) will be collected. It is also important to mention
whether a validated tool (e.g. SF-36 for quality of life) will be used or
not. In many cases a translated version of a validated tool is used,
which must also be mentioned. In a cross-sectional study the data is
collected at one point of time, whereas in a follow-up study like the
cohort or case control, the data is collected on multiple occasions.
Thus the investigator must specify on what time periods (e.g. week
1, 4, 8, etc.) the data will be collected. A detailed performa must
be attached preferably as an appendix because of the limited word
count or number of pages in synopsis.

142 Basics in Epidemiology and Biostatistics

In case of secondary data, the investigator must make clear from


what databases or registry the data has been extracted. As the original
database contains a large number of variables and all of them might
not be of use to the researcher. Thus, the researcher extracts the
variables of his interest from the database. Moreover, large databases
are also distributed into various sections. The researcher must also
specify as to what variable was obtained from which section of the
database.
For example, a study looking at the factors associated with
mortality among dialysis patients used a database, the United States
Renal Data System (USRDS). This is a large registry of dialysis patients
being dialyzed in all states of the US. The information of patients
demographics, labs and comorbid conditions were obtained from
a patients questionnaire (the DMMS Wave II), while the data on
mortality was obtained from the patients file.

Ethical Concerns
Ethical concerns are of paramount importance for any research.
The researcher must obtain an informed consent in the local
language from all the participants. The purpose of the research,
intervention to be given, potential benefits and harms, voluntary
participation, healthcare cost, etc. must be explained in detail to
all study participants. It is also important to protect the rights of
vulnerable groups (i.e. children, mentally ill people, etc.) If children
are to included in the study, a consent from guardian is essential.
A translated version of the inform consent form must be attached
as an appendix. It is the duty of the researcher to ensure that
anonymity of the participants will be maintained throughout the
research. Moreover, confidentiality of participants response must
also be maintained during research. The researcher must make
sure that appropriate data protection policies are adopted, so no
unofficial person has an access to confidential data collected from
study participants. Finally, the researcher must ensure that the
study is conducted in accordance with the guidelines of Helenski
Deceleration, and if deemed necessary an approval from the local
ethical review board should be obtained. All these details must be
included in the ethical consideration portion of the methodology.

tahir99 - UnitedVRG

Synopsis Writing 143

Data Analysis
Descriptive Analysis
The data analysis usually begins with the descriptive analysis. The
descriptive analysis is the description about the characteristics of the
population/sample being studied. The descriptive analysis is usually
presented in research studies as shown in Table 15.2.
A universally accepted and prescribed descriptive analysis, if the
study is describing one sample/population is like given here:
A descriptive statistical analysis of continuous and categorical
variables will be performed. Data on continuous variables will be
presented as mean SD and data on categorical variables will be
presented as proportions.
Please note that there is no p-values column in Table 15.3 as no
comparison is being made.
If the comparison is to be made between two groups, then values
on each variable in both groups must be calculated, with a p-value
indicating any difference (Table 15.3).
Ideally, a statistical analysis should include various types of
analyses like cross-tabulations, linear regression, multivariate
regression analysis, and survival analysis. New researchers are
strongly encouraged to include these types of analysis to add
glamor and colour to the research. Examples of some of the analysis
mentioned above are given here.

Association Between Two Variables


An association between two variables which seem to have a relation
can be studied in two ways; cross-tabulation and linear regression.
Cross-tabulation is used when the researcher wants to determine
the association between two categorical variable, while linear
regression is used when the researcher wants to determine the
association between two continuous variables (Tables 15.4 and 15.5).
Table 15.4 is a hypothetical cross-tabulation table. The researcher
would like to study the proportion of patients with various Hb levels
in each creatinine category. The Hb has been categorized as patients
having Hb <10, Hb 1011, Hb 1112, and >12 g/dL, the less than 10
and 1011 categories are patients who are anemic. The 1112 and
>12 categories are patients who are not anemic.

144 Basics in Epidemiology and Biostatistics

Table 15.2: Baseline characteristics of patients with chronic kidney


disease (CKD)hypothetical table
Patient characteristics

Mean SD or %

Age (in years)

Gender
Male

Race
Caucasians
African-American
Asian
Other

Insurance
Private
Health Maintenance Organization (HMO)
Medicare
Medicaid
None

Comorbidity index
Zero
One
Two
Three

Cause of CRI
Diabetes mellitus
Hypertension
GN/PKD/IN
Other

Laboratory values
Serum creatinine (mg/dL)
GRF (mL/ min/ 1.73m2)
BUN (mg/dL)
Serum albumin (g/dL)
Hematocrit (Hct) (%)

tahir99 - UnitedVRG

Synopsis Writing 145


Table 15.3: Comparison of characteristics of patients of CKD with and
without anemia: A hypothetical table of descriptive comparative analysis
Variable
CKD patients with CKD patients
p-value
anemia
without anemia
(Mean SD or %) (Mean SD or %)
Age (in years)
Gender
Male
Race
Caucasians
African-American
Asian
Other
Insurance
Private
Health Maintenance
Organization (HMO)
Medicare
Medicaid
None
Comorbidity index
Zero
One
Two
Three
Cause of CRI
Diabetes mellitus
Hypertension
GN/PKD/IN
Other
Laboratory values
Serum creatinine (mg/dL)
GRF (mL/ min/ 1.73 m2)
BUN (mg/dL)
Serum albumin (g/dL)
Hematocrit (Hct) (%)

146 Basics in Epidemiology and Biostatistics

Table 15.4: The prevalence of various levels of Hemoglobin (Hb) at different


serum creatinine levels: A hypothetical table of a cross-tabulation
Cr (<2 )
mg/L

Cr (2 3)
mg/dL

Cr (3 4) Cr (4 5)
mg/dL
mg/dL

Cr (>5)
mg/dL

Hb>12 g/dL
Hb 11 12 g/dL
Hb 10 11 g/dL
Hb <10 g/dL

Table 15.5: Factors associated with anemia in CKD patients: Multivariate


analysis: (A hypothetical table)
Characteristics

OR (95% CI)

p-value

Age (per 1 year increase)


Male (ref= females)
Whites (ref= Non-Whites)
Diabetes (ref = No)
Hypertension (ref = No)
GFR ( per 1mL/min increase)
Serum creatinine (per 1 mg/dL
increase)
Serum albumin (1 g/dL increase)

The creatinine categories are Cr <2, Cr 2 3, Cr 3 4, Cr 4 5 and


Cr >5. The <2 and 2 3 are patients who have mild CKD, while
patients in categories Cr 4 5 and Cr >5 are patients with moderate
to severe CKD.
This is a cross-tabulation between two variables hematocrit and
creatinine, published in the American Journal of Kidney Disease.
On the X-axis is the creatinine categories and the Y-axis shows the
hematocrit categories. It can be seen in Figure 15.2, that patients in
creatinine categories less than 2 (mild CKD) have high proportion
of patients with good hematocrit values, while in the creatinine
categories greater than five there are more people with less hematocrit
values and an only a few with good hematocrit values.

tahir99 - UnitedVRG

Synopsis Writing 147

Figure 15.2 Association between hematocrit and creatinine


(Source : Kazmi WH et al. Am J Kidney Dis. 2001;38:803-12)

Note: The hypothetical table for this cross-tabulation was conceived


much before the analysis was carried out and can be seen in
Table 15.4.
In the above example of cross-tabulation, two continuous variable
hematocrit and creatinine were first stratified into categories of
creatinine and categories of hematocrit. Two continuous variables
who seems to be associated can also be studied doing a linear
regression (Figs 15.3A and B). In the Figure 15.3A, hematocrit
and GFR have been studied, while in Figure 15.3B, hematocrit
and creatinine has been studied. Figure 15.3A shows a positive
correlation between hematocrit and creatinine (with higher
glomerular filtration rate (GFR) we can see higher hematocrits)
(Fig. 15.4). Figure 15.3B shows a negative correlation between
hematocrit and creatinine (with higher creatinine values we can see
patients with less hematocrits).
Note: These are hypothetical figures of linear regression analysis
between hematocrit and GFR, hematocrit and creatinine
(Fig. 15.5). The direction of the plots are based on the anticipation
of the researcher. This hypothetical figure was conceived at the

148 Basics in Epidemiology and Biostatistics

A
A

B
B

Figures 15.3A and B Relationship of hematocrit to renal function: Linear


regression between hematocrit and creatinine

Figure 15.4 Hypothetical figure showing expected association between


hematocrit and GFR

tahir99 - UnitedVRG

Synopsis Writing 149

Figure 15.5 Hypothetical figure looking at the expected association


between hematocrit and creatinine

synopsis stage. The researcher so far does not have the data but
he has in his mind how the associations should be between these
two continuous variables (Figs 15.4 and 15.5). A true association
between continuous variables hematocrit and GFR, and hematocrit
and creatinines can be seen in Figures 15.3A and B, which is a
published study by kazmi et al.

Multivariate Regression Analysis


This sort of analysis is performed in a study to determine the
risk factors associated with a certain disease/outcome. These
analysis are done in studies with a follow-up (like case control and
cohort studies). The outcome variable (dependent variable) and
independent variables (risk factors) should be precisely identified.
Table 15.5 is a hypothetical table of a study looking at the factors
associated with the outcome (anemia) among patients with chronic
kidney disease (CKD).
References: All resources used should be referenced appropriately.
The recommended referencing styles are Vancouver and Harvard.
All references should be verified.
Work schedule or timeline: A work schedule is a table that summarizes
the tasks to be performed in the research project, the duration of
each activity (Fig. 15.6).

150 Basics in Epidemiology and Biostatistics

Figure 15.6 Work schedule and timeline for researcher

Appendix: The appendix must include:


CV of researchers
CV of supervisor
Previous published articles
Information on institutional affiliations of researchers
Sample of data collection instrument
Informed consent form
Letters for endorsement for the study.
Logistics:
A description of the resources and facilities available for the study
Any anticipated difficulty
A brief management plan
A realistic budget.

1. Guidelines for Synopsis and Dissertation Writing for CPSP, Retrieved


on 14 April 2010 from http://www pakmedinet.com/page/cpsp
2. Marg Gilks. How to write a synopsis? Retrieved on 14 April 2010.WritingWorld.com.from http://www.wrting-world.com/publish/synopsis.shtml

BIBLIOGRAPHY

tahir99 - UnitedVRG

CHAPTER

16

Dissertation Writing

It is a detailed discourse on a subject especially submitted for a


higher degree in a University [Oxford].

STEPS IN WRITING A DISSERTATION


Format of Dissertation
Part-1
Title page
Supervisors certificate
Dedication
Acknowledgement
Table of contents
List of tables
List of figures, graphs, illustrations
List of abbreviations
Part-2 (about 70100 pages)
Abstract
Introduction
Review of literature
Objective(s) of study
Operational definitions
Hypothesis
Material and Methods
Result
Discussion
Conclusion(s)
References (Bibliography)
Annexures (Proforma, etc.)

152 Basics in Epidemiology and Biostatistics

TITLE
It should highlight the key features of the study.

TABLE OF CONTENT
Include headings and subheadings with respect to the page number.

TITLE PAGE
It includes complete title of the manuscript, the name of the authors
with their highest qualifications, the department or institution to
which they are attached, address for correspondence with telephone
numbers and fax number, if possible.

ABSTRACT
Structured: All original articles should have a structured abstract.
Usually the limit ranges from one hundred fifty to two hundred fifty
words. The abstract should be in structured form and should have
headings of objective, study design, settings, subjects, interventions
(if applicable), main outcome measures, results and conclusions.
Keywords: Below the abstract give few keywords, which should not be
more than ten. These keywords are used in cross-indexing the article
and are usually published with abstract. Use terms from the Medical
Subject Headings (MeSH) which are listed with standard medical
headings given in the list of index medicus, e.g. glomerulonephritis,
paraplegia, infertility. If some cases, MeSH terms are not yet available
for recently introduced terms, present term may be used. Keywords
are included with structured abstract.

INTRODUCTION
It includes:
Importance of the subject (what is known).
Limitation of previous studies/gray areas/controversies (what is
unknown).
Justification of your study/rationale (based on the above aspects
e.g., gaps in knowledge).
Any special strength of your study.

tahir99 - UnitedVRG

Dissertation Writing 153


Collective review and critique of the literature should be written in
the candidates own words (not copied). References of the last 5 years
(older, relevant and historical references can be used). Review of the
local as well as international literature must be included. Literature
cited must belong to MedLine, ExtraMed or journals approved by
Pakistan Medical and Dental Council (PMDC).

HYPOTHESIS
It is an expected relationship between the exposure and the outcome.

STUDY OBJECTIVE
Formulate your objective(s) clearly. Remember Quality Thoughts
Precede Quality Results.

SUBJECTS/MATERIAL AND METHODS


Subjects: These are patients or persons on whom study was done.
Their age, sex, mean age, and standard deviation, and other relevant
characteristics should be given. The term subject is replaced by the
term material if data is noted down directly from laboratory reports,
device/machinery, or any inanimate object.
Apparatus: It refers to the main device used to measure the
observation, this may be a laboratory equipment, surgical procedure,
questionnaire, or a clinical method, for example, a laboratory
instrument for hemoglobin estimation, a procedure to remove the
stone from bile duct, a questionnaire developed to know the effect of
poverty on nutritional status or clinical criteria to assess the severity
of pain.
Method is the procedure of data collection. Mention the study
design, setting (place) where study was conducted, procedure of
data collection.
Mention the study variables, such as predictor variables, outcome
variables, confounding variables, etc.
Mention the name of statistical test and software program applied.

RESULTS
Firstly, the demographic profile is shown (e.g. if the study is done
on human subjects, show the different age groups, common areas of

154 Basics in Epidemiology and Biostatistics

belonging, gender, educational level, different professional cadres,


etc.). Quantitative variables are presented as mean + standard
deviation. Qualitative variables are shown in proportions (or %).
For graphic representation, show qualitative variables by using
either bar graph or pie charts, while for quantitative variables
histogram is appropriate.
Cross-tabulation could also be done. For any hypothetical study
design, variables (exposure with outcome) are cross matched by
applying either Chi-square test or Students t-test. Chi-square test is
applied for qualitative variables, while t-test is applied for quantitative
variables. Level of significance is usually set at 0.05. Odds ratio and
relative risk (with confidence interval) are calculated if the study
design is case control or cohort, respectively. Further analytical tests
such as correlation, regression and multivariate analysis are applied
where required. (For further detail regarding statistical analysis, read
the chapter no. 14, data analysis plan: page no.102111).

DISCUSSION
It should emphasize the salient features of present findings.
Comparisons should be made of variations or similarities with
results of previous similar studies both national and international
with references. The detailed data should not be repeated in the
discussion. It must be mentioned whether the hypothesis in the
article was rejected, or could not be rejected. It is important to
remember that in the discussion section only discuss points you
have highlighted in the results. The second last paragraph highlights
the limitations of your study. It is a good idea to mention your
limitations before they are pointed out to you by the reviewer. The
conclusions of your study must be based on what you have observed
in your results.

OPTIONAL COMPONENTS

They are added only whenever applied. These are as follows:


Acknowledgementif desired, it should be included after the
discussion and before references.
Letter of undertaking signed by the main author must accompany
all manuscripts.

tahir99 - UnitedVRG

Dissertation Writing 155


Sample letter of undertaking is as follows:
This is to confirm that the original/review article/case report
titled------ written by ------ submitted for publication, has not been
published in any other journal and if accepted for publication in
the requested journal, it will not be published in any other medical
journal in Pakistan or overseas.

REFERENCES
It includes citation in the text that should be serially numbered. List
the references in Vancouver style.

ANNEXES
It should be added, if they increase the understanding or evaluation
of the study. All annexure should be serially numbered and referred
to at appropriate places in the body of dissertation.

THE WHOLE MANUSCRIPT/DISSERTATION


SHOULD BE IN PAST TENSE
SAMPLE OF TITLE PAGE
Cost incurred on Directly Observed Therapy-Short Course (in terms
of time and money) by Tuberculous patients.
Dr XYZ
FCPS Student
(2008-2009)
Supervisor:
Dr ABC
Institute
Department
Name of Institution

156 Basics in Epidemiology and Biostatistics


Supervisors Certificate (Sample)
I, hereby, certify that Dr. ______________________ having Enrolment.
Number: ________________________ and RTMC Registration.
Number:______________ has been working under my direct
supervision with effect from : (date) ____________________________
to (date) __________ in the Department: _________________________
Unit: ________________________________________________________
of Training institution:________________________________________
in the city of: _________________________________ The enclosed
Dissertation
titled:_________________________________________
____________ was prepared according to the FCPS Dissertation
Guidelines under my direct supervision. I have read the Dissertation
and have found it satisfactory for FCPS part II examination in the
subject.
Signature of the Supervisor: ___________________________________
Name of the Supervisor: ______________________________________
Designation: ______________________ Date: _____________________
Official stamp:

BIBLIOGRAPHY

1. Dissertation Writing. Retrieved on 15 April 2010 from www.cpsp.edu.


pk/guideline/dissertation.
2. Newcastle University, (2009). School of Chemical Engineering
and Advanced Materials. Writing Research Thesis or Dissertations
(guidelines and tips). Retrieved on 14 April 2010 from http://lorien.ncl.
ac.uk/ming/dept/tips/writing/thesis/thesis-layout.htm
3. PhD-Dissertations.com. Retrieved on 15 April 2010 from http://www.
phd-dissertations.com/topic/medical_dissertation_thesis.html

tahir99 - UnitedVRG

CHAPTER

17

Reference Writing

Reference writing is a standardized method of acknowledging


sources of information and ideas used in research article, synopsis,
dissertation, assignment, etc. in a way that uniquely identifies their
source. Direct quotations, facts and figures, as well as ideas and
theories, from both published and unpublished works, must be
mentioned.
There are many acceptable forms of quoting references. This
chapter will exclusively provide a brief guide to the Vancouver style
of reference writing. The Vancouver style of writing references is
predominantly used in the medical field. The Vancouver style was
first published by the Vancouver group, which expanded from time
-to-time and evolved into the International Committee of Medical
Journal Editors (ICMJE).
It is very important to use the right punctuation and the order
of details in the reference. In this style, the journal titles used in the
references are abbreviated from an authoritative list.
A reference list at the end of the chapter contains the full details of
all the in-text citations. References are necessary to avoid plagiarism,
to verify quotations, and to enable readers to follow-up and read
more cited authors arguments in detail.

CITING A JOURNAL ARTICLE


Name(s) of Author(s)
Name(s) of author(s) of the article
Where there are six or less authors one must list all authors.
Where there are more than six authors, only the first six are
listed and added as et al. (et al. means and others).

158 Basics in Epidemiology and Biostatistics

Put a comma and 1 space between each name. The last author
must have a full-stop after his initial(s).
Format name (s) of author(s): Surname (1 space) initial(s) (no
spaces or punctuation between surname and initials) (full-stop
OR if further names comma, 1 space).
Example: Halpern SD, Ubel PA, Caplan AL. Solid-organ trans
plantation in HIV-infected patients. N Engl J Med. 2002;
347(4):284-7.
As an option, if a journal carries continuous pagination
throughout a volume (as many medical journals do) the month
and issue number may be omitted.
Example: Halpern SD, Ubel PA, Caplan AL. Solid-organ
transplantation in HIV-infected patients. N Engl J Med.
2002;347: 284-7.
More than six authors
Example: Rose ME, Huerbin MB, Melick J, Marion DW, Palmer
AM, Schiding JK, et al. Regulation of interstitial excitatory
amino acid concentrations after cortical contusion injury.
Brain Res. 2002; 935(1-2):40-6.
Organization as author
Example: Diabetes Prevention Program Research Group.
Hypertension, insulin, and proinsulin in participants with
impaired glucose tolerance. Hypertension. 2002; 40(5):679-86.

TITLE OF JOURNAL ARTICLE

Do not use italics or underlining.


Only the first word of journal articles (and words that normally
begin with a capital letter) are capitalized.
Format of journal article: Title (full-stop, 1 space).
Example: Clinical results in pediatric cochlear implantation.
Format subtitle of publication: Title (colon, 1 space).
Example: Cochlear implantation after meningitis: Does the
post-meningitic deafness etiology influence worse speech
rehabilitation progress?

JOURNALS TITLE

Title of journal (abbreviated)


Abbreviate title according to the style used in Medline. A list of
abbreviations can be found at: http://www.nlm.nih.gov

tahir99 - UnitedVRG

Reference Writing 159


Note: No punctuation is used in the abbreviated journal name.
Format title of journal: Journal title abbreviation (1 space).
Example: J Coll Physicians Surg Pak Ann King Edward Med
Coll.

Year (and Month/Day, if Necessary) of Publication


Abbreviate the month to the first 3 letters.
If the journal has continuous page numbering through volume,
the month/day and issue information can be omitted.
Format year of publication: Year (1 space) month (1 space) day
(semi-colon, no space) OR year (semi-colon, no space).
Example: 2003 September.

Volume Number
If the journal has continuous page numbering through volume,
the month/day and issue information can be omitted.
Format volume of publication: Volume number (no space) issue
number in brackets (colon, no space) OR volume number (colon,
no space).
Example: 4(3):

Page Numbers
Format of page number: Page numbers (full-stop).
Example: pp. 122-9.
Example: pp. 1129-57.

CITING A BOOK REFERENCE


Name(s) of author(s), editor(s), compiler(s) or the institution
responsible
Where there are six or less authors you must list all authors.
Where there are seven or more authors, only the first six are
listed and add et al. (et al. means and others).
Put a comma and 1 space between each name. The last author
must have a full-stop after their initial(s).
Format of author(s): Surname (1 space) initial(s) (no spaces or
punctuation between initials) (full-stop OR if further names
comma, 1 space).

160 Basics in Epidemiology and Biostatistics

Title of publication and subtitle if any


Do not use italics or underlining.
Only the first word of journal articles or book titles (and words
that normally begin with a capital letter) are capitalized.
Format title of publication: Title (full-stop, 1 space)
Example: Harrisons Principles of Internal Medicine.
Format subtitle of publication: Title (colon, 1 space).
Example: Physical pharmacy: Physical chemical principles in
the pharmaceutical.
Edition, if other than first edition
Abbreviate the word edition to edn (Do not confuse with editor).
Format of edition: Edition statement (full-stop, 1 space).
Example: 3rd edn.
Place of publication
If the publishers are located in more than one city, cite the
name of the city that is printed first.
Write the place name in full.
If the place name is not well known, add a comma, 1 space
and the state or the country for clarification. For places in the
USA, add after the place name the 2 letter postal code for the
state. This must be in upper case, e.g. Hartford, CT (where
CT=Connecticut).
Format place of publication: Place of publication (colon, 1 space)
Example: New York:
Publisher
The publishers name should be spelt out in full.
Format name of publisher: Publisher (semicolon, 1 space)
Example: Williams and Wilkins;
Year of publication.
Format year of publication: Year (full-stop, add 1 space if page
numbers follow).
Example: 1999.
Example: 2000.pp. 12-5.
Page numbers (if applicable)
Abbreviate the word page to p.
Note: Do not repeat digits unnecessarily-abbreviate.
Format of page number: P (full-stop, 1 space) page numbers (fullstop).
Example: pp. 122-9.
Example: pp. 1129-57.

tahir99 - UnitedVRG

Reference Writing 161

OTHER AUTHORS
More than six authors: Give the first six names in full and add et
al. The authors are listed in the order in which they appear on the
title page.
Editor(s): Follow the same methods used with authors but use the
word editor or editors in full after the name(s). The word editor
or editors must be in lower case. (Do not confuse with edn used
for edition).
Example: Millares M, editor. Applied drug information:
strategies for information management. Vancouver, WA:
Applied Therapeutics, Inc.; 1998.
Sponsored by institution, corporation or other organization
(including Pamphlet)
Example: Australian Pharmaceutical Advisory Council.
Integrated best practice model for medication management
in residential aged care facilities. Canberra: Australian
Government Publishing Service; 1997.
Chapter or part of a book to which a number of authors have
contributed.
Format of book chapter: Author(s)/editor(s) of chapter. Title
of chapter. In: author(s)/editor(s) of book. Title of book. City of
publication (State or country of publication): Publisher; year.
pages of book chapter.
Example: Porter RJ, Meldrum BS. Antiepileptic drugs. In:
Katzung BG, editor. Basic and clinical pharmacology. Norwalk,
CN: Appleton and Lange; 1995.pp. 361-80.

DISSERTATION REFERENCE
Example: Borkowski MM. Infant sleep and feeding: a telephone
survey of Hispanic Americans [dissertation]. Mount Pleasant (MI):
Central Michigan University; 2002.

CITING INTERNET AND OTHER


ELECTRONIC SOURCES
This includes software and Internet sources such as websites,
electronic journals and databases. These sources are proliferating
and the guidelines for citation are developing and subject to change.

162 Basics in Epidemiology and Biostatistics


The following information is based on the recommendations of the
National Library of Medicine.

Journal on the Internet

Format: Author(s) (full-stop after last author, 1 space) Title of


article (full-stop, 1 space) Abbreviated title of electronic journal
(1 space) [serial on the Internet] Publication year (month if
applicable) [cited year month date] (full-stop, 1 space) Volume
number (no space) (Issue number in round brackets if applicable)
(colon, no space) [Page number in square brackets] (full-stop, 1
space) Available from (colon, 1 space) URL address.
Examples: Abood S. Quality improvement initiative in nursing
homes: the ANA acts in an advisory role. Am J Nurs [serial on
the Internet]. 2002 Jun [cited 2002 Aug 12]; 102(6):[about 3
p.]. Available from: http://www.nursingworld.org/AJN/2002/
june/Wawatch.htm
(If the author is not documented, the title becomes the first
element of the reference).
Format: Organization name (1 space) [homepage on the Internet]
(full-stop, 1 space) place of publication (colon, 1 space) publisher
of the website (semicolon) published year (1 space) [updated
year month date; cited year month date]. Available from (colon, 1
space) URL address.
Examples: Cancer-Pain.org [homepage on the Internet]. New
York: Association of Cancer Online Resources, Inc.; 2000-01
[updated 2002 May 16; cited 2002 Jul 9]. Available from: http://
www.cancer-pain.org/.
In the Vancouver style, a consecutive number is allocated
to each reference as it is cited for the first time in the text of
the assignment. This number becomes the unique identifier of
that source and if the source is cited again the same number is
repeated. Numbers are inserted to the right of commas and fullstops, and to the left of colons and semicolons. Multiple sources
can be listed at a single reference point. The numbers are then
separated by commas and consecutive numbers are joined
with a hyphen like 27. Vancouver uses superscript numbers,
or standard numbers in brackets, in the text, e.g. 14,10,12 or
(14,10,12). The superscript numbers are preferably used in
the text.

tahir99 - UnitedVRG

Reference Writing 163


The references are listed at the end of your dissertation and
synopsis in the same numerical order as cited in the text.

BIBLIOGRAPHY
1. International Committee of Medical Journal Editors. Uniform
requirements of manuscripts submitted to biomedical journal: sample
references. [monograph on the Internet]. Bethesda (MD): National
library of Medicine (US); 2003. [cited 10 Aug. 2008]; Available from:
URL: http://www.nlm.nih.gov/bsd/uniform_requirements.html.
2. Uniform requirements for manuscripts submitted to biomedical
journals. International Committee of Medical Journal Editors. CMAJ.
1995;152(9):1459-73.

CHAPTER

18

Guidelines for Consent Writing

Informed consent has been recognized as an important component


of research protocols. Procedures of disclosure and consent in
collaborative research have been criticized, as they may not be in
keeping with cultural norms of developing countries.
The Nuremberg Doctors Trial (the so-called Medical Case)
following World War ll heightened international concerns with
ethical issues surrounding human experimentation. These
proceedings judged medical experiments conducted by Nazis on
prisoners of concentrated camps. In 1947, the Nuremberg Code,
the first international code of ethics for research involving human
subjects, was issued. The Nuremberg Code emphasized a strong
commitment to the informed and voluntary consent of research
participants. The World Medical Associations Declaration of
Helsinki, adopted in 1964 and most recently revised in 1996,
reiterated concerns for voluntary and informed consent for research.
In 1982, the Council for International Organizations of Medical
Sciences (CIOMS) and World Health Organization (WHO) published
Proposed International Guidelines for Biomedical Research. These
guidelines were developed in response to concerns raised about
the particular circumstances surrounding the implementation of
scientific research in developing countries.

GENERAL ETHICAL PRINCIPLES


All research involving human subjects should be conducted in
accordance with three basic ethical principles, namely respect for
persons, beneficence and justice. It is generally agreed that these
principles, which in the abstract have equal moral force, guide the
conscientious preparation of proposals for scientific studies. In
varying circumstances they may be expressed differently and given

tahir99 - UnitedVRG

Guidelines for Consent Writing 165

different moral weight, and their application may lead to different


decisions or courses of action. The present guidelines are directed
at the application of these principles to research involving human
subjects.
Respect for persons incorporates at least two fundamental ethical
considerations, namely:
1. Respect for autonomy, which requires that those who are
capable of deliberation about their personal choices should be
treated with respect for their capacity for self-determination.
2. Protection of persons with impaired or diminished autonomy
(vulnerable groups e.g. children/minors, subjects with
psychiatric illness, etc.), which requires that those who are
dependent or vulnerable be afforded security against harm or
abuse.
Beneficence refers to the ethical obligation to maximize benefits
and to minimize harms. This principle gives rise to norms
requiring that the risks of research be reasonable in the light of the
expected benefits, that the research design should be sound, and
that the investigators must be competent to conduct the research
and to safeguard the welfare of the research subjects. Beneficence
further proscribes the deliberate infliction of harm on persons;
this aspect of beneficence is sometimes expressed as a separate
principle, nonmaleficence (do no harm).
Justice refers to the ethical obligation to treat each person in
accordance with what is morally right and proper, to give each
person what is due to him or her. In the ethics of research involving
human subjects the principle refers primarily to distributive
justice, which requires the equitable distribution of both the
burdens and the benefits of participation in research. Differences
in distribution of burdens and benefits are justifiable only if they
are based on morally relevant distinctions between persons;
one such distinction is vulnerability. Vulnerability refers to a
substantial incapacity to protect ones own interests owing to
such impediments as lack of capability to give informed consent,
lack of alternative means of obtaining medical care or other
expensive necessities, or being a junior or subordinate member
of a hierarchical group. Accordingly, special provision must be
made for the protection of the rights and welfare of vulnerable
persons.

V
d

h
ta

9
ri 9

n
U
-

ti e

G
R

166 Basics in Epidemiology and Biostatistics

Sponsors of research or investigators cannot, in general, be held


accountable for unjust conditions where the research is conducted,
but they must refrain from practices that are likely to worsen unjust
conditions or contribute to new inequities. Neither should they
take advantage of the relative inability of low-resource countries or
vulnerable populations to protect their own interests, by conducting
research inexpensively and avoiding complex regulatory systems of
industrialized countries in order to develop products for the lucrative
markets of those countries.
In general, the research project should leave low-resource
countries or communities better off than previously or, at least, no
worse off. It should be responsive to their health needs and priorities
in that any product developed is made reasonably available to them,
and as far as possible leave the population in a better position to
obtain effective healthcare and protect its own health.
Justice requires also that the research be responsive to the health
conditions or needs of vulnerable subjects. The subjects selected
should be the least vulnerable necessary to accomplish the purposes
of the research. Risk to vulnerable subjects is most easily justified
when it arises from interventions or procedures that hold out for
them the prospect of direct health-related benefit. Risk that does
not hold out such prospect must be justified by the anticipated
benefit to the population of which the individual research subject is
representative.

GUIDELINES FOR DRAFTING AN


INFORMED CONSENT FORM
Guidelines are given here in order to help and facilitate the
researchers in drafting a proper, acceptable consent form.
All studies involving human subjects should have a properly
drafted consent form. No study should be done on human subjects
without obtaining informed consent and sufficiently before the
start of the study, at an appropriate time, and not a time when he/
she is under stress such as surgical procedure, and is unable to
understand the study.
Consent may be written or verbal or telephonic. In case of
unwritten consent, it should be signed by the person taking
consent and witnessed by a second person.

tahir99 - UnitedVRG

Guidelines for Consent Writing 167

In case of children, an assent form from children and consent


from guardian/parents is needed.
In case of mentally or physically incapacitated subject, consent
should be obtained from immediate guardian or relative such as,
wife or husband, father or mother, brother or sister, etc.
In case of community studies, community leaders, elders,
local political leaders, religious leaders (in certain cases), and
governmental officials should be taken into confidence, and a
written consent should be obtained.
In case of doing a study in other locations such as other hospitals
and clinics, permission from appropriate authority or physicians
should also be obtained.
The consent form should be in English, Urdu or other local
language if needed. These should be identical in such a way that
the translation of one into other is similar. The language should be
easy which can be understood by study subjects (uneducated or
primary passed). Use of technical terms should be avoided.
A properly drafted consent form should contain the following
important points:
Information sheet. There should be one paragraph or page
giving information about the nature of the study, its purpose
and need, possible benefits of the study, and procedures to be
carried out on the study subjects.
Possible risks and benefits to the study subjects.
Availability of alternate treatment in case of therapeutic trials.
Voluntary participation without any compulsion, moral or
otherwise and without any financial incentive or coercion.
However, financial assistance reimbursement for time and
traveling may/should be provided to study subjects; which
should commensurate with the time spent, and should not be
too high.
Right to withdraw from the study any time without affecting
their rights and treatment.
Confidentiality.
If any specimen is to be stored, its time of storage and
permission to use it in further research.
Name and contact number of the investigator in case the study
subject wants further clarification or information about study.
Authorization from study subjects with their signature, thumb
impression, signature of witness, etc.

9
ri 9

h
ta

n
U
-

ti e

V
d

G
R

168 Basics in Epidemiology and Biostatistics

IMPORTANT NOTES
Studies should not be done on patients expenses.
If any new or additional tests are to be done as a requirement of
study, their cost should be supported by the study.
If a new treatment is compared with an existing and establish one
or two treatment modalities are being evaluated and compared,
cost of treatment or difference in cost of treatment should be
borne by the study. In addition any expected or unexpected
complication arising as a result of new treatment should also be
supported by the study.
Studies which are unlikely to produce any significant results
because of faulty design are often considered not to be ethical as
such studies cause wastage of time and resources. Theses should
be avoided unless there is a strong justification.

BIBLIOGRAPHY
1. Agard E, Finkelstein D, Wallach E. Cultural Diversity and Informed
Consent. The Journal of Clinical Ethics. 1998;9(2):173-6.
2. Sugarman J, Popkin B, Fortney J Rivera R. International Perspectives
on Protecting Human Research Subjects. Crystal City, VA: National
Bioethics Advisory Commission Draft, 2000.
3. World Health Organization and Council for International Organizations
of Medical Sciences (WHO-CIOMS). International Ethical Guidelines
for Biomedical Research Involving Human Subjects. Author, Geneva,
1993.

tahir99 - UnitedVRG

CHAPTER

19

Consent to Participate
in Research (Sample)

V
d

TITLE OR PARAPHRASED TITLE OF THE STUDY

G
R

You are asked to participate in a research study conducted by names


of PI (and faculty sponsor if the PI is a student), from the departmental
affiliation at Michigan Technological University. If student, indicate
whether study is being conducted as part of undergraduate project,
graduate student project, thesis, or dissertation. Your participation
in this study is entirely voluntary. Please read the information below
and ask questions about anything you do not understand, before
deciding whether or not to participate.

n
U
-

ti e

Optional: You have been asked to participate in this study because


explain succinctly and simply why the prospective subject is eligible
to participate. If appropriate, state the approximate number of
subjects involved in the study. State whether there are inclusion
or exclusion criteria for participation (e.g. medical conditions that
would include or exclude a person).

9
ri 9

h
ta

PURPOSE OF THE STUDY

Briefly state what the study is designed to examine, assess, or


establish.

PROCEDURES

If you volunteer to participate in this study, you will be asked to do


the following things:
Describe the procedures chronologically using simple language,
short sentences, and short paragraphs. If there are several procedures
or if they are complex, then use of subheadings may help organize
this section and increase readability.

170 Basics in Epidemiology and Biostatistics

Define and explain scientific or discipline-specific terms. Use


language appropriate to the population.
If applicable, specify the subjects assignment to study groups,
length of time for participation in each procedure or study activity,
the total length of time for participation, frequency of procedures
and location of the procedures to be done.
If subjects will be recorded (audiotaped, videotaped, digitally),
describe the procedures to be used.
If any study procedures are experimental, clearly identify which
ones.

POTENTIAL RISKS AND DISCOMFORTS

Describe any reasonable foreseeable risks or discomforts, including


physical inconveniences and their likelihood, and explain how these
will be managed. In addition to physiological risks/discomforts,
describe any reasonably foreseeable psychological, social, legal, or
financial risks or harms that might result from participating in the
research.
If there are circumstances in which the researcher may terminate
the study, describe them. (This refers to situations in which the study
itself may be terminated. It is not the same thing as circumstances
in which a specific subject may be withdrawn; this issue is to be
discussed below, if relevant).
In the event of physical and/or mental injury resulting from
participation in this research project, Michigan Technological
University does not provide any medical, hospitalization or other
insurance for participants in this research study, nor will Michigan
Technological University provide any medical treatment or
compensation for any injury sustained as a result of participation in
this research study, except as required by law.

POTENTIAL BENEFITS TO SUBJECTS


AND/OR TO SOCIETY

Describe benefits to subjects expected from the research. If the


subject will not benefit directly from participation, clearly state this
fact.
State the potential benefits, if any, to science or society expected
from the research.

tahir99 - UnitedVRG

Consent to Participate in Research (Sample) 171


Note: Payment or other compensation for participation (e.g. a gift
certificate, extra credit) is not a benefit and is not to be discussed in
this section.

For Biomedical Studies Only: Include the


Following Paragraph, if Relevant

G
R

Based on experience with this drug, procedure, device, etc. in


animals, patients with similar disorders, researchers believe it may
be of benefit to subjects with your condition or, it may be as good
as standard therapy but with fewer side effects. Of course, because
individuals respond differently to therapy, no one can know in
advance if it will be helpful in your particular case. The potential
benefits may include: describe the anticipated benefits to subjects
resulting from their participation in the research.
If there is no likelihood that participants will benefit directly from
their participation in the research, state in clear terms. For example:
You should not expect your condition to improve as a result of
participating in this research or This study is not being conducted
to improve your condition or health. You have the right to refuse to
participate in this study.

9
ri 9

n
U
-

ti e

V
d

Payment for Participation (Optional)


State whether the subject will receive payment. If not, delete
this section. If subject will receive compensation, describe type
and amount, when compensation (e.g. money, extra credit, gift
certificate) is scheduled, and the proration schedule, if any, should
the subject decide to withdraw or is withdrawn by the investigator.

h
ta

Confidentiality

Any information that is obtained in connection with this study


and that can be identified with you will remain confidential and
will be disclosed only with your permission or as required by law.
Confidentiality will be maintained by means of describe coding
procedures and plans to safeguard data, including where data will
be kept, who will have access to it, etc.
If information will be released to any other party for any reason,
then state the person or agency to whom the information will

172 Basics in Epidemiology and Biostatistics

be furnished, the nature of the information, the purpose of the


disclosure, and the conditions under which it will be released.
If activities are to be audio- or videotaped or digitally recorded,
describe who will have access, if the tapes/files will be used for
educational purposes, and when they will be erased or destroyed.

Participation and Withdrawal


You can choose whether or not to be in this study. If you volunteer to
be in this study, you may withdraw at any time without consequences
of any kind or loss of benefits to which you are otherwise entitled.
You may also refuse to answer any questions you do not want to
answer. There is no penalty, if you withdraw from the study and you
will not lose any benefits to which you are otherwise entitled.

Include the Following Paragraph in this


Section Only if Relevant
The investigator may withdraw you from this research if
circumstances arise which warrant doing so. Describe the
anticipated circumstances under which the subjects participation
may be terminated by the investigator without regard to the subjects
consent.

FOR BIOMEDICAL STUDIES ONLY, ADD THE


FOLLOWING SECTION HERE
Alternatives to Participation (If Applicable)
Describe any appropriate alternative therapeutic, diagnostic, or
preventive procedures that should be considered before the subjects
decide whether to participate in the study. If applicable, explain
why these procedures are being withheld. If there are no efficacious
alternatives, state that an alternative is not to participate in the study.

IDENTIFICATION OF INVESTIGATORS
If you have any questions or concerns about this research, please
contact; identify research personnel: principal Investigator, faculty
Sponsor (if student is the PI), Co-Investigator(s), if any. Include
day phone numbers, addresses, and email addresses for all listed

tahir99 - UnitedVRG

Consent to Participate in Research (Sample) 173


individuals. For some studies of greater than minimal risk, it may be
necessary to include night/emergency phone numbers.

RIGHTS OF RESEARCH SUBJECTS

G
R

The Michigan Tech Institutional Review Board has reviewed my


request to conduct this project. If you have any concerns about your
rights in this study, please contact Joanne Polzien of the Michigan
Tech-IRB at 906-487-2902 or email jpolzien@mtu.edu.
I understand the procedures described above. My questions have
been answered to my satisfaction, and I agree to participate in this
study. I have been given a copy of this form.

V
d

ti e

________________________________________
Printed Name of Subject

n
U
-

________________________________________
________________________________________
Signature of Subject

9
ri 9

Date

________________________________________
________________________________________
Signature of Witness

h
ta

Date

1. www.uoguelph.ca/research/forms/.../sample%20consent%20form.
doc

BIBLIOGRAPHY

Index
Page numbers followed by f refer to figure and t refer to table

A
Alternate hypothesis, types of 60
Analytical observational studies 14
Antibody test 106

Bar charts 46
Basic statistical tests 110
Bias 89
control of selection 92
interviewer 91
misclassification 91
types of 89
Biostatistics 51
Blinding 24

Conduct research 4t
Consecutive manner 37
Consecutive sampling 37
Consent form 25
Convenience sampling 37
Coronary artery disease 22f
Coronary heart disease 94
Cross-sectional studies 12
design of 13
Cumulative incidence rate 73

C
Calculating odds ratio 87
Case control study 15
design 15
Categorical data 43
Causes of CRI 11
Central tendency, measures of 51
Chronic kidney disease 11t, 62, 95,
134, 144f
Citing book reference 159
Citing internet and electronic
sources 161
Citing journal article 157
Closed ended questions 116
Cluster random sampling
technique 37
Cluster sampling 32, 36
Cohort studies 17
Comorbidity index 11
Comparative studies 14

Data analysis 123, 143


plan 120
Data collection techniques, overview of 115
Data processing 122
Data types, classification of 42
Descriptive analysis 143
Descriptive observational
studies 10
Diabetes 6
Different data collection
techniques 115
Disease frequency, measures of 69
Disease prevalence, effect of 108
Dissertation reference 161
Dissertation writing 151
Dissertation, format of 151
Dyspepsia 45

E
End-stage renal disease 131
Epidemiological study designs,
types of 8, 9
Estimation and hypothesis
testing 57

tahir99 - UnitedVRG

176 Basics in Epidemiology and Biostatistics


Ethical review board 25
Experimental study design, sketch
of 21

F
Fever 45
Focus group discussion 118
Formulate analysis plan 60

G
Gender distribution of
respondents 47f
General ethical principles 164
Generating hypothesis, observational designs for 8
Graphs 45
types of 45

H
Headache 45
Histograms 47
Hypertension 6
Hypothesis 57, 134, 135, 153
alternative 59
test of 59

I
Incidence 72
density rate 74
rates, special types of 73
Information bias 91
Interpretation 66, 86, 88

J
Journal article, title of 158
Journal title 158
Judgmental sampling 38

L
Laboratory values 11
Line graphs 48
Literature search, resources of 5
Lottery sampling technique 32f

M
Mapping and scaling 119
Mean, median and mode, example
of 51
Methodology 129
Morbidity rate 75
Mortality rate 76
Multivariate analysis 146t
Multivariate regression analysis 149
Myocardial infarction, relation
of 93t

N
Nausea 45
Negative predictive value 107
Nephropathy 7
Nonparametric tests 112
Nonprobability sampling
techniques 31, 37
Null hypothesis 59
Numerical data 43

O
Observation bias 91
Odds ratio 86
Open-ended questions 116
Operational definition 134
Optional components 154
Oral contraceptive and breast
cancer 87
Oral contraceptive use 93t

P
Page numbers 159
Participation and withdrawal 172
Pie charts 46
Population 30
Positive predictive value 107-109
Post-marketing
clinical trials 27
surveillance 26
Probability sampling
techniques 31, 32

Index 177

T
Table of content 152
Title 152
page 152
Tuberculosis 16
morbidity rate of 75

S
Sample data 60
Sample of title page 155
Sample size 95
calculation 139
calculation result 100t
estimation 95
for single group mean 96
for single proportion 95
Sampling
method 138
procedure 30
techniques 31, 32f

Recall bias 91
References 155
study 140
writing 157
Research questions and study
types 27
Research subjects, rights of 173
Research topic, selection of 3
Research
classification of 2
types of 1
Retrospective cohort study 19

Qualitative data 43
Qualitative research 1
Quantitative data 43, 122, 123
Quantitative research 3
Quasi-experimental studies 25
Questions, types of 116
Quota sampling 39

Scatter plots 49
Selection bias 90
Significance level, selection of 60
Simple linear regression 81, 82f
Simple random sampling 32
Snowball sampling 38
technique 39
Solving hypothesis testing
problems 65
Sorting data 121
Special package for social
sciences 83
Standard error of mean 54
State appropriate conclusion 66
Steps in
hypothesis testing 60
writing dissertation 151
Stratified random sampling 32, 35
technique 36f
Study designs 8
Study duration 139
Study objective 153
Study purpose 169
Synopsis writing 129
Systematic random sampling 32,
33, 34f, 35f
Systolic blood pressure 45

Processing and analysis of


qualitative data 126
Projective techniques 118
Prospective cohort study 17, 18

V
Variables, types of 41
Variation, measures of 52
Volume number 159
Vomiting 45

tahir99 - UnitedVRG

You might also like