Journal Club Toolbox

The
Journal Club
Toolbox
Critical reading of “original” medical articles
Arjun Rajagopalan
CONTENTS
STEP 1: Don't read the abstract 3
Read the introduction

STEP 2: 4
instead
STEP 3: The title tags the type 4
Identify the clinical

STEP 4: question – in 4-part 6
harmony
BREAK 1: Bias (systematic error) 7
STEP 5: Looking for sampling bias 8
Looking for measurement

STEP 6: 11
bias
Bias in comparison – apples

STEP 7: 13
with apples, not oranges
Estimating the impact of the

BREAK 2: inherent variability of 15
populations
STEP 8: Now for the results 19
Interpreting interventional
STEP 9A: 20
studies
Interpreting studies on
STEP 9B: 21
value of diagnostic tests
Interpreting studies on
STEP 9C: 23
risk/ association/ causality
Closing the loop – applying

STEP 10: 26
the results in your practice
© Dr Arjun Rajagopalan - 1
The fact that an opinion has been widely held is no evidence

whatever that it is not utterly absurd; indeed in view of the silliness
of the majority of mankind, a widespread belief is more likely to be
foolish than sensible. Bertrand Russell
...where the value of a treatment, new or old, is doubtful, there may

be a higher moral obligation to test it critically than to continue to
prescribe it year-in and year-out with the support of custom or
wishful thinking. F H K Green
Familiarity with medical statistics leads inevitably to the conclusion

that common sense is not enough… Many people are not capable of
using common sense in the handling and interpretation of numerical
data until they have been instructed. Austin Bradford Hill
It is only prudent never to place complete confidence in that by

which we have even once been deceived. Rene Descartes
A reasonable probability is the only certainty. E W Howe
In the space of one hundred and seventy-six years, the Lower

Mississippi has shortened itself two hundred and forty-two miles.
That is an average of a trifle over one mile and a third per year.
Therefore, any calm person, who is not blind or idiotic, can see that
in the old Colitic Silurian Period, just over a million years ago next
November, the Lower Mississippi River was upward of one million
three hundred thousand miles long, and stuck over the Gulf of
Mexico like a fishing rod. And by the same token any person can see
that seven hundred and forty-two years from now, the Lower
Mississippi River will be only a mile and three quarters long, and
Cairo and New Orleans will have joined their streets together, and
be plodding along comfortably under a single mayor and a joint
board of aldermen. There is something fascinating about science.
One gets such wholesale returns of conjecture out of such trifling
investment of fact. Mark Twain – Life on the Mississippi
T
here's way too much stuff out there waiting to be read. The
repeated chanting of the mantra of “evidence based
medicine” leaves you with the nagging feeling that you
should at least try to read some of the stuff, but, if you are like me,
you have never received any formal instruction on how to go about
making sense of arcane stuff like “p” values and 95% confidence
intervals. Your memory of journal clubs is that of the potato chips
and snacks that were provided.
There are 2 kinds of journal readers:
1. Most of us have a short list of a half dozen or less journals

whose Table of Contents we scan now and then, rejecting
most items as fluff that no one will read except the authors.
An article catches our eye and we turn to the page, or mark
it for inclusion in a "journal club" or some such gathering.
We are not narrowly focused researchers, but we see
ourselves as "evidence-based" clinicians, without any real
idea of what the term means. We browse journals in the
following fashion:
a) First, we scan the title.
b) Then, we quickly look at the institutional affiliation(s) of

the authors.
c) We jump to the Abstract and land first on the

"Conclusion" section.
2. The hard core researchers and nerds who will wade through
a hundred and thirty-six references to answer a simple
question and who, somehow, don't strike us as being
capable of taking care of patients in the real world.
This offering is for the first group. Don't be a wallflower at journal

clubs, content to eat the potato chips, allowing a small group of
loud, forceful people to act like they know it all. Get in there and
pitch it back at them.
This handbook provides you a tool to approach your selected bunch.

First of all, it requires no special knowledge of biostatistics. What
you need to know is sprinkled across the document in a painless,
jargon-free manner. It provides a step-wise method that, when
applied systematically, will give you the ability to critically analyze
medical journal articles. You don't have to depend on the word of
experts and residents of academic ivory towers. What's more, as
you get good at this, you can burst their bubbles with confidence.
The handbook is only a tool. Like all tools, it is up to you to use it

and get good at it. Enjoy.
STEP 1
DON'T READ THE ABSTRACT

Yes, I mean it. This is the worst way to read the literature. It makes
you feel good but ends up implanting false ideas in the brain. Your
mind is primed by biases before you begin. What's more, in the very
likely event that you don't read the article critically, you will carry
the biases acquired from the 1.2.3 method, through your day-to-day
practice, often subliminally.
STEP 2
READ THE INTRODUCTION INSTEAD

It is more fruitful to start first with the introduction to the paper.
Almost always, it will have this general format.
A proposition or statement
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Cras suscipit, diam ultricies consectetuer
elementum, purus tellus dignissim odio, vitae porttitor odio orci id quam. Aenean interdum nunc id elit.
Integer aliquet tempus augue. Nam erat. Praesent malesuada quam. Morbi id mauris. Fusce porta
justo ut diam.
Declaration of current state of understanding
Complete clarity Total ignorance
We are now here
We (the author(s)) would like to show by

this study that the state of
understanding can be shifted
STEP 3
THE TITLE TAGS THE TYPE

Published medical evidence falls into one of three major groups.
• Observational studies: Data is collected and analyzed for

significant patterns. Inferences are made from the patterns
observed. There is no attempt to alter the natural course of
the process that is studied. Observational studies can, again,
be of three varieties:
• Those that only look for patterns in the data that is

obtained – classical epidemiological studies.
• Risk/ association estimation - the study attempts to

answer the question: “Is there an association between A
and X?” or “Does the presence of P pose an increased
risk of developing R?”
• Causation. “Is alpha the cause of delta?” Unlike risk, the

relationship is of the “on-off” variety: if the cause is
present, the outcome is inevitable, if not, there is no
such event.
• Evaluating diagnostic tests: The study attempts to

evaluate the utility of a diagnostic test (decision making
under states of uncertainty).
• Interventional trials: An external intervention, usually

therapeutic, is made in an effort to alter the outcome from a
disease, and evaluated in terms of this effect. As clinicians,
these are the most common types of studies that will
interest us
A necessary first step is to tag the study as one of the three types
listed. The title of the paper will be sufficient in most cases.
Be wary of articles that attempt to be more than one of the

three main types; they will rarely succeed in doing any one
item well, leave alone all that they are attempting.
You will evaluate articles on the basis of their label. Tagging is an
easy task. To convince you, the underlying section lists a few papers
from the literature. Tag them using the principles outlined above.
Practice – tagging article type

Classify the articles listed in the table below as being one of three major types:
observational (O), diagnostic test assessment (D) or interventional (I).
Title of article TAG

Prevention of contrast induced nephropathy with sodium bicarbonate
Clinical value of the total white blood cell count and temperature in the
evaluation of patients with suspected appendicitis
The risk of cesarean delivery with neuraxial analgesia given early versus late
in labor
Outcomes in young adulthood for very-low-birth-weight infants
Homocysteine as a predictive factor for hip fracture in older persons
Randomised trial of a brief physiotherapy intervention compared with usual
physiotherapy for neck pain patients: outcomes and patients’ preference
A longitudinal, population-based, cohort study of childhood asthma followed
to adulthood
Emergency ultrasound in the acute assessment of haemothorax
The tag is the defining element in assessment of a study. This will

become clearer when we get into the process of evaluating the
results of a study.
STEP 4
IDENTIFY THE CLINICAL QUESTION – IN 4 PART

HARMONY
Example:
Low-dose ramipril reduces microalbuminuria in type 1 diabetic patients without

hypertension: results of a randomized controlled trial.
O'Hare P, Bilbous R, Mitchell T, O' Callaghan CJ, Viberti GC
Diabetes Care 2000. 23(12): 1823-9
This study attempts to answer the 4-part question:
In normotensive, type I diabetics (population), does a low dose of 1.25 mg of the ACE inhibitor
ramipril (indicator variable), prevent progression of incipient diabetic nephropathy as measured by
microalbuminuria (outcome variable), as compared to standard, 5 mg doses of ramipril or placebo.
The paper's conclusion needs to be measured against the stated question and assessed for the
degree of completeness and lack of evasiveness in addressing the issues raised. In this study,
the authors have concluded:
Microalbuminuria was reduced significantly by ramipril treatment in type 1 diabetic patients without
hypertension, as compared to placebo. Although the magnitude of the response was greater, there
was no significant difference between responses to 1.25 or 5 mg ramipril.
(Note: The paper is reported without any attempt at examining the details of the study)
Once you have completed this task, identify the 4-part clinical
question as:
1. Explicitly stated and complete.
2. Stated but incomplete.
3. Fuzzy.
The more explicitly stated and narrower the clinical question,

the greater the likelihood of the authors' ability to establish
the validity of their argument and overall soundness of the
paper. Very broad and poorly articulated queries are unlikely
to be adequately validated.
A SHORT BREAK TO EXPLAIN A

CONCEPT
BIAS (SYSTEMATIC ERROR)
Assume it is everywhere
Bias is defined as a prevailing preference or

tendency that inhibits impartial comparisons
and judgements.
Bias is also called "systematic error" as

opposed to random errors that occur due to
the inherent variability of human groups.
Bias has to be presumed to exist in all clinical

studies. The challenge in study design and
evidence based medicine is to minimise or
eliminate bias - not easily done in many
clinical siutations.
In clinical trials and studies, bias occurs at

three important points in the design of a study:
Bias in clinical trials 1. In the process of designing and drawing a
sample for the study.
Sampling 2. In the various measurements and
Measurement observations that are a part of the study.
Comparison 3. In designing and making comparisons
between groups.
The importance of tagging a study lies in looking hard for a specific

bias. Although all three sources of bias can flaw any study,
specifically, each tag has a key concern:
• Observational studies – sampling bias.
• Diagnostic accuracy – measurement bias.
• Interventional studies – comparison bias.
STEP 5
LOOKING FOR SAMPLING BIAS
N o n -p r o b a b ility s a m p le
Type Process Reason
Non-probability sampling - creation by a non-random process of a sample

that is a facsimile of what would be a probability sample
Consecutive Including every patient who Simplest and most com-
sample meets criteria over a given monly used option in clinical
number or time frame research
Convenience Using easily available members Easy strategy when any
sample of an accessible population sample will be representative
Judgemental Hand picking most appropriate Easy strategy when any

sample members from an accessible sample will be representative
population
Clinical trials, necessarily, have to recruit patients from clinics and
hospitals. Therefore, they are almost always non-probability samples
and are inherently flawed in this respect. Multicentre trials can
mitigate this bias to an extent. Given a choice, consecutive samples
are the least subject to bias with convenience and judgmental
samples running high risks of sampling bias.
All clinical trials, regardless of complexity, have to go through 4

stages in sampling. These are outlined in the diagram shown below.
It is best to look at an example first:
Sampling bias can be induced at 3 points in a study:
• The geography of the sample can induce cultural and socio-

economic bias that may render the study invalid for general
application and external validation. The time period of the
study may be critical if significant changes have occurred
during this time frame in the understanding and
management of the condition that is being studied.
• Inclusion and exclusion criteria are a necessary part of good

study protocols. However, if they are too tight or too loose,
the resulting sample will be non-representative and biased.
• All studies will have drop outs between the intended

population (those available after application of inclusion/
exclusion criteria) and those available till the study is
completed as defined in the protocol. Any study that has a
drop out rate greater than 15% is suspect and automatically
invalid. One has to assume that drop outs occur because of
unfavorable outcomes.
Sampling errors are the most common reasons for flawed

studies, yet, the average reader spends the least amount of
time, if any, in critically evaluating the sampling process.
Regardless of all else, a biased sample red flags a poor study.
STRENGTHENING THE SAMPLING PROCESS – A PRIORI

SAMPLE SIZE ESTIMATION
Later in this exercise, we will look at the impact of sample size (n)
on estimating the significance of differences shown in studies. A
large number of trials, although well done and showing differences,
are weakened by being “underpowered”. Simply stated, the size of
the sample is not large enough to prove the point beyond doubt.
Good studies will attempt to handle this critical issue by estimating

the sample size, a priori. This can be done by a biostatistician if the
investigator provides three criteria:
1. The size of the difference between the groups (as a

percentage or proportion) that he is looking for or
postulating.
2. The power of the test to detect this difference – expressed

as a percentage and usually set at 80 – 90%.
3. The confidence level or the margin of error that the

investigator is prepared to accept – expressed as a decimal
or percentage.
This example will clarify: (Ref: The risk of cesarean delivery with
neuraxial analgesia given early versus late in labor. NEJM 2005; 352:655-
65.) “The study was designed to have 80 percent power to detect a
difference of 50% in the rate of cesarean delivery, with a two-sided
alpha level of 0.05. The sample size required to detect this
difference was 350 subjects per group.”
A priori sample size estimation strengthens the study and protects it

from the danger of small numbers or the expense of larger trials
than are necessary.
STEP 6
LOOKING FOR MEASUREMENT BIAS

All studies involve measurements; as part of the study protocol,
during the study itself or in assessing results. Bias can creep into
measurements through various routes.
Measurement: precision & accuracy

Any measurement has to have 2 characteristics: precision and accuracy. The diagram shown below
describes the two entities with the simple exam of the results of hitting a target.
Precise and Precise but not Accurate but not Neither precise
accurate accurate precise nor accurate
Measurement errors can arise from 2 sources:
1. The observer making the measurement.
2. The instrument (or device) used to measure. We tend to

overlook common bedside tools like pulse rate, BP
measurement and temperature as sources of measurement
error, but, studies can be weakened by such simple devices.
Each of these can again be random (due to the nature of the

sample) or systematic (bias) – part of the study design.
Observer bias
Good studies will attempt to reduce observer bias through various
strategies. Look for the following:
1. Training of observers: Putting the observers through a

training and certification protocol to ensure reproducibility of
results.
2. Standard protocols: Formal statements of procedures and

processes that are to be used.
3. Using numerical scoring and scaling systems where

subjective decisions are involved. A typical example would
be the use of visual analogue pain scores to assess pain
levels rather than descriptive terms such as "mild',
"moderate", or "severe".
4. Blinding: This is the most classical of techniques to reduce

what is called differential bias. The variants of blinding
include:
a. Single blinding (independent observer): where it is

not possible for the investigator to administer an
intervention without knowledge of its nature, but, the
outcome variables can be assessed by a person who is
unaware of the intervention given. For eg. using an

independent infection control nurse to assess
postoperative wound infections rather than the surgeon
who performed the procedure.
b. Double blinding: where neither the person

administering the intervention nor the subject receiving
the intervention is aware of the nature of the
intervention. In such cases, observations need not
necessarily be made by independent observers.
c. Cross-over, double blind studies: At some point in the

protocol, the interventions are switched in the subject.
For eg. those on the placebo, now get placed on the
treatment and vice versa, the entire process being
carried out double blinded. Each patient, thus serves as
his own control.
Measuring observer variability: No two successive observations

will be identical. Variability will occur between observations made by
the same person on the same situation, on successive occasions –
intra-observer variability, and between different persons making the
same observation – inter-observer variability. This source of error
can be estimated statistically and expressed as a “kappa” statistic,
a valuable, objective figure. The final figure is stated as a decimal
between 0 and 1. A kappa of "1" indicates perfect concordance
between observers (virtually impossible) and a kappa of "0"
represents total discordance. Practically, a kappa value above 0.6 is
considered good agreement between observers.
Measurement bias
Strategies for reducing measurement bias include:
1. Repetition: Obtaining multiple readings from the same event

would increase the reliability of the values.
2. Calibration against a "gold standard". This aspect is

particularly important when using advanced generation
instruments that are yet to be proven. Some simple
examples include the use of digital blood pressure recorders
and weighing machines. Periodic calibration against a
mercury column machine or a beam balance would be
necessary.
3. Suitability of the instrument to what is being measured:

Quite often this element is glossed over or missed. A well
done study that uses an unsuitable instrument, will raise
doubt on the validity of the results.
Systematically list everything that the study measures and

put them through the checklist for measurement bias. Don't
take anything for granted; look for explicit statements or the
absence thereof.
STEP 7
BIAS IN COMPARISON – COMPARING APPLES WITH

APPLES, NOT ORANGES
Progress in Medicine is a slow, incremental process of comparing the
new with what is traditional or established. “Paradigm shifts” are
very rare. Fair comparisons are a critical aspect of valid studies.
Against the backdrop of the vast range of variability in human
populations, assuring fair comparisons between groups is not an
easy task
Comparison: being in "control"
Most clinical trials are designed on the principle of

GROUP A Observation/ Outcome “A” making comparisons between groups.
Observations and/or outcomes of a group (study
GROUP B Observation/ Outcome “B” group) are compared with another reference group
(control group) and the differences established.
The study, in conclusion, will make a decision
The difference - “is it real?” whether this difference is valid or not.
It is therefore mandatory in all comparative trials that there be a reference group (control) against
which the comparison is made. A large number of studies are invalid as evidence because of
inadequacies in the control and the compared groups. Sources of error include:
1.No controls. However large the numbers and however rigorously designed the study, an
uncontrolled study is unacceptable and ranks only as anecdotal evidence. (It is like the sound of one
hand clapping!).
2.Historical controls. Quite commonly, the lack of controls in the study will be addressed by
comparing the results of the study with other studies on the similar topic, done elsewhere, at
different times in the past. Historical controls are not fair comparisons and are unacceptable.
3.Poorly matched controls. Controls may be present, but are not comparable with the study
group. As an example, those is the study group may be younger and with less comorbidties than the
controls and therefore, be associated with better outcomes - comparing apples with oranges.
Good studies will outline the salient characteristics of control

and study group and analyze differences using tests of
statistical significance. These lists have to be scrutinized,
certified as comparable, and, significant differences noted.
No interventional trial is valid unless it is compared with a control
group. At present, the prospective, randomized, controlled trial
(RCT) is the only method of ensuring comparable groups in clinical
trials. By extension, valid interventional trials must therefore, be
RCTs. Retrospective analysis are always flawed in that allocation to
control and study arms can never be unbiased. Comparisons in
retrospective studies are seldom fair, the groups compared are
always dissimilar, and, therefore, differences in outcome can never
be proven to be true.
By now, you would have weighed the study in terms of systematic

error or bias and would like to move on to looking at the results of
the study, usually expressed as differences between two or more
groups. There is another important source of error that needs to be
examined at this juncture – random error; the consideration that
differences between groups might be purely an accident and an
expression of the inherent variability and inhomogeneity of human
populations.
A SECOND (AND FINAL) BREAK TO

TAKE IN ANOTHER IMPORTANT
CONCEPT
ESTIMATING THE IMPACT OF THE INHERENT

VARIABILITY OF HUMAN POPULATIONS
The commonly used term, “the bell shaped curve”, encapsulates all
our ideas, knowledge and emotions regarding the tendency of
populations to exhibit values for single indicants that are distributed
widely around an average or central tendency and not tightly
bunched around a mean. An immediate implication of this
phenomenon is that differences between human groups cannot be
casually taken as real differences. In fact, it is assumed that any
noticed difference is a random event and, the burden of proof is on
the author or investigator to show that the difference is real,
consistent and repeatable - in statistical jargon, “rejecting the null
(no difference) hypothesis”.
This is where we need the help of biostatistical methods. No study is

acceptable unless put through the mill of statistical analysis to
determine the significance of differences that have been
demonstrated in the study.
I can hear you saying, “Don't go there now!” In truth, contrary to

popular belief, it is not necessary to have any knowledge of
biostatistics to be able to critically analyze medical articles. You can
safely consider all the arcane, background mumbo-jumbo as taking
place in a black box that will finally put out two indices that you
need to know about and be able to apply intelligently:
• The 'p' value, and
• The 95% confidence interval (CI).
'p' value
A quick recap first
The major task ahead of us is that of

deciding whether differences seen in
GROUP A Observation/ Outcome “A” outcomes are real or just due to chance.
Errors may be systematic (bias) or random
GROUP B Observation/ Outcome “B” (chance). Systematic errors are minimised
by good study design. Random errors are
The difference - “is it real?” evaluated by established biostatistical
methods.
DATA FROM THE
CLINICAL TRIAL
Biostatistical analysis is a formal process
of taking the data generated from the
clinical trial and putting it through well
established procedures and processes that
BIOSTATISTICAL will permit us to make the decision
BLACK BOX regarding the truth - i.e. are we to reject
the null (no difference) hypothesis or not.
Regardless of the type of statistical test

used, a ' p' value will be generated at the
MEASURE OF 'p' end that is a simple estimate of the
THE TRUTH VALUE probability of error or chance in producing
this observed difference.
The ' p' value is a simple estimate of the probability of error or chance in producing the observed
difference. By convention, it is expressed as a decimal fraction, for eg. p = 0.03. The table shows some
examples.
'p' value Explanation

0.38 38% chance that the differences are not real
< 0.05 Less than 5% chance that the differences are not real
0.001 1 in 1000 chance that the differences are not real
In biostatistical usage, a p value of 0.05 or less is taken as a signficant difference; i.e. less than
5% probability that the difference observed is due to chance alone.
Errors in assigning significance
Null hypothesis true Null hypothesis false

(no difference) (difference is valid)
Accept hypothesis Correct decision Type II (beta) error
(false rejection)
Reject hypothesis Type I (alpha) error Correct decision
(false validation)
Since the decision regarding the 'p' value cut off is somewhat
arbitrary, it is essential that the value be chosen with the least
potential for error. If the limit is made too narrow (small 'p' value)
then the likelihood of Type II errors (false rejection) rises.
Conversely, if too wide (large 'p' value) the likelihood of a Type I
error (false validation) increases. The 0.05 cut-off appears to be the
"sweet spot" in this regard and determines significance.
Although a convenient and easily understood measure of the truth

behind differences, the 'p' value has many shortcomings.
1. The arbitrary nature of the cut-off point for significance

creates a false dichotomy in what is often a continuum. This
is particularly important when differences are close to the
borderline - eg. p = 0.06.
2. The 'p' value tells us nothing about the size of the difference
or its direction. It is not a quantitative measure.
3. Sample size has a major impact on 'p' values. A small

difference in studies with large sample sizes will have the
same 'p' value as a much larger difference with smaller
samples. As the 'n' in the study increases, more 'p' values
are likely to become significant.
In recent years, the "95% confidence interval" is preferred as a

measure of examining the differences between groups on a
numerical basis. It is more likely to be useful to practitioners making
clinical decisions.
The 95% confidence interval (CI)
...versus 'p' values
All studies rely on samples that are hopefully representative of the truth. The confidence interval is a
quantitative estimate of the range of values within which the truth is likely to lie, with a specified
degree of confidence. The 95% confidence interval is the range of values within which we can be
95% sure that the truth regarding the population lies. CIs can be calculated for any continuous
variable. CIs may be expressed for any value, 90%, 99% etc, but, like the 'p' value, the sweet spot is
the 95% cut-off. The table below compares the 95% CI with 'p' values.
'p' value 95% CI

95 % CIs can be derived for
Nature Qualitative Quantitative
common indices like:
Basis of decision Dichotomous Continuous set - Differences between
(significant/ non- of numbers means or proportions
significant) - Relative risks and odds
Size of difference Not indicated Indicated ratios
- Sensitivities, specificities
Direction of difference Not indicated Indicated
and likelihood ratios
Impact of 'n' Present Present (These items are discussed later in the
Reader cannot Reader can make document)
make a judgement a judgement
T h e c o n c e p t o f 9 5 % C I is b e s t u n d e r s to o d th r o u g h e x a m p le s
The commonest kind of clinical trial is one that compares two

groups. Specific outcome variables are measured and, the mean
values or proportions compared. In the light of what we know about
the tendency of measured values to be distributed across a range
and not tightly bunched around the mean, we need some other
measures that will permit us to to judge if these differences are real
and valid or merely due to chance. The 95% CI is the most
commonly used measure. Consider the following example:
The data summarized in the table shown below is from a RCT testing
the efficacy of pertussis vaccine. Patients were randomly assigned to
receive either pertussis vaccine (study group) or a placebo (control).
The table shows the outcome of the study as measured by the
number of study subjects who developed pertussis in the follow up
period.
Vaccine Placebo
(1670) (1665)
Developed pertussis 72 240
(4.3%) (14.4%)
Ref: Trollfors B, Taranger J, Lagergard T, et al. NEJM 1995: 333: 1045-50
The difference in the rate of development of pertussis between

placebo and vaccination (absolute risk reduction - ARR) was 10.1%.
The 95% CI for this difference was calculated and reported as 8.2 -
12.0%. This tells us that even assuming the lowest value as being
the truth, there is a real difference because it does not reach or
cross zero, the point at which one would have to consider that one
end of the truth estimate reaches the point of no difference. It is
possible for the value to go below zero as well, indicating a possible
negative difference. (In contrast, the 'p' value gives us no feel for
the magnitude or direction of the difference.)
The width of the CI is dependent on the numbers sampled - the 'n'.

The more the numbers involved, the greater the degree of precision
with which the 95% CI can be represented. This concept is best
expressed in the following hypothetical situation involving the
sensitivity of a diagnostic test.
Sensitivity 95% CI
n = 24 95.8% 75 – 100%
n = 240 95.8% 92.5 – 98.0%
In the example shown, increasing the 'n' from 24 to 240 results in a

significantly more precise 95% CI, even though the underlying
outcome variable, the sensitivity of the test, remains unchanged.
Two practical points in applying the 95% CI

1. When calculated for raw numbers (continuous variables)
such as the differences observed between groups, the
95%CI will be expressed as a range that can include “0” as a
value and even be negative. When the 95% CI touches or
extends across the zero value, (eg. 3.4, -1.8) it means that
there is likelihood of no real difference between the groups.
2. When calculated for ratios such as risk or odds ratios, the

95%CI will be expressed as a range that can include “1” and
span values from an ever increasing series of decimals to
any value above 1. If the 95% CI includes “1” within its
range (eg 0.04, 2.6), it means that there is a likelihood that
the differences shown are not real or statistically significant.
The 95% CI provides quantitative information and permits the

reader to make finer clinical judgments and decisions without the
dichotomy of 'p' values. In practice, it is worthwhile looking at both
estimates.
STEP 8
NOW FOR THE RESULTS

At this point, you need to go back and recall two important
elements:
1. The type of study that the paper represents (STEP 3).

Depending on which one of the three types the study
represents, you will have to go through a specific series of
steps to evaluate the article.
2. The four part clinical question that the study is trying to

address. It is here that most studies will become evasive
and throw a large number of tables, charts and data at you.
Hold your ground and keep focused on the major stated
intentions of the study. Don't get distracted by post hoc data
analysis that was not in the stated intentions.
It is a fact of life that journals don't like publishing studies

with equivocal results. To get around this, many studies will
descend into creating subgroups and making comparisons. If
this was the intention ahead of time, the authors should
have clearly stated that in their clinical question and
stratified the sample at the time of enrollment of study
subjects. Don't accept attempts at post-hoc sub groups
analysis. For eg. A study may not show any significant
difference with the use of intervention X in a disease, as
compared with the controls. They may then go on to state
that women, who were post-menopausal and diabetic (or
some such) showed a difference. If this was their aim, they
should have stratified their study at enrollment into: men vs
women; the women into post and premenopausal, and each
of these as diabetic and non-diabetic; a strategy that would
vastly increase the complexity, sample size and expenses.
Watch out for this trick, it is all too common.
STEP 9 A
INTERPRETING INTERVENTIONAL STUDIES

Ask yourself the following questions:
• Were the interventions explicitly stated? This is particularly

important when interventions involve skills in the person
making the intervention. Ask:
○ Were protocols drawn up before the study and was there

a process for assuring basic competence in the
intervention? Was there any monitoring by independent
observers to assure quality during the study.
○ Is this level of training/ competence/ monitoring

practical in day-to-day life?
○ If skill levels could not be standardized, were the results

analyzed on the basis of known or assumed competence
of the person making the intervention?
• What was the level of significance of demonstrated

differences. Look at the 95% CI and 'p' values for the
differences.
Simpler still, express the major difference as a fraction and

invert it to get the NNT. (If the difference between control
and study group was 12%, then the NNT is the reciprocal of
12/100, 100/12 which is about 8.
STEP 9B
INTERPRETING STUDIES ON VALUE OF DIAGNOSTIC

TESTS
First, closely scrutinize the study for measurement bias.

Tests are obtained to resolve uncertainty and address the question,
"Will I be better off after the test in terms of coming to a diagnosis?"
To explain the point further:
1. The LR+ is a measure of the true positive rate. An LR+ of 5

means that there will be one false positive for every five true
positive tests.
2. Similarly, the LR- is an index of the false negative rate. An

LR- of 0.2 (20%) means that there will be one false negative
for every 5 true negative tests.
Using sensitivities and specificities in a raw fashion, is erroneous

because the likelihood of resolving the uncertainty behaves in a
quirky fashion. Consider the table shown below.
Sensitivity Specificity LR + LR -
95% 95% 19 0.05
95% 80% 4.75 0.06
80% 95% 16 0.21
85% 75% 3.4 0.20
75% 85% 5 0.29
75% 65% 2.1 0.38
65% 75% 2.6 0.47
95% 65% 2.7 0.07
65% 95% 13 0.37
70% 70% 2.33 0.43
A test that is 70% sensitive and 70% specific is practically

useless.
Let's look at the results of this study: "Magnetic resonance imaging
for preoperative evaluation of breast cancer: a comparative study
with mammography and ultrasonography". (Hata T, Takahashi H,
Watanabe K, et al. J Am Coll Surg 2004; 198:190-197.) The authors are
claiming that MRI can detect intraductal spread more accurately
than the other two methods and conclude that MRI " appears to be
indispensable in breast conserving surgery to minimize local
recurrence". Is Hata san correct? Let's look at his numbers.
Sensitivity Specificity LR + LR -
Ultrasound 21% 85% 1.4 5.26
Mammogram 22% 86% 1.57 5.57
MRI 67% 64% 1.86 0.52
At first glance, to the uninitiated, it looks as though MR does better -

sensitivity/ specificity of 67/64% as compared to mammography
(22/86%) and ultrasound (21/85%).
Now see what happens when Likelihood Ratios are calculated (which
the authors did not).
• All three have LR+ < 2 - that puts them in the "should we
bother getting it?" category. In simple words, this means
that there will be one false positive test for every two true
positives.
• As far as the LR - is concerned, US and mammography are

actually much better than MRI, which is practically worthless
– an equal chance of true and false negatives. A negative
MRI is far less valuable than a negative US or mammogram!
"Indispensable?"
Learn to use likelihood ratios, not the knee-jerk sensitivity/

specificity figures, to determine the value of a diagnostic test.
It is well worth the effort.
Rule of thumb: Look at the LR+ when the emphasis is on diagnosing
(ruling in). Use the LR- when the issue is one of exclusion (ruling
out, screening).
STEP 9C
INTERPRETING STUDIES ON RISK/ ASSOCIATION/

CAUSALITY
Clinical trials often attempt to assess the degree of harm or risk that
accompanies exposure to various events and circumstances.
Risk assessment
Have measurements been made on
more than one occasion?
NO YES
Cross-sectional study Longitudinal study

(case controlled) (prospective cohort)
Odds ratio Relative risk
Evidence from these studies is acceptable only if they fall into one of
the two types that are shown above. The longitudinal cohort study is
the better choice but may not always be practical, in which event, a
well done, case-control study is an option.
Longitudinal cohort studies can assess incidence of the disease since

they involve follow-up over a period of time and therefore, can
measure risk. Longitudinal studies will yield a "relative risk".
Cross sectional studies (case controlled) cannot provide estimates of

risk. They will only yield an "odds ratio".
• Since this group does not perform interventions, sampling

strategies have to be closely scrutinized for bias.
• Studies on risk and association, will have to be represented

as shown in the diagram below, keeping in mind the 4 part
clinical question.
From the numbers in this simple 2x2 table the following results will
be stated:
The calculation is not as hard as it looks. It's OK if you don't want to

be bothered by it. Just remember the principle behind each and look
for the numbers in the study. The more important aspect of using
these numbers is outlined in the next section.
From the earlier discussion on 95% CI we learned that when

calculated for ratios such as risk or odds ratios, the 95%CI will be
expressed as a range that can include “1” and span values from an
ever increasing series of decimals to any value above 1. If the 95%
CI includes “1” within its range (eg 0.04, 2.6), it means that there is
a likelihood that the differences shown are not real or statistically
significant.
This will be clearer as you go through the examples provided below.
Association of obesity and cancer risk in Canada. The use of estrogens and progestins and the risk
Pan SY, Johnson KC, Ugnat AM et al. Am J Epidemiol. of breast cancer in postmenopausal women. Colditz
2004; 159:259-68. GA, Hankinson SE, Hunter DJ et al. NEJM, 1995;
332:1589-93
This is a population-based, case-control study of
21,022 incident cases of 19 types of cancer and This is data from a prospective, cohort study
5,039 controls aged 20-76 years during 1994-1997 During 725,550 person-years of follow-up, 1935
to examine the association between obesity and cases of invasive breast cancer were newly
the risks of various cancers. The study compared
diagnosed. When compared with women who had
people with a body mass index of less than 25
never used hormones, the data was as shown.
kg/m2, with (body mass index of > or = 30 kg/m2.
Odds ratio 95% CI
Relative risk 95% CI
Overall 1.34 1.22 – 1.48
Estrogen alone 1.32 1.14 – 1.54
Colon 1.93 1.61 – 2.31
Estrogen + 1.41 1.15 – 1.74
Pancreas 1.51 1.19 – 1.92 progestin
Breast 1.66 1.33 – 2.06 5-9 yrs users 1.46 1.22 – 1.74
Ovary 1.95 1.44 – 2.64
Prostate 1.27 1.09 – 1.47
When are risk/ odds ratios significant?

At what level do we consider risk ratios and odds ratios to be
clinically significant. Although there are formulae for calculating
numbers needed to harm (much like the NNT seen earlier), these
are not easily applied. As a general rule of thumb, it may be stated
that:
• Risk ratios of 3 or more are significant.
• Odds ratios of 4 or more are significant.
• A risk ratio greater than 20 practically implies

causality.
As a general caveat, the threshold for significance would also be

determined by the seriousness of the adverse event. The more
serious the event, the smaller the ratio that could be considered
significant.
Studies on risk may sometimes present a group of risk ratios/ odds

ratios, in a graphical form such as what is shown below. This is from
a paper that seeks to address the question whether women who
have breast cancer detected early by mammographic screening
programs have a lower risk of cancer-related death than those
picked up conventionally.
With your knowledge of interpreting risk ratios and their

significance, you can eyeball this chart and come to your own
conclusions. The middle dot in each line represents the risk ratio
figure for each study and the line represents the spread of the 95%
CI. As you can see, several of them cross the “1” mark, the point of
no significance, but, if your take the bottom value (red arrow), the
one that represents the summation of all the values above it, you
can see that the relative risk falls well short of the “1” mark. This
means that women who were treated for mammographically
detected breast cancer, had a lower risk of cancer-related deaths on
long term follow up than the control population.
Proving causality
When your 2 x 2 table shows numbers like this, it is very tempting

to extrapolate that the predictor variable may be the cause of the
outcome. This is dangerous. It is suggested that 5 questions be
asked and answered before a causal link can be established.
1. Is it clear that the exposure to the risk factor preceded the

outcome?
2. Is a dose-response gradient demonstrable: consistently

increasing harmful effects with increasing exposure?
3. Is there evidence from a "dechallenge-rechallenge" study:
the adverse effect decreases/ disappears when the risk

factor is withdrawn and reappears when it is reinstituted?
4. Is the association consistent and repeatable in other

studies?
5. Does the association make biological sense?
Most often, evidence in support will be lacking or weak. Many

common diseases are complex disorders with multiple etiological
possibilities. Confounding - the process where a cause produces its
effect through a less visible, but stronger factor - is commonplace in
Medicine. The process of estimating such effects involves complex
mathematical processes that would make you scream. Take them on
faith. They will ultimately yield the familiar odds ratios, relative risk
and 95% CIs that you can use to interpret the results.
When multiple risk factors are being assessed in a single situation, it

is possible to evaluate the relative strengths of each The process of
establishing strength of association is a complex mathematical effort
using regression analysis and other terrifying animals that we would
rather take on faith than confront. Moreover, these complex
regression analysis are usually pieces of sophistry that can seldom
be applied in real life. If they scare you, leave them alone. You will
not know the difference.
STEP 10
CLOSING THE LOOP – APPLYING THE RESULTS IN YOUR

PRACTICE
Journal reading should not be a sterile exercise. If you are satisfied
with the quality of the article you have read, you need to apply its
conclusion in your practice. A simple mnemonic might make the task
easier.
INFER, where
• I – Interesting: does the topic fall within you sphere of

interest?
• N – Novel: is it saying something new?
• F – Feasible: can you do it in your daily environment. Use

the NNT, likelihood ratios and risk ratios to make the
decision.
• E – Ethical: subtle issues like patient's preferences, cultural

constraints and so on have to be looked at besides overall
ethical concerns.
• R – Resources: what impact will it have on resources –

your's, the patient's, the hospital's, society as a whole?
“I never promised you a rose garden.”
For some worked out examples and a downloadable, 2-page, study

evaluation form visit: http://www.ebm4d.org
4-part research question Inclusion criteria R ? T
Population:
Predictor variable:
Authors:
Journal: Outcome variable:
Affiliation: Exclusion criteria R ? T
Comparison:
BACKGROUND:
Sampling
Probablility sample | |
Simple random| |Stratified random| | Cluster| |
Non-probability sample | |
Consecutive| |Convenience| |Judgmental| |
Sampling scorecard
Target population
Accessible population
Intended population
EBM Dashboard (after inclusion/ exclusion)
Primary nature of study Drop outs (%)

Interventional Study population
Diagnosis
Risk/ association/causality
More ... 
Observational
Evidence hierarchy
Summing up
Double blind RCT
Randomised controlled trial (RCT)
Prospective cohort
Case control
Case series
Systematic error (bias)

1 2 3 4 5
Sampling
Measurement
Comparison
Applicability
1 2 3 4 5
Interesting
Novel
Feasible
Ethical
Relevant
/ /200
EBM 4 dummies – - 1 of 2
© Dr Arjun Rajagopalan
Measurement
Devices used
Authors:
Journal:
Affiliation:
Comparison
Controls
Randomised Measurement error
Case controlled Device error Observer error
Non-random
Protocols
Repetition
Gold standard
Training
Scoring
Blinding
Device suited to task
Device used
Historical R ? T
None
1.
Controls - details 2.
Randomisation method 3.
4.
5.
6.
7.
Details
8.
Personal notes and observations
Comparability
Disparity
EBM 4 dummies – - 2 of 2
© Dr Arjun Rajagopalan

Journal Club Toolbox

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Journal Club Toolbox

Uploaded by

Copyright:

Available Formats

The

Critical reading of “original” medical articles

STEP 1: Don't read the abstract 3

Read the introduction

STEP 3: The title tags the type 4

Identify the clinical

BREAK 1: Bias (systematic error) 7

STEP 5: Looking for sampling bias 8

Looking for measurement

Bias in comparison – apples

Estimating the impact of the

STEP 8: Now for the results 19

Closing the loop – applying

The fact that an opinion has been widely held is no evidence

...where the value of a treatment, new or old, is doubtful, there may

Familiarity with medical statistics leads inevitably to the conclusion

It is only prudent never to place complete confidence in that by

A reasonable probability is the only certainty. E W Howe

In the space of one hundred and seventy-six years, the Lower

1. Most of us have a short list of a half dozen or less journals

a) First, we scan the title.

b) Then, we quickly look at the institutional affiliation(s) of

c) We jump to the Abstract and land first on the

This offering is for the first group. Don't be a wallflower at journal

This handbook provides you a tool to approach your selected bunch.

The handbook is only a tool. Like all tools, it is up to you to use it

DON'T READ THE ABSTRACT

READ THE INTRODUCTION INSTEAD

Declaration of current state of understanding

Complete clarity Total ignorance

We are now here

We (the author(s)) would like to show by

THE TITLE TAGS THE TYPE

• Observational studies: Data is collected and analyzed for

• Those that only look for patterns in the data that is

• Risk/ association estimation - the study attempts to

• Causation. “Is alpha the cause of delta?” Unlike risk, the

• Evaluating diagnostic tests: The study attempts to

• Interventional trials: An external intervention, usually

Be wary of articles that attempt to be more than one of the

Practice – tagging article type

Title of article TAG

The tag is the defining element in assessment of a study. This will

IDENTIFY THE CLINICAL QUESTION – IN 4 PART

Low-dose ramipril reduces microalbuminuria in type 1 diabetic patients without

This study attempts to answer the 4-part question:

1. Explicitly stated and complete.

2. Stated but incomplete.

The more explicitly stated and narrower the clinical question,

A SHORT BREAK TO EXPLAIN A

BIAS (SYSTEMATIC ERROR)

Bias is defined as a prevailing preference or

Bias is also called "systematic error" as

Bias has to be presumed to exist in all clinical

In clinical trials and studies, bias occurs at

The importance of tagging a study lies in looking hard for a specific

• Observational studies – sampling bias.

• Diagnostic accuracy – measurement bias.

• Interventional studies – comparison bias.

LOOKING FOR SAMPLING BIAS

Type Process Reason

Non-probability sampling - creation by a non-random process of a sample

Judgemental Hand picking most appropriate Easy strategy when any