You are on page 1of 78

" lx M jy L

>L

ir,c<

THE POCKET GUIDE TO


CRITICAL APPRAISAL:
A HANDBOOK FOR HEALTH
CARE PROFESSIONALS
by
IAIN K . C R O M BIE
D ep a r tm en t o f E p id em io lo g y & Public H ealth,
N in ew ells H o sp ita l & M e d i c a i S ch o ol, D u n dee, S c o t la n d

J b O ___

C BMJ Publishmg Group 1996

Ali rights reserved. N o part of this publication may be reproduced, stored in a


reiricval system, or transm itted, in any form or by any m eans, electronic,
mechanical, photocopying, recording and/or othcnvise, wilhout the prior writtcn
pcrmission of thc publishcrs.
First published in 1996
Second impression 1997
by lhe BMJ Publishing G roup, BMA H ouse, Tavistock Square,
L ondon W C1H 9JR
B ritish L ib ra ry Cataloguing in Publication D ata
A catalogue record fo r this book is available from the B ritish L ib ra ry

ISBN 0-7279-1099-X
Typesct by A pekTypeseiters Ltd, Nailsea, Bristol
Printcd and bound in Great Britain by Latim erTrend Ltd, Plymouth

Contents

Preface
Acknowledgements
1 Introducing criticai appraisal
C la rify yo u r reasons for read in g
S p ecify yo u r info rm atio n n eed
Id en tify the relevan t repo rts
C ritic a lly ap p raise the p apers

s/. 2 Questions to ask when reading a paper


Is it o f interest?
W h y w as it done?
H ow w as it done?
W h a t h as it found?
W h a t are the im plicatio ns?
W h at else is of interest?

^ 3 Identifying the research method


--S u rv e y s
C o h o rt studies
- C lin ic a i trials
C a se-co n tro l studies

C O N TF.N TS

4 In terp retin g the resu lts


Y

Statistical significance
T h e p lay of chance
Probabiliry
T h e logic of statistical tests
C onfidence intervals
Pitfalls in the analysis
O utliers
Skew
N on-independence
S erendipity m asq u erad in g as hypothesis
B lack-box analyses
Bias
C onfounding

5 I n tro d u c tio n to th e c h e c k lis t s


Evaluating the flaws
Best case/worst case/likely case

6 T h e s t a n d a r d a p p r a is a l q u e s tio n s
Are the aim s clearly stated?
Was the sam ple size justified?
Are the m easurem ents likely to be v alid and reliable?
Are the statistical m ethods described?
Did untow ard events occur d urin g the study?
Were the basic data ad eq u ately described?
Do the num bers add up?
Was the statistical significance assessed?
W hat do the m ain findings m ean?
How are n ull findings interpreted?
Are im portant effects overlooked?
How do the results com pare w ith previous reports?
W hat im plications does the study have for your practice?

Y- 7 A pp raisin g surveys
The essential questions
T he specific questions
R ationale for the essential questions
R ationale for the specific questions
T he com plete list for the appraisal o f surveys

CONTENTS

X8

A p p r a is in g c o h o rt s tu d ie s
T h e essen tial questions
T h e specific questions
R atio n ale for the essential questions
R atio n ale for the specific questions
T h e com plete list for the appraisal of cohort studies
9 A p p r a is in g c lin ic a i t r ia ls
T h e essential questions
T h e specific questions
R atio n ale for the essential questions
R atio n ale for the specific questions
T h e com plete list for the app raisal o f clinicai trials

10 A p p r a is in g c a s e - c o n t r o l s t u d ie s
T h e essen tial questions
T h e specific questions
R atio n ale for the essential questions
R atio n ale for the specific questions
T h e com plete list for the appraisal o f case-co ntrol studies
V

11 A p p r a is in g r e v ie w p a p e r s
T h e essen tial questions
T h e specific questions
R a tio n ale for the essential questions
R atio n ale for the specific questions
T h e com plete list for the appraisal o f review papers
In d ex

63

P reface

T h is b o o k w as w ritten to m eet the needs of health professionals as


m ed icin e m oves to b e evid en ce-b ased . T h e initia! idea arose during
discu ssio n w ith p o stg rad u ate students on their difficulties of
in terp retin g the m e d icai literatu re. It q u ick ly becam e apparent that
th eir n eed s w o uld b e b est m et b y a short book d etailin g criteria for
c riticai ap p raisal.
T h e book is o rg an ised in tw o p arts. T h e first five chapters
p rovide an in tro d u ctio n to c riticai ap p raisal, indicatin g how papers
can be read an d how the resu lts can be interpreted. Experienced
research ers co u ld e a sily o m it these ch apters. T he final six chapters
p rovide an n o tated ch eck lists for criticai ap praisal. T h e first of these
con tain s the g en eral q u estio n s w h ich can be asked of any study,
irresp ective o f the m eth o d u s e d .T h e succeed in g chapters review, in
tu rn , the q u estio n s w h ich are specific for each researh method.
F or co n ven ien ce, each of the latter chapters concludes with a
co m b in ed list o f g e n e ral and specific questions.
T h e book h as b een w ritten to be sim ple to use. Technical terms
are avo id ed w h ere p o ssib le, an d the assessm ent criteria are
ex p lain ed b u t no t ju stified . A la rg e r and less accessible text would
have b een n eed ed to give a p ro p er ratio n ale for each of the check
lists. T o keep th is a p ocket gu id e it w as also decided to omit
evaluatio n s o f o th er topics su ch as qualitative m ethods, health
eco n o m ics, clin ica i a u d it, decisio n an alysis, and screening tests. An
argu m en t co u ld b e m ad e for the inclusion of each, b ut to include
th em ali w o u ld n early double the size of the book. I hope the check
lists p ro vid ed prove useful.
I K CRO M BIE

ix

A c k n o w le d g e m e n ts

I w as en co uraged to w rite this book by m y colleague and friend


H u w D avies. S everal other co lleagues gave constructive com m ents
on the m an u scrip t, in clu d in g F io n a W illiam s, Linda Irvine, B eth
A ld er, G ordon M c L a re n , Jan e K night, and C harles Florey. Special
thanks go to Jan et T u ck er w h ose gentle b u t incisive com m ent it
w o n t w ork that w a y led to a sub stan tial im provem ent in the text.
T h e check lists w ere based on m y own experiences of conducting
research and refereein g p ap ers, but I have augm ented and im proved them th rough com parison w ith previously published ones.
I gratefully ackn o w led ge my d eb t to the authors of these check lists.
T h e preparation of this book w as supported by the Scottish Office
H om e and H ealth D ep artm en t.

1 Introducing criticai
app raisal

T h e m e d ic a i literatu re is vast and rap id ly expanding. Forays into


the lib ra ry can be ex h au stin g , as the read er is overwhelm ed by the
p ap ers on offer. W h en read , som e w ill cite interesting references
w h ich sp u r the re ad er into a lengthy p ap er chase. A m ajor hazard
o f re ad in g is to p u rsu e a subject in too m uch depth. Instead of
fo llo w in g th is h ap h azard co urse, the process of reading should be
carefu lly p lan n ed to .p rovide a w orthw hile retu m on the time
invested. E stab lish in g control over yo ur re ad in g m eans following a
n u m b er o f steps:

C larify yo u r reason s for reading


S p ecify yo u r info rm atio n need
Id en tify the relevan t reports
C ritic a lly ap p raise the papers.

C larify your reasons for reading


H ealth p ro fessio nals read the literatu re for m any reasons: to keep^
up .to d a te , to ans^yer specific^cjinjcal^qu^stions, or to pursue a
research in terest. E ach reason requires a different kmd of literature
serch - TcPK eep ab reast o f professional developm ents a skim
h ro u g h the m o st recen t issues of the m ain journals w ill suffice.
S p ecific clin icai questio n s can be answ ered b y readin g recent high
q u a lity stu d ies. In co n trast, p u rsu in g a research interest can require
an exten siv e co m p uterised literatu re search to determ ine whether
the stu d y h as b een done before. W h atever the reason for reading,
the lib ra ry sh o uld be approached only w hen the reasons for reading
have b een clarified .

(N T R O D U C IN G C R IT IC A L A P PR A 1SA L

S p e c ify y o u r in fo rm a tio n n e e d
C larifyin g w hat you w ant to find out sh ould indicate the am ount
of inform ation required. M an y queries can be best answ ered from
current textbooks or review articles. B ut these w ill not contain the
m ost recen t studies, and will not contain the levei of detail of the
original papers. T h us, the read er should ask: ^X^hat kind of reports
do I w ant? How m uch detail do I need? H ow com prehensive do 1
need to be? How far back should I search? T h e answ ers to these
questions follow from the reasons for reading.

Identify the relevant reports


K now ing w hat you w ant to find out lead s to the questio n of how
to get it m ost easily. T h ere are m any w ays o f accessing the literature
in ad d itio n to brow sing through journ als: indexin g jou rnals such as
Index M e d i c u s ; abstracting jou rnals like C u rr en t C o n ten ts; com puterised literatu re searches usin g M ed line. T h e local lib rarian will
advise 0 11 the types of search w hich you can conduct. T h ere are
usu ally several w ays of tracking down p ap ers, so any lib rary search
should b e approached by asking how this can be done m ost easily.
Even b rie f visits to the lib rary can gen erate several dozen papers.
to inspect. M a n y w ill be of m argin al relevance, and should be set
aside. S electivity in reading is essential to ensure th at there is tim e
for the d etailed inspection of im po rtant papers.

Critically appraise the papers


H aving identified potentially useful articles, they need to be
appraised critically. T he process of ap p raisal is the focus of this
book. T h ere are m any poor q u ality studies w hose claim s should be
discounted. Others contain some inform ation of value m ixed in
w ith m uch that is dross.T his book provides check lists b y w hich the
useful inform ation can be read ily identified.

2 Questions to ask when


reading a paper

R esearch papers are o rganised into four m ain sections: introduc


tio n , m eth o d s, results, an d discussion. M ost also b egin w ith an
ab stract or sum m ary w hich presents the key points from each ot
the m ain sections. Papers can be read by asking a series o f
q u estio n s, addressed to the various sections. T h ese questions wi
elicit the im p o rtan t inform ation that each section contam s, and w iil
also provide the basis for the evaluation of the quah ty of the study.
T h e qu estio n s are:

Is it o f interest?
W h y w as it done?
H ow w as it done?
W h at has it found?
W h a t are the im p lication s?
W h at else is of interest?

Is it o f interest? Title, abstract


A n im m ed iate gu id e to w hether a paper may be w orth readin g
co m es from the title and abstract. T h is w ill m dicate how relevant
the to p ic is to the inform atio n needed and how interesting the
results are likely to pro^e. T h e abstract should also give a
p relim in a ry ind ication of how w ell the study was conducted.

W hy was it done? Introduction


T h e m ain function of the introduction is to provide the
b ackgro u n d to the study, indicating why it was carried out^To do
this the introduction b riefly reviews previous w ork, but does so
3

Q UESTIO N S TO A SK W H E N REA D JN G A PAPER

mainlv to highlight gaps in our current knowledge. It can also show


why these are m ajor gaps w hich urgently need to be filled. Often
this is achieved by d escrib in g the clinica! im portance o f the topic interms of m ortaliry, m o rbidity, or cost to the health Service.
The introduetion should end with a clear statem en t of the
purpose of the study. T h is m ay be phrased as a hypothesis to be
tested or as a question to be answered. T he absence of such a
statem ent can im ply that the authors them selves had no clear idea
o f what they were tryin g to find out. If this. were the case it is likely
that they d id not find out m uch of interest.

How was it done? Methods


The m ethods section gives the details of how the study w as
carried out. T h e descriptions are usually succinct, an d references
are often given to papers w hich provide fuller explanations. D espite
this brevity, there should be sufficient inform ation to in d icate who
w as studied and how they w ere recruited (for exam ple, w hich clinic
the patients atten ded, w hat the diagnostic criteria w ere, w hat age
and sex groups were in clu d ed ). W ithout this inform ation it w ill not
be possible to say how w idely the findings can be gen eralised.
There should also be sufficient detail to allow th e reader to
decide w hether the d ata w hich have been collected are accurate. If
m easurem ents were m ad e, the circum stances in w h ich they w ere
taken should be describ ed, together w ith the steps taken to
standardise the m easurin g procedures. T he stru etu re of any
questionnaires used should also be given, and m entio n m ade of
how they w ere tested for validity and reliability. T he inform ation in
the m ethods section provides an im portant guide to th e qualiry of
the paper. F inally, the m ethods section should in d icate w hich
statistical m ethods w ere used in the analysis.

What has it found? Results


The m ain findings o f the study are presented in tables and
figures w hich are exp lain ed by the text in the results section. T he
data should be p resented in a logical fashion, startin g w ith quite
simple observations and p roceedin g, when appropriate, to com plex
analyses. T h e text should lead the reader through the data,
highlighting the key findings. W hen the results follow a more
haphazard course, w h ich im pedes understanding, th ey m ay not
have been fully analysed.
The text in the results section will give the authors view of w hat
4

Q U E ST IO N S T O A SK W H EN RE A D IN G A PAPER

is im p o rtam . T h is n eed not be the only view; authors can make


m istakes. R ead ers sh o u ld m ake up lh e ir own minds about w h atth e
study has found. It is also w orth checkin g w hcther the results fulfil
the aim s of the study. W hen an aim presented in the m troduction
is not addressed in the resu lts, a n u m b er of questions arise. W hy
w as it m issed? W as th is ju st an oversight? W ere the appropriate data
not collected? Or w ere the fin din gs, for som e reason, unacceptable
to the authors? T h e o m issio n raises doubts about the w hole paper.
A nother w arn in g sign is w hen the p ap er does not expand on
phrases llke: the re su lts for the an alysis are shown in T ab le 2 . This
can suggest that th e au th o rs have not w orked out w hat the findings
really m ean. If th ey do not have the in terest to interpret th eir own
data they m ay not have h ad the ap p licatio n to design and conduct
the study properly.
As p art of the in terp retatio n o f the results the read er should
search for the flaw s an d inco nsisten cies in the study. Ali research is
flaw ed in som e w ay, an d it is sim p ly a m atter of finding out how.
Often the p ro b lem s are m in o r and can be ignored, but som etim es
they m ay u n d erm in e the m a in findings. C riticai ap praisal does not
just involve finding flaw s; th eir p o ten tial impacr. m ust also be
assessed. Only th en can a decisio n be m ade on w h at the results
really m ean.

W hat are the im plications? Abstract/discussion


T h e value of re searc h u su a lly lies in the extent to w hich the
findings can be g en eralised to other tim es and other locations. A
study w h ich h as m e an in g only for the clinic in w hich it was
con ducted is alm o st c ertain ly n o t w o rth reading. T h e wider
im plications of a stu d y sh o uld be review ed in the discussion, and
they are often su m m a rised in the abstract.
Identifying im p licatio n s is largely a subjective process, and as
such should be ap p ro ach ed w ith cau tio n . We w ould ali like our own
studies to have earth -sh ak in g significance. T hus it is not surprising
that authors are n o t alw ays im p artial when in terp retin g their
results. T h e fo llo w in g question s sh ould be asked w hen assessing
the im p licatio n s: W h a t is new ? W h at does it m ean for health care?
Is it relevant to m y p atien ts? C ertain ly the findings should be
com pared w ith o th er studies and an y discrepancies addressed. T he
key question for h ealth professionals is w hether the findings should
be acted upon b y in tro d u cin g ch anges in current clin icai pracce.

Q UESTIO N S T O A SK W H E N RE A D IN G A PAPER

The answer is best delayed until the check lists presented in


Chapters 6 to 11 have been scrutinised.

What else is of interest? Introduction/discussion


The results contained in a p aper m ay not be the only interesting
feature. U seful references m ay be cited in the introduction and
discussion. T h ese sections m ay also discuss im portant or novel
ideas. T h u s, even if the results are to be discounted, there m ay still
be benefit to read in g a paper. C riticai ap p raisal is not just a faultfinding exercise. It is a process of review ing a paper to identify
inform ation of value.

3 Identifying the
research m ethod

M a n y o f the criteria for ap p raisal app ly to ali methods of research.


B u t others are specific to a sin gle m ethod. T hus, the use of the
d etailed ch eck lists req uires first th at the research m ethod be
id en tified . T h is chapter provides an outline of the research m ethods
to b e used for Identification. It does not attempt to provide a
definitive description of each m etho d as this would entail m uch
m ore d etail than is n eed ed. In stead , it just gives sufficient detail to
en ab le the read er to recognise one w hen it is used.
T h e re are several key term s w hich are specific for certain
m eth o d s; technically, th ey sh o u ld be diagnostic for the m ethod.
U n fo rtu n ately some authors use these terms indiscrim inately so
th at th eir appearance in the text does not necessarily identify the
m ethod. T h ere is no recourse b u t to look in detail at how the study
w as co n d u cted to confirm the m ethod used. In most instances this
w ill b e a sim ple m atter, w ith difficulties arising only when the
auth o rs m isu se key term s.

Surveys
S u rv eys are used to jd e sc rib e h " w j ^
A sam ple of
in d iv id u ais is iden tifiedV T na^it're"c)b tain ed on each at m ore or
less the sam e time. T h e sam ple b ein g studied may be taken from
the gen eral population, or m ay be a highly selected one. For
exam p le, surveys could be c a rrie d out to establish the leveis of
seru m cholesterol in the gen eral population. However, surveys can
also be u sed to study specific groups such as pregnant w om en,
p h ysio th erap ists, o rp erso n s ag ed betw een 65 and 90 years. Surveys
7

ID EN TJ FY JN G TH E R E SE A R CH M E T H O D

can even be carried out on in an im ate objects such as fire


extin guish ers, or em ergency trolleys.
E s s e n t ia l fe a tu re s
In p rin cip ie, surveys start by o b tain in g a com plete list of the
group of interest. T hen a sam ple of individuais on the list is
selected for further study. T h e selection is carried out random ly
(not haphazardly) so that each in d ivid u al has an eq u al chance of
being chosen. In practice, a com plete list of the gro up m ay not be
av ailab le, and im aginative alternative m ethods m ay be used.
H ow ever, the overall process should achieve the sam e end: random
sam p lin g should be used to^obtain a represen ta tive sam ple. D ata
a^ ffie'rT co lle cted on the current status of the sam pled individ
uais.

Complications
M o st survevs do not have a sep arate control or gom parjson
group. T h u s, studies which have them are usu ally not surveys.
H ow ever, in the analysis of surveys one__subgroup in the sam ple
m ay be com pared with another (for exam ple, m en versus w om en,
or old versu s young). C om parisons are being m ade but there. is no
sense in w h ich one group is actin g as a control to another group. Ali
the in d ivid u ais have been selected at the sam e tim e and then
in tern ai com parisons are m ade.

Terms o f identification
U se o f the term survey in a p ap er should identify the m ethod, but
so m etim es the term is m istakenly used for what is really a cohort
study. Cross-sectional is a helpful term because it is seldom used
w ith an y other research m ethod. T he term s saniple and random
sample are unhelpful because they often app ear in the description of
the o th er research designs. T h ere are m any different w ays of
d raw in g a sam ple, described by the term s stratified, cluster, and
systematic. T hese term s are seldom used with the other research
m eth o d s, except stratified, which can be used in a clin icai trial.

C o h o r t studies
C o h o rt studies are used to find out w hat happens to patients. For
ex am p le, they could invesdgate how long patients w ith acute lowb ack p ain take to recover; or they could m onitor the natural history

1DENT1FVING T H E R E SE A R C H M ETHOD

of p ep tic u lcers. W h atev er the topic, a group of individuais is


id e n tifie d an d w atch ed to see w hat events @ ^ 1 them . Cohort
stud ies m ay have a com parison or control group. T h ese w ill be
id en tified at ab o ut the sam e tim e and w ill be follow ed for roughly
the sam e len gth o f tim e. However, a control group is not an
essen tial featu re, an d m an y cohort studies do not have one.

Essential features
T h e d efin in g ch aracteristic of cohort studies is the elem ent of
tim e: in co h o rt stu d ies tim e flows forwards. A set of individuais is
id en tified at one p o in t in tim e, and followed up to a later time to
asc ertain w h at h as h ap p en ed . T he direction o f tim e is always
forw ard s. S tu d ies in w h ich individuais are selected at one point and
traced b ack w ard s to see how they w ere at some tim e previously are
n ot co h o rt studies.

Complications
C o h o rt stud ies can be read ily im agin ed as identifym g a group of
p atien ts an d fo llo w in g th em into the future. H ow ever, som e studies
iden tify a set of p atien ts at some tim e in the p ast and follow them
up to the p resen t. T h e se studies m ight at first ap p ear to be looking
b ack w ard s in tim e, b ut th ey are not. T im e flows forwards from the
p oint at w hich the p atien ts are identified.

Terms o f identification

T h e term cohort should be diagnostic for this m ethod, although


so m etim es the Word is u sed in the context of clm ical trials. The
sam e co un sel ap p lies to the term s prospective, follow-up, and
outcome. T h e term retrospective can be used of cohort studies which
have id en tified a set o f p atien ts at som e tim e in the p a st. However,
this te rm is also u sed w ith case-co n tro l studies.

Clinicai trials
C lin ic a i trials sh o u ld be the easiest m ethod to identify. T hey are
u sed to test w h eth er o n e h ealth care intervention is superior to
an o th er. C lin ic a i tria ls are often described in term s o f testing
d ru g s, b u t they can be u sed to investigate m an y different types of
h ealth care in terv en tio n : surgery, vaccin atio n, anti-pressure sore
m attresses, an d h ealth education. W h en a co m p letely new type of
treatm en t has b een developed it m ay be tested against a placebo
b ecau se th ere is no other treatm ent to com pare it against.

1DENT1FY1NG T H E R E SE A R C H M E T H O D

E s s e n tia l fe a tu r e s
C lin icai trials are always concerned w ith effectiveness. A
characteristic of w ell-conducted clinicai trials is that they iden tify a
set of patients with a diagnoscd disease, and then ran dom ly allocate
them to the new or current best treatm ent. T h e focus of the study
should be on the outcom e of the treatm ents, seeking the one w hich
is superior. C lin icai trials are also concerned w ith the side-effects of
treatm ents.

C o m plications
Som etim es cohort studies are used to assess effectiveness; in
such stu d ies a group of treated patients is follow ed up to see how
m any gain benefit. C ohort studies are a poor m ethod of assessing
a treatm en t and can be severely cnticised: it is difficult to m ake a
fair com parison betw een treatm ents in a cohort study.
T h e b rief outline o f clinicai trials stated that two treatm en ts were
being com pared. T h is is com m only the case, b ut in som e instances
m ore than two treatm ents can be investigated. D oing so adds to the
com plexity of the stu d y and its analysis, although the resultin g
study can still be a v alid clinicai trial.
T e rm s o f Identification
T h e term s effectiveness, efficacy, and evaluation or phrases like
assess the value o f or improve the outcome, are often used in papers on
clin icai trials. T h e term s double blvid and plucebo-controlled are
seldom used except in a clinicai tnal. RaridoiTi alloccition of patients
to treatm ents is essential for a fair com parison, and th us its
presence suggests that a study is a clinicai trial. (H owever, the term
random selection is m ore likely to refer to a su rv ey.)T h e sim ple term
outcome can feature in cohort studies as w ell as in clin icai trials.

/ Case-control studies

" '.' C ase-co n tro l studies ask w hat m akes a group of individuais
different. ften the group o f individuais w ill have som e disease, in
which case the question w ill be directed at the causes of their
disease. In other instances the individuais will have behaved in
som e u-ay^TucTs^failed to com ply with therap y or failed to attend
for a clin ic appointm ent.
10

1D EN TIFY 1N G T H E R E SE A R C H M E T H O D

E ssen tial featu res


C ase-co n tro l stu d ies select a set of patients with a definm g
ch aracteristic: a d iagn o sed disease (for exam ple, w om en with
breast can cer) or lack of atten dan ce at breast screem ng T he
ch aracteristics of these are com p ared w ith a control group (often
sim ilar in age, sex , and b ackgro u n d ) who do not have the
ch aracteristic of in terest.

Complications
C ase-co n tro l stu d ies often have an elem ent of um e m w hich a
backw ard s look is taken to p ast events. T h ey look to see w hether
curren t disease co uld have b een caused b y past events. T he
d irectio n ality of tim e is cru cial for distinguishing betw een cohort
studies an d case-co n tro l studies: cohorts look forwards,
control stud ies loo k backw ards.
T e rm s o f id e n t if ic a t io n
As w ell as ca se-cO n trol there are several other term s for this type
o f study, in c lu d in g c a s e - r e fe r r e n t , c a s e - c o m p a r a t o r and
ison T h e possession of a control group is not a defin g
ch aracteristic b ecause clin icai trials should also have one, an
cohort stu d ies often do. B ecau se the m ethod looks backw ards in
tim e it is so m etim es called a r e tr o sp ectiv e stu d y, but this term can be
used for co h o rt studies.

11

4 Interpreting the
results

Interpreting the results p resen ted in a paper provides the m ajor


challenge to the criticai faculties. Each table and figure should be
approached asking w hat do I th in k this really m ean s? . C aution is
the w atchword: large, excitin g and unexpected results are exceedin g ly rare. In con trast, flaw ed studies and m isleadin g findings are
i m uch more com m on. R esu lts shouldjpe approached w ith care by
) assessing their ^significan_ce_and_by looking for_.gossible pitfalls in
' the analysis. T h is chap ter presents fundam ental ideas for the
intrpretation o f results.

Statistical significance
In the past som e papers sim p ly presented tables and graphs w ith
a description and exp lan atio n of the main findings. It is now
conventional for research studies to assess the statistical signi
ficance of the findings thro ugh statistical tests. T h e need for.
statistiaj.j e s t s arise sjo ec au se o f the pervasiveness o f the play of
chance.

The play of chance


W henever a group o f p atien ts is selected for study, and
m easurem ents are m ade, there is an opportunity for the play of
chance to affect the findings. T h e effects of chance are m ost evident
w ith sm all sam ples. Supp ose a sam ple of ten new born babies was
taken. Although we w ould expect that about h a lf w ould be girls, no
one would be surp rised if there w ere seven girls an d only three
boys. If a second sam ple w ere taken, the observation of four girls
and six boys w o uld, again , cause no concern. A lthough in the long
12

IN T E R P R E T 1N G T H E R E SU LT S

run w e exp ect ap p ro x im ately equal num bers of boys and girls, the
p la y o f ch an ce m ean s th a t we seldom get a 50:50 sp lit in small
sam ples.
T h e effects of the p la y o f chance are seen everyw here in medicai
research. Sup p o se two treatm en ts are being com pared in a clinicai
trial. P atients w ill have b een ran dom ly allocated to the two groups.
R an d o m isation p rotects against system ati differences betw een, the
two gro up s, b ut it does not prevent d ifferen ce arisin g by chance.
F orexam p le, it co u ld h ap pen th at slighdy m ore o f the severely ill
p atien ts co uld have b een allo cated to one treatm ent group, creating
an ap p aren t difference b etw een the treatm ents even if there were
n o n e. In p ractice it is v ery u n u su al for the two groups in a clinicai
trial to be exactly the sam e: there are often sm all chance differences
b etw een them . L ess often, there are very large chance differences
b etw een the two groups.
T h e im p o rtan ce of the p la y o f ch ance lies in the extern tojwhich
it m igh t.h ave aiffectiHlfie^oljserxed. .results. So m etim es, wTiatKjjkS^
T ik e a n interestirig r'sTf'may fin ally prove to be a statistical^fluke^/
F o rtu n ately, statistical m eth ods allo w us to estim ate w heth er or not
the observed resu lts co u ld be due to the p lay of chance. C entral to
th ese m ethods is the co n cep t o f probability.

Probability
T h e p ro b ab ility o f th ro w in g a six (w ith a fair, six-sided dice) is
o n e in six. T h e p ro b ab ility of a sin gle ticket w in nin g the national
lo tte ry is one in 14 m illio n . P ro b ab ilities a r e j jim ply a w ay of
d esc rih in g how lik elv .it is that an event w ill h appen. T h e y are often
exp ressed as d ecim al fractio n s, w here one in six becom es 0 1 6 7 .
T h e in terp retatio n o f p ro b ab ilities is quite straightforw ard. W hen
an event has a very sm all p ro b ab ility, for exam ple, 0 0 0 0 1 , it is very
u n lik ely to h ap p en . W h en the p ro bability is large, say 0 9, the event
is v ery likely to h ap p en .
P ro b ab ilities v ary b etw een 0-0 an d 1-0, w here zero m eans an
event w ill n ever h ap p en and 1-0 m eans it is certain to happen.
T h u s, the p ro b ab ility th a t a h ealth y ad u h w ill even tually die is 1 0,
b ecau se we ali die so m etim e. In contrast, the p robability of the
ad u lt dying tom orrow is less than one in 100 000, i.e. 0 00001. It
is not qu ite zero b ecau se som e u n likely event, such as being run
over by a b u s, m igh t ju st h ap pen . B ut it is very sm all because
u n lik ely events are u n lik e ly to happen.
P ro babilities lie at the h eart of statistical tests. T h ey are often
term ed P -values, in w h ich the lette r P stan ds for probability. They

-s \ iJt_

1 3

IN TE R PRE TIN G T H E R E S U L T S

can be w ritten as P= 0 -0 03, in d icatin g that an event has a three in


1000 chance of o ccu rrin g , a som ew hat rare occurrence. Som etimes this w ill be w ritten P < 0 -0 1 , where the < sym bol m eans
less than . T h e decim al 0-003 is less than 0-01, so if P = 0-003 it
is also true that P w ill be less th an 0-01. T h e < sym bol is w idely
used, but sayin g P < 0-01 is less accu rate than P= 0-003. T here w as
a fashion for usin g the < sym bol, w here probabilities were
rounded up to certain values. T h e m ost com m on w ere P < 0-05,
P< 0-01, and P < 0 -0 0 1 . H ow ever, itjs_ n o w preferred to give the
exact P value b ecause ro un din g u p leads to an approxim ate value
which w astes inform ation.
T h e lo g ic o f s t a t is t ic a l te s ts
Statistical tests use w hat som etim es appears to be a curious logic.
It can seem that they do so just to be difficult, b ut the approach is
chosen because it is the only one w hich is valid. C o nsider a study
com paring two treatm ents in a clin icai trial in w hich one treatm ent
gives better results than the other. T he first step is to propose that
the difference observed betw een the treatm ents is due solely to the
play of chance, i.e. that there is really no difference betw een the
treatm ents. (T his is not w hat w e are hoping for: everyone would
want a new treatm ent to be sup erior to the conventional one.
However, it is the w ay the logic leads us.) T h e statistical test then
calculates how likely it was th at, by chance alone, we w ould have
seen a difference at least as b ig as that observed. T h e test provides
us with a pro b ab ility, a P value, o f the results being due to chance.
W hen this is very sm all (for exam p le, P < 0 -0 0 1 ), we conclude that
the result is u n likely to be due to ch an ce.T h is leads us to reject the ,
proposal th at there is no difference betw een the treatm ents; we can
conclude that one re ally is b etter than the other. (T h e hypothesis
that there is no difference b etw een th two treatm ents, other than
that due to chance, is co m m on ly called the null h y p o th e s is .)
T he P -value is a very convenient guide to w h eth er or not the
observed results could be due to chance: sm all P-values indicate
that the resu lt is u n lik ely to be d ue to chance. A li we have to decide
is when the P -value is sm all en ough . T h ere is a convenient, if
arbitrary, ru le: w hen the P -value is less than 0-05 (i.e. P < 0-05) we
exclude ch an ce as an exp lan atio n . W hen the P -value is this sm all
the result is said to have achieved statistical significance.
T he arb itrary ru le (P< 0-05) does not correspond to a guarantee.
Suppose we carried out a large n u m b er of statistical tests. We would
expect a sp urio usly significant resu lt to occur on average once for
14

IN TE R PRE T1N G T H E R E SU L T S

every 20 sign ifican ce tests which have been carried out. (T his is
b ecau se P = 0 0 5 actu ally says that the chance alone could create
the result one tim e in 20.) T here are two corollaries. First, studies
w h ich have co n d ucted a m u ltitu d e of significance tests will
re g u larly en co u n ter puriously significant results. Second, sm aller
P -valu es, say P < 0-01 or even P < 0 -0 0 1 , give increased confidence
th at the resu lt w as not a chance affair.
C o n fid e n c e in t e r v a ls
C o n fid en ce in terv als provide an alternative way of assessing the
effects o f ch an ce. T h e y can also be m ore inform ative. Suppose a
clin icai tria l of two antihypertensive drugs showed that one lowered
d iasto lic blood pressure by an average of 15 mm H g, w hereas a
second low ered it b y an average of only 5 mm H g. T he average
difference o f 10 m m Hg seems im pressive, but we know that this
v alu e co u ld be influenced by chance. T h e im portant question is
w h eth er the tru e v alu e for the difference betw een treatm ents could
b e as low as zero (i.e . no difference). V^s^.tjdj.this from_die 9 5 /o
confidence in terval. It gives the range w ithin w hich we are 95%
certain th at the tru e value lies. If the antihypertensive trial h ad
given a ran g e of 3 to 17, we w ould say that the true value could be
as low as 3 or as h igh as 17. Zero lies outside this interval so we
co n clud e th at it is un likely to be the true value. T h is is equivalent
to o b tain in g a P -value of P < 0-0 5,. i.e. the result achieved statistical
sign ificance. (N ote th at if the confidence interval had been w ider,
say ran gin g from - 4 to 24, the interpretation w ould be different.
T h is ran ge in clu d es zero, so th at it could be the true value. We
w ould th en co n clude that we have no evidence that there is a
difference b etw een the treatm ents.)
C o n fiden ce intervals are alw ays interpreted in the same w ay.
w h eth er the research method w as a clin icai trial, a cohort study, or
w hatever. T h e y are inspected to test the proposal that there was no
effect, for exam p le, no difference betw een two groups. If the zero
difference lies w ith in the confidence interval we conclude th at
there w as no effect. If it lies outside the range, we exclude no effect
as being unlikely. T h is is equivalen t to saying that the result w as
statistically sign ificant.
T h e ad van tage of confidence intervals is that they do more than
in d icate w h eth er the result m igh t be a chance effect. T hey show,
allo w in g for the p lay of chance, how sm all and how large the tru e
size of effect m igh t be.
15

IN T E R P R E T IN G T H E R E SU L T S

Pitfalls in the analysis


P-values and confidence intervals prov.de a v alu ab le u id e to the
interoretation of results w hen the analysis has b een carn ed out
-

is flawed From the d e t a i l s ^ n in a paper rt can be difficult


w h cer there are
i f f c analysis. Ali stan su ca l tests m ake
som e^assumptions abot,, the d ata, bt w ithout access to the r a
data it is not possible to test w hether they ^
N onetheless, there are som e pointers to possible defects.

Outliers

me

W hen data are presented in tables and figures there is the


opportunity to look for
i T s I u ^ o f ^ d u h s T t h general population w ould be^expected to

1'

fe S e tC ^ - V

Tther A sm all c M d placed on the long arm c a n overcom e the


in nf a m uch h eavier child on the sh ort arm . In statistical rests
T few d istan t points can p ull against the b u lk o f * e d ata e a n n g
so m e^ tep s sh o u ld be t a le n to

coped w ith.

^ Skew
So m e m easurem ents, such as length ^
X h h la t a e s
grouped around an average value but w ith a lon g ta .l of h.ghi value s.
T he h ieh values cant be called outliers b ecau se th ey are not really
separate from the bulk of the data. In stead, th e y g rad u ally shade
away from the m ost com m on values. T he p resen ce of a lon g ta,l of
observations on one side of the average is c a lle d skew ness. Ske
acts in the sam e w ay as outliers to disto rt a u s u c ^ ests. lt
com m only^een when the observations

i r L r i L l ^ W

patients.

be rnade for the p resen ce of skew by

N T E R P R E T IN G T H E R E SU LT S

t nsfo ,n8 th r

, rc

* 1

zzz

; : c t>o, * * .

- * - *-*

how it w as co p ed w ith .
N o n - in d e p e n d e n c e

A common assnmption ,n statistical


o b servatio n s are in d e p en d em .

it w ould be

measuring the heights of a sampk of schoc>l<Md,re^ ^

assu m ed th a t m e asu rm g
e ^
ch ild .T h e assum ption that
effect on th e m e asu rem en t o f * e next c
m of

all the measurements are


the children
statistical n,' h ,dS

spedal

d measures designs) which can cope


Pm ,, of the common statistical tests cannot.

r r ;
S e r e n d ip it y m a s f c e r a d i n g a s h y p o th e s is
f nr m ost research studies collect q u ite large am ounts o
M an y, if n o t m o st, re sear
been set

data on the subjectsbemgstud. d ^ e s e s ra d .e s

p to test a few
hjpothes^ Howev r,
mPay other avenues which can be xplored^ ^
s p

e c f i c

b etw een the v a n a b le s, t


su b d iv id ed (for ex am p le, b y

g ,

d etailed
not to exp lo re th e d ata lu i y.

or Severity of disease). Such


>
^
be wast ftl~ A
hypotheses w hich

Chance observations a p re se n te U s ,f * e , ^ J
w ere b ein g te ste .

of wo

m ieh t ch eck w hether the sam e

antihypertensives, the ^ searc

red with the old, or in

size of effect w as seen in the o


to rep0rt any
m en co m p ared w ith w o m en
^
^ had been
difference seen b etw een these s g
J
k w ould b e wrong to
clearly sp ecified m advJ n ^
^
^ study was set up to test,
p reten d th a t th is w as a
dgta can b e subdivided, so that
T h e re are m a n y w ays m w h ic
reveal ap p arently interesting
b y ch an ce alo n e som e are J
J
te lots o f interesting observa-

effe? E:r t h "


T d^

e s te d Ein s u b s e q u e n t stu d ie s. A

b e u s e d t o g e n e ra te a n d te s , h y p o th e s e s .

single s e .

rv--.

IN T E R P R E T IN G T H E R E S U L T S
IN T E R P R E T IN G T H E R E S U L T S

B lack-box an alyses
A ppropriate statistical tests increase the credibihty of a paper.
However, it does not follow that the more sophisticated the tests
used the m ore authoritative the findings. P resent d ay C om puter
packages m ake it very easy to carry out com plex an alyses, even
when the u ser has an incom plete understandm g of the m ethod
being u sed . T h e m ore com plex analyses tend to m ake m ore
assum ptions about the data b ein g analysed. It is also m u ch easier to
make m istakes durin g the analysis.
Papers w h ich present only the results of some com p lex analysis
should be view ed w ith some suspicion. T h is approach makes. 1
more d ifficult to identify problem s in the data, such as skewness
and o u tliers. It is b etter if the results of m ore sim ple analyses are
presented first. T h en these can be checked to see w hether they
accord w ith the m ore com plex. If discrepancies are found they
should be explained in the text of the paper. U n exp lain e
discrep an cies cast doubt on the propriety of the analysis.

S e

= . ^

nr^pT factors such as their need 10 fmd an ex p lan afo n for


S a s e S
. stu d y of p ast events m ight be influenced m ore
b y factors affecting recall th an b y w hether or not past events to
Pl Trfc often difficult to d eterm in e w hether bias has entered a study.

T^ i r

eed ; b er r ed

^
has occurred can assist in its Id e n tific a tio n .

b ias

Confounding

n art of the observed relationship

can cer are r S S 5n k . g , and it is sm ofcng w hich causes hm g


B ia s
B ias is the bugb ear of research. It m eans that the results we get
are system atically different from w hat they should have b een . Bias
can o ccur in a variety of guises b u t its effect is alw ays the sam e: the
observed re su lts are m islead in g and the conclusions draw n m ay be
w rong.
,
B ias can arise w hen the study subjects w ere selected. Even w hen
the stu d y h as been carried out carefully, it is possible th at those
in clu d ed differ from the general run of subjects. Som e could have
been in c lu d ed because they w ere more severely ill an d hen ce, being
b ed -b o un d, w ere easily contacted. A lternatively, the severely 1
m ight be exclu d ed if, for exam ple, they w ere aw ay for treatm en t at
a sp ecialist centre. T h e need to obtain inform ed consent f r o m i e
study sub jects can also introduce b i a s . T hose who refuse m ay d if er
from those who agree to co-operate. T his problem is p artic u lar y
acute for research concerned w ith attitudes and behefs, as these are
the very ch aracteristics w hich influence p articip ation in the study.
Bias can also occur w hen data are being collected. M easu rm g
instrum en ts m ay be w rongly caiibrated, givm g co n sisten tly high or
low re ad in g s.T h e w ay in w hich questions are asked o f patients m ay
influence their replies: a sym pathetic interview er m ay encourage
patients to decr.be th eir experiences in full. A second stu d y w ith a
m ore abrasive interview er co uld gather m uch less d etail, lead m g to
1O

pressure tend s to

the n ew ^ P er'

observed relationship.

^
d

hypertension causes

h y reflect differing aspects of agem g.

5 Introduction to the
check lists

The check lists provide a series of questions to be asked of


published papers. T h e questions lead towards an m form ed
judgem ent on the m ean in g of the findings and their relevance for
clinicai practice. T h u s, the check lists provide a fram ew ork for the
appraisal of papers. E ach question has accm panying text w hich
explains an d expands on w h at is being assessed.
T h e check lists b egin w ith a set of standard questions (C h ap ter
6) w hich apply irresp ective of the research m ethod w hich has been
used. T h ere follow five further check lists (C h ap ters 7 to 11)
covering clin icai trials, surveys, cohort studies, case-co n tro l stud. ies, and review papers. T h ese check lists are arranged in tw o parts.
T he first, the essen tial appraisal, acts as an initial screen to identify
any m ajor defects in the research. T h u s, papers passin g thcsc
checks have been given the equivalent of a provisional certificate o
approval. A ppraisal m ay stop at this stage, w hen the read er h as
identified a paper w ith potentially interesting new facts. H owever,
if the paper is cen tral to the readers clinicai practice then a m ore
d etailed assessm ent m ay be required. E qually, if a h ealth professional is new to criticai appraisal, then it w ill be helpful to review
the second ch eck list, the detailed appraisal.
T h e second check list provides for a m ore detailed review. It is
organised in sections corresponding to the m ain sections of
scientific papers. F o r convenience, the list incorporates the set ot
standard questio ns o utlin ed in C h ap ter 6, but they are given in
italics so th at th ey can be easily distinguished. T ogether, the
questions extend the process of uncovering an d evaluatin g fiaws m
study design an d executio n. T hey also review the overall quality of
the study and its an alysis, and facilitate an assessm ent of the w ider
20

IN T R O D U C T IO N T O T H E C H E C K U S T S

significance of the results. T h is sim plifies the task of m aking a

balanced assessment of the paper.


T h e co m p lete ch eck lists contain an extensive array of items
the assessm en t o f p u b lish ed papers. H ow ever, these do not provid
a tick-box guid e to the q u ality of a paper. M an y of the item s require
a sub iective assessm en t of quality. In p articu lar, flaws mus be
evalu ated to d eterm in e their p o ten tial im p act on the stu >
findings.

Evaluating the flaws


Tust b ecau se a stu d y co ntains a flaw does no t m ean that it should
be d iscard ed G iven the difficulties of desigm ng and conducung
research an d th e o p eratio n of M u rp h y s m any law s of disaster it is
S l e s u r p r e th a t m an y published pap ers contain flaws. Indeed
h l ^ H ca ed c ritic co u ld p robably fin d som e flaw m every paper.

w hich are serio us an d w hich are m e rely trivial.


h t d ffic u lt <0 draw up general rules for rhe e.a lu a t.o n o f flaw
b e cau se their im p act w ill depend on the purposes o f the snrdy, * e
m eth od em p loyed , and the w ay the sru d , was carned out^

H ow ever, the item s listed in the essen tial ap praisal secuon specify
S r m l flaw s w h ich c a n beset ea ch type o f m e th o d I f . stu V

serio u sly d eficien t on any of these it quite hkely that s


co n clu sio n s are u n fo u n d ed . For ex am p le, suppose a new drug for
rh eu m ato id arth ritis w ere being in v estigated . T reatm en t5 are best
ev alu ated in ra n d o m ised controlled trials. If the new d r u g h a d b
tested in a stu d y w ith o u t a con tro l group, any d aim s about

rprnrded along the lines indicated in Table 5.1. Writmg tne


derils can h elp to fo cu s the m ind and facilitate a d e a s .o on the

m erits of the paper.

B e s t case/ w orst case/likely ca se


Often it is difficult to judge what efifect a defect may have on the
study findings. A useful technique for many defects is to speculat

IN T R O D U C T IO N TO

Table 5.1

T H E C H E C K L IS T S

A s s c s s m e n t o f t h e i m p o r t a n c e o f f ia w s

'
'
~Z
F la w s e n c o u n t e r e d

T v d c o f fla w
ly p e 01 n a w
(n a tu re a n d s iz e )

D ir e c t io n o f e f f e c t

Is th e f in d in g
o f th e s t u d y
^ g a te d ?

Design
Conduct
I n te r p r e ta tio n

on altern ative possibilities, po sing b est case/worst case/likely case


s cenrios. C onsider a postal questio n n aire survey of the leveis o
literacy am ong elderly patients in lon g-term care. Suppose tha 1
found th at 90% had a high lite racy score b ut the response ra t
only 50% . T he low response rate co u ld result in bias. T
case w o u ld be that ali the n on -respo nders were illiterate, explaining
whv th ey did not rep ly.T h is w ould m ean that the true proportion
w ith a h igh literacy score w as only 45% . A lternatively it is possible
That a the non-responders were h lghly Hterate, although this
seem s unlikely. T he m ost lik ely case is that m any non-responses
o ccu rred for reasons other th an literacy, but that readm g difficulties m ad e a substantial co n trib ud o n to the non-response.
h ap p en ed that h alf the non-response w as due to poor literacy, th
the tru e figure for high lite r a c y w ould be 6 7 -5 /o.
Tn n ractice we do not know the reasons for the 50 /o non
response but this type of speculatio n gives an indication of the
co n seq uences that bias co uld have. It w ill help to clarify w hether
the stu d y results should be dism issed, or w hether, w ith due
cautio n they can be accepted as an addiuo n to our know ledge. For
this exam p le, given that the true levei of high literacy could lie
betw een 45% and 90% , the study findings w ould app ear unreli
able A lthough the lower figure of 45% is u n lik ely to be correct,
noor lite racy is likely to m ake a sub stan tial con tn bution to nonresponse. T h us, the study resu lt of 90% is likely to be m isleading.
T h e size of effect of flaws sh o uld alw ays be assessed to determ i
the ir lik ely im pact on the stu d y s conclusions.

6 The standard
appraisal questions

in co rp o rated m to the checK lists prov


au e stions foUow
but the rationale for them is on y given

published
c/>mipnrp
m
w
hich
Inform
ation
is
p
resen
tea
f
the seq u en ce m
pap ers:
A re the aim s clearly stated?
W as the sam p le size justified?
,
Are the m easu rem en ts lik ely to be valid and reliable.
Are the statistical m ethods described?
D id u n to w ard events o ccur d u n n g the study.
W ere the b asic data ad eq u ately describe
D o the n u m b ers add up?
W as the statistical significance assessed.
W h at do th e m ain findings m ean?
H ow are n u ll findings interpreted?
A re im p o rtan t effects overlooked?
H ow do the results com pare w ith previous repoi:
W h at im p licatio n s does the study have for your practice.

A re the aims clearly stated?


The *
of d * study should
explanation o f^ w h y -* e stu *

?
tackled an important

TH E STA N D A RD A PPRA ISA L Q U E ST IO N S

research hypothesis may have been specified in advance, resulting


in a w ell planned study. In contrast, w ide ran gin g or w oo lly aim s
suggest that many different issues were being p ursued to see what
popped up. Such studies are less likely to collect useful data.
F u rth er, they give the opportunity for traw ling through the results,
perform ing multiple significance tests. T h is m akes it lik ely that
some spurious statistically significant results w ould be ob tain ed.

Was the sample size justified?


R esearch should be carried out only w hen it has a good chance
of m eeting the study aim s. One essential p art o f this is th at the
study should be large enough to give an accu rate picture o f w hat is
going on. Conventionally, the size of effect being sought (for
exam ple, the likely difference betw een two treatm ents) is specified.
T h en , a formal sample size calculation is carried out to determ ine
how b ig the study should be to detect this effect. T h e d etails of this
calculatio n should be in the m ethods section. Stu d ies w hich are too
sm all often fail to detect clin ically im portant effects. W hen the trial
has b een com pleted, the questio n can be asked in a different form :
what size of effect did the stu d y have the pow er to detect? Studies
o f high q uality address this issu e, usually in the discussion section,
but m an y studies ignore it.

Are the measurements likely to be valid and


reliable?
Poor m easurin g techniques can lead to sub stan tial errors. T h e
m ethods of m easurem ents should be described in som e detail
(references m ay be given to m ethods w h ich are describ ed
elsew h ere).T h ey should be read critically, asking how errors could
be introduced. Particular attention should b e p aid to difficult
m easurem ents, such as those involving subjective assessm ents.
W h en there is more than one observer, for exam ple in m u lticen tre
stu d ies, some effort should have been m ade to stan d ard ise
m easurem ents. T he issue o f m easurem ent erro r is often tackled in
the discussion section, b u t if it is ignored the reader m u st ask
w hether there could be errors in m easurem ent, and w h eth er these
could be im portant. T he m ain concerns are w hether the m ethods
are lik ely to be valid and reliab le.
A v alid measure is one w hich m easures w h at it is supposed to
m easure. For example, w hen estim ating alcohol consum ption,
valid answ ers are unlikely to be obtained to the question how
24

T H E ST A N D A R D A P P R A ISA L Q U E STIO N S

m u ch do yo u drin k? . M a n y subjects w ill understate their true


co n su m p tio n , although som e bastful ones m ay exaggerate. A
re liab le m easu re is one w h ich gives a sim ilar result when applied on
m o re th an one occasion. F or exam ple, in theory, an ad u lts height
sh o uld be re lia b ly m easu red , as it varies only slightly throughout
the day. In p ractice, m a n y factors, such as the w ay an individual
stan d s, the w ay the m easu rin g eq u ip m en t is p laced , w hether or not
shoes are w o rn , can co n trib ute to differences betw een m easure
m en ts. A featu re of stud ies of high q u ality is that they discuss how
v alid ity an d reliab ility w ere assessed.

A re the statistical methods described?


T h e statistical m ethods w hich w ere u sed should be described in
the m eth o d s section, an d should be referenced. Inappropriate
statistical an alysis can p ro d u ce m islead in g results. Ali statistical
tests m ake som e assum p tio ns abo ut the d ata b ein g an alysed , and it
is en co u rag in g w hen this issue is exp licitly addressed. If there is
d o u b t ab o ut statistical p ro p riety then it w o u ld be w ell to contact a
statistician . One w arn in g sign is the use of exotic statistical tests w as the test selected b ecau se of the P -value it yielded? C oncern is
h eigh ten ed w hen the o n ly results presented are those from a
so p h isticated statistical tech n iq u e; sim ple analyses should be
p resen ted first and co m p ared w ith the m ore com plex ones.
A no th er w arn in g sign is the suggestion th at a large n um ber of tests
w ere carried out - as m ore tests are carried out it becom es
in c reasin g ly lik ely that sp rio us signifiance w jll result.

Did untward events occur during the study?


In som e stud ies it can prove difficult to follow exactly the initial
research d esig n , som e su b jects m ay not b e co n tactab le, and others
m ay su b seq u en tly d isap p ear. It m ay also prove im possible to m ake
m easu rem en ts on c e rta in individuais. M o st of these types o f
pro blem sh o uld have b een iden tified an d d ealt w ith in p ilot studies,
so th e ir o ccu rren ce in the m ain stu d y m a y indicate inadequate
p rep aratio n . S u b stan tial am ounts of m issin g data give am ple
o p p o rtu n ity for bias to in tru d e. M ore w o rryin g is w hen problem s
en co u n tered d u rin g the co n d u ct o f the stu d y lead to changes in the
d esign . S u ch changes m ay be poorly th o ugh t out and lead to
fu rth er p roblem s. F o r exam p le, they co u ld resu lt in the data
co llected before the ch an ge b ein g inco m p atib le w ith those collected after. Som e u n t w ard events are tru ly unpredictable and

T H E ST A N D A R D A P P R A ISA L Q U E ST IO N S

thus beyond the researchers control, but often such events signal a
study of poor quality.

Were the basic data adequately described?


Ali stud ies should report the num ber of subjects w hich w ere
investigated, and how they w ere obtained. T h e b asic characteristics
o f the suB jects should be described, usually givin g the m ean or
m edian for the prin cip al m easurem ents together w ith an indication
o f how the sub jects vary (for exam ple, the stan d ard deviation or
in terq u artile ran ge). T h is inform ation allows an assessm ent o f the
extent to w hich the findings can be generalised, an d w hether they
are likely to be relevant to the read ers clinicai p ractice.
T h e stu d y should begin w ith sam ple an alyses, giving the m ain
outcom es in sim ple tables or figures. T hese give the reader a feel for
w hat is go in g on in the data. C om plex statistical m ethods w hich
investigate the effects of m any factors sim ultan eou sly, should be
p resented only after the sim ple analyses have been given. T he
findings from the m ore sophisticated analyses w o u ld n o rm ally fit
w ith the m ore sim p le ones. A ny discrepancy b etw een these an alyses
should be exp lain ed in detail.
f

Do the num bers add up? I

, ,

"

S ub jects are som etim es lost from parts of a study, either b ecause
they tru ly d isap p eared , or because m easurem ents m ade on them
were not in clu d ed in the final report. M any papers present several
tables in w h ich the data are subdivided in different w ays. T h is
provides an oppo rtun ity to check for absent subjects and m issing
data. Ideally, the num b er of subjects in ali tab les should add up to
the value stated at the b eginn ing of the results sectio n (som etim es
the n u m b er of subjects is given in the m ethods). Inconsistencies in
the n um b er of subjects should be explained in the text. F ailure to
do so indicates som e sloppiness; the authors m ay not have checked
the tables for typ in g errors, or they may be u n co n cern ed about the
consequences of m issin g data. Sm all discrepancies (of the order of
1%) are u n likely to have m uch im pact on the findings, but large
discrepancies are a serious hazard w arning.

Was the statistical significance assessed?


T h e results of ali research studies are influenced by the p lay of
chance. So m etim es chance effects can app ear quite large, espe26

T H E ST A N D A R D APPRA ISA L. Q U E STIO N S

cially w h en the sam ple size is sm all. T h us, the statistical


sign ificance of the m ain fin din gs sh o uld be assessed. A P-value of
less than 0-05 provides good evidence th at the result is likely to be
real rath er than chance. E ven sm aller P -values, such as P < 0-01, or
P < 0 '0 0 1 , give extra confidence that the result was not a chance
even tr>
M an y m ed icai jou rn als p refer confidence intervals to P-values.
T h ese also provide a te st of statistical significance, but give
ad d itio n al inform ation. T h e y show the ran ge w ithin w hich the true
value co uld lie (see C h ap ter 4). T h is allow s an assessm ent of just
how larg e, or how sm all, the true effect m ight be. Further, when
the range is b ro ad the m ean in g of the estim ated size of effect is
called into question.

W hat do the m ain findings mean?


T h e in terp retatio n of stu d y findings follows a standard sequence.
T h e size of each repo rted effect is scrutinised to see w hether it
m igh t be of clin icai im p ortance. T h e levei of statistical significance
is not n ecessarily a good guide to the clinicai significance of a
finding H ow ever, the co nfidence in terval can be helpful, showing
the range w ithin w hich the true value is likely to lie. T he key
findings can th en be m atch ed against any defects m the design,
co n d uct an d an alysis, allo w in g the re ad er to m ake an inform ed
decision ab o ut w h at the study has re ally shown. The author s
conclusions can n o t alw ays be relied u p o n because researchers are
often m ore en th u siastic about th eir findings than is strictly
w arran ted . In stead, a careful search sh o uld be m ade for possible
b iases or co n fo un din g (see C h ap ter 4 ).
F in d in gs can be given m ore w eight w hen there is some internai
co nsistency, i.e. sim ilar results are seen w hen the data are divided
into sub gro up s, such as age and sex. Sup p o rtin g evidence, such as
a d o se-resp o n se relatio n sh ip , also m akes it more likely that the
resu lt is no t a ch an ce ab erratio n (for exam ple, that a m oderate
exposure carries a risk w h ich is in term ed iate between low- and
high-level exp o sures). F in ally, the inherent plausibility of the
results can be assessed. D o they m ake biological sense? Do they fit
w ith w hat is know n ab o ut the disease? Is the tim ing of the events
plausible? T h e read er sh o uld not ju st accept the author s in ter
pretatio n , b u t should m u ll over findings to decide w hether, in their
view they m ake sense. In terp retin g the findings of a study is a
m atter for jud gem en t, aid ed by exp erience. T h e process m ay be
27

TH E STAN D ARD A P P R A IS A L Q U E ST IO N S

subjective, and bence im perfect. But even an im perfect evaluation


is better than passive accep tan ce of the results at face value.

How are null findings interpreted?


Null findings (for exam p le, a new treatm ent which was found
not to be better th an a conventional one) need to be interpreted
with p articular care. T h e lack o f an effect could arise because the
study was too sm all to have a reasonable chance of detecting
anything. T h is w ould be seen in the confidence interval, w hich
w ould be w ide. F o r exam ple, in a clin icai trial the confidence
interval w ould cover a ran ge from the new treatm ent being m uch
better to the new treatm en t being much w orse. N ull findings can
also arise through w eaknesses in the design or conduct of the study.
W hatever the exp lan atio n, lack of evidence of an association is not
the same as evidence o f no association.

Are important effects overlooked?


T here is an u n d erstan d able tendency am ong researchers to draw
attention to findings w hich fit their preconceptions. Results w hich
do not fit their view s, or w hich flatly contradict them , are
som etimes not co m m en ted on. T hus, the results need to be
explored for interesting looking effects, even null findings, w hich
are unrem arked.

How do the results compare with previous


reports?
T he findings from a single study seldom provide convincing
evidence. New findings are u su ally accepted only when there is a
substantial body of research , involving several studies preferably
from more than one research group. (C onfidence is dim inished if it
is thought that oth er research groups have had difficulty in
confirm ing a resu lt.) T h u s the results o f any study need to be
interpreted in the ligh t o f previous reports.
T he supportive stu d ies cited in a paper m ay not be sufficient to
confirm a finding. So m e auth ors m ay be tem pted to overstate
previous findings w h ich support their own, and m ay om it m ention
of contradictory results. In stead, the findings from a single report
need to be fitted into a b alan ced overview of ali reported studies.
T his is seldom possible as only a sm all num ber of people w ill be
expert in any given field. N onetheless, the reader who is not
28

T H E ST A N D A R D A P P R A ISA L Q U E ST IO N S

fam iliar with a particular field should be more circum spect about
accepting the claims of a single paper.

What implications does the study have for your


practice?
Possibly the most im portant issue when reviewing a paper is
whether it should lead to changes in the m anagem ent of ones own
patients. T h e decision can be between subjecting patients to useless
therapy and denying them access to effective ones. Alternatively, it
could be whether to advise them to avoid some harmful behaviour
at the risk of generating anxiety. The first question is, how big was
the effect, and is it clinically important? Then the quality of the
study should be assessed, together with the amount of supporting
evidence, asking whether the finding is likely to be true. Finally, the
relevance of the finding to your practice should be reviewed by
asking w hether the patients studied were sim ilar to yours and
whether the conditions in which the study was carried out resemble
local circum stances. This should indicate whether the same size of
effect is likely to occur in your patients.

29

7 Appraising surveys

T his chapter presents the issues which are relevant prim arily to
surveys. These are arranged in two parts: the brief essential
questions and the more detailed specific questions. An explanation
is given for each item. These lists are then combined, at the end of
the chapter, with the list of standard questions described m
Chapter 6. These lists provide a complete guide to the appraisal of
surveys.

The essential questions


Surveys are easy to carry out. In consequence, the method is
w idely used and, some would say, widely abused. Surveys simply
involve identifying a group of subjects - be they patients, health
professionals, or members of the general population. Data are
collected on each subject, often by questionnaire or mterview.
Surveys are used to make general statements about a group wider
than that studied. It is the validity of these generahsations that hes
at the heart of the brief appraisal of surveys.
W ho was studied?
How was the sample obtained?
W hat was the response rate?

The specific questions


T he specific questions are also concerned with the validity of the
generalisations, but have a different focus. T heir m ain concern is
with the common pitfalls which beset the design and mterpretation
of surveys.
30

A PPR A ISIN G SU R V E Y S

Is the design appropriate to the stated objectives?


Is there a suggestion of haste?
How could selection bias arise?
Were the findings serendipitous?
C an the results be generalised?

Rationale for the essential questions


Who was studied?
T he interpretation of survey findings naturally depends on who
is being investigated. T he source of the sample will determine
w hether the results apply generally or whether they are restricted to
a highly selected group. Selection criteria for entry into the sample
(such as age, sex, or severity of disease) should be inspected
carefully to see how these could influence the findings. A clear
description of the source population allows an assessment of
w hether the m ethod of drawing the sample is flawed.
How was the sample obtained?
T h e method of obtaining the sample is crucial to the validity of
the findings. Som e studies take what is called a grab sample of
subjects. Individuais are included because they are convenient to
hand, ignoring the problem that the factors which make them
convenient m ay also make them unrepresentative. The personal
case-series (for exam ple, the patients treated at a particular clinic)
falls into this category. It might be thought that the sample is
comprehensive if ali the patients seen over some time period are
included. However, there are many selection factors that determine
which patients are seen by a health professional: there will usually
be many patients who do not make it through the gates controlling
access to a clinic.
T he process of obtaining a sample needs to be a rigorous one,
and should be adequately described. Everyone who would be
considered eligible for the study should have an equal chance of
being selected for the sample. Thus, there needs to be some kind of
list of ali potential subjects, together with a mechanism for drawing
a random sample (using random num bers). Sampling is flawed
whenever there are groups of subjects who will be largely overlooked. The size of the problem will depend on the number
overlooked, and the extent to which these subjects are likely to
differ from those included.
31

APPRA1S1NG SU R V E Y S

What was the response rate?


In surveys, m any subjects cannot be contacted, or refuse to take
part.T he concern with non-response is whether it could introduce
bias. Those who do not respond often differ system atically from
those who do. Individuais who have moved home or been admitted
to long-term care, may be missed; those who have em igrated, or
died, certainly will be. Information m ay also not be obtained on
those who have difficulty reading, dislike filling in questionnaires
(or taking part in interviews), or have some personal antipathy to
the subject being investigated. The reasons which lead people to be
difficult to contact also mark them out as being different from
those contacted. T hus, once the study has been conducted it is
worth asking who is likely to be missed and what effect would
this have on the results?.
The greater the percentage of non-responders, the larger the bias
which could result. There is no magic rule for what is an acceptable
response rate. Instead, the circumstances of the study should be
reviewed, balancing the study findings against the size of nonresponse and the likely reasons for it. Best case/worst case
calculations can be carried out (see Chapter 5). T he question is to
what extent have the findings been influenced by non-response?
Studies which do not give the response rate, or claim a near-perfect
response, should be regarded with suspicion. U nless the circum
stances are quite exceptional, a captive audience who are
compelled to answer, the response rate will always be less than
100 %.

Rationale for the specific questions


Is the design appropriate to the stated objectives?
The evidence from surveys is considered particularly weak.
Surveys can describe what is going on - for exam ple, how care is
currently being delivered, or whether patients were satisfied with
their management. They can provide insight into the current
organisation and consequences of care, but they are not well suited
to explaining why events happen as they do. In particular, they
cannot be used to assess whether one form of care was more
effective than another. Surveys should be assessed by asking
whether the study aims might be better met by one of the other
research methods.
32

A PPR A 1SIN G SU RV EY S

Is there a suggestion o f haste?


A difficulty with surveys is that it is very easy to conduct one with
only a vague idea of what is being investigated. There is a danger
that they are undertaken before the details of the study design have
been properly w orked out. T hus, a few questions m ay be cobbled
together and a questionnaire distributed to a set of individuais who
are conveniently to hand.
Surveys which have been conducted in haste often give scant
descriptions of how the sample was selected and how the
m easurem ents were m ade. U sually they will give no indication that
a pilot study w as carried out. Another w arning sign is that no
attempt has been m ade to contact those who did not respond to the
first invitation to participate. The nature of the key findings should
also be checked. If subtle or sensitive issues have been investigated
then the m ethods section should explain how the questions were
developed and tested. If physical m easurem ents, such as height or
blood pressure, have been taken then the steps taken to standardise
the conditions of m easurem ent should be described.

How could selection bias arise?


The possibility that the method of drawing the sample could
affect the results was introduce_d in the b rief app.raisaL.questions.
However, selection bias has such im portance for surveys that it
should be explicitly reviewed. The following questions are being
askcd: W hat could possibly happen to make atypical the sample
that was obtained? Was the group from which the sample was
drawn unusual in any way? Could the method of drawing the
sample have introduced bias? Could certain types of participam
have been selectively lst? Note that having a very large sample
provides no protection. If selection bias is occurring it will have as
great an effect on a study with one m illion participants as it would
with one of 500.

Were the findings serendipitous?


Surveys provide the opportunity to collect data on many
different item s: the analysis then looks to see what falls out.
Problems arise if unexpected results are presented as though they
were the very findings which were being sought. This flawed
procedure invalidates tests of statistical significance. If lots of data
have been collected it is likely that several spurious statistically
significam results w ill be found. Significance tests on survey data
33

A PPRA ISIN G SU R V E Y S

should be treated with great caution.T hey can be used as a method


for discarding unimportant (i.e. non-significant) findings; the
difficulty lies in interpreting the significant ones. Papers should be
inspected carefully for suggestions that many statistical tests were
carried out but that only the significant ones are presented.

Can the results be generalised?


The extent to which findings can be generalised, to other times
and other locations, is of concern for ali research methods.
However, the issue is particularly acute for surveys because a
sample has been taken precisely for what it says aboutTome wider
group. T he extent of generalisation depends on how well the study
performed on the other appraisal questions. If it appears robust,
particularly on the essential questions, generalisations can be made
with confidence. A few m inor defects should not affect the major
conclusions. However, evidence of powerful selection, including
substantial non-response, will m ean that the findings cannot be
generalised.

34

A P PR A ISIN G SU RV EY S

The complete list for the appraisal of surveys


The essential questions
W ho was studied?
How was the sample obtained?
W hat was the response rate?

The detailed questions*


Design
A re the aims clearly stated?

Is the design appropriate to the stated objectives?


Was the sample size justified?
A re the measurements likely to be v a lid and reliable?
A re the statistical methods described?

Is there a suggestion of haste?

Conduct
D id untoiuard events occur during the study?

Analysis
Were the basic data adequately described?
Do the numbers add up?
Was the statistical significance assessed?

Were the findings serendipitous?

Interpretation
W h at do the m ain findings mean?

How could selection bias arise?


H ow are null findings interpreted?
A re im portant effects overlooked?

C an the results be generalised?


H ow do the results compare with previous reports?
W h at implications does the study have fo r y o u r practice?

T h e qu estio ns in italics are the stan d ard ones which w ere described in C h ap ter 6.

35

8 Appraising cohort
studies

This chapter presents the issues which are relevant prim arily to
cohort studies. These are arranged in two parts: the brief essential
questions and the more detailed specific questions. An explanation
is given for each item. These lists are then combined, at the end of
the chapter, with the list of standard questions described in
Chapter 6 .These lists provide a complete guide to the appraisal of
cohort studies.

The essential questions


Cohort studies follow patients through time to determine what
becomes of them. T he object m ay be to follow the natural time
coursc of disease, or to determ ine whether some medicai intervention has had an unintended consequence.Thus, the appraisal of
cohort studies focuses on the patient group and the outcome being
investigated, asking:
Who exactly has been studied?
Was a control group used? Should one have been used?
How adequate was the follow-up?

The specific questions


The specific questions extend the concern with the nature of the
study group and the details of the follow-up. They also explore the
factors which can contribute to m isleading findings.
Is the design appropriate to the stated aims?
Was the exposure/intervention accurately measured?
36

A P PR A ISIN G C O H O R T S T U D IE S

W ere relevant outcom e measures ignored?


D id the analysis allow for the passage of time?
W hat else m ight influence the observed outcome?

Rationale for the essential questions


Who exactly has been studied?
T he features of the group being studied are important because
the events which befall patients depend crucially on their characteristics. T h e nature and number of clinicai events which occur
am ong patients will depend on the duration and severity of their
illness, the range of interventions to which they have been exposed,
and the presence or absence of other m edicai conditions. Thus, the
key to interpreting the findings is a clear idea of who has been
studied, enabling the reader to ask: is this a surprising result? The
extent to w hich the findings can be generalised to other groups of
patients w ill also depend on who has been studied. Thus, the
source of the patients should be specified, for example, whether
they have been identified through a specialist clinic or through
general practice. T he definition of eligibility for entry to the study,
such as the disease definition or the nature of the exposure, should
be given. F inally, if the study subjects have been obtained by some
form of sam pling (for exam ple, selecting GPs and taking a sample
of their patients) the details of this should be reviewed. These will
indicate w hether the sam ple obtained is likely to be representative
of a w ider group.
Was a control group used? Should one have been used.
Follow -up studies investigating exposure/intervention cannot
easily be interpreted when data have been collected only on the
exposed group. Suppose that the cancer risk of industrial pollution
were being assessed. Cancers occur spontaneously, so the true
question is whether there is an increase in the frequency of cancer
am ong those exposed. This can be determ ined only by comparison
with some control group who resemble the study group in ali ways
except exposure. A crucial question concerns the appropriateness
of the control group: does it enable a fair comparison to be made?
How adequate was the follow-up?
In follow-up studies patients have many opportunities to
disappear. M arriage, death, em igration, or admission to a long-stay
37

A P P R A IS IN G C O H O R T ST U D IE S

hospital can ali result in patients being lost to follow-up. W hatever


the reasons for it, those lost to follow-up are likely to differ from
those who rem ain in view. The greater the extent of this loss the
greater the potential for bias. T hus, the key questions are: how
great was the loss? And to what extent, given the circum stances of
the study, could this influence the results?
Another crucial aspect of follow-up is the way that the outcome
was m easured. For signal events like death or the diagnosis of
cancer, m easurem ent is straightforward because the information
will be stated in official records. Physiological or biochemical
m easures of disease status, such as peakflow for respiratory disease
or plasm a creatinine for renal disease, also provide objective
m easures. But when clinicai judgem ent is being used, for example
to arrive at a diagnosis, there is an opportunity for error. Equally,
when patients are being interviewed, the nature of the questioning,
either probing or passing on, could influence the answers obtained.
T hus, the.question to be asked is: could the method of obtaining
the outcome data materially affect the result? It is im portant then
that the person taking these decisions is blind to the exposure
group, otherwise bias might occur.
F inally, the length of follow-up should be reviewed to clarify
whether it was long enough to have a reasonable chance of
detecting im portant events. This is particularly im portant for
studies in which no events were observed.The m inimum length of
follow-up depends on the events being studied. For exam ple, if the
study investigated pain and discomfort after discharge from daycase surgery, then a few weeks of follow-up would be sufficient.
However, if the aim was to detect adverse events associated with a
new treatm ent then a much longer follow-up would be required.
Side-effects m ay take several months to develop. F urther, if there
was interest in diseases like cancer, which can take m any years to
occur, then a substantially longer follow-up would be needed. The
question to be asked is: how long could it be until the events of
interest occur? T he follow-up period should be substantially longer
than the expected time.

Rationale for the specific questions


Is the design appropriate to the stated aims?
Was the exposure/intervention accurately measured?
Were relevant outcome measures ignored?
W hat else m ight influence the observed outcome?
38

A PPRA 1SIN G C O H O R T S T U D IE S

D id the analysis allow for the passage of time?


Are the findings unexpected?

Is the design appropriate to the stated aims?


Cohort studies are used to answer questions of the form W hat
happens next?. T h ey are clearly the method of choice when
studying disease prognosis, but they are also used for investigating
consequences of exposure to potentially harmful agents. Cohort
studies are com m only used to investigate causality - for exam ple,
whether an exposure caused a particular disease. However, they are
not the best method for answering this question: a clinicai trial
would be preferred. However, it is plainly unethical to expose
people to potentially noxious agents in a clinicai trial. Cohort
studies are used instead, recognising their limitations.
One area in which cohort studies are particularly poor is the
assessm ent o f treatm ent efficacy. It is often possible to identify
cohorts of patients who received different treatments, and it m ight
appear tem pting to com pare outcomes. For example, the outcome
for those receiving radical surgery could be compared with that for
patients with the same condition who were more conservatively
treated. T he problem is that there will often be good clinicai
reasons for the choice of operation: perhaps those with the more
serious disease receive the more radical treatment. Whatever the
basis for choice, the two groups being compared are likely to be
different before they are treated. Differences in outcome cannot
then be assigned to the nature of the treatment; they could easilv be
due to the way patients were allocated to each treatment.
Was the exposure/intervention accurately measured?
W hen the cohort study is investigating the consequences of some
m edicai intervention or exposure to a noxious chemical, the nature
of the exposure/intervention should be properly described. It is
best if the extent of exposure has been measured objectively. For
example, the amount of exposure to medicai X-rays could be
obtained through case-note searches of the type and number of Xrays that each patient had had. It would also be possible to confirm
that the Controls had not been exposed.
S o m etim e s the e xp o su re w ill be m ore difficu lt to quantify - fo r
e x am p le, e xp o su re to an e n viron m en tal h azard such as airb orne
p o llu tio n fro m an in d u stria l plant. R esidence close to the p la n t
co u ld be u sed as a p ro x y fo r exp osu re, but individuais will vary in
the am ou n ts o f tim e th e y sp en d at hom e. F u rth e r, som e o f the
39

A PPR A ISIN G C O H O R T S T U D IE S

control group, chosen because they live far from the plant, could
actually work at the plant. M easuring exposure can be difficult,
especially if the c o n c e rn is with chronic exposure, when individ
uais exposure leveis can change over time. T h u s, whenever
exposure/intervention is described, two key questions to be asked
are How accurately was exposure m easured and Could some of
the Controls have been exposed?.

Were relevant outcome measures ignored?


The impact of medicai interventions or exposure to noxious
agents could be measured in different ways. For exam ple, if an
atmospheric pollutant is suspected of causing or exacerbating
asthma, exposed individuais could be followed up in several ways.
Deaths from asthma might not be helpful because they would be
rare, but em ergency hospital admissions for asthma could well be
used. Alternatively, outcome could be assessed in general practice,
counting recorded new diagnoses or exacerbations o f asthma. Even
ndirect m easures, such as requests for anti-asthma m edication, or
the cashing of prescriptions, could be used. Finally, a com munity
survey could be carried out to investigate symptoms of breathlessness or w heezing.The different outcome measures vary not just
in their frequency but also in their severity and their validity.
Published papers should give some rationale for their choice of
outcome m easure, but it is up to the reader to decide how best the
outcome could have been assessed.

Did the analysis allow for the passage o f time?


If cohort studies follow people for several years they may need to
allow for the ageing of study participants. Most diseases increase in
frequency with age so the analysis should take account of this.
Further, if im portant factors which could influence the outcome
have been identified these should also be incorporated into the
analysis. It can be difficult for the non-statistician to be sure
whether the methods are truly appropriate, but it will be possible to
check that some attempts have been made to cope with the
complexities which have been identified.

What else might influence the outcome?


The major limitation of cohort studies is that the researcher has
no control over the group to be investigated. The study group is
selected because they have some disease or have been exposed to
40

A P PR A JSIN G C O H O R T ST U D IE S

some potentially noxious agent. Individuais who develop a disease


m ay differ in m any ways from those who are disease-free. Equally,
those who have been exposed to a noxious agent are likely to differ
from those who have not. The process of selecting individuais with
some defined characteristic can influence the study outcome.
Thus, the outcom e needs to be reviewed to determ ine which
factors, clinicai and behavioural, could influence it. These could be
aspects of individual behaviour, such as smoking, drinking, diet,
and exercise, or characteristics of the disease or its management.
Having identified these factors, the paper can be inspected to see
whether these factors were investigated and, if so, how they were
allowed for in the analysis.

41

A P PR A ISIN G C O H O R T S T U D IE S

The complete list for the appraisal of cohort


studies
The essential questions
W h o exactly has been studied?
W as a con trol group used? S h o u ld one have been used?
H ow adequate was the fo llo w -u p ?

The detailed questions*


D esign
A re the aims clearly stated?
Is the design ap p rop riate to the stated aim s?
Was the sample size justified?
A re the measurements likely to be v a lid an d reliable?
W as the exp osu re/ in terven tion a c c u ra tely m easu red ?
W ere relevan t o u tco m e m easu res ign ored ?
A re the statistical methods described?

Conduct
D id untoward events occur during the study?

Analysis
D id the analysis allow fo r the passage o f tim e?
Do the numbers add up?
Were the basic data adequately described?
Was statistical significance assessed?

Interpretation
W hat do the main findings mean?
W h a t else m ight in flu en ce the o b se rve d o u tco m e?
How are null findings interpreted?
A re importam effects overlooked?
How do the results compare with previous reports?
W hat implications does the study have fo r y o u r practice?

T h e questions in italics are the stan d ard ones which w ere describ ed in C hapter 6.

42

9 Appraising clinicai
trials

T h is ch ap te r p re se n ts the issues w h ich are relevan t p rim arily to


clin icai trials. T h ese are arran g ed in tw o parts: the b r ie f essential
q u estio n s and the m o re detailed specific questions. A n explanation
is given fo r each ite m . T hese lists are then com bin ed , at the end
o f th e ch ap te r, w ith the list o f stan d a rd questions described in
C h a p te r 6. T h ese lists provide a co m p lete guide to th e appraisal o f
clin icai trials.

The essential questions


C lin ic a i trials are the m e th o d fo r assessing the effectiveness o f
sp ecific trea tm e n ts. In essence they involve com p arin g one treat
m e n t w ith a n o th e r to d e term in e w h ich is b etter. T h e key
re q u ire m e n t fo r clin icai trials is th at a fa ir com p arison is m ade. T h e
essen tial ap p raisal questions are d irected at the m ain reasons w hy
th e c o m p a riso n m ig h t n ot be fa ir:
W e re trea tm e n ts ra n d o m ly allocated ?
W ere ali the p atien ts accou n ted for?
W e re o u teo m e s assessed blind?

The specific questions


T h e specific q u estio n s also search fo r p oten tial causes o f
u n fa irn e ss in the co m p ariso n o f treatm en ts. H ow ever, they extend
the e v a lu a tio n in to o th e r areas and in spect the overall q u ality o f the
tria l. E vid en ce o f p o o r quality can cast d o u b t on a study, even w hen
th e b r ie f ap p raisal suggests th at the com p ariso n appears to be fair.
T h e q u estio n s also review th e w id e r significance o f the findings.

A PPRA ISIN G C L IN IC A L T R IA L S

The extent to which the findings can be generalised depends on


how ihe study was conducted: tbe types of patients, the nature of
the treatment, and the way the outcome was assessed.
Could the choice of subjects influence the size of treatment effect?
How was the randomisation carried out?
Were there ambiguities in the description of the treatment and its
administration?
Could lack of blinding introduce bias?
Are the outcomes clinically relevam?
Were the treatment groups comparable at baseline?
Were deviations from planned treatment reported?
Were results analysed by intention to treat?
Is the size of effect clinically important?
Were side-effects reported?

Rationale for the essential questions


Were treatments randomly allocated?
For a fair comparison the two treatments must be given to
sim ilar types of patients. This can be best achieved by random ly
allocating patients to one of the two treatments. (The process uses
computer-generated random numbers to avoid any problems of
human frailty.) If randomisation has not been used, the patients
receiving the two treatments are likely to be system atically
different. The term quasi-random ised is a warning signal; it
usually means that patients have been allocated to treatments using
a method which is convenient to the researchers (for example, by
day of the week). But the convenience m ay result in subtle
differences between the two groups.

Were ali the patients accounted for?


In clinicai studies contact with some patients may be lost. T he
concern in clinicai trials is whether the patients who disappear are
special in any way. For example, patients m ay fail to attend a
scheduled appointment because they are so severely ill that they are
unable to travei. Alternatively, they may be completely recovered,
and not see any need to attend. If one of the treatments being
tested was truly effective then patients who were being cured m ight
not attend. If the other treatment was much less effective it m ight
weli be the very sick who do not attend. It is not possible to know
44

A P P R A ISIN G C L IN IC A L T R JA L S

why some patients disappear from follow-up. However, doubts


about the fairness of the comparison emerge if a substantial
num ber of those who were randomised have disappeared. Concern
is particularly heightened if more patients are lost from one
treatm ent group than the other.

Were outcomes assessed blind?


W hen clinicai judgem ent is needed to assess the outcome of
treatm ent there is the opportunity for personal views to intrude.
One researcher m ay be an enthusiast for a new treatment, and may
subconsciously record a better outcome for the patients receiving
it. Another researcher, aware of this possible bias, may overcom pensate and give a better rating to the other treatment. The
problem is prevented if those assessing treatment outcome are
blind to the treatm ent each patient received,

Rationale for the specific questions


Could the choice o f subjects influence the size o f treatm ent
effect?
To be able to assess the effects of a treatm ent the source and
nature of the patients studid need to be fully described. It is
possible that a treatm ent which is highly effective for the severely ill
patients seen at a specialist centre would have little impact on those
w ith a m ilder form of the disease who are managed by a general
practitioner. The inform ation which should be given is:
T he setting from which the patients were recruited (communiry,
hospital, or specialist clinic)
T he diagnostic criteria for entry to the trial
c Factors which led to patients being excluded from the study (for
exam ple, contra-indications to therapy)
A description of the duration and severity of disease at entry to
the study
T his inform ation allows the reader to decide the extent to which
the study findings apply elsewhere. In particular, it helps to answer
the question Could this treatm ent help m y patients?.

How was the random isation carried out?


T he random isation of patients should be organised so as to
m inim ise the opportunities for the random isation code to be
45

A PPR A ISIN G C L IN I C A L T R IA L S

b roken . T h u s , the re p o rt sh ou ld indicate h ow the p ro cess w as


u n d ertak en . In the p ast the ran d om isation codes w ere so m etim es
h e ld in in d iv id u a lly sealed en velo p es, but n ow ad ays th e cod es are
u su ally k ep t at a re m o te cen tral trial office. A s each p a tie n t is
e n tered the clin ician w o u ld p h o n e the trial office to be to ld the
ap p ro p ria te cod e. F o r exam ple, in a drugs trial the tw o trea tm e n ts
m igh t be id e n tic a lly packaged, one labelled A the o th e r B. T h e
clinician w o u ld be given the app rop riate letter fo r each p a tie n t,
th u s p reservin g b lin d n ess to trea tm e n t group.

Were there ambiguities in the description o f the treatm ent


and its administration?
To be u sed in p ra ctice the treatm ents tested in a clinicai trial
n eed to b e fu lly d escrib ed . I f the in form ation is ab sen t, it m ay be
difficu lt fo r o th ers to use a successful trea tm e n t in th e ir ow n
practice. F u rth e r, if these details have not been c le arly w o rk ed o u t
th e treatm en t m ay n o t have b een given p ro p erly to ali the stu d y
patients. W ere this to hap p en the size o f any trea tm e n t e ffect c o u ld
b e reduced.

Could lack o f blinding introduce bias?


Bias can be in tro d u c e d in several ways w h en it is k n o w n w h ich
treatm en t each p atie n t received . It is better that ev eryo n e - p atie n t,
clinician and statistician - be b lin d to treatm en t details. P atients
w h o believe th ey are getting an expensive new d ru g m ay re p o rt
b ein g b e tte r th an th e y really are. T h ere is evidence th at, th ro u g h
m ech anism s w h ich are little u n d ersto o d , these p atien ts m ay actu a lly exp erien ce clin icai benefit.
T h e m a n ag em en t given to p atien ts by the caring p h ysician c o u ld
also be affected by kn o w led ge o f the treatm ent. I f o n e tre a tm e n t is
th o u g h t to be less effective, th ose given it m ight receive m o re care
an d atten tion to com p en sate fo r the anticipated deficiencies. I f
kn ow led ge o f the tre a tm e n t g ro u p influences overall m a n ag e m e n t,
bias could resu lt.
T he q u estio n o f b iase d o u tco m e assessm ent was d iscu ssed in the
b rie f ap p raisal, alth ou g h bias d u rin g the statistical analysis w as n o t.
I f it were k n o w n w h ic h tre a tm e n t group was w hich, th ere c o u ld be
a tem p tation to search the d ata fo r some d ifferen ce w h ich w o u ld
su p p o rt o n e o f the th erapies. O bscure differences w h ich did n o t
su p p o rt this th e ra p y m ight be disregarded and o n ly su p p o rtive
ones re p o rte d . If th e re ap p ears to be som e d a ta -to rtu rin g , be

46

A PPR A ISIN G C L IN IC A L T R IA L S

su sp iciou s if the an alysis w as n o t c a rrie d out b lin d to treatm en t


group.

Are the outcomes clinically relevant?


D e cid in g w h e th e r a tre a tm e n t is effective is n o t alw ays easy. For
som e diseases th e re m a y b e an obvio u s yardstick: fo r exam ple, for
can c er it m igh t be th e le n g th o f tim e fo r w hich patients su rvive. F or
o th e r diseases, su ch as rh e u m a tism , clear-cu t m easu res such as
tim e to death are n o t a p p ro p ria te . A lte rn ative m easu res, such as
leveis o f disease ac tivity , e x te n t o f d isease-related physical m obility,
o r q u ality o f life, n e e d to b e u sed . T h e re are o ften m any different
w ays in w h ich the e ffe c t o f tre a tm e n t could be m easu red . It is
alw ays w o rth asking w h e th e r the one w h ich was used p rovides the
b e st w a y to assess th e m a n a g e m e n t o f this disease. A t its m ost
e x tre m e , the tre a tm e n t m ig h t im p rove one aspect o f the disease
(levei o f sym p to m s o r p a tie n t satisfaction ) w h ile o th e r aspects
d e te rio ra te (leadin g to s e rio u s com p lication s or even death ). S h o rtte rm m easu res (fo r e x a m p le , p atie n t status at 2 weeks) are partic u la rly suspect.
W h e n several o u tc o m e s m ea su res h ave been used, it is b est i f one
o f th em has been n o m in a te d as the p rim a ry m easu re to be used to
judge trea tm e n t. T h is g u ard s against e dangers o f m ultiple
testing , w h en o u t o f a h o s t o f m easures the one producing
(sp u rio u s) statistical sig n ifican ce is highlighted w h ile the others are
o verloo ked .

Were the treatm ent groups comparable at baseline?


R an d om allo ca tio n o f tre a tm e n ts is u sed to guard against bias in
assigning patients to trea tm e n ts. H o w ever, this technique does not
guarantee th at the tw o tre a tm e n t g rou p s will be identical at the
start o f the study. B y the p la y o f ch an ce it is possible fo r m ore o f the
severely ill patients to be allo c a ted to one o f the groups than the
other. T h u s, the tw o g ro u p s n eed to be assessed to determ ine
w h e th e r they were sim ila r at the start o f the study. T h is evaluation
sh ou ld focus on th ose item s m ost relevant to prognosis and
o utco m e. F o r exam p le, su p p o se an an tih yp erten sive d ru g were
being tested and e ffectiven ess w ere b ein g m easu red b y the n u m b er
o f patients w ho h ad a stro k e o r a h e a rt attack. A g e, gen d er, serum
ch o lestero l, and cig arette sm o k in g are m ajo r risk factors fo r these
diseases, so it w o u ld be im p o rta n t to check th at the two groups
w ere sim ilar fo r these fa c to rs.
I f there are difFerences a t b aselin e in im p o rtan t fa c to rs, this need

A PPRA ISIN G C L IN IC A I. T R JA L S

not negate the whole study. Careful statistical analysis can go a long
way to take account of such differences. Thus, when they exist, the
question is whether they have been allowed for in the analysis.

Were deviations from planned treatm ent reported?


Several events can affect the smooth running of a trial. Patients
may fail to com ply with therapy. T h ey may have their treatments
stopped, because of side-effects or concerns about a deteriorating
condition. T hey may be transferred to other therapies or may be
given additional treatments to those under investigation. If any of
these events were to happen more commonly in one group than in
the other, a fair comparison could not be made. M anaging one
group differently to the other could produce a difference in
outcome unrelated to the treatments being studied. Good trials
report these details separately for the two treatment groups,
allowing the reader to assess the potential for bias. Studies which
do not report these details arouse suspicion.
Were results analysed by intention to treat?
In the course of the trial patients may have their treatments
changed and m ay even swap from one treatment to the other.
Whatever has happened, the best advice is to analyse the study by
the groups to which the patients were first allocated. The concern
is that patients who change treatments, or even withdraw from the
study, may be system atically different to those who do not change.
Excluding these patients from the analysis could introduce
differences between the two groups of patients at entry into the
study. The com parison of treatments would no longer be fair. (It
might be argued that bias could arise because some patients had
their treatm ent changed. This could occur, but it is considered a
lesser evil than excluding these patients.)
Were side-effects reported?
Many treatm ents have unwanted side-effects: amoxycillin can
cause nausea and diarrhoea; amitriptyline causes dry mouth and
sedation; and surgical operations certainly have their hazards.
Thus, the beneficiai effect of any therapy has to be balanced against
its side-effects. Often the balance is clear, when the therapeutic
effect outweighs the adverse effects. But when two similar treat
ments are being compared, a difference in side-effect profile could
be more im portant than a treatment difference.
48

A PPR A ISIN G C L IN IC A L T R IA L S

The complete list for the appraisal of clinicai


trials
The essential questions
Were treatments random ly allocated?
Were ali the patients accounted for?
Were outcomes assessed blind?

The detailed questions*


Design
A re lhe aims clearly siated?
Was the sample size justified?
A re the measurements likely to be valid and reliable?

C ould the choice of subjects influence the size of treatment effect?


Were there am biguities in the description of the treatment and its
adm inistration?
A re the statistical methods described?

Could lack of blinding introduce bias?


Are the outcomes clinically relevant?

Conduct
How was the randomisation carried out?
D id untward events occur during the study?

Analysis
Were the treatm ent groups com parable at baseline?
Were results analysed by intention to treat?
Was the statistical significance assessed?
Were the basic data adequately described?
Do the numbers add up?

Were side-effects reported?

Interpreta tion
W hat do the main findings mean?
How are nullfindings interpreted?
A re importam effects overlooked?
How do the results compare with previous repons?
What implications does the study have fo r your practice?

* T h e q u estio n s in itah cs are the stan d ard o nes w hich were d escrib ed in C h apter 6.
49

10 Appraising
case-control studies

T h is c h ap te r presents the issues w hich are re le v a n t p rim a rily to


c a se -c o n tro l studies. T h ese are arran ged in tw o p a rts: the b rie f
essential question s and the m o re detailed sp ecific question s. A n
exp lan ation is given fo r each item . T hese lists are th en com bined,
at the end o f the ch ap ter, w ith the list o f sta n d a rd questions
d escrib ed in C h a p te r 6. T h ese lists provide a c o m p le te guide to the
appraisal o f c a se -c o n tro l studies.

The essential questions


C a s e -c o n tro l studies investigate w h y c e rta in p e o p le d evelop a
specific illness. T h ey can also investigate w h y so m e sets o f patients
b.ehave as th e y do - fo r exam p le, w hy som e d o n o t atten d fo r
cervical scre en in g .T h e studies seek out the fa c to rs w h ich m ark out
these in d ivid u ais as being sjpeial, in the h o p e th a t this w ill identify
the causes o f the disease o r the exp lan ation fo r th e ir b e h a v io u r.T h e
m ethod does this by co,mparing the ch aracteristics o f th ose with a
disease (or ch aracteristic o f in terest) against a su itab le group o f
con trol individuais. T h e assu m p tio n is th at d ifferen ces betw een
cases and Controls will reveal w hy cases b eco m e cases. T h e essential
questions focus on the valid ity o f this assu m p tio n .
H ow w ere the cases obtained?
Is the con trol group ap p rop riate?
W ere data collected the sam e w ay fo r cases a n d Controls?

The specific questions


T he specific questions are also con cern ed w ith the in terp retatio n
o f the case and control com p ariso n s, b u t th ey deal with the

50

A P P R A ISIN G C A S E -C O N T R O L ST U D IE S

m e th o d o lo g ic a l p ro b lem s w h ich are p articu larly acute fo r c a se c o n tro l stu d ies.


Is the d esign ap p ro p ria te to the aims?
W h e re are the biases?
C o u ld th e re b e con fou n d in g ?
W as th e re d ata-d redg in g ?

Rationale for the essential questions


How were the cases obtained?
T h e ch ara c teristic s o f the cases sh ou ld be clearly stated. T h is
w o u ld in clu d e the d efin itio n o f a case which sh ou ld be b ro ad
en ou g h to en su re th at tru e cases are n o t m issed, yet specific enough
to en su re. th a t o n ly tru e cases are in clu d ed . In m any instances the
case d e fin itio n w ill b e a statem en t o f diagnostic criteria fo r the
disease to g eth e r w ith any exclusion c rite ria (such as con com itant
disease).
T h e so u rce o f the cases should also be given w h eth er from the
gen eral p o p u ia tio n o r som e specialist cen tre. T h e average severity
o f the c o n d itio n is lik ely to v a ry w ith the source o f the cases.
A n o th e r c o n c e rn is w h e th e r the cases rep resen t new ly diagnosed
d isease, o r w h e th e r th e y are cases w ith lon g-stan d in g illness. T h ose
w ith lo n g -sta n d in g d isease are a selected group; those who have
b e en c u re d o r have d ie d will be lost. Selective loss could bias the
sev e rity o f th e d isease in tw o w ays: th ose cured m ay have had m ild
d isease, w h erea s th o se w h o died m ight have h ad m ore severe
illness. B ias c o u ld also arise i f som e o f those w ho w ere sought fo r
the stu d y w e re n o t in c lu d e d in it. P atients who avoid inclusion in
research stu d ies te n d to d iffe r fro m th ose included.
T h ese d etails o f th e cases studied are central to the in terp retatio n
o f the resu lts. A ty p ic a l cases can p ro d u ce atypical findings, so th at
the n atu re o f the cases w ill also in flu ence the extent to which the
fin d in gs can be g en eralised .

Is the control group appropriate?


S elec tin g ap p ro p ria te Controls is one o f the m ajo r challenges o f
c a s e -c o n tro l studies. T h e y are g en erally selected from the same
sou rce as the cases, b e th at the com m u n ity, general practice o r
sp ecialist centre. T h e intention is that th ey should resemble the
cases, except th at th e y d o not have the disease being studied. Since
51

APPRA ISIN G C A S E -C O N T R O L S T U D IE S

there are many forces of selection which determine where cases are
managed, it is best if both cases and Controls have been subject to
the same forces. Controls are usually selected to be sim ilar in terms
of age, scx, social class, and area of residence to the cases.
The obsession with the com parability of Controls follows from
the analysis of these studies: evidence about causes of disease
comes from a comparison of cases and Controls. If differing forces
of selection introduce differences between cases and Controls, these
might be mistaken for risk factors for the disease.

Were data collected in the same way for cases and Controls?
Having recruited cases and Controls, both must be asked about
past exposure to potentiai risk factors. This information should be
obtained in_the same way for each, whether by interview, postal
questionnaire or case-note review. But even using the same method
may not produce identical information gathering. If the data
collector knows which are cases and which Controls, then interviews or case-note searches m ay be influenced. Blinding to case
ojntrol status should be used where possible.
Another source of probems is recall bias: patients who have a
serious disease tend to review their past history in detail to find an
explanation for their illness. T hus they are more likely to report
events which the Controls m ight forget. Case-control studies are
bedevilled by biased information gathering. The details of the datacollection techniques need to be scrutinised carefully to determ ine
whether data collection was identical for cases and Controls.

Rationale for the specific questions


Is the design appropriate to the aims?
Case-control studies are a powerful research method, but they
have limitations. They involve collecting data retrospectively, after
cases hav developed ^tjje disease (or other characteristics of
interest). They cannot be used to assess the effectiveness of a new
treatment. They are often used to investigatejpossible cause and
effect: whether a specific factor is asso.ciated with_..the risk of
developing-a -disease. However, they seldom provide definitive
evidence for cause and effect because of their potentiai for bias.
Finally, because the cases are often highly selected, they cannot be
used to make more general statements about how common certain
features occur: this would require a survey.
52

A PPR A ISIN G C A S E -C O N T R O L STU D IE S

W here are the biases?


C ase-control studies are notorious for their susceptibility to bias.
T he b rief appraisal questions introduced the idea that bias cold
arise through the selection of cases and Controls and through the
method of data collection. B ut there are many subtle variants on
these biases. One form of this is surveillance bias; individuais who
are taking regular m edication will be more likely to have regular
contact with doctors. T hus, newly occurring asymptomatic or mild
diseases are more likely to be diagnosed. This could create an
apparent but spurious relationship between the medication and the
new m ild disease.
Another form of bias is m isclassificaonjbias: some of the cases
m ay not have the disease of interest but suffer instead from a
condition which resem bles it. For example, endom etrial hyperplasia could be m isclassified as frank carcinoma. Then any factors
associated with hyperplasia (for example, exogenous oestrogens)
would be falsely associated with the cancer.
T he w illingness of patients to participate in studies and provide
inform ation can also result in bias. Those who have a serious illness
are likely to differ on both counts from other groups in the
com m unity. In short, the opportunities for bias in case-control
studies are legion. T he challenge is to identify the source and to try
to estim ate what effect it could have on the study findings.
W henever associations between diseases and risk factors are
identified these should be reviewed by asking How else could this
have occurred?, W hat m ight be special about lhe cases? and
How could the m easurem ents be biased?. C ase-control studies
should always be approached in the expectation that there will be
bias.

Could there be confounding?


C onfounding aris.es when an_ observed association between nvo
variables is due to the action of a third factor. For example, excess
sugar consum ption will lead to an increase in dental caries. It will
also increase the risk of developing mature-onset diabetes. There
could thus be an apparent relationship between caries and diabetes
which would arise because some individuais had a particular
fondness for sugar.
C ase-control studies are often used to investigate why disease
occurs in certain individuais. T hus, whenever a case-control_shpws
that a disease is related to som e factor, attention should be given to
53

A P P R A ISIN G C A S E -C O N T R O L ST U D IE S

o th er fa c to rs th at b o th m ight be rela ted to. C o uld a th ird fa c to r b e


driv in g the o b se rv e d re lation ship? It is often d ifficu lt to guess what
the th ird fa c to r_m ig h t b e, b u t th e su b ject shou ld be raised at som e
p o in t in the d iscu ssio n section. W h e n con fou n d in g fa c to rs are
id en tified in a stu d y th ey sh ou ld be taken in to a cco u n t in the
analysis. T h e re are stan d ard statistical m eth od s to do this. It can be
d ifficu lt to tell w h e th e r these have been used correctly, b u t at least
they sh o u ld have b e en used.

Was there data-dredging?


C a s e -c o n tro l stu d ies are o ften exp loratp ry, casting a w id e n et to
see w h a t can be caught. T h ese studies are asking w h y cases are
d iffe ren t fro m Controls (i.e. w h a t is it th at is special a b o u t cases?).
T h ey allow a n u m b e r o f d iffe ren t h ypotheses to be te sted at the
sam e tim e. D ata are collected on as m any d ifferen t item s as the
im agin ation an d reso u rces o f the research ers allow. T h e analysis
then involves traw lin g th rou gh the variab les, looking fo r an yth in g o f
interest. W h e n this is d on e, m ultiple significance testing b eco m es a
m a jo r hazard. C a lcu la te d P -values can n o lo n ger be in te rp re te d at
face va lu e, and som e instances o f sp u riou s statistical significance
sh ou ld be exp ected . T h u s, rep o rts sh ou ld be carefu lly scrutinised
fo r evid en ce o f m u ltip le testing, and findings in te rp re te d m ore
cau tio u slv i f this has o ccu rred .

54

A P PR A ISIN G C A S E -C O N T R O L S T U D IE S

The complete list for the appraisal of case-control


studies
The essential questions
H o w w ere the cases o b tain ed ?
Is the c o n tro l g ro u p ap p ro p ria te ?
W ere data collected th e sam e w a y fo r cases and Controls?

The detailed questions*


Design
A re the aim s clearly stated?
Is the m e th o d a p p ro p ria te to th e aims?
Was the sample size justified?
A re the measurements likely to be v a lid and reliable?
A re the statistical methods described?

Conduct
D id untow ard events occur during the study?

Analysis
Were the basic data adequately described?
Do the numbers add up?
W as th ere d ata -d red g in g ?
as the statistical significance assessed?

Interpretation
W hat do the main findings mean?
W h e re are the biases?
C o u ld th ere be co n fo u n d in g ?
H ow are null findings interpreted?
A re im portant effects overlooked?
Hoiu do the results compare tvith previous reports?
W hat implications does the study h ave fo r you r practice?

* T h e qu estio ns in italies are the stan d ard ones which were d escrib ed in C hapter 6.

55

11 Appraising review
papers

This chapter presents the issues which are relevant prim arily to
review papers and meta-analyses. These are arranged in two parts:
the brief essential questions and the more detailed specific
questions. An explanation is given for each item. These lists are
combined, at the end of the chapter, with those items from the
jist of standard questions which are relevant to this type of study.
These lists provide a complete guide to the appraisal of review
papers and meta-analyses.

The essential questions


Preparing a review of published research calls for the same
ligence and rigour which characterises the best of original
research. S ^ sten m icjeview g_are heir to the same pitfalls and biases
as original research. Thus, it is little surprise that the appraisal of
review papers follows an identical sequence to that used for the
research methods, focusing in turn on the design, the conduct of
tne study, and the analysis of the review materiais.
How were the papers identified?
How was the quality of papers assessed?
How were the results summarised?

The specific questions


T he specific questions explore in more detail how the study was
conducted, to see how the pitfalls common to review papers were
dealtw ith.

A P P R A I S IN G

r e v ie w

PAPERS

Is the topic we]l defined?


W as p u b lic a tio n bias taken into accou n t?

Was m issing inform ation sought?


Were the detailed study designs reviewed?
Was heterogeneity of effect investigated?
Are there other findings which m erit attention?
Are the conclusions justified?

Rationale for the essential questions


How were the papers identified?
R esearch papers are the raw data of a review and need to be
gathered with care. In the past some review articles were prepared
using the personal set of papers which their author had collected
over the years. T hese w ill reflect the interests of the author and are
likely to be incom plete. T he jg n e e r n j s j h a t they mav be a hiaseH
s^rnP.le .?L a! 1.E,aJ2Es- C om puterised searches are so easy to conduct
that they are now essential for any good review. The details of the
search should be described: which databases were searched and
what key term s were used .in the s e r c h V j m e v e n ^ p w i i s T r
searches wilIJ-aL-y-ieJd all relevant reports. Because of the way
papers are m dexed in the databases, some papers will not be found
even when ali the sensible key term s have been used in the search.
T hus, to be com plete, m anual searches of selected journals need to
be undertaken. Plainly, the workload of these searches will be so
large that they w ill be possible only if special funding is provided.
The absence of these extended system atic searches should temper
the conclusions which can be drawn from a review.
One consequence of exhaustive searching is that a large number
of possibly relevant studies will be identified. Clearly defined
critena will be needed to decide which to include in the review,
because they are p articu larly pertinent to the topic, and which to
exclude. I h e s e criteria would need to be inspected carefully to
determ ine w hether they m ight bias the papers to be included.

irow was the quality ofp ap ers assessed?


Not ali research studies are well designed and conducted.
Including studies of poor quality on the same footing as those of
high quality is p lain ly undesirable. T hus, the quality of the papers
identified needs to be assessed. T his can be achieved with formal
check lists, such as those presented in this book

A P P R A ISIN G REVIEW P A PE R S

assessm ent o f q u ality, in th e absence o f a list o f specific p oin ts, is


less satisfactory.
T he stren gth o f e vid en ce from a p ap er also d epen d s on the
research m eth o d w h ich w as used. C o n tro lle d clinicai trials are
reckon ed to p ro v id e the m o st pow erful evidence fo llo w ed in
d ecreasing o rd e r o f stren g th by c o h o rt studies, c ase -c o n tro l
studies, su rveys, an d case-series. T h e assessm ent o f q u ality and
strength o f evid en ce is an essential prelu d e to su m m arisin g the
results fro m the p a p e rs id en tified .

How were the results summarised?


T he resu lts o f th e in d ivid u a l studies can be p resen ted in a table
o r figure to allow th e re a d e r to judge w h eth er, on b alan ce, they give
a con sisten t answ er. T h is d isp lay will also in d icate the am o u n t o f
variation betw een in d iv id u a l studies. Individual ju d g em en t can be
used to d ra w con clu sion s fro m the data. B ut fa r b e tte r is to use the
statistical m eth o d s o f m eta-analysis. T h e m a jo r ad vantage o f
system atic review s is th a t, b y com bining stu d ies, th ey have
su b stan tially in creased p o w e r to detect significant results. T h is
p ow er can be h a rn esse d o n ly b y using the ap p ro p riate statistical
techniques.
O ne u n an sw ered q u estio n is w hat to do w ith p o o r-q u a lity
studies. E xclu d in g th e m e n tire ly could b e w asteful o f in fo rm atio n .
O ne so lu tio n is to use a w eigh ting system w h ere th e p o o r studies
are given such a low w eig h t th at they have only a sm all effect on the
conclu sion s (the d etails o f these weighting system s are d escrib ed in
stand ard texts on m e ta -a n aly sis). A n altern ative approach is to
c on d u ct w h a t are c alled sen sitivity analyses. T h is involves first
su m m arising the re su lts w ith ali possible studies included. T h e n ,
those o f the low est q u a lity w o u ld be excluded an d the analysis
repeated. T h is p ro cess w o u ld be repeated, progressively h igh er
quality th resh old s b ein g set fo r papers to be in clu ded . T his w o u ld
indicate h o w sensitive the conclusions are to the inclusion o f p o o rquality p ap ers. I f b ro a d ly th e sam e result is obtain ed across a range
o f q u ality th resh old s th en the findings can be accepted.

Rationale for the specific questions


Is the topic well defined?
T he focu s o f the re vie w shou ld be clearly stated. T h e review is
m ore likely to id en tify ali re le v a n t articles w hen lim ited to a n arro w
58

a p p r a is in g

r e v ie w

papers

area o f m ed icin e. W id e -ra n g in g review s, fo r exam ple coverin g th e


d iagn o sis, trea tm e n t, lo n g -te rm p ro g n osis and ep idem iology, m a v
w ell m iss im p o rta n t papers. S u c h b ro ad overview s have th eir p lace
in research , b u t th ey ten d to p re se n t the personal v iew o f th e
a u th o r. T h e y sh ou ld be d istin g u ish ed from the exhaustive process
o f p ro d u c in g a system atic review.

Were the detailed study designs reviewed?


In d ivid u al studies o ften v a ry in th e details o f their design - fo r
e x am p le, in the types o f p atie n t th ey have studied (diagnostic
c rite ria , age range etc.) o r in the w ays in w hich treatm en t was given
(d osage levei, tim ing o f trea tm e n ts etc.). T hese differences should
be id en tified an d com p ared w ith th e size o f effects w hich were
re p o rte d . B road ly sim ilar fin d in gs across a range o f clinicai
c o n d itio n s do n o t just stren g th en the conclusions, they indicate
th at th e findings can be m o re w id e ly generalised.

Was missing information sought?


S o m etim e s in fo rm a tio n on som e k e y details about a stu d y is not
c o n ta in e d in the p u b lish ed re p o rt. T h is cou ld m ake it d ifficu lt to
assess th e q u ality o f a stu d y or, in so m e instances, to in te rp re t the
resu lts. P ainstaking review ers w ill w rite to the au thors requesting
these details.

Was publication bias taken into account?


P a p e rs w h ich re p o rt p ositive fin d in g s have a higher chance o f
p u b lica tio n th an those w h ich c o n c lu d e that a n ew treatm en t was
in effe c tive o r a h o p e d -fo r effect was n o t found. T his m ay reflect the
e n th u sia sm w ith w h ich re se a rc h e rs seek p u b lication , o r the
re lu c ta n c e o f jo u rn a ls to accep t n egative rep orts. T h e con cern
a b o u t u n p u b lish ed studies is th at th e ir findings m ight be system atically d ifferen t fro m those that w e re published. I f in clu d ed in the
review , these studies m ight w eight th e findings sufficiently to lead
to q u ite d iffe ren t con clu sion s.
F in d in g u n p u b lish ed stu d ies is d ifficu lt. It could involve w riting
to re se arc h e rs k n o w n to be active in th e field, asking w h eth er thev
have c o n d u c ted , o r p ossib ly kn o w o f, un p u blish ed studies. D etails
o f an y stu d ies u n co vered w o u ld then h ave to be sought. W h e n som e
e ffo rt has been m ad e to find th em , m o re weight can be given to the
con clu sion s fro m the review. W h e n u n p u b lish ed papers have n o t
been sou g h t, the m ain findings m ay b e biased - they m o st likely
59

APPRAISIN G REVIEW PAPERS

overestimate the sizes of any effects. T he degree of overestimation


can only be guessed at, so the only safe advice is to interpret with
caution.

Was heterogeneity of effect investigated?


Some variation in results of individual studies would be expected
just by the play of chance. However, heterogeneity of effect can
occur because of differences in design. Formal statistical methods
should be used to determine whether the amount of variation is
greater than would be expected by chance. If this has occurred it
should be investigated further to see which of the design features
might explain it. For example, if there were large differences in the
ages of the patients investigated, then individual studies could be
subdivided by age to see whether the effect is consistent within age
groups.
Even if there is no statistical evidence of heterogeneity, the
studies should still be inspected carefully to see whether studies
which share design features have broadly similar results. (Unfortunately the statistical tests are rather poor at detecting hetero
geneity.) It might be, for exam ple, that large effects are seen among
the young with more modest effects among the elderly. Whenever
there is evidence of heterogeneity, the process summarising ali the
studies with a single m easure becomes doubtful.
Are there other findings which m erit attention?
Systematic reviews are a major undertaking, so that their results
should be inspected with great care. They should contain ali the
valid studies on a particular topic. T hey will clarify not only what is
known but also what is not yet known: they should cast in sharp
relief the gaps in current knowledge. This will help to identify the
major research questions in the field which are yet to be answered.

Are the conclusions justified?


The interpretation of systematic reviews is as prone to errors as
is the interpretation of any data. T hus, the reader should ask: Do
they reflect the weight of evidence? Was due allowance made for the
strengths of the research methods? Were defects in the studies taken
into account? Review articles may appear unchallengeable, particularly those which have involved extensive searches and have
combined findings using m eta-analysis techniques. This semblance
of infallibility should be rejected; review articles are prepared by
60

A P P R A ISIN G

r e v ie w p a p e r s

p e o p le , an d p eo p le m ake m istakes. A s fo r any o th e r research


m e th o d , re a d e rs sh o u ld m ake up th e ir ow n m in d s about the
c o n clu sio n s.

61

A P P R A IS IN G R E V IE W PAPERS

The complete list for the appraisal of review


papers
The essential questions
H o w w e re the p ap ers id entified?
H o w w a s the q u a lity o f p ap ers assessed?
H o w w e re the resu lts su m m arised ?

The detailed questions*


Design
Is the top ic w ell defined?
A re the statistical methods described?

Conduct
W ere the d etailed stu d y designs review ed ?
W as m issing in fo rm a tio n sought?

Analysis
Were the basic data adequately described?
W as p u b lica tio n bias taken into accou n t?
W as h e tero g e n e ity o f effect investigated?

Interpretation
W hat do the main findings mean?
A re th e re o th e r findings w hich m e rit attention?
A re the con clu sio n s justified?
How do lhe findings compare zuiih previous reports?
W hat unplications does the study have fo r you r practice?

* T h e qu estio ns in italics are the standard ones w hich were describ ed in C hapter 6.

62

Index

ageing, o f cohort study


participants 40
aims, o f research study 4, 2 3 -4
addressed in results? 5
analysis of results, pitfalls 16 -19
appraisal, see questions
best case 2 1 - 2
bias 1 8 -1 9
in case-control studies 5 1 ,5 2 ,
53
in clinicai trials 45, 4 6 -7
misclassification bias 53
missing data 25, 26
non-response to surveys 32
publication bias in review papers
5 9 -6 0
recall bias 52
in review papers 57
selection bias in surveys 33
surveillance bias 53
black-box analysis 18
blinding
to case-control details 52
to treatment details 4 5 , 4 6 -7
case-comparator 11
casecomparison 11
case-control studies
appraisal o f 5 0 -5
bias susceptibility 5 1 , 52, 53
case characteristics and sources
51
complications 11
confounding 5 3 -4
control groups 11 , 5 1 - 2
data collection 52
data-dredging 54
design and aims of 52

features 1 0 - 1 1 ,5 0
generalisation o f results 51
limitations 52
questions about 2 3 -9 , 5 0 -5
terminology 11
case-referrent 11
causality 39
chance 1 2 -1 3
P-values and 14 -15
see also probability
chance observations 1 7 ,3 3 - 4
check lists 20-1
see also questions
clinicai trials
accounting for patients 4 4 - 5
appraisal of 4 3-9
bias 4 5, 46-7
choice o f subjects 45
complications 10
deviations from planned
treatment 48
features 9 - 1 0 ,4 3
ntention to trcat 48
outcome assessments 45
placebos 9
powerful evidence 58
questions about 2 3 -9 , 4 3 -9
randomisation of patients 44,
4 5 -6
relevance of outcomes 47
side-effects 1 0 ,4 8
terminology 10
treatment allocation 44
treatment description and
administration 46
of treatment effectiveness 10
treatment group comparability
4 7-8

cluster sam pling 8


63

IN D E X

cohort 9
cohort studies
ageing of participants 40
appraisal of 36-42
complications 9
control and comparison groups
9 ,3 7
design and aims of 39
exposure/intervention
measurcment 3 9 -4 0
features 8 -9 , 36
follow-up 3 7 -8
generalisation o f results 37
influences on outcome 4 0 -1
outcome measures 40
questions about 2 3 -9 , 3 6 -4 2
sampling 37
terminology 9
and treatment assessment 10
comparison groups
in cohort studies 9
in surveys 8
C o m p u t e r analysis 18
Computer searches, of research
papers 57
confidence intervals 15, 27
and null findings 28
confounding 19
in case-control studies 5 3-4
control groups
in case-control studies 11,
5 1-2
in cohort studies 9, 37
in surveys 8
cross-sectional 8
data
description 26
missing 25, 26
numerical discrepancies 26
see also sampling; subjects
data-dredging 54
design changes 25
doublc blind 10
cffcci
h e ic r o g e n e iiy o f 60
s iz e o f 2 4 , 2 7 , 2 9

effectw cn css 10
effic a c y 1 0
evaluation 1 0
64

exposure, mcasurement of 3 9-4 0


fin d in g s

effects overlooked? 28
implications for practice 29
interpretation o f 2 7 -8
null 28
see also generalisation; results
flaws
analysis pitfalls 1 6 - 1 9
evaluation of 2 1 - 2
finding and assessing 5
in sampling 31
in statistical procedures 3 3-4
use of check lists 2 0 - 1
Jollozu-up 9
follow-up, in cohort studies 3 7 -8
generalisation, o f results 34
case-control studies 51
cohort studies 37
surveys 30, 34
grab samples 31
heterogeneity o f effect 60
hypotheses 4
and chance observations 17
null 14
independence 17
information needs 2
intervention, measurement of
3 9 -4 0
likely case 2 0 -1
literature search 1-2
measuring techniques 2 4 -5
misclassification bias 53
multicentre studies 24
non-independence 17
non-response, see response rate
null findings 28
null hypotheses 14
observers, multiple 24

o n tco m e 9, 10
assessing vatu e o j 10
xmproving 10
outcomes, in cohort studies 40,
4 0 -1
follow-up data 38

IN D E X

outliers 16
pilot studies 25, 33
placebo-controlled 10
placebos 9
practice, implications o f study for
29
probability 1 3 - 1 4 , 1 4 - 1 5
see also chance
prospeciive 9
purpose o f research study, see aims
P-values 1 3 - 1 4 , 1 4 - 1 5
quasi-randomised treatments 44
questionnaires 4
questions
for al! research papers 23-9
about case-control studies
2 3 - 9 , 5 0 -5
check lists 2 0 -1
about clinicai trials 2 3 -9 , 4 3 -9
about cohort studies 2 3 -9 ,
3 6 -4 2
when reading research papers
3 -6
on review papers 5 6 -6 1
about surveys 2 3 -9 , 3 0 -5
random allocation 10
random isation, in clinicai trials
4 4 ,4 5 -6
random isation codes 4 5 - 6
random sample 8
random selection 10
reading
appraisal o f 2
clarifying reasons for 1
identifying reports 2
selectiviry ir. 2
specifying inform ation 2
steps in 1
recall bias 52
reliability 25
research papers
abstract 3, 5
aims 4
C o m p u t e r searches 57
discussion 5, 6
implications 5 -6
introduction 3 -4 , 6
m ethodology 4, 7

presentation o f data 4
references 6
sensitivity analyses 58
structure o f 3
title 3
unpublished 5 9 -6 0
weighting systems 58
see also case-control studies;
clinicai trials; cohort studies;
flaws; questions; reading;
results; review papers; surveys
response rate 32
follow-up o f non-responders 33
reasons for non-response 32
results
analysis pitfalls 1 6 -1 9
comparison with previous
reports 2 8 - 9
fulfilling study aims? 5
interpretation o f 5 ,1 2
presentation o f 4 -5
statistical significance 12 -15
see also findings; generalisation
retrospeciive 9, 11
review papers
appraisal o f 5 6 - 6 1
bias in 57
conclusions 6 0 - 1
focus o f 5 8 -9
heterogeneity o f effect 60
identification o f research papers
57
missing information 59
other findings? 60
publication bias 5 9 -6 0
quality o f research papers 57
questions about 5 6 -6 1
review o f study designs 59
summary of results 58
unpublished studies 5 9 -6 0
sample 8
sample size 1 2 - 1 3 , 2 4
sampling
cluster 8
in cohort studies 37
flawed 31
grab samples 31
selection bias 33
stratified

65

IN15BX

in surveys 7, 8, 31
sysiem aiic 8
selecon, see sampling
selection bias 33
sensitivity analyses 58
serendipity 17, 3 3 -4
side-effects 10, 48
significance tests 1415
multiple 54
on survey data 3 3 - 4
see also statistical tests
skew 1 6 - 1 7
statistical significance 1 2 - 1 5 , 2 6 -7
statistical tests
assumptions in 1 7 ,2 5
description o f methods 25
logic o f 1 4 -1 5
number and complexity of 25,
26
pitfalls in analysis 1 6 - 1 9
see also chance; probability;
significance tests
strafied sampling 8
studies, see research papers
subgroups, analysis by 17
subjects
missing 26
variability o f 26
survcillancc bias 53
survey 8
surveys
appraisal of 3 0 -5
complications 8

66

control and comparison groups


8

design and objcctivcs 32


features 7 -8 , 30
generalisation o f results 30, 34
haste 33
questions about 2 3 -9 , 3 0 -5
response rate 32
sampling 7, 8, 31
selection bias 33
terminology 8
unexpected results 3 3 -4
s y s t e m a t i c sampling 8
time, direction of
backward 11
forward 9
treatments
assessment of 9 - 1 0
description and administration
of 46
deviations 48
effect of, and choice of subjects
45
efficacy 39
quasi-randomised 44
randomisation 44
side-effects 10, 48
validity 24
wcighting 58
worst case 2 1 -2

n
S8$

w
m

H ^V i

iP
Q
'M
A;*&
*+??*.
RH
hps

^fei

:friQ
#

?&
.giffv'

You might also like