Professional Documents
Culture Documents
A Statistical Perspective
John Maindonald,
Centre for Bioinformation Science
John Curtin School of Medical Research and School of Mathematical Sciences
Australian National University.
john.maindonald@anu.edu.au
Summary of Contents
If we teach only the findings and products of science no matter how useful and even inspiring
they may be without communicating its critical method, how can the average person possibly
distinguish science from pseudoscience? . . . Many, perhaps most, textbooks for budding
scientists tread lightly here. It is enormously easier to present in an appealing way the wisdom
distilled from centuries of patient and collective interrogation of nature than to detail the
messy distillation apparatus. The method of science, as stodgy and grumpy as it may seem, is
far more important than the findings of science.
[Sagan 1997, The Demon-Haunted World, p. 26, Headline Book Publishing, London.]
ii
Summary of Contents
Contents
Summary of Contents ..................................................................................................... vii
Part I ............................................................................................................................vii
Part II (Available Separately ...........................................................................................viii
Introduction ..................................................................................................................... 1
iii
Summary of Contents
9. Critical Review............................................................................................................ 87
9.1 A Springboard for New Research .............................................................................. 88
9.2 The Power of Multiple Sets of Data .......................................................................... 89
9.3 Data-Based Overview .............................................................................................. 90
9.4 The Historical Sciences ............................................................................................ 96
9.5 Social Research ....................................................................................................... 99
References and Further Reading ..................................................................................... 99
iv
Summary of Contents
Contents - Part II
(This is available as a separate document.)
v
Summary of Contents
vi
Summary of Contents
Summary of Contents
Part I
Research as Learning (Introduction & Ch. 1)
Openness to new ideas versus Scepticism
Questionnaire design
Qualitative research
Computer modelling
Science as a human activity.
vii
Summary of Contents
Appendices
Questions for Researchers
Exploratory data analysis should precede and inform more formal analysis
Instructive examples
viii
Summary of Contents
Research data should, except with good reason, be in the public domain
The overall content of journals requires regular review, from the perspective of all
major skill areas that enter into the research.
Additional Material
Material that supplements these notes may be found on the web page:
http://wwwmaths.anu.edu.au/~johnm/planning
As of December 2000, the main addition is a set of notes on the design of experiments.
ix
Introduction
Introduction
In this case I believe much more could be done than is, in fact, done to prepare for the future
scientific career. For the logical principles of experimental design and of reasoning from
experimental results are of great interest to post-graduate students, who would appreciate
definite courses in this subject. In fact however, and at present, the majority of scientific
workers enter their careers without this preparation, and learn as they go, by their own
mistakes and those of their colleagues.
[Fisher, R.A. in Bennett, J.H. (ed.) 1989, pp. 343-346. See chapter 9 references.]
These notes address, at a preliminary level, broad planning principles that apply to many
different areas of research. Anyone who has a research degree should be aware of them,
whether or not they arise in their own research. They give, also, pointers that may help in
getting a clear view of where the researchers project is headed. I will have been successful in
my endeavour if I kindle in at least some readers interest both in the research process itself and
in the examples.
There are several reasons why researchers should take an interest in broad-ranging issues in
research planning:
1. The immediate research project may take twists and turns that are different from those
for which earlier study has been a preparation. This is especially likely for highly
applied projects, which typically demand a range of diverse skills.
2. Those who acquire a wide range of research skills are thereby better placed, after
graduation, to turn their hand to tasks different from those for which their immediate
research training has equipped them.
3. Broad-based research skills will best equip researchers to respond to changing demands,
as they move from task to task and from job to job in the course of their careers.
4. Many of the skills are highly relevant to the planning of any substantial understanding.
Designing the instrument panel on a large aeroplane may appear like an engineering problem. It
has, also, a large human engineering component. A layout that has the potential to confuse
pilots may, in an emergency, be fatal1.
This is not a text on statistical methodology, even though there is extensive discussion of
statistical issues. It discusses, with numerous examples, issues that should influence the design
of data collection, the eventual analysis of the resulting data, and the reporting. The emphasis is
on the way that statistical issues impact on the quality of the science.
There is a strong focus on the critical and questioning role of scientific ways of thinking. It does
not much matter where you start practicing scientific thinking. What is important is that you
start. As Sagan (1997) notes2:
Because its explanatory power is so great, once you get the hang of scientific reasoning you
are bound to start applying it everywhere.
1
Thus if a warning indicator does not indicate clearly which engine has experienced problems, the
pilot may shut down the wrong engine. An emergency may become a disaster.
2
In The Demon-Haunted World, Headline Book Publishing, London, p. 279.
1
Introduction
Criticism and questioning are in tension with the openness to imaginative insight that is equally
important to the research process. Data may be in tension with the theoretical insights that
generated their collection.
The issue of evidence is central. There must be an assessment of the evidence in the literature
that is the starting point for the research project. There must be a research strategy that will
bring together data that address the research question. Statistical analysis will extract from the
data evidence that relates to the research question. Finally, the new research evidence must be
integrated into the body of earlier knowledge, creating a coherent account that will appear as a
report or paper or thesis.
My examples range widely, from social science through to pure and applied biology and physical
science, with medical and health examples strongly represented. Most people are interested in
their own health. I am hopeful that such examples will be of wide interest to non-medical as
well as medical researchers. I have tried to find examples that are not unduly technical. I have
found it helpful, at various points, to draw on ideas from the approach to clinical medicine that
has the name Evidence-based Medicine (EBM). For those who want to understand the
practicalities of Evidence-Based Medicine, I recommend the book Smart Health Choices,
subtitled How to make informed health decisions, by Judy Irwig and collaborators. These ideas
may assist researchers both with their health needs and with their research planning!
The first drafts of this monograph were written for a course that introduced a series of short
courses on statistical design and analysis. Any statistical analysis must have a context. Data
collection and data analysis serve the wider aims of the research project. This requires a clear
view of the projects aims. There are principles that should guide the design of data collection
whenever this lies in the researchers control. Where the researcher does not have this control,
it is important to examine the processes that generated the data. Focusing attention back onto
the contexts from which data have come is important both for use of the data that the
researcher may already have, and for thinking about any future data collection. Data do not just
happen!
I will be glad to receive comments or corrections, or examples that illustrate points that I have
made. I am in debt to researchers from many different areas who over the years have brought
me questions and data.
Dr Harold Henderson, from AgResearch (New Zealand), has given me extensive help in
removing errors and obscurities from these notes, and in drawing interesting examples to my
attention. Professor Susan Wilson, from the ANU Centre for Mathematics and its Applications,
has made a number of useful suggestions. Dr Gail Craswall, from the ANU Study Skills and
Learning Centre, has helped with proofreading. In no way are these individuals responsible for
what I have made of their help!
John Maindonald
22 September 2000
Note: This document is in two parts. Part I discusses general research planning issues. Part II
discusses statistical analysis and wider planning issues that are likely to be important to
researchers, though without getting into the details of statistical analysis methodology.
2
1. The Research Enterprise
[Sagan 1997, The Demon-Haunted World, p. 287. Headline Book Publishing, London.]
There is an inherent tension between openness to new ideas, and the ruthless
criticism to which the scientific research process insists (or should insist) on
exposing every new idea. As well as research principles and methodologies specific
to particular disciplines, there are general principles and methodologies. These
notes will focus on these general principles and methodologies, and particularly on
statistical methodologies, though avoiding any attempt at rigid prescription of
acceptable scientific procedure. In order to discuss research planning, we will
establish a framework that is broad enough for most research projects. The plan
should include examination of existing knowledge, a decision on a research question
or questions, a plan to follow in seeking answers, an analysis of the research data,
and an eventual report.
3
1. The Research Enterprise
scientifically fruitless iridology, palmistry, crystal balls, the star signs, sympathetic magic,
augury, UFOs, and so on. Ideas from these sources have been singularly unhelpful to the
progress of science. When ideas appear, there must be mechanisms for deciding which are
worth pursuing. Time and energy will not be well spent on the investigation of every crackpot
idea. But how does one know which ideas really are totally crackpot, and which are worth
pursuing? There can be no sure criteria. Typically the researcher will stay away from lines of
research that have proved unfruitful in the past. There is a risk that in rejecting such sources
out of hand, an important insight will sometime be missed. It is a risk that most researchers
think justified by their assessment of the trade-offs.
Repeatability
In many (but not all) areas of knowledge, it is appropriate to ask whether results can and have
been repeated, by different workers in different places. An effective way to silence would-be
critics is to demonstrate that they can be repeated. Results that have been obtained in one time
and place, and that others elsewhere are unable to reproduce, cannot contribute to science. To
become part of the body of useful scientific knowledge, results must be repeatable. Thus Fisher
(1935, 7) argued that
. . . no isolated experiment, however significant in itself, can suffice for the
experimental demonstration of a natural phenomenon. . . . In relation to the test of
significance, we may say that a phenomenon is experimentally demonstrable when we
know how to conduct an experiment which will rarely fail to give us a statistically
significant result.
Tukey (1991) notes that:
Long ago Fisher . . . recognised that . . . solid knowledge came from a demonstrated
ability to repeat experiments . . . . This is unhappy for the investigator who would like
to settle things once and for all, but consistent with the best accounts we have of the
scientific method . . . .
Scherr (1983) uses more colourful language to make a similar point:
The glorious endeavour that we know today as science has grown out of the murk of
sorcery, religious ritual, and cooking. But while witches, priests and chefs were
developing taller and taller hats, scientists worked out a method for determining the
validity of their results: they learned to ask Are they reproducible?
The demand for repeatability applies with different force and in different ways in different areas
of science.
Where it is not possible to demonstrate a claim experimentally, what recourses are available?
There are other ways of gathering and using evidence, which however rarely give the secure
knowledge that comes from a properly conducted experiment. The two examples in the next
section will illustrate some of the issues.
4
1. The Research Enterprise
Age 20-25
Englishmen
English soldiers
Age 25-30
Englishmen
English soldiers
Age 30-35
Englishmen
English soldiers
Age 35-40
Englishmen
English soldiers
0 5 10 15 20
Fig. 1: Florence Nightingale's data showing deaths per 1000 per annum,
for the general population and for soldiers living in barracks.
The clear message of Fig. 1 is that, at the time of the Crimean War, it was much more
dangerous to be a soldier living in barracks in England than to be a male in the general
population. Note that the pattern is the same for all four age groups. There were other
important sources of evidence. Evidence about poor sanitation and hygiene at army barracks
supported what the data seemed to say.
How much effort went into the collection of these data? Was it straightforward, just a matter of
tallying up readily accessible official records? Or was it necessary to organise clerks to go out
and collect it? What was Florence Nightingales purpose in collecting it?
Often houses in the same street would get their water, some from one company and some from
the other. So the source of the difference did seem to be the different sources of water. Snow
noted that in 1853 the Lambeth Company had moved its supply upstream to Thames Ditton,
where the water was relatively uncontaminated. Snow wrote:
It is extremely worthy of remark that whilst only 563 deaths occurred in the whole
metropolis in the four weeks ending August 5th (1853), more than one half of them
took place amongst the customers of the Southwalk and Vauxhall company and a great
proportion of the remaining deaths were those of mariners and persons employed in the
shipping on the Thames, who almost invariably drew their drinking water from the
river.
5
1. The Research Enterprise
Shoe Leather
Florence Nightingale and John Snow did much more than present data. Florence Nightingales
argument was of the kind: Isnt this what you would expect from the conditions that prevail in
British army barracks. For Snow the evidence from the 1854 epidemic clinched what he had
begun to suspect on other grounds. Great cholera epidemics occurred in the British Isles
between 1831 and 1866. There were competing theories as to the cause, with many blaming the
air. Snow noted that cholera affected the intestines rather than the lungs, making it unlikely that
it was spread as a poison in the air. He noted that when a ship went from a cholera-free
country to a cholera-stricken port, the sailors would get the disease only after they had landed or
taken on supplies. Exposure to the air was not enough. Snow engaged in scientific detective
work. In one of the earliest epidemics he found the seaman who had been the first case, and
noted that he had newly arrived from Hamburg, where the disease was active. Snows book is a
classic for the way he builds his case from the variety of evidence.
In a paper titled Statistical Models and Shoe Leather, Freedman (1991) describes how Snow
tramped around London gathering his information. Not just statistical analysis, but shoe leather,
was crucial to the case that Snow finally made. It is always thus. The context from which the
data come is crucial to their use and interpretation.
6
1. The Research Enterprise
Clear research questions keep the research focused, and are a safeguard against diversion of
undue energy into bypaths. One may have specific hypotheses, e.g. that two treatments for
blood pressure are indistinguishable in their effect. Or one may wish to estimate the effect of a
particular treatment. How does living at high altitudes affect the lung capacities of ten-year old
children?
Good research planning and execution has multiple components. It should bring together
relevant insights and skills from all contributing disciplines. This is a particular challenge for
highly applied research, where there may be diverse multi-disciplinary demands.
Examples
Here are examples where there is disagreement:
7
1. The Research Enterprise
8
1. The Research Enterprise
In discussing the teaching of reading, examination of data from studies that compare different
methods is important, but not the only thing we ought to look at. We would like to know, not
just that some methods work and others do not, but why they work. McGuinness's study has
the virtue that she presents both a rationale that explains why her methods work, and data from
studies that seem to show that her methods do indeed work better than other methods. We have
a theory, supported at many of the crucial points by experimental data, that lends support to her
claims. We do not always have a conjunction of scientific insight and statistical evidence that
gives such coherence to the argument.
The Lehman, Wortman and Williams study of the effects of sudden and unexpected loss
differed from many previous studies because of its use of a control group. It may therefore
seem unsurprising that it reached different conclusions. How can one assess the effects of
traumatic loss, unless there is an adequate standard for comparison?
The Deere, Murphy and Welch study of the employment consequences of minimum wage
legislation does not directly contradict the Card and Krueger study. Card and Krueger
examined employment in one industry only. The strength of their study is that they tried, by
their choice of a control, to isolate all effects except that due to the change in minimum wage.
They use a single instance to challenge a broad general theory. Deere et al. rely instead on
regression adjustments. Their choice of explanatory variables is then open to question.
Changing the explanatory variables, or using a transformed scale, may lead to quite different
conclusions.
9
1. The Research Enterprise
The discipline of statistics provides the framework of balance sheets and income
statements for scientific knowledge. Statistics is an accounting discipline, but instead of
accounting for money, it is accounting for scientific credibility.
10
1. The Research Enterprise
Both the statistical mainstream and many of these separate streams have placed an exaggerated
emphasis on tests of hypotheses. Outside of the mainstream there has been a neglect of pattern,
an all too common insistence on styles of analysis that are not insightful, a failure to take on
board modern statistical analysis approaches and the policy of some editors of publishing only
those studies that show a significant effect. Thus Nelder (1999) argues that
the practice of statistics has become encumbered with non-scientific procedures
which perceptive scientists and experimenters are increasingly finding to be irrelevant to
the making of scientific inferences. The kernel of these non-scientific procedures is
the obsession with significance tests as the endpoint of any analysis.
Why do these procedures continue in use, if they are in fact of such little help in making
scientific inferences? Nelder has two targets of blame: (1) editors who will not accept papers
unless they follow these procedures, and (2) his perception that many scientists pass through
their training without getting any real insight into the methods of science. Nelder is arguing that
statistical science is a key component of scientific method.
11
1. The Research Enterprise
Statistical Software
The interplay between computing power and theoretical development has made a huge impact
on statistical methodology, both for design of data collection and for analysis. These
developments have taken advantage of the increased power of computers and of the programs
that drive them. We can do a much better job on many analyses than was possible ten years
ago. We have become much more aware of the benefits and traps of different analysis
approaches. Both the teaching and the practice of statistics need to change to reflect these
advances. Why continue to use makeshift methods that were necessary when statistical
computing software was at a very early stage of development?
Influences from new research developments are obvious in the best of the statistical packages
that have been designed or adapted for use in teaching statistics. Examples are Data Desk3 and
the more recent JMP (from SAS 4). Both have a fresh and modern style, have great graphics,
and link data analysis closely with graphics. The large packages that go back to the mainframe
era of computers have often been slower to adapt.
SPSS5 has been popular for the processing of data from large surveys. It has been slow to
incorporate the modern abilities that one finds in S-PLUS6, which I discuss below. Minitab7,
which at one time seemed the package of choice for use in teaching, now has a number of
competitors in this market. Each package has its own areas of strength and weakness.
I have used S-PLUS, a system that is popular with professional statistical users, for the graphs
that appear in this monograph. It has been a common test-bed for the implementation of new
statistical methodology. It is strong on graphics, with a tight linkage between graphics and
analysis. If an analysis is not already available, it is often straightforward to write a few lines of
S-PLUS code that will do what is wanted. S-PLUS is built around an implementation of the S
statistical language
R8 implements a dialect of the same S language that is used in S-PLUS. An attraction of R is
that it is free. Development of R is a substantial international co-operative effort. R has
spawned a variety of associated projects. It is setting new directions for statistical software
development, and will be highly important for the future of statistical computing.
The ANU Statistical Consulting Unit has had a tradition of using GenStat9. GenStat handles
hierarchical analysis of variance in a highly elegant manner. Its windows interface is superior to
that in S-PLUS, especially for novices. Also it does better than S-PLUS at providing, by
default, diagnostic output that users should examine as a matter of course.
Particularly for medical applications, Stata10 is attractive. It has a high quality of technical
documentation. Its web page is unusually helpful and careful in the documentation of known
bugs and in the provision of fixes.
3
http://www.datadesk.com
4
http://www.sas.com
http://www.spss.com
5
http://www.minitab.com
7
8
To find out more about R, or to copy down the code (for the PC under Windows, for Unix or for
Linux), go to the web site http://mirror.aarnet.edu.au/CRAN . My document that describes the use
of R for data analysis and graphics is available from
http://wwwmaths.anu.edu.au/~johnm/r/usingR.pdf .
http:// http://www.nag.co.uk/stats/tt/5thedition/new_5th.html
9
http://www.stata.com
10
12
1. The Research Enterprise
All of these packages have the potential to be generally good vehicles for initial analysis. None
of them can be a substitute for expert knowledge or assistance. For anything that is non-trivial,
decoding and understanding the output is usually, also, a non-trivial task.
1.6 Practicalities
Many issues that are important for researchers lie outside the scope of this monograph. These
include: 1) funding; 2) the use of libraries and other information resources; 3) computing
system requirements; 4) sources of help; 5) oral presentation of results; 6) intellectual property;
and 7) job search. There are brief comments on all of these, and useful references, in
Greenfield (1997.)
11
Providing the column alignment (click on Format, then on Cells) is set to General, such illegal
entries will appear left adjusted, whereas numbers are right adjusted. This allows a visual check.
13
1. The Research Enterprise
Greenfield, Tony, ed. 1996. Research Methods. Guidance for Postgraduates. Arnold,
London.
Lehman, D., Wortman, C., and Williams, A. 1987. Long term effects of losing a spouse or a
child in a motor vehicle crash. Journal of Personality and Social Psychology 52: 218-231.
McGuinness, D. 1997. Why our Children Cant Read. The Free Press, New York.
Nelder, J. A. 1999. From statistics to statistical science. Journal of the Royal Statistical
Society, Series D, 48, 257-267.
Neumark, D. and Wascher, D. 1992. Employment effects of minimum and subminimum
wages: panel data on state minimum wage laws. Industrial and Labor Relations Review 46:
55-81. [See also (1993) 47: 487-512 for a critique by Card and Krueger and a reply by
Neumark and Wascher.]
Sacks, F.M., Svetkey, L.P., Vollmer, W.M., Appel, L.J., Bray, G.A., Harsha, D., Obarzenek,
E., Conlin, P.R., Miller, E.R., Simons-Morton, D.G., Karanja, N., and Lin, P.-H. 2001.
Effects of blood pressure on reduced dietary sodium and the Dietary Approaches to Stop
Hypertension (DASH) diet. New England Journal of Medicine 344: 3-10.
SAS Institute Inc. 1996. JMP Start Statistics.
Scherr, G. H. 1983. Irreproducible Science: Editors Introduction. In The Best of the Journal of
Irreproducible Results, Workman Publishing, New York.
Snow, John. (1855) 1965. On the mode of communication of cholera. Reprint ed., Hafner,
New York.
Taubes, G. 1998. The (political) science of salt. Science 281: 898-907 (14 August).
Tufte, E. R. 1983. The Visual Display of Quantitative Information. Graphics Press, Cheshire,
Connecticut, U.S.A.
Tufte, E. R. 1990. Envisioning Information. Graphics Press, Cheshire, Connecticut, U.S.A.
Tufte, E. R. 1997. Visual Explanations. Graphics Press, Cheshire, Connecticut, U.S.A.
Tukey, J. W. 1981. The philosophy of multiple comparisons. Statistical Science 6: 100-116.
Velleman, P. 2000. ActivStats for Excel. Data Description Inc., and Addison Wesley Longman.
Wainer, H. 1997. Visual Revelations: Graphical Tales of Fate and Deception from Napoleon
Bonaparte to Ross Perot. Copernicus Books.
Wilkinson, L. 1999. The Grammar of Graphics. Springer, New York.
14
2 The Structure of a Research Project
There are broad planning principles that apply across many different areas of
research, and which are the subject of this monograph. In addition there are insights
and approaches that are specific to particular areas of research.
Any effective research project must build on existing knowledge, and must ask
pertinent and incisive questions. Where new data are needed, data collection
methods should be designed to ensure that they are accurate, relevant and
interpretable. The information in the data must be teased out in ways that will help
answer the research question. Finally this information must be communicated.
Techniques for gathering, refining, systematising and interpreting information are a
large part of research methodology. Some techniques are highly specific to
individual subject areas. Others have a more general relevance that extends broadly
to all research. Statistical techniques and insights may be needed at many different
stages of a research project, starting with the overview of existing knowledge.
Different research areas may have very different demands.
15
2 The Structure of a Research Project
1. the methods that will be used for collecting data experiment, published data, data
archives, cross-sectional or longitudinal survey, etc.;
2. the extent to which you will need to develop new methodology or new measuring
instruments;
[It is possible to occupy a whole PhD with the development of methodology that other
researchers can then use, perhaps a new method for estimating the amount of carbon in
the soil, or perhaps a new health measurement scale.]
3. the extent to which validity seems an issue. Are the data what they seem to be; do they
really measure, for example, well-being? This is commonly a key issue in marketing or
health social science. It is often an issue in biology. It is much less often an issue in
physical science;
4. the signal to noise ratio commonly low in marketing or health social science and high
in physics, with biology somewhere in between;
5. the types of measurement instrument whether questionnaires, visual assessment e.g.
of a pattern on an agar plate, physical measurement, or a mixture.
One result of these differences in predominant emphasis is that researchers who have been
trained in one area may find it difficult to make the necessary adjustments when they move to
another area. For example, there are many areas of engineering where the signal to noise ratio
is so low that it can, most of the time, be ignored. Those who have come from this background
of experience may have difficulty making the necessary adjustment when they come to work on
engineering aspects of experimentation with fruit, e.g. the mechanics of bruising.
Investigations that work very close to the limits of detectability require special care. Biases that
are unimportant in more robust experiments can create havoc. The techniques used to detect a
few molecules of a trace chemical must be far more rigorous than those that one would use to
detect concentrations of a few milligrams per litre.
There are good reasons why you should be aware of the differing research demands of different
areas of work. There are large areas of research that cross disciplinary boundaries. There may
be components of your research that will call for a style of research different from that for
which your undergraduate training has prepared you. Increasingly engineers who design new
systems must worry about human engineering issues, whether or not these have been part of the
their training. Human engineering issues are for example crucially important in the design of
aircraft instrument panels, in the design of aircraft fly-by-wire systems, and in the design of
computerised systems for delivering precise doses of radiation. Biologists and anthropologists
may, for their work, need to use measurement or chemical assay devices.
Many of those employed to do research on fruit storage or transport have been trained as
engineers or chemists or physicists. They thus move from an area where variability is
commonly not a major issue to an area where everything varies. The research demands are
thus different. Food chemists may find it hard to adjust to working with the subjective
judgements provided by taste panels. Engineers who move into management positions may be
uncomfortable with market research methodology. Econometricians whose models of the total
Australian economy cannot be rigorously tested may not be well attuned to the careful criticism
and testing that is desirable in situations where this is a possibility. Models for use in hospital
economics can and should be rigorously tested and criticised, in a manner that may not be
possible for models of the Australian economy.
So even if some of the discussion seems remote from the current demands of your own
research, bear in mind that you may at some point move into an area of work that requires this
knowledge.
In addition to differences already identified, projects may differ:
1. in the extent to which the researcher requires new knowledge, and in the extent to which that
new knowledge is available from such `obvious sources as books and journal articles;
16
2 The Structure of a Research Project
2. in the extent to which the research will be an individual effort, or part of a co-operative
project.
3. in the range and extent of multi-disciplinary demands.
In all these areas, be prepared for surprises. Current measuring instruments may prove less
adequate than you had expected, and you may have to develop your own. The skill demands
may be different, and/or more diverse, than what you had initially expected.
Below, I will now set out steps that a research project might follow, and comment on the role of
statistical insights and methods at each step.
Question: For each of the criteria 1-4 above, where in the spectrum does your project fall? Are there
other special issues that arise for your research, that none of these criteria capture?
[The answers you give to this question may affect the importance you attach to the steps that I
describe below for a `typical research project.]
Eight Steps
1. Search out the research context
There are several facets to this. It is necessary to know, as well as you can, the state of existing
knowledge, what existing data may be available, etc.
What is the state of existing knowledge?
You will discover this by talking to any experts you can find, and by reviewing the literature. In
a case where the experts disagree, or seem unable to give convincing reasons for their
judgments, you should be careful about accepting at face value the opinion of any one expert.
You may, finally, need to make your own judgment.
You may need to look critically at claims made in the literature. This may include
(i) looking critically at the experimental or sampling design that generated the data;
(ii) critical examination of the data analysis;
17
2 The Structure of a Research Project
18
2 The Structure of a Research Project
iv. In addition to the logistical problems of doing experiments, there are cost issues. Experiments
in which large commercial buildings are randomly assigned to two different construction
methods are, at the very least, unusual. Theyd need a wealthy and enlightened backer.
[Experiments of this kind have however been undertaken to compare the effects of different
insulation regimes.] An experiment in which, after construction, there was a destruction test
to determine the strength of the building, would require a very wealthy backer!
v. Observational or quasi-observational studies are typically much less expensive than
experiments, and easier to mount. One way to make the comparison between the two types
of construction method is to compare buildings that have been constructed using the two
different methods. There will from time to time be earthquakes in one or other place that do
an unplanned destruction test. Are the data from this just as good as data from a planned
experiment? Are they even more useful?
vi. Governments and organisations, by the changes they make, are all the time carrying out
experiments, though usually they describe them as reforms. These changes might often be
better run, in the first instance, as formal experiments. For example, Government might take
five pairs of hospitals, with the two members of the pair carefully matched, then randomly
assign one member of each pair to the current management regime and the other to the new
management regime that is under trial.
vii. In what is really a case-control study, every motor-cyclist and every tenth car driver are
stopped on a freeway and asked whether they have had a serious accident, requiring hospital
admission, in the previous 12 months. The rate among car drivers is found to be twice that
among motorists. Motor-cycle accidents may more often be fatal. Motor-cyclists who have
serious accidents may give up motor-cycling and become car drivers. There are
`confounding effects at work here. (Christie et al., 1987.)
viii. In all studies that have an observational element, there is a potential for confounding. In a
case-control study, the two groups may differ in more than the exposure.
Once again, what matters is that the study should in principle be able to answer the questions
that are asked. This is an issue of statistical design.
4. Design the study
There are large statistical issues here. For experiments, what are the treatment units, how large
should they be, and how many of them are needed? How can one avoid confounding? Would
some form of blocking improve precision? How will information be collected on each
experimental unit (e.g. measure all plants, or just a sample), and how should it be collected?
For sample surveys, what is your target population? What sample design will give the best
precision for a given cost? How many primary sampling units are required, how many
secondary sampling units, and so on? Will you design your own questionnaire, or will you
adapt an existing questionnaire? How can you avoid questions that may puzzle respondents,
loaded questions and/or ambiguous questions. How will you handle non-response?
Your design should include planning of the details of data recording. Will you enter data onto a
sheet, or directly into a computer? If onto a sheet, do you need a specially designed form or
forms? If into a computer, do you need a computer entry form that can be displayed on the
screen. How can you be sure that the data are entered correctly?
In experimental work, photographs and/or video recordings may be useful as records of
information that you may want to check on later. (We found them invaluable when, in the apple
transport experiment, we needed to check back afterwards on the original labelling on some of
the wooden bins.)
5. Design and carry out a pilot study
This provides a check that your planning has been adequate, and should lead to refinement of
your study design. The pilot study provides a check, of your general study planning, of the
19
2 The Structure of a Research Project
study design, of your measurement devices or instruments, and of practical aspects of data
collection. In deciding whether you need a pilot study, consider whether you could afford to
repeat the study should something go wrong.
The `piloting of a new form of questionnaire that is to be used as an `instrument for measuring
e.g. hospital patient satisfaction or general sense of well-being, may be a long and demanding
process.
6. Carry out the study and collect the data
This is where the quality of your planning is, finally tested! Logistical, rather than statistical,
skills are required at this point. Be sure, however, to keep your eyes and ears open for
evidence of problems, or for the unexpected. A factor that you had not incorporated into your
design may turn out to be important. There may be implications for your later interpretation of
the data. Thus in the apple transport experiment that I mentioned earlier, the intention was to
compare the effect of two truck suspension systems (mechanical and air bag). It turned out that
the major source of damage was unstable bins! We became aware of this when we noticed
that one bin that showed unusually serious damage was rickety.
An adjunct to the process of data collection must be careful checking and re-checking of data,
to avoid errors. It is often helpful to do initial exploratory data summaries as data are collected.
Any problems in the data can be investigated there and then.
7. Analyse the data
The data analysis has, broadly, two parts. There is an exploratory data analysis where you
examine various forms of data summary, both in case they have a message that you need to
consider and in order to check whether the assumptions of the intended formal analysis seem
reasonable. Exploratory data analysis allows the data, as far as possible, to speak for
themselves. I referred earlier to an apple transport experiment. In that experiment the
exploratory data analysis started when fruit were examined for transport damage. Unusually
heavy damage in a particular bin alerted us to the need to look for some major source of huge
damage that had nothing to do with truck suspension.
The formal data analysis directly addresses the issues that the study was designed to examine.
Following the formal analysis, there is further exploratory data analysis that one can and should
do. There can be more carefully targeted checks on assumptions. (After the smooth has been
removed, you can see the rough more clearly.) You can check whether there is anything that
the analysis has missed.
8. Write the report(s) and/or the paper(s)
There are important issues here of statistical presentation. One can debate whether they are
specifically statistical issues. They are issues where statisticians will have comments and
insights. It is important to communicate results clearly and accurately. If those who need to
assess or use the results cannot understand the exposition, the effort may have been largely
wasted.
2. The results of the literature review may have big implications for planning. So do not set
plans in concrete until you know what the literature says.
20
2 The Structure of a Research Project
3. Wherever possible, use a pilot study to test the design, the techniques and logistics before
proceeding with any major experimental or data collection exercise. Changes made to the
design part way through an experiment or data collection exercise can be a recipe for
disaster.
4. Think carefully about how you will handle changes to the plan that may be forced on you by
unexpected circumstances. If it becomes obvious part way through that changes really are
needed, talk to a statistician about whether this is possible without invalidating the design.
Ideally you should carry the current experiment through to conclusion, and then mount a
new experiment with the changed plan.
5. Plan your general approach to data analysis, and ensure that you will have access to the
resources and skills that you need. Unless you have been through the same type of analysis
with the same type of data so many times that it has become routine, you should not try to
plan the analysis in detail. The data may have a message for you about the details of the
appropriate analysis.
Some investigations lend themselves to the use of a dummy run that is designed to check
logistics. For example, a hospital-based clinical trial may involve procedures that must be
followed when each new patient arrives and is enrolled in the trial. It makes sense to do one or
more dummy runs of those procedures. Dummy runs of interviewing procedures are an
essential part of the training of interviewers who will administer questions in a sample survey.
12
The placebo effect is an improvement that occurs merely because the patient is receiving the
attention of medical staff. There may be an improvement from giving patients harmless and
ineffectual tablets, e.g. made of glucose, to swallow.
21
2 The Structure of a Research Project
improve over time, can operate in subtle ways to induce biases. It is necessary to ensure that
the control group and the treatment group benefit equally from any placebo effect.
These issues become even more important when you examine reports, or documents copied
down from the internet. Such material has often not been refereed at all, either by a subject
area specialist or by a statistician.
13
The document may be found on the web site http://www.faseb.org/arvo/helsinki.htm
22
2 The Structure of a Research Project
The quality of the science is an ethical issue. Flawed studies, if they carry any credence at all,
may mislead. One should not put patients at risk or inconvenience, in order to carry out a study
that brings no benefit or may mislead. For just these same reasons, there is a duty on
researchers to fairly elicit and present the information that is in the data. These same issues
arise, though perhaps less cogently, in other research. (Greenfield 1997, chapter 5.)
Silverman (1998) has extensive discussion of issues that relate to the conduct of clinical trials.
See especially chapter 13, pp.48-52.
Christie, D., Gordon, I., and Heller, R. 1987. Epidemiology. An Introductory Text for Medical
and Other Health Science Students. New South Wales University Press, Kensington NSW,
Australia.
Natural Experiments
Diamond, J. 2001. Damned experiments. Science 294: 1847-1848.
Greenfield, T., ed. 1996. Research Methods. Guidance for Postgraduates. Arnold, London.
Manly, B. F. J. 1992. The Design and Analysis of Research Studies. Cambridge University
Press.
Silverman, W.A. 1998. Wheres the Evidence. Debates in Modern Medicine. Oxford.
23
3 Alternative Types of Study Design
A first task must be to decide on a clear research question. The type of study design
will depend on what it is hoped to achieve, on what information is already available,
and on available resources. The study design will impose limits on the inferences
that can be drawn from the data. Large studies may have components of two or more
different types of study design.
Structured methods for collecting data include experiments, censuses or sample surveys,
prospective or retrospective longitudinal studies, case-control studies, cross-sectional
studies, and various forms of structured observational study. Properly designed experiments
or sample surveys are the most structured of all these approaches to data collection.
My focus is on quantitative studies. There are in addition various types of qualitative study.
Often, some mix of qualitative and quantitative approaches will be appropriate.
25
3 Alternative Types of Study Design
There is a correlation of 0.43 between blood pressure and sodium. However the graph makes it
clear that there are really two clusters of results, one for the industrialised societies, and one for
the non-industrialised societies. For industrialised societies there is a slight negative correlation,
that is not however statistically significant. So what is one to make of these results?
The arguments have always been controversial:
The body needs salt for proper functioning. Too little salt may be dangerous. What is the
optimum level? Dahls rats developed hypertension only when fed huge amounts of salt.
The human equivalent would be 500 grams per day.
Indigenous societies differ from industrialised societies in many ways, not just in their
consumption of salt.
It is now widely accepted that the most valid evidence comes from randomised controlled trials
that meet strict protocols. Intra-population studies are commonly (but not universally) thought
26
3 Alternative Types of Study Design
to be more valid than inter-population studies. Here is a summary of the results from these
different types of evidence:
1. Some ecologic studies have shown big differences between populations, correlating with their
salt intake. [It makes a lot of difference which data one focuses on witness the difference
between non-industrialised and industrialised societies in Fig. 2.]
2. Intra-population studies have generally been unable to show a link between salt intake and
blood pressure.
3. An overall analysis (meta-analysis) that included 30 randomised trials and 48 unrandomised
trials found a substantial effect. This study has been criticised for failing to distinguish
between randomised and unrandomised trials.
4. Randomised controlled trials have shown either no effect or a very small effect. A recent
meta-analysis (overall analysis) shows a small effect, possibly too small to be of clinical
importance.
5. Cross-over trials have shown a substantial effect.
It is with good reason that ecologic studies are widely regarded as unreliable. Almost inevitably,
the populations differ in many respects. Thus, in the salt studies, the populations almost
certainly differ in the level of intake of fruit and vegetables. Why focus on salt? Pretending
that one is seeing the effect of salt alone may just be wishful thinking. A number of other
effects are at work. The effects are confounded, i.e. the data do not allow you to separate
them. One might say that confounding is a confounded nuisance! Confounding is a very
serious problem in observational studies.
Societies that have high salt intakes are typically those that consume highly salted preserved
foods. They consume these foods because they do not have access to fruit and vegetables.
Thus, in the inter-population studies, the effects of salt are confounded with the effects of low
levels of fruit and vegetable consumption. Recently the DASH (Dietary Approaches to Stop
Hypertension) collaborative research group has reported on a series of trials that investigated the
use of a diet rich in fruit, vegetables and low fat dairy products (Appel et al. 1997). The blood
pressure was reduced both for normal subjects and for mild hypertensives, slightly more for the
latter. There was no reduction in salt consumption. More recent results from the DASH group
(Sacks et al. 2001) suggest that salt intake has an effect over and above these other dietary
effects.
The most reliable evidence is, in many contexts, that from carefully conducted randomised
clinical trials that use appropriate controls. Randomised trials that compare the effect of different
diets are however very difficult to carry out. It may be difficult to change the amount of salt in
the diet without causing changes to other aspects of the diet. This difficulty is one possible
explanation for the small size of the effect that Graudal (1998) found in his meta-analysis of 58
randomised controlled trials of persons with high blood pressure and 56 trials of persons with
normal blood pressure. There are questions of what trials should be included in such a meta-
analysis. Should one limit attention to trials that are double-blind, i.e. neither the patients
themselves nor the staff administering the trial know who is on which diet? It is important to
know how well were other aspects of diet controlled? Poor control is likely to attenuate the
estimate of the treatment effect. As often, one has to sift out the more directly relevant and
reliable sources of information, and use them to interpret less reliable and/or relevant sources of
information.
Cross-over trials, where participants in one arm follow a salt-reduced diet with their usual diet,
and in a second arm of the trial reverse the order, use each individual as his/her own control.
This allows, in principle, a more precise estimate of the treatment effect. Carry-over effects
may in practice be a serious problem for cross-over studies. Cross-overs studies suggested a
more substantial effect than indicated by randomised controlled trials with a separate treatment
and control group.
27
3 Alternative Types of Study Design
A good place to start investigation of the controversy is the Taubes (1998) article in the journal
Science. Taubes draws attention to the major overview studies, and presents the views of the
main protagonists. Important studies that are subsequent to Taubes article are Graudal et al.s
(1998) meta-analysis and Sacks et al.s (2001) report of a recent very careful study that
controlled for other aspects of diet as well as for sodium intake. Sacks et al. conclude that
sodium contributes substantially to hypertension, with the effect strongest for those who do not
exercise especial care with other aspects of their diet. The results that Sacks et al. report are in
line with R.A. Fishers dictum that Nature responds best when she is asked multiple questions,
albeit in a highly structured manner. In order to get a precise indication of the effects of
sodium, one must at the same time investigate other dietary effects.
In reviewing the literature you need to be aware of the strengths and weaknesses of different
types of study. In planning your own study, you need to know the strengths and weaknesses of
the alternative designs that are available to you. Additional issues arise when there are multiple
studies.
28
3 Alternative Types of Study Design
been no change. The neighbouring state where there had been no change was used as a control.
But what if we have an intervention (a change in minimum wage requirements), but no control?
Can we mount a before/after argument? Here are summaries of a range of possible studies, for
the study of minimum wage legislation:
1. Use U. S. national monthly data to study the effects of increases in the Federal
minimum wage on April 1 1980 and April 1 1981. (Deere et al. 1995).
[NB there was no control group.]
2. Conduct a panel study of state minimum wage changes, 1973 1989. (Neumark and
Wascher 1992).
[Horizontal comparisons across states, at one time, rely heavily on analytic models and
numeric adjustments.]
3. Compare New Jersey (where there was a change in the minimum wage) with nearby
Eastern Pennsylvania (where there was no change). (Card and Krueger 1994).
[NB Control was chosen by the investigator. We have only one comparison between
`treatment and `control.]
In an example (Freedman 1999) from the early history of investigations into the health effects of
smoking, cases were persons admitted to hospital after diagnosis with lung cancer. Controls
were patients admitted for other reasons. In such case-control studies, it is the outcome (lung
cancer or not) that determines who will be in the study. The investigator then peeks to see what
treatment the patient received.
29
3 Alternative Types of Study Design
o Analytical sampling
o Descriptive sampling
o Sampling for Pattern
There is a great deal more that might be said. Sampling issues arise in experimental as well as in
non-experimental studies.
30
3 Alternative Types of Study Design
17. You are asked for advice on what sorts of studies are needed to decide once and for all the
dietary effects of salt. Is one individual study likely to be useful? Should the focus be on
careful evaluation of existing data, or on a new study? What advice would you give? Bear
in mind that most research to date has focused on effects on blood pressure. Are there
other effects of changes in dietary salt that ought to be a concern?
[The answers are not at all obvious. They are, though, good questions to think about.]
18. You are asked for advice on the validity of the evidence that Dianne McGuinness (1997)
presents in her book Why Our Children Cant Read. What would be a good way to
proceed? How long will you need? What help will you need?
19. A private health provider is responsible for 20 hospitals. It plans to move to a new funding
and management regime. Before making the change, it wants to be sure that the changes
will work and will be an improvement. Would you recommend moving some of the
hospitals to the new regime on an experimental basis?
20. You have read the book Smart Health Choices (Irwig et al., 1999). You applaud the
encouragement that it gives to patients to ask clinicians probing questions about their
treatment choices. But will clinicians be able to respond well to such demands? Design a
study to answer this question.
21. What are the pros and cons of screening for prostate cancer? [See e.g. Irwig et al. 1999;
Moynihan 1998].
22. Consider the design of a study of the effects of changing sociological and political forces on
taxation regimes in the Commonwealth of Australia since Federation?
Irwig, J., Irwig, L., and Swift, M. 1999. Smart Health Choices. How to make informed health
decisions. Allen and Unwin, Sydney.
Manly, B. F. J. 1992. The Design and Analysis of Research Studies. Cambridge University
Press.
McGuinness, D. 1997. Why our Children Cant Read. The Free Press, New York.
Natural Experiments
Diamond, J. 2001. Damned experiments. Science 294: 1847-1848.
Terborgh, J., Lopez, L., Nuez, P., Rao, M., Shahabuddin, G., Orihuela, G., Riveros, M.,
Ascanio, R., Adler, G.H., Lambert, T.D., and Balbas, L. 2001 Ecological Meltdown in
Predator-Free Forest Fragments. Science 294: 1923-1926.
31
3 Alternative Types of Study Design
Dahl, L. K. 1960. Possible role of salt intake in the development of hypertension. In Cottier,
P., Bock, K. D., eds. Essential Hypertension an International Symposium, pp. 53-65.
Springer-Verlag, Berlin.
Dahl, L. K. 1970. Salt and Hypertension. American Journal of Clinical Nutrition 25: 231-244.
Graudal, N. A., Galloe, A. M., Garred, P. 1998. Effects of sodium restriction on blood
pressure, renin, aldosterone, catecholamines, cholesterols, and trigliceride. Journal of the
American Medical Association 279: 1383-1391.
Sacks, F.M., Svetkey, L.P., Vollmer, W.M., Appel, L.J., Bray, G.A., Harsha, D., Obarzenek,
E., Conlin, P.R., Miller, E.R., Simons-Morton, D.G., Karanja, N., and Lin, P.-H. 2001.
Effects of blood pressure on reduced dietary sodium and the Dietary Approaches to Stop
Hypertension (DASH) diet. New England Journal of Medicine 344: 3-10.
Taubes, G. 1998. The (political) science of salt. Science 281: 898-907 (14 August).
32
4. Experimental Design
4. Experimental Design
The statistical tools of experimental psychology were borrowed from agronomy, where they
were invented to gauge the effects of different fertilizers on crop yields. The tools work just
fine in psychology, even though, as one psychological statistician wrote, we do not deal in
manure, at least not knowingly. The power of these tools is that they can be applied to any
problem how color vision works, how to put a man on the moon, whether mitochondrial
Eve was an African no matter how ignorant one is at the outset.
[Pinker, S. 1997. How the Mind Works, p.303. Norton, New York.]
The methods of science, with all its imperfections, can be used to improve social, political
and economic systems, and this is, I think, true no matter what criterion of improvement is
adopted. How is this possible if science is based on experiment? Humans are not electrons
or laboratory rats. But every act of Congress, every Supreme Court decision, every
Presidential National Security Directive, every change in the Prime Rate is an experiment.
Every shift in economic policy, every increase or decrease in funding for Head Start, every
toughening of criminal sentences is an experiment. Exchanging needles, making condoms
freely available, or decriminalizing marijuana are all experiments. . . . In almost all these
cases, adequate control experiments are not performed, or variables are insufficiently
separated. Nevertheless, to a certain and often useful degree, such ideas can be tested. The
great waste would be to ignore the results of social experiments because they seem to be
ideologically unpalatable.
[Sagan 1997, The Demon-Haunted World, pp. 396-397. Headline Book Publishing, London.]
Experiments may answer questions you never thought to ask! Experiments teach by
experience. Receptive and trained minds will learn more. Different applications
have different needs.
There is no more effective way to settle a disputed question than to do an experiment,
when an experiment is possible. When fire-walkers walk across hot charcoal and
emerge unharmed, it demonstrates that such a thing is possible. When one plant
grows like crazy in a bed of compost, while its neighbour has no compost and wilts, it
seems a convincing demonstration that compost helps growth. It seems convincing
even though this is a rather poorly designed experiment.
The aim of experimental design is to ensure that the experiment can detect the
treatment effects that are of interest, uses available resources to get the best precision
possible. The choice of design can make a huge difference.
The account that I give here will, as in the case of much else that this monograph touches on,
be introductory. My aim is to give the flavour of experimental design, as it applies to a number
of different application areas.
33
4. Experimental Design
Francis Bacon (1561-1626) gives an early example of a controlled experiment. He applied five
different treatments to wheat seeds water mixed with cow dung, urine, and three different
wines. The winner was urine, followed by the cow dung. By the standards of modern
experimental design, Bacons experiment was inadequate. It was not randomised, i.e. he did not
use a random mechanism for assigning seeds to treatments.
Very simple experiments vary just one factor at a time. Indeed there are still experimenters who
regard this as the proper strategy. Where there are multiple factors, the one-factor-at-a-time
approach makes it very difficult to detect interactions. If there are no interactions, it may work
reasonably well, but is inefficient.
Multi-factor experiments allow the detection of interactions. Degrees of freedom that are
associated with any interactions that prove to be negligible are available for improving the
precision of the standard deviation estimate. So the experimenter wins both ways. For purposes
of estimating main effects, a single four-factor experiment is in general far more efficient than
four single factor experiments. It will give the same accuracy with a much smaller use of
resources.
Randomised controlled trials are a good setting in which to consider a number of elementary
aspects of experimental design. By contrast with agricultural experimentation, the design for a
randomised controlled trial is often very simple in concept.
Where there are two treatment groups, subjects are randomly assigned to one or other
treatment, and the result determined. Complications arise from the ethical and logistical
difficulties of conducting a properly designed clinical trial.
34
4. Experimental Design
A minor elaboration of the two-sample trial arises when subjects are matched, or when
treatment comparisons can be made within subjects. In this case it may be possible to perform
the analysis on the difference between the responses or on log(ratio) of the reponses, or on
some other measure of the difference. The analysis then reduces to a single sample analysis.
There are numerous examples of interventions that were introduced without first doing an
experiment, and where the intervention was later shown to be harmful. Hormone injections in
pregnancy were at one time thought to prevent miscarriage. A randomised controlled trial
showed no effect, compared with placebo injections. Moreover this unproved therapy later
proved to give an excess of cases of vaginal carcinoma and of breast cancer (Christie et al.
1987; Gehan & Lemak 1994, p.159). Section 5.1 gives initial data from this study.
Randomised controlled trials where there is matching provide a simple example of a block
design. The individuals who are matched form a single block. Another form of matching arises
when the different treatments are applied, in turn, to the one patient. The issue of whether
there is treatment carry-over is then important. Also one has to design the trial so that changes
over time can be distinguished from the treatment effect.
The investigator uses a ruler to read off the results. One way to make this easy is to place the 1
at 10mm, the 30 at 30mm, and so on. The `x is at about 36mm. A reasonable way to do the
experiment is to give each person both products. Here then is a set of results (shown as mm)
from such an experiment:
Person 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
4 units 72 74 70 72 46 60 50 42 38 61 37 39 25 44 42 46 56
1 unit 58 69 60 60 54 57 61 37 38 43 34 14 17 54 32 22 36
Diff. 14 5 10 12 -8 3 -11 5 0 18 3 25 8 -10 10 24 20
35
4. Experimental Design
7
6
5
four
4
3
2
2 3 4 5 6 7
one
The diagonal line shows where the assessments for the two samples would be equal. Notice
that tasters who give a higher assessment for the sample with one unit of additive also tend to
give a higher assessment for the sample with four units of additive. The differences in the two
assessments are relatively consistent.
The individual tasters have the role that blocks would have in an agricultural field design. Each
taster compares the two treatments. In the field design, the two treatments are placed alongside
in the one block. The design can easily be extended to allow a comparison with, for example,
milk with no additive. There would then be three treatments per taster.
Experimental design questions that one might ask include:
1. What are the pros and cons of the above experiment, as against an experiment where
34 tasters were divided randomly into two groups of 17. Tasters in the first group all
received the milk with one unit of additive, while those in the second group received the
milk with four units of additive.
[This alternative experiment would be a very imprecise experiment. Differences
between tasters would introduce unwanted noise into the comparison between amounts
of additive.]
2. What is the best way to improve accuracy? It is easier and cheaper to get each taster to
repeat the comparison a number of times, rather than to bring in new tasters.
[If individual tasters are highly consistent from one occasion to another, relatively to
variation between tasters, it will not help much to get each taster to repeat the
comparison a number of times. Increasing the number of tasters will always, in theory,
improve the expected precision.]
36
4. Experimental Design
Replication reduces variability and ensures that there will be a valid estimate of experimental and
other error. Note the contrast between replication of experimental units and repeated
measurements on the same experimental unit. Repeated measurements on an experimental unit
increase the accuracy for that unit. One still has only the one experimental unit.
Randomisation aims to balance out the effects of factors that are not amenable to experimental
control. It does this by making chances equal. It does not ensure that treatment groups will be
balanced with respect to these uncontrollable factors, only that the chances are equal.
37
4. Experimental Design
Also possible are block designs where a subset of the treatments appear in each block. For
example, we might have
Block 1 Block 2 Block 3
Treatments A, B B, C C, A
One treatment has been left out in each block, in a balanced way. This is a balanced
incomplete block design. I have used this type of design for comparing the readings of different
firmness testing devices on the same fruit. Each fruit was in effect a block. We did two sets of
two readings, one pair with each of the devices, on the one fruit.
Block designs are widely used in agriculture, where the aim is to maximise the precision of
treatment comparisons. Thus each block is chosen to be as uniform as possible. In the simplest
form of randomised block design, all treatments occur once in each block. Blocks should be
38
4. Experimental Design
sampled from the wider population to which it is intended to generalise results, so that they
might be on different sites.
In controlled climate chambers, each chamber may form a block, with one or more units from
each treatment in each chamber. Or if there are differences between trays in a chamber, each
tray may form a block.
In clinical trials blocks are more often used as a way of making it hard to predict treatment
allocations for individual patients. Allocation of treatments to patients is random within blocks,
subject to devices that achieve a roughly equal numbers in the different treatments. (ICH 1998,
p.21). Where a surgical trial involves several different surgeons, blocking may be highly
desirable as a mechanism for controlling variation. The patients that are allocated to a surgeon
form a block, with random allocation to treatments within those blocks.
Randomisation
Randomisation prevents intentional or unintentional favouring of one treatment over another. It
is also a way to ensure that observations are all drawn, independently, from the same
distribution. Haphazard allocation, where the experimenter allocates treatments in any
unsystematic way that seems right, is not randomisation.
Replication
Genuine replication increases the number of treatment units. Where there are blocks, there is a
choice between increasing the number of blocks, and increasing the number of experimental
units in each block. Increasing the number of observations on each experimental unit, while it is
often a good idea, is not genuine replication.
4.5 Confounding
Experiments can and should be designed so that they are capable of revealing the effects of the
factors and factor combinations that are of interest. In observational studies there may be such
limited control over the design that it is impossible to separate effects out in this way. Some
confounding is almost inevitable.
For example, two measuring instruments that are believed functionally identical may be set
differently. If one instrument is used for measuring results from treatment A, and the other for
measuring results from treatment B, the effect of the treatment is confounded with the effect of
the instrument. Or if one technician assesses results from treatment A, and another technician
assesses results from treatment B, there may be a technician effect that is confounded with the
treatment effect.
The simplest form of experimental confounding occurs when two factors change together. High
correlations between pairs of variables, common in observational studies, provides an indication
that it will be difficult to separate their effects. Contrast this with the way that experiments vary
factor levels under the control of the experimenter, to ensure that they do not change together.
Suppose we have two factors level of lime, and level of phosphate. The following three
designs illustrate the three different possibilities. An x indicates that a particular combination of
factor levels is present.
0 x x x x x x
1000 x x x
39
4. Experimental Design
2500 x x x x x x
8000 x x x x x x
The first design is much preferable to the third. The same selection of levels of phosphate
appears for each different level of lime. The second design does not allow any possibility for
separating the effects of lime from those of phosphate. It is a hopeless design, unless one
already knows the optimum ratio of phosphate to lime.
In clinical trials, age or sex may be a confounding factor. Suppose one has:
Treatment A Treatment B
Females 7 15
ales 9 3
Then gender is a confounding factor for purposes of making treatment comparisons. The
treatment A results will be slightly biased to the results for females, while the treatment B results
will be relatively similar to the results for females14.
14
These are the numbers in a trial that is reported in Gordon, N. C. et al.(1995): `Enhancement of
Morphine Analgesia by the GABAB against Baclofen. Neuroscience 69: 345-349. Treatment A was
Pantazocine plus placebo, while treatment B was Pantazocine plus Baclofen. When the data were
analysed to take account of the gender effect, it turned out that the main effect was a gender effect,
with a much smaller difference between treatments.
40
4. Experimental Design
Different application areas differ in the types of design that find predominant use. In specific
applications, there will be a range of practical issues that require attention. Robinson (2000) is
attractive for the way that it combines attention to such practical issues with attention to the
theory as and when it is necessary. Examples are drawn from many application areas, with a
focus on industrial applications. For field experimentation, see Mead (1988), Petersen (1985),
Pearce et al. (1988), and Williams and Matheson (1994). See also the very brief discussion of
experimental design in Maindonald (1992). The manual for the statistical package Genstat
(Payne et al. 1993) has helpful discussions of designs that are common in field experimentation.
For clinical trials, Piantadosi (1997) and Silverman (1985) are particularly good. See also other
books that are noted in the references.
Armitage, P. and Berry, G., 2nd edn. 1987. Statistical Methods in Medical Research.
Blackwell Scientific Publications, Oxford.
Begg, C., Cho, M., Eastwood, S., Horton, R., Moher, D., Olkin, I., Pitkin, R.,Rennie, D.,
Schulz, K. F., Simel, D., and Stroup, D. F. 1996. Improving the Quality of Reporting of
Randomised Controlled Trials: the CONSORT Statement. Journal of the American Medical
Association 276: 637 - 639.
[The checklist that appeared as part of this statement can be found at:
http://www.ama-assn.org/public/journals/jama/jlist.htm]
Box, G.E.P., Hunter, W.G., and Hunter, J.S. 1978. Statistics for Experimenters. Wiley, New
York.
Cochran, W.G. and Cox, G.M. 2nd edn. 1957. Experimental Designs. Wiley, New York.
Gehan, E. A. and Lemak, N. A. 1994. Statistics in Medical Research. Plenum Medical Book
Company, New York.
ICH 1998. Statistical Principles for Clinical Trials. International Conference on Harmonisation
of Technical Requirements for Registration of Pharmaceuticals for Human Use. Available
from http://www.pharmweb.net/pwmirror/pw9/ifpma/ich1.html
Maindonald J H 1992. Statistical design, analysis and presentation issues. New Zealand Journal
of Agricultural Research 35: 121-141.
Payne, R.W., Lane, P.W., Digby, P.G.N., Harding, S.A., Leech, P.K., Morgan, G.W., Todd,
A.D., Thompson, R., Tunnicliffe Wilson, G., Welham, S.J. and White, R.P. 1993. Genstat
5 Release 3 Reference Manual. Clarendon Press, Oxford.
Pearce, S.C., Clarke, G.M., Dyke, G.V. and Kempson, R.E. 1988. Manual of Crop
Experimentation. Griffin, London.
Peterson, R.G. 1985. Design and Analysis of Experiments. Marcel Dekker, New York.
Piantadosi, Steven. 1997. Clinical Trials: A Methodologic Perspective. Wiley 1997.
Robinson, G.K. 2000. Practical Strategies for Experimenting. Wiley, New York.
Schulz, K. F. 1996. Randomised Trials, Human Nature, and Reporting Guidelines. Lancet
348: 596 - 598.
41
4. Experimental Design
Shapiro, S. H. and Lewis, T. A. 1983. Clinical Trials. Issues and Approaches. Marcel Dekker
1983.
Silverman, W. A. 1985. Human Experimentation. A Guided Step into the Unknown. Oxford
University Press, Oxford.
Williams, E.R. and Matheson, A.C. 1994. Experimental design and analysis for use in tree
improvement. CSIRO Information Services, Melbourne.
42
5. Quasi-Experimental and Observational Studies
As noted in the previous chapter, the only sure way to know the effect of one or other
change is to make the change and see what happens, i.e. do an experiment. However
there are severe practical and ethical limits on what experiments are possible. Hence
the various quasi-experimental methods that exercise non-experimental forms of
control on the generation of data. Or the conditions of an experiment may be created
by an accident of management or of nature.
Even though we do not have an experiment, is it sometimes (or often) possible to get data that
we can treat, to a greater or limited extent, as though it had come from an experiment? This
section will explore several types of study that aim to do just that. As we will see, there can be
severe obstacles to reliable inference from such studies.
Data from quasi-experimental studies are commonly analysed as though they had been gathered
under experimental control. If the mechanisms that generated the data closely mimic those of a
genuine experiment, this makes good sense. Where the data have few of the characteristics of
experimental data, inferences that rely on statistical models are in general hazardous. The
literature offers guidelines, arising from work such as I will discuss below, that careful
researchers will study and use.
I will start with those types of study where there is the greatest potential to reproduce the
conditions of an experiment, moving through to those farthest from the conditions of an
experiment. The examples are all from clinical medicine. The following section discusses the
use and limitations of regression modelling. It uses examples from the economics literature.
Getting results that are credible and defendable, if this is possible, is not a simple matter of
running data through a multiple regression program!
43
5. Quasi-Experimental and Observational Studies
health experience of doctors who smoked at the point of entry to the study with the health
experience of those who did not. The doctors were not randomly assigned to a smoking and a
non-smoking group! So there might be something different about the doctors who smoke,
affecting both their health experience and their tendency to smoke. Much of the work on the
health effects of smoking has been directed to ruling out such explanations.
Case-control studies
Again we wish to assess the effects of an exposure. Case-control studies aim, by the choice of
'cases', to exercise an 'after the event' control that as far as possible substitutes for direct
experimental control. Those subjects who have the disease are 'cases', while the 'controls',
chosen from the same population as the cases, do not have the disease. We classify both cases
and controls as exposed or unexposed. The estimation of relative risk relies on cases and
controls being representative of cases and controls in the community, with no regard to the
likelihood of exposure or non-exposure. Depending on how subjects are selected, such
associations are common. Persons known to have been exposed, and therefore thought more
likely to be cases, may be more likely to find their way into hospital records.
Occasionally, case-controls involving quite small numbers of patients provide highly convincing
evidence. Adenocarcinoma of the vagina in young women had been recorded rarely before it
was diagnosed in eight patients treated in two Boston hospitals between 1966 and 1969. Each of
the eight patients was matched with a female born nearest the time of the patient and from the
same service. Seven of the eight mothers of patients with carcinoma had received
diethylstilbestrol (DES), starting during the first trimester. No control mother had been given
the synthetic estrogen. Thus we have
With cancer Without cancer
Mother had not taken DES 1 8
In seven of the pairs, the mother of the daughter with carcinoma had taken DES, while the
other mother had not. In one of the pairs, neither mother had taken DES. The probability of
finding this discordance in seven or more pairs out of 8, if the probability of the mother taking
or not taking DES is the same irrespective of whether the daughter had cancer, is 0.004. (Gehan
& Lemak 1994, pp.158-159.15)
Cross-sectional Studies
Essentially a cross-sectional study is a type of survey. It shows a current reality the
prevalence of smoking or the prevalence of lung cancer. It does not tell us incidence the rate
at which people are taking up smoking or getting lung cancer. There is no time dimension.
Moreover there is a survivor effect the only people who can be asked questions are those who
are available to be asked. Christie et al. (1987) quote the (fictitious?) example of stopping all
motor-cycle riders and every tenth car driver on a freeway and asking whether they have had a
serious accident, requiring hospital admission, in the past 12 months. The rate among car
drivers is found to be twice that among motor cyclists. Serious accidents may be more likely to
kill motor-cyclists. Or perhaps, following a serious accident, many motor-cyclists give up their
motor-cycles and become car drivers.
15
It is not appropriate to apply a chi-squared test to the two -way table. Such an analysis ignores the
pairing, and would be wrong.
44
5. Quasi-Experimental and Observational Studies
The data suggest a better prognosis for the group whose cancer is detected as a result of
screening. However there are at least two differences between the screened and the unscreened
group:
1. It is possible that some in the screened group would never have presented at a clinic; some
of these cancers may stay dormant;
2. The screening will detect cancers at an earlier stage. Even without treatment these patients
should survive longer than those whose cancer is detected, almost inevitably at a more
advanced stage, when they present at a medical service.
Because the process that led to the detection of cancer was different between the screened and
unscreened groups, the two groups are not comparable. The method of detection is a
confounding factor. The screening may lead to surgery for some cancers that would otherwise
lie dormant for long enough that they would never attract clinical attention.
One needs a longitudinal study that compares all patients in a screened group with all patients in
an unscreened group. Table 2 presents results from such a study (c.f. Hisamuchi et al. 1991)
Number Mortality over
1960 - 1977
Unscreened Group 2683 95/100,000 p.a.
Screened Group 4325 45/100,000 p.a.
16
The data appeared in Sugawara et al.(1984), in Japanese, in a paper of which I have no other details.
45
5. Quasi-Experimental and Observational Studies
studies, and various prospective control studies, gave the lowest risks. See also Andersen
(1990).
46
5. Quasi-Experimental and Observational Studies
more extensively than the corresponding results for females. Male 1978 earnings increased,
relative to those in the control group, by an average of $886 [SE $472].
Lalondes idea was to replace the experimental control group with two non-experimental groups
that had been studied extensively, then using regression methods to estimate the effect on
earnings. The results are discouraging. The estimate depends strongly on the form of
regression adjustment. Even more disturbingly, it was in every case negative, and different for
the different comparison groups. The closest agreement was a decrease in earnings of $1844
[SE $762] when the analysis used one non-experimental control group, and a decrease of $987
[SE $452] when it used the other non-experimental control group. The figures improved
slightly, i.e. became less negative, when comparisons were with subsets of the non-experimental
control groups that more closely matched the characteristics of the treatment group.
Dehejia and Wahba (1999) revisited Lalondes study, using his data. They used the propensity
score methodology, as expounded e.g. in Rosenbaum and Rubin (1983). Here is a simplified
description of the approach, as used by Dehejia and Wahba (1999). A propensity is a measure,
determined by covariate values, of the probability that an observation will fall in the treatment
rather than in the control group. Various forms of discriminant analysis may be used to
determine scores. Comparison of treatment and control groups then uses only those
observations whose propensity scores lie within the overlapping parts of the ranges of treatment
and control groups. Comparison of treatment and control group then proceeds using the
propensity score as the only covariate. Dehejia and Wahba (1999) used this methodology to
reproduce, in comparisons using the non-experimental control groups, results that closely
matched the experimental results. The task is not as hopeless as Lalondes study seemed to
indicate. It does however require a careful and subtle use of a methodology that is adapted for
handling non-experimental comparisons. A straightforward use of regression methods will not
work. In general Dehejia and Wahbas methods require extensive data. A key requirement is
that the data must include information on all relevant covariates.
This work warns that coefficients in regression equations can be highly misleading. Regression
modelling places two demands on the coefficients. They must model within group relationships
acceptably well, and in addition they must model effects that relate to differences between
groups. Even where the groups are reasonably well matched on relevant variables, the
methodology may not be able to reconcile these perhaps conflicting demands. Where the ranges
of some variables are widely different in the different groups, the task is even more impossible.
47
5. Quasi-Experimental and Observational Studies
Freedman, D. 1999. From association to causation: some remarks on the history of statistics.
Statistical Science 14: 243-258.
Rosenbaum, P. and Rubin, D. 1983. The central role of the propensity score in observational
studies for causal effects. Biometrika 70: 41-55.
Christie, D., Gordon, I., and Heller, R. 1987. Epidemiology. An Introductory Text for Medical
and Other Health Science Students. New South Wales University Press, Kensington NSW,
Australia.
Hisamuchi S, Fukao P, Sugawara N, et al. 1991. Evaluation of mass screening programme for
stomach cancer in Japan. In: Miller AB, Chamberlain J, Day NE, et al., Eds.: Cancer
Screening. Cambridge: Cambridge University Press, pp 357-372.
Neumark, D. and Wascher, D. 1992. Employment effects of minimum and subminimum
wages: panel data on state minimum wage laws. Industrial and Labor Relations Review 46:
55-81. [See also (1993) 47: 487-512 for a critique by Card and Krueger and a reply by
Neumark and Wascher.]
Petitti, D. 1994. Meta-analysis, Decision Analysis and Cost-Effectiveness Analysis. Oxford
University Press.
48
6. Sample Surveys, Questionnaires and Interviews
It must be stressed that fact-collecting is no substitute for thought and desk research, and that
the comparative ease with which survey techniques can be mastered is all the more reason
why their limitations as well as their capabilities should be understood. Sound judgement in
their use depends on this. It is no good, for instance, blindly applying the formal standardized
methods generally used in survey or market research enquiries to many of the more complex
problems in which sociologists are interested.
[Moser and Kalton 1971, p.3]
Sampling is ubiquitous. A person buying a sack of potatoes will use a small sample
of the potatoes as a basis for assessment of the contents of the sack. Auditors who are
checking for mistakes or fraud will examine a sample of a firms accounts. This
chapter focuses on sample surveys that use samples to gain information on a human
population.
Important concepts are target population and sampling frame. Probability based
sampling schemes help avoid sampling bias and allow estimates of accuracy. Simple
random sampling is the simplest such scheme. More complex schemes combine
simple random sampling with cluster sampling and/or stratified sampling. Non-
response is the bane of surveys of human populations. It may introduce serious bias.
Human sample surveys typically work with questionnaires. An inappropriate choice
of questions, and/or poor overall design of the questionnaire, can bias responses.
What strategies and checks can ensure that responses do genuinely answer the
questions that were in the researchers mind?
Qualitative approaches should often complement quantitative approaches.
Qualitative investigation may help indicate what forms of quantitative investigation
may be helpful and useful. It may shed light on what respondents intended by their
answers.
A cook takes a spoonful of soup from the cooking pot to determine whether the amount of salt
is right. From the taste of the spoonful, the cook generalizes to the whole pot of soup. Wine
tasters taste a sample of the wine in a bottle, and on that basis make a judgment about the
whole bottle. Auditors are not able to examine all transactions in the accounts that they
scrutinise. Instead they take a sample of the accounts, and base conclusions on the sample. All
the time we sample.
In an experiment, it may be necessary to take a sample from the experimental unit. If the
experiment is a clinical trial that collects data on how the treatment affects the patients blood,
any measurement must be made on a sample of the blood! Results from the sample are taken
as indicative of all the blood in the patients body. In an experiment where trees are the
experimental units, measurements of the amounts of calcium in the apples will be taken on a
sample of the apples.
Survey data are widely used for decision and policy making. Unlike an experiment, the aim is
not to study the effect of change, but to learn what is! While surveys may sometimes be used to
gather data that will be used to evaluate the effects of contrived change, this is not a necessary
or predominant survey context. Decisions on whether and how to market a new product, on
the effects on government finances of changes in tax rates, or on priorities for new housing
development, may rely crucially on information from surveys.
49
6. Sample Surveys, Questionnaires and Interviews
In this chapter the focus is on studies where samples are used to survey a human population,
typically using questionnaires to elicit information. Many of the points carry over to surveys of
organizations, or of animal or plant populations.
Logistical Issues
The logistics of carrying out the survey must be planned. Will responses be obtained by
interview, post, telephone17, or by some other method? Face to face interviewing can allow
relatively subtle forms of questioning, and can give a good response rates. Effective conduct of
interviews does however require skills that, for most individuals, take time and experience to
develop. With postal and other forms of self-completion questionnaires, some form of
motivation to respond is almost essential. There may be a reward. Even then, it will almost
certainly be necessary to send reminders, or even phone or visit non-respondents, in order to get
a reasonable response rate. Failure to follow up non-respondents can wreck an otherwise well-
conducted survey.
Surveys of official agencies, or of organizations, will require the co-operation of the relevant
officials or managers, people that Lynn (1996) calls gatekeepers. Processes must be followed
that may be specific to each organization. Negotiating a way through these processes can be
frustrating and time-consuming.
Detailed logistics cannot be worked out until sampling design issues are resolved. What are the
different tasks that are involved? Who will perform these various tasks? In a major survey,
there are huge planning demands. For further discussion see e.g. Duoba and Maindonald
(1988), Moser and Kalton (1971).
Sampling Design
1. What is the target population, i.e. the population about which information is required?
2. What is the sampling frame, i.e. the population from which individuals will be sampled?
While this should ideally be the same as the target population, some compromise is usually
necessary. In a simple survey design, this may be a list of names and addresses, or names
and phone numbers.
17
Issues that arise in telephone surveys are discussed in Collins (1999).
50
6. Sample Surveys, Questionnaires and Interviews
3. What method will be used for selecting the sample? Ideally the sample should be chosen
using a probabilistic sampling scheme, of which the simplest is simple random sampling.
Non-probabilistic methods, e.g. including in the sample whoever one can most readily find,
have a serious risk of bias. Self-selected samples, e.g. where people who are interested ring
in to give answers to questions that have appeared in a magazine, may be very seriously
biased.
4. What steps will be taken to ensure high levels of response? What efforts will be made to
follow-up non-respondents?
18
These are discussed briefly, with references, in Streiner and Norman 1995.
51
6. Sample Surveys, Questionnaires and Interviews
1. Make a draft of the questionnaire. Check that it has a clear coherent structure. Be sure to
include a short preamble that explains the purpose of the questionnaire, what will happen to
the results, what has been done to ensure confidentiality, and so on.
2. Get someone who is experienced with questionnaires to look over it with a critical eye.
Make any necessary revisions.
3. Seek the co-operation of 10-15 potential respondents. Administer the questionnaire verbally.
Note, using the headings in section 6.3, the behaviours that each question elicits. (Behaviour
coding).
4. This is a follow-up to step 3. Once each set of results is complete, ask the respondent to
explain their answers in a sentence or two. (Probing).
Following these steps, there should be a pilot survey. For a large survey, the pilot might be
conducted with as many as 30-100 respondents. After entering the data from the pilot and
carrying out a summary analysis, there should be a review both of the questionnaire and of the
conduct of the survey.
This is a five-point Likert scale. Not at all interesting rates as 1, Not very interesting rates
as 2, and so on. There may be four or five questions that focus around this same theme of
attitudes to science, with high ratings indicating positive attitudes and low ratings indicating
19
For random samples, these are the ranges that are consistent with a 50/50 split, as assessed by a
95% confidence interval for the population proportion. In practice, because simple random sampling
has not been used, and because of non-response bias, these ranges may realistically be much wider
than stated.
52
6. Sample Surveys, Questionnaires and Interviews
negative attitudes to science. A simple way to get an overall attitudes to science score may be
to add the scores from the four or five individual questions20. If this seems inappropriate, one
might use principal components analysis to determine scores21. It may even be possible to
combine results from several themes into a single score.
o Using a sample of 3000 from a sampling frame similar to that used by the Literary
Digest, he predicted that the Digest poll would give Roosevelt 44% of the vote!
Even Gallups sample of 50,000 was enormously larger than polling organizations would use
today. Even in very well conducted sample surveys, non-sampling biases typically become
more important than sampling error once the sample size is more than one or two thousand. In
less well conducted surveys, or where the tradition of experience has been too short to allow the
honing of the methodology, the cross-over point may be a few hundred or less.
20
I am unconvinced by arguments that the ratings are not on an interval scale and should not be added.
What is the alternative? The scores have to be combined somehow, formally or informally. The scale
should have been chosen so the distance between Not at all and Not very is intuitively similar to
that between Somewhat and Not very. This is not to deny a need for caution.
21
Principal components analysis determines a weighted combination of the scores, designed to
account for as much of the variability as possible in the individual scores.
53
6. Sample Surveys, Questionnaires and Interviews
names on the electoral role. We will discuss elaborations of this simple scheme in the next
section.
Non-sampling Errors
Non-response is one of several types of non-sampling error. Other non-sampling errors may
arise because the questions have been misunderstood, or have been interpreted differently from
the way that the survey planners intended. The next section will examine implications for
questionnaire design. Comments in Moser & Kalton (1971, p.482) are apt:
There is incongruity in the present position. One part of the survey process (the sampling)
is tackled by a tool of high precision that makes accurate estimates of errors possible, while
in the other parts errors of generally unknown proportions subsist. This incongruity has a
double implication. It means, first of all, that the survey designer is only partly able to plan
towards his goal of getting the maximum precision for a given outlay of money, since the
errors (and even costs) associated with the various non-sampling phases cannot be
satisfactorily estimated in advance. And secondly, so long as these errors cannot be
properly estimated from the results of a survey, the practitioner is in a position to give his
client an estimate of the sampling error only, not of the total of all kinds of error. This is a
weakness, and there is here a field of fertile research for students of research methodology.
... The operation of memory errors, the kinds of errors introduced in informal as opposed to
formal interviewing, the effects of length of questionnaire on errors, the errors associated
with different kinds of question, the influence of interviewer selection, training and
supervision, the errors introduced in coding and tabulation --- these are but a few of the
many fields in which ... there remains scope for research.
In a carefully conducted mail survey, there will be a second mail-out that will seek a response
from those who did not respond to the first mail-out. In telephone surveys, it will often be
necessary to make several calls in order to contact some of those in the sample. Respondents
should then be classified according to the ease with which it was possible to contact them, and
the response compared. If differences are greater than statistical error, this will suggest that
non-respondents may be even more different.
Quota Sampling
Many commercial market research organizations use this as their preferred method. Its
principal advantage is reduced cost, though technological change may now be changing the
relative costs. There are serious, and usually unknown, risks of bias. Quota sampling is not
usually carried out in a manner that allows a realistic estimate of error from any individual
sample. This may perhaps be acceptable where the aim is to get ballpark indications only.
There are mechanisms that may help calibrate results from quota sampling. Error may be
estimated by examining the results of repeated quota samples. Bias can be estimated by making
occasional comparisons with a probabilistic sample that is conducted in parallel.
Question: Do quota sampling and other non-probabilistic sampling methods have a role? If so, when
are they appropriate?
Self-selected Samples
For example, readers of a magazine may be asked to write in and give their opinion. These are
the most hazardous of all.
Question: What other planned ways are there to collect data, apart from experiments, sample surveys,
longitudinal studies, and case-control studies? What are the different challenges of these other
approaches for the statistical analyst?
54
6. Sample Surveys, Questionnaires and Interviews
target population
sample frame
non-response
In addition we introduced the idea of simple random sampling. An example was the choice of
names at random from an electoral role.
Multi-stage sampling
Cluster sampling can be mixed with stratified sampling to give stratified cluster sampling.
Instead of using simple random sampling within each cluster, one uses cluster sampling. More
generally, the sample procedure may be multi-layered, leading to multi-stage sampling. At each
stage the method used may be stratified random sampling, or cluster sampling, or a mixture of
the two.
55
6. Sample Surveys, Questionnaires and Interviews
Behaviour Coding
Where an interviewer administers the questionnaire, coding of respondent behaviour may be
used to identify actual or potential problems. The problem behaviour code categories used in
Oksenberg et al. (1991) were
1. Respondent interrupts initial question-reading with answer.
4. Answer is inadequate.
56
6. Sample Surveys, Questionnaires and Interviews
Probing
Respondents answer to the question as they understand it. This may differ from what the
researcher intended. So a facet of the pre-testing is to follow administration of the questionnaire
with probing designed to discover how the respondent understood the question. Oksenberg et
al.(1991) quote as an example:
During the past twelve months, that is, since January 1 1987, about how many days did
illness or injury keep you in bed for more than half the day.
Most respondents took this to mean not getting up in the morning and staying in bed till about
noon or later. Others had in mind lengths of time, as little as 2-4 hours or as much as 12 or
more hours. Another issue was whether staying in bed because they felt they were coming
down with something would count as illness. About two thirds would have included this, while
the other third would not.
57
6. Sample Surveys, Questionnaires and Interviews
9. Recall/response is impossible
How many kilometres did you drive in the last year?
[Few people will know this.]
17. Awkward syntax (an especial problem when an interviewer has to read the question
out.)
The Department of Social Security has information in its files about census items like
date of birth and sex for nearly everyone. Would you favour or oppose giving this
information to the Bureau of Statistics for use in the Census?]
[You surely dont mean sex for nearly everyone. You mean the DSS holds
information on everyones date of birth and sex.]
Cognitive laboratory methods is a collective name for methods that try to tease out the
thought processes that led to a particular response (Forsyth and Lessler 1991).
58
6. Sample Surveys, Questionnaires and Interviews
Content Validity
Does the statistical data connect strongly with the problem in which we are interested? Issues
of content validity arise with particular force in psychometric testing. Do IQ tests really
measure intelligence? Perhaps, if we knew what intelligence was, we could say. Note
Nunnally's (1978, p. 94) comment:
In spite of some efforts to settle every issue about psychological measurement by a flight
into statistics, content validity is mainly settled in other ways. Although helpful hints are
obtained from analyses of statistical findings, content validity primarily rests upon an appeal
to the propriety of content and the way that it is presented.
Even apparently hard factual questions may measure something different from what we think
they measure. Questions about sexual and other practices where there are strong social
constraints are particularly difficult.
Face validity
Broadly, this has to do with the extent to which those who work in the area find the measure a
credible instrument for its claimed purpose. Of course, researchers may be wrong.
59
6. Sample Surveys, Questionnaires and Interviews
There will now be studies that will allow estimation of the distribution of any person-specific
bias. If the person-specific bias proves substantial, this will seriously undermine the use of the
food frequency questionnaire as a measuring instrument in studies where relatively fine
discrimination is required.
60
6. Sample Surveys, Questionnaires and Interviews
Sample Surveys
Biemer, P. B., Groves, R. M., Lyberg, L. E., Mathiowetz, N. A. and Sadman, S. (eds.) 1991.
Measurement Error in Surveys. Wiley, New York.
Collins, M. 1999. Editorial: Sampling for UK telephone surveys. Journal of the Royal
Statistical Society A, 162: 1-4.
Gallup, G. [1972] rev. 1976. The Sophisticated Poll Watchers Guide. Princeton Opinion
Press.
Lynn, P. 1996. Sampling in human studies. In Greenfield, T., ed.: Research Methods.
Guidance for Postgraduates, chapter 17.
Moser. C. A. and Kalton, G., 2nd edn. 1971. Survey Methods in Social Investigation.
Heinemann Educational Books, London.
Questionnaire Design
Forsyth, B. H. and Lessler, J. T. 1991. Cognitive laboratory methods: A taxonomy. In Biemer,
Groves, Lyberg, Mathiowetz and Sadman (eds.) 1991. Measurement Error in Surveys, pp.
393-418. Wiley, New York.
Judd, C. M., Smith, E. R. and Kidder, L. H. 1991. Measurement. From Abstract Concepts to
Concrete Representaions. In Research Methods in Social Relations (sixth edition), Holt
Rinehard and Winston Inc 1991, 42-61. (See also Maximising Construct Validity, pp. 30-
32.)
Nunnally, J. C., 2nd edn 1978. Psychometric Theory. McGraw-Hill, New York.
Oksenberg, L., Cannell, C., and Kalton, G. 1991. New strategies for pretesting survey
questions. Journal of Official Statistics (Statistics Sweden) 7: 349-365.
Presser, S. and Blair, J. 1994. Survey pretesting: Do different methods produce different
results? Sociological Methodology 24: 73-104.
Streiner, D. L. and Norman, G. R., 2nd edn., 1995. Health measurement scales: a practical
guide to their development and use. Oxford University Press.
Qualitative Research
Britten N, Jones R, Murphy E, Stacy R (1995). Qualitative Research Methods in General
Practice and Primary Care. Family Practice 12: 104 - 114. Oxford University Press.
61
6. Sample Surveys, Questionnaires and Interviews
Greenhalgh, T. 1997. How to read a paper: the basics of evidence-based medicine. BMJ,
London.
[See the chapter on Qualitative Research.]
Kuzel A J (1992). Sampling in Qualitative Inquiry. In B K Crabtree & W L Miller (ed), Doing
Qualitative Research (Vol. 3, pp 31-44). Newbury Park: Sage Publications.
62
7 Sample Size Calculations
Sample size calculations may be needed for many different types of study.
Researchers should know roughly what precision they can expect from their study.
How large a difference between treatments is detectable?
Sample size issues should be considered alongside, and be subordinate to, sample
structure issues. It is good design that is needed, not necessarily a large sample size.
In a randomised controlled trial with control and treatment groups, a decision is needed on how
many will be in the control group, and how many in the treatment group. Or if there is a
limitation on the available numbers in the two groups, the researchers will want to know the
implications for the accuracy of the result. A sample size calculation relies on various
assumptions. A provisional model is needed for the data. Often it is possible to make a stab at
the information that is needed. Where research breaks totally new ground, getting a good
guesstimate may be more difficult.
63
7 Sample Size Calculations
3. Each new study should be seen as part of a total learning process. The key issue for the
researcher is how the new study can best contribute to the total learning process, given the
state of existing knowledge.
4. In highly exploratory studies, the effort put into trying to get high precision may be largely
wasted. The initial study will often provide information that calls for substantial modification
of the initial design. Such studies have the character of pilot studies. The priority should
often be the collection of information that will assist in the design of later studies, rather than
high precision.
5. Sample size calculations have received a huge amount of attention in such studies as medical
case-control studies and clinical trials. Here, generally, they do have a useful role. However
as with other studies, sample size is only one of a number of important design issues. It has
too often been treated as the one issue of major importance.
6. If a study is to stand on its own, then sample size is highly important. If it is one in a series
of studies that will finally be analysed together, then the sample size in that individual study
may have more limited consequence.
7. There is an urgent need for mechanisms that will foster co-operation between different
researchers who are working on similar questions, so that their work meets similar standards
and can finally be evaluated in a single overall analysis. Questions of sample size in
individual studies should be addressed in this wider context. This is a particular issue for
clinical trials.
8. The aim should be accurate estimation of variability, and ensuring there are enough degrees
of freedom to do this, rather than replication as such.
9. Once the study has been conducted, the initial sample size calculation has no relevance. The
analysis will provide information on the accuracy of the estimated effects, and it is this that is
of interest.
Strong assumptions may underlie sample size calculations. If the assumptions are not satisfied,
then the answer may be seriously astray. If the same faulty assumptions underpin the eventual
analysis, that will be wrong also.
Sample Structure
We have so far assumed that it is obvious what the sample units are. It is not always that
simple. Suppose that two different devices for measuring fruit firmness are to be compared. A
sample of fruit will be taken, with half then randomly assigned to one instrument and half to the
other. The instrument used for any particular fruit will make two measurements. Note that
even though two measurements are made on each fruit, it is the number of fruit that is crucial.
The experimental unit is an individual fruit. (A better design would of course be one where
each instrument makes one or more assessments on the same fruit.)
64
7 Sample Size Calculations
If the aim is to generalise results of an ecological investigation to a wide range of sites, then it is
necessary to have an adequate sample of sites. The investigation of large numbers of apples
from one tree does not allow generalisation to multiple trees.
n=
where is the smallest difference that it is desired to detect, and SDD is the standard deviation
of the difference for a sample of one in each group. For a test at the 5% level, when the n is
large, t = 1.96. If all that is required is a 50:50 chance of finding a difference, then t = 0.
For 80% power when n is large, t = 0.84, while for 90% power t = 1.28. The formula
applies also to confidence intervals, where now `power is the desired probability that the
confidence interval for the difference of interest will have a half-width of less than .
Here are some special cases:
1. For a one-sample t-test, SDD is the standard deviation s. Thus for matched samples, s is the
standard deviation of differences between sample pairs and n is the number of sample pairs.
2. For a two-sample t-test with s the pooled standard deviation, SDD is 2 s.
3. For a two-sample t-test, with different standard deviations s1 and s2 for the two samples, SD
= s12 + s22
65
7 Sample Size Calculations
Clustering Effects
The formulae we have given assume that there is no clustering. In situations where there is
clustering, the between cluster variance will often dominate the variance of estimated totals or
means or differences of means. The number of clusters, not the total number of individuals,
may be crucial. This is equivalent to the insight, in the experimental design context, that the
number of experimental units is crucial.
A simple case, which however illustrates the general principle, arises when all clusters are the
same size m. The variance (or its estimate) can be partitioned into a within cluster component,
i.e. sw2 between individuals in the same cluster, and a between cluster component sb2 , i.e.
between individuals in different clusters. Then the variance of the mean of a sample of size m
from a randomly chosen cluster is sb2 + s w2 / m . For a given cluster size m, one can estimate
SD2 = sb2 + s w2 / m , and use the square root of this, multiplied by 2 if the interest is in
differences, as a standard deviation to plug into the formulae given above. The formula will
give the number of clusters that are required.
Detecting Change
Given a statistic and a standard error estimate for it, one can adapt the above methodology.
Thus in a straight line regression calculation that assumes independent and identically distributed
errors with variance that we estimate to be s 2 , the variance (= SE2) of the slope estimate is:
s2
, where we define s x2 =
(x
i i
x)2
ns 2x n
(t + t ) s
2
n = .
s x
The minimum detectable effect size is
(t + t ) s
=
sx n
66
7 Sample Size Calculations
200. These are the numbers that are typically required to distinguish clinically significant
effects. While smaller trials may sometimes be useful, they risk capturing the characteristics of
an idiosyncratic subgroup of patients.
Johnsons advice is specific to clinical trials, and perhaps to psychiatric trials. In other areas,
there will be different norms.
67
8 The Rationale of Research
The aim of science is to seek the simplest explanation of complex facts . . . seek simplicity and
distrust it.
[A. N. Whitehead]
Both scepticism and wonder are skills that need honing and practice. Their harmonious
marriage within the mind of every schoolchild ought to be a principal goal of public
education.
[Sagan 1997, p. 289.]
Any adequate account of the scientific method must allow for the exercise of
imaginative insight. It must also place checks on the unconstrained use of the
imagination. There must be a mechanism for distinguishing claims that can be
substantiated from claims that cannot be substantiated.
It must allow a role both for data and for theory. Any collection of data pre-supposes
some notion that these particular data are likely to be interesting and useful. In this
sense, science is driven by theory. It is the genius of science that data may challenge
and even destroy the theory that guided their collection. This is the means by which
science places a check on unbridled exercise of the imagination.
Theory works with models. Our special interest is in statistical models. A good
model captures those aspects of a phenomenon that are relevant for the purpose in
hand. A model is, inevitably, an incomplete account of the phenomenon. The reward
for simplifying by ignoring what is irrelevant for present purposes is that the model
is tractable we can use it to make predictions.
I use the word science in a broad sense, not much different from the word knowledge.
Scientific research is directed to gaining new knowledge.
69
8 The Rationale of Research
need to check further. Previously inexplicable facts now make perfect sense. Even here one
has to proceed with caution, keeping in mind our capacity for mistake and self-deception, and
our proneness to jump to conclusions. Scepticism, directed at current assumptions as well as at
any new theory, must be the order of the day. There are many case-histories that demonstrate
the need for caution. An example is the claimed link between salt and hypertension that we
discussed in Section 3.1.
There are by contrast well-known instances where the scientific community refused to take
seriously, on the grounds that there was no mechanism, an idea that had strong empirical
support. Or important and significant results may be dismissed out of hand. The examples that
follow illustrate, in turn, these two possibilities.
Continental drift
My discussion pretty much follows the account the very readable account in Hallam (1989).
Wegener (1880-1930) presented a range of evidence in support of his theory that the present
continental land masses had formed from the splitting apart of older continental masses. He
pointed out that the Western coast of Europe and Africa fits fairly well the contours of the
Eastern seaboard of the Americas. He argued that former land bridges between continents
explained important features of the present distribution of fauna and flora. But geologists had a
long tradition of mechanistic explanation. Prominent and influential figures denounced
Wegeners ideas, creating an intellectual climate where any young and bold spirit who took up
these ideas thereby placed their career at risk.
Biologists were more sympathetic. They had rarely been lucky enough to find detailed
mechanisms for the phenomena that they studied, and were more willing to live with the idea
that an understanding of mechanisms would have to come later. At the same time, they
respected the prevailing judgment of geologists that such splitting and moving of land masses
was impossible. The opposition to Wegeners ideas remained strong through into the 1950s.
The highly respected geophysicist and mathematician Harold Jeffreys (1891-1989) was
especially vocal in his opposition to Wegeners ideas.
A further impossible hypothesis has often been associated with hypotheses of
continental drift and with other geological hypotheses based on the earth as devoid of
strength. . . . In Wegeners theory, for instance . . . the assumption that the earth can
be deformed indefinitely by small forces, provided only that they act long enough, is
therefore a very dangerous one, and liable to lead to serious error.
[Jeffreys 1926, p.261]
A group of younger researchers who revived Wegeners ideas, still without much idea of the
mechanism involved, thereby risked their careers. One of those younger researchers Edward
Irving took a position at the Australian National University. Australia provided, at that time,
more fertile ground for his ideas. Far from leading geologists into serious error, the theory has
been the point of departure for huge advances in the understanding of earth history. It is a
cornerstone in a unified framework for the interpretation of data from biogeography, geophysics
and geology.
70
8 The Rationale of Research
paper ever to appear in that journal (Clark 1995, p.42.) It marked the beginning of fundamental
discoveries regarding the immune system.
There are many reasons why a good idea may be slow to gain acceptance. The forces of
conservatism can act just as strongly in scientific communities as in other communities. The
word of one dominating and influential figure may be enough to prevent a hearing. How dare
you challenge my authority? While it is the force of the argument that should prevail, not the
pronouncements of elder statesmen, this may not be what happens.
Data
Data are crucial to science. Up until the 20th century a prevailing view was that science was
generalisation from data. The name given to this process of generalisation is induction, which
contrasts with deduction as used in mathematics and logic.
The view of science that emphasised induction and generalisation from data was strongly
influenced by Francis Bacon, who in 1620 published a book that argued for a new method of
research that, as he claimed, gave True Directions Concerning the Interpretation of Nature. In
Bacons improved plan of discovery, laws were to be derived from collections of
observations. (Silverman 1985.)
Theory
Scientists do not collect any old data. They collect the data that seem most useful. How do
they get this sense that some data will be helpful, and other data of little use? For example a
study of the effects of passive smoking is likely to look for specific effects, most likely effects
that are known to be a result of active smoking. One would not expect to find that passive
smokers have an unusually high number of ingrown toenails! So we will not waste effort on
gathering data on ingrown toenails. We will examine the occurrence of lung cancer, bronchitis,
heart disease, and so on, but not ingrown toenails. Theres no theory to suggest that smoking of
any kind might cause ingrown toenails.
For studying the health of children living in some area of New Guinea, one might collect data on
age, sex, height and weight. Hair colour and eye colour are unlikely to be of interest, for this
purpose. It seems obvious that height and weight are important indicators, but that hair and eye
colour are unlikely to be relevant. It is assumed that some measures are useful and some are
not. There is an extensive literature that provides guidance on what measures other workers
have found useful, which sets out theory that anyone who now undertakes collection of data
on the health status of one or other human group will want to note22. Those who initiated work
in this area had to make their own judgments on measures that seemed useful indicators of
health status.
22
See for example chapters 7 and 8 in Little and Haas (1989).
71
8 The Rationale of Research
Any adequate understanding of science must have regard both to theory and to data.
Researchers do not collect any data. Data collection is driven by a judgement of what is worth
collecting. It is in this sense that theory drives scientific research. None of the great scientists
have followed Bacons prescription. Typically they showed unusual insight, aided sometimes
by good fortune, in the data that they collected.
Data may carry within themselves the power to challenge and perhaps destroy the theory that
guided their collection. It is this that gives science its power. Statistical insights and approaches
have a key role both in data collection and the extraction of information from data. They assist
in the efficient choice of data, in teasing out pattern from the data, and in distinguishing genuine
pattern from random variation. The pattern may be as simple as a difference between the
means of two treatment groups, or a linear relationship between two variables.
This is a convenient place to introduce the idea of a model. This is an important idea, both in
science generally and in statistics.
8.3 Models
Consider the formula for the distance that a falling object, starting at rest above the earths
surface, moves under gravity in some stated time. The formula is:
d = 12 gt 2
where t is the time in seconds, g ( 9.8 m/sec/sec) is the acceleration due to gravity, and d is the
distance in metres. Thus a freely falling object will fall 4.9 meters in the first second, 19.6
meters in the first two seconds, and so on. This formula describes the way that objects fall.
Observing the fall of a stone (especially if you happen to be underneath) is a different
experience from encountering the formula on a piece of paper. There are important aspects of
the fall about which the formula tells us nothing. It gives no indication of the likely damage if
the stone were to strike ones foot. The formula can tell us only about the distance traversed in
a given time, and other information that we can deduce from distance information.
Watching the stone fall and making measurements is different from doing calculations using the
formula. The results will not be quite identical, if only because of the limits of accuracy of the
measurements. The formula is a model, not the real thing. It is not totally accurate it neglects
the effects of air resistance. For the limited purpose of giving information about distance fallen
it is, though, a pretty good formula. As Clarke (1968) says: Models and hypotheses succeed in
simplifying complex situations by ignoring information outside their frame and by accurate
generalization within it.
A good model captures those aspects of a phenomenon that are relevant for the purpose in
hand. A model is, inevitably, an incomplete account of the phenomenon. The reward for
simplifying by ignoring what is irrelevant for present purposes is that the model is tractable we
can use it to make predictions.
There are also non-mathematical models. An engineer may build a scale model of a bridge or a
building that is to be constructed. Medical researchers may speak of using some aspect of
mouse physiology as a model for human physiology. The hope is that results from experiments
in the mouse will give a good idea of what to expect in humans. As those who know the history
of such research understand all too well, animal medical models can be misleading. At best,
they provide clues that must be tested out in direct investigation with human subjects.
The model captures important features of the object that it represents, enough features to be
useful for the purpose in hand. An engineer can use a scale model of a building to show its
visual appearance. The scale model might be useful for checking the routing of the plumbing.
The model will be almost useless for assessing the acoustics of seminar rooms that are included
in the building.
72
8 The Rationale of Research
23
Data are from Stewart, K.M., Van Toor, R.F., Crosbie, S.F. 1988. Control of grass grub
(Coleoptera: Scarabaeidae) with rollers of different design. N.Z. Journal of Experimental Agriculture
16: 141-150.
73
8 The Rationale of Research
We might expect that depression would be proportional to roller weight. That is the signal part.
The values for Depression/Weight make it clear that this is not the whole story. Rather, we
have
Depression = b Weight + Noise
Here b is a constant, which we do not know but can try to estimate. The Noise is different for
each different part of the lawn. If there were no noise, all the points would lie exactly on a line,
and we would know the line exactly. In Fig. 4 the points clearly do not lie on a line. We
therefore explain deviations from the line as random noise, at least until some more insightful
explanation becomes available.
30 25
Depression in Lawn (mm)
e
lin
20
le
ib
ss
po
15
a
of
e
pl
am
Ex
10 5
0
0 2 4 6 8 10 12
Weight of Roller (tonnes)
74
8 The Rationale of Research
We need a model for the noise also. Well leave the details till later. Anyone who has done a
first year course in statistics will expect to hear words such as normal and independently
distributed used to describe the noise components. For now, lets call it a random term without
spelling out the details.
It is a feature of statistical models that they have a signal component and a noise component. In
some data the signal is strong and the noise small. In other data noise may dominate the signal.
Fig. 5 illustrates the range of possibilities:
Signal
Noise
Fig. 5: Different positions along the horizontal axis correspond to
different mixes of signal and noise. At the left extreme, there is
only signal, while at the right extreme there is nothing except noise.
Statistical models lie somewhere between these extremes.
We would prefer to get rid of the noise altogether. That is not a totally silly idea. While we
cannot get rid of the noise altogether, we may be able to reduce it. There are several ways in
which we might be able to do this:
1. By using more accurate measuring equipment.
2. By improving the design of the data collection.
A skilled experimenter will get as near as is reasonably possible to the extreme left in Fig. 5.
That is where every experimenter would like to be.
Question: In the lawn roller experiment, how might one reduce the noise, i.e. reduce the scatter
about the line or other response curve?
75
8 The Rationale of Research
24
It almost survives it. Later work found small anomalies in the orbit of the planet Mercury.
Einsteins theory of relativity is required to give a completely accurate description of the orbit of
Mercury.
76
8 The Rationale of Research
2. Vegetation Effects: What is the extent of continuing damage from new clearing of
vegetation? What is the potential remediation role of new tree plantings? Is it possible
to find tree species that will grow and survive in saline soil?
3. Irrigation practices: How much of the problem is the result of past and current
irrigation practices? How might changes in irrigation practices assist remediation? How
effective (and cost-effective) would it be to use bores to replace the use of water from
irrigation channels?
5. Engineering of irrigation channels: What effects (e.g. damage to adjacent roads from the
build-up of salt in the soil and/or from waterlogging) arise from loss of water from
irrigation channels? What engineering solutions (e.g. better lining of channels) are
available?
25
The Australian Government web site http://www.ndsp.gov.au is devoted to issues of dryland salinity.
77
8 The Rationale of Research
6. Land use strategies: What changes in patterns of land use might assist remediation.
The replacement of agriculture by forestry can be highly effective. Those crops are
preferable that do not require heavy irrigation.
7. Flow-on effects: How much of the problem in one or another area is the result of
practices in other areas, perhaps more elevated or perhaps upstream?
8. Ecology: What are the effects on fauna and flora? How would alternative remediation
strategies affect fauna and flora?
9. Social issues: What steps will ensure that remediation measures do not unduly
disadvantage individual communities?
Also open to scientific study are political and economic consequences, flowing both from the
present degradation of land and from proposed remedies.
There must be strategies for gathering whatever information is needed under each of these
headings, and for creating from them an integrated plan of understanding and action. Questions
worth considering are:
1. Are there changes that would be easy and cheap, and that would make substantial
inroads on the problem?
2. What changes, ignoring for the moment their costs, would make the largest inroads?
Questions: Why is it hard to get action on the degradation of Australian land that is a result of
salinity? Are there no good strategies? Or is the problem an inability to implement the strategies that
are available? Is the needed co-operative action too difficult for our societys political and economic
structures?
78
8 The Rationale of Research
Human impacts on climate change are a serious issue for our time. For science it is a huge
problem of the how do we put the pieces back together type. Many different sources of
information and evidence must come together. Computer modelling seems the only viable
approach.
Increased atmospheric levels of carbon dioxide and other implicated greenhouse gases26
increase the effectiveness of the earths atmosphere as a heat shield. Much of the focus has
been on increases in carbon dioxide levels that have resulted from increased use of fossil fuels.
A 0.5C average global increase in the temperature of the earth over the past century seems in
part due to this and other human activities. Schneider (1996) reports an assessment of tree-ring
and other evidence for temperature change in the past ten thousand years that suggests that such
a large 100-year change has been unusual over this time, occurring no more than once in a
thousand years. See also Crowley (2000).
Projections drawn up by the Intergovernmental Panel on Climate Change predict an average
global warming of between 1.0C and 3.5C over the next century, a greater rate of climate
change than at any earlier time in the past 10,000 years. Predictions are that sea levels will rise,
some low-lying areas will be covered by sea, there will be loss of vegetation, farmers may need
to change to new crops that are viable in the new climatic conditions, weather patterns will be
less stable, and tropical diseases will affect many sub-tropical regions.
How were these figures obtained? It is not sensible to try to project current temperature trends
into the future. The worlds climate has changed continuously over time, making short-term
trends a poor guide to what may happen in the future. Rather the evidence comes from
computer modelling, including modelling of the effect of projected ongoing emissions of
greenhouse gases in the atmosphere. The predictions from this modelling are unequivocal
present rates of release of CO2 into the earths atmosphere will lead to a temperature increase.
If these rates continue to increase at about 1.5% per annum as in the recent past, the
temperature increase over the next 100 years will be correspondingly larger.
Atmospheric and ocean currents are the moving parts of a huge engine that is driven by the
suns heat. The blanketing effect of the atmosphere, itself affected by life processes on land and
in the sea and by human activities that include the use of fossil fuel, are a part of the engines
control mechanisms. Understanding of the functioning of the individual components seems
adequate for the building of computer models that make gross predictions, always assuming that
ocean (and air) currents continue to follow pretty much their current patterns of movement. A
worrying aspect of potential large temperature changes is that they may cause the engine to
reconfigure itself. Changes in the flow of major ocean currents, such as have happened in past
geological times, would bring changes in climate patterns that would be even more traumatic.
Computer models must accommodate, as best they can, all these different effects. Statistical
methodology has a clear role in checking the predictions of individual components against
experimental and observational data. Checks that model predictions over several years for
different regions of the earths surface agree with observation are encouraging, but not clinching
evidence. By the time that clinching evidence of the accuracy of model predictions is available,
the damage will be irreversible. Hence the importance of close critical scrutiny of the separate
components of the models, of the way that those components are linked and of sensitivity
analyses that check how predictions would change if there were changes to those model
assumptions that are open to challenge.
Scientists from many different disciplinary backgrounds have critically scrutinised the computer
models. There has been extensive refinement of the details. Qualitative model predictions have
withstood these criticisms remarkably well. The most persistent criticism has come from those
with a political axe to grind, usually in defence of inaction! Such critics have the option, and the
26
Other gases that are implicated are methane, nitrous oxide and hydrofluorocarbons.
79
8 The Rationale of Research
challenge, to build and offer for scientific scrutiny models that give predictions that are more to
their taste.
[Scientific theories] . . . are constructed specifically to be blown apart if proved wrong, and if
so destined, the sooner the better. Make your mistakes quickly is a rule in the practice of
science. I grant that scientists often fall in love with their own constructions. I know, I have.
They may spend a lifetime vainly trying to shore them up. A few squander their prestige and
academic political capital in the effort. In that case as the economist Paul Samuelson once
quipped funeral by funeral, theory advances.
[Wilson, E.O., 1998, p.56]
Humans are not inherently rational creatures. Much of what passes for reasoned argument is
rationalisation the use of reason to defend positions that we hold for other reasons. An
attitude of mind that judiciously balances openness to new ideas with rigorous critical scrutiny
does not come easily to our human nature. Prejudice readily takes precedence over the
demands of rationality. Scientists are not inherently different from other humans who are prey
to idiosyncratic belief systems and spurious claims. Gilovich (1991) is one of many books
devoted to the discussion of our irrational foibles.
Fallible Scientists
Scientists are not immune from the tendency to rationalise. Thus craniology the measurement
of the brain capacity, often with the aim of relating brain capacity to racial differences became
a popular subject of study in the nineteenth century. Not surprisingly, much of this work
collected and used data in ways that reflected the racial and sexual prejudices of the scientists
who undertook it. Gould (1996), in a highly readable book, discusses this and other similar
examples. Fortunately the processes of scientific criticism and re-evaluation do in the course of
time tend to expose and correct such abuse. (Goulds book has itself attracted accusations of
bias from academic critics.)
Still today, rationalisation and prejudice compromise science. New prejudices and new
rationalisations have arisen to replace those that we hoped to have conquered. Such
rationalisations find it especially easy to establish and retain a foothold in those areas where
there is a dearth of external checks on the exercise of imaginative reconstruction. Dogma easily
masquerades as science.
Researchers may become more concerned about maintaining their funding or their position
within the profession than about truth. Science easily degenerates, in some times and some
corners, into pseudo-science. There is self-deception, there is an often exaggerated deference to
authority, there is deliberate manipulation, and there is a yielding to self-interest. There is a
challenge to devise ways of funding and directing scientific research that reduce opportunity for
manipulation, for deviousness, and for prejudice and dogma that masquerade as science.
Different scientists have different qualities. Some may be receptive to new ideas, but not good
at criticism. Others may be good at criticism, but not receptive to new ideas. They may apply
high standards of criticism in their own area, but make idiosyncratic judgments when the
scientific demands change. They may be hypercritical, not understanding the different nature of
the evidence that the new and unfamiliar area demands. Or, failing to note the different
80
8 The Rationale of Research
opportunities for self-deception that this new area offers, they may be unduly credulous. There
are few who can examine claims in medicine or social science or physics with more or less equal
critical incisiveness.
Dominant authorities
As in all communities, there are some whose pronouncements carry especial weight, or whose
positions give them special authority. They may be editors of major journals, or have a large
influence in the decisions of funding agencies. There are practical reasons for listening to the
voices of such dominant figures. Their judgments can be effective in weeding out ideas that are
not worth pursuing. At the same time they may weed too ruthlessly, their own speculative
notions may acquire the force of dogma, and they may resist anything that they find too novel.
This may be a particular danger if there are just one or two dominant figures individuals who
occupy the sort of position that Harold Jeffreys occupied in geophysics in the 1950s. It is
healthier if the dominant figures do not altogether agree among themselves.
Jealously and backbiting also flourish. Other scientists may be seen, not as partners in a
common endeavour, but as threats to ones own enterprise who must be cut down by any
means available. Political concerns may influence scientific judgements. Even if such attitudes
are not overt, they may lurk below the surface. Perhaps we should be surprised that the
demands for scientific rationality do so often prevail over these human influences. Only an
overarching insistence on rigorous criticism can keep science from becoming prey to
irrationality. There will never be total success. There is however plenty of scope for
improvement on the way that science is now conducted.
27
For a recent wide-ranging critique of Kuhns views, see Fuller (2000).
81
8 The Rationale of Research
Reductionist Scientists?
Scientists who wish to publish extensively and advance in their chosen research area will do well
to limit their attention to a narrow range of problems that seem likely to yield easily to their
skills. This narrowness of focus, which can be beneficial in making initial progress in a closely
defined area of research, does not give the breadth of view needed to tackle big issue
questions. Determining the structure of an organic chemical compound found in the river water,
or using radio-isotopes to trace its progress through the river system, does not of itself give the
breadth of view needed to tackle such big picture problems as dry land salinity.
Wilson (1998, p.40) has apt comments:
The vast majority of scientists have never been more than journeymen prospectors. That is
even more the case today. . . . They acquire the training needed to travel to the frontier and
make discoveries of their own, and as fast as possible, because life at the growing edge is
expensive and chancy. The most productive scientists, installed in million-dollar
laboratories, have no time to think about the big picture and see little profit in it.
The skills of a journeymen prospector may serve well those who expect to join multi-million
dollar research laboratories. A narrow training focus seems clearly inappropriate for anyone
whose work is likely to demand skills different from those of their Ph.D. or other research
degree, or who is likely at some time to work on big picture issues.
Commercial Pressures
Money speaks volumes. Commercial pressures may be a potent influence. Wilkinson (1998)
offers a series of case studies that highlight some of the issues. Edmeades (2000) is an
interesting study of the aftermath to a celebrated defamation claim that occupied the New
Zealand High Court for 135 days. What were the rights and duties of fertiliser scientists who
wished to make the results of their research available to the farming community that they had a
responsibility to serve?
82
8 The Rationale of Research
The scientific study of human nature and abilities is a sensitive area, for all sorts of reasons.
Are humans able to pursue such studies objectively, with the detachment that science demands?
Supposed scientific objectivity readily becomes a vehicle for particular prejudices.
The Heritability of IQ
Studies of the genetic basis of IQ have had a long and tangled history. A key and greatly
overplayed concept has been the heritability coefficient, the proportion of variation (measured
using the statistical variance) that is due to genetic variation. The heritability coefficient has
been widely used in animal and plant breeding studies, where the outcome variable of interest
has been weight or milk production. A high heritability suggests a potential to get further
improvements from breeding. Comparison between heritability coefficients from different trials
makes sense only if environmental variation is comparable. This may be reasonable if, as in
many animal and plant breeding studies, conditions are similar across different trials.
Studies of twins, both identical and non-identical and including separated pairs, have been the
main source of evidence for the heritability of IQ in human populations. As one might expect,
the two members of a separated pair are often reared in very similar circumstances, more
similar than for two randomly chosen members of the population. Thus the studies tell us
nothing about heritability in a section of the population where the range of social disadvantage is
large. Lewontin (1979) has argued, rightly in my view, that
. . . there is no way in human populations to break the correlation between genetic similarity
and environmental similarity, except by randomised adoptions.
One would need to randomly assign adoptees to the whole range of social circumstances to
which it was intended to generalise results. Such an experiment is surely out of the question.
There is a further issue. Twins share a common maternal environment. Daniels et al. (1997),
in a meta-analysis of more than 200 studies, estimate that the shared maternal environment of
twins accounts for 20% of the total variance. The ignoring of this component in earlier analyses
of data from twin-adoption IQ studies led to a substantial over-estimate of heritability. Assigning
to the wrong source a component that turns out to be 20% of the total is perhaps excusable in
the initial tentative investigations. Long before one has the 212 sets of results that Daniels et al.
analysed, this surely has acquired the status of a fundamental biological error! This analysis still
leaves large questions unanswered. What is the relevance of these studies, if any, to a wider
population where the range of environmental effects may be far larger than those typically
experienced by the separated twins?
IQ tests capture a small part of the rich texture of human abilities. Mental and other abilities
continue to change and develop through into old age. Mind Sculpture (Robertson 1999) is the
title of a book that discusses evidence on how our brains develop and change as a result of
demands placed on them. The emphasis should perhaps move from the study of mental testing
to the study of mind sculpture.
Sociobiology
In his 1975 book Sociobiology: The New Synthesis, Wilson defined sociobiology 28 as the
systematic study of the biological basis of all social behaviour. Wilson hoped to find a genetic
basis for behaviour. Sustained controversy followed its publication. While most of the book
was devoted to the study of animal and especially insect societies, the final chapter speculated
on genetic influences on human behaviour. Why all the fuss? The discussion that now follows
draws at several points on the account in Segerstrle (20 00).
28
Note also the more recent term evolutionary psychology, used to describe an area of study that has
a large overlap with sociobiology.
83
8 The Rationale of Research
Any initial foray into an area that is as complex as genetic effects on animal behaviour must
over-simplify. But what if the simplifications that seem required are precisely those that readily
feed into racial, sexual, national and other such forms of prejudice? Wilson was aware of the
risks of the area into which he had ventured, and took care to protect his words from such
misuse. His critics were not satisfied, either with his science or with the care that he had
exercised. Criticisms were of several different types:
o Wilson was charged with specific scientific errors.
o Notwithstanding the generally liberal tenor of Wilsons views, it was argued that they
lent support to those opposed to steps that would ameliorate the position of socially and
economically disadvantaged groups.
o Criticism of Wilsons book became a convenient starting point for promoting wider
scientific and political agendas. In some instances statements were taken out of
context, charging Wilson with views that were at variance with specific statements in
the surrounding text.
There is a succinct statement of the criticisms in Rose et al. (1984). Segerstrle attempts to
disentangle the various strands of this controversy. It is worth noting that a wide spectrum of
political views is found both among those who emphasise genetic influences on human
behaviour and abilities, and among those who emphasise environmental effects.
The first tentative steps in a new area of study may use overly simplistic models, which will be
refined as understanding advances. Problems arise when there are perceived implications for
the way that we regard or treat fellow humans. There is a long history of misusing claimed
scientific results that is the theme of Goulds The Mismeasure of Man29. Where such
implications are perceived, it behoves scientists to tread with extreme care, to acknowledge
obvious limitations in their models, and to acknowledge the tentative character of their results.
This may conflict with the motivation that researchers feel to persuade themselves and others of
the importance and significance of their work.
A useful outcome of the sociobiology controversies has been a closer scrutiny of the scientific
methodology than has been common in other areas of biology that rely extensively on
observational data. This scrutiny needs to go further. Such statistical methodologies as
regression are too often used uncritically, without regard to traps such as were discussed in
section 5.2. Even if the models are correct, estimates of key parameters may be wrong.
29
Goulds account has itself attracted strong criticism from some academic reviewers.
84
8 The Rationale of Research
Desmond, A., paperback edition 1999. Huxley. From Devils Disciple to Evolutions High
Priest. Perseus Books, Reading MA.
Diamond, J. M. 1997. Guns, Germs, and Steel: the Fates of Human Societies. Random House,
London.
Edmeades, D.C. 2000. Science Friction. The Maxicrop Case and the Aftermath. Fertiliser
Information Services Ltd., P.O. Box 9147, Hamilton, N.Z.
Fuller, S. 2000. Thomas Kuhn: A Philosophical History for Our Times. University of Chicago
Press.
Gilovich, T. 1991. How we know what isnt so. The Free Press, New York.
Gleick, J. 1987. Chaos: making a new science. Viking, New York.
Gould, S. J., revised and expanded edition, 1996. The Mismeasure of Man. Penguin Books.
Greenhalgh, T. 1997. How to Read a Paper: the basics of evidence-based medicine. BMJ
Publishing Group, London.
Hallam, A., 2nd. edn 1989. Great Geological Controversies. Oxford University Press.
Harr, R. 1967. The principles of scientific thinking. In Harr, R., ed.: The Sciences. Their
Origin and Methods, pp. 142-174. Blackie and Son Ltd., Glasgow.
Jeffreys, H. 1926. The Earth, its Origin, History and Physical Constitution. Cambridge
University Press.
Kuhn, T., 2nd edn, 1970. The Structure of Scientific Revolutions. University of Chicago
Press, Chicago.
Lewontin, R.C. 1979. Sociobiology as an adaptionist program. Behavioural Science 24: 5-14.
Little, M.A. and Haas, J.D., eds. 1989. Human Population Biology. A Transdisciplinary
Science. Oxford University Press.
Maindonald, J. H. 1986. Apple transport in wooden bins. New Zealand Journal of Technology
2: 171-176.
Robertson, I. H. 1999. Mind Sculpture. Bantam, London.
Sagan, C. 1997. The Demon-Haunted World. Science as a Candle in the Dark. Headline
Book Publishing, London.
Schneider, S.H. 1996. Laboratory Earth. The Planetary Gamble We Cant Afford to Lose.
Weidenfeld and Nicholson, London.
Segerstrle, U. 2000. Defenders of the Truth. The Battle for Science in the Sociology Debate
and Beyond. Oxford University Press, Oxford.
Silverman, W. A. 1985. Human Experimentation. A Guided Step into the Unknown. Oxford
University Press, Oxford.
Taubes, G. 1998. The (political) science of salt. Science 281: 898-907 (14 August).
Wilkinson, T. 1998. Science Under Siege: The Politicians War on Nature and Truth. Johnson
Press, Boulder CO.
Wilson, E.O. 1975. Sociobiology: The New Synthesis. Harvard University Press, Cambridge
MA.
Wilson, E.O. 1998. Consilience. The Unity of Knowledge. Abacus, London.
85
9. Critical Review
9. Critical Review
To give a basis for independence of judgement it is, I believe, of far more importance than is
generally supposed that the worker should allot a considerable fraction of his working time
to making himself acquainted with the published literature. . . . The students reading may
have been well directed, but it has covered almost certainly only a small fraction of the
published researches bearing on his problems. The junior worker should receive
encouragement, and his duties should allow him to read, with adequate care, far beyond the
limited series of papers which his chief may indicate to him as necessary for the work of his
department. The object should be to familiarise the reader with the stages whereby current
opinions have been developed, and to train him, by scrutinising the results of past
experimentation, to exercise his own judgement on the value of the experimental evidence
available on different disputable points.
[Fisher, R.A., in Bennett 1989.]
Critical review of previous research is the appropriate starting point for a new study.
The aim is, as far as possible, to start from what is already known. New research
should build on and learn from what others have or have not done. Look also for
other ways of getting a research consensus, such as talking directly to `experts.
The principles of critical review have wide application. They are pretty much those
of evidence-based medicine. One can apply them to the use of medical advice, and
one can apply them to research.
Statistical insights are often crucial in assessing the literature. Not all studies are of
equal quality. It is necessary to decide, as objectively as possible, which studies are
relevant and of high quality. This requires careful and critical scrutiny of each
individual study. Watch for confounding, i.e. more than one explanation is available
for an observed effect. Ask about possibilities for bias? Ask whether the study had
sufficient precision to detect the effects that are of interest. Influences on precision
include measurement instruments, experimental or sampling design, and sample size.
Inadequate description of methodology may be a warning sign of methodological
inadequacies. Do not automatically give authors the benefit of the doubt.
Some researchers must contend with a large number of papers bearing on their
chosen topic. A first step is to determine whether someone else has already done a
thorough competent critical overview. If one or more overview studies are available,
how careful and reliable are they? Are studies that show a clear effect more likely to
be represented? What are the possibilities for bias? Is it possible that an effect that
has shown up in a data-based overview is the result of a similar bias that has affected
all studies?
Before starting ones own research, there should be a good sense of what other workers have
achieved. It is well to be sure than any new piece of research has a good chance of providing
new, relevant, information. In some instances the research supervisor may provide a research
question that he/she is sure is unworked ground. At the other extreme, it may be impossible to
decide on a sensible research question until one has canvassed the state of existing knowledge,
and examined openings for new research.
The examination of existing data may be a desirable preliminary to the gathering of new data. A
first step will be to examine the highly summarised data that appear in published papers. If
access is then needed to original data, this may not be easy to get. In rare cases, the data may
87
9. Critical Review
already be available from an internet site. Some researchers are meticulous in keeping their data
on file, while some are not. Some make their data freely available to other workers, while
others may not.
88
9. Critical Review
89
9. Critical Review
One hospital had unusually high death rates for one specific groups of patients, those requiring
CABG on an emergency basis. Here are the figures:
Risk-Adjusted Mortality Rates for Emergency CABG Patients
St Peters Hospital, Albany 27%
An investigation revealed that steps taken to stabilise patients before surgery were inadequate.
From 7 deaths from 42 patients in 1992 the hospital went to no deaths from 54 patients in 1993.
There are several points:
1. One can only make such comparisons if there are very large numbers. This study was
effective because it could compare one hospital with hospitals as a whole, and an individual
surgeon or groups of surgeons with surgeons as a whole. It is terribly important to bring all
the data together, and to study it in context.
2. If one pools data from low-volume surgeons, there are enough data to make useful
comments. Only where individual low-volume surgeons had an exceptionally high mortality
rates could one argue that an individual surgeon was not performing well. Some of the low
volume surgeons might actually have been quite good.
3. The true long-term figure, for an individual surgeon who performed 50 operations over the
3-year period and had a 6% mortality rate, might be anywhere between 1.4% and 22%. We
have very little idea, with such scant data, as to how good that surgeon really is.
[I have given a 99% confidence interval.]
4. It was essential to make adjustments to allow for the higher number of high risk patients
operated on by some surgeons and in some hospitals. Use of the figures without such
adjustment would have been an abuse of statistics.
These sorts of comparative figures are open to serious abuse. If two surgeons have each
performed 200 operations, one with 5 deaths and the other with 10 deaths, it would be wrong to
try to make anything of the difference. Some reporters focused on just these sorts of differences
when the data were first reported. The New York State Department of Health started a
program to educate reporters on how to interpret the figures, leading to huge improvements in
reporting standards.
90
9. Critical Review
Researchers who contend with a large number of papers bearing on their chosen topic must
somehow get an adequate overview. Overview may be informal, largely supported by
qualitative judgements. Or it may follow approaches that have been developed by specialists in
the art of overview, and may be supported by quantitative analysis. In either case the file
drawer problem is a concern; how complete a sample does the published literature provide of
the evidence? Typically, studies that show an effect are over-represented among those that find
their way to publication. Also, with high apparent precision available from the meta-analysis of a
large number of trials, any systematic bias that affects a large number of the trials becomes
important. Any reviewer needs to pay attention to possibilities for bias.
One or more overview studies may already be available in the literature. It is then necessary to
assess the quality of this work. What have the authors of any overview studies done to attend
to the difficult issues noted in the previous paragraph?
91
9. Critical Review
0.6
Mortality in treatment group
x
=
y
0.4
0.2
0.0
Here we discuss results from a data-based overview (Cochrane Injuries Group Albumin
Reviewers 1998). Fig. 6 presents results from the 24 trials in which there was at least one
death, in either or both of the treatment or control group. These represent 1204 patients in all.
The authors were thorough in their searching for information on randomised controlled trials.
They searched various trials registers as well as international journals. They identified 30 trials
(1419 patients) that met their criteria, and for which mortality data were available. There were
two further trials (44 patients) where the mortality data were not available. All compared an
albumin treatment with a control that did not involve albumin.
Fig. 6 suggests that, contrary to previous expectation, albumin may actually be dangerous to
patients. Most trials, and almost all of the larger and hence more accurate trials, have points
that lie above the line y = x, i.e. mortality was higher in the treatment group than in the control
group.
A meta-analysis indicates that giving human albumin to patients in critical illness increases the
risk of death, by around 1 death for every 17 critically ill patients30 who receive albumin. The
results of this study go against what had been received medical wisdom. They build a picture
that was not available from any individual trial. Theoretical justifications for the use of albumin,
based on its presumed ability to restore blood volume, have yielded to hard data.
The authors checked several possibilities for bias, to the extent that the data allowed it. Most of
these were small trials; all except two had less than 40 patients. Small trials are not always
conducted to the same standards as larger trials. Strict adherence to a pre-determined protocol
is a necessity for a large trial, where in a smaller trial there may be less stringent planning and
procedures. If this were a consideration here, one would expect the effect size to change with
the size of the trial. It does not. Even so, the small size of most of the trials is a reason for
interpreting results with caution.
A further issue is that, in some of the trials, allocation concealment was inadequate or unclear.
Is it possible that some clinicians did not follow the protocol strictly, giving albumin to more
30
The 95% confidence interval was 9 32. This is a Number Needed to Harm (NNH) form of
presentation of the results, which makes better intuitive sense than relative risk. The estimated
relative risk from using albumin rather than an alternative was 1.68 (95% C.I. 1.26 - 2.23).
92
9. Critical Review
seriously ill patients? In order to check this, the reviewers did an analysis that excluded trials
where the protocol may not have been strict. Exclusion of such trials made almost no
difference to the estimates of relative risk.
There is now evidence that albumin has a variety of effects, some perhaps unhelpful.
Interestingly, cohort studies that have measured the levels of albumin in the blood of seriously ill
patients have shown that the risk of death reduces with increasing levels of albumin. Fig. 6
suggests that it is dangerous to add to the albumin that is already present.
93
9. Critical Review
94
9. Critical Review
of data overview, formal or informal, is inevitable when research results are brought together
and their implications for practical decision-making assessed.
An adequate statistical theory, for use in data-based overview, was slow to develop. For a long
time there was more than adequate challenge to theoretical skills from developing a theory that
would handle data from an individual field site or from an individual clinical trial. Scientists
have often been protective of their experiments and their data, which they may believe should
stand on their own independently of the work of other scientists. The tradition of analysing
separately data from each field experiment or each trial became firmly established. It remains
firmly entrenched in horticulture, and in other research areas also. Experimenters who have
worked on different sites may each claim the other is wrong, where it is unclear whether the
difference is a geographical effect, or perhaps due to differences in experimental procedure.
95
9. Critical Review
districts that are represented? What insight does the art, and the groups into which art items
fall, shed into historical cultural connections in the region?
Her study has the potential to make, from relatively disconnected items of published
information at a site level, a coherent account. As well as forming the building blocks of that
account, those individual published site reports will surely have a more regional relevance
and meaning within the framework of her account. Moreover, better understanding of the
chains of connection between the art at the various sites must lead to better understanding of
the individual motifs.
Just as with other types of overview, there have been reporting inadequacies that create
difficulties for the study. Future studies of individual sites will do well to note those
criticisms.
96
9. Critical Review
with an unusually simple technology once in Tasmania, in South Australia became canal builders
running a productive fishery, and ended up extinct when transplanted to truly appalling
conditions on Flinders Island.
Diamond identifies four groups of factors that help explain inter-continental differences affecting
the different technological patterns of development of different human societies. They are:
1. Wild plant and animal species available for domestication
Diamond presents evidence on the way that each of these types of factors has affected the
different histories of the different peoples living on the different continents. Thus, in respect of
the first point above, he presents a table that compares the distribution of large-seeded grass
species, that once domesticated might have provided food crops:
Number of Species
West Africa, Europe, North Africa 33
Mediterranean zone 32
England 1
East Asia 6
Sub-Saharan Africa 4
Americas 11
North America 4
Mesoamerica 5
South America 2
Northern Australia 2
97
9. Critical Review
Diamond discusses the characteristics of species that were suitable for domestication, and
argues that there were good reasons why none of the African species were domesticated.
In regard to point 2, he argues that diffusion will happen most readily along lines of similar
latitude, i.e. between regions with similar climate and able to grow similar types of plant species.
So one expects that diffusion of domesticated plants and animals, and of human populations,
will be more rapid in Eurasia than along the predominant longitudinal axes of the Americas and
sub-Saharan Africa. He discusses such archaeological data as are available on rates of diffusion.
He handles points 3 and 4 in much the same way, making general points and backing these
points up with whatever archaeological and evidence is available.
98
9. Critical Review
Begg, C., Cho, M., Eastwood, S., Horton, R., Moher, D., Olkin, I., Pitkin, R.,Rennie, D.,
Schulz, K. F., Simel, D., and Stroup, D. F. 1996. Improving the Quality of Reporting of
Randomised Controlled Trials: the CONSORT Statement. Journal of the American Medical
Association 276: 637 - 639.
Bennett, J.H. (ed.) 1989. Statistical Inference and Analysis. Selected correspondence of R. A.
Fisher, letter to J. R. Baker, pp. 343-346. Oxford.
Bussell, W. T., Maindonald, J. H. and Morton, J. R. 1997. What is a correct plant density for
transplanted green asparagus? New Zealand Journal of Crop & Horticultural Science 25:
359-368.
Chalmers, I. and Altman, D. G., eds. 1995. Systematic Reviews. BMJ Publishing Group,
London.
Chassin, M.R.; Hannan, E.L.; DeBuono, B.A. 1996. Benefits and hazards of reporting medical
outcomes publicly. New England Journal of Medicine 334: 394-398.
Chatfield, C. 1995. Uncertainty, data mining and inference (with discussion). Journal of the
Royal Statistical Society A, 158: 419-466.
Cochrane Injuries Group Albumin Reviewers 1998. Human albumin administration in critically
ill patients: systematic review of randomised controlled trials. British Medical Journal 317:
235-240.
Diamond, J. M. 1997. Guns, germs, and steel : the fates of human societies. Random House,
London.
Draper, D; Gaver, D P; Goel, P K; Greenhouse, J B; Hedges L V; Morris, C N; Tucker, J R;
Waternaux, C M 1992. Combining Information. Statistical Issues and Opportunities for
Research. National Academy Press, Washington D.C.
Easterbrook, P. J., Berlin, J. A., Gopalan, R. and Matthews, D. R. 1991. Publication bias in
clinical research. Lancet 337: 867-872.
Ehrenberg, A. S. C. 1990. A hope for the future of statistics: MSOD. The American
Statistician 44: 195-196.
Fairburn, M. 1999. Social history: Problems, strategies and methods. Macmillan Press,
London.
Greenhalgh, T. 1997. How to read a paper. The basics of evidence based medicine. BMJ
Publishing Group, London.
Hubbard, R. and Armstrong, J.S. 1994. Replications and extensions in marketing: rarely
published but quite contrary. International Journal of Research in Marketing 11: 233-248.
Law, M. R., Frost, C. D., and Wald, N. J. 1991. By how much does dietary sodium lower
blood pressure? III Analysis of data from trials of salt reduction. British Medical Journal
302: 819-824.
Lindsey, R. M. & Ehrenberg, A. S. C. 1993. The design of replicated studies. The American
Statistician 47: 217-228.
McGuinness, D. 1997. Why our Children Cant Read. The Free Press, New York.
Moynihan, R. 1998. Too Much Medicine. Australian Broadcasting Corporation.
Oxman, A. D. and Gyatt, G. H. 1983. The science of reviewing research. Annals of the New
York Academy of Sciences 703: 125-131.
Sackett, D. L. and Oxman, A. D., eds. 1994. The Cochrane Collaboration Handbook.
Cochrane Collaboration, Oxford.
99
9. Critical Review
100
10. Presenting and Reporting Results
It is easy to lie with statistics. It is hard to tell the truth without statistics.
[Andrejs Dunkels]
The setting out of conclusions in a way that is vivid, simple, accurate and integrated with subject
matter considerations is a very important part of statistical analysis.
[D. R. Cox 1981.]
Keep in mind from the beginning the required style and content for the eventual
report, paper or thesis. This will help plan and structure your project. It is a good
idea to include a provisional list of chapter or section headings in the research plan.
This outline can be filled out and modified as the project proceeds.
Much of the focus of this chapter is on the presentation of statistical results. Efficient
and cost-effective collection of quality data, and analysis that gets from the data all
the information that is reasonably available, are central to research. The endpoint is
the presentation of clear and coherent results. How does one present the message so
that it accurately reflects the data, so that it is clear, and so that it will be heard and
used?
Appendix III has a checklist for the authors of reports. Appendix IV has a checklist of
statistical presentation issues for the use of authors and referees. These supplement
the comments in this chapter.
I will begin with a discussion of general reporting issues, moving on to the special issues that
relate to the reporting of statistical results. Inevitably an understanding of some points will
require more knowledge of statistics than has been assumed in earlier chapters. Readers who
are puzzled by specific points that affect their research should take this as warning to seek
expert statistical help. There is no royal road to the understanding of statistics that comes from
years of professional study and experience.
101
10. Presenting and Reporting Results
Always, demonstrate that conclusions are soundly based. This may require a modest level of
technical detail. In a report for a commercial client it is often best to consign technical detail to
an appendix. Research theses may include substantial appendices.
Try to put yourself in the shoes of a reader of your report or thesis. Does it start with a
summary that presents the major insights and conclusions? Does it present a clear coherent
story? Does your report read well? Is the supporting evidence in place? Does the text focus
on the major issues?
The next two sections are largely adapted from Maindonald (1992). The advice is set out in a
pithy tutorial style. It is intended as a basis for consideration and debate.
For a paper, critically examine papers that others have published in that journal. For a thesis,
critically examine a well-regarded earlier thesis. If there is a scholarly book that canvasses
themes similar to yours, examine how it is structured. It may serve as a starting point for
developing a layout for your own work.
102
10. Presenting and Reporting Results
streamlined. The graphs that you used to explore data may need substantial modification, if they
are appropriate at all, when you come to present the data. Output from computer packages is
rarely suitable for direct use you will need to modify and adapt it.
Scientific Interpretation
Interpret all statistical results, as far as possible, in subject matter terms. Use the statistic that
translates easily into subject matter terms in preference to a statistic that does not easily
translate. Translate regression coefficients into rate of change terms whenever this seems
helpful. Instead of reporting the relative risk of two medical treatment regimes, it is often more
meaningful to report the number needed to treat (NNT) to avoid one death.
Translate all transformed values back into meaningful units for presentation. On graphs you
may wish to plot using transformed units, with the axes labelled using the original units.
Economic Implications
It is often helpful to give an assessment of economic implications. But be realistic about
uncertainties and limitations. Present calculations of economic return in such a way that it is
straightforward to work out how results would be different under different economic conditions.
Scientific models
Analyses that use models that are motivated by scientific understanding are in general more
insightful than analyses that use ad hoc and/or empirical models. Use any scientific
understanding that is available to help direct the study design and the analysis. At the same
time, be sensitive to questions that the data may raise for current scientific perceptions. Allow
the data to speak for themselves.
Description of the design
Describe the study (experiment, sample survey, . . .) accurately and fairly. Be careful to
identify experimental or sampling units and the units on which measurements were made.
Where experimental data are reported describe the blocking structure, the exact form of
randomisation, and other details of the experimental design. Explain the reasons for your choice
of design. In field experiments either provide a drawing of the field layout, or else describe it in
sufficient detail that the reader can sketch a diagram.
Describe realistically and accurately the population to which results apply.
Measures of Precision
Include SEs or SEDs (or their equivalent) and sample sizes wherever relevant. Where there are
multiple error strata, be sure to quote the SE that is relevant to the comparison that is made. If
results do not have the replication that would allow determination of the relevant SE, note this.
Note sources of variability that have been excluded in determining standard errors.
If the data allow it, present one SE rather than different SEs for different groups.
Curve fitting
When estimating a particular point on a fitted curve (eg. time to 99% mortality, or a maximum),
it is crucial that the curve fits well in the neighbourhood of that point. If necessary, the fitting
procedure should omit points that are at one (or both) extreme(s) from the point that is of
interest.
Consider the use of a smoother as an alternative to the use of a curve that follows a specific
mathematical form.
Measures of Relationship
The standard Pearson product-moment correlation is a measure of straight-line association. Use
it only if you can justify restricting attention to linear association. Scatterplots will highlight
103
10. Presenting and Reporting Results
gross departures from linearity. In addition there are statistical methods for testing linearity
against specific curvilinear forms of response.
Correlation and regression calculations should ordinarily be supported by relevant plots.
Reserve multiple or adjusted R2 for comparisons across similar experimental or sampling
designs. Use adjusted R2 in preference to multiple R2.
Note that a high correlation or multiple R2 does not automatically imply that the relationship is
adequate. The size of R2 must be judged against the scatter in the data. If there is little scatter,
it will require a correspondingly high R2 to justify the claim that the fitted curve adequately
captures the data.
Unless experience with earlier comparable results has shown what magnitude of R2 to expect,
do not rely on R2 as a measure of model adequacy. Instead use a graphical check, perhaps
backed up with a formal test for absence of systematic departure from the assumed form of
response.
Significance Tests
Use p-values, if appropriate, to back up what you see as the major points that you have to
make. Otherwise be abstemious in the use of p-values. Be sensitive to alternative ways of
presenting the data that may reveal its major patterns.
Highlight the Trends
Where effects are quantitative use a trend curve or response surface analysis in preference to
individual tests of significance. Multiple range tests are not appropriate for structured data.
Overall Analyses
Where work is widely extended in space and time, present an overall analysis that captures the
major results. This extends to results that have been obtained by different workers, but carrying
out closely related studies. Such analyses will identify how, after allowing for systematic effects
due eg. to geography and soil type, local results stand up against site to site variation. In the
absence of such an overview the effort that has gone into the individual trials may be largely
wasted.
Consider the relevance of results to those who may use them. Farmers and horticulturists are
interested in effects that apply to their farm or orchard. They can be confident in using results
that have appeared consistently over different locations and years. Doctors are interested in
results that apply to their patients.
Studies that have not yielded statistically significant results must be included in overview
analyses.
Graphical Presentation
Put major conclusions into graphical form. Make captions comprehensive and informative.
Use appropriate graphical presentations to reduce reliance on tables and on verbal description.
The best statistical software links statistical analysis closely with graphical presentation.
Effective presentation of data and of statistical results will similarly link the results of the
analysis with graphical presentation.
Design graphs to make their point tersely and clearly, with a minimum waste of ink. Avoid
distracting irrelevancies. Label as necessary to identify important features. Use scatterplots in
preference to e.g. bar graphs whenever the horizontal axis represents a quantitative effect. Keep
the information to ink ratio in mind.
Use graphs from which information can be read directly and easily in preference to those that
rely on visual impression and perspective. Thus in scientific papers contour plots are much
preferable to surface plots or two-dimensional bar-graphs.
104
10. Presenting and Reporting Results
Draw graphs so that reduction and reproduction will not interfere with visual clarity.
Explain clearly how error bars should be interpreted SE limits, 95% confidence interval,
SD limits, or .... You must explain what source of error is represented. It is pointless to
present information on a source of error that is of little or no interest.
105
Appendix 1 Questions for Researchers to Consider
You were left pretty much on your own to find a research question.
2. Is (was) your research question clear from the beginning, or did you (will you need to)
refine it?
Clear from the start Some refinement required Extensive refinement required
C. New Knowledge?
4. Will you require skills and knowledge that you did not acquire in your earlier study?
5. If in question 3 you ticked Extensive or Some, what are the likely sources of this new
knowledge? (Tick one or more)
107
Appendix 1 Questions for Researchers to Consider
E. Multi-disciplinary Demands
7. Does you research demand skills from several different disciplines?
The skills that are required are from one main area.
8. Do your supervisors come from one main skill area, or from several different skill areas?
F. Measuring Instruments
9. Will you need to develop new measuring instruments?
10. What types of measuring instrument(s) will you primarily use? [Tick one or more.]
Questionnaire
Other .
G. Sources of Data
11. What methods will you use for collecting data?
108
Appendix 1 Questions for Researchers to Consider
109
Appendix 1 Questions for Researchers to Consider
H. Use of Controls
12. If your study is experimental or quasi-experimental, what form of assignment to control will
you use?
Not random, but not under the control of the researcher (e.g. haphazard or systematic)
No control. How then will you deal with the limits this imposes?
Other: ..
J. Use of Time
14. Where will you spend major amounts of your time? (Tick one or more)
Fieldwork/overseas
L. Practical Implications
16. What are the practical implications of your research?
110
Appendix 1 Questions for Researchers to Consider
M. Issues of Validity
17. Is it clear what your instruments measure?
[For example, what does GDP (Gross Domestic Product) measure? Does it measure
anything useful? What do public opinion polls measure?]
One could argue about it, but there is not much room for doubt.
111
Appendix II: Checklist for Use with Published Papers
Data Collection
4. How were the data obtained? Some of the possibilities are Sample, Experiment,
Informed opinion, Guess.
[How many tens of thousands of people did the papers say marched across Sydney
Harbour Bridge? Did someone count them all? Was the number a stab in the dark?]
5. Do the data make sense; are they free of apparent serious anomalies?
[Some numbers may be impossible? Or, e.g., a height/weight ratio may be
impossible.]
6. Do any of the claims go beyond what the data could support?
7. Do the data answer the research question?
8. Are the measurements/questions clear? Or is there ambiguity?
[e.g. using data from a limited local study to support claims that relate to another
geographical location.]
9. Are the data valid for the intended use?
10. In a study of human subjects, who had contact with the participants and how?
11. Who/what was studied and what was the selection process?
12. Are the data sampled from the population to which the researchers wish to generalise?
[A sample of Sydney-siders is not a good basis for generalising to what Canberra
residents think.]
13. Was the study capable of detecting effects of a magnitude that were of interest?
[Influences on precision include measurement instruments, experimental or sampling
design, and sample size.]
14. What biases may have been present in the data?
[Consider, measuring instrument bias, observer bias, selection bias, etc.]
15. Where groups are compared are there extraneous differences?
[e.g., in clinical trials, differences that have nothing to do with the treatment.]
Data Analysis
16. Is the arithmetic correct?
17. Does the analysis take account of data structure (fixed effects, random effects,
clustering, etc.)
18. Is the description of the method of analysis clear and complete, with a reference given
if the methodology is at all non-standard?
19. Has account been taken of clear grouping (e.g. males/females, different species, etc.)
in the data? If results were combined across groups, is justification given?
113
Appendix II: Checklist for Use with Published Papers
114
Appendix III A Checklist for Authors
Points that will quickly attract the casual readers attention appear in italics. In a report for a
commercial client, these will often be the main focus of attention. They may become important
to a commercial client (and to the report writer) when claims made in the report are challenged,
when the report goes to other consultants for review, or when other specialists make use of
information in the report.
Other points relate more directly to statistical or other professional concerns. They are intrinsic
to doing a thoroughly professional job. In a research thesis these are likely to be the major focus
of attention.
115
Appendix IV Checklist for Presentation of Statistical Results.
117
Appendix IV Checklist for Presentation of Statistical Results.
ii If there are multiple standard error bars, are they all necessary? (But take care that
when there clearly are standard errors that are very different, this is reflected by the
use of the requisite number of error bars.)
15. Is assistance with the design and/or statistical analysis and/or interpretation acknowledged
by
i authorship?
ii acknowledged help?
16. From the statistical viewpoint is the paper of acceptable standard to be published?
17. Comment on any points not covered by the above questions.
[Adapted from the checklist on page 1486 of Gardner, Altman, Jones and Machin (1983).]
Maindonald, J.H. 1992. Statistical Design, analysis and presentation issues, New Zealand
Journal of Agricultural Research 35: 121 - 141, 1992.
Murray, A.W.A. 1988. Recommendations of the editorial board on use of statistics in papers
submitted to JSFA --- guidelines to authors as formulated by A W A Murray. Journal of the
Science of Food and Agriculture 42, no. 1, following p. 94. Reprinted in vol. 61, no. 1, 1993.
118
Index to Part I
Index to Part I
accident of nature, 43 health status studies, 71
behavioural studies human albumin, 91, 92
animal, 65 labour training program, 46
bias to noise ratio, 94 minimum wage legislation, 9, 29
bias, non-experimental studies, 45 salinity, 78
case-control study, 44 salt, & blood pressure, 8, 26, 89
cause & effect, 28, 78 teaching of reading, 8, 95
checklist traumatic loss, consequences, 9
for authors, 113 Excel, 13
for use with published papers, 111 experiment, 33
presentation of results, 115 blocking, 38
clustering, 66 experimental unit, 37
Cochrane collaboration, 93 haphazard assignment, 38
cohort study, 43 levels of variation, 38
complex systems, 77 measurement unit, 38
computer modelling, 79 precision, 36
confounding, 39 randomisation, 38
correlation, 104 randomisation, replication & blocking, 36
cross-sectional study, 44 replication, 36, 38
data treatment unit, 38
& theory, 71 Food Frequency Questionnaire (FFQ), 59
Diamond, Jared, 96 graphs, 11, 104
ecological studies, classification, 29 historical sciences, 96
economic implications, 103 history
effect size, 65 as a science, 98
ethics, 22 hypothesis testing, 76
Evidence-based Medicine (EBM), 2, 93 imaginative insight, 76
examples Knowledge Discovery in Databases (KDD),
47
1936 Literary Digest poll, 53
Kuhn, Thomas, 76, 82
antibody production, 70
law-like behaviour, 73
climate change, 79
literature review, 21, 88
death rates in heart surgery, 89
measurement instrument, 16
diethylstilbestrol (DES), 44
meta-analysis, 27
fertilizer trials, 95
multiple R2, 104
gastric cancer, 45
119
Index to Part I
120