You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/246901474

Multiple-choice and true/false tests: Myths and


misapprehensions

Article  in  Assessment & Evaluation in Higher Education · February 2005


DOI: 10.1080/0260293042003243904

CITATIONS READS
37 3,068

1 author:

Richard Burton
University of Glasgow
102 PUBLICATIONS   1,209 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Richard Burton on 27 December 2013.

The user has requested enhancement of the downloaded file.


Assessment & Evaluation in Higher Education
Vol. 30, No. 1, February 2005, pp. 65±72

Multiple-choice and true/false tests:


myths and misapprehensions
Richard F. Burton*
University of Glasgow, UK

Examiners seeking guidance on multiple-choice and true/false tests are likely to encounter various
faulty or questionable ideas. Twelve of these are discussed in detail, having to do mainly with the
effects on test reliability of test length, guessing and scoring method (i.e. number-right scoring or
negative marking). Some misunderstandings could be based on evidence from tests that were badly
written or administered, while others may have arisen through the misinterpretation of reliability
coef®cients. The usefulness of item response theory in the analysis of academic test items is brie¯y
dismissed.

Introduction
Multiple-choice and true/false tests have a valuable role in higher education, but, as
with other examination techniques, they must be used with care. This article is mainly
intended to help academics who are reconsidering their current practices, or else
trying to persuade colleagues that what they do is appropriate or otherwise. Our main
concern here is with the reliability of unspeeded tests (i.e. tests with ample time for
completion) and thus with their format, length and manner of scoring (of which
number-right scoring and negative marking have long been the main contenders).
There is an extensive literature for guidance that includes too many textbooks and
journal articles for most of us to read, as well as more casual and ephemeral writings
that may sometimes have disproportionate impact. Many of the relevant arguments
are based in statistics and probability calculations that many test-setters, and also a
few writers on the topic, have no other reason to know much about. Inevitably,
therefore, misunderstandings and questionable ideas arise and ¯ourish, some gaining
currency, one suspects, because they support examination procedures favoured for
other reasons.
Ideas on test reliability depend not only on mathematical reasoning, but on
experience. To be useful, empirical evidence needs to come from tests in which

*Institute of Biomedical and Life Sciences, Thomson Building, University of Glasgow, Glasgow
G12 8QQ, UK. Email: R.F.Burton@bio.gla.ac.uk
ISSN 0260±2938 (print)/ISSN 1469±297X (online)/05/010065-08
ã 2005 Taylor & Francis Ltd
DOI: 10.1080/0260293042000243904
66 R. F. Burton

items are well-constructed and unambiguous. It is also essential that the


examinees understand fully the implications of the marking system.
Unfortunately, however, instructions to examinees are often inadequate (see
Discussion). It is likely that particular tests, and with them their formats and
scoring methods, have sometimes been judged as unreliable simply because of
¯awed items and procedures. Inevitably, this is hard to substantiate on the
basis of published evidence, but what is indisputable is that many tests are
indeed badly written and administered (e.g. Holsgrove & Elzubeir, 1998; Downing,
2003).
We could thus take as our ®rst `misapprehension' the idea that any teacher has
the skill to write good test items without instruction. However, the point of this
paper is more to deal with the following 12 myths and misunderstandings that
are familiar to the author (hereafter simply labelled `myths'). They are listed ®rst
as stark statements, and then discussed in the same numbered sequence. Except
when otherwise stated, they refer equally to multiple-choice and true/false
tests. The term `multiple-choice' is sometimes applied to small groups of
related true/false items, but here refers only to items with three or more answer
options.

Twelve myths and misapprehensions


Myth 1. With number-right scoring, random guessing generally has little effect on test
reliability.
Myth 2. Blind guessing is reduced if the test items are well-constructed.
Myth 3. Blind guessing is harder when test items are complicated.
Myth 4. With number-right scoring, the extra score obtained by blind guessing can be
exactly calculated from the number of test items guessed blindly (if known) and the
number of answer options per item.
Myth 5. Negative marking `corrects for guessing'.
Myth 6. Negative marking of true/false tests is necessarily more unfair than number-
right scoring because people differ in their willingness to gamble on uncertain
answers.
Myth 7. Negative marking never works.
Myth 8. Incorrect knowledge has the same effect on number-right scores as does
complete ignorance.
Myth 9. True/false tests are limited to testing for factual recall.
Myth 10. Tests of, say, 60 items generally suf®ce to sample the facts and ideas taught
in a given course, assuming these to be much more numerous.
Myth 11. Reliability coef®cients (sometimes called `reliabilities') measure test
reliability.
Myth 12. Item response theory (including the simplest version, the Rasch model) is
superior to classical test theory for analysing items in academic tests.
Multiple-choice and true/false tests 67

Discussion
Myth 1
The deleterious effect of guessing on test reliability has long been recognised. With
number-right scoring, the worst situation pertains when some examinees do not
record a response to every item, while others do so, guessing as necessary. Examinees
must therefore be urged to answer every item, and guessing, random or not, is thus
part of the system. Then the effects of completely random guessing on test reliability
are calculable for de®ned circumstances, and there is no doubt that they can be
unacceptably serious, especially in short tests (Burton & Miller, 1999; Burton, 2001,
2002). Despite the importance of guessing, many examiners are susceptible to
counter-arguments (see, for example, myths 2 and 3). One of these, given by several
authors (e.g. Ebel, 1979), is that random guessing on its own is extremely unlikely to
give a high test score. Thus the probability of scoring 60 in a 70-item true/false test
with no knowledge at all is only 0.0000000003 (Downing, 2003). That is reassuring,
but beside the point; of more concern is the fact that the success or failure of a student
of near-passing ability (i.e. not one who is completely ignorant) can very much depend
on the vagaries of guessing. To illustrate, let us arbitrarily take the pass mark in this
true/false test as 53 and consider a student who knows 35 of the 70 answers and
guesses the rest blindly: although that person's most likely scores are 52 or 53, the
probabilities of scoring less than 49 or more than 56 are each quite high, namely 9%.
The water is muddied by the fact that guesses are often not blind (i.e. not
completely uninformed and random), but, instead, are assisted by `partial know-
ledge', a concept which itself has several facets (Burton, 2002). We return to this in
discussing myth number 6. Guessing will be less often blind if poorly constructed test
items contain unintended clues, such as bad wording (e.g. grammatical or syntactical
clues) or obviously implausible distractors. The presence of these should tend to raise
the average scores in a test and, with the effective number of options per item reduced,
score reliability should be reduced too.

Myths 2 and 3
The author has only seen the ®rst of these two in print. Both are hard to understand
since, as already noted, if one has no idea of the correct answer, one can only score by
guessing blindly. As also just noted, guessing is easier when items are poorly
constructed and contain unintended clues, but the guesses are not then blind.

Myths 4 and 5
That the extra score to be obtained by blind guessing is exactly calculable from the
number of items guessed blindly and the number of answer options per item is a
mistake more often implied than stated explicitly, and it is not always clear whether
the writers really believe it. What can be calculated, or rather estimated, is the average
extra score and the variation around that average. Similarly, it is sometimes unclear
68 R. F. Burton

whether `correction for guessing' is meant literally, as opposed to being just a standard
phrase for negative marking and other forms of `formula scoring'. Presumably
`correction' is not taken to mean `punishment'! In any case the main, and usual,
justi®cation for negative marking is the discouraging of blind guessing, not correcting
for it. Whether particular textbooks are actually saying or implying this is not always
clear (e.g. Stanley & Hopkins, 1972; Ebel, 1979; Popham, 1990). Some of this may
seem like quibbling, but people can be misled by injudicious wording.
A secondary, but important, effect of negative marking is that marks are lost for
beliefs that are incorrect, but con®dently held (see comments on myth 8). This can
improve test reliability (Burton, 2004b).

Myth 6
It is indeed unfortunate, and is generally considered inappropriate, if people score
worse than others simply through a greater reluctance to take risks. There is certainly
evidence that students do differ in this regard (e.g. Stanley & Hopkins, 1972; Rowley
& Traub, 1977; Masters & Keeves, 1999). Differences in personality are partly
responsible, and grade expectations may be relevant too in that examinees tend to
guess more when they expect a low grade (Bereby-Meyer et al., 2002). However, test-
wiseness is also important and students in higher education can be taught to recognise
that guesses based on hunches and partial knowledge should on average be worth
risking (e.g. Hammond et al., 1998). Although instructions given to examinees should
make this clear, they sometimes say instead that answers should only be recorded if
known to be correct, leaving the wilier examinees to work out a better approach. As
Hammond et al. (1998) point out, the instruction never to guess is sometimes due to
the examiners' mistaken belief that guessing usually leads to a net loss of marks.
The differences amongst examinees in regard to risk-taking have been seen as a
conclusive argument against negative marking, but the key issue, generally
disregarded, is whether this effect on test reliability is worse than the alternative,
namely the effect of guessing on number-right scores. With 300-item true/false tests,
available evidence suggests that the `willingness-to-gamble' effect is actually the lesser
of the two evils (Burton, 2002). With shorter true/false tests, negative marking would
then be even more strongly favoured. As to multiple-choice tests, which are not
included in the statement of myth 6, similar evidence is unavailable, and the analysis
would be complicated by the greater role played in these by partial knowledge. In true/
false tests the latter is largely a matter of uncertainty, while in multiple-choice tests
one or more distractors per item can often be dismissed with complete certainty
(Burton, 2002). (Incidentally, it is a general presumption that all partial knowledge
deserves to be rewarded, but one is free to question this, notably when the partial
knowledge comes from recognising unintended clues provided by poor item writing.)
We have just been considering guessing that is not completely blind or random.
Another minor aspect to this should be mentioned in relation to multiple-choice items
lest it proves a stumbling block. Even when a multiple-choice response is completely
uninformed and supposedly random, options may sometimes be chosen for their
Multiple-choice and true/false tests 69

length or position in the list (Stanley & Hopkins, 1972). This is immaterial provided
that there is no corresponding pattern to the wrong or right answers, but that
condition is not always met.

Myth 7
Holsgrove (2000) says of attempts to minimise guessing by negative marking that `the
only thing they have in common is that none work'. One may interpret this in two
waysÐeither that all examinees still answer every item, which is contrary to
experience, or that there are other problems. As to the latter, the author set some
very bad tests in his youth, through ignorance. On the other hand, negative marking is
still being used with apparent success by many institutions (e.g. Muijtjens et al.,
1999). If many educationists have abandoned negative marking (Downing, 2003),
this is no proof that it does not work. Some students may feel it to be unfair, but this
could be because they are unaware of the considerable unfairness associated with the
vagaries of unpenalised guessing. With true/false tests, the probability of guessing an
item correctly is high (0.5), so that negative marking is almost a necessity if the tests
are not to be inordinately longÐi.e. long enough that the random component of a
score is about the same for all equally knowledgeable examinees.
It can happen that some examinees choose to answer all items in the belief, often
correct, that it pays to do so. If all do this, then negative marking is unhelpful, but, if
some examinees do not do so, there is likely to be some overall bene®t to test
reliability. For experimental purposes (Muijtjens et al., 1999; Burton, 2002) negative
marking can be applied even when the examinees are expecting number-right scoring,
but this does not affect test reliability.

Myth 8
With number-right scoring, incorrect responses due to incorrect knowledge score
nothing, but, whenever one has no knowledge at all, correct or not, one must guessÐ
and may happen to do so correctly. Thus misinformation results in lost opportunities
to score by guessing. This may or may not be generally misunderstood, but the author
has not seen this fact mentioned in relation to negative marking when people decry the
penalising of misinformation as opposed to penalising bad guesses. Perhaps this myth
should be worded differently: `Negative marking differs from number-right scoring in
that only the former penalises misinformation'. There seems to be a general
presumption that con®dently believed misinformation is no worse than simple
ignorance, but sometimes a single con®dent error implies fundamental misunder-
standing of a whole topic.

Myth 9
Ebel (1979) demonstrates clearly that true/false items can be made to present quite
dif®cult and complex problems. This applies to multiple-choice items too.
70 R. F. Burton

Myth 10
This discussion has so far been concerned with guessing, but tests of any kind are
inherently unreliable if they do not adequately sample the relevant knowledge
domain. When a large body of facts and ideas is sampled by a short test, individual
examinees can be very lucky or unlucky in what they are asked, and this can make their
scores seriously unreliable. As in relation to guessing, reliability increases less than
proportionately with test length. That 60 items are too few for adequate sampling of a
large knowledge domain is readily demonstrated for a simple, idealised test model
(Posey, 1932; Ebel, 1979; Burton, 2001). In this model all items test single facts or
ideas with number-right scoring and are sampled with equal probability. Guessing is
excluded. Realistic modi®cations may be introduced to the model, but do not suggest
a signi®cantly happier conclusion. That people have faith in tests with even fewer than
60 items, even as few as 30, is evident from their widespread use (e.g. Downing &
Haladyna, 2004) and longer tests may be seen as too burdensome for examiners or
students. Of course, a test of 60 items must suf®ce when there are only 60 or so facts
and ideas to be tested, or when reliability happens not to be an issue. Independently of
this sampling effect, a test of 60 items may be far too short anyway if guessing occurs
(Burton & Miller, 1999).
The choice here of `60' is somewhat arbitrary, but corresponds to one item per
minute over one hour. In this connection, a general problem with any kind of test item
requiring extra reading or thought is that test length may have to be reduced in order
to ®t the available time. This could apply to multiple-choice tests with elaborate
distractors, to some of those true/false tests requiring more than factual recall, or to
some of the variants of multiple-choice tests that are not discussed here. Amongst the
latter are those employing `con®dence weighting' (e.g. Wood, 1991), `extended-
matching' (Case et al., 1994) and `assertion-reason' items (Connelly, 2004).

Myth 11
A reliability coef®cient may be seen as the estimated correlation between strictly
parallel tests (Cronbach, 1951; Ebel, 1979). Therefore it does partly re¯ect test
reliability, but it also depends very much on the spread of knowledge amongst the
whole group of examinees (e.g. Wood, 1991; Burton, 2004b). Thus, to take a fanciful
illustration, the coef®cient can be increased by bringing in extra examinees who know
nothing (or everything) of the subject. In a certain sense, reliability coef®cients can be
seen as measuring reliability, but not of the test per se. Thus, if the ignorant examinees
just mentioned were duly to score zero, the proportion of all examinees achieving
exactly appropriate scores, unless already 100% of them, would surely be increased.
Reliability coef®cients do not reveal the reduction in test reliability due to variable
risk-taking with negative marking (myth number 6). However, they may indicate
differences in test reliability in the special situation that the spread of knowledge can
be taken as constant from one test to another. Obviously `misapprehensions' can
come about through the misinterpretation of reliability coef®cients. One often meets
Multiple-choice and true/false tests 71

statements about the `reliabilities' of tests without the measure being further speci®ed;
this point may need to be checked before the evidence is accepted. Reliabilities need
to be assessed in relation to the objectives of the test in question and there is no single,
all-purpose numerical measure (Burton, 2004b).

Myth 12
Item response theory (e.g. Wood, 1991; Masters & Keeves, 1999) may seem attractive
because it is sophisticated and often carries the label of `modern', but academic tests
do not typically have appropriate characteristics (Burton, 2004a). Moreover,
empirical comparisons with classical test theory have not shown that the latter gives
inferior results (e.g. MacDonald & Paunonen, 2002). It is therefore far from obvious
that the trouble and expense of applying item response theory is justi®ed.

Conclusion
The author toyed with the idea of presenting the 12 numbered statements as a true/
false test for the reader, but some would be too equivocal for that purpose, as the
subsequent commentaries show. True/false tests fare well in the discussion, and some
ingenuity would be needed to turn the 12 (false) statements into good multiple-choice
items. Whether true/false or multiple-choice tests are more generally preferable must
often depend partly on what exactly is to be tested. Against what could now be the
general tide of opinion, discussion here has also been favourable to negative marking,
especially with short tests. However, it is not our present concern to prescribe test
procedures, nor to consider every relevant issue (Wood, 1991). The most important
message is rather that opinions, arguments and evidence in this ®eld need to be
considered critically and carefully.

Acknowledgements
I thank Dr R. Orchardson and Dr D. J. Miller for their helpful comments.

Notes on contributor
Richard F. Burton is an Honorary Research Fellow in the University of Glasgow, with
research interests in comparative physiology, psychophysiology and botany. He
has many years of examining experience in Physiology.

References
Bereby-Meyer, Y., Meyer, J. & Flascher, O. M. (2002) Prospect theory analysis of guessing in
multiple choice tests, Journal of Behavioral Decision Making, 15, 313±327.
Burton, R. F. (2001) Quantifying the effects of chance in multiple-choice and true/false tests: item
selection and guessing of answers, Assessment and Evaluation in Higher Education, 26, 41±50.
72 R. F. Burton

Burton, R. F. (2002) Misinformation, partial knowledge and guessing in true/false tests, Medical
Education, 36(9), 805±811.
Burton, R. F. (2004a) Can item response theory help us improve our tests? Medical Education.
38(4), 338±339.
Burton, R. F. (2004b) Multiple choice and true/false tests: reliability measures and some
implications of negative marking, Assessment and Evaluation in Higher Education, 29, 585±
595.
Burton, R. F. & Miller, D. J. (1999) Statistical modelling of multiple-choice and true/false tests:
ways of considering, and of reducing, the uncertainties attributable to guessing, Assessment
and Evaluation in Higher Education, 24, 399±411.
Case, S. M., Swanson, D. B. & Ripkey, D. R. (1994) Comparison of items in ®ve-option and
extended-matching formats for assessment of diagnostic skills, Academic Medicine, 69
(Suppl), S1±S3.
Connelly, L. B. (2004) Assertion-reason assessment in formative and summative tests: results from
two graduate case studies, in: R. Ottewill, E. Borredon, L. Falque, B. Macfarlane & A. Wall
(Eds) Educational innovation in economics and business, VIII. Pedagogy, technology and
innovation (Dordrecht, Netherlands, Kluwer Academic Publishers), 359±378.
Cronbach, L. J. (1951) Coef®cient alpha and the internal structure of tests, Psychometrika, 16(3),
297±334.
Downing, S. M. (2003) Guessing on selected-response examinations, Medical Education, 37(8),
670±671.
Downing, S. M. & Haladyna, T. M. (2004) Validity threats: overcoming interference with
proposed interpretations of assessment data, Medical Education, 38(3), 327±333.
Ebel, R. L. (1979) Essentials of educational measurement (3rd edn) (Englewood Cliffs, NJ, Prentice-
Hall).
Hammond, E. J., McIndoe, A. K., Sansome, A. J. & Spargo, P. M. (1998) Multiple-choice
examinations: adopting an evidence-based approach to exam technique, Anaesthesia, 53(11),
1105±1108.
Holsgrove, G. (2000, February 4) Test answer is anybody's guess, Times Higher Education
Supplement, p. 15.
Holsgrove, G. & Elzubeir, M. (1998) Imprecise terms in UK medical multiple-choice questions:
what examiners think they mean, Medical Education, 32(4), 343±350.
MacDonald, P. & Paunonen, S. V. (2002) A Monte Carlo comparison of item and person statistics
based on item response theory versus classical test theory, Educational and Psychological
Measurement, 62(6), 921±943.
Masters, G. N. & Keeves, J. P. (1999) Advances in measurement in educational research and assessment
(Amsterdam, Pergamon).
Muijtjens, A. M. M., van Mameren, H., Hoogenboom, R. J. I., Evers, J. L. H., van der Vleuten, C.
P. M. (1999) The effect of `don't know' option on test scores: number-right and formula
scoring, Medical Education, 33(4), 267±275.
Popham, W. J. (1990) Modern educational measurement. A practitioner's perspective (New York,
Harcourt Brace Jovanovich).
Posey, C. (1932) Luck and examination grades, Journal of Engineering Education, 60, 292±296.
Rowley, G. L. & Traub, R. E. (1977) Formula scoring, number-right scoring, and test-taking
strategy, Journal of Educational Measurement, 14(1), 15±22.
Stanley, J. C. & Hopkins, K. D. (1972) Educational and psychological measurement and evaluation
(Englewood Cliffs, NJ, Prentice-Hall).
Wood, R. (1991) Assessment and testing: a survey of research commissioned by the University of
Cambridge Local Examinations Syndicate (Cambridge, Cambridge University Press).

View publication stats

You might also like