Writing Objective Tests (2nd Edition)

Writing Objective Tests
Scottish Qualifications Authority
WRITING OBJECTIVE TESTS

Advice on the construction of objective tests
Abstract
This document summarises best practice in writing selected response questions and the
creation of objective tests. The advice was distilled from a literature review carried out in
2007 and updated in 2017. Topics covered include: misconceptions about objective
testing, types of selected response questions, Blooms Taxonomy, difficulty and demand,
writing questions for higher level skills, item analysis and setting pass marks. It is
intended for authors, vetters, test experts and others with an interest in assessment.
Version 2.0, November 2017
Bobby Elliott
bobby.elliott@sqa.org.uk
Licensed under Creative Commons: Attribution-NonCommercial 4.0 International
CONTENTS
Introduction ....................................................................................................................................................... 1
Misconceptions about Objective Testing ........................................................................................................... 2
Question Types ..................................................................................................................................................... 2
Constructed Response Questions ................................................................................................................... 2
Selected Response Questions ......................................................................................................................... 3
Uses of Selected Response Questions ................................................................................................................ 5
Types of Selected Response Questions ............................................................................................................... 7
True/False Questions ........................................................................................................................................... 7
Matching Questions ............................................................................................................................................. 7
Multiple Choice Questions .................................................................................................................................. 8
Multiple Response Questions .............................................................................................................................. 8
Ranking Questions ............................................................................................................................................... 9
Assertion/Reason Questions ............................................................................................................................... 9
Likert Scale Questions .......................................................................................................................................... 9
Best Answer and Exceptions.............................................................................................................................. 10
Variants and Clones ........................................................................................................................................... 11
Advantages & Disadvantages of Question Types ............................................................................................. 11
Using Selected Response Questions ................................................................................................................. 13
Blooms Taxonomy ............................................................................................................................................. 13
Identifying the level of a question..................................................................................................................... 15
Difficulty and demand ........................................................................................................................................ 16
Question types and demands ............................................................................................................................ 16
Writing Multiple Choice Questions ................................................................................................................... 17
Anatomy of an MCQ .......................................................................................................................................... 17
Writing Multiple Choice Questions ................................................................................................................... 17
The Item ......................................................................................................................................................... 17
The Stem ........................................................................................................................................................ 18
The Options .................................................................................................................................................... 20
Advice on avoiding cueing............................................................................................................................. 22
Disclosers ....................................................................................................................................................... 23
Writing MCQs for Formative Assessment......................................................................................................... 23
Writing questions for Higher Level Skills ........................................................................................................... 26
Techniques for writing higher order questions ................................................................................................ 26
scenario questions ......................................................................................................................................... 26
Passage-based reading .................................................................................................................................. 27
Item Analysis .................................................................................................................................................... 29
Facility Value....................................................................................................................................................... 29
Discrimination Index .......................................................................................................................................... 30
Other metrics ..................................................................................................................................................... 31
Constructing tests ............................................................................................................................................ 32
Authoring tests ................................................................................................................................................... 32
Determining test length ..................................................................................................................................... 34
Techniques for setting pass marks in objective tests ....................................................................................... 35
Informed judgement ..................................................................................................................................... 35
Initial Pass Mark ............................................................................................................................................. 36
Angoff method............................................................................................................................................... 37
Contrasting Groups ....................................................................................................................................... 38
Dealing with guessing ....................................................................................................................................... 39
Setting an appropriate pass mark ..................................................................................................................... 39
Negative marking ............................................................................................................................................... 39
Correction-for-guessing ..................................................................................................................................... 39
Confidence Levels .............................................................................................................................................. 40
Mixing and Sequencing questions ..................................................................................................................... 41
INTRODUCTION
Objective testing has been around for a long time. It was first used in the early 20th Century and
gained popularity in the middle of that century when mass testing became common in developed
countries. Its popularity has ebbed-and-flowed during the last century. The relatively recent
emergence of computer-based assessment increased its popularity. It remains relatively popular
today, alongside other forms of assessment. Both advocates and critics have settled on the view that
there is a place for this form of testing in conjunction with complimentary approaches.
This document is intended for anyone with an interest in objective tests. It will be of particular
interest to the following individuals.
Test writers: the people who create test questions.

Test vetters: the people who check test questions.
Internal and external verifiers: the people who check the quality of assessments.
Awarding bodies: the people who are responsible for national qualifications.
This guide has several aims. Its main aim is to summarise best practice in the creation of objective
tests. Although different sources sometimes provide conflicting advice, there is a significant
consensus around most aspects of objective test creation this guide seeks to distil that advice into
a single document. More specifically, this guide aims to:
describe the range of question types

explain the strengths and weaknesses of each
summarise best practice in writing multiple choice questions
explain how to combine questions into an objective test
introduce item analysis.
A subsidiary objective is to standardise terminology. Objective testing is a technical area with lots of
jargon some of which is used inconsistently. This guide is the result of a wide-ranging literature
review and seeks to harmonise the terminology with that used internationally.
Although some topics (such as item banking) overlap with computer-assisted assessment (CAA), this
guide focuses on the production of objective tests irrespective of how they are administered. The
document has eight sections:
Section 1 Introduction to selected response questions

Section 2 Types of selected response questions
Section 3 Choosing selected response questions
Section 4 Writing multiple choice questions
Section 5 Writing questions for higher level skills
Section 6 Item analysis
Section 7 Constructing tests
Section 8 Dealing with guessing.
1
MISCONCEPTIONS ABOUT OBJECTIVE TESTING
Although this document does not seek to promote one type of assessment over another, it does aim
to dispel some commonly-held, but inaccurate, views about objective testing. Some of the most
common misconceptions are rehearsed below.
1. Objective tests dumb-down education/Objective tests are easy. Objective tests are as dumb
as you make them. Many high stakes tests (such as university medical examinations in the UK
and SATs in the United States) use objective tests.
2. Objective tests can only be used to assess basic knowledge. While this is largely true in
practice, there is nothing inherent in the design of objective tests to make them unsuitable for
assessing high order skills.
3. Objective tests encourage guessing. The problem of guessing can be resolved through one of a
number of established methods.
4. Writing an objective test is easy. The construction of high quality objective questions is skilled
and requires significant knowledge and experience.
5. Objective testing is only fashionable because of e-assessment. Its true that objective tests are
well suited to computer-assisted assessment but they are also valid and reliable forms of
assessment in their own right.
6. Objective tests arent appropriate for my subject. While objective tests have traditionally been
used in the physical and social sciences (such as Physics and Psychology), they can be used in
any subject.
QUESTION TYPES
Question types can be categorised under two headings.
1. Constructed response questions

2. Selected response questions.
CONSTRUCTED RESPONSE QUESTIONS

Constructed response questions (CRQs) are questions that require learners to create (construct)
an answer. Examples of constructed response questions include short answer questions and essays.
Example 1 illustrates a short answer question.
Question Translate Good morning mother into Spanish.

Response
Example 1: Constructed response question
CRQs can be divided into two sub-categories: (1) restricted response questions; and (2) extended
response questions.
2
A restricted response question (RRQ) is a question whose answer is limited to a few words.
Examples of RRQs include complete-the-sentence, missing word and short answer.
An extended response question

(ERQ) is one whose answer
Restricted Extended requires the candidate to write
response response longer responses, normally
consisting of two or more
paragraphs. Examples of ERQs
include reports, essays and
Constructed
dissertations. There is no hard-and-
response
fast rule about where a restricted
response question ends and an
extended response question
begins.
Figure 1: Constructed response question types
SELECTED RESPONSE QUESTIONS

A selected response question (SRQ) is one whose answer is pre-determined and involves the learner
choosing (selecting) the response from a list of options. Because the answer is pre-determined and
there is only one correct answer, these types of question are often referred to as objective
questions. Examples of SRQs include true/false, multiple choice and matching questions.
Question The capital of the United States is New York.

Response True/False
Example 2: Selected response question
This guide focuses on SRQs, which are becoming increasingly popular for a variety of reasons.
ADVANTAGES OF SELECTED RESPONSE QUESTIONS

1. SRQs are more reliable than CRQs: they eliminate the inherent subjectivity in marking written
responses.
2. SRQs take less time to answer: reducing the time that learners spend on assessment.
3. SRQs are quick to mark: reducing the time that teachers spend on assessment.
4. SRQs are well suited to formative assessment: learners responses can be analysed and used to
provide diagnostic feedback.
5. SRQs are good for assessing breadth of knowledge: they are well suited for assessing a broad
range of topics in a short time.
6. SRQs are well suited to computer-assisted assessment: and facilitate item banking.
The low writing load of SRQs means that the focus is on the learners knowledge rather than their
writing or language skills, which is a common problem with constructed response questions.
3
Also, the speed of answering SRQs addresses another common criticism of assessment that it takes
too much time.
MRQ
MCQ Ranking
Matching Assertion
Selected Likert
True/False
response Scale
Figure 2: Selected response question types
Research into the marking of CRQs and SRQs has shown significant differences in the reliability of
the two approaches with objective tests proving to be significantly more reliable than written
tests. This is the main reason for their widespread adoption in the United States, where testing
organisations operate in a more litigious environment.
The compatibility between objective tests and computer-assisted assessment is a driver for the
current popularity of objective testing. Many testing organisations are in the process of building
large collections of questions (item banks), which can be delivered to learners over the Internet.
DISADVANTAGES OF SELECTED RESPONSE QUESTIONS

1. SRQs are not suitable for assessing certain abilities, such as communication skills or creativity;
they are also not appropriate when learners are required to construct an argument or provide
an original response.
2. SRQs may be less valid than CRQs and suffer from low professional credibility.
3. SRQs that assess higher order skills are difficult (and time consuming) to produce.
The first and second disadvantages are linked. There is nothing inherent in the design of SRQs to
make them less valid than CRQs but because they have been used inappropriately (to measure
skills that cannot be properly measured by this style of question) they have established a reputation
for having low validity among some educators.
Most teachers are comfortable with using SRQs to assess low order skills (such as factual recall,
exemplified by Example 2). They are less comfortable with their use in assessing deeper knowledge
and understanding. Most contemporary examples of SRQs confirm this view by focussing on the
assessment of surface knowledge.
4
Traditionally, the costs of carrying out assessment come at the end of the process setting the
question paper is relatively speedy, the time consuming (and labour intensive) part comes when the
papers have to be marked. Objective tests reverse this model the time consuming part is the
production of the questions, with marking taking little time. The total time is not greater (its much
less if an item bank is frequently used) but the front loading is a culture change and the long term
benefits are less immediately obvious.
Another criticism of SRQs is that they can atomise teaching and learning, encouraging teaching to
the test and surface learning. This, combined with their efficiency in assessing large numbers of
learners in short periods of time, has resulted in them acquiring a reputation as weapons of mass
instruction, with poor standing among many educators.
USES OF SELECTED RESPONSE QUESTIONS
Objective tests can be used for formative and summative assessment. When used summatively,
objective testing tends to be used for low-stakes assessment rather than high stakes assessment,
which largely remains the preserve of constructed response questions. However, some subjects
(such as Biology) do employ objective testing in high stakes assessment, and Higher Education has a
long tradition in using objective testing for high-stakes summative purposes in areas such as
Mathematics and Medicine.
Objective testing is well suited to formative assessment since it is quick to administer and assess,
and addresses the main reason for the lack of use of formative assessment time. It is particularly
suited to diagnostic assessment since it can be used to identify specific misunderstandings or
weaknesses.
Question Select the larger fraction in each pair.

1 3/7 or 5/7
2 3/4 or 4/5
3 5/7 or 5/9
Example 3: Formative assessment
These questions were used in an English SSAT test in 2014. Unsurprisingly, 90% of learners answered
question 1 correctly; 75% answered question 2 correctly; but only 15% answered question 3
correctly. Learners had no difficulty when the denominator was fixed (question 1), a little more
difficulty when denominators changed but the fractions were familiar (question 2), and a great deal
of difficulty when the denominators were different and the fractions were not familiar (question 3).
These responses, collectively, show that this cohort did not understand how to compare fractions by
converting to the lowest common denominator, a finding that can be used to intervene and correct
this problem.
Historically, objective tests have been widely used for psychometric testing (tests of attitude and
aptitude) and, more recently, it has been widely used in job appraisal. It is also used in entry
examinations for some professional bodies (such as ACCA). Objective tests are widely used
5
internationally including high stakes assessments such as SATs in the United States, which are used
for university entry. They are also widely used within vendor examinations (such as Microsofts
global certification programme). Awarding bodies in every country are focusing on computer-
assisted assessment, which has resulted in a renewed interest in objective testing. These
organisations share the view that the increasing popularity of e-learning will drive demand for e-
assessment which will be underpinned by item banks consisting of large numbers of selected
response questions.
6
TYPES OF SELECTED RESPONSE QUESTIONS
There are several types of selected response questions (SRQs). Although they share some common
characteristics, they each have unique features and applications. But they all share a fundamental
characteristic they have one correct answer.
There are seven types of SRQ.
1. True/false questions.
2. Matching questions.
3. Multiple choice questions.
4. Multiple response questions.
5. Ranking/sequence questions.
6. Assertion/reason questions.
7. Likert Scale questions.
TRUE/FALSE QUESTIONS
A true/false question (T/F) is a statement that is either true or false. The learner must select one of
two possible responses: true or false.1
Question (x+1) is a factor of x2+2x+3. True/False

Example 4: True/False question
Because learners have a 50/50 chance of answering these questions correctly, this type of question
is considered easy and is associated with low order knowledge. However, true/false questions can
assess higher order skills; and setting an appropriate pass mark can eliminate the effects of guessing.
MATCHING QUESTIONS
A matching question requires learners to match an object with one or more characteristics.
Match the storage technologies on the left with the storage

characteristics on the right.
A Flash memory 1 High capacity
B Hard disk 2 Low capacity
C RAM 3 Non-volatile
D ROM 4 Volatile
Example 5: Matching question
1
Any question that has two possible answers is a true/false question, irrespective of the possible responses (true/false, yes/no, etc.).
These types of question are also known as alternative response items.
7
The objects on the left are called stimulators and the matching statements on the right are called
responses.
This type of question is often used to assess learners knowledge of the characteristics of certain
objects. It is particularly well suited to computer-based assessment since it can be implemented as
drag-and-drop (dragging each response onto an associated stimulator).
MULTIPLE CHOICE QUESTIONS
A multiple choice question (MCQ) consists of a question (or incomplete statement) followed by a list
of possible responses from which learners must select one.
There are normally three to five options with four being the most common.
Which one of the following psychological terms means holding two contradictory points of view about a subject?
A Cognitive dissonance.
B Delusion.
C Dementia.
D Dissociate disorder.
Example 6: Multiple choice question
MULTIPLE RESPONSE QUESTIONS
A multiple response question (MRQ) is similar to a multiple choice question (MCQ) but has two or
more correct responses.
Which of the following statements relating to earthquakes is/are true?

A An earthquake generates seismic waves.
B The boundary of tectonic plates is called the fault plane.
C The point of origin of seismic waves is called its epicentre.
D The severity of an earthquake is measured by its magnitude and intensity.
Example 7: Multiple response question
There are some misconceptions about MRQs. They are not necessarily more difficult than MCQs;
they are as hard or as easy as you choose to make them. There is no need to indicate the number of
correct options; this only encourages guessing. And there is nothing wrong with making every option
correct; in fact, prohibiting this possibility reduces the flexibility of MRQs.
It is common practice for MCQs to begin with Which one of the following and MRQs to begin
with Which of the following.
8
RANKING QUESTIONS
A ranking question involves ordering the options into a sequence. The sequence can be numerical,
chronological or some other defined sequence.
Rank the following countries in terms of their population

densities. Rank the highest density first.
A Australia.
B France.
C Germany.
D United Kingdom.
Example 8: Ranking question
Ranking questions are easily implemented by computers using drag-and-drop.
ASSERTION/REASON QUESTIONS
Assertion/reason questions consist of a statement (assertion) and a possible explanation (reason).

Learners must decide if the assertion and reason are true, and whether the reason is a correct
explanation of the assertion.
The following assertion and reason relate to World War Two. Choose the corresponding letter (A-E) to indicate if the
assertion and/or reason is/are true.
Assertion Japans lack of natural resources was the main reason for the war in Asia.
Reason Japan lacked raw materials except for small deposits of coal and iron.
A Assertion is false and the reason is false.
B Assertion is false and the reason is true.
C Assertion is true and the reason is false.
D Assertion is true and the reason is true and the reason is the correct explanation for the assertion.
E Assertion is true and the reason is true but the reason is not the correct explanation for the assertion.
Example 9: Assertion/reason question
Assertion/reason questions are, effectively, multi-true/false questions.
LIKERT SCALE QUESTIO NS
The Likert Scale was named after Rensis Likert who invented the scale in 1932. It is widely used
within questionnaires to gauge respondents attitudes. The Likert Scale consists of five responses.
1. Strongly agree.
2. Agree.
3. Neither agree nor disagree.
4. Disagree.
5. Strongly disagree.
9
Some psychometricians add or remove options (the neutral option Neither agree nor disagree
is often removed).
My manager supports me when needed but otherwise permits me to work without interference.
A Strongly agree.
B Agree.
C Neither agree nor disagree.
D Disagree.
E Strongly disagree.
Example 10: Likert Scale question
This type of SRQ is almost exclusively used for attitudinal surveys.
BEST ANSWER AND EXCEPTIONS
Although the existence of a single, unambiguous, correct response is a fundamental feature of SRQs,
the usefulness of SRQs can be extended through best answer and exception type questions.
These techniques increase the flexibility of SRQs at the expense of some of their objectivity.
A best answer question is one whose answer is the closest (best) answer selected from a list of
possible answers of which more than one may be true. Used carefully, best answer questions can be
as objective as standard SRQs.
A user wishes to use a search engine to look for information relating to Celtic music that originated in Scotland. Which
one of the following queries is most likely to produce the best results?
A Celtic music Scotland
B Celtic music Scotland football
C Scotland +celtic +music +originate
D Celtic music that originated in Scotland
Example 11: Best answer question
Note that more than one of the responses is correct (in fact, they are all more-or-less correct). But
only one option is the best answer (B).
The use of best answer questions is particularly appropriate to the social sciences and arts subjects,
which have a less well defined body of objective knowledge compared to the physical sciences. Best
answer questions can also be used to assess some higher order skills since they frequently require
an element of judgment.
An exception question is one where all of the options are correct except one of the responses. This
type of question effectively reverses the logic of the standard SRQ.
10
Smoking is a contributory factor in the following conditions except:

A diabetes.
B heart disease.
C lung cancer.
D Parkinsons disease.
Example 12: Exception question
A question that includes not is effectively an exception type question. For example, the above
question could have been rephrased as Which one of the following conditions is not caused by
smoking?.
The use of exception (and negative) questions can simplify the wording of questions or increase the
number of questions that can be asked.
VARIANTS AND CLONES
A question that uses similar wording to assess the same content as another question is known as a
variant. The wording of variant questions is worded differently and the options may be entirely
different but, fundamentally, variants assess the same learning objective. For example, a question
about the impact of diet on diabetes could be considered a variant of Example 12.
A question that is (almost) identical to another question is known as a clone. Clones differ only in
their variables. For example, the question below is a clone of Example 4, the only difference being
the expression to be factorised.
Question (x+1) is a factor of x2+2x+1. True/False

Example 13: Clone
Variants and clones have important applications in e-assessment since they provide a quick and
simple way of rapidly populating item banks.
ADVANTAGES & DISADVANTAGES OF QUESTION TYPES
As stated previously, each question type has unique characteristics and uses. The applications of
each type are determined by its strengths and weaknesses.
Type Advantage(s) Disadvantage(s)
Well suited to basic knowledge. Limited applications (best suited to

dichotomous knowledge).
Easy to write.
True/False
Rapid to mark.
Suited to dichotomous knowledge.
11
Type Advantage(s) Disadvantage(s)
Good for formative assessment especially

diagnostic assessment.
Relatively easy to write. Limited to knowledge and comprehension.
Quick to mark. Best used for homogenous content i.e.

classifying types.
Good for assessing knowledge of
Matching characteristics/features or relationships
between variables.
Well suited to computerisation (drag-and-

drop).
Can assess a wide range of cognitive High quality MCQs (at any level) are
abilities (up to analysis). difficult and time consuming to construct.
Scenario-based questions can assess higher MCQs that assess high level abilities
order skills. require skilled authors.
Well suited to diagnostic assessment Unsuitable for assessing synthesis and

MCQ/MRQ (distractors can target learning difficulties). evaluative skills.
Item analysis provides detailed feedback

(to assessors and candidates).
Simple MCQs are quick and easy to

construct.
High re-usability of items.
Well suited to assessing relationships Difficult to construct.

between variables.
Limited applications (compared to MCQs).
Assertion Well suited to assessing understanding of
Difficult to read and understand.
cause-and-effect.
Good for constructing demanding items.
Table 1 - Advantages and disadvantages of question types
12
USING SELECTED RESPONSE QUESTIONS
The previous section explored the characteristics of different types of selected response question.
This section looks at how each type is best used.
BLOOMS TAXONOMY
One of the key determinants in the selection of SRQs is the kind of knowledge or understanding that
you are seeking to assess. Although every type of question can be used to test every level of
cognition, some are more appropriate than others. For example, factual recall can be adequately
assessed using true/false questions; deeper understanding may require more complex question
types such as multiple response questions.
As a starting point, we need a method of classifying knowledge and understanding. The most widely
used classification system is Blooms Taxonomy. Benjamin Bloom wrote Taxonomy of Educational
Objectives Book 1 Cognitive Domain in 1956 in an attempt to standardise the terminology used by
teachers to describe academic abilities. Until the publication of this book, different people used
different words to describe the same thing or, worse, used the same words to describe different
things. His book described a classification system that could be used to categorise cognitive abilities.
The taxonomy (which became known as Blooms Taxonomy) is widely used within the educational
community. Blooms Taxonomy is not the only way to classify academic abilities. There are
alternatives some linked to Blooms (but more up-to-date) and some entirely different. But
Blooms Taxonomy remains the most widely used classification system.
Blooms Taxonomy classifies academic abilities into six categories.
1. Knowledge
2. Comprehension
3. Application
4. Analysis
5. Synthesis
6. Evaluation.
A brief description of each follows.
Knowledge involves the recall of specific facts and figures, or the recall of specific
methods and processes. Knowledge is the bottom of Blooms Taxonomy but underpins
the higher order abilities. There are three types of knowledge: knowledge of specifics,
knowledge of methods, and knowledge of universals. The higher levels (knowledge of
Knowledge methods and universals) can be intellectually demanding. This category includes:
knowledge of terminology, knowledge of specific facts, knowledge of conventions,
knowledge of trends and sequences, knowledge of classifications, knowledge of criteria,
knowledge of methodology, knowledge of principles and generalisations, and
knowledge of theories and structures.
13
Comprehension differs from knowledge in that it relates to the mental processes of

organising and re-organising information for a particular purpose. It includes:
translation, interpretation and extrapolation. Translation relates to the ability to
translate (or decode) a communication from one format (or language) to another.
Comprehension Interpretation involves the explanation or summarisation of a communication. Whereas
translation involves a mechanistic, part-for-part rendering of a communication,
interpretation involves a more holistic, re-ordering or re-arrangement of the
information. Extrapolation involves extending trends or sequences beyond the given
data to infer consequences or corollaries.
This involves the use of knowledge and comprehension in specific situations for specific
purposes. For example, knowledge of a specific programming languages syntax and
Application
data structures, together with an understanding of how to test programs, can be
applied to the task of debugging code.
Analysis involves the breakdown of a communication into its constituent parts so that
the relationship between the elements is made clear. Analysis is intended to clarify or
explain communications or processes. This cognitive skill includes the ability to: (1)
analyse elements (identification of the components of the communication); (2) analyse
Analysis relationships (the ability to check the consistency or accuracy of a hypothesis, and skills
in comprehending the inter-relationships among different ideas or concepts); and (3)
analyse organisational principles (the ability to recognise form and pattern in a
communication, and the ability to recognise general techniques used within a subject
area).
Synthesis involves combining the parts so as to form a whole. It involves combining and
arranging parts or pieces of a communication to create something new. It may involve:
Synthesis
(1) the production of a unique communication; (2) the production of a plan; and (3) the
derivation of a set of abstract relations to represent physical phenomena.
Evaluation involves making judgements about the value of particular phenomena for
given purposes. Evaluation is carried out using criteria and involves qualitative and
quantitative judgements based on these criteria. The criteria may be given or created.
This includes measuring the internal consistency of the communication using criteria
Evaluation such as: quality of writing, accuracy of the information contained within it, and
consistency of argument; and measuring the external consistency of the
communication, which requires the evaluator to have a detailed knowledge of the
types of phenomena under review since it will be evaluated in terms of the general
criteria that are applied to phenomena of this type.
Table 2: Bloom's Taxonomy
Blooms Taxonomy is a hierarchy in that each category builds on the one below. For example,
application depends on comprehension, which in turn depends on knowledge. Or, to put it more
simply: you cant apply something until you understand it and you cant understand it until you know
about it. It is worth noting that, in practice, every level of Blooms Taxonomy can be reduced to
factual recall if the learner answers a question through rote learning, which is the principle behind
contemporary translation systems, which have no (real) concept of language and, instead, use AI
and big data techniques to translate text.
14
Evaluation
Synthesis
Analysis
Application
Comprehension
Knowledge
Figure 3: Bloom's Taxonomy as a hierarchy
A common error is to consider all knowledge to be less demanding than any comprehension.
Blooms Taxonomy is a hierarchy only within a single subject domain. Understanding gravitational
waves (comprehension) is more intellectually demanding than using a spreadsheet to solve a routine
problem (application). But within a single subject domain, the hierarchy applies.
IDENTIFYING THE LEVEL OF A QUESTION
When attempting to identify the level of demand of a question (using Blooms categories), the verb
in the question can provide a clue to the level.2
Level Verbs
Knowledge define, describe, label, list, name, recall, show, who, when, where
Comprehension compare, discuss, distinguish, estimate, explain, interpret, predict, summarise
Application apply, calculate, demonstrate, illustrate, relate, show, solve
Analysis analyse, arrange, categorise, compare, connect, explain, infer, order, separate
Synthesis arrange, combine, compose, create, design, formulate, hypothesize, integrate, invent,
modify, plan, write
Evaluation assess, compare, decide, defend, discriminate, evaluate, judge, justify, measure, rank,
recommend.
Table 3: Verbs associated with levels within Bloom's Taxonomy
So, for example, a question that commences: Define is likely to assess basic knowledge; a
question that begins Discuss is likely to assess analytical and/or evaluative skills.
2
Note that the same verb can be used in different levels.
15
DIFFICULTY AND DEMAND
Blooms Taxonomy provides an indication of the demand of a question it does not define its
difficulty. A questions demand is a measure of its intellectual requirements; its difficulty is how
hard it is. Although difficulty and demand are related (most demanding questions are difficult), a
question can have high demand and low difficulty - or low demand and high difficulty. Asking a
simple question about some esoteric piece of knowledge (the staple diet of most TV quiz shows) is
harder to answer correctly than asking a difficult (according to Bloom) question about something
thats commonly understood (such as evaluating road conditions or crossing the road). So,
merely climbing Blooms hierarchy is no guarantee of difficulty.
QUESTION TYPES AND DEMANDS
Each question type can be related to one or more levels in Blooms Taxonomy. While its possible to
use any question type for any of Blooms levels, some are better than others for specific levels as the
following table describes.
While mostly used to assess knowledge, T/F questions can be used to assess
True/False
knowledge, comprehension and application levels.
Again, mostly used to assess basic knowledge but can be used to assess knowledge
Matching
and comprehension.
MCQs are the most flexible type of SRQ and can assess all levels except synthesis
MCQ
and evaluation.
MRQs can assess the same range of levels as MCQs i.e. knowledge, comprehension,
MRQ application and analysis but have the potential to create more difficult questions
within each category.
Ranking Ranking questions are well suited to assessing application and analysis.
Assertion Suitable for knowledge, comprehension and analysis.
Table 4 - Question types and demand
While certain question types are more suited to different levels of demand, that does not mean that
one type is necessarily harder than another. For example, multiple response questions are not
more difficult than multiple choice questions. They can be. But they can also be easier.
So, in theory, SRQs can assess all of Blooms hierarchy except the top two levels (synthesis and
evaluation). However, in practice, it is uncommon to come across SRQs that assess anything other
than knowledge and comprehension. This is not an inherent limitation of their design. Assessing
higher order skills can be done but it is a time consuming and difficult task.
16
WRITING MULTIPLE CHOICE QUESTIONS
This section focuses on the construction of one type of selected response question the multiple
choice question. However, much of the advice is transferable to other forms of selected response
question.
Multiple choice questions (MCQs) are the most common type of SRQ; theyre also one of the most
flexible - and most difficult to write.
ANATOMY OF AN MCQ
A single, complete multiple-choice question is called an item. An item consists of a question and a
number of possible responses, from which the learner selects one. An MCQ has the following
structure.
Figure 4 - Architecture of an MCQ
Stem (or stimulus): the question or problem.

Options (or responses or alternatives): the list of possible answers.
Key: the correct (or best) answer.
Distractors: the incorrect alternatives to the key.3
WRITING MULTIPLE CHOICE QUESTIONS
There is no formula for writing high quality items. However, there is some guidance that aids their
construction.
THE ITEM
A vital aspect of writing good items (authoring) is to ensure that the question directly relates to
the associated learning objectives, and that it is clearly presented and free from unnecessary details.
A question should not be a test of reading ability; the focus must be on the learning outcome that it
is seeking to assess.
3 Note the spelling of distractor which is the US-English spelling rather than the International English spelling (distracter).
17
Ensure that each item is related to a learning objective.

Ensure that the level of language is appropriate to the target cohort.
Assess one thing at a time (unless you intend to ask an integrative question).
One correct answer only.
Dont include unnecessary words.

Pre-test items whenever possible.
One of the most difficult aspects of writing an item is to ensure that there is only one correct
answer. Having more than one potentially correct answer is the most common complaint from
learners. Its a challenge to write an item with one unambiguously correct answer, which is not
subjective or context dependent (the key is correct is some circumstances but not others). One
solution is to spell out the context but this may make the item clumsy or give clues to the correct
answer. Another option is to use best or most likely in the stem its easier to argue that the key
is the most likely answer than the only answer.
Although the initial construction of questions has to be the work of an individual, its vital that items
are reviewed prior to being used operationally. Its impossible for a single author to both write and
review items. SRQs are well suited to pre-testing which means trying them out on learners before
using them. Pre-testing will confirm the items suitability (or not). It also generates valuable data
about the question that can be used in item analysis (see Section 6).
STYLE GUIDE
Each item should follow an agreed house-style to provide guidance on language use. A style guide
for item writing would normally include advice about:
spelling
punctuation
use of emphasis
prose style
language.
For example, spelling advice would include the treatment of numbers (spelt in words or written as
digits?); punctuation advice would include information on the punctuation to use within options
(should they end with a period or without any form of punctuation?); emphasis rules would include
the use of bold and italics; prose style and language would provide general advice about the type
and level of language to be used.
THE STEM
Its best to phrase the stem as a self-contained question rather than a partial statement although
the latter approach is neither uncommon nor invalid.
Try to phrase the stem as a complete question unless this is too contrived, in which case an
incomplete statement should be used.
18
Use clear, straight-forward language suitable for the target cohort.
Place necessary wording in the stem not in each of the options.
Avoid irrelevant or unnecessary information.
Avoid negative wording if possible or use negatives sparingly; always avoid double negatives.
Specify any standards implied.
Avoid subjectivity e.g. Which one of the following do you think is.
Any words that would be repeated in each of the options should be included in the stem. Options
should not begin or end with identical words and phrases. The following examples illustrate this.
If the pressure of a certain amount of gas is held constant, what will happen if its volume is increased?
A The temperature of the gas will decrease.
B The temperature of the gas will increase.
C The temperature of the gas will remain the same.
Example 14: Repeated text in the options
Here is the same question with the repeating text removed.
If the pressure of a certain amount of gas is held constant, what will happen to the temperature if its volume is increased?
A Decrease.
B Increase.
C Remain the same.
Example 15: Repeated text removed
Avoid words like could and would. For example, asking a learner What would you do cannot
be answered incorrectly (since only the learner can know what s/he would do). Instead write: What
should you do. The following example illustrates a poor question.
A computer is running slowly. What could be responsible?

A Insufficient memory
B Over-heating
C Small hard drive
D Virus
Example 16: Subjective wording
The author intends D to be the correct answer but any of the options could be correct. Here is an
improved version.
19
A computer suddenly runs slowly without any known changes to its hardware or software. Which one of the following is
most likely to be responsible?
A Insufficient memory
B Over-heating
C Small hard drive
D Virus
Example 17: Subjective wording removed
Note the added contextual information in the stem to improve the clarity of the question and the
replacement of could with most likely. Notice, too, the increased word length. This illustrates
how clarity can come at the cost of complexity.
Specify any standards implied. If an item calls for a judgment, specify the authority or standard upon
which the correct answer is based.
According to the American Medical Association, the diet of the average American provides vitamins in amounts that are
what?
A Adequate for normal consumption.
B Inadequate for normal consumption.
C In excess of normal requirements.
D Variable in relation to individual requirements.
Example 18: Standards specified
The key to good stem construction is to keep the question (or statement) as short as possible
consistent with providing sufficient information to unambiguously pose the question. But dont be
tempted to reduce the length of the stem by moving information into each of the options; this
complicates the question and increases the reading time. Negative wording is not prohibited but its
better to word a question positively when this is possible. Double negatives should be completely
avoided i.e. two negatives in the stem or a negative in the stem and a negative in the options.
However, some questions can be made unnecessarily complex by avoiding a single negative in
which case, use negatives. When negatives are used, emphasise NOT (or whatever construct is
used) in the stem (or the options).
THE OPTIONS
Provide between three and five options four options are most common.
Ordering of the options should follow a consistent and logical sequence.
The length of options should be comparable.
Options should be mutually exclusive.
Only one correct (or best) answer.
The one correct answer (key) should be actually correct.
20
The key should not be worded in a way that would make it likely to change over time.
Ensure that none of the distractors is conditionally correct (depending on circumstances or

context unless these are defined in the stem).
Do not create distractors that are too close to the key.
Dont use words such as not, never or always to make an option incorrect.
None of the above/ All of the above should be used sparingly (and when used should be the
correct answer some of the time).
Avoid pejorative language (such as bad, low, ignore etc.).
Avoid syllogistic reasoning e.g. Both A and B are correct.

Some of the advice is conflicting such as Stems should be short and simple and Move
information to the stem rather than repeat it in each option. Dealing with these tensions is the art
of item writing.
The advice about pejorative language is quite subtle. Any option that uses words such as bad,
low and ignore is usually wrong authors rarely use such words in the key.
At higher levels of understanding, it can be difficult to construct questions with one objectively
correct answer and it is a common error in such questions to offer options that include more than
one potentially correct answer. Careful wording (Which one of the following is likely to be the best
answer) can get around this potential problem.
SEQUENCING OPTIONS
The ordering of the options within an item should follow a logical order. If using numbers or dates
the options should be listed numerically or chronologically in ascending or descending order
(normally ascending). Text responses should be sorted alphabetically unless there is a natural
sequence to the options, in which case the natural sequence should be used. Do not order the
options to try to evenly distribute the answers (i.e. to ensure each option A, B, C and D is used
approximately the same number of times) nor attempt to avoid clustering keys (e.g. A-B-B-B-C).
Both of these strategies reduce the randomness of a test.
USE OF NONE OF THE ABOVE

None of the above/All of the above should be used sparingly. It is preferable to avoid the use of
these options. Studies have shown that they decrease item discrimination and test score reliability
(see Section 6). However, they can be used if they are:
used in several items in a test

sometimes the correct option (but not always)
not used after a negative stem
not used as padding (because you are short of ideas).
21
ADVICE ON WRITING DISTRACTORS

The quality of distractors has a big impact on the quality of the question. Distractors have a
particularly important role to play in formative assessment since their careful selection can provide a
wealth of diagnostic information about the learners present understanding. In summative
assessment, carefully selected distractors can catch out unprepared (or under-prepared) learners.
Writing distractors, therefore, requires as much thought as writing the key.
Distractors should be plausible; do not use unrealistic or humorous distractors as this effectively
reduces the number of options.
Distractors should be plausible no silly distractors although some can be relatively weak.
Common misunderstandings make good distractors.
Incorrect paraphrasing of the question makes for good distractors.
Correct sounding distractors are good for poorly prepared learners.
True statements that do not answer the question are good distractors.
However, there is a balance to be struck between writing good distractors and trying to dupe
learners. Distractors should not entrap learners that is, catch out learners through clever
wording, very fine distinctions or tricks-of-the-trade. If you want to write a difficult question then do
so through the knowledge and skills required to answer it not by trying to trick the learner into
giving the wrong answer.
ADVICE ON AVOIDING CUEING

Cueing is the tendency for the stem (or the options) to imply the key. It is a common problem with
SRQs. The following question has only one option (A) which is grammatically correct (the stem ends
with an and only option A begins with a vowel).
A word used to describe a noun is called an

A adjective
B conjunction
C pronoun
D verb
Example 19: Cueing the answer
Here are some dos-and-donts to avoid this problem.
Ensure that the options flow from the stem, are in the same format and tense, and are
grammatically correct.
Dont allow the wording of the options to provide obvious clues to the correct answer.
Avoid the use of always and never in the options since these responses are rarely correct.
22
Avoid the use of sometimes and often in the options since these responses are frequently
correct.
Avoid using stereotypical language that could give away the answer.
Avoid using phrases from textbooks.
Avoid pejorative wording (bad, low etc.) since these words are rarely used in the key.
Avoid absolute language such as always, never, etc. since these are rarely correct.
Avoid complex language in one option compared with other options (this option tends to be the
correct answer).
Avoid similar language in the stem and the options since the option with the most similar
language is most likely to be the key.
Avoid visual cueing i.e. one option being much longer or standing out in some other way from
the other options this one is likely to be correct.
The length of options should be comparable. An option that stands out from the others can indicate
to a learner that it is the right answer. If different lengths are unavoidable then use two long options
adjacent to each other and two short options adjacent to each other.
DISCLOSERS
A concept associated with cueing is disclosing. A discloser is a question that contains the answer to
another question. Unless otherwise intended, every question should be independent of every other
question and should contain the minimum information required to answer the question. However, it
can happen that the stem or options in one question inadvertently help learners to answer another
question.
Disclosure is a particular problem in item banking since it is impossible to predict which items will be
included in a particular instance of a test (such tests are usually dynamically generated by a
computer and a computer is unlikely to spot the subtleties of disclosure).
WRITING MCQS FOR FORMATIVE ASSESSMENT
Formative assessment is fundamentally different from summative assessment. The purpose of

summative assessment (referred to as assessment of learning) is to accredit (and, often, grade)
learning. It is primarily used for certification or progression purposes. The purpose of formative
assessment (referred to as assessment for learning) is to improve learning. The critical aspect of
summative assessment is the learners score. The critical aspect of formative assessment is the
learners current understandings and misunderstandings.
The previous advice for writing good MCQs applies equally to formative questions. However,
formative and summative items differ in the following respects.
23
Each question is designed from a learning perspective rather than an assessment focus.
Marks are not normally awarded to formative items.
Selection of distractors is critical.
Feedback must be provided.
The first point affects the wording of questions, which may be explanatory in nature or provide
hints. The focus of the question is to improve learning not judge learning. Instead of marking each
response, detailed feedback is provided. The feedback should be customised to each response and
may include references to learning material to permit remediation when necessary. The selection of
distractors is, if anything, more vital in formative assessment than summative assessment. They
must be carefully selected to elicit what learners know and dont know. Common errors and
popular misconceptions provide particularly good distractors in formative questions. Trick questions
are permissible if they seek to reinforce a learning objective.
The example below illustrates these points. The feedback is adjacent to each response.
The base of a number is the number of unique digits in the number system. For example, binary is base 2 and has two
digits (0 and 1). The number 100 is written in base two as 1002. The following numbers have different bases. Which
one is the largest? Carefully consider the base of each number before deciding.
This number looks the biggest but because it is base 2 it is small (11 in decimal). Base 2 numbers
A 10112 require a lot of digits before they become large. Although this number has the most digits it is the
smallest number in the list. The correct answer is B, which is much larger than this number.
B 2518 Correct! This base 8 number (169 in decimal) is much larger than the other numbers.
This is a decimal number (base 10). It is the second largest number in the list. The base 8 number
C 9210
is larger when converted to decimal (169 in decimal).
Although this has the largest base (base 16), it only has one digit and is the second smallest
D F16
number in the list (15 in decimal).
Example 20: Formative question
It is instructive to compare this with the equivalent summative question.
Which one of the following numbers is the largest?

A 10112
B 2518
C 9210
D F16
Example 21: Summative question
As can be seen, there is an element of education in the formative question (the learner is reminded
of what is meant by base). There is also a hint in the question (Consider the base of each number
before deciding). Each response has its own unique feedback (including the correct response),
which tries to explain why that response is wrong (or right).
24
Formative assessments can be scored. Marking formative assessments permits an overall mark to be
derived, which can be used to determine whether the learner should progress to the next stage of
their learning or whether remediation is required before progression.
Writing formative items is significantly more demanding (in terms of time and effort) than writing
summative items. However, formative assessment has a significant impact on learner achievement.
25
WRITING QUESTIONS FOR HIGHER LEVEL SKILLS
Multiple choice questions (MCQs) have gained a reputation for being a quick-and-dirty way of
assessing low level knowledge. However, they can also be used to assess higher level skills but this
requires a great deal more effort on the part of the writer. This section explores the potential of
MCQs to assess higher level skills.
As has been previously stated, MCQs can be used to assess four of Blooms cognitive levels:
knowledge, comprehension, application and analysis. This section explores a couple of techniques
for writing higher order questions, and exemplifies these against each level in Blooms Taxonomy.
Writing MCQs to assess higher order skills contradicts some of the previous advice about writing
good items. For example, such questions often involve long stems; complex language is frequently
used; standards are often omitted (or the question becomes one of knowledge of the standard); and
they often require an element of judgement on the part of the learner (and, as a consequence, are
frequently best answer type question).
TECHNIQUES FOR WRITING HIGHER ORDER QUESTIONS
Writing higher level questions is easier in some subjects than others. Some fields, such as
mathematics, are problem solving based; it is relatively straight-forward to produce questions that
assess more than knowledge and comprehension in these subject areas. In other fields its more
difficult.
However, there are a few techniques that can be used to help authors produce more demanding
questions. We will look at two:
1. scenario questions
2. passage-based reading.
SCENARIO QUESTIONS
The main method of writing higher level questions is to present a scenario to learners and then pose
one or more related questions. The scenario can be anything from a paragraph to a page (although a
very long scenario really requires a significant number of follow-on questions to justify the time
required to read it). The associated question(s) may involve a range of cognitive abilities including
explanation (comprehension), interpretation (comprehension), prediction (comprehension),
calculation (application), problem solving (application), inference (analysis), categorisation (analysis)
and decision making (analysis).
Scenarios can be used in all subjects but are particularly suitable in the social sciences. Science
subjects are inherently well suited to problem solving and it is easier in these areas to pose
demanding questions without the need for lengthy scenarios.
26
The examples provided in this section are given without detailed comment. You are encouraged to
critically appraise each question yourself. When you do, you will realise that few (if any) items are
without their issues.
A scenario question has a straight-forward construction. It consists of some text, which may be
illustrated with a diagram or photograph, and one or more associated questions. The scenario can
take one of a number of forms including:
a description of a specific environment

a description of a specific situation
a description of a principle or theorem
a description of a problem
an explanation of an event
the results of an experiment
the results of research.
Most scenario questions involve an element of interpretation on the part of the learner. Learners
will take more time to process a scenario question as it requires a high level of reading ability. This
should be taken into account when determining the duration of a test (see Section 7).
The following example uses a single scenario and a number of linked questions of increasing
demand.
Raj and Sophie had two children (Ben aged 8 and Shazia aged 2) during a co-habiting partnership that lasted ten years.
Their relationship has ended. Sophie has now married John. Raj has agreed that the children can live with Sophie and
John for the time being.
For questions 1-3, the options are:
A Raj and Sophie.
B Raj, Sophie and John.
C Sophie and John.
D Sophie only.
E Raj only.
Example 22: Scenario question
1. Who is able to apply as of right (without leave) for a residence or contact order?
2. If Raj obtained a contact order to see the children every week, who would have parental
responsibility for the children?
3. If Section 8 orders are required in respect of the children, who could apply as of right (without
leave) for a Section 6 order?
PASSAGE-BASED READING
A second technique to help with the writing of higher order questions is to use passage-based items.
This involves presenting a passage of around 100 to 800 words and asking one or more linked
questions about the passage.
27
The passage can be narrative, argumentative or expository in nature. The questions can ask learners
about the meaning of words in the passage (vocabulary in context); ask questions about significant
information contained within the passage (literal comprehension); or measure learners abilities to
analyse information as well as to evaluate the assumptions made and the techniques used by the
author (extended reasoning).
Psychoanalysis has been criticised on a variety of grounds by Karl Popper, Adolf Grnbaum, Mario Bunge, Hans
Eysenck, L. Ron Hubbard and others. Popper argues that it is not scientific because it is not falsifiable. Grnbaum
argues that it is falsifiable, and in fact turns out to be false. The other schools of psychology have produced alternative
methods of psychotherapy, including behavioural therapy, cognitive therapy, primal therapy and person-centred
psychotherapy.
An important consequence of the wide variety of psychoanalytic theories is that psychoanalysis is difficult to criticise as
a whole. Many critics have attempted to offer criticisms of psychoanalysis that were in fact only criticisms of specific
ideas present in one or more theories, rather than in all of psychoanalysis. For example, it is common for critics of
psychoanalysis to focus on Freud's ideas, even though only a fraction of contemporary analysts still hold to Freud's
major theses. (Wikipedia)
Example 23: Passage based question
A number of linked questions could be asked about this passage. For example, a vocabulary-in-
context question could ask about the meaning of a word such as falsifiable or a term such as
cognitive therapy; a literal comprehension question could ask about the learners understanding
of this passage (such as asking them to choose the best (one line) summary of the passage); and a
number of extended reasoning questions could be posed (such as one asking about criticisms of
Freudian psychoanalysis).
28
ITEM ANALYSIS
One of the major advantages of selected response questions (SRQs) is that they can be easily
analysed. Item analysis permits a more scientific approach to assessment. If you know the properties
of each question (for example, how difficult it is or how well it separates learners of differing
abilities) then you can construct a better test.
Item analysis has garnered a (deserved) reputation for being complex and opaque, using statistical
techniques with no real world analogue. This has resulted in it (sometimes) being entirely avoided
even by test experts and awarding bodies. This section explores two straight-forward (and real)
methods of analysing items: (1) measuring their difficulty; and (2) measuring how well they separate
learners. The next section explains how these measures can be used to create tests.
FACILITY VALUE
The facility value (FV) of an item is a measure of its difficulty or, more accurately, its easiness. It
represents the proportion of learners who answer the item correctly, and is expressed as a decimal
fraction between zero and one. A FV of zero means that no-one answered the question correctly; a
FV of one means that everyone answered the question correctly; and a FV of 0.6 means that 60% of
the test takers answered it correctly.
The lower the FV, the more difficult the item; the higher the FV, the easier the item. A very easy item
might have a FV of 0.9 (meaning that 90% of learners are expected to answer it correctly) and a very
difficult item might have a FV of 0.1 (meaning that 10% of learners are expected to answer it
correctly).
Facility values should be assigned during pre-testing. Once a sample group of learners has attempted
the item (assuming that this sample is representative of the target cohort), an initial FV can be
assigned. If pre-testing is not possible (or, more likely, not feasible) a predicted facility value (PFV)
can be assigned by the test authors. Predicted FVs are assigned by subject matter experts (SMEs)
and represent the best guess of two or more SMEs. This initial estimate can be re-calibrated once
the item is used operationally.
Note that a FV is a relative measure of an items difficulty relative to the target cohorts age and
stage. For example, a simple addition question might have a low FV for Primary 2 pupils but a high
FV for Primary 4 pupils.
Also note that, in theory, any SRQ will have a minimum FV greater than zero. For example, any
true/false question will have a minimum FV of 0.5 (which represents the 50-50 chance of guessing
the answer correctly) and any MCQ (with four options) will have a minimum FV of 0.25 (no matter
how difficult it is). However, in practice, some FVs may be lower than these values due to the
gravitational pull of well designed distractors.
It is recommended that items with FVs greater than 0.9 are discarded (too easy); similarly FVs lower
than 0.1 should be avoided (too difficult).
29
DISCRIMINATION INDEX
The discrimination index (DI) of an item is a measure of how well that item separates (discriminates
between) learners. It relates each learners test score with his/her performance on a specific item,
and then compares the top candidates with the bottom candidates.
For example, if 30 candidates attempt an item, the DI measures the performance of the top third
(top 10) of the learners (in terms of their overall test scores) with the bottom third (bottom 10) of
learners.4 If eight of the top ten answered the item correctly and two of the bottom third answered
it correctly then the items DI is:
DI = (8-2)/10 = 6/10 = 0.6.
DI values range from +1 (all of the top learners answered it correctly and none of the bottom
candidates) to -1 (all of the bottom learners answered it correctly and none of the top learners); a DI
of zero means that the same number of top and bottom learners answered it correctly. A positive DI
is essential (which shows some discrimination). If an item yields a zero or negative DI, discard it. The
above example illustrates good discrimination. It recommended that an item has a DI of at least 0.2;
items with DI values of 0.4 and above are considered to have good discrimination.
Discrimination indices cannot be predicted. They must be derived through pre-testing or operational
use.
There is a link between a questions facility value and its discrimination index. In general, items with
low facility values will have high discrimination values (in other words, difficult questions separate
learners). However, not all questions with low FVs will have high DIs. A poorly designed question
that is difficult to answer due to lack of clarity or inappropriate language may have a low FV and low
discrimination (low ability learners are as likely to get it right as good learners).
The following example illustrates the facility value and discrimination index for a specific question.
The item was designed to assess the mathematical knowledge of learners in the early stages of
secondary school education. It was pre-tested on 60 learners, of whom 18 answered it correctly; 15
in the top third and three in the bottom third.
If the radius of a circle increases by 20%, which one of the following represents the corresponding increase in the circles
area?
A 40%
B 44%
C 120%
D 144%
Example 24: Facility values and discrimination indices
4
The proportion can vary. DIs can be based on top/bottom thirds, quarters or other fractions.
30
This gave the following item analysis:
FV = 18/60 = 0.30
DI = (15-3)/20 = 0.60
This item is difficult. Given that blind guessing would produce a one-in-four chance of answering it
correctly (FV=0.25), the recorded FV of 0.30 (representing 30% of the sample) is very low. It also
discriminates well, meaning that it is likely to separate learners and aid grading.
It is worth noting that this item is slightly cued. 44% appears twice in the options (in B and D)
which might encourage some candidates to assume one of these options is correct (which would be
a correct assumption the key is B). This could have been avoided by selecting a different value for
D (such as 160%).
OTHER METRICS
There is a range of other metrics that can be calculated for SRQs. Most are complex and, unlike
facility values and discrimination indices, have no real meaning. However, the distractor pattern
provides useful information about which of the options learners choose. For example, the following
distractor pattern illustrates the choices made by 100 learners for the previous question.
Option Frequency of selection

A 15
B 40
C 10
D 35
This distribution would imply that distractors A and C are under-performing and need to be
strengthened. It might also indicate that distractor D is too strong and may require weakening. It
would appear that this question comes down to a straight choice between options B and D for most
learners.
There isnt a perfect distribution for the options but options that are rarely selected or a distractor
that is more popular than the key warrant attention.
Item analysis provides a means of improving item banks by identifying weak (under-performing)
items and eliminating them. The initial calibration of items can be done formally (through field
testing items prior to their use) or informally (using predicted facility values for example) and these
initial values can be re-calibrated once the items are used in earnest. However, to be effective, item
banks needs a mechanism to identify and replace under-performing items.
31
CONSTRUCTING TESTS
AUTHORING TESTS
This section looks at the process of combining questions into a test. The following diagram illustrates
the test generation procedure.
Test
Assemble Authoring Item
specification
test team event bank
Figure 5 - Test generation procedure
TEST SPECIFICATION
The test specification is the document (or blueprint) that defines the precise nature of the test.
The test specification will include the following information:
description (including links with source unit(s) and outcome(s))

question format(s)
number of questions
duration
rubric (including the marking scheme)
pass mark (including grade boundaries where applicable)
conditions of assessment.
A sample test specification is provided in the appendices.
The description of the test must (at a minimum) define the learning objectives that the test is
seeking to measure.
The question format defines the type of question that the test will employ. This might be true/false,
matching, multiple choice or multiple response or a mix of these types. For example, a test might
use 15 MCQs and 5 MRQs the test spec should spell this out.
The number of questions is self-evident but note that where more than one question type is
employed, the spec should specify the number of each type.
The duration of the test will depend on the number of questions and the complexity of the
questions. Simplistic formulas for the duration of a test (two minutes per question) should be
avoided. Scenario questions, in particular, take time to read, assimilate and answer. The duration
should be based on a typical test undertaken by a typical learner. If in doubt, err on the side of
generosity unless speed of response is a critical aspect of the assessment.
The rubric defines the marking scheme and provides instructions to learners. Setters may adopt a
simple marking scheme (one mark per question) or more complex schemes (involving one, two or
32
more marks for each item depending on its importance or complexity). Simple marking schemes are
recommended. This section should also provide any special instructions for learners.
The pass mark (or cutting score) is the minimum mark that learners must gain in order to achieve a
pass in the test. There are a number of techniques for setting pass marks, some of which are
discussed later in this section. Fifty percent is rarely an appropriate cut score for an objective test
(due to the effects of guessing see below) and neither is a heuristic such as the pass mark is 70%
for all objective tests.
If a test is graded (beyond the basic pass/fail threshold), the grade boundaries must be defined. The
grade boundaries define the marks required to gain an A or B or C pass. For example, a C pass might
require a total score between 60% and 74%, B between 75% and 89%, and an A pass 90% or more.
Finally, the test spec should describe any special conditions that have not already been described
elsewhere in the specification. Examples include: access to reference material (Is the assessment
open book? Or open web?) and permitted materials (such as calculators or special instruments).
ASSEMBLING THE TEST TEAM

The test team is responsible for constructing the test, using the test specification as a blueprint. This
team will normally consist of a test expert and a number of subject matter experts (SMEs). The SMEs
should have prior knowledge and experience of writing SRQs. The size of the team will depend on a
number of factors such as the number of items required and the time available to write them.
Subject matter experts may need training in the construction of selected response questions. This
can be done at the authoring event (see below) or, prior to this event, at a specific training event.
AUTHORING EVENT
Due to the collaborative nature of item writing, it is recommended that questions are produced over
a short period of intensive activity. For example, a team of four SMEs might be asked to produce 200
items over an intensive working weekend. A suggested workflow during the authoring event is
provided in Figure 6.
Authors need to be clear about the learning objectives (outcomes) that they are to assess. Where
more than one learning outcome is to be covered by an individual SME, the number of questions for
each outcome should be agreed. Each authors targets should also include the types of question and
number of each type of question (for example: Twenty multiple choice questions and 10 multiple
response questions), the average facility value for their set of questions (see below), and the
expected productivity rate (for example, eight items per hour).
Writing items is a solitary activity. Although authors may seek advice when they write questions, the
act of putting pen to paper (or, more likely, finger to keyboard) is an individual task. Authors should
be provided with a question template before commencing. This template defines the precise format
of the question and will include metadata about the item (such as the associated keywords and its
predicted facility value).
33
If the items are being written for a test with a known pass mark, authors will require to know the
target facility value (FV) to aim for. For example, if the writers are producing items for a test with a
pass mark of 15/20 then the target FV will be 0.75 and each author should ensure that each batch of
questions has an average FV of 0.75 (so that the overall item bank has a correct FV).
Allocate learning
outcomes to
SMEs
Agree targets Add item to

with SMEs item bank
Write Revise or discard

item item Yes
Add item Locate Accept

No
to batch multimedia Item?
Completed Pass batch Reviewer checks

No Yes
batch? To reviewer batch
Figure 6 - Authoring event workflow
Authors should batch items before passing a group of questions to a designated reviewer for
checking. The reviewer will then consider each item and do one of three things: (1) accept it without
change; (2) accept it with revisions; or (3) reject it. While it is unlikely that the author and reviewer
cannot reach a compromise about a disputed item, in such cases the test expert should make a final
decision. Reviewing is best done blind (i.e. without knowing the identity of the author) to prevent
personality conflicts from interfering with the process. While group reviewing is a good means of
training writers and reviewers, it is an inefficient way to create large numbers of items.
The output from the authoring event will be an item bank of approved and calibrated items. Target
setting and regular milestones will play an important part in ensuring a successful outcome. At
various points during the event, the co-ordinating person should convene review meetings when
progress can be measured, and problems or bottlenecks can be collectively identified and
addressed.
DETERMINING TEST LENGTH
Determining the number of questions to include in a test is an important decision. The length of a
test has a direct relationship with the tests reliability the longer the test (and, by implication, the
more questions in the test), the more reliable that test will be as a measure of the learners ability.
34
There are a number of factors that affect test length including:
the importance of the test

the size of the domain being assessed
the range of knowledge and skills contained within the domain
the time available.
A high stakes test needs to be more reliable than a low stakes test and therefore needs to be
longer. However, the improvement in reliability levels off after a certain number of questions.
The number of learning objectives being assessed also has a bearing on the size of the test. A test
that assesses several outcomes (or one large outcome) will obviously require more items than one
that assesses fewer outcomes (or smaller outcomes). However, even a test that assesses a single
outcome may require lots of questions if that outcome covers a broad range of knowledge and skills.
And, finally, the time available needs to be considered. There is no point is designing a test with 60
questions, requiring two hours to complete, if this is not feasible in centres.
There is no formula for test length. Criticality, domain size and practical considerations need to be
balanced. However, in most instances, it is best to keep tests as short as possible to reduce the
assessment burden on teachers and learners.
TECHNIQUES FOR SETTING PASS MARKS IN OBJECTIVE TESTS
There are a number of ways to set a pass mark. We will look at three methods:
1. informed judgement
2. initial pass mark
3. Angoff method
4. contrasting groups
Some are more scientific than others but, no matter which method is used, none of them replace
the need for human judgement.
INFORMED JUDGEMENT
This technique involves the most human judgement and, as a consequence, is the most subjective
method of determining pass marks. At its most basic level, informed judgement simply involves the
opinions of the setting team. These subject matter experts (SMEs) agree a common sense pass
mark based on their expert judgement and the following considerations:
the minimum mark achievable through guessing

the criticality of the judgement being made about learners
the complexity of the subject domain
the difficulty of the test items
the age and stage of the learners.
35
No matter how little a learner knows, s/he is unlikely to score zero marks in an objective test due to
the effects of guessing. For example, in an objective test consisting of 100 multiple choice items,
each with four options, blind guessing should produce a minimum mark of 25% (representing the
one in four chance of guessing the correct answer to each question). For this reason, the pass mark
in an objective test is usually higher than 50%.
If there is an existing item bank, the difficulty of the items in the bank can be used to determine the
pass mark. For example, if we know that an item bank contains difficult questions then that would
result in a lower pass mark; conversely, a simple item bank would lead to a higher pass mark.
Associated with this is the complexity of the subject domain.
In practice, informed judgement would be based on all of these considerations some of which may
drive the pass mark up and some may push it down. For example, an undergraduate true/false test
for medical students may have a significantly higher pass mark than a multiple response test for
primary children.
The initial judgement may be refined after further consultation or pre-testing. For example,
practicing teachers may be asked about their views about the proposed pass mark; and/or the
assessment may be field-tested and the pass mark adjusted in the light of the resulting scores.
The remaining methods are more scientific than informed judgement since they compute their
respective pass marks.
INITIAL PASS MARK

A simple formula to determine an initial pass mark in an objective test is given below.
Initial Pass Mark = Guessed marks + (Remaining marks x True pass mark)
Or: IPM = G + (R x T).
Put more simply, work out the number of marks that will typically be achieved through guessing (G)
and add to that the number of marks remaining (R) multiplied by the true pass mark (T). For
example, a multiple choice test comprising 16 questions, each with four options, with a true pass
mark of 50%, would have the following pass mark.
IPM = 4 + (12 x 0.5) = 10.
Learners would be expected to achieve 4 marks through blind guessing (G=4). If a true 50% pass
score is applied (T=0.5) to the remaining marks (R=12) then you would expect learners to get six of
these correct, giving a pass mark of 10 (out of 16). Once determined, this pass mark can be adjusted
upwards or downwards, depending on the professional judgement of teachers.
This method can be applied to any objective test. For example, a T/F test, consisting of 40 questions
would have a G (guess) value of 20 and an H (half) value of 10, resulting in a pass mark of 30.
While the proportion of marks achieved through guessing will always be known in an objective test,
the true pass mark (50% in each of the above example) can be adjusted. For example, in a high
36
stakes assessment of medical knowledge, consisting of 50 MCQs, each with five options, the true
pass mark could be raised to 60%, giving the following initial pass mark.
IPM = 10+24 = 34.
This method provides a simple way of setting an initial pass mark, which can be refined once the
items are actually used by learners.
ANGOFF METHOD
This method involves aggregating the facility values (FVs) for each item and estimating the pass mark
based on this figure. The following example illustrates this method.
Question FV
1 0.8
2 0.6
3 0.6
4 0.3
5 0.4
Total 2.7
Pass mark 3/5

Table 5 - Setting pass marks using Angoff
Recall that the facility value is a measure of the probability (between 0 and 1) of learners answering
the question correctly. For example, based on the above table, there is an 80% probability that
learners will answer question 1 correctly (FV=0.8). Adding the FVs for each question, therefore,
provides an indication of the total score that a learner should achieve (in this case 2.7). Subject
matter experts would then either round this value down or up using their professional judgements
(in this case the aggregate FV was rounded up). The resulting pass mark for this test is three out of
five.
In practice, pass marks (for tests) are usually defined in the test specification, and the task,
therefore, becomes one of selecting questions with FVs that add up to this pass mark. We effectively
reverse engineer the Angoff method. For example, if the test specification defines a pass mark of
7/10 then the test should consist of questions whose FVs add to seven (give or take a decimal place).
This is a very simple task for a computer.
Care must be taken when using the Angoff method to compute pass marks. If the facility value for
each item is based on the probability of a typical learner answering the question correctly then the
aggregate facility value (the pass mark according to the Angoff method) is the score you would
expect from average learners getting less than this simply tells you that the learner is below this
norm. If the test is meant to be a judge of competency then the FVs would have to be based on
minimally competent learners before an accurate judgement could be made.
37
CONTRASTING GROUPS
This method, unlike the previous ones, requires pre-testing. The test is issued to two groups of
learners one group who are expected to pass and one group who are expected to fail. The
resulting scores are then plotted on a chart and the intersection of the graphs provides an initial
pass mark. This initial pass mark is then refined using the SMEs expert judgement.
The graph below illustrates the results for two groups of learners. One group (the blue line) is
expected to fail and one group (the red line) is expected to pass.
30
25
20
No of candidates
15
10
0
0
5
10
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
Marks
Figure 7 - Setting pass marks using contrasting groups
The initial cut score would be around 55% (the approximate intersection of the two distributions).
Raising this to 60% would reduce the number of incompetent learners who would pass the test
but increase the number of competent learners who would fail. Conversely, decreasing the pass
mark to 50% would reduce the number of false fails but increase the number of false passes.
The final decision is based on the professional judgement of the SMEs.
These methods can be used alone or in combination. They all provide a scientific basis to the process
of setting the pass mark.
38
DEALING WITH GUESSING
Guessing is often cited as a serious problem with selected response questions, and it is true that
blind guessing can produce relatively high marks. For example, blind guessing in a true/false test
should produce a result of approximately 50%. However, there are well established ways of dealing
with guessing. These are: pass mark setting, negative marking, correction-for-guessing and
confidence levels.
SETTING AN APPROPRIATE PASS MARK
The simplest way of dealing with guessing is to adjust the pass mark accordingly. Instead of the
traditional 50% pass mark, the cut score can be made higher to compensate for the effects of
guessing. For example, a multiple choice test that has a pass mark of 70% is unlikely to be passed by
blind guessing. We have already seen four ways of determining the pass mark for an objective test
(informed judgement, initial pass mark, Angoff method and contrasting groups). Any of these
methods will eliminate (or greatly reduce) the effects of guessing.
NEGATIVE MARKING
Negative marking involves deducting marks for incorrect answers. For example, the following table
illustrates a candidates scoring pattern in a five item test where one mark is awarded for the correct
answer, zero marks where a question is not answered and one mark deducted for the incorrect
answer.
Question Mark
1 1
2 1
3 0
4 -1
5 1
Total 2
The main problem with negative marking is that it penalises partial knowledge. Selecting a good
distractor is a better than choosing a bad distractor but both choices will result in the loss of a
mark.
CORRECTION-FOR-GUESSING
This technique involves deducting a certain number of marks from every learner to compensate for
the effects of guessing. The number of marks deducted can be worked out in a number of ways,
ranging from the crude (a fixed number of marks deducted from every learner) to the more
sophisticated (when the number of marks deducted is not fixed and is based on an estimate of how
many guesses each learner has made). An example of the second approach follows.
39
In a 50 item test, where each item is a multiple choice question consisting of four options (a key and
three distractors), a learner scores 38/50. The proportion of marks deducted is based on the number
of incorrect answers (which are assumed to be guesses) and is worked out as follows:
No. of marks to deduct = No. of wrong answers x (1/No. of distractors)
In this case:
No. of marks deducted = 12 x 1/3 = 4 marks.
So, four marks would be deducted from this learner giving an adjusted score of 34.
While less crude than negative marking, this method suffers from similar problems it penalises
partial knowledge as much as no knowledge, and disproportionately affects low risk takers who will
choose not to attempt a question rather than answer it incorrectly for fear of losing marks, resulting
in many unanswered questions and deflated scores.
CONFIDENCE LEVELS
This technique for dealing with guessing is a mix of negative marking and correction for guessing.
The principle is to require learners to state their confidence in their responses and adjust the
marks accordingly.
Table 6 illustrates a possible distribution for marks.
Marks
Confidence
Correct answer Incorrect answer
Confident 3 -2
Unsure 2 -1
Not confident 1 0
Table 6: Confidence levels
A learner who confidently selects the correct answer gains the most marks (3); a learner who
confidently selects the incorrect answer loses the most marks (2); a learner who selects an incorrect
answer but is unsure about his/her answer scores zero.
Although the introduction of confidence levels complicates scoring, it distinguishes between learners
who guess the correct answer and those who know the correct answer. The increased complexity is
not an issue when the scoring is carried out by computer, and its ability to distinguish between
responses significantly improves differentiation between learners.
Confidence levels can be applied to all types of selected response questions.
40
MIXING AND SEQUENCING QUESTIONS
An objective test can mix question types. Learners have no particular difficulty in answering different
question types within the same test. It is particularly common to mix MCQs and MRQs. However, it
is not advisable to mix too many question types in the same test.
When mixing question types, avoid crude rules such as 1 mark for an MCQ, 2 marks for an MRQ and
3 marks for an assertion/reason. As previously discussed, question type is not a proxy for difficulty
or demand. In fact, it is best to avoid mixing the marks for questions within an objective test. This
[mixing marks] causes more confusion among learners than mixing question types.
When deciding the order of items in a test, it should be borne in mind that tests should start with
relatively simple questions and progress to more complex questions. It is also advisable (but not
vital) to group item types together for example, all true/false items and all MCQs. So, in most
cases, it is advisable to begin with straight-forward, low difficulty true/false questions and progress
to more complex, higher difficulty MRQ or assertion/reason items.
41

Writing Objective Tests (2nd Edition)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Writing Objective Tests (2nd Edition)

Uploaded by

Copyright:

Available Formats

Writing Objective Tests

Scottish Qualifications Authority

WRITING OBJECTIVE TESTS

Test writers: the people who create test questions.

describe the range of question types

Section 1 Introduction to selected response questions

MISCONCEPTIONS ABOUT OBJECTIVE TESTING

Question types can be categorised under two headings.

1. Constructed response questions

CONSTRUCTED RESPONSE QUESTIONS

Question Translate Good morning mother into Spanish.

An extended response question

SELECTED RESPONSE QUESTIONS

Question The capital of the United States is New York.

ADVANTAGES OF SELECTED RESPONSE QUESTIONS

Figure 2: Selected response question types

DISADVANTAGES OF SELECTED RESPONSE QUESTIONS

USES OF SELECTED RESPONSE QUESTIONS

Question Select the larger fraction in each pair.

TYPES OF SELECTED RESPONSE QUESTIONS

There are seven types of SRQ.

Question (x+1) is a factor of x2+2x+3. True/False

Match the storage technologies on the left with the storage

MULTIPLE CHOICE QUESTIONS

MULTIPLE RESPONSE QUESTIONS

Which of the following statements relating to earthquakes is/are true?

Rank the following countries in terms of their population

Ranking questions are easily implemented by computers using drag-and-drop.

Assertion/reason questions consist of a statement (assertion) and a possible explanation (reason).

Assertion/reason questions are, effectively, multi-true/false questions.

LIKERT SCALE QUESTIO NS

This type of SRQ is almost exclusively used for attitudinal surveys.

BEST ANSWER AND EXCEPTIONS

Smoking is a contributory factor in the following conditions except:

VARIANTS AND CLONES

Question (x+1) is a factor of x2+2x+1. True/False

ADVANTAGES & DISADVANTAGES OF QUESTION TYPES

Type Advantage(s) Disadvantage(s)

Well suited to basic knowledge. Limited applications (best suited to

Suited to dichotomous knowledge.

Type Advantage(s) Disadvantage(s)

Good for formative assessment especially

Relatively easy to write. Limited to knowledge and comprehension.

Quick to mark. Best used for homogenous content i.e.

Well suited to computerisation (drag-and-

Well suited to diagnostic assessment Unsuitable for assessing synthesis and

Item analysis provides detailed feedback

Simple MCQs are quick and easy to

High re-usability of items.

Well suited to assessing relationships Difficult to construct.

Good for constructing demanding items.

Table 1 - Advantages and disadvantages of question types

USING SELECTED RESPONSE QUESTIONS

Blooms Taxonomy classifies academic abilities into six categories.

A brief description of each follows.

Comprehension differs from knowledge in that it relates to the mental processes of

Table 2: Bloom's Taxonomy

Figure 3: Bloom's Taxonomy as a hierarchy

IDENTIFYING THE LEVEL OF A QUESTION

Comprehension compare, discuss, distinguish, estimate, explain, interpret, predict, summarise

Application apply, calculate, demonstrate, illustrate, relate, show, solve

Table 3: Verbs associated with levels within Bloom's Taxonomy

DIFFICULTY AND DEMAND

QUESTION TYPES AND DEMANDS