You are on page 1of 8

See

discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/274577986

High Level Multiple Choice Questions in


Advanced Psychology Modules

Article in Psychology Learning and Teaching · September 2009


DOI: 10.2304/plat.2009.8.2.30

CITATION READS

1 436

3 authors:

Clare Harley Catriona Morrison


University of Leeds University of Bradford
37 PUBLICATIONS 669 CITATIONS 52 PUBLICATIONS 2,623 CITATIONS

SEE PROFILE SEE PROFILE

Richard M Wilkie
University of Leeds
102 PUBLICATIONS 967 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Driving Rehabilitation View project

Understanding human steering control View project

All content following this page was uploaded by Clare Harley on 28 October 2015.

The user has requested enhancement of the downloaded file.


Psychology Learning and Teaching 8(2), 30–36

Reports
High-level multiple choice questions in advanced psychology modules
Richard M. Wilkie 1, Clare Harley and Catriona Morrison 
University of Leeds, UK

Traditional approaches to assessing students assume that multiple choice questions (MCQs) are
adequate in assessing only basic, low-level knowledge at the early stages of the higher education
(HE) curriculum. Increasingly, however, teachers of HE across a variety of subject areas are keen
to explore the opportunities for developing higher-level MCQ formats for assessing more advanced
stages of the curriculum. This has many benefits: students are unable to question-spot and are
required to demonstrate a breadth as well as a depth of knowledge, tests can be administered
electronically, and because feedback and marking can be instantaneous feedback can be quicker
than in a traditional paper-and-pencil assessment, without an onerous marking load for staff. Here
we report the use of high-level MCQs (hMCQ) in Level 2 of our BSc Psychology programme. We
demonstrate the success of this format in differentiating between students, and highlight important
factors in designing questions. We argue that this type of examination format offers an assessment
that discriminates between students and which can be simply evaluated to ensure there is a
suitable fit between the questions that make up the assessment tool and the student population
under evaluation.

Introduction are widely regarded as being able only to measure


rudimentary, descriptive knowledge. In terms of Bloom’s
Multiple choice questions (MCQs) have become taxonomy (Bloom, 1956), MCQs are often considered to
increasingly popular as the assessment method of test simple concrete abilities such as knowledge and
choice for psychology courses in UK HE institutions. comprehension, but are not regarded as offering the
This position has been driven by a desire to ensure capacity to assess higher-level skills such as analysis,
students cover the full curriculum in their exam revision, synthesis or evaluation.
while avoiding an overly onerous task for staff in terms
of marking. Where test banks need to be set up, the staff Recently this has been challenged, with exam setters
workload is fairly intensive; however, once those test attempting more creative assessment methods. Higher-
banks exist they can be used and reused, sampled from level MCQs (hMCQs) use a format that requires students
year after year, with departments building up useful to select an answer or answers from a list of choices, but
statistics on which questions best distinguish the also requires students to engage in higher-level learning
knowledgeable and less knowledgeable students. In processes like synthesis. For example, Connelly (2004)
most cases papers will be machine-read, and reported the use of assertion reason question (ARQ)
increasingly, tools such as “Question Mark Perception” examinations for use in a Masters of Business
are available to administer tests online that can offer Administration course. In this exam format students read
instant marks and even feedback on individual questions. both assertion and reason statements and select from a
Because there is no marking load, this form of choice of answers combinations that state the statements
examination gracefully scales to any size of student are either true or false. An example of an ARQ used in a
cohort. This is highly appealing for courses where the business school is shown in Table 1. Connelly showed
push to increase student numbers has resulted in that ARQ tests were good substitutes for more
modules taking around 200 students. conventional MCQs/short-answer questions, and that
ARQ scores were a good predictor of student
For all their popularity, MCQs are not a panacea for performance in essays. Connelly concluded that MCQs
assessment problems. Their major drawback is that they of this sort did tap deep learning and provided the

1
 Correspondence concerning this report should be addressed to the first author at: Institute of Psychological Sciences,
Faculty of Medicine and Health, University of Leeds, Leeds LS2 9JT, UK. Email: r.m.wilkie@leeds.ac.uk
30

Downloaded from plj.sagepub.com by guest on August 22, 2015


Questions in advanced psychology modules

opportunity to assess high-level knowledge at advanced It should be made clear that the hMCQs used were a
stages of the HE curriculum. step beyond those used to examine factual
understanding. For example, at BSc Psychology Level 1
An additional criticism of MCQs has been that they an example of a MCQ worth one mark could be:
involve no active engagement on the part of the student
and the task is not considered “authentic”, “worthy” or MCQ. The retinal receptors linked to colour are the:
”intellectual” (Wiggins, 1990). Supporters of MCQs,
however, argue that high-level questions of the ARQ a. rods
type do require evaluation, synthesis and critical analysis b. cones
on the part of the student, and as such they are able to c. intermediate cells
achieve the highest level of assessment possible in d. ganglion cells
terms of Bloom’s taxonomy (Bloom, 1956).
In contrast one of our seven-mark questions has the
The Assessment Context following structure:
In a recent set of examinations we trialled a new format
of questions for the BSc Psychology Level 2 Perception hMCQ. Read the following abstract from <details of
and Language module. Originally the exam format had research paper> (worth 7 marks)
consisted of six short answer questions (SAQs) followed
by one long answer question (LAQ) (chosen from three <Abstract from research paper>
possible essay titles), with 60 minutes recommended for 1. <Assertion 1> and <Assertion 2>
each section. 2. <Assertion 3> and <Assertion 4>
3. <Assertion 5> and <Assertion 6>
The SAQs were deemed unsatisfactory, so after 4. <Assertion 7> and <Assertion 8>
exploring a range of formats we replaced the SAQs with
a set of 12 hMCQs. By doubling the number of questions Select the answer below (a–d) which correctly
we expanded the range of topics covered as well as identifies the statements above which are consistent
being able to introduce differing levels of complexity in with the information in the abstract AND your
questions. The time to complete each question was understanding of the subject area:
related to its complexity, therefore we weighted the
questions by assigning more marks to the more complex a. 1 and 2
questions. Twelve questions were designed that were b. 2 and 4
worth three, five or seven marks, with four questions in c. 1, 3 and 4
each mark group. In total 60 marks were available, and d. 2, 3 and 4
students were given 60 minutes to complete the section.
As with previous years, the second part of the exam was This style of question requires several stages of
to answer one long answer question, selected from a evaluation prior to answering. In the above example,
choice of three essay titles. students must comprehend the abstract in the context of

Table 12
Example ARQ question (adapted from Williams, 2006) where the answer requires a true/false assessment of the
‘assertion’, a true/false assessment of the ‘reason’, and if both are deemed true, then a final assessment must be made
to determine whether the ‘reason’ provides an accurate explanation of the ‘assertion’.

Assertion Reason
In a small, open economy, if the In a small, open economy, any
prevailing world price of a good is lower surplus in the domestic market will
than the domestic price, the quantity be absorbed by the rest of the world.
supplied by the domestic producer will BECAUSE
This increases domestic consumer
be greater than the domestic quantity surplus.
demanded, increasing domestic
producer surplus.

a. True; True; Correct reason c. True; False e. False; False


b. True; True; Incorrect reason d. False; True (The correct answer is d.)

2
 Adapted from Williams, G. (2006). Assertion-reason multiple-choice testing as a tool for deep learning: A qualitative
analysis. Assessment and Evaluation in Higher Education, 31(3), 287–301 and reprinted by permission of the publisher
(Taylor & Francis Group, http://www.informaworld.com).
31

Downloaded from plj.sagepub.com by guest on August 22, 2015


Wilkie, Harley and Morrison

their learning, evaluate the truthfulness of each assertion, years. Rasch analysis illustrates the comparative fit of
and decide which combination of assertions provides the item location (question difficulty) against person
correct answer. As such this format of questioning measures (student ability) along postulated latent traits
requires a level of analysis, synthesis and evaluation that (e.g., ‘understanding of topic’). Therefore this analysis
would not be assessed by the Level 1 MCQ format. We method can be used as a tool for determining the source
decided not to apply negative marking, on the basis that of misfit between students and questions for improved
some answers contained both correct and incorrect examination construction (Bhakta, Tennant, Horton,
statements and therefore we could be penalising Lawton, & Andrich, 2005). The Rasch model applies a
students who had some, but not complete, understanding. probabilistic unidimensional model asserting that: a)
easier questions are more likely to be answered correctly
To evaluate the usefulness of the new exam format we by the students; and b) that the more able the student,
carried out a series of analyses to examine student the more likely they are to answer the question correctly.
performance in the old SAQ exam and the new hMCQ Student ability and question difficulty are estimated
exam. Our prediction was that the hMCQs would be jointly to produce parameter estimates called ‘logits’ (or
better than the SAQs at discriminating between students log-odds), which are independent of the student sample
who had achieved different levels of learning, and as a and the question items included. The fit of each question
result we would see a wider distribution of marks for the to the model can be assessed using the weighted mean
hMCQ exam. In addition we carried out a Rasch analysis square (infit) and unweighted mean square (outfit)
which would not only allow a direct comparison between statistics. The fit statistics have an expected value of 1.0
the performance of questions in the SAQ and hMCQ, but can range between 0 to infinity; values indicating
but also an evaluation of the difficulty of the questions good fit range between 0.7 and 1.3. Values greater than
used. This allows an evaluation of the weightings applied 1.3 can be interpreted as noise or lack of fit, whilst values
to the hMCQ questions. below 0.7 can be interpreted as redundancy or overlap
(Linacre, 2005).
Method
In the current data set, student responses to each SAQ
Data Collection: Participants and Procedure were analysed as a rating scale using the exam score
The responses to the SAQs or hMCQs for a Visual for each question. The hMCQ responses were analysed
Perception and Language module were collected over as dichotomous data, with the correct answer being
two consecutive years. The two hour exam comprised given a value of 1 and all other (incorrect) response
two parts: In the first year, 152 students (92% single options assigned a value of 0.
honours, 81% female) completed the SAQ exam
including six SAQs. In the second year, 167 students Results and discussion
(89% single honours, 92% female) completed the hMCQ
exam that included 12 questions. For both sets of To evaluate the general student performance on each
students, the exam also required the student to complete exam we examined the distribution of grades for the
one LAQ. The overall module mark was an equally SAQ and the hMCQ exams (presented in Figure 1). It
weighted average of the LAQ and either the SAQ or can be seen that there was a greater spread of grades
hMCQ score (see Table 2). awarded to students sitting the hMCQ than the SAQ.
The hMCQs extended the distribution of exam marks
Data Analysis across the higher and lower grades, using almost the full
Descriptive data were examined to determine the range of possible scores (from 18% to 100%). This
distribution of exam grades for each method of indicates that the hMCQs may provide better
assessment. Rasch analysis methods were then applied discrimination between students of different ability, since
to examine SAQs and hMCQs across two different exam grades using SAQs were clustered in a narrow

Table 2
Mean and standard deviation (SD) of exam marks (%) for the two exam periods3.

Mean SAQ or SD SAQ or Mean SD Overall


hMCQ hMCQ LAQ LAQ mark

Exam period 1: SAQ 55 9.4 60 9.1 58

Exam period 2: hMCQ 63 17.7 61 10.1 62

3
 SAQ = Short Answer Questions; LAQ = Long Answer Questions; hMCQ = high-level multiple-choice questions.

32

Downloaded from plj.sagepub.com by guest on August 22, 2015


Questions in advanced psychology modules

Figure 1 The proportion of students achieving each determine whether the chosen weighting was
exam mark for the old SAQ, shown in white, appropriate. In Figure 2 the questions are shown in the
and the hMCQs, shown in black. actual order of difficulty, descending from most difficult
at the top and easiest at the bottom. The questions are
48
assigned a logit value based on the number of students
MCQ
42 correctly answering each question and then rank
SAQ
ordered. The predetermined weights assigned to each
question were: Questions 1 and 2 (VP1, VP2, LN1, LN2)
Proportion of Students (%)

36

each received three marks; Questions 3 and 4 (VP3,


30
VP4, LN3, LN4) each received five marks; and Questions
5 and 6 (VP5, VP6, LN5, LN6) each received seven
24
marks. If this allocation of grades were accurate, it might
18 be expected that Questions 1 and 2 would appear at
the  bottom of the map, and Questions 3 and 4 in the
12 middle, whilst Questions 5 and 6 would appear at the top
of the map.
6

0
The distribution of hMCQ questions in Figure 2 shows
0 20 40 60 80 100 that the assumption of difficulty was not strongly linked
Exam Mark (%) to student performance. Some questions were placed
appropriately – for example, VP2 and LN2 were only
worth three marks, are placed at the bottom of the map,
and were answered correctly by approximately 95% of
boundary between high 2:2 and low 2:1 grades, with few students. There were seven mark questions that were
students managing to excel. clearly harder – for example, only 53% of students
correctly answered LN6 and VP5, which are much
The results of the Rasch analysis allow us to determine higher on the map. There are also exceptions, however,
how well the individual questions on the SAQ and hMCQ since only 45% answered VP1 correctly (worth three
exams targeted the abilities of the students in each year. marks) and this is near the top of the map, whilst LN5
Figure 2 shows the results of the Rasch analysis as a (worth seven marks) was answered correctly by 79% of
‘person-item map’. students and lies in the middle of the map.

Both exams seem to target student ability reasonably The original concept behind the hMCQ examination was
well, but the distributions of question difficulty and to design questions of varying difficulty, to target different
student ability are narrower for the SAQ exam compared levels of student learning. For basic question items, only
to the hMCQ exam. The lack of variability of student retention of key facts is necessary, with little additional
performance on the SAQ exam is reflected in the small understanding of the topic required, therefore it was
standard error values in Table 3 (embedded in Figure 2). predicted that the majority of students would be
The narrowness of distribution for the SAQ means that successful on these questions. Other questions that
the exam questions were less able to assess a broad targeted synthesis and analysis of more complex
range of student ability. materials were thought less likely to be answered
correctly by the students. It was anticipated that by using
The SAQ questions LN1 and LN3 appear together on a variety of questions, the hMCQ would better allow
the map (Figure 2), suggesting overlap between these discrimination between students who had achieved
questions. Because LN3 had poor fit statistics it seems different levels of learning, and the overall exam
that this question was redundant. The question LN2 also performance would therefore reflect the level of learning.
had poor infit and outfit statistics, which indicates that it
contributed little to the Rasch model. The spread of the The Rasch analysis revealed that some questions
hMCQ data was much broader than the SAQ data, and were weighted particularly appropriately (e.g., VP2, LN4
Figure 2 shows there was no overlap of hMCQ questions. and VP5) so we looked more closely to see how
Additionally, Table 3 confirms that all questions on the students  performed in these ‘ideal’ questions (shown
hMCQ exam had good fit (with no overlap or redundancy) in Figure 3). We grouped the responses to each question
and had greater variability (standard error) in student on the basis of overall score on the test to see how
responses than did the SAQ exam questions. these  questions discriminated between students of
different abilities. The groups of students who
The hMCQs contained items that had been assigned achieved  a  very good overall grade in the exam
three levels of difficulty based on the predetermined performed at a similar level across each type of question
design of the questions. We used Rasch analysis to (achieving close to 100% in each), whereas the groups

33

Downloaded from plj.sagepub.com by guest on August 22, 2015


Wilkie, Harley and Morrison

Figure 2 Person-item map. This figure illustrates exam question difficulty against student ability for each
assessment method, transformed to a logit scale. Students with better ability and the more difficult exam
questions are shown at the top of the map descending to students with lower ability and easier exam
questions at the bottom. The number of students at each point on the scale are represented cumulatively
by a series of symbols on the left side of each map: ‘#’ (3), ‘:’ (2) and ‘.’ (1); whereas exam questions are
positioned on the right of each map. The top left panel shows SAQ scores for the six exam questions
and the right panel shows hMCQ scores for the 12 multiple choice exam questions.

Short Answer Questions (SAQ) High-level Multiple Choice Questions (hMCQ)


Short
moreAnswer
able | Questions (SAQ)
more difficult High-levelmoreMultiple
able | Choice Questions (hMCQ)
more difficult
1 more :able
+ | more difficult 3 more able | .# more +difficult
1
.: :| + 3 .# +
.# |
.: |
.::.::T|T|
.# |
| | Number ofNumber of students:
students:
:: ::| | T| T| # = 3 # = 3
: :|T|T |T |T : = 2 : = 2
.:::| |
.::: | | . = 1 . = 1
::::::S|S| |
|
.:: |S VP3 |
.:: |S| VP3
.:::::: LN1 LN3 | | Questions:
.::::::
::::::| | LN1
LN2 LN3 | | Questions:
VP = Visual Perception
0 ::::::
.::::::| M+MLN2 2 .####### + | VP = Visual Perception
LN = Language
0 :::::::::
.:::::: M+M| 2 |
.####### +
::::: | VP2 S|
LN = Language
::::::::: | |
::::: |S |
::::: | | VP2
.::::::::: | VP4S|
:::::::::|S S| VP1 ########### | |
.::::::::: .:::| |T | | VP4
.:::
:::: S| | VP1 |S
########### |
.::: .|T| | VP1
|
T| |
-1 .::: | + 1 .############ + VP3 |S
less .able
| | less difficult M| VP5 | VP1
T| | LN3 |
-1 + 1 .############| + VP3
| LN6
less able | less difficult M| VP5
.######### |
| | LN3
| |
Table 3 | LN1 | LN6
.###### |
.######### |
Mean-square (MNSQ) fit statistics and variability for SAQ 0 S+M
and hMCQ exam questions |
| VP6
| |
Exam Infit Outfit Standard .#### | | LN1
.######
| |
question MNSQ MNSQ error 0 | S+M
| LN5
Short answer questions |
| VP6
|
VP1 1.22 1.24 0.6 .# | LN4
.####
T| |
VP2 1.19 1.20 0.6 -1 + |
| |
VP3 1.08 1.19 0.5 |
| LN5
LN1 0.86 0.92 0.5 |S
|
.# |
LN2 0.47 0.48 0.5 | .# | LN4
LN3 1.28 1.38 0.5 | T|
-1 | +
High-level multiple choice questions |
|
|
VP1 1.02 1.06 1.7 -2 + VP2 |
| |S
VP2 0.94 0.83 3.1 | .# |
VP3 1.02 1.01 1.7 | |
|
VP4 1.02 1.12 1.7 |
|
|T |
VP5 0.87 0.83 1.7 | |
VP6 0.86 0.84 1.8 | LN2 |
-2 | + VP2
LN1 1.00 1.00 1.7 -3 +
|
less able | less difficult
LN2 1.07 1.23 4.2 |
LN3 1.03 1.03 1.7 |
|
LN4 1.13 1.13 2.1 |
LN5 0.92 0.96 2.0 |T
|
LN6 1.08 1.15 1.7 | LN2
Note. Infit/Outfit statistics <.7 or >1.3 have been |
-3 +
emboldened to indicate that these questions have less able | less difficult
particularly poor fit.

34

Downloaded from plj.sagepub.com by guest on August 22, 2015


Questions in advanced psychology modules

who achieved a lower overall grade contain fewer more marks for deeper understanding of topics. To
students who answered the five mark and seven mark accurately estimate weightings, however, does require
questions correctly. good insight into how hard each question is for the
students, since misjudging the level of difficulty can
In Figure 3 the slopes of the lines indicate the result in rewarding students for only superficial
discrimination achieved by each question type, and it understanding. In the current hMCQ design correctly
is  clear that the gradient becomes steeper for the answering a seven mark question contributes 6% of the
questions worth more marks. Many of the students overall exam mark, so such questions need to be pitched
(irrespective of their overall score) answered the three correctly. Because of the uncertainty in estimating
mark question correctly, and this is appropriate given the question weighting, if an academic were considering
basic understanding required for such questions. We introducing a new hMCQ format we would recommend
also wanted to reward higher level skills, and some reducing the contribution of hMCQ exam mark in the first
questions performed very well at discriminating between year of use until the performance of each question
those who performed well at the test and those who (especially highly-weighted ones) can be confirmed. We
scored poorly. By using the forms of analysis outlined use a 50/50 split between the hMCQ and the LAQ in our
here we can reflect on the relative difficulty of each exam format, but in the first instance a 30/70 split may
hMCQ item, and over time refine our questions to better be more prudent.
reflect a distribution of topics as well as depth of
understanding. This will allow us to increasingly The differential marks are supposed to reflect the
discriminate between students who have attained amount of time spent on a question (rather than purely
different levels of knowledge and understanding, as well as a measure of difficulty). More time taken on one
as critical and analytical skills. question would, however, reduce the amount spent on
other questions, effectively making them harder. Further
It is still not clear whether differentially weighted work is needed to evaluate how long students spent on
questions (worth three, five or seven marks) are the best each question and how this relates to the marks made
way to reward question complexity, since this may have available for that question.
had both positive and negative outcomes. Differential
weighting is a useful way of using a small number of The ultimate measure of whether implementation of the
items to discriminate between students by awarding new hMCQs has been a success is the opinion of the

Figure 3 The proportion of students who answered correctly a three mark, five mark and seven mark multiple
choice questions (VP2, LN4 and VP5 respectively.) The student’s data were split into five groups based
on their overall score for the whole hMCQ, from lowest (0–50%) to highest (81–100%) respectively.

100%

90%
Students who answered correctly (%)

80%

70%

60% VP2 (3 Marks)


LN4 (5 Marks)
50%
VP5 (7 Marks)
40%

30%

20%

10%

0%
0–50 51–60 61–70 71–80 81–100
Overall Score in MCQ

35

Downloaded from plj.sagepub.com by guest on August 22, 2015


Wilkie, Harley and Morrison

teaching staff, the students, and – not least – the external References
examiners. Needless to say teaching staff appreciate
the time saved as a result of not having to mark hundreds Bhakta, B., Tennant, A., Horton, M., Lawton, G., & Andrich,
hand-written exam essays, and although many initially D. (2005). Using item response theory to explore the
approached the exercise with reservations, we have psychometric properties of extended matching questions
now implemented the use of hMCQs across the board at examination in undergraduate medical education. BMC
Medical Education, 5(1), 9.
Level 2, to staff approval. The students find the format
challenging but again the feedback has been very Bloom, B. S. (Ed.). (1956). Taxonomy of educational
positive in consultations via the Staff-Student Committee Objectives: The classification of educational goals:
Handbook I, cognitive domain. London: Longman Group.
and other means. Crucially, both external examiners
gave the new hMCQ exam format a ringing endorsement Connelly, L. B. (2004). Assertion-reason assessment in
and were impressed with the way in which this form of formative and summative tests: Results from two graduate
case studies. In R. Ottewill, E. Borredon, L. Falque, B.
exam discriminates between students.
Macfarlane, & A. Wall (Eds), Educational innovation in
economics and business VIII: Pedagogy, technology and
Conclusion innovation (pp. 359–378). Dordrecht: Kluwer Academic
Publishers.
We have shown that it is possible to generate a small
Linacre, J. M. (2005). A user’s guide to Winsteps/Ministeps
number of hMCQs that perform across a range of Rasch-Model programs. Chicago, IL: MESA Press.
student abilities, and effectively discriminate between
Linacre, J. M. (2006). WINSTEPS. [Rasch measurement
students with varying abilities. High-level MCQs are not
computer program]. Chicago, IL: Winsteps.com
trivial to write, but we have demonstrated that a large
number of questions are not required in order for the Wiggins, G. (1990). The case for authentic assessment.
ERIC Digest (ERIC Document Reproduction Service
examination to be useful, as long as they tap high-level
No. ED328611).
skills. To make a reusable and validated hMCQ question
bank we recommend annual analysis of the performance Williams, J. (Retrieved 9 December, 2009 from http://
of individual questions to ensure a good spread of more www.ericdigests.org/pre-9218/case.htm). Assertion-reason
multiple-choice testing as a tool for deep learning: A
and less difficult questions which also discriminate
qualitative analysis. Assessment and Evaluation in Higher
between students of varying abilities. Despite the time Education, 31(3), 287–301.
costs associated with generating and assessing the
Manuscript received 22 September 2008.
questions, once they are in place, they can remove the
Revision accepted for publication 6 May 2009.
heavy workload associated with marking large numbers
of essay-based questions.

36

Downloaded from plj.sagepub.com by guest on August 22, 2015


View publication stats

You might also like