You are on page 1of 18

58

Administration, Scoring, and


Reporting Scores
Ari Huhta
University of Jyvskyl, Finland

Introduction
Administration, scoring, and reporting scores are essential elements of the testing
process because they can significantly impact the quality of the inferences that
can be drawn from test results, that is, the validity of the tests (Bachman &
Palmer, 1996; McCallin, 2006; Ryan, 2006). Not surprisingly, therefore, professional
language-testing organizations and educational bodies more generally cover these
elements in some detail in their guidelines of good practice.
The Standards for Educational and Psychological Testing devote several pages
to describing standards that relate specifically to test administration, scoring, and
reporting scores (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education [AERA,
APA, & NCME], 1999, pp. 616). Also the three major international languagetesting organizations, namely the International Language Testing Association
(ILTA), the European Association for Language Testing and Assessment (EALTA),
and the Association of Language Testers in Europe (ALTE), make specific recommendations about administration, scoring, and reporting scores for different contexts and purposes (e.g., classroom tests and large-scale examinations) and for
different stakeholders (e.g., test designers, institutions, and test takers).
Although the detailed recommendations vary depending on the context, stakeholder, and professional association, the above guidelines endorse very similar
practices. Guidelines on the administration of assessments typically aim at creating standardized conditions that would allow test takers to have a fair and equal
opportunity to demonstrate their language proficiency. These include, for example,
clear and uniform directions to test takers, an environment that is free of noise
and disruptions, and adequate accommodations for disadvantaged test takers,
such as extra time for people with dyslexia or a different version of the test for
The Companion to Language Assessment, First Edition. Edited by Antony John Kunnan.
2014 John Wiley & Sons, Inc. Published 2014 by John Wiley & Sons, Inc.
DOI: 10.1002/9781118411360.wbcla035

Assessment Development

blind learners. A slightly different consideration is test security: Individual test


takers should not have an unfair advantage over others by accessing test material
prior to the test or by copying answers from others during the test because of
inadequate invigilation, for example. Administration thus concerns everything
that is involved in presenting the test to the test takers: time, place, equipment,
and instructions, as well as support and invigilation procedures (see Mousavi,
1999, for a detailed definition).
Scoringgiving numerical values to test items and tasks (Mousavi, 1999)is a
major concern for all types of testing, and professional associations therefore give
several recommendations. From the point of view of test design, these associations
emphasize the creation of clear and detailed scoring guidelines for all kinds of
tests but especially for those that contain constructed response items and speaking
and writing tasks. Accurate and exhaustive answer keys should be developed for
open-ended items, raters should be given adequate training, and the quality of
their work should be regularly monitored. Test scores and ratings should also be
analyzed to examine their quality, and appropriate action should be taken to
address any issues to ensure adequate reliability and validity.
The main theme in reporting, namely communicating test results to stakeholders (Cohen & Wollack, 2006, p. 380), is ensuring the intelligibility and interpretability of the scores. Reporting just the raw test scores is not generally recommended,
so usually test providers convert test scores onto some reporting scale that has
a limited number of score levels or bands, which are often defined verbally. An
increasingly popular trend in reporting scores is to use the Common European
Framework of Reference (CEFR) to provide extra meaning to scores. Other recommendations on reporting scores include that test providers give information about
the quality (validity, reliability) of their tests, and about the accuracy of the scores,
that is, how much the score is likely to vary around the reported score.

Test Administration, Scoring, and Reporting Scores


In the following, test administration, scoring, and reporting scores are described
in terms of what is involved in each, and of how differences in the language skills
tested and the purposes and contexts of assessment can affect the way tests are
administered, scored, and reported. An account is also given of how these might
have changed over time and whether any current trends can be discerned.

Administration of Tests
The administration of language tests and other types of language assessments is
highly dependent on the skill tested and task types used, and also on the purpose
and stakes involved. Different administration conditions can significantly affect
test takers performance and, thus, the inferences drawn from test scores. As was
described above, certain themes emerge in the professional guidelines that are
fairly common across all kinds of test administrations. The key point is to create
standardized conditions that allow test takers a fair opportunity to demonstrate
what they can do in the language assessed, and so to get valid, comparable

Administration, Scoring, and Reporting Scores

information about their language skills. Clear instructions, a chance for the test
taker to ask for clarifications, and appropriate physical environment in terms
of, for example, noise, temperature, ventilation, and space all contribute in their
own ways to creating a fair setting (see Cohen & Wollack, 2006, pp. 35660, for a
detailed discussion of test administration and special accommodations).
A general administration condition that is certain to affect administration conditions and also performance is the time limit set for the test. Some tests can be
speeded on purpose, especially if they attempt to tap time-critical aspects of performance, such as in a scanning task where test takers have to locate specific
information in the text fast. Setting up a speeded task in an otherwise nonspeeded
paper-based test is challenging administratively; on computer, task-specific time
limits are obviously easy to implement. In most tests, time is not a key component
of the construct measured, so enough time is given for almost everybody to finish
the test. However, speededness can occur in nonspeeded tests when some learners
cannot fully complete the test or have to change their response strategy to be able
to reply to all questions. Omitted items at the end of a test are easy to spot but
other effects of unintended speededness are more difficult to discover (see Cohen
& Wollack, 2006, pp. 3578 on research into the latter issue).
A major factor in test administration is the aspect of language assessed; in practice, this boils down to testing speaking versus testing the other skills (reading,
writing, and listening). Most aspects of language can be tested in groups, sometimes in very large groups indeed. The prototypical test administration context is
a classroom or a lecture hall full of learners sitting at their own tables writing in
their test booklets. Testing reading and writing or vocabulary and structures can
be quite efficiently done in big groups, which is obviously an important practical
consideration in large-scale testing, as the per learner administration time and
costs are low (for more on test practicality as an aspect of overall test usefulness,
see Bachman & Palmer, 1996). Listening, too, can be administered to big groups,
if equal acoustic reception can be ensured for everybody.
Certain tests are more likely to be administered to somewhat smaller groups.
Listening tests and, more recently, computerized tests of any skill are typically
administered to groups of 1030 learners in dedicated language studios or computer laboratories that create more standardized conditions for listening tests, as
all test takers can wear headphones.
Testing speaking often differs most from testing the other skills when it comes
to administration. If the preferred approach to testing speaking is face to face with
an interviewer or with another test taker, group administrations become almost
impossible. The vast majority of face-to-face speaking tests involve one or two test
takers at a time (for different oral test types, see Luoma, 2004; Fulcher, 2003; Taylor,
2011). International language tests are no exception: Tests such as the International
English Language Testing System (IELTS), the Cambridge examinations, the
Goethe Instituts examinations, and the French Diplme dtudes en langue
franaise (DELF) and Diplme approfondi de langue franaise (DALF) examinations all test one or two candidates at a time.
Interestingly, the practical issues in testing speaking have led to innovations in
test administration such as the creation of semidirect tests. These are administered
in a language or computer laboratory: Test takers, wearing headphones and micro-

Assessment Development

phones, perform speaking tasks following instructions they hear from a tape or
computer, and possibly also read in a test booklet. Their responses are recorded
and rated afterwards. There has been considerable debate about the validity of
this semidirect approach to testing speaking. The advocates argue that these tests
cover a wider range of contexts, their administration is more standardized, and
they result in very similar speaking grades compared with face-to-face tests (for
a summary of research, see Malone, 2000). The approach has been criticized on
the grounds that it solicits somewhat different language from face-to-face tests
(Shohamy, 1994). Of the international examinations, the Test of English as a Foreign
Language Internet-based test (TOEFL iBT) and the Test Deutsch als Fremdsprache
(TestDaF), for example, use computerized semidirect speaking tests that are scored
afterwards by human raters. The new Pearson Test of English (PTE) Academic
also employs a computerized speaking test but goes a step further as the scoring
is also done by the computer.
The testing context, purpose, and stakes involved can have a marked effect on
test administration. The higher the stakes, the more need there is for standardization of test administration, security, confidentiality, checking of identity, and measures against all kinds of test fraud (see Cohen & Wollack, 2006, for a detailed
discussion on how these affect test administration). Such is typically the case in
tests that aim at making important selections or certifying language proficiency
or achievement. All international language examinations are prime examples of
such tests. However, in lower stakes formative or diagnostic assessments, administration conditions can be more relaxed, as learners should have fewer reasons
to cheat, for example (though of course, if an originally low stakes test becomes
more important over time, its administration conditions should be reviewed).
Obviously, avoidance of noise and other disturbances makes sense in all kinds of
testing, unless the specific aim is to measure performance under such conditions.
Low stakes tests are also not tied to a specific place and time in the same way as
high stakes tests are. Computerization, in particular, offers considerable freedom
in this respect. A good example is DIALANG, an online diagnostic assessment
system which is freely downloadable from the Internet (Alderson, 2005) and
which can thus be taken anywhere, any time. Administration conditions of some
forms of continuous assessment can also differ from the prototypical invigilated
setting: Learners can be given tasks and tests that they do at home in their own
time. These tasks can be included in a portfolio, for example, which is a collection
of different types of evidence of learners abilities and progress for either formative or summative purposes, or both (on the popular European Language Portfolio, see Little, 2005).

Scoring and Rating Procedures


The scoring of test takers responses and performances should be as directly
related as possible to the constructs that the tests aim at measuring (Bachman &
Palmer, 1996). If the test has test specifications, they typically contain information
about the principles of scoring items, as well as the scales and procedures for the
rating of speaking and writing. Traditionally, a major concern about scoring has
been reliability: To what extent are the scoring and rating consistent over time and

Administration, Scoring, and Reporting Scores


1
Individual
item
responses

2
Scoring

3
Individual
item
scores

4
(Weighting
of scores)

5
Sum of
scores
(score
scale)

6
Application
of cutoffs

5
7
Score band
or reporting
scale

Scoring key
(Item analyses:
deletion of
items, etc.)

(Standard setting)

Figure 58.1 Steps in scoring item-based tests

across raters? The rating of speaking and writing performances, in particular,


continues to be a major worry and considerable attention is paid to ensuring a
fair and consistent assessment, especially in high stakes contexts. A whole new
trend in scoring is computerization, which is quite straightforward in selected
response items but much more challenging the more open-ended the tasks are.
Despite the challenges, computerized scoring of all skills is slowly becoming a
viable option, and some international language examinations have begun employing it.
As was the case with test administration, scoring, too, is highly dependent on
the aspects of language tested and the task types used. The purpose and stakes
of the test do not appear to have such a significant effect on how scoring is done,
although attention to, for instance, rater consistency is obviously closer in high
stakes contexts. The approach to scoring is largely determined by the nature of
the tasks and responses to be scored (see Millman & Greene, 1993; Bachman &
Palmer, 1996). Scoring selected response items dichotomously as correct versus
incorrect is a rather different process from rating learners performances on
speaking and writing tasks with the help of a rating scale or scoring constructed
response items polytomously (that is, awarding points on a simple scale depending on the content and quality of the response).
Let us first consider the scoring of item-based tests. Figure 58.1 shows the main
steps in a typical scoring process: It starts with the test takers responses, which
can be choices made in selected response items (e.g., A, B, C, D) or free responses
to gap-fill or short answer items (parts of words, words, sentences). Prototypical
responses are test takers markings on the test booklets that also contain the task
materials. Large-scale tests often use separate optically readable answer sheets for
multiple choice items. Paper is not, obviously, the only medium used to deliver
tests and collect responses. Tape-mediated speaking tests often contain items that
are scored rather than rated, and test takers responses to such items are normally
recorded on tape. In computer-based tests, responses are captured in electronic
format, too, to be scored either by the computer applying some scoring algorithm
or by a human rater.
In small-scale classroom testing the route to step 2, scoring, is quite straightforward. The teacher simply collects the booklets from the students and marks the
papers. In large-scale testing this phase is considerably more complex, unless we
have a computer-based test that automatically scores the responses. If the scoring
is centralized, booklets and answer sheets first need to be mailed from local test

Assessment Development

centers to the main regional, national, or even international center(s). There the
optically readable answer sheets, if any, are scanned into electronic files for
further processing and analyses (see Cohen & Wollack, 2006, pp. 3727 for an
extended discussion of the steps in processing answer documents in large-scale
examinations).
Scoring key: An essential element of scoring is the scoring key, which for the
selected response items simply tells how many points each option will be awarded.
Typically, one option is given one point and the others zero points. However,
sometimes different options receive different numbers of points depending on
their degree of correctness or appropriateness. For productive items, the scoring
can be considerably more complex. Some items have only one acceptable answer;
this is typical of items focusing on grammar or vocabulary. For short answer items
on reading and listening, the scoring key can include a number of different but
acceptable answers but the scoring may still be simply right versus wrong, or it
can be partial-credit and polytomous (that is, some answers receive more points
than others).
The scoring key is usually designed when the test items are constructed. The
key can, however, be modified during the scoring process, especially for openended items. Some examinations employ a two-stage process in which a proportion of the responses is first scored by a core group of markers who then
complement the key for the marking of the majority of papers by adding to the
list of acceptable answers based on their work with the first real responses.
Markers and their training: Another key element of the scoring plan is the selection of scorers or markers and their training. In school-based testing, the teacher
is usually the scorer, although sometimes she may give the task to the students
themselves or, more often, do it in cooperation with colleagues. In high stakes
contexts, the markers and raters usually have to meet specified criteria to qualify.
For example, they may have to be native speakers or non-native speakers with
adequate proficiency and they probably need to have formally studied the language in question.
Item analyses: An important part of the scoring process in the professionally
designed language tests is item analyses. The so-called classical item analyses
are probably still the most common approach; they aim to find out how demanding the items are (item difficulty or facility) and how well they discriminate
between good and poor test takers. These analyses can also identify problematic
items or items tapping different constructs. Item analyses can result in the acceptance of additional responses or answer options for certain itemsa change in the
scoring keyor the removal of entire items from the test, which can change the
overall test score.
Test score scale: When the scores of all items are ready, the next logical step is to
combine them in some way into one or more overall scores. The simplest way to
arrive at an overall test score is to sum up the item scores; here the maximum
score equals the number of items in the test, if each item is worth one point. The
scoring of a test comprising a mixture of dichotomously (0 or 1 point per item)
scored multiple choice items and partial-credit/polytomous short answer items
is obviously more complex. A straightforward sum of such items results in the
short answer questions being given more weight because test takers get more

Administration, Scoring, and Reporting Scores

points from them; for example, three points for a completely acceptable answer
compared with only point from a multiple choice item. This may be what we want,
if the short answer items have been designed to tap more important aspects of
proficiency than the other items. However, if we want all items to be equally
important, each item score should be weighted by an appropriate number.
Language test providers increasingly complement classical item analyses with
analyses based on what is known as modern test theory or item response theory
(IRT; one often-used IRT approach is Rasch analysis). What makes them particularly useful is that they are far less dependent than the classical approaches on
the characteristics of the learners who happened to take the test and the items in
the test. With the help of IRT analyses, it is possible to construct test score scales
that go beyond the simple summing up of item scores, since they are adjusted for
item difficulty and test takers ability, and sometimes also for item discrimination
or guessing. Most large-scale international language tests rely on IRT analyses as
part of their test analyses, and also to ensure that their tests are comparable across
administrations.
An example of a language test that combines IRT analysis and item weighting
in the computation of its score scale is DIALANG, the low stakes, multilingual
diagnostic language assessment system mentioned above (Alderson, 2005). In the
fully developed test languages of the system, the items are weighted differentially,
ranging from 1 to 5 points, depending on their ability to discriminate.
Setting cutoff points for the reporting scale: Instead of reporting raw or weighted
test scores many language tests convert the score to a simpler scale for reporting
purposes, to make the test results easier to interpret. The majority of educational
systems probably use simple scales comprising a few numbers (e.g., 15 or 110)
or letters (e.g., AF). Sometimes it is enough to report whether the test taker passes
or fails a particular test, and thus a simple two-level scale (pass or fail) is sufficient
for the purpose. Alternatively, test results can be turned into developmental scores
such as age- or grade-equivalent scores, if the group tested are children and if
such age- or grade-related interpretations can be made from the particular test
scores. Furthermore, if the reporting focuses on rank ordering test takers or comparing them for some normative group, percentiles or standard scores (z or T
scores) can be used, for example (see Cohen & Wollack, 2006, p. 380).
The conversion of the total test score to a reporting scale requires some mechanism for deciding how the scores correspond to the levels on the reporting scale.
The process through which such cutoff points (cut scores) for each level are
decided is called standard setting (step 6 in Figure 58.1).
Intuition and tradition are likely to play at least as big a role as any empirical
evidence in setting the cutoffs; few language tests have the means to conduct
systematic and sufficient standard-setting exercises. Possibly the only empirical
evidence available to teachers, in particular, is to compare their students with each
other (ranking), with the students performances on previous tests, or with other
students performance on the same test (norm referencing). The teacher may focus
on the best and weakest students and decide to use cutoffs that result in the
regular top students getting top scores in the current test, too, and so on. If the
results of the current test are unexpectedly low or high, the teacher may raise or
lower the cutoffs accordingly.

Assessment Development

Many large-scale tests are obviously in a better position to make more empirically based decisions about cutoff points than individual teachers and schools. A
considerable range of standard-setting methods has been developed to inform
decisions about cutoffs on test score scales (for reviews, see Kaftandjieva, 2004;
Cizek & Bunch, 2006). The most common standard-setting methods focus on the
test tasks; typically, experts evaluate how individual test items match the levels
of the reporting scale. Empirical data on test takers performance on the items
or the whole test can also be considered when making judgments. In addition
to these test-centered standard-setting methods, there are examinee-centered
methods in which persons who know the test takers well (typically teachers) make
judgments about their level. Learners performances on the items and the test are
then compared with the teachers estimates of the learners to arrive at the most
appropriate cutoffs.
Interestingly, the examinee-centered approaches resemble what most teachers
are likely to do when deciding on the cutoffs for their own tests. Given the difficulty and inherent subjectivity of any formal standard-setting procedure, one
wonders whether experienced teachers who know their students can in fact make
at least equally good decisions about cutoffs as experts relying on test-centered
methods, provided that the teachers also know the reporting scale well.
Sometimes the scale score conversion is based on a type of norm referencing
where the proportion of test takers at the different reporting scale levels is kept
constant across different tests and administrations. For example, the Finnish
school-leaving matriculation examination for 18-year-olds reports test results on
a scale where the highest mark is always given to the top 5% in the score distribution, the next 15% get the second highest grade, the next 20% the third grade, and
so on (Finnish Matriculation Examination Board, n.d.).
A recent trend in score conversion concerns the CEFR. Many language tests
have examined how their test scores relate to the CEFR levels in order to give
added meaning to their results and to help compare them with the results of other
language tests (for a review, see Martyniuk, 2011). This is in fact score conversion
(or setting cutoffs) at a higher or secondary level: The first one involves converting
the test scores to the reporting scale the test uses, and the second is about converting the reporting scale to the CEFR scale.

Scoring Tests Based on Performance Samples


The scoring of speaking and writing tasks usually takes place with the help of one
or more rating scales that describe test-taker performance at each scale level. The
rater observes the test takers performance and decides which scale level best
matches the observed performance. Such rating is inherently criterion referenced
in nature as the scale serves as the criteria against which test takers performances
are judged (Bachman & Palmer, 1996, p. 212). This is in fact where the rating of
speaking and writing differs the most from the scoring of tests consisting of items
(e.g., reading or listening): In many tests the point or level on the rating scale
assigned to the test taker is what will be reported to him or her. There is thus no
need to count a total speaking score and then convert it to a different reporting
scale, which is the standard practice in item-based tests. The above simplifies

Administration, Scoring, and Reporting Scores

matters somewhat because in reality some examinations use more complex procedures and may do some scale conversion and setting of cutoffs also for speaking
and writing. However, in its most straightforward form, the rating scale for speaking and writing is the same as the reporting scale, although the wording of the
two probably differs because they target different users (raters vs. test score users).
It should be noted that instead of rating, it is possible to count, for example,
features of language in speaking and writing samples. Such attention to detail at
the expense of the bigger picture may be appropriate in diagnostic or formative
assessment that provides learners with detailed feedback.
Rating scales are a specific type of proficiency scale and differ from the more
general descriptive scales designed to guide selection test content and teaching
materials or to inform test users about the test results (Alderson, 1991). Rating
scales should focus on what is observable in test takers performance, and they
should be relatively concise in order to be practical. Most rating scales refer to
both what the learners can and what they cannot do at each level; other types of
scales may often avoid references to deficiencies in learners proficiency (e.g., the
CEFR scales focus on what learners can do with the language, even at the lowest
proficiency levels).
Details of the design of rating scales are beyond the scope of this chapter; the
reader is advised to consult, for example, McNamara (1996) and Bachman and
Palmer (1996). Suffice it to say that test purpose significantly influences scale
design, as do the designers views about the constructs measured. A major decision concerns whether to use only one overall (holistic) scale or several scales. For
obtaining broad information about a skill for summative, selection, and placement
purposes, one holistic scale is often preferred as a quick and practical option. To
provide more detailed information for diagnostic or formative purposes, analytic
rating makes more sense. Certain issues concerning the validity of holistic rating,
such as difficulties in balancing the different aspects lumped together in the level
descriptions, have led to recommendations to use analytic rating, and if one
overall score is required, to combine the component ratings (Bachman & Palmer,
1996, p. 211). Another major design feature relates to whether only language is to
be rated or also content (Bachman & Palmer, 1996, p. 217). A further important
question concerns the number of levels in a rating scale. Although a very finegrained scale could yield more precise information than a scale consisting of just
three or four levels, if the raters are unable to distinguish the levels it would cancel
out these benefits. The aspect of language captured in the scale can also affect the
number of points in the scale; it is quite possible that some aspects lend themselves
to be split into quite a few distinct levels whereas others do not (see, e.g., the
examples in Bachman & Palmer, 1996, pp. 21418).
Since rating performances is usually more complex than scoring objective items,
a lot of attention is normally devoted, in high stakes tests in particular, to ensuring
the dependability of ratings. Figure 58.2 describes the steps in typical high stakes
tests of speaking and writing. While most classroom assessment is based on only
one rater, namely the teacher, the standard practice in most high stakes tests is for
at least a proportion of performances to be double rated (step 3 in Figure 58.2).
Sometimes the first rating is done during the (speaking) test (e.g., the rater is
present in the Cambridge examinations but leaves the conduct of the test to an

10

Assessment Development
1 Performance during the test
2 First
rating

(a) During the test


(speaking)
(b) Afterwards from a
recording or a script

3 Second rating (typically afterwards)

Rating scale(s) &


benchmark samples
Monitoring
Rating scale(s) &
benchmark samples
Monitoring

4 Identification of (significant)
discrepancies between raters;
identification of difficult
performances to rate

Monitoring

5 Third and possibly more ratings

Rating scale(s) &


benchmark samples

6 Compilation of different raters


ratings

Rater or rating
analyses

7 (Compilation of different rating


criteria into one, if analytic rating
is used but only one score reported)
8 (Sum of scores if the final rating is
not directly based on the rating scale
categories or levels)
9 (Application of cutoffs)

(Standard setting)

10 Reporting of results on the


reporting scale

Figure 58.2 Steps in rating speaking and writing performances

interlocutor), but often the first and second ratings are done afterwards from an
audio- or videorecording, or from the scripts in the writing tests. Typically, all
raters involved are employed and trained by the testing organization, but sometimes the first rater, even in high stakes tests, is the teacher (as in the Finnish
matriculation examination) even if the second and decisive rating is done by the
examination board.
Large-scale language tests employ various monitoring procedures to try to
ensure that their raters work consistently enough. Double rating is in fact one such
monitoring device, as it will reveal significant rater disagreement in their ratings;
if this can be spotted while rating is still in progress, one or both of the raters can
be given feedback and possibly retrained before being allowed to continue. Some
tests use a small number of experienced master raters who continuously sample
and check the ratings of a group of raters assigned to them. The TOEFL iBT has
an online system that forces the raters to start each new rating session by assessing
a number of calibration samples, and only if the rater passes them is he or she
allowed to proceed to the actual ratings.

Administration, Scoring, and Reporting Scores

11

A slightly different approach to monitoring raters involves adjusting their


ratings up or down depending on their severity or lenience, which can be estimated with the help of multifaceted Rasch analysis. For example, the TestDaF,
which measures German needed in academic studies, regularly adjusts reported
scores for rater severity or lenience (Eckes et al., 2005, p. 373).
Analytic rating scales appear to be the most common approach to rating speaking and writing in large-scale international language examinations, irrespective
of language. Several English (IELTS, TOEFL, Cambridge, Pearson), German
(Goethe Institut, TestDaF), and French (DELF, DALF) language examinations
implement analytic rating scales, although they typically report speaking and
writing as a single score or band.
It is usually also the case that international tests relying on analytic rating weigh
all criteria equally and take the arithmetic or conceptual mean rating as the overall
score for speaking or writing (step 7, Figure 58.2). Exceptions to this occur,
however. The International Civil Aviation Organization (ICAO) specifies that all
aviation English tests adhering to their guidelines must implement the five dimensions of oral proficiency in a noncompensatory fashion (Bachman & Palmer, 1996,
p. 224). That is, the lowest rating across the five criteria determines the overall
level reached by the test taker (ICAO, 2004).

Reporting Scores
Score reports inform different stakeholders, such as test takers, parents, admission
officers, and educational authorities, about individuals or groups test results for
possible action. Thus, these reports can be considered more formal feedback to
the stakeholders. Score reports are usually pieces of paper that list the scores or
grades obtained by the learner, possibly with some description of the test and the
meaning of the grades. Some, typically more informal reports may be electronic
in format, if they are based on computerized tests and intended only for the learners and their teachers (e.g., the report and feedback from DIALANG). Score
reports use the reporting scale onto which raw scores were converted, as described
in the previous section.
Score reports are forms of communication and thus have a sender, receiver,
content, and medium; furthermore, they serve particular purposes (Ryan, 2006,
p. 677). Score reports can be divided into two broad types: reports on individuals
and reports on groups. Reporting scores is greatly affected by the purpose and
type of testing.
The typical sender of score reports on individual learners and based on classroom tests is the teacher, who acts on behalf of the school and municipality and
ultimately also as a representative of some larger public or private educational
system. The sender of more formal end-of-term school reports or final schoolleaving certificates is most often the school, again acting on behalf of a larger
entity. The main audiences of both score reports and formal certificates are the
students and their parents, who may want to take some action based on the results
(feedback) given to them. School-leaving certificates have also other users such as
higher-level educational institutions or employers making decisions about admitting and hiring individual applicants.

12

Assessment Development

School-external tests and examinations are another major originator of score


reports for individuals. The sender here is typically an examination board, a
regional or national educational authority, or a commercial test provider. Often
such score reports are related to examinations that take place only at important
points in the learners careers, such as the end of compulsory education, end of
pre-university education, or when students apply for a place in a university. The
main users of such reports are basically the same as for school-based reports
except that in many contexts external reports are considered more prestigious and
trustworthy, and may thus be the only ones accepted as proof of language proficiency, for instance for studying in a university abroad.
In addition to score reports on individuals performance, group-level reports
are also quite common. They may be simply summaries of individual score
reports at the class, school, regional, or national level. Sometimes tests are
administered from which no reports are issued to individual learners; only
group-level results are reported. The latter are typically tests given by educational authorities to evaluate students achievement across the regions of a
country or across different curricula. International comparative studies on educational achievement exist, in language subjects among others. The best known
is the Programme for International Student Assessment (PISA) by the Organisation for Economic Co-operation and Development (OECD), which regularly tests
and reports country-level reports of 15-year-olds reading skills in their language
of education.
The content of score reports clearly depends on the purpose of assessment. The
prototypical language score report provides information about the test takers
proficiency on the reporting scale used in the educational system or the test in
question. Scales consisting of numbers or letters are used in most if not all educational systems across the world. With the increase in criterion-referenced testing,
such simple scales are nowadays often accompanied by descriptions of what different scale points mean in terms of language proficiency. Entirely non-numeric
reports also exist; in some countries the reporting of achievement in the first years
of schooling consists of only verbal descriptions.
Score reports from language proficiency examinations and achievement tests
often report on overall proficiency only as a single number or letter (e.g., the
Finnish matriculation examination). Some proficiency tests, such as the TOEFL
iBT and the IELTS, issue subtest scores in addition to a total score. In many
placement contexts, too, it may not be necessary to report more than an overall
estimate of candidates proficiency. However, the more the test aims at supporting learning, as diagnostic and formative tests do, the more useful it is to report
profiles based on subtests or even individual tasks and items. For example,
the diagnostic DIALANG test reports on test-, subskill-, and item-level
performance.

Current Research
Research on the three aspects of the testing process covered here is very uneven.
Test administration appears to be the least studied (McCallin, 2006, pp. 63940),

Administration, Scoring, and Reporting Scores

13

except for the types of testing where it is intertwined with the test format, such
as in computerized testing, which is often compared with paper-based testing,
and in oral testing, where factors related to the setting and participants have been
studied. Major concerns with computerized tests include the effect of computer
familiarity on the test results and to what extent such tests are, or should be,
comparable with paper-based tests (e.g., adaptivity is really possible only with
computerized tests) (Chapelle & Douglas, 2006).
As far as oral tests are concerned, their characteristics and administration have
been studied for decades. In particular, the nature of the communication and the
effect of the tester (interviewer) have been hotly debated. For example, can the
prototypical test format, the oral interview, represent normal face-to-face communication? The imbalance of power, in particular, has been criticized (Luoma,
2004, p. 35), which has contributed to the use of paired tasks in which two candidates interact with each other, in a supposedly more equal setting. Whether the
pairs are in fact equal has also been a point of contention (Luoma, 2004, p. 37).
Research seems to have led to more mixed use of different types of speaking tasks
in the same test, such as both interviews and paired tasks. Another issue with the
administration conditions and equal treatment of test takers concerns the consistency of interviewers behavior: Do they treat different candidates in the same
way? Findings indicating that they do not (Brown, 2003) have led the IELTS , for
example, to impose stricter guidelines on their interviewers to standardize their
behavior.
An exception to the paucity of research into the more general aspects of test
administration concerns testing time. According to studies reviewed by McCallin
(2006, pp. 6312), allowing examinees more time on tests often benefits everybody,
not just examinees with disabilities. One likely reason for this is that many tests
that are intended to test learners knowledge (power tests) may in fact be at
least partly speeded.
Compared with test administration, research on scoring and rating of performances has a long tradition. Space does not allow a comprehensive treatment but
a list of some of the important topics gives an idea of the research foci:
analysis of factors involved in rating speaking and writing, such as the rater,
rating scales, and participants (e.g., Cumming, Kantor, & Powers, 2002; Brown,
2003; Lumley, 2005);
linking test scores (and reporting scales) with the CEFR (e.g., Martyniuk, 2011);
validity of automated scoring of writing and speaking (e.g., Bernstein, Van
Moere, & Cheng, 2010; Xi, 2010); and
scoring short answer questions (e.g., Carr & Xi, 2010).
Research into reporting scores is not as common as studies on scoring and
rating. Goodman and Hambleton (2004) and Ryan (2006) provide reviews of practices, issues, and research into reporting scores. Given that the main purpose of
reports is to provide different users with information, Ryans statement that whatever research exists presents a fairly consistent picture of the ineffectiveness of
score reports to communicate meaningful information to various stakeholder
groups (2006, p. 684) is rather discouraging. The comprehensibility of large-scale

14

Assessment Development

assessment reports, in particular, seems to be poor due to, for example, the use of
technical terms, too much information too densely packed, and lack of descriptive
information (Ryan, 2006, p. 685). Such reports could be made more readable, for
example, by making them more concise, by providing a glossary of the terms used,
by displaying more information visually, and by supporting figures and tables
with adequate descriptive text.
Ryans own study on educators expectations of the score reports from the statewide assessments in South Carolina, USA, showed that his informants wanted
more specific information about the students performance and better descriptions
of what different scores and achievement levels meant in terms of knowledge and
ability (2006, p. 691). The educators also reviewed different types of individual
and group score reports for mathematics and English. The most meaningful report
was the achievement performance level narrative, a four-level description of
content and content demands that systematically covered what learners at a particular level could and could not do (Ryan, 2006, pp. 692705).

Challenges
Reviews of test administration (e.g., McCallin, 2006, p. 640) suggest that nonstandard administration practices can be a major source of construct-irrelevant variation in test results. The scarcity of research on test administration is therefore all
the more surprising. McCallin calls for a more systematic gathering of information
from test takers about administration practices and conditions, and for a more
widespread use of, for example, test administration training courseware as effective ways of increasing the validity of test scores (2006, p. 642).
Scoring and rating continue to pose a host of challenges, despite considerable
research. The multiple factors that can affect ratings of speaking and writing, in
particular, deserve further attention across all contexts where these are tested.
One challenge such research faces is that applying such powerful approaches as
multifaceted Rasch analysis in the study of rating data requires considerable
expertise.
Automated scoring will increase in the future, and will face at least two major
challenges. The first is the validity of such scoring: to what extent it can capture
everything that is relevant in speaking and writing, in particular, and whether it
works equally well with all kinds of tasks. The second is the acceptability of automated scoring, if used as the sole means of rating. Recent surveys of users indicate
that the majority of test takers feel uneasy about fully automated rating of speaking (Xi, Wang, & Schmidgall, 2011).
As concerns reporting scores, little is known about how different reports are
actually used by different stakeholders (Ryan, 2006, p. 709), although something
is already known about what makes a score report easy or difficult to understand.
Another challenge is how to report reliable profile scores for several aspects of
proficiency when each aspect is measured by only a few items (see, e.g., Ryan,
2006, p. 699). This is particularly worrying from the point of view of diagnostic
and formative testing, where rich and detailed profiling of abilities would be
useful.

Administration, Scoring, and Reporting Scores

15

Future Directions
The major change in the administration and scoring of language tests and in the
reporting of test results in the past decades has been the gradual introduction of
different technologies. Computer-based administration, automated scoring of
fairly simple items, and the immediate reporting of scores have been technically
possible for decades, even if not widely implemented across educational systems.
With the advent of new forms of information and communication technologies
(ICT) such as the Internet and the World Wide Web, all kinds of online and
computer-based examinations, tests, and quizzes have proliferated.
High stakes international language tests have implemented ICT since the time
optical scanners were invented. Some of the more modern applications are less
obvious, such as the distribution of writing and speaking samples for online
rating. The introduction of a computerized version of such high stakes examinations as the TOEFL in the early 2000s marked the beginning of a new era. The new
computerized TOEFL iBT and the PTE are likely to show the way most large-scale
language tests are headed.
The most important recent technological innovation concerns automated assessment of speaking and writing performances. The TOEFL iBT combines human
and computer scoring in the writing test, and implements automated rating in its
online practice speaking tasks. The PTE implements automated scoring in both
speaking and writing, with a certain amount of human quality control involved
(see also the Versant suite of automated speaking tests [Pearson, n.d.]). It can be
predicted that many other high stakes national and international language tests
will become computerized and will also implement fully or partially automated
scoring procedures.
What will happen at the classroom level? Changes in major examinations will
obviously impact schools, especially if the country has high stakes national
examinations. Thus, the inevitable computerization of national examinations will
have some effect on schools over time, irrespective of their current use of ICT.
The effect may simply be a computerization of test preparation activities, but
changes may be more profound, because there is another possible trend in computerized testing that may impact classrooms: more widespread use of computerized formative and diagnostic tests. Computers have potential for highly
individualized feedback and exercises based on diagnosis of learners current
proficiency and previous learning paths. The design of truly useful diagnostic
tools and meaningful interventions for foreign and second language learning are
still in their infancy and much more basic research is needed to understand
language development (Alderson, 2005). However, different approaches to
designing more useful diagnosis and feedback are being taken currently, including studies that make use of insights into dyslexia in the first language (Alderson
& Huhta, 2011), analyses of proficiency tests for their diagnostic potential (Jang,
2009), and dynamic assessment based on dialogical views on learning (Lantolf
& Poehner, 2004), all of which could potentially lead to tools that are capable of
diagnostic scoring and reporting, and could thus have a major impact on language education.

16

Assessment Development

SEE ALSO: Chapter 51, Writing Scoring Criteria and Score Reports; Chapter 52,
Response Formats; Chapter 56, Statistics and Software for Test Revisions; Chapter
59, Detecting Plagiarism and Cheating; Chapter 64, Computer-Automated Scoring
of Written Responses; Chapter 67, Accommodations in the Assessment of English
Language Learners; Chapter 80, Raters and Ratings

References
Alderson, J. C. (1991). Bands and scores. In J. C. Alderson & B. North (Eds.), Language testing
in the 1990s: The communicative legacy (pp. 7186). London, England: Macmillan.
Alderson, J. C. (2005). Diagnosing foreign language proficiency: The interface between learning
and assessment. New York, NY: Continuum.
Alderson, J. C., & Huhta, A. (2011). Can research into the diagnostic testing of reading in
a second or foreign language contribute to SLA research? In L. Roberts, M. Howard,
M. Laioire, & D. Singleton (Eds.), EUROSLA yearbook. Vol. 11 (pp. 3052). Amsterdam, Netherlands: John Benjamins.
American Educational Research Association, American Psychological Association, &
National Council on Measurement in Education. (1999). Standards for educational and
psychological testing. Washington, DC: American Educational Research Association.
Bachman, L., & Palmer, L. (1996). Language testing in practice: Designing and developing useful
language tests. Oxford, England: Oxford University Press.
Bernstein, J., Van Moere, A., & Cheng, J. (2010). Validating automated speaking tests. Language Testing, 27(3), 35577.
Brown, A. (2003). Interviewer variation and the co-construction of speaking proficiency.
Language Testing, 20(1), 125.
Carr, N., & Xi, X. (2010). Automated scoring of short-answer reading items: Implications
for constructs. Language Assessment Quarterly, 7(2), 20518.
Chapelle, C., & Douglas, D. (2006). Assessing language through computer technology. Cambridge, England: Cambridge University Press.
Cizek, G., & Bunch, M. (2006). Standard setting: A guide to establishing and evaluating performance standards on tests. London, England: Sage.
Cohen, A., & Wollack, J. (2006). Test administration, security, scoring, and reporting. In
R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 35586). Westport, CT: ACE.
Cumming, A., Kantor, R., & Powers, D. (2002). Decision making while rating ESL/EFL
writing tasks: A descriptive framework. Modern Language Journal, 86, 6796.
Eckes, T., Ellis, M., Kalnberzina, V., Piorn, K., Springer, C., Szolls, K., & Tsagari, C. (2005).
Progress and problems in reforming public language examinations in Europe: Cameos
from the Baltic States, Greece, Hungary, Poland, Slovenia, France and Germany. Language Testing, 22(3), 35577.
Finnish Matriculation Examination Board. (n.d.). Finnish Matriculation Examination.
Retrieved July 14, 2011 from http://www.ylioppilastutkinto.fi
Goodman, D., & Hambleton, R. (2004). Student test score reports and interpretive guides:
Review of current practices and suggestions for future research. Applied Measurement
in Education, 17(2), 145221.
International Civil Aviation Organization. (2004). Manual on the implementation of ICAO
language proficiency requirements. Montral, Canada: Author.
Jang, E. (2009). Cognitive diagnostic assessment of L2 reading comprehension ability: Validity arguments for Fusion Model application to LanguEdge assessment. Language
Testing, 26(1), 3173.

Administration, Scoring, and Reporting Scores

17

Kaftandjieva, F. (2004). Standard setting. Reference supplement to the preliminary pilot version
of the manual for relating language examinations to the Common European Framework of
Reference for Languages: Learning, teaching, assessment. Strasbourg, France: Council of
Europe.
Lantolf, J., & Poehner, M. (2004). Dynamic assessment: Bringing the past into the future.
Journal of Applied Linguistics, 1, 4974.
Little, D. (2005). The Common European Framework and the European Language Portfolio:
Involving learners and their judgments in the assessment process. Language Testing,
22(3), 32136.
Lumley, T. (2005). Assessing second language writing: The raters perspective. Frankfurt,
Germany: Peter Lang.
Luoma, S. (2004). Assessing speaking. Cambridge, England: Cambridge University Press.
Malone, M. (2000). Simulated oral proficiency interview: Recent developments (EDO-FL-00-14).
Retrieved July 14, 2011 from http://www.cal.org/resources/digest/0014simulated.html
Martyniuk, W. (Ed.). (2011). Aligning tests with the CEFR: Reflections on using the Council of
Europes draft manual. Cambridge, England: Cambridge University Press.
McCallin, R. (2006). Test administration. In S. Downing & T. Haladyna (Eds.), Handbook of
test development (pp. 62551). Mahwah, NJ: Erlbaum.
McNamara, T. (1996). Measuring second language performance. Boston, MA: Addison Wesley
Longman.
Millman, J., & Greene, J. (1993). The specification and development of tests of achievement
and ability. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 33566). Phoenix,
AZ: Oryx Press.
Mousavi, S. E. (1999). A dictionary of language testing (2nd ed.). Tehran, Iran: Rahnama
Publications.
Pearson. (n.d.). Versant tests. Retrieved July 14, 2011 from http://www.versanttest.com
Ryan, J. (2006). Practices, issues, and trends in student test score reporting. In S. Downing
& T. Haladyna (Eds.), Handbook of test development (pp. 677710). Mahwah, NJ: Erlbaum.
Shohamy, E. (1994). The validity of direct versus semi-direct oral tests. Language Testing,
11(2), 99123.
Taylor, L. (2011). Examining speaking: Research and practice in assessing second language speaking. Cambridge, England: Cambridge University Press.
Xi, X. (2010). Automated scoring and feedback systems: Where are we and where are we
heading? Language Testing, 27(3), 291300.
Xi, X., Wang, Y., & Schmidgall, J. (2011, June). Examinee perceptions of automated scoring of
speech and validity implications. Paper presented at the LTRC 2011, Ann Arbor, MI.

Suggested Readings
Abedi, J. (2008). Utilizing accommodations in assessment. In E. Shohamy & N. Hornberger
(Eds.), Encyclopedia of language and education. Vol. 7: Language testing and assessment (2nd
ed., pp. 33147). New York, NY: Springer.
Alderson, J. C. (2000). Assessing reading. Cambridge, England: Cambridge University Press.
Becker, D., & Pomplun, M. (2006). Technical reporting and documentation. In S. Downing
& T. Haladyna (Eds.), Handbook of test development (pp. 71123). Mahwah, NJ: Erlbaum.
Bond, T., & Fox, C. (2007). Applying the Rasch model: Fundamental measurement in the human
sciences (2nd ed.). Mahwah, NJ: Erlbaum.
Buck, G. (2000). Assessing listening. Cambridge, England: Cambridge University Press.
Fulcher, G. (2003). Testing second language speaking. Harlow, England: Pearson.

18

Assessment Development

Fulcher, G. (2008). Criteria for evaluating language quality. In E. Shohamy & N. Hornberger
(Eds.), Encyclopedia of language and education. Vol. 7: Language testing and assessment (2nd
ed., pp. 15776). New York, NY: Springer.
Fulcher, G., & Davidson, F. (2007). Language testing and assessment: An advanced resource book.
London, England: Routledge.
North, B. (2001). The development of a common framework scale of descriptors of language proficiency based on a theory of measurement. Frankfurt, Germany: Peter Lang.
Organisation for Economic Co-operation and Development. (n.d.). OECD Programme
for International Student Assessment (PISA). Retrieved July 14, 2011 from http://
www.pisa.oecd.org
Weigle, S. (2002). Assessing writing. Cambridge, England: Cambridge University Press.
Xi, X. (2008). Methods of test validation. In E. Shohamy & N. Hornberger (Eds.), Encyclopedia of language and education. Vol. 7: Language testing and assessment (2nd ed.,
pp. 17796) . New York, NY: Springer.

You might also like