You are on page 1of 18

411078 LTJXXX10.1177/0265532211411078Bridgeman et al.

Language Testing

/$1*8$*(
Article 7(67,1*

Language Testing

TOEFL iBT speaking test


29(1) 91­–108
© The Author(s) 2011
Reprints and permission:
scores as indicators of oral sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/0265532211411078
communicative language ltj.sagepub.com

proficiency

Brent Bridgeman, Donald Powers,


Elizabeth Stone and Pamela Mollaun
Educational Testing Service, USA

Abstract
Scores assigned by trained raters and by an automated scoring system (SpeechRaterTM) on
the speaking section of the TOEFL iBT™ were validated against a communicative competence
criterion. Specifically, a sample of 555 undergraduate students listened to speech samples from
184 examinees who took the Test of English as a Foreign Language Internet-based test (TOEFL
iBT). Oral communicative effectiveness was evaluated both by rating scales and by the ability
of the undergraduate raters to answer multiple-choice questions that could be answered only
if the spoken response was understood. Correlations of these communicative competence
indicators from the undergraduate raters with speech scores were substantially higher for
the scores provided by the professional TOEFL iBT raters than for the scores provided by
SpeechRater. Results suggested that both expert raters and SpeechRater are evaluating aspects
of communicative competence, but that SpeechRater fails to measure aspects of the construct
that human raters can evaluate.

Keywords
automated speech scoring, communicative competence, test validity, TOEFL

Improving the assessment of communicative language proficiency was one of the driving
forces behind the development of the Test of English as a Foreign Language Internet-
based Test (TOEFL iBT) (Chappelle, Grabe, & Berns, 1997; Butler, Eignor, Jones,
McNamara, & Suomi, 2000; Chapelle, Enright, & Jamieson, 2008); therefore, part of the
validity argument for the speaking component of the TOEFL iBT should focus on the oral
communicative effectiveness of the examinees. The speaking test is intended to measure
test takers’ ability to actually communicate in English about what they have read and

Corresponding author:
Brent Bridgeman, 09-R, ETS, Princeton, NJ 08541, USA.
Email: bbridgeman@ets.org.
92 Language Testing 29(1)

heard. This emphasis seems consistent with thoughtful explications of communicative


competence provided elsewhere (see, e.g., Bachman, 1990; Canale & Swain, 1980;
Douglas & Smith, 1997; Duran, Canale, Penfield, Stansfield, & Liskin-Gasparro, 1985;
Henning & Cascallar, 1992; Stansfield, 1986). These models can get fairly complex as
evidenced, for example, in the sociolinguistic, grammatical, strategic, and discourse com-
ponents of communicative competence explicated by Canale and Swain (1980). We are
using the term in its most basic sense simply to emphasize that the speech sample col-
lected as part of the assessment should serve as an indicator of how comprehensible the
speaker would be to ordinary listeners. That is, we are not concerned that the speaker
follow all the grammatical rules or that pronunciation be perfect, but only that the
speaker’s message be understood by ordinary listeners. But the ratings produced during
the test scoring process are not supplied by ordinary listeners; rather they are provided by
expert raters with years of experience listening to the speech of non-native speakers of
English. The ratings supplied by these expert raters may or may not be a good indication
of how well the speaker communicates with ordinary listeners.
Previous research comparing naive raters and expert raters has produced mixed
results. Hadden (1991) showed videotapes of eight Chinese students speaking extempo-
raneously to 25 ESL teachers and 32 non-teachers. She found that the non-teachers rated
the linguistic ability higher than the ESL teachers, but that ratings on other factors were
not significantly different. Elder (1993) developed a classroom-based observation sched-
ule and had the math/science teaching performance of non-native students evaluated by
nine ESL teachers and eight math/science subject specialists. She found a reasonably
strong relationship between the scores assigned by these different groups in terms of
overall communicative effectiveness, although they did appear to differentially weight
specific factors in reaching their overall judgments. Chalhoub-Deville (1995) evaluated
ratings of audiotaped speech samples of Arabic as a foreign language by three different
rater groups, all of whom were native speakers of Arabic: 15 teachers of Arabic as a
foreign language, 31 non-teachers residing in the USA for a least a year, and 36 residents
of Lebanon. She found that the rater groups differed in their expectations and evalua-
tions. Although Hadden (1991), Elder (1993), and Chalhoub-Deville (1995) all con-
trasted ratings by experts in second language teaching with ratings by non-experts, their
teaching experts did not have extensive training or experience as raters. On the other
hand, Barnwell (1989) used a trained and experienced ACTFL oral interviewer and com-
pared the ratings to those made by naive native speakers. He found that there was ‘con-
siderable divergence’ in ratings of oral communication between the ACTFL interviewer
and naive native speaker raters, and that the naive raters were stricter. Although the
expert rater was presumably well trained, this research depended on the scores produced
by this single expert. Indeed, nearly all research in this area relies on small samples of
speakers or small samples of raters, or small samples of both speakers and raters.
Prior research on expert and naive raters has tended to focus on differences in rated
performance, but not on actual evidence of communicative effectiveness. Receiving a
high rating on a rating scale does not provide direct evidence that a particular message
was understood. Referential, or informing, communication, as described by Dickson and
Patterson (1981) is concerned with the accuracy of communication. Evidence of accurate
Bridgeman et al. 93

communication is that listeners can take some action or answer content-related questions
based on what they have heard. Powers, Schedl, Leung, and Butler (1999) argued that
evidence of successful referential communication provides a very meaningful criterion
in building the validity argument for a speaking test. In this approach, both listeners and
speakers are essential in the co-construction of meaning, albeit asynchronously. Powers
et al. showed that the ability of undergraduates to answer questions based on what they
heard was indeed related to the score levels on the Test of Spoken English as determined
by expert raters.
As machines begin to be used as raters of spoken English, it becomes necessary to
also evaluate the machine judgments as indicators of communicative competence.
Although automated (or computer) scoring of open-ended written responses has a rela-
tively long history (see, e.g., Shermis & Burstein, 2003; Williamson, Mislevy, & Bejar,
2006), automated scoring of spoken responses has a somewhat shorter history.
An automated system for scoring speech, SpeechRaterTM, was developed by
Educational Testing Service (ETS). It is designed to evaluate spontaneous speech from
non-native speakers. It evaluates relatively extended, open-ended speech in contrast to
earlier systems that are limited to short and predictable responses (e.g. Bernstein, 1999;
Bernstein, Van Moere, & Cheng, 2010). SpeechRater requires both sophisticated auto-
mated speech recognition tools as well as natural language processing tools. It assesses
features related to the pace, pronunciation, and fluency of recorded speech. As described
by its developers:

SpeechRater consists of three main components: the speech recognizer, trained on about 30
hours of non-native speech, the feature computation module, computing about 40 features
predominantly in the fluency dimension, and the scoring model, which combines a selected set
of speech features to predict a speaking score using multiple regression (Zechner, Higgins, &
Xi, 2007).

Evaluating speech is a more difficult endeavor than evaluating written responses, in


large part because the computer’s capacity to simply recognize words is substantially
more difficult for spoken than for written language (Zechner, Bejar, & Hemat, 2007).
When the task is constrained to what the computer can do well (e.g. fluency evaluation),
the agreement of humans and machine is quite high, especially if the task is further lim-
ited to read speech as compared to spontaneous speech. Cucchiarini, Strik, & Boves
(2002) found correlations as high as 0.92 between rated fluency and a machine evalua-
tion of rate of speech. When the scoring rubric includes features that are not easily evalu-
ated by machine, such as the coherence, idea progression, and content relevance features
of the TOEFL speaking rubric, correlations between machine scores and human ratings
would be expected to be substantially reduced. Nevertheless, because the features the
computer can evaluate tend to be correlated with features the computer cannot evaluate,
the computer is reasonably successful at predicting human scores even with the rela-
tively unconstrained TOEFL spontaneous speech samples, yielding a correlation of 0.68
(Xi, Higgins, Zechner, & Williamson, 2008).
A primary rationale for the automatic evaluation of both written and spoken responses
has been efficiency. Simply put, there are gains to be made in both time and cost when
94 Language Testing 29(1)

computers are able to perform evaluations that have typically required the careful train-
ing and labor-intensive use of human judges. By and large, significant efficiencies have
been achieved and documented in this regard. Another often overlooked advantage of
automated scoring is the potential for ‘construct control,’ that is, the ability to diminish
(or in some instances to completely eliminate) the influence of irrelevant factors that may
influence human judgments, and therefore possibly dilute the validity of scores based on
them (Bennett & Bejar, 1998). Automated scoring also has promise for establishing and
consistently maintaining appropriate weights for those factors that are deemed to be most
relevant for evaluating responses.
Beyond efficiency, the accuracy and validity of computer evaluations is another mat-
ter for both written and spoken responses. With respect to the evaluation of written
responses, there is ample evidence that automated scores can be computed for which
agreement with human raters is quite high. However, using judgments from expert
human raters as the sole ‘gold standard’ is problematic (Bennett & Bejar, 1998).
Additional validity evidence can be obtained by evaluating automated scores against
independent, external criteria that represent defensible and valued outcomes (Bernstein,
Van Moere, & Cheng, 2010).
The current research is an attempt to validate both trained judges’ and machine scores
against an external criterion of communicative understanding. The external criterion is
represented by naive listeners who are representative of the target context: in this case,
undergraduate students. The criterion includes a direct measure of the undergraduate
students’ ability to comprehend the speech as evidenced by their answering multiple-
choice questions that could be answered only if the speech of the test takers was under-
stood, along with a rating of the amount of effort required to understand the speaker.
The criterion we have used here is judgments made by naive (that is, untrained)
judges. Counter intuitively perhaps, we suggest that the use of naive raters – in this case
undergraduate students – has advantages over the use of trained, experienced experts,
even though the latter may be far more adept at understanding the utterances of non-
native speakers than are typical ‘persons-on-the-street.’ The use of naive raters may be
especially appropriate because the objective of the test is to establish not how well test
takers can communicate with experts, but rather how well they can relate with peers, who
are, generally, not expert at interpreting the speech of non-native speakers.
The specific research questions addressed were as follows:

1. What is the relationship of speaking scores from expert TOEFL raters to ratings
of the communicative effectiveness of the same speech samples by naive under-
graduates, with effectiveness defined as effort expended to understand the speech,
confidence that the message was understood, interference of language abilities
(e.g. pronunciation and grammar) with understanding, and whether the speaker
answered the question posed in the prompt?
2. What is the relationship of speaking scores from expert TOEFL raters to compre-
hension of the same speech samples by naive undergraduates as indicated by
their ability to answer multiple-choice questions based on what they heard?
3. What is the relationship of scores from an automated speech scoring system to
ratings and comprehension scores from undergraduates?
Bridgeman et al. 95

Method
Participants
Undergraduate students who served as the raters were recruited from five universities in
the United States, including three public universities in the East, a private university in
the South, and a private university in the West. Local recruiters in each university were
paid $10 for each student recruited. The student volunteers had to be enrolled full time in
the university, be native speakers of English, and have access to a PC or Mac with a high-
speed Internet connection and the ability to play spoken responses (preferably through
ear phones). Each student volunteer received a $50 gift check for listening to and evalu-
ating 12 speech samples. A final sample of 555 raters was obtained. Slightly less than
20% of the raters were freshmen, and 57% were juniors or seniors. They came from a
diverse array of undergraduate majors; about 20% were education majors with about
10% each from business, arts and humanities, social sciences, and natural sciences with
the remainder either undeclared or from other fields.

Speech samples
The speech samples came from regular operational administrations of two TOEFL iBT
speaking forms, referred to in this paper as Form 1 and Form 2. These forms were
selected because the content of the questions had previously been disclosed, so there
were no security concerns in exposing speech samples from these forms to the under-
graduate raters. Each form contained six speaking tasks, so each form produced six
speech samples for each examinee. Two of the tasks, the independent tasks, ask the
examinee to speak about familiar topics. The remaining four tasks require a spoken
response that integrates material presented in either a short reading passage or spoken
stimulus. All tasks are in an academic or campus-based context. These six tasks are
highly related and are summarized in a single speaking score. This score is the sum of
ratings on the six tasks with the condition that any given rater could evaluate no more
than two responses from a particular examinee. Each task is rated on a 1–4 scale, so raw
scores across the six tasks could range from 6 to 24. Although these scores are converted
to a 0–30 scale for reporting purposes, we based our analyses on the raw scores.
Operational scores and audio response files associated with these forms were retrieved
for 400 randomly selected examinees from each form. Because our student raters might
not be using very high fidelity playback on their home computers, and because the
recording quality from operational administrations is variable, we screened the responses
for the technical quality of the recording and retained the 100 best responses for each
form. Note that ‘best’ here refers only to technical quality of the recording – the selected
responses covered the full range of speaking ability levels. In addition, three sample
responses to the first task were selected. These samples were from widely different score
levels and were used to get raters accustomed to the range of abilities in the samples they
were about to rate. These samples should not be considered as training sets, as there was
no score attached to any of the samples.
96 Language Testing 29(1)

Materials
Comprehension questions.  An experienced test developer wrote comprehension ques-
tions for the student raters for each of the six tasks in each form. These questions were
in a five-choice multiple-choice format, and could be answered by the raters only if
they understood what the examinee had said. The comprehension questions for Form
1 are in the Appendix; questions for Form 2 were similar. Although we would have
preferred multiple comprehension questions for each task, for some tasks we were
able to produce only a single question that could reasonably be answered by a naive
listener. Hence the number of comprehension questions per task ranged from one to
three. Across the six tasks there were 13 comprehension questions for Form 1 and 12
for Form 2.

Rating scales.  The other questions were 5-point rating scales in four areas: effort (e.g. ‘As
a listener, how much effort was required to understand the speaker’ with choices from
‘very little’ to ‘not comprehensible’), confidence (e.g. ‘How confident are you that you
understood what the speaker was trying to say?’ with choices from ‘extremely confident’
to ‘not confident at all’), interference (e.g. ‘How much did the speaker’s English lan-
guage abilities [pronunciation, vocabulary, or grammar] interfere with your understand-
ing of the response?’ with choices from ‘Did not interfere at all’ to ‘Almost always
interfered’), and task fulfillment. For the task fulfillment question, the rater was first
given some background on the task assigned to the speaker and then asked how success-
ful the speaker was at answering the question. For example:

Now read the question that the speaker was asked:

Using the research described by the professor, explain what scientists have learned about the
mathematical abilities of babies.

In your opinion, how successful was the speaker at answering the question?
A)  Extremely successful
B)  Very successful
C)  Somewhat successful
D)  Not very successful
E)  Not at all successful

Task fulfillment questions in this format could not be written for every task either because
the task was simply to provide an opinion or because an extensive presentation of exactly
what the speaker had read and heard would be required; we were able to produce 4 task
fulfillment questions for Form 1 and 3 for Form 2.

Design and procedures


Comprehension scores and ratings were based on means provided by virtual ‘teams’ of
naive undergraduate judges. This approach extends the number of evaluations well
Bridgeman et al. 97

beyond the relatively limited number of raters that are typically employed to judge test
taker’s responses. The use of multiple raters allows relevant sources of test score varia-
tion to accumulate while irrelevant sources wash out, thus resulting in more valid and
generalizable scores (Messick, 1989). Raters were assigned to 100 virtual teams by the
order in which they logged in to the scoring system. The first to log in was assigned to
Team 1, the next to Team 2, and so on until all 100 teams had one member, then the next
person to log in would be the second member of Team 1, etc. Each rating team consisted
of approximately five raters. They were considered a team because each member of a
team rated the same spoken response, and the scores on the ratings and multiple-choice
questions were based on the average across all members of a team. But all ratings were
done independently, and the ‘team’ members never interacted with each other. A rating
team could rate only one response to a particular task because once they had heard one
good response they could answer the multiple-choice question related to that task. Each
team rated six different examinees, one for each task from Form 1, and another six
examinees for Form 2. Thus, each rater rated a total of 12 examinees, and the six speech
samples from a given examinee would be rated by six different teams. Although there
were supposed to be a total of 100 examinees from Form 1 and 100 from Form 2 evalu-
ated, because of a software glitch, the final sample with usable data consisted of 94
examinees from each form.
After a rater logged in to the system, the task was described as follows:

The purpose of this study is to investigate how well tests of speaking ability, such as the
Speaking component of the Test of English as a Foreign Language (TOEFL), relate to the
ability to communicate. This communicative ability will be ascertained by obtaining information
from you and other study participants about the quality of spoken responses to TOEFL
questions. In some of the test questions for which you will hear responses, speakers were asked
to express opinions on a topic, using their background knowledge and personal experiences to
support their answers. In other questions, they read and/or listened to information and were
then asked to speak about the information they had read and heard.

In this activity, you will hear responses to 12 test questions. You will listen to one response
for each question. Each response will be provided by a different speaker, and each will take
approximately one minute or less to listen to. After listening to each response, you will be
asked to complete a set of questions about the response. You will not be provided with the
reading or listening information from the test questions, nor will you hear the questions the
speakers were asked. Instead you will be asked to complete your evaluations based on
the information conveyed in the responses. Please complete all of the evaluation questions
for each response.

Click the Play button to play a response. Please note that you may listen to each response only
one time.

On the next few screens you will listen to sample responses at three different levels of ability to
one test question and see an example of a question you might be asked about what you heard.
The sample responses are in order of increasing score to give you an idea of the various levels
98 Language Testing 29(1)

of speaking ability you might hear. After you have completed this brief sample task, you will
be directed to begin the activity.

For each examinee on each task the data consisted of a score from the regular raters
in the operational administration (on a 1–4 scale), a score from SpeechRater version 1.0,
and scores from the rating scales and multiple-choice questions provided by the naive
undergraduate raters. Each of these scores was summed across the six tasks taken by an
individual examinee to form a total score in each category (e.g. a total effort score [from
the rating teams], a total expert score [from the regular TOEFL raters], and a total
SpeechRater score).

Results and discussion


Comprehension scores
The comprehension score was the score on the multiple-choice items that reflected the
raters’ understanding of the speaker as demonstrated by their ability to answer multiple-
choice content questions. The mean score across the 13 questions in Form 1 was 6.43
(SD = 1.80), and for the 12 questions on Form 2 the mean was 6.01 (SD = 1.68), so for
both forms the raters got about half of the questions correct suggesting that the questions
were at an appropriate level to provide meaningful discriminations among examinees.
The alpha reliability of the scores from these questions was 0.76 in both forms. This is a
remarkably high reliability for such short tests, especially considering that differences
across both tasks and raters are included in this estimate.

Rating scale scores


Number of questions, means, standard deviations, and means adjusted for the number of
questions for the four rating scales are presented in Table 1. The five-point rating scales
were scored such that high scores always indicate a positive outcome (e.g. a 5 on the
effort scale means very little effort was required, while a 5 on the confidence scale indi-
cates a high level of confidence that the speaker was understood). When divided by the
number of items in the scale, most ratings were near the mid-point of 3.

Table 1.  Means and standard deviations for rating scales

Scale Form 1 Form 2

# Quest. M SD M/ # Quest. # Quest. M SD M/ # Quest


Effort 6 17.95 4.60 2.99 6 17.50 4.03 2.92
Confidence 6 17.39 4.01 2.89 6 17.17 3.65 2.86
Interference 6 17.26 4.40 2.88 6 16.58 4.00 2.76
Task 4 11.63 2.12 2.91 3  8.78 1.62 2.92
fulfillment
Bridgeman et al. 99

Table 2.  Alpha reliability of rating scales

Rating scale Form 1 Form 2


Effort 0.92 0.88
Confidence 0.89 0.85
Interference 0.92 0.89
Task fulfillment 0.76 0.62

Table 3.  Correlations among rating scales

Scale Form 1 Form 2

Confidence Interference Task Confidence Interference Task


fulfillment fulfillment
Effort 0.98 0.98 0.83 0.97 0.98 0.82
Confidence 0.95 0.82 0.96 0.85
Interference 0.80 0.82

Table 4.  Means and standard deviations for operational (human) and SpeechRater scores

Score Form 1 Form 2

M SD M SD
Operational (human) 16.52 3.76 15.62 3.26
SpeechRater 16.55 1.99 16.47 2.37

Alpha reliabilities for the four scales are in Table 2. As expected, alphas were lowest
for the very short task fulfillment scales.
Correlations among the rating scale scores are presented in Table 3. When corrected
for unreliability, these correlations reach or exceed 0.99. Given these high correlations,
we produced a rating scale total score which is the simple sum of scores on the four
scales; only this total is used in subsequent analyses.
Although the objective comprehension scores and subjective rating total scores could
be tapping somewhat different aspects of oral communicative competence, these scores
were substantially correlated – 0.89 in Form 1 and 0.82 in Form 2.

Operational human scores and SpeechRater scores


Means and standard deviations for the scores assigned by the regular human raters and
by SpeechRater are presented in Table 4. As part of the SpeechRater development proc-
ess, the SpeechRater mean is scaled to the human mean over a large number of forms,
but for any given form the means could diverge. For the forms used in this study the
100 Language Testing 29(1)

Table 5.  Correlation of undergraduate rater scores with operational (human) scores and
SpeechRater scores

Undergraduate Form 1 Form 2


rater scores
Operational SpeechRater Operational SpeechRater
Comprehension 0.79 0.47 0.66 0.32
Rating scale total 0.79 0.45 0.67 0.28

means for the operational and SpeechRater scores are quite comparable, but standard
deviations are somewhat lower for the SpeechRater scores.
The primary research question is the extent to which communicative competence, as
defined by the scores from the undergraduate raters, is related to operational (human)
scores and SpeechRater scores. The answer to this question is in Table 5.
For both the comprehension scores and rating scale scores, correlations with opera-
tional scores are relatively high, suggesting that standard TOEFL iBT scores obtained
from expert raters are a valid indicator of communicative competence. The alpha relia-
bility of the human scores across the six tasks was 0.87 in both forms. Thus, corrected for
unreliability, the correlation of the operational (human) scores and the comprehension
scores was 0.97 in Form 1, and 0.81 in Form 2.
Although correlations of the criterion scores with SpeechRater scores are substantial,
they are far lower than the correlations of the criteria with scores from expert human
raters. The differences in the correlations with comprehension scores between expert
human and SpeechRater scores were statistically significant (Form 1, the matched sam-
ple t [91] = 6.32, p <.01, and Form 2, t [91] = 5.21, p < .01). Correlation differences for
the rating scale scores were similarly significant. The alpha reliability of the SpeechRater
scores was 0.95 in Form 1 and 0.96 in Form 2; corrected for unreliability, the correlation
of SpeechRater scores with comprehension scores was 0.55 for Form 1 and 0.37 in
Form 2. The difference in the correlations for expert humans and SpeechRater suggests
that SpeechRater is capturing some aspects of communicative competence, but is also
missing some important components of the construct. This interpretation is consistent
with claims made for SpeechRater, specifically that, ‘Although the [SpeechRater] mod-
els did not include any Topic Development features such as coherence, progression of
ideas, and content relevance and did not cover the full spectrum of Language Use, they
represented the Delivery features very well, especially fluency’ (Xi, Higgins, Zechner,
& Williamson, 2008, p. 63). Further confirming this interpretation of overlap but not
identity in the constructs assessed by human raters (whether experts or undergraduates)
and by SpeechRater is the correlation between operational and SpeechRater scores
observed in our sample; the correlation was 0.69 for the Form 1 sample and 0.65 for the
Form 2 sample.
The superiority of expert human scores to SpeechRater scores is apparent in both
forms, but all correlations are lower in Form 2. The reason for these lower correlations
Bridgeman et al. 101

Table 6.  Means, standard deviations, and sample sizes by language group

Language Score Form 1 Form 2


Asiana n 34 48
  M SD M SD
  Operational 13.97 3.13 14.29 2.84
  SpeechRater 15.46 1.93 15.87 2.31
  Comprehension 5.47 1.65 5.71 1.57
  Rating total 57.16 13.00 57.8 12.00
Europeanb n 33 7

  M SD M SD
  Operational 17.88 3.76 16.43 3.10
  SpeechRater 16.74 1.67 16.52 1.48
  Comprehension 7.04 1.80 6.12 2.06
  Rating total 69.66 16.21 60.66 15.18
Other n 27 39
  M SD M SD
  Operational 18.07 2.70 17.10 3.16
  SpeechRater 17.70 1.76 17.21 2.42
  Comprehension 6.91 1.48 6.40 1.71
  Rating total 66.51 11.19 62.68 13.52
aChinese, Japanese, and Korean. bFrench, German, and Italian

was not immediately apparent; the comprehension scores were equally reliable in both
forms, and rating scale scores were only slightly less reliable in Form 2. Although the
item specifications in all forms are tightly controlled to be comparable and the forms are
parallel, the sample of examinees who take a particular form is not controlled. Because
of the way the test must be administered to avoid time zone cheating, the proportion of
students taking the test from a particular native language group varies across forms. For
Form 1 in the current study, about a third of the examinees reported that their native
language was one of three Asian languages (Chinese, Japanese, and Korean) and another
third reported one of three European languages (French, German, and Italian) with the
remainder from 18 other languages. For Form 2, the proportion of examinees whose
native language was one of these three European languages was far lower (7%), and the
proportion with Asian languages was somewhat higher (51%). Table 6 shows the number
of examinees in each form by these language groupings along with their operational
(human) and SpeechRater scores, and their comprehension and rating total scores from
the undergraduate raters.
For all measures on both forms, scores were higher in the European language group
than in the Asian language group. Examinees are not randomly assigned to forms, so
there is no expectation that within a language group means would be comparable across
102 Language Testing 29(1)

Table 7.  Correlation of undergraduate rater scores with operational (human) scores and
SpeechRater scores by language group

Language Score Form 1 Form 2

Operational SpeechRater Operational SpeechRater


Asiana Comprehension 0.71 0.31 0.60 0.25
  Rating total 0.74 0.31 0.76 0.32
Europeanb Comprehension 0.80 0.52 0.91 0.81
  Rating total 0.88 0.62 0.93 0.74
Other Comprehension 0.66 0.31 0.66 0.26
  Rating total 0.51 0.14 0.52 0.11
aChinese, Japanese, and Korean. bFrench, German, and Italian

forms. Nevertheless, in the relatively large Asian group, means for all measures were
fairly constant across forms.
Results relevant to the question of primary interest, the correlations of operational and
SpeechRater scores with scores from the undergraduate raters for the different language
groups, are presented in Table 7. Across language groups (Table 5), correlations were
consistently higher for Form 1, but in Table 7 the correlation of the rating total with both
operational and SpeechRater scores was consistently higher (or equal) in Form 2. This is
an example of Simpson’s Paradox, in which results across groups are inconsistent with
the results within each group. The explanation of the paradox is in the differences in
subgroup sample sizes across forms; correlations are substantially higher for the
European language group, and, as shown in Table 6, this group is much larger in Form 1
than in Form 2.
Although not relevant to the problem of explaining form differences, Table 6 also
highlights an area for future research. Mean operational human scores and SpeechRater
scores are comparable in the European and Other language groups, but in the Asian lan-
guage group humans appear to award lower scores than SpeechRater. Note that in these
relatively small samples this difference is not statistically significant, but it is large
enough to warrant follow-up in a larger study. The same pattern of relatively lower scores
from humans than from machines for examinees from Asia has also been noted for essays
(Bridgeman, Trapani, & Attali, in press), although the reasons for this discrepancy could
be quite different for essays than for spoken responses.
Correlations are a useful summary statistic for showing the overall strength of a (lin-
ear) relationship, but they do not provide a direct index of classification errors at the top
and bottom of the scale. For example, the correlation by itself would not show if
SpeechRater agreed well with our raters at the top of the scale, but not at the bottom, or
vice versa. To display such potential relationships, we divided the sample in thirds based
on the scores assigned by the undergraduate raters, and thirds based on the operational
human scores, and thirds based on SpeechRater. We omitted the middle groups to sim-
plify the presentation. A notable discrepancy would be one in which our undergraduate
Bridgeman et al. 103

Table 8.  Cross-tabulation of examinees with high and low comprehension scores by high and
low operational (human) and SpeechRater scores

Form 1 Comprehension Form 2 Comprehension

  High Low High Low


Operational–High 20  0 19  2
Operational–Low  0 25  6 20
SpeechRater–High 18  5 11  8
SpeechRater–Low  3 19  9 13

Table 9.  Cross-tabulation of examinees with high and low rating total scores by high and low
operational (human) and SpeechRater scores

Form 1 Rating scale Form 2 Rating scale

  High Low High Low


Operational–High 19  0 18  6
Operational–Low  0 27  4 19
SpeechRater–High 18  6 10 12
SpeechRater–Low  3 18 77 12

raters put an examinee in the top third, but the operational human score or the SpeechRater
score put the person in the bottom third. As shown in Table 8, there were no notable
discrepancies when comparing comprehension scores from the undergraduate raters with
the operational score in Form 1; in our sample of 94 examinees, not a single person was
scored high (top third) by the undergraduate raters and low (bottom third) by the opera-
tional human raters. Agreement was also fairly high for SpeechRater, but there were
eight misclassifications; five were too high and three were too low relative to the com-
prehension score from the undergraduate raters. Consistent with the lower correlations in
Form 2 (and lower proportion of more consistently rated Europeans), the misclassifica-
tion rate is somewhat higher in Form 2, but the pattern of the SpeechRater scores produc-
ing more errors than operational human scores, relative to the scores of the undergraduate
raters, is maintained.
Table 9 tells essentially the same story for the Rating Total score – very high agree-
ment with the operational human score, and reasonably good agreement with
SpeechRater, but with nine misclassifications in Form 1. On Form 2, examinees with
high human scores were much more likely to receive high ratings than low ratings, but
examinees with high SpeechRater scores were almost equally likely to be rated high as
rated low.
104 Language Testing 29(1)

Conclusions
This study provides strong evidence supporting the validity of TOEFL iBT scores as a
measure of oral communicative competence. Listeners in the target context, undergradu-
ate students without any training in evaluating speech of international students, rated
their comprehension of speech samples in a manner that was highly consistent with
the way examinees were ordered by experienced TOEFL iBT raters. More important,
the ability of the undergraduates to answer questions that required comprehension of
the spoken responses was closely linked to the scores provided by professional raters.
SpeechRater scores were also significantly related to the comprehension scores
and ratings of the undergraduate raters, but to a notably lesser degree than the scores
from the professional raters. SpeechRater is apparently reliably assessing some por-
tion of the speaking construct, but is not assessing the full construct in a way that is
consistent with human evaluations of speaking competence, whether those human
judgments are from experienced professionals or naive undergraduates. This outcome
is entirely consistent with the current state of the art in the evaluation of free-form
speech. SpeechRater can evaluate features related to fluency and pronunciation, and
these features are related to the overall comprehensibility of spoken responses, but
they are not the entire story. Although not directly assessed in this study, a reasonable
speculation is that smoothly articulated nonsense will get a high score from
SpeechRater, but will not help students answer comprehension questions. These
results tend to support a low-stakes use of SpeechRater as a rough gauge in a practice
or learning environment, but more research and development is needed before
SpeechRater could be used in a high-stakes test.

Funding acknowledgement
This work was supported by Educational Testing Service.

References
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford Univer-
sity Press.
Barnwell, D. (1989). ‘Naive’ native speakers and judgments of oral proficiency in Spanish. Lan-
guage Testing, 6, 152–163.
Bennett, R. E., & Bejar, I. I. (1998). Validity and automated scoring: It’s not only the scoring.
Educational Measurement: Issues and Practices, 17(4), 9–17.
Bernstein, J. (1999). PhonePass testing: Structure and construct. Menlo Park, CA: Ordinate Cor-
poration.
Bernstein, J., Van Moere, A., & Cheng, J. (2010). Validating automated speaking tests. Language
Testing, 27, 355–327.
Bridgeman, B., Trapani, C., & Attali, Y. (in press). Comparison of human and machine scoring
of essays: Differences by gender, ethnicity, and country. Applied Measurement in Education.
Butler, F. A., Eignor, D., Jones, S., McNamara, T., & Suomi, B. K. (2000). TOEFL® 2000 speak-
ing framework: A working paper (ETS Research Memorandum RM-00-06). Princeton, NJ:
Educational Testing Service.
Canale, M., & Swain, M. (1980). Theoretical basis of communicative approaches to second lan-
guage teaching and testing. Applied Linguistics, 1, 1–47.
Bridgeman et al. 105

Chaloub-Deville, M. (1995). Deriving oral assessment scales across different test and rater groups.
Language Testing, 12, 16–33.
Chapelle, C. A., Enright, M. K., & Jamieson, J. M. (Eds.). (2008). Building a validity argument for
the Test of English as a Foreign Language. New York: Routledge.
Chapelle, C., Grabe, W., & Berns, M. (1997). Communicative language proficiency: Definition
and implications for TOEFL-2000. TOEFL Monograph Series MS-10. Princeton, NJ: Educa-
tional Testing Service.
Cucchiarini, C., Strik, H., & Boves, L. (2002). Quantitative assessment of second language learn-
ers’ fluency: Comparison between read and spontaneous speech. Journal of the Acoustical
Society of America, 111, 2862–2873.
Dickson, W. P., & Patterson, J. H. (1981). Evaluating referential communication games for teach-
ing speaking and listening skills. Communication Education, 30, 11–21.
Douglas, D., & Smith, J. (with Schedl, M., Netten, G., & Miller, M.) (1997). Theoretical underpin-
nings of the Test of Spoken English revision project. ETS Research Memorandum RM-97-2.
Princeton, NJ: Educational Testing Service.
Duran, R. P., Canale, M., Penfield, J., Stansfield, C. W., & Lisken-Gasparro, J. E. (1985). TOEFL
from a communicative viewpoint on language proficiency: A working paper. TOEFL Research
Report No. 17, ETS RR-85-8. Princeton, NJ: Educational Testing Service.
Elder, C. (1993). How do subject specialists construe classroom language proficiency? Language
Testing, 10, 235–254.
Hadden, B. L. (1991). Teacher and nonteacher perceptions of second language communication.
Language Learning, 41, 1–20.
Henning, G., & Cascallar, E. (1992). A preliminary study of the nature of communicative compe-
tence. TOEFL Research Report No. 36, ETS RR-92-17. Princeton, NJ: Educational Testing
Service.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed.). New York:
Macmillan.
Powers, D. E., Schedl, M. A., Wilson-Leung, S., & Butler, F. (1999). Validating the revised
Test of Spoken English against a criterion of communicative success. Language Testing,
16, 399–425.
Shermis, M., & Burstein, J. (2003). Automated essay scoring: A cross disciplinary perspective.
Mahwah, NJ: Lawrence Erlbaum.
Stansfield, C. W. (1986). Toward communicative competence testing: Proceedings of the second
TOEFL invitational conference. TOEFL Research Report No. 21. Princeton, NJ: Educational
Testing Service.
Williamson, D. M., Mislevy, R. J., & Bejar, I. I. (Eds.) (2006). Automated scoring of complex tasks
in computer-based testing. Hillsdale, NJ: Lawrence Erlbaum.
Xi, X., Higgins, D., Zechner, K., and Williamson, D. (2008). Automated scoring of spontaneous
speech using SpeechRater v. 1.0. TOEFL Research Report No. 62. Princeton, NJ: Educational
Testing Service.
Zechner, K., Bejar, I. I., & Hemat, R. (2007). Toward an understanding of the role of speech rec-
ognition in nonnative speech assessment. TOEFL iBT Research Report No. 02. Princeton, NJ:
Educational Testing Service.
Zechner, K., Higgins, D., & Xi, X. (2007). Paper in Proceedings of the 2007 Workshop of the
International Speech Communication Association (ISCA) Special Interest Group on Speech
and Language Technology in Education (SLaTE). International Speech Communication
Association.
106 Language Testing 29(1)

Appendix: Content Questions for Form 1


1. Which of the following best expresses the topic of the speaker’s response?
 • An interesting personal experience
 • An important book
 • His or her field of study
 • His or her favorite film
 • Topic was unclear

2. Which of the following best describes the topic the speaker is expressing an
opinion about?
 • Early television programming
 • The effects of television viewing
 • His or her favorite television program
 • Children’s television programming
 • Topic was unclear

3a. Which of the following best expresses the speaker’s purpose or intent?
 • To evaluate an argument
 • To describe a problem
 • To summarize an opinion
 • To express his or her point of view
 • Purpose was unclear

3b. Which of the following best expresses the topic of the speaker’s response?
 • An increase in university fees
 • A new fine arts building
 • The location and cost of a work of art
 • The need for a new soccer field
 • Topic was unclear

3c. Based on what you heard in the response, which of the following did the speaker
state or imply?
A. The university spent too much money on the sculpture
B. The sculpture should not be exhibited on campus
C. Paul makes incorrect assumptions about who is paying for the sculpture
D. The sculpture will be located in an area where Paul plays soccer
• A, B, and C
• A and C
• C and D
• A, B, C, and D
• None of the above

4a. Which of the following best expresses the speaker’s purpose or intent?
 • To compare two points of view expressed about Groupthink
 • To relate his or her own experience on the subject of Groupthink
Bridgeman et al. 107

 • To explain how an example illustrates the concept of Groupthink


 • To refute claims about Groupthink expressed in the reading and listening
material
 • Purpose was unclear

4b. Which of the following best expresses the topic of the speaker’s response?
 • Reasons for working in groups
 • Advantages of working in groups
 • Avoiding problems when working in groups
 • A negative effect of working in groups
 • Topic was unclear

4c. Based on what you heard in the response, which of the following best describes
the speaker’s understanding of the effects of ‘Groupthink’?
 • Computer company sales increased due to Groupthink
 • Going against the decision of the group can be bad for business
 • Changing one’s opinion to conform to the group can result in bad decisions
 • When all group members are listened to, the group can be successful
 • None of the above

5a. Which of the following best expresses the speaker’s purpose or intent?
 • To explain the speaker’s point of view
 • To recount a sequence of events
 • To describe a problem and possible solution(s)
 • To explain the causes of a problem
 • Purpose was unclear

5b. Which of the following best expresses the topic of the speaker’s response?
 • Problems working with children
 • Finding transportation to the zoo
 • Joining a volunteer program
 • Advantages and disadvantages of public transportation
 • Topic was unclear

6a. Which of the following best expresses the speaker’s purpose or intent?
 • To critique the conclusions reached by scientists
 • To compare the results of two research studies
 • To describe a research study and its conclusions
 • To evaluate a research study
 • Purpose was unclear

6b. Which of the following best expresses the topic of the speaker’s response?
 • Intellectual ability of babies
 • How children learn to count
 • Observing babies’ reactions to dolls
108 Language Testing 29(1)

 • Babies’ emotional development


 • Topic was unclear

6c. Based on what you heard in the response, which of the following does the speaker
state or imply?
A. The baby was shown two dolls
B. The baby was placed behind a screen
C. The baby showed surprised when it saw only one doll
D. The baby understood 1 plus 1 equals 2
• A and D
• A, B, and C
• A, C, and D
• B, C, and D
• A, B, C, and D

You might also like