Professional Documents
Culture Documents
Nonverbal
Delivery in
Speaking
Assessment
From an Argument to a Rating Scale
Formulation and Validation
Nonverbal Delivery in Speaking Assessment
Mingwei Pan
Nonverbal Delivery
in Speaking Assessment
From an Argument to a Rating Scale
Formulation and Validation
123
Mingwei Pan
Faculty of English Language and Culture
Guangdong University of Foreign Studies
Guangzhou, Guangdong
China
Springer Science+Business Media Singapore Pte Ltd. is part of Springer Science+Business Media
(www.springer.com)
Preface
v
vi Preface
thus, Tom is reduced to stagnation where only the mere occurrence of nonverbal
delivery employment can be captured. In stark contrast, Diana, as a representative
of advanced proficiency level who is assigned a full mark in nonverbal delivery, is
found to be articulate in eclectically resorting to a repertoire of nonverbal channels
in accompanying her verbiage. At certain points, her nonverbal performance can
also instantiate intended meanings in the absence of any synchronised verbal lan-
guage. Judging from the perspective of metafunctions, she is found to be capable of
realising a variety of meaning potentials via nonverbal delivery. Although she
seems somewhat aggressive in group discussion, her frequent shift in instantiating
different nonverbal channels with discrepant metafunctions would impress other
discussants as an active and negotiable speaker as well as an attentive listener.
Although Linda, whose subscore of nonverbal delivery is 3, performed quite sat-
isfactorily in terms of formal nonverbal channels, she is found to be slightly passive
and hesitant in the group discussion. In particular, when the interpersonal meaning
of her gestures is looked into, she seems to be self-contained and strike a certain
distancing effect on the peer discussants. The above profile of the three candidates’
performance on nonverbal delivery can also be aligned with the descriptors of
nonverbal delivery on the rating scale, thus lending weightier support to validate the
proposed rating scale.
This research project yields significance in the sense that it organically integrates
multimodal discourse analysis, a research method scarcely explored in language
assessment, with rating scale validation, thus extending the literature of applying
this method to more research of a similar kind. In addition, based on the research
findings, how nonverbal delivery can penetrate into EFL learning and teaching is
also enlightened and suggested. In particular, this thesis illuminates how EFL
textbooks should be multimodally compiled for a heavier load of meaning making
and how EFL teaching can be optimised with nonverbal delivery by teaching
practitioners incorporated in daily instruction.
References
Bachman, L.F. 1990. Fundamental considerations in language testing. Oxford: Oxford University
Press.
Bachman, L.F., and A.S. Palmer. 1996. Language testing in practice: designing and developing
useful language tests. Oxford: Oxford University Press.
Hood, S.E. 2007. Gesture and meaning making in face-to-face teaching. Paper presented at the
Semiotic Margins Conference, University of Sydney.
Hood, S.E. 2011. Body language in face-to-face teaching: a focus on textual and interpersonal
meaning. In Semiotic margins: meanings in multimodalities, eds. S. Dreyfus, S. Hood, and
S. Stenglin, pp. 31–52. London: Continuum.
Liu, Q., and M. Pan. 2010a. A tentative study on non-verbal communication ability in Chinese
college students’ oral English. Computer-assisted Foreign Language Education in China
(2):38–43.
Liu, Q., and M. Pan. 2010b. Constructing a multimodal spoken English corpus of Chinese Science
and Engineering Major Learners. Modern Educational Technology (4):69–72.
Preface ix
A great many people have helped in the writing of this book. In particular, I feel
profoundly indebted to Professor David D. Qian from Hong Kong Polytechnic
University, whose resourcefulness, insightfulness and supportiveness have removed
my amateurishness in research and nurtured me towards professionalism in the
academia of language assessment.
I also need to thank other scholars who so generously offered their time and
voices—Professor Frederick G. Davidson from University of Illinois at
Urbana-Champaign, Professor Alister Cumming from University of Toronto and
Professor Zou Shen from Shanghai International Studies University—and all the
others whose voices are also recorded here. Their scholarship continues to be a
great source of stimulation.
I would also like to thank the Springer editorial team who have been such a
pleasure to work with, in particular Ms. Rebecca Zhu and Ms. Yining Zhao.
xi
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Research Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 General Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Research Significance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Book Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Nonverbal Delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Eye Contact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Gesture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.3 Head Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Communicative Competence . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Hymes’ Notion of Communicative Competence . . . . . . . . . 17
2.2.2 Communicative Competence Model . . . . . . . . . . . . . . . . . 19
2.2.3 Communicative Language Ability Model . . . . . . . . . . . . . 21
2.2.4 Communicative Language Competence Model. . . . . . . . . . 28
2.2.5 An Integrated Review on Communicative Competence . . . . 31
2.3 Rating Scale and Formative Assessment . . . . . . . . . . . . . . . . . . . 33
2.3.1 Rating Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.2 Taxonomies of Rating Scales . . . . . . . . . . . . . . . . . . . . . 35
2.3.3 A Critique on the Existing Rating Scales . . . . . . . . . . . . . 42
2.3.4 Formative Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.3.5 Properties of the Present Rating Scale . . . . . . . . . . . . . . . 47
2.4 Validity and Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.4.1 Validity: A Componential Notion . . . . . . . . . . . . . . . . . . 48
2.4.2 Validity: A Unitary Notion . . . . . . . . . . . . . . . . . . . . . . . 53
2.4.3 Argument-Based Validation and AUA . . . . . . . . . . . . . . . 57
xiii
xiv Contents
xvii
xviii Acronyms and Abbreviations
xix
xx List of Figures
xxiii
xxiv List of Tables
English has long enticed great concern in language teaching and learning. Many
international large-scale English test batteries, almost without any exception, fea-
ture an incorporation of an oral testing component with an aim to measure candi-
dates’ communicative competence. In addition, irrespective of any testing form,
such as oral proficiency interview, dialogue or discussion, rating scales are gen-
erally regarded as the yardsticks against which candidates’ communicative com-
petence is observed and measured.
In the context of tertiary education in China, English language assessment has
long been prioritised by education authorities, university administrators, teaching
practitioners, students as well as parents. Be it College English Test (CET) or Test
for English Majors (TEM), a separate speaking assessment is routinely administered
to the candidates whose scores in written tests reach the required threshold. Apart
from those domestic tests, Chinese EFL learners also have the access to the lan-
guage proficiency tests administered worldwide, such as Test of English as a
Foreign Language (TOEFL), International English Language Testing System
(IELTS) and Business English Certificate (BEC), with oral assessments included.
Most of these tests are so highly valued that Chinese college students spare no
efforts in obtaining high scores for meeting the degree-conferring requirements,
equipping themselves with more competitive edge in job market and recognising
their own abilities, among other various reasons.
However, although Chinese EFL learners seem to be motivated by a fervent
craze of English proficiency tests, a substantial gap still exists between their general
spoken English proficiency and what is stipulated in the curriculum requirements
for college English learning in China. Cai’s (2002a) study reveals that a total
number of 32,107 candidates took College English Test Spoken English Test
(CET-SET) from January 1999 to May 2001, and among them, 18,550 test-takers,
tantamount to approximately 57.8 %, were assigned Grade B equal to intermediate
proficiency level, signalling that more than half are only capable of developing
certain familiar topics in English. In Cai’s (2002b) follow-up study, it is even
argued that, given the huge CET test population, it can be imagined that those to be
judged as qualified English communicators would account for only a tiny pro-
portion of all the Chinese EFL learners nationwide.
Albeit insufficient evidence drawn from the above figures in providing a detailed
profile, an abundance of studies might aggravate the concern over the status quo of
Chinese college EFL learners’ spoken English proficiency. In a preliminary study of
TEM4 Oral Test (TEM4-OT), Wen et al. (1999) find that, in terms of spoken
English proficiency, except for the speech rate, English majors in China are gen-
erally below the benchmarks as is stipulated in the curriculum for English majors.
Wen et al. (2001), in an investigation on a larger scale, reconfirm that Chinese EFL
learners’ spoken English is characterised by expression inaccuracy, disfluency, lack
of innovative ideas as well as a poor mastery of interaction strategies expected of
daily communication.
It has to be admitted that the above studies as a whole might leave an impression
that it could be commonplace that Chinese college EFL learners’ spoken English
proficiency is far from satisfactory. What they are poor at, as exposed above, is
1.1 Research Background 3
These four research questions will be answered at the different phases of this
research. RQ1 touches upon the role of nonverbal delivery in EFL learners’ per-
formance of spoken production. Thus, this question is raised in response to an
intended argument for including nonverbal delivery into speaking assessment and
can be resolved by verifying that nonverbal delivery well differentiates learners
across a range of proficiency levels. Soundly supported by such an argument, the
remaining RQs encapsulate the follow-up rating scale formulation and validation.
RQ2 deals with the development of a rating scale with nonverbal delivery per-
ceived, whereas RQ3 and RQ4, in an integrated manner, are devoted to addressing
1.3 General Research Questions 5
the properties of the rating scale, viz. its validity,1 reliability, practicality and dis-
criminating power.
With the above general RQs substantially, discretely and satisfactorily addressed, it
is also anticipated that the present study will yield much significance and value.
Firstly, when the proposed rating scale is used in formative assessment, tertiary
EFL learners’ merits and demerits in oral English proficiency can be fully captured
and measured, particularly with a view to their performance in nonverbal delivery.
In a sense, teachers will be informed of what demerits their students share and what
particularly obtrusive demerits individual learners possess so that an adjustment in
their instructions can be made. In a similar vein, learners will ameliorate their
spoken English via anchoring their performance with the rating scale descriptors
and the assessment results.
Secondly, with a special view to the construct validity, this study will provide a
complete validation procedure of the rating scale with both quantitative and qual-
itative approaches. Particularly, as will be detailed later, the qualitative approach
this study adopted is multimodal discourse analysis (MDA), a method underused in
language assessment. It is hoped that the integration of language testing and MDA
can, in a much broader sense, provide practical guidance for the investigation on the
interface between these two domains. Thus, as far as the validation methods are
concerned, this study would inform the area of rating scale validation and will shed
light on the research of a similar kind.
Lastly, the study will demonstrate a theoretically sound and practically feasible
rating scale for EFL learners at the tertiary level in China. With appropriate alter-
ations, it is expected to be further applied to assessing learners of other levels in the
Chinese EFL context, such as secondary school students. What is even more
promising is that the proposed rating scale can be referred to in terms of oral
English assessment for specific or professional purposes on condition that part of
the assessment construct is basically unchanged. In that sense, its utility will be
helpfully widened.
1
This study conceptualises construct validity as a unitary and overarching notion, to which all the
components of validity contribute. See more details and justifications in Sect. 2.4.
6 1 Introduction
Having highlighted the expected significance attached to this study, this section
then outlines the layout of this book, which is sequentially arranged into nine
chapters.
This chapter serves as an introduction to the whole research project, which
overviews the research background, research aims and objectives, general research
questions as well as the anticipated value which the present study is to be
substantiated.
In Chap. 2, a crucial part for literature review, five sections are earmarked in
response to three key issues involved in this research. The first section is concerned
with the most essential notion of this study, viz. nonverbal delivery, outlining the
previous studies on this notion and how they might inform the present study. The
second section, by elaborating on the conceptualisation of communicative compe-
tence, along with the relevant models, surveys the fittest rationale on which a rating
scale should be based. The third section continues with a review on the taxonomies
of rating scales in the context of language assessment and then describes the
properties of the rating scale to be proposed. The second and the third sections,
therefore, address the key issue of how to develop a rating scale. The last two
sections of the second chapter review the concept of validity and validation as well
as the validation methods in language testing. In so doing, clarifications can be
made as to what notion of test validity this study resides in and what validation
methods best accommodate the present study. Thus, these two sections provide an
answer to and navigate the process of validating the rating scale to be proposed.
Chapter 3 depicts a general picture of the research design and clarifies the
research methods utilised in this study. In addition, how the data were collected,
processed and analysed, and how three datasets were allocated to serve different
research purposes in each phase of the project are also detailed in this chapter.
Chapter 4, based on a comparatively smaller dataset of test-takers’ group dis-
cussion, reports on a preliminary study with a special view to empirically verifying
the necessity of a new dimension, nonverbal delivery, to be incorporated in spoken
English assessment. In a way, this chapter spearheads the whole project in that it
builds an argument to justify an indispensable role of nonverbal delivery in
assessing EFL learners’ communicative ability in a comprehensive manner.
Chapter 5 addresses two broad components of the proposed rating scale.
Informed by the results from a questionnaire administered to both teaching prac-
titioners and learners in the Chinese EFL context, the first half of this chapter sheds
light on the descriptors of those “conventional” dimensions on the rating scale, such
as pronunciation and intonation, vocabulary and grammar, and discourse man-
agement. The second half draws upon the research findings of the study reported in
Chapter 4, with which nonverbal delivery, as an “unconventional” dimension, is
brought forth in a gradable manner on the rating scale.
Chapter 6 links the development with the validation of the proposed rating scale.
In this chapter, the rating scale, as an interim product based on the findings reported
1.5 Book Layout 7
1.6 Summary
This chapter panoramically inaugurates what this book intends to convey. Against
the background of less saliently pinpointed role of nonverbal delivery, low spoken
English proficiency of Chinese tertiary EFL learners and a prevalence of stan-
dardised summative speaking assessments, this study sets out to build an argument
for fortifying an essential role of nonverbal delivery in speaking assessment, based
on which a rating scale with such a consideration of including nonverbal delivery in
formative assessment is formulated and validated. This chapter then outlines the
research aims and subsidiary objectives of this study. Having safely entrenched all
the above, this chapter proposes four general research questions, enclosing the role
of nonverbal delivery in EFL learners’ speaking assessment, the components,
reliability, validity, practicality and discriminating power of the rating scale to be
proposed. In the end, this chapter sketches out an introduction to how this book is
arranged on a chapter-by-chapter basis.
References
Capella, J.N., and M.T. Palmer. 1989. The structure and organization of verbal and nonverbal
behaviour: Data for models of reception. Journal of Language and Social Psychology 8:
167–191.
Davitz, J.R. 1969. The repertoire of nonverbal behaviour: Categories, origins, usage, and coding.
Semiotica 69: 49–97.
Harrison, R. 1965. Nonverbal communication: Exploration into time, space, action and object.
Florence, KY: Wadsworth Publishing Co., Inc.
Leathers, D.G. 1979. The impact of multichannel message inconsistency on verbal and nonverbal
decoding behaviours. Communication Monograph 46: 88–100.
Leathers, D.G., and T.H. Emigh. 1980. Decoding facial expressions: A new test with decoding
norms. Quarterly Journal of Speech 66: 418–436.
Liu, Q., and M. Pan. 2010a. A tentative study on non-verbal communication ability in Chinese
college students’ oral English. Computer-assisted Foreign Language Education in China 2:
38–43.
Liu, Q., and M. Pan. 2010b. Constructing a multimodal spoken English corpus of Chinese Science
and Engineering major learners. Modern Educational Technology 4: 69–72.
Pan, M. 2011a. Reconceptualising and reexamining communicative competence: A multimodal
perspective. Unpublished PhD thesis. Shanghai: Shanghai International Studies University.
Pan, M. 2011b. Incorporating nonverbal delivery into spoken English assessment: A preliminary
study. English Language Assessment 6: 29–54.
Rothman, A.D., and S. Nowicki. 2004. A measure of the ability to identify emotion in children’s
tone of voice. Journal of Nonverbal Behaviour 28: 67–92.
Sternglanz, R.W., and B.M. DePaulo. 2004. Reading nonverbal cues to emotions: The advantages
and liabilities of relationship closeness. Journal of Nonverbal Behaviour 28: 245–266.
Swain, M. 1985. Communicative competence: Some roles of comprehensible input and
comprehensible output in its development. In Input in second language acquisition, ed.
S. Gass, and C. Madden, 235–256. New York: Newbury House.
Wen, Q., C. Wu, and L. So. 1999. Evaluating the oral proficiency of TEM4: The requirements
from the teaching curriculum. Foreign Language Teaching and Research 1: 29–34.
Wen, Q., X. Zhao, and W. Wang. 2001. A guide to TEM4 oral test. Shanghai: Shanghai Foreign
Language Education Press.
Chapter 2
Literature Review
This chapter reviews the literature pertaining to the present study. As the whole
research can be chronologically broken down into three main phases, covering
(1) building an argument for embedding nonverbal delivery into speaking assess-
ment, (2) the formulation and (3) the validation of the rating scale for group
discussion in formative assessment, this chapter is accordingly organised into five
sections, with the first section reviewing nonverbal delivery relating to the first
phase, and the other four sections consecutively addressing the related literature
concerning rating scale development and validation.
Specifically, the first section reviews the previous research with regard to non-
verbal delivery. Instead of standing still in the arena of language assessment, this
section of review will commence with a review on nonverbal delivery in other fields
of research; thus, a dearth of the related studies can be felt in the context of
language testing. The second section is more concerned with the conceptualisation
of communicative competence, addressing the issue of what rationale the rating
scale development in the case of the present study should be based on. In particular,
a link between nonverbal delivery and strategic competence will be drawn so that a
theoretical argument can be tentatively advanced to embed nonverbal delivery into
speaking assessment. The third section, appertaining to the categorisations of rating
scales in language assessment and the essentials of formative assessment, provides a
leeway for determining the basic properties of the rating scale to be designed in this
research. In response to the issue of rating scale validation, the fourth and fifth
sections, respectively, dwell on the notions of validity and validation, and quanti-
tative and qualitative approaches to be adopted for validating the rating scale
proposed in this study.
In retrospect, the meaning conveyance via nonverbal delivery might be dated back
to Greek rhetoric, where Quintilian (AD 35-100), one of the first in recorded
history, drew the research attention to the use of gesture. He distinguishes rhetorical
delivery into vox (voice) and gestus (the use of gesture). In a quite similar vein,
© Springer Science+Business Media Singapore 2016 9
M. Pan, Nonverbal Delivery in Speaking Assessment,
DOI 10.1007/978-981-10-0170-3_2
10 2 Literature Review
The central role of eye contact in nonverbal delivery has long been acknowledged.
A host of researchers are devoted to studying the language of eyes and now arrive at
a consensus that there may well be a language of the eyes with its own syntax and
grammar (Webbink 1986). Janik et al. (1978) find that attention is focused on the
eyes 43.4 % of the communication duration. When eye contact is investigated in a
social context, more interest is invited in identifying how eye contact can make
meanings in social interactions (Kendon 1967; Street 1993). For example, Bourne
and Jewitt (2003) study various purposes of eye contact in young learners’ English
learning process. Besides, there are also extensive studies aiming at the roles of eye
contact in the development of children’s language and communication, indicating
that eye contact is primal regarding its shared attention by both infants and adults
(Tomasello 2003).
Leathers and Eaves (2008) list a total of seven functions that eye contact pos-
sibly serves. The first function is attentiveness. Argyle and Cook (1976) emphasise
that mutual eye contact “has the special meaning that two people are attending to
each other, [which] is usually necessary for social interaction to begin or be sus-
tained” (p. 170). The enlargement of pupils can be an indication that listener’s or
speaker’s attentiveness is accordingly promoted (Hess 1975). Second, persuasive
function, with which the persuader wishing to be noticed as trustworthy must
maintain eye contact while speaking and being spoken to by the persuadee
(Burgoon and Saine 1978; Burgoon et al. 1986; Grootenboer 2006). Third, inti-
macy, conducive to establishing interpersonal relations, is another function. In
interpreting this function, Hornik (1987) and Kleinke (1986) assert that the intensity
of eye contact, or the duration of gaze, has a crucial role to play in developing
intimacy between persons. Fourth, regulatory function, which refers to alerting the
decoder that the encoding process is occurring and continuing by virtue of sig-
nalling the encoder whether listening and decoding are occurring, and by indicating
when the listener is to speak (Ellsworth and Ludwig 1971; Kalma 1992). Fifth, eye
contact can also serve an affective function. Eye contact, along with facial
expression, is able to function as a powerful medium of emotional communication
(Zebrowitz 1997), or as Schlenker (1980) concisely phrases, “the eyes universally
12 2 Literature Review
symbolise affect” (p. 258). Sixth, eye contact has its power function, which largely
deals with eyes’ function of exerting authority, or that of performing mesmerisation
(Henley 1977; Henley and Harmon 1985). Seventh, impression management
function, as its name suggests, means speaker’s efforts in formatting either positive
or negative impressions upon the addressees (see Iizuka 1992; Kleinke 1986).
However, it should be noted that the above taxonomy of communicative func-
tions are viewed in such a broad social context that it might not be directly
applicable to studying eye contact deployed by EFL learners. For instance, in
language assessment context, where candidates perform their oral task, it would be
less likely that there are occurrences of eye contact with intimacy or power function
as almost no necessity can be felt in this particular setting. In addition, a few
communicative functions might be overlapping or serve more than one function as
above elaborated, in the case of which judging what function(s) a captured
occurrence of eye contact serves might be complicated. The desolation of eye
contact from its accompanying verbiage can be another drawback of the above
taxonomy. Without synchronised verbal utterance, it would be a practical challenge
to fathom what exactly eye contact attempts to convey.
When an occurrence of eye contact is observed and measured, Poggi (2001)
proposes a set of measures to analyse eye contact from the perspective of bodily
organs, roughly including eyebrows (inner part, medial part and outer part), eyelids
(upper or lower), wrinkles and eye (humidity, reddening, pupil dilation, eye posi-
tion and eye direction). Fine-grained as these measures are, it may be technologi-
cally demanding as the observation of various occurrences of eye contact in
accordance with the above specified frame might be jeopardised by its complexity
and judgment subjectivity. In real practice, when eye contact is measured in this
study, in order to provide a leeway for the first phase of this study, where an
empirical argument is tentatively built for embedding nonverbal delivery into
speaking assessment, the descriptive analysis will be refrained from resorting to the
detailed taxonomy of bodily organs. Instead, analyses will be largely based on
candidates’ eye contact as is de facto presented, mainly from the angles of eye
contact directionality and duration because both measures can tentatively help allow
an observation of candidates’ frequency and intensity of the various referents they
visualise (Cerrato 2005). When the occurrences of eye contact are described and
analysed, the taxonomy by Leathers and Eaves (2008) above outlined will be
referred to.
Nonetheless, when the rating scale is validated qualitatively, considering more
explanatory power and applicability, eye contact will be probed into with an MDA
approach, with basically an integrated framework drawn from the studies by
Martinec (2000b, 2001, 2004) and Hood (2007, 2011). In such a context, not only
the frequency and duration of eye contact as salient measures will be probed into,
but also other vehicles carried via eye contact, such as eye contact shift, will also be
focused on. The operationalised framework from Martinec’s (2000b, 2001, 2004)
and Hood’s (2007, 2011) studies will be further expounded below in detail along
with an elaboration on MDA approach in Sect. 2.5 of this chapter.
2.1 Nonverbal Delivery 13
2.1.2 Gesture
Unlike eye contact, whose manifestations mainly refer to such issues as duration,
directionality and intensity of pupil fixation, gesture can be instantiated via a ple-
thora of different manifestations. Thus, the question of what constitutes a unit of
gesture is contested, with compelling reasons offered for various perspectives.
Within the field of nonverbal communication, gesture can be broadly defined as
“any distinct bodily action that is regarded by participants as being directly
involved in the process of deliberate utterance” (Kendon 1985, p. 215). Kendon
(1996) further proposes that a gesture consists of “phases of bodily action that have
those characteristics that permit them to be ‘recognised’ as components of willing
communicative action” (p. 8). However, this begs the question of recognition by
whom. In addition, there can be concerns in the subjectivity involved in identifying
unambiguously what is willing communicative gesture. Kendon (2004) explains
that a prototypical gesture passes through three phases, namely the preparation, the
stroke and retraction, with the stroke phase being the only obligatory element.
McNeill (1992) describes the stroke phase as “the phase carried out with the quality
of ‘effort’ a gesture in kinetic term” (p. 375). He continues to argue that “[s]
emantically, it is the content-bearing part of the gesture” (p. 376). With the above,
when gesture is observed in this study, more foci will be placed on the meaning
potential it makes though the judgement will basically follow. Kendon’s (2004)
proposed prototypical gesture, with the stroke phase as the core.
Following formal instantiation of gestures, quite a few studies decipher what
various gestures would supposedly convey in particular settings, viz. their
emblematic or iconic meanings. However, they rarely touch upon more than an
inventory of providing the respective verbal glosses in various social contexts (e.g.
Barakat 1973; Creider 1977; Efron 1941; Green 1968; Saitz and Cervenka 1972;
Sparhawk 1978; Wylie 1977), though efforts are also made in response to gestures’
role in generating thinking (Alibali et al. 1997), in enhancing teaching and learning
for complex ensembles (Kress et al. 2001) and in coordinating with workplace
discourse (Heath and Luff 2007). However, emblematic meaning alone does not
constitute all the possible conveyance or function of gestures.
Ekman and Friesen’s (1969) taxonomy of gesture functions encapsulates
emblems, illustrators, affect displays, regulators and adaptors. Emblems are gestures
with a direct verbal translation consisting of a word or two with a precise meaning
known by most of the members of a given culture; thus emblematic gestures are
mostly speech independent. For instance, the OK sign made by a fist with the
thumb pointing upward is a classic example of an emblem.
Illustrators are used to augment what is being said and to reinforce or
de-intensify the perceived strength of emotions experienced by the communicator.
Therefore, examples of illustrators can be signals for turn-taking in conversations
(pointing at the next turn-holder with an upward palm) or baton (slamming of
hand). Given the fact that gestures might be highly associated with the accompa-
nying verbiage when being interpreted, they can be regarded as speech dependent.
14 2 Literature Review
Dissimilar to a fervour that solely concentrates on eye contact and gesture, quite
few studies, if not none, have been exclusively devoted to a third essential and
conspicuous channel of nonverbal delivery, head movement. This nonverbal
channel might be slightly akin to eye contact in the sense that the directionality of
head movement, in most cases, naturally corresponds to that of eye contact. It is,
however, different from gestures in that head movements, with a comparative
scarcity in variedness, are overwhelmingly instantiated via head nod or head shake
though other vertical or horizontal movements of the head, such as one-way left-
ward movement from a central position, can also constitute a basic occurrence of
head movement under discussion.
In a limited number of studies, a revelation can be made concerning cultural
influence on head movement (e.g. Maynard 1987, 1990; Weiner et al. 1972). For
instance, head shake can be usually interpreted as negation or disagreement in the
Chinese culture, whereas in certain other cultures, such an occurrence can also be
understood as agreement (Matsumoto 2006). Let us take head nodding as another
example, Jungheim (2001) deems it as a backchannelling signal “giving feedback to
indicate the success or failure of communication” (p. 4), especially when interac-
tants intend to (1) show agreement with what is said, (2) pay respect to other
speakers, or (3) indicate that they are attentively listening to the speaker in the
Japanese culture (see Maynard 1987, 1989, 1990; White 1989).
Considering a dearth of any existing framework concerning the communicative
functions of head movement that this study can comfortably rests upon, in building
an argument for incorporating head movement as one of the dimensions of non-
verbal delivery in speaking assessment, Ekman and Friesen’s (1969) aforemen-
tioned framework in its general application is tentatively referred to. Since the main
purpose of that research phase would just discriminate candidates across the pre-
determined proficiency levels, in terms of formal head movement, only head nod
(generally interpreted as agreement) and head shake (generally interpreted as dis-
agreement) that are semantically loaded will be investigated. When head movement
as one subdimension of the rating scale descriptors for nonverbal delivery is vali-
dated, in addition to head nod or shake, more fine-grained head movements, such as
vertical or horizontal movements of high frequency in an interval unit, are also
taken into account following an integrated framework drawn from Martinec’s
(2000b, 2001, 2004) and Hood’s (2007, 2011) research to be unfolded below.
16 2 Literature Review
The above provides a review on nonverbal delivery, with a particular view to the
three most representative channels and what approaches this study will adopt in
observing and analysing formal nonverbal delivery at different phases of the study.
With this section of review addressed, it can be felt that nonverbal delivery, with its
proven significance and saliency in communication, should be embedded into
speaking assessment, where meaning making is realised not just from verbal lan-
guage alone. The ensuing section will then review the notion of communicative
competence and specifically indicate the role that nonverbal delivery legitimately
plays in assessing EFL learners’ communicative ability.
construed, Bachman (1990), Bachman and Palmer (1996) eclectically put forward
the model of communicative language ability (CLA), which has been credited as a
widely recognised framework with new insights on the language ability. The most
recent framework with regard to communicative competence is the conceptualisa-
tion of communicative language competence (Council of Europe 2001) as one of
the by-products from Common European Framework of Reference (CEFR).
The above brief introduction on the notional evolution of communicative com-
petence leads to a necessity that this section of review should outline, critique and
compare the above models for reaching the fittest one to bolster the explanation of
what domains should be measured in a speaking rating scale and why nonverbal
delivery plays a crucial role in light of communicative competence assessment.
Hymes (1972) asserts that one’s capacity is composed of language knowledge and
the ability to use language and that communicative competence consists of four
parameters that included “communicative form and function in integral relation to
each other” (Leung 2005b, p. 119). Concerning communication beyond Chomsky’s
(1965) demarcation between competence and performance, he proposes a frame-
work comprising four following questions to explain what communicative com-
petence should include.
18 2 Literature Review
The first question deals with what is possible considering the language form.
Actually, what is possible refers to something acceptable within a formal system that is
grammatical, cultural or communicative (Hymes 1972). However, communicative
competence is not succinctly interpreted when what is possible stands alone; the second
question, therefore, touches upon feasibility, such as the memory limitation and per-
ceptual device, or as can be rephrased, concerns what is biologically and psychologi-
cally feasible. To illustrate this parameter, Royce (2007) renders an example, where a
sentence itself may be grammatically well formed, yet can be so lengthy that it fails to
convey what is intended. The third question is more concerned with the appropriateness
of language use in particular settings, reflecting the sociological and pragmatic aspects
of language use. The last parameter bears upon a communicator’s knowledge of
probabilities in the sense that whether what is conveyed is actually common determined
by whether successful communication can be fulfilled.
Reaffirmed by Hymes’ other works (1973, 1974, 1982), his proposition might be
interpreted that communicative competence includes not only grammatical knowl-
edge but also language user’s ability to judge whether what is said is practical,
appropriate and probable. That means a language user with expected communica-
tive competence should be aware of the above parameters, and the most salient
connotation of performance is “that of imperfect manifestation of underlying sys-
tem” (Hymes 1972, p. 289).
As divulged above, the notion put forward by Hymes (1972) exerts great impact on
language teaching, yet the four parameters are mainly challenged by virtue of their
operationalisation. Although plenty of studies with communicative competence as a
point of departure manage to apply the notion to language teaching, in such
domains as syllabus design (Munby 1978) and language classroom teaching
(Savignon 1983; Widdowson 1978), such application largely concentrates on a
micro-basis. Against this, Canale and Swain (1980) contrive a model with more
pertinent foci on the overall reflection of communicative competence, comprising
grammatical competence, sociolinguistic competence and strategic competence.
Later, Canale (1983) adds discourse competence for the model expansion.
20 2 Literature Review
Communicative Competence
Fig. 2.1 Communicative Competence Model (Canale and Swain 1980; Canale 1983)
2.2 Communicative Competence 21
In the early 1990s, Lyle F. Bachman, an American applied linguist, based on the
critique on the weaknesses of Lado’s (1961) and Carroll’s (1961, 1968) interpre-
tations on language ability, develops the prevailing models posited by Halliday
(1976), van Dijk (1977), Hymes (1972, 1973, 1982), Savignon (1983), Canale and
Swain (1980) and Canale (1983) and conceptualises a new model, where
22 2 Literature Review
Bachman (1990) gestates the construct of CLA on the basis of three core compo-
nents, viz. language competence, strategic competence and psychophysiological
mechanisms. Figure 2.2 illustrates the componential breakdown and the internal
correlation of the model. As is shown, knowledge structures refer to language users’
social and cultural knowledge and the general knowledge about the material world,
strategic competence
psychophysiological
mechanisms
context of
situation
Fig. 2.2 CLA components in communicative language use (Bachman 1990, p. 85)
2.2 Communicative Competence 23
whereas the context of situation includes the reciprocal sides of the communication,
situation, topic and purpose (Bachman 1990). In addition to the knowledge in both
regards, the three core parts constituting the CLA model are language competence,
strategic competence and psychophysiological mechanisms, all coordinating with
the knowledge structures and situation context to depict an overall picture of
communicative competence.
Language Competence
language
competence
organisational pragmatic
competence competence
Fig. 2.3 Subcomponents of language competence in the CLA model (Bachman 1990, p. 87)
24 2 Literature Review
clauses in accordance with the rules stipulating cohesion and rhetorical organisa-
tion. Some cohesive devices are salient, such as lexical connection, reference,
substitution and omission (Halliday and Hasan 1976); there are also devices with
implied functions, regulating the occurring sequence of new and given information
in a text. Rhetorical organisation in the CLA model mainly touches upon methods,
such as narration, description and classification (McCrimman 1984).
2. Pragmatic competence
Pragmatic competence is more concerned about how discourse, clause and inten-
tion realise their meanings and functions in a particular context, or as Bachman
(1990) pinpoints, this competence deals with “the relationships between (the) signs
and their referents on the one hand, and the language users and the context of
communication, on the other” (p. 89). Pragmatic competence can be split into two
subcomponents: illocutionary competence and sociolinguistic competence.
Illocutionary competence encompasses “the knowledge of the pragmatic con-
ventions for performing acceptable language functions” (Bachman 1990, p. 90).
This concept bears much relevance with Speech Act Theory (Searle 1969), which
includes such functions as assertion, warning and imagination. As is shown in
Fig. 2.3, illocutionary competence is further classified into four groups: ideational,
manipulative, heuristic and imaginative. The ideational function is used to “express
meaning in terms of our experience of the real world” (Halliday 1973, p. 20),
including the language use either to express propositions or to exchange informa-
tion about such knowledge. The manipulative function is mainly applied to affect
the world around us. The abilities falling into this group include the instrumental
function used to handle things such as making suggestions, requests, commands
and warnings; the regulatory function used to control others’ behaviour by either
controlling or formulating the persons or the objects in the environment; and the
interactional function, which serves to form, maintain, or change interpersonal
relationships. The heuristic function is applied to share with others our knowledge
of the world, which frequently occurs in such acts as teaching, learning, problem
solving and conscious memorising. The imaginative function enables to create or
extend humour or aesthetic values by constructing and communicating fantasies,
creating metaphors, attending plays and so forth (Bachman 1990).
Sociolinguistic competence, as another part of pragmatic competence, is defined
as “the sensitivity to, or control of the conventions of language use that are
determined by the features of the use context” (Bachman 1990, p. 94). The sen-
sitivity referred to is linked with the response to which communicators are able to
cognate the dialect, language variety, differences in register (Halliday et al. 1964),
cultural references and figures of speech as well as the degree to which speakers can
appropriately and naturally generate the utterances expected in the target language
in a specific language-use context (Pawley and Syder 1983).
2.2 Communicative Competence 25
Strategic Competence
2. Planning component
The planning strategy enables communicators to formulate a plan in realisation of
communicative purpose with certain language knowledge selected. If the speakers
are interacting in their mother tongue, the knowledge needed derives from the
ability in the first language. Nonetheless, if the communication takes place in a
bilingual or a second/foreign language setting, what is needed regarding language
knowledge can be switched to the abilities either transferred from their first lan-
guage or those gradually fostered in their interlanguage. The main functions of
planning strategy are to select the relevant language knowledge, schemata and mind
mapping.
3. Execution component
The strategy of execution is a critical stage before communication is realised
under the co-functioning of psychophysiological mechanisms (see Section
“Psychophysiological Mechanisms”). For instance, in the receptive channel of
language input, visual and auditory faculties shall be applied. Bachman (1990)
26 2 Literature Review
Goal
Interpret or express speech with
specific function, modality, and
content
Language Competence
Situational Planning Process
Assessment Retrieve items from language Organisational competence
competence Pragmatic competence
L1
Li
L2
Plan
Composed of items, the
realisation of which is expected to
lead to a communicative goal
Execution Psychophysiological
A neurological and physiological mechanisms
process
Utterance
Express or interpret language
holds that the three components of strategic competence, in effect, co-exist in the
whole process of communication, interacting with language ability and
language-use context. Having integrated the flow chart originally taken from Færch
and Kasper’s model (1983), Bachman (1990) visualises how the above components
and the other parts of the CLA model co-function, as illustrated in Fig. 2.4.
As can be seen, along the central line from goal to utterance, both language
competence and psychophysiological mechanisms exert their respective influences
on the planning process and execution stage. The whole process also witnesses the
existence of situational assessment impacting planning process and utterance
because communicators need to make situation-specific judgments on what com-
munication channels to be adopted to optimise meaning conveyance. Bachman
(1991) further contends that language knowledge can only be realised with the
involvement of strategic competence. Therefore, the strategies concerning assess-
ment, planning and execution are intrinsically interdependent.
2.2 Communicative Competence 27
Psychophysiological Mechanisms
The above review on the CLA model leads to an embodiment of the interaction of
language knowledge within the context of language use, which integrates language
knowledge and a series of cognitive strategies. Such a notional presentation is
characterised by more explanatory power as the CLA model is epitomised as a leap
forward compared with the Canale and Swain’s communicative competence model.
The CLA model embeds strategic competence and regards it as not just serving a
compensatory function, which, to a certain extent, echoes the modified model by
Canale (1983). More importantly, the CLA model recognises the roles of cognitive
strategies and pragmatic competence, together with their impact on the realisation
of communicative competence. On the whole, the CLA model has been theoretically
sound and empirically verified and has been merited as the state-of-the-art repre-
sentation (Alderson and Banerjee 2002).
Despite its prevalence, the CLA model is without caveats. McNamara (1990)
believes that when performance tests are taken into account, this model seems to be
less operationalisable because raters are very likely to assign unbalanced weightings
to a particular component of language knowledge. Upshur and Turner (1999), on
the same side, believe that a cure-all, construct-only approach to evaluating com-
plex performance may cover the influences that task context and discourse may
have on how raters interpret rating scales in the assessment of communicative
competence because such a disproportion may beget a biased focus on one com-
ponent only. In a similar vein, Purpura (2004), when addressing the subcomponent
of grammatical competence, contends that since “meaning” plays a central role in
the CLA model, the model per se would be more consolidated by how “meaning”
should be theoretically defined and how grammatical resources can be employed to
express denotative and connotative meanings on the one hand and a variety of
pragmatic meanings on the other. Chapelle (1998), from an interactionist per-
spective towards construct definition, critiques that the CLA model is defined and
operated more on the trait basis, and further states that, “[t]rait components can no
28 2 Literature Review
With almost the same name yet discrepant academic background with the CLA
model, communicative language competence (CLC) Model (Council of Europe
2001; North 2010a, b) is a by-product of CEFR (Council of Europe 2001). It is
based on the initial considerations of providing a common basis for language
syllabi, curriculum guidelines, examinations, textbook and so on, and of relating a
Europe credit scheme to fixed points in a framework (van Ek 1975). This frame-
work is inspired by the documents such as Threshold, Vantage, Waystage,
Breakthrough, Effective Operational Proficiency and Mastery (Alderson 2010). It is
then developed with detailed descriptors for each level of expected behavioural
descriptions of language ability in various domains (Little 2006). Therefore, in
terms of the theoretical groundings, it is more a political and educational demand
than an academic motive though the above documents effectively guide the model
formulation and the conceptualisation of communicative competence in its own
right.
2.2 Communicative Competence 29
Stipulated by Council of Europe (2001), the CLC Model consists of three domains:
“linguistic competences, sociolinguistic competences and pragmatic competences”
(p. 108), as is outlined in Fig. 2.5.
As is illustrated, linguistic competences are concerned with the “knowledge of
and ability to use language resources to form well structured messages” (Council of
Europe 2001, p. 109), which can be subcategorised into lexical competence,
grammatical competence, semantic competence, phonological competence and
orthoepic competence. Judging from the interpretation of these subcomponents in
linguistic competences, linguistic competences bear much relation with grammat-
ical competence of the CLA model, reflecting a mastery of language knowledge in a
traditional and narrowed sense.
Sociolinguistic competences refer to the “possession of knowledge and skills for
appropriate language use in a social context” (Council of Europe 2001, p. 118).
They include linguistic markers of social relations, politeness conventions,
expressions of folk wisdom, register difference as well as dialect and accent.
Sociolinguistic competence in the CLA model is labelled within pragmatic com-
petence; therefore, this subcomponent is somehow elevated as one of the core
components in the CLC Model, in the case of which the social realisation of
language use is emphasised.
How pragmatic competences are defined is largely based on the description of
how its subcomponents are made up of. Pragmatic competences embed discourse
competences (abilities to organise, construct and arrange knowledge), functional
competences (abilities to generate communication-inductive meaning) and design
competences (abilities to sequence the messages in accordance with schemes and
interactiveness) (Council of Europe 2001). Given this, an understanding can be
Communicative
Language Competence
Fig. 2.5 Components of the CLC Model (Council of Europe 2001, pp. 108–129)
30 2 Literature Review
reached that pragmatic competences in the CLC Model seem to indicate a broader
sense of pragmatics, with partial anchoring with pragmatic competence in the CLA
model.
As one of the by-products of CEFR, the CLC Model has provided a Europe-specific
reference for language teaching, learning as well as assessment. Council of Europe
(2001) claims CEFR to be comprehensive as “it should attempt to specify as full a
range of language knowledge, skills and use as possible…and all users should be
able to describe their objectives, etc. by reference to it” (p. 7). In that sense, the
CLC Model is an important point of reference, but not an instrument of coercion,
nor for accountability (Alderson 2010).
Nevertheless, how communicative competence is defined in the CLC Model
etches much flaw to the model per se. First, the construct of language ability in this
model or the descriptors of different levels are basically drawn from teachers’ and
learners’ perceptions, with little empirical research or theoretical basis. In addition,
the descriptors take “insufficient account of how variations in terms of contextual
parameters may affect performances by raising or lowering the actual difficulty level
of carrying out the target ‘can-do’ statement” (Weir 2005, p. 281). Although the
CLC Model refers to such documents as Waystage, Threshold and Vantage, as
previously mentioned, they are barely different from each other (Alderson 2010).
While the CEFR claims to cover both aspects of proficiency and development in its
six ascending levels of proficiency, it fails to do so consistently (e.g. Alderson et al.
2006; Hulstijn 2011; Norris 2005). A number of researchers (e.g. Cumming 2009;
Fulcher 2004; Hulstijn 2007; Spolsky 2008) express concerns regarding the
foundation of the CEFR system. Spolsky (2008), for instance, criticises the CEFR
as “arbitrary” standard to produce uniformity, whereas Cumming (2009) points out
the dilemma of the imprecision of standards such as the CEFR “in view of the
complexity of languages and human behaviour” (p. 92).
Second, a comparison between the CLC Model and the previously highlighted
models reaches the finding that the CLC Model excludes strategic competence,
which, though partially included in pragmatic competences, is largely abandoned.
Therefore, the above-mentioned pragmatic competences in a broader sense are no
longer how the pragmatic aspect of language use is conventionally conceptualised.
As reviewed above, strategic competences, playing a quintessential role in language
use, should be a subcomponent attached to communicative language ability as a
whole. Such abandonment would also cause infeasibility for test development or
validation, whose rationale rightly resides in the CLC Model (see Alderson 2002;
Morrow 2004).
Third, the naming of sociolinguistic competences itself might be problematic.
This is because the literal sense of this ability suggests that they seem to be
naturally subordinate to linguistic competences, which is another core component of
the CLC Model that overrides sociolinguistic competences.
2.2 Communicative Competence 31
Endeavouring to seek the fittest model for designing a rating scale with nonverbal
delivery included as a dimension, the above review is devoted to an elaboration on
communicative competence, covering a range of the background of the notion
(Hymes 1972) and the subsequent notional evolution (Bachman 1990; Bachman
and Palmer 1996; Canale and Swain 1980; Canale 1983; Council of Europe 2001).
However, in the process of notional development, there are admittedly also other
frameworks relating to communicative competence. Celce-Murcia et al. (1997), for
instance, extend communicative competence model by further dividing sociolin-
guistic competence into sociocultural competence and actional competence. With
regard to the CLA model renovations, Douglas (2000) proposes a model with a
particular view to the language use for specific purposes, in the case of which
professional or topical knowledge is equally emphasised. Likewise, Purpura (2004)
develops an extended model based on the CLA model, where “a model of language
knowledge with two interacting components: grammatical knowledge and prag-
matic knowledge” (Purpura 2008, p. 60) is proposed.
An in-depth analysis of the above-modified models or frameworks, though
excluded here, would be sufficient to constitute an understanding that they are
characterised by either domain-specificity or further breakdown dissolving from the
CLA model. Therefore, it can be justifiable that the CLA model serves as an
umbrella model that covers the notions and models just briefed.
A retrospective review on communicative competence model, the CLA model
and the CLC Model on a chronological continuum, as illustrated in Fig. 2.6, can
provide a better understanding of communicative competence and which model can
be judged as the fittest. The components with the linkage by arrows as an indication
from a developmental point of view mean that they are basically of the same
conceptual referents. It can be observed that when the notion is ushered into the
CLA model, as the arrows in the figure point, its components are most compre-
hensive and inclusive, with integrated interactions and mechanisms between dif-
ferent components. Notably, the CLA model substantiates the component of
strategic competence and incubates psychophysiological mechanisms, though the
related studies on the latter are unavailable.
When the notion evolves into the CLC Model, strategic competence disappears;
design competences in the CLC Model, judging from the definition that is previ-
ously mentioned, have only seemingly partial connection with psychophysiological
mechanisms in the CLA model, as indicated by a dotted arrow in Fig. 2.6.
Therefore, it can be felt that the absence of strategic competence in explaining what
communicative competence is might give rise to the model caveats; thus, it can be
naturally argued back that the CLA model should be selected as the fittest model
with inclusiveness and explanatory power. All these can enhance a justification that
the CLA model is the most appropriate to be the theoretical rationale, based on
32 2 Literature Review
Illocutionary competence
Design competences
Sociolinguistic competence
Discourse
competence Strategic competence Sociolinguistic
assessment competences
planning
execution
Strategic Strategic
competence competence
Psychophysiological
mechanisms
latter, issues arise as to what components can best represent and constitute the
construct of the notion. As showcased from the above elaborations, communicative
competence is multi-componential; thus, when communicative competence is
assessed, EFL learners are supposed to be assessed in different domains. This also
echoes the philosophy of the present study in that a rating scale, particularly in the
context of formative assessment, should be designed as analytic instead of holistic.
This issue will be re-addressed and further resolved in the next section of this
chapter. Second, Connor and Mbaye (2002) pinpoint that a sound model of com-
municative competence offers a convenient framework for categorising components
of written and spoken discourse, in which all the possible competences should be
reflected in the scoring criteria. A substantial number of test designers also indeed
adopted the CLA model to be the basis of rating scale design (e.g. Clarkson and
Jensen 1995; Grierson 1995; Hawkey 2001; Hawkey and Barker 2004; McKay
1995; Milanovic et al. 1996). To that end, the selection of the CLA model in the
present study can be further justified.
Therefore, following the CLA model, the rating scale to be proposed will
comprise two broad dimensions: language competence and strategic competence.
The former is quite self-explanatory within the model with regard to what detailed
assessment domains should be looked at; however, strategic competence seems not
to be that observable because it is explained in terms of three metacognitive
strategies in the model. In that context, enlightened by the definition of strategic
competence, which mainly concerns how a speaker resorts to non-linguistic means
to sustain communication, and also informed by the review on nonverbal delivery
in the previous section, the present study attempts to incorporate nonverbal delivery
into the rating scale as one observable dimension to correspond to strategic com-
petence. Although it has to be admitted that nonverbal delivery alone cannot depict
a full picture of strategic competence, it can to a large extent provide a detectable
and representative profile of candidates’ performance in speaking assessment.
With the above, it can be felt that incorporating nonverbal delivery into speaking
assessment is well grounded because it is intrinsically rooted in strategic compe-
tence in the CLA model. Yet, such a perception largely remains on the theoretical
level. If an argument for embedding nonverbal delivery into speaking assessment
can be built via an empirical study to verify that the competence in this aspect can
indeed discern candidates across a range of proficiency levels, such an argument
can be further consolidated. It can also pave the way for the formulation and
validation of the rating scale with such a consideration. As aforementioned, this
argument will be made in the first phase of this study.
This section will touch upon the literature concerning rating scale and the context of
the rating scale to be proposed in this study, viz. formative assessment. Therefore,
this section will review four respects: (1) what is a rating scale in language
34 2 Literature Review
assessment? (2) what are different categorisations of rating scales? (3) what is
formative assessment and how can it benefit EFL learners? (4) what type of rating
scale best accommodates the context of formative assessment? The end of this
section will integrate the review to summarise the wide-ranging properties of the
rating scale to be proposed.
The second categorisation is of holistic and analytic scales. This is first brought
forth by Shohamy (1981) and has long served as the most salient and
best-documented categorisation (e.g. Bachman 1988; Bachman and Savignon 1986;
Douglas and Smith 1997; Fulcher 1997; Ingram and Wylie 1993; Underhill 1987;
Weir 1990). As this taxonomy is commonly referred to (see Barkaoui 2007; Cooper
1977; Fulcher 2003; Goulden 1992, 1994; Hamp-Lyons 1991; Weigle 2002), more
elaborations will be accordingly unfolded in this section of review. Holistic rating
scale is also referred to as impressionistic or global scale. It is first defined in the
context of writing assessment when Cooper (1977) posits that a holistic rating scale
refers to
any procedure which stops short of enumerating linguistic, rhetorical, or informational
features of a piece of writing … [s]ome holistic procedures may specify a number of
particular features and even require that each feature be scored separately, but the reader is
never required to stop and count or tally incidents of the feature (p. 4).
more than one domain of assessment, such as accuracy, vocabulary and fluency.
However, all the descriptors of a particular band are grouped together, dissimilar to
multi-trait scoring, where different domains of assessment are separately described
in detail.
Since only one score is supposed to be given, this scoring method usually
triggers controversy in lieu of an incomplete account of targeted construct (Fulcher
2003). It also seems less powerful in explaining the intriguing nature of speaking.
Another problem with holistic scoring is that in speaking assessment, raters might
overlook one or two aspects, in the case of which candidates might be rated on their
strengths instead of being penalised for weaknesses (Bacha 2001; Charney 1984;
Cumming 1990; Hamp-Lyons 1990; Knoch 2009). However, holistic scoring is
primarily favoured by large-scale language assessments, where the time allocated
for rating is of the topmost concern, yet it is spurned by classroom assessment
because it provides limited feedback for students and teachers about what might be
revealed from assessment per se.
Primary-trait scoring is developed to assess certain expected language functions
or rhetorical features elicited by an assessment task (Lloyd-Jones 1977). It was first
adopted by the National Assessment of Educational Programme (NAEP) for the
purpose of obtaining more information from one single score. As Applebee (2000)
explains, regarding writing assessment, “primary trait assessment in its initial for-
mulations focused on the specific approach that a writer might take to be successful
on a specific writing task; every task required its own unique scoring guide” (p. 4).
Therefore, it can be comprehended that in primary-trait scoring, raters predetermine
a main trait for the successful task fulfilment so that scoring criteria are usually
reduced to one chief dimension and is therefore context-dependent (Fulcher 2003).
Although only one score needs to be assigned in primary-trait scoring, that single
score largely depends on the degree to which the candidate addresses the specific
requirements of a given oral assessment task (Barkaoui 2007).
This kind of rating scale is advantageous in virtue of its focus on one targeted
observable aspect of language performance, and it is a relatively quick way to score
speaking performance, especially when rating emphasises one specific aspect of
that performance. For example, if candidates are requested to perform a presenta-
tion as an assessment task, a rater would rather concentrate on candidates’ artic-
ulation than lexical density. In that case, the primary-trait articulation is assessed
with a focused weighting. However, just because this way of scoring concentrates
on only one primary trait, it would be less fair to argue that the aspect singled out
for assessment is primary enough to base a single score on it (Knoch 2009).
Hamp-Lyons (1991) puts forward multiple-trait scoring, or multi-trait scoring
for the rating scale designed to offer feedback to learners and other stakeholders
about performance on contextually appropriate and task-specific criteria. As this
scoring method per se suggests, it involves evaluating various traits for reaching an
overall score. Although this approach is similar to primary-trait scoring in that both
methods are holistic in nature, it allows raters to observe more than one dimension.
Given that, it can also be regarded as an extended version of holistic scoring method
38 2 Literature Review
as the band descriptors of each assessment domain are much more detailed and
corporeal.
Since large-scale language assessments usually take rating duration into serious
consideration, the rating scales adopted by IELTS Speaking (see Appendix I) and
TOEFL iBT Independent Speaking Tasks (see Appendix II) are typical of this
category. In the former case, a rater is supposed to judge examinees’ performance in
four aspects, fluency and coherence, lexical resource, grammatical range and
accuracy, pronunciation, and assigns an overall score according to nine bands
(Band 1 to Band 9). What is slightly different in the case of TOEFL is that a number
of general descriptions concerning task fulfilment, coherence or intelligibility are
also attached in the rating scale in addition to the descriptors of the three individual
traits (delivery, language use and topic development). Yet a rater is still expected to
accord an overall score to the speech sample within a range of 5 bands (Band 0 to
Band 4).
By contrast, Cooper (1977) defines analytic approach as requiring the rater “to
count or tally incidents of the features” (p. 4). Analytic rating scales are inclusive of
separate categories representing different aspects or dimensions of performance. For
example, dimensions for oral performance might include fluency, vocabulary and
accuracy. Each dimension is scored separately, and then dimension scores are
totalled. Analytic rating scales can be extremely similar to multi-trait scoring in the
sense that both require raters to assign more than one score to a speech sample.
However, their difference consists in the fact that multi-trait scoring is more
task-specific, usually focusing on specific features of performance necessary for the
success of task fulfilment; the latter is more generalisable to a plethora of assess-
ment tasks with generic dimensions of language production included.
For example, the rating scale for Test of English for Educational Purposes
(TEEP) takes this form (see Appendix III). A rater is supposed to tick one number
for each of the six assessment domains (appropriateness, adequacy of vocabulary
for purpose, grammatical accuracy, intelligibility, fluency and relevance and ad-
equacy of content) and then sum up the subscores. One special example is the rating
scale of BEC Oral Test, which combines holistic and analytic rating (see
Appendix IV for Level 1). One interlocutor responsible for communicating with
candidates marks holistically while another assessor takes charge of analytic
marking, with the two scores averaged to a final score subsequently.
However, analytic scoring is criticised insomuch that various domains apart do
not necessarily add up to the whole. In other words, individual subscores for
different dimensions might not supply reliable information of what is assessed
globally. On the other hand, since scoring is multifaceted, raters might assign
correspondingly lower subscores to all the assessment domains if one particular
domain is not as satisfactorily performed as expected. Therefore, tendency would
be assigning the same low grades across all the domains, known as “halo effect”
(Thorndike 1920) or “cross-contamination” (Alderson 1981).
On the positive side, it can be found that if rating is analytically conducted, raters
can be refrained from being confused with dimensions as they are supposed to
assign subscores to each assessment dimension. Weir (1990) also comments that
2.3 Rating Scale and Formative Assessment 39
analytic rating scales facilitate rater training and scoring calibration, especially for
inexperienced raters. In addition, the advantage of adopting analytic over holistic
rating scales includes an access to fine-grained information about examinees’ lan-
guage ability (Bachman et al. 1995; Brown and Bailey 1984; Kondo-Brown 2002;
Pollitt and Hutchinson 1987) because, from a variety of dimensions, rating ana-
lytically may reveal more information about what students are excelled in.
Weigle (2002), in the context of writing assessment, also contends that analytic
rating scales are generally accepted to result in higher reliability and construct
validity especially for second language writers although they can be
time-consuming. This is accorded with Sawaki’s (2007) view that in second lan-
guage assessments, analytic rating scales are often used to assess candidates’ lan-
guage ability within a single modality, viz. speaking in the case of this study. When
it comes to the construction of a rating scale for formative assessment, whether
analytic or holistic scale is preferred will be further discussed in the follow-up
section.
Regarding the process of rating scale design, North (1996) describes the develop-
ment of rating scales as condensing the complexity of performance into thin
descriptors. The way in which rating scales and rating criteria are constructed and
interpreted by raters also act as de facto test constructs (McNamara 2000).
Therefore, another categorisation of rating scales takes the perspective of how they
are developed: intuition-based, theory-based, empirically driven and performance
decision trees (PDTs) (Brindley 1991; Fulcher 2003, 2010; Fulcher et al. 2011;
North 2003). The first type tends to be a priori measuring instrument or “armchair
method of scale development” (Fulcher 2010, p. 209). A priori method usually
40 2 Literature Review
refers to constructing the descriptors of the rating scales by an expert, often using
his/her own intuitive judgment concerning the nature of language proficiency, along
with a consultation with other experts. Therefore, it is believed to be the most
prevailing method of generating a rating scale (Knoch 2009). A priori method can
be subclassified into more specific development methodologies (North 1994), but
they mostly have in common “the lack of any empirical underpinnings, except as
post hoc validity studies” (Jarvis 1986, p. 21).
The second type is on the basis of an existing theory or framework. Lantolf and
Frawley (1985) expound that the validity of a rating scale can be limited if no
linguistic theory or the research in the definition of proficiency is taken into
account. As is aforementioned, the advantage of basing a rating scale on a model of
communicative competence is that “these models are generic and therefore not
context-dependent” (Knoch 2009, p. 48), resulting in higher generalisability.
The third type, designed in a post hoc fashion, is likely to be driven by the data
elicited from a sample of testees and rating scale developers manage to extract the
features that distinguish candidates across various proficiency levels. For example,
Fulcher (1987, 1993, 1996a) developed a rating scale of fluency in spoken English
assessment based on the distinctive discourse features discernable in candidates’
oral production. Another data-based method of rating scale development is a
corpus-based/corpus-driven approach. Hawkey (2001), Hawkey and Barker (2004)
manage to design a universal rating scale that covers Cambridge ESOL writing
examinations at different proficiency levels.
The latest development of rating scales witnesses the fourth type, which starts
with an analysis of the discourse features expected in real-life interaction and then
finds its assessment domains in the context of a particular framework as the trees.
Afterwards, the decision on whether obligatory elements are present in each tree is
made to determine what should be assessed as reflected in a rating scale (Fulcher
2010). Fulcher et al. (2011) employ a scoring model for service encounters with
PDTs and prioritise this method in performance tests within a specific commu-
nicative context.
Since it is not quite necessary for the rating scales used in low-stakes speaking
assessments to be constructed from data, most of them are developed intuitively.
However, when this approach is applied to the formulation of the rating scale for
large-scale and high-stakes tests, problems of validation and reliability might arise.
For instance, Skehan (1984) and Fulcher (1987, 1993) criticise the English
Language Testing Service regarding the intuitively developed rating scale.
Likewise, Brindley (1986, 1991), Pienemann and Johnston (1987) find that the
rating scale used in Australian Second Language Proficiency Ratings (ASLPR)
lacks validity due to its intuitive development. Bachman (1988), Bachman and
Savignon (1986), Fulcher (1996b), Lantolf and Frawley (1985, 1988), Matthews
(1990) and Spolsky (1993) invalidate the ACTFL scales with either the empirical
studies or the reasoning that the scale confuses linguistic with non-linguistic cri-
teria. Therefore, it can be generalised that even though a rating scale is developed
intuitively or based on the theoretical underpinnings, it would be better to be
validated with or informed by data-driven methods.
2.3 Rating Scale and Formative Assessment 41
Specific to the present study, on the one hand, the development of the rating
scale is based on the priori consideration of the CLA model, together with the
possible discriminating features informed by the data-driven evidence when an
argument for embedding nonverbal delivery into speaking assessment is built. On
the other hand, post hoc quantitative and qualitative validation studies will con-
tribute to the finalisation of the rating scale. However, as formative assessment in
the context of this study does not fall into professional English tests, almost no need
is felt as to apply the PTDs method. Therefore, the rating scale, with nonverbal
delivery included as a dimension, is an integration of theory-laden and empirically
validated one.
Alderson and Banerjee (2002) divide rating scales in terms of task specificity. One
division is generic scales, referring to those constructed in advance for almost all
sorts of assessment tasks and the other is used to evaluate test-takers’ performance
on target specific tasks. Rating scales and tasks are thus directly linked because the
scales describe speaking skills that tasks might elicit (Luoma 2004). However, as
different assessment tasks feature discrepant task characteristics, it is questionable
as to whether such a generic rating scale can be designable. Since the present study
proposes to design a rating scale to be applicable to formative assessment, it would
be far beyond a claim of being a generic one because the assessment task in the
present study, viz. group discussion to be elaborated below, is prespecified.
The last categorisation focuses more on the physical layout of rating scales. The
most simple type in this categorisation is a graphic and numerical rating scale, in
which there is a continuum with two points representing both ends of a scale, yet
with no descriptors of behaviours expected from candidates (North 2003).
Therefore, the subjectivity among various raters becomes the main drawback of
such design. The second type is a labelled rating scale, viz. a scale with cues
attached to various points along the scale. Nonetheless, it can still be regarded as
less assessor-friendly as the cues provided might be vague, such as a range from
poor to excellent (Knoch 2009). The third type is a vertical rating scale with each
point elaborately defined so that succinct space is allowed for longer descriptions.
For instance, Shohamy et al.’s (1992) ESL writing scale falls into this type.
However, since there is no significant difference in the reliability of different
designs (Myford 2002), this study will first aim at a rating scale with sufficient
defined behavioural expectations for rater-friendliness, yet subject to revision after
the expert judgment in the rating scale formulation phase.
42 2 Literature Review
The above definition clarifies that one of the purposes to conduct formative
assessment is to diminish, or even remove possible negative backwash of
high-stakes tests on language learning (Wang et al. 2006). Against this background,
increasing attention is paid to the great potential of formative assessment; con-
ventional summative testing of language learning outcomes also gradually com-
pacts formative modes of assessing language learning as an ongoing process
(Davison 2004). However, formative assessment vis-à-vis summative assessment is
still under explored (Black and Wiliam 1998; Davies and LeMahieu 2003; Leung
2005a; Leung and Mohan 2004).
2.3.4.1 Definition
The wording of this broad definition, such as purpose and source, mainly tou-
ches upon the functions of formative assessment. However, because of its broad-
ness, many aspects of formative assessment fail to be specified, such as the referents
of educators and the nature of the information source as further guidance in
44 2 Literature Review
In fact, the above analyses on the definitions of formative assessment already reveal
its functions and purposes, which can also be credited as the benefits in four
aspects.
First, as far as the nature of formative assessment is concerned, it provides a
wealth of feedback from assessors for learners. Therefore, such feedback is char-
acterised by learner-specificity and full description on an individual basis (Sadler
1989). Herman and Choi (2008) examine the perceptions by teachers and learners
to see whether formative assessment is shared with a similar understanding by both
sides. The results indicate that the perceptions and attitudes on both sides are
consistent, and that it also emphasised the significance of improving learners with
the information available from formative assessment. Rea-Dickens (2006, p. 168),
however, dissuade that in formative assessment the feedback to learners should be
“descriptive” rather than “evaluative” so that it is not negatively perceived.
2.3 Rating Scale and Formative Assessment 45
Second, in lieu of the preference that classroom is usually the primary choice of
formative assessment, learners’ anxiety can be much lowered due to a familiar
environment. Davidson and Lynch (2002), Lynch (2001, 2003) and McNamara
(2001) in general agree upon endorsing formative assessment to conventional
testing methods as a shift of the locus of control from centralised authority into the
hands of classroom teachers and their peers. If assessment environment is familiar
to candidates, assumedly they will be at a more advantageous position of giving full
play to their potentials.
Third, as formative assessment may include tasks or activities, such as ongoing
self-assessment, peer-assessment, projects and portfolios (Cohen 1994), most
assessment methods can be task-based. As Ross (2005) point out, one of key
appeals formative assessment can provide is the autonomy given to learners.
Formative assessment is thus thought to influence learner development through a
widened sphere of feedback during their engagement with various learning tasks.
Last, regarding the validity in alignment with traditional standardised assess-
ment, there emerges research on validating formative assessment as a testing
method. Huerta-Macias (1995) prioritises the direct face validity of alternatives to
conventional achievement tests as sufficient justification for their use. This view is
also accorded with the notion of learner and teacher empowerment (Shohamy
2001). Therefore, it can be believed that if a rating scale for formative assessment is
also vigorously validated, it can be applied as a valid measure.
With the benefits of formative assessment outlined above, it is necessary to
develop, in the formative assessment context, a rating scale with a dimension of
nonverbal delivery. In so doing, teachers can assess learners from various dimen-
sions and learners may also have the access to feedback of various aspects for
self-enhancement.
The above outlines various aspects of benefits that formative assessment might
offer. This part of review turns to group discussion as a formative assessment task
for assessing EFL learners’ spoken English so that why group discussion is par-
ticularly chosen as the main assessment task in the present study can be justified.
Prior to unfolding the usefulness of group discussion in formative assessment,
the previous studies on group discussion as an assessment task is first reviewed. In
the first large-scale study on group discussion concerning the accuracy of
test-takers’ production, Liski and Puntanen (1983) find that test-takers’ performance
in group discussions can serve as a fit predictor of their overall academic success. In
addition, Fulcher (1996a) also reports that test-takers consider group discussion as a
valid form of second language testing and that examinees feel less anxious and
more confident when speaking to other discussants instead of examiners or inter-
locutors (Folland and Robertson 1976; Fulcher 1996a). Fulcher (1996a) also finds
that group discussion is an easily organised task compared with picture talk, where
an interlocutor or an interview based on speaking prompts will be involved. In
46 2 Literature Review
addition, group discussion, similar to paired discussion (Brooks 2009), may elicit
richer language functions than oral proficiency interviews (OPI) so that commu-
nicative ability can be more comprehensively assessed (Nakatsuhara 2009; van
Moere 2007).
Pre-eminently, in the context of formative assessment, group discussion can be
assessed not only by instructors but also by learners and their peers on condition
that the rating scales and criteria are made transparent and accessible to all the
parties concerned (Fulcher 2010; Shepard 2000). However, previous studies also
indicate that without substantial experience of applying the scoring criteria to work
samples, self-assessments may fluctuate substantially (Ross 1998; Patri 2002). By
contrast, peer-assessments are likely to be much more reliable though they can be
more lenient than instructor-assessments (Matsuno 2009). Therefore, the present
study, instead of including self-assessment as a rating method, resorts to teacher and
peer-rating when the proposed rating scale is validated based on the above
considerations.
Although the reliability of group discussion as an assessment task in standard-
ised large-scale testing is challenged as raters might not be able to assign reliable
scores when candidates are tested in groups (Folland and Robertson 1976; Hilsdon
1995), such unreliability is hardly recorded empirically. In response to that, Nevo
and Shohamy (1984) compare 16 language assessment experts’ intuition and per-
ception of having group discussion as an assessment task with other forms, such as
role-play and OPI, only to find that group discussion ranks the top in terms of task
utility standards, but stands in the middle on fairness, probably leading to testing
experts’ suspicion of the reliability of group discussion. Despite that, scant evidence
can be collected to support that group discussion is not reliable.
In terms of task usefulness and task characteristics, group discussion also has a
few distinctive features. It is first of all highly interactive and authentic (Kormos
1999; Lazaraton 1996b; van Lier 1989), with all the discussants involved in a
meaning-making and negotiating process. It is also characterised by a high degree
of feasibility and economy in the sense that formative assessment of this kind can
just take place in classrooms and can be time-saving because several students are
grouped together to be assessed, thus greatly reducing the time that traditional
testing methods would call for (Ockey 2001). Another point inherent in formative
assessment that can also credit group discussion is that all candidates are in dis-
cussion with familiar faces without interlocutors, which tends to lower their anxiety
and eschew more errors arising from the intervention of interlocutors (Ross and
Berwick 1992; Johnson and Tylor 1998; Young and He 1998a, Brown 2003).
What’s more, even though candidates’ weaknesses are disclosed in various aspects,
they do not feel as ashamed as they would otherwise be in face of generally stern
examiners or interlocutors.
To briefly summarise, the above review provides positive evidence that this
particular assessment task can be judged as ideal from the perspectives of face
validity, reliability, authenticity, interactiveness, impact and practicality, which,
incidentally but purposefully, can be accorded with Bachman and Palmers’ (1996)
framework of test usefulness.
2.3 Rating Scale and Formative Assessment 47
Integrating the above review on rating scales and formative assessment, procure-
ment can be reached regarding the properties of the rating scale that this study
intends to propose. It will be an assessor-oriented analytic rating scale specifically
for group discussion in formative assessment. The band and level descriptors aim to
be defined and descriptive instead of merely evaluative. The design of the rating
scale will be firstly theory-grounded on the construct of the CLA model and pre-
liminary discriminating features identifiable in candidates’ nonverbal delivery and
then undergo empirical corroboration with data-driven involvement.
As the last phase of the present study sets out to validate a proposed rating scale
with nonverbal delivery included as an assessment dimension, it is of importance to
review the conceptualisation of validity and the evolution of validation methods.
What should be pointed out is that validity is an integral and most basic concept in
language assessment because “accepted practices of test validation are critical to
decisions about what constitutes a good language test for a particular situation”
(Chapelle 1999, p. 254). How validity is defined in reality determines how a test is
to be validated.
Historically, test validity is an ever-changing concept and has undergone
metamorphoses chronologically (Angoff 1988; Cronbach 1988, 1989; Goodwin
1997, 2002; Goodwin and Leech 2003; Kane 1994, 2001; Messick 1988, 1989a, b;
Langenfeld and Crocker 1994; McNamara and Roever 2006; Moss 1992; Shepard
1993; Yang and Weir 1998). Researchers with different perceptions towards
validity (e.g. Angoff 1988; Kane 2001; Goodwin and Leech 2003) have various
demarcations of its development. Nonetheless, what stands to be certain is that the
introduction of construct in conceptualising validity is widely regarded as a mile-
stone. Therefore, all demarcations can fall into three phases in terms of how the role
of construct validity evolves: (1) the preconstruct-validity phase, a period before
construct validity was put forward by Cronbach and Meehl (1955); (2) the initial
phase of construct validity, a period covering the range from the 1970s to the 1980s,
when construct validity was made co-existent with other types of validity in lan-
guage testing; and (3) the core phase of construct validity, a period when the
concept starts to play a quintessential role in test validation.
In recent decades, with the popularity of argument-based validation method,
there are also other perspectives of conceptualising validity, among which
Assessment Use Argument (AUA) is utilised in full swing as “[an] overarching
logical structure that provides a basis both for test design and development and for
score interpretation and use” (Bachman 2005, p. 24). However, as far as the essence
48 2 Literature Review
of AUA is concerned, it still falls into the third phase as this notion calls for
evidence collection in support of construct validity.
Therefore, concerning the concept of validity, this part will embark upon a
review on a componential notion of validity, followed by a unitary concept of
validity, with construct validity as the core. Afterwards, details on the newly
established AUA (Bachman 2005; Bachman and Palmer 2010) will also be briefly
reviewed; however, the critique on AUA in this section will lead to an argument
that caution should be taken in employing AUA as the framework of validating the
proposed rating scale. This section of review will wind up with the justification of
employing both quantitative and qualitative approaches for the validation of the
rating scale to be proposed based on the unitary notion of test validity; in particular,
the incorporation of nonverbal delivery calls for a qualitative approach in validating
the rating scale.
The early phase of test validity stresses test purposes. Guilford (1946) points out
that every test is purpose-specific and that one test can be valid for a particular
purpose but invalid for another. No matter whether it is test providers or test users,
all parties of stakeholders should be responsible for verifying that one test is valid
for the particular purpose it serves. In that sense, how validity is defined is closely
associated with test purposes, which can be interpreted in the Garrett’s (1947)
definition that “the validity of a test is the extent to which it measures what it
2.4 Validity and Validation 49
purports to measure” (p. 394). Similarly, Cureton (1950) also views that test pur-
pose should be the basic issue of test validity and phrased its importance as “how
well a test does the job it was employed to do” (p. 621). Against the above
viewpoint, test purposes can be twofold: either diagnosing the existing issues or
predicting the future performance. Accordingly, American Psychological
Association (APA), American Educational Research Association (AERA) and
National Council on Measurement in Education (NCME) in their early versions of
Standards for Educational and Psychological Testing (Standards) divides
criterion-related validity into concurrent validity and predictive validity (see APA
1954; APA et al. 1966).
In fact, criterion-related validity is deeply rooted in a realist philosophy of
science, which holds that every individual can produce a value on the specific
assessment characteristics and the assessment purpose is to estimate or predict that
value as accurately as possible. In the context of standardised testing, the “true
score”, or the estimates most approximating the “true score”, reflects the extent to
which the test has precisely estimated that value (Thorndike 1997). In that sense,
the precision of estimation is the degree of test validity.
The above definition reveals that criterion-related validity is concerned with the
test per se and that it is a static property attached to test validity (Goodwin and
Leech 2003). Therefore, criterion-related validity equals to “the correlation of
scores on a test with some other objective measure of that which the test is used to
measure” (Angoff 1988, p. 20). A test can be judged as valid or invalid according to
the measuring results (Cureton 1950; Gulliksen 1950) and “[i]n a very general
sense, a test is valid for anything with which it correlates” (Guilford 1946, p. 429).
The key to validating criterion-related validity then lies in how to lay down the
criterion measure in order to obtain standardised test scores, without which such
validation studies cannot be carried out. Cureton (1950) puts forward the following
method.
A more direct method of investigation, which is always to be preferred wherever feasible, is
to give the test to a representative sample of the group with whom it is to be used, observe
and score performances of the actual task by the members of this sample, and see how well
the test performances agree with the task performances. (p. 623)
As revealed above, the first step is sampling the target candidates and observing
their performances in the real assessment tasks to assign the corresponding scores.
The scores ultimately derived should become the standard scores with reference to
the criterion. When other tests are in the process of validation, the newly observed
scores will be correlated with the standard scores to see the extent to which the test
consistently measures the candidates’ ability. Therefore, “the test is valid in the
sense of correlating with other [valid and reliable language] tests” (Oller 1979,
pp. 417–418). Ebel (1961), however, would rather think that some language tests
can be regarded as valid merely through subjective judgment and that language
assessment experts’ judgment on validity can be employed to measure test validity.
Once the validity criterion is determined, it is possible to design standard testing for
the validation of other tests.
50 2 Literature Review
Content validity usually refers to the extent to which the test items or tasks are
sufficient enough to represent the domain or universe of the content to be covered in
a test. It was explained as “[whether] the behaviours demonstrated in testing con-
stitute a representative sample of behaviours to be exhibited in a desired perfor-
mance domain” (APA et al. 1974, p. 28).
Angoff (1988), in summarising what content validity in language assessment
represents from the aspects of content relevance, content coverage and content
significance, posits that a test has content validity when all the test items are
representative not only of the domain but also of the number and significance of the
domain. Messick (1988), from the interface between content and construct, asserts
that “[w]hat is judged to be relevant and representative of the domain is not the
surface content of test items or tasks, but the knowledge, skill, or other pertinent
attributes measured by the items or tasks” (p. 38).
The main validation method regarding content validity is based on logical
judgement, such as expert evaluation and a review of the test content by assessment
experts (Angoff 1988). Since much subjectivity is involved, this validation method
is usually controversial (Guion 1977; Kane 2001). Given this, there used to be a call
for empirical validation on expert evaluation (Bachman et al. 1995). However, in
direct performance tests, there are indeed advantages for expert evaluation
(Cronbach 1971), which is still being utilised in many assessment settings (Kane
2001). Cronbach (1971) also puts forward equivalent tests for content validation, in
which two sets of scores obtained from two different tests with the same content
coverage are correlated. A low-correlation coefficient can indicate that at least one
of them does not have high content validity, yet it is challenging to determine which
particular test it is. On the other hand, if the correlation coefficient is high, it can be
generally thought that both tests have content validity.
Unlike criterion-related validation, whose problem lies in the availability of a
real standardised test, content validation is challenged in that the representativeness
of test content can be barely guaranteed. On the one hand, the domain or universe of
a test cannot be easily operationalised because what is assessed can be either
language knowledge and language skills, or complicated performances or pro-
cesses. On the other hand, the number of test items, coverage of test materials and
method of sampling all impact the representativeness of a test content as well as its
facility and discriminating power (Angoff 1988). Latent variables mentioned above
2.4 Validity and Validation 51
Construct validity is first conceptualised by Paul Meehl and Robert Challman upon
their draft of the Standards (1954), and further nourished by Cronbach and Meehl
(1955). The introduction of construct validity, together with criterion-related
validity and content validity, signifies the beginning of a “trinity view” of test
validity (see APA et al. 1966). Therefore, construct validity has been regarded as a
hallmark in the evolution of test validity. However, when this notion is first con-
ceptualised, it is treated as a mere supplement to criterion-related validity
(Cronbach and Meehl 1955). This is because when the criterion measure is not
available, researchers would turn to an indirect validation method, which highlights
the trait or quality underlying the test instead of test behaviour or scores on the
criteria. Then “the trait or quality underlying the test” is just what construct is. The
Standards (APA et al. 1974) put a psychological construct as
[a]n idea developed or “constructed” as a work of informed, scientific imagination; that is, it
is a theoretical idea developed to explain and to organise some aspects of existing
knowledge. Terms such as “anxiety”, “clerical aptitude”, or “reading readiness” refer to
such constructs, but the construct is much more than the label; it is a dimension understood
or inferred from its network of interrelationships. (p. 29)
52 2 Literature Review
Face validity, as its name suggests, usually refers to the degree to which all those at
the surface level, such as the language and instructions used in a test, whether the
layout or the printing quality of a test paper can be acceptable to candidates and the
2.4 Validity and Validation 53
public (Hughes 2003). Whether test validity in this regard should also be treated as
a component of validity has long been debatable. Because face validity is only
confined to the acceptability of the test paper at the surface level without any
involvement of psychological measurement, it cannot truly reflect the validity of a
test in the strictest sense, nor can it be a yardstick of measuring the degree of
validity for a test. Mosier (1947), when criticising the ambiguity of face validity,
thinks that “any serious consideration of face validity should be abandoned” (cited
from Angoff 1988, p. 23). Angoff (1988) also mentions that “superficial judgments
of the validity of a test made solely on the basis of its appearance can easily be very
wrong” (p. 24).
Although face validity is challenged as incompetent to be one of the components
of validity, quite a number of researchers have noted its importance. Anastasi
(1982) believes that “the language and contexts of test items can be expressed in
ways that would look valid and be acceptable to the test-taker and the public
generally” (p. 136). Likewise, Nevo (1985) also acknowledges the usefulness of
face validity and thinks that face validity should also be reported in test validation.
In brief summary, test validity at the first evolution phase was perceived as a
componential entity, with criterion-related validity, content validity and construct
validity as its tenets. However, in the case of this study, when a rating scale with a
consideration of embedding nonverbal delivery into speaking assessment is vali-
dated, it seems quite impractical to accumulate evidence from all the above three
aspects of validity. After the following part, where light will be shed on validity as a
unitary notion and as playing a core role in all sources of validity, this review can
justify that in validating the proposed rating scale, construct validity will be mainly
scrutinised.
The two latest versions of Standards (AERA et al. 1985, 1999) define construct
validity from a unitary perspective. However, the Standards (AERA et al. 1999) are
added with test use and consequence, reflecting a further extension of test validity.
The Standards (AERA et al. 1985) regard validity as
the appropriateness, meaningfulness and usefulness of the specific inferences made form
test scores. Test validation is the process of accumulating evidence to support such infer-
ences. A variety of inferences may be made from scores produced by a given test, and there
are many ways of accumulating evidence to support any particular inference. Validity,
however, is a unitary concept. Although evidence may be accumulated in many ways,
validity always refers to the degree to which that evidence supports the inferences that are
made from test scores. (p. 9)
The above definitions of validity can be comprehended as the extent to which all
sorts of evidence can support the score interpretation and use. Hence, validity is a
unitary concept with construct validity as the core. Compared with the compo-
nential notion, there are three distinguishable features for this view.
First, dissimilar to a preference to classifying validity into several components
(see Angoff 1988; Langenfeld and Crocker 1994; Messick 1988, 1989b, 1995;
Shepard 1993), this view no longer treats validity as divisible; rather, it is a unifying
force (Goodwin and Leech 2003; Messick 1988). The previous criterion-related
validity and content validity are also embedded into the evidence collection con-
cerning “content relevance”, “content coverage”, “predictive utility” and “diag-
nostic utility” (Messick 1980, p. 1015). The validation process includes collecting
evidence from various sources, interpreting and using the evidence for verification.
In light of construct validity, Cronbach (1988) puts forward two kinds of validation
programmes as follows.
The weak programme is sheer exploratory empiricism; any correlation of the test score with
another variable is welcomed. …The strong programme, spelled out in 1955 (Cronbach &
Meehl) and restated in 1982, by Meehl and Golden, calls for making one’s theoretical ideas
as explicit as possible, then devising deliberate challenges. (pp. 12–13)
It can be seen that the weak programme focuses on the correlation between test
scores and other variables, while the strong one tends to seek theory-based ideas.
The former holds that evidence should be gathered from a variety of sources so that
its advantage consists in its diversity and complementariness. However, just as
2.4 Validity and Validation 55
Kane (2001) points out, the weakness of this programme is its opportunistic
strategy; in other words, it seeks “readily available data rather than more relevant
but less accessible evidence” (p. 326). The strong programme follows an approach
of validation-through-falsification, viz. “an explanation gains credibility chiefly
from falsification attempts that fail” (Cronbach 1988, p. 13). Yet it also has its
weakness in that this approach is limited in its utility given an absence of a
well-grounded theory to test (Kane 2001).
The unitary notion of validity lays more emphasis on complementariness,
instead of alternativeness, of evidence. This view is widely accepted and reinforced
since the 1980s. Bachman (1990) notes that “it is important to recognise that none
of these by itself is sufficient to demonstrate the validity of a particular interpre-
tation or use of test scores” (p. 237). In a similar vein, Weir (2005) also emphasises
that
[v]alidity is multifaceted and different types of evidence are needed to support any claims
for the validity of scores on a test. These are not alternatives but complementary aspects of
an evidential basis for test interpretation…No single validity can be considered superior to
another. Deficit in any one raises questions as to the well-foundedness of any interpretation
of test scores. (p. 13)
Second, the unitary concept of validity has transferred its focus from test per se
to the interpretation of test scores, or more precisely, to the extent to which the
score interpretation can be supported by the evidence. In 1986, English Testing
Service (ETS) sponsored a symposium themed Test validity for the 1990s and
beyond and most of the keynote speeches are compiled in the proceedings by
Wainer and Braun (1988). On the first page of the prelude, there is a footnote to the
effect that a test itself cannot be claimed to be valid; rather, the inferences made
from the test scores should be used as the sources of validation.
In fact, Cronbach (1971) shares the above view when stating that “one validates
not a test, but an interpretation of data arising from a specified procedure” (p. 447)
and “one does not validate a test, but only a principle for making inferences”
(Cronbach and Meehl 1955, p. 297). Based on this, McNamara and Roever (2006)
even elevate Cronbach’s view to such a height that there is no such thing as a truly
valid test, but only defensible interpretations to a certain degree. Therefore, the
unitary concept of validity shows that test validity is manifested in score inter-
pretation rather than test per se.
Third, after the unitary concept of test validity is put forward, the test use and its
consequence also invite great concern. Although they are not new in validity
studies, the Standards (1985) include neither of them into the definition of validity.
With the maturing of the unitary concept, there has been an increasing awareness of
and concern over the intended and unintended purposes, potential and actual
consequences (Cronbach 1988; Linn 1994; Messick 1989b, 1994; Shepard 1993).
Fitting into that trend, the new version Standards (1999) officially include the test
use and consequence into the definition of validity.
However, there are also researchers (e.g. Dwyer 2000; Popham 1997) who prefer
to confine validity to the boundary of score interpretation and traditional
56 2 Literature Review
Although the second evolution stage of test validity deems the notion as a unitary
one, it is still etched with many dimensions. Messick is among the first proponents
of a unitary concept of test validity and his works (1975, 1980, 1988, 1989a, b,
1992, 1994, 1995, 1996) exert far-reaching significance. Messick (1995) defines
validity as
nothing less than an evaluative summary of both the evidence for and the actual – as well as
the potential – consequences of score interpretation and use. This comprehensive view of
validity integrates considerations of content, criteria and consequences into a comprehen-
sive framework for empirically testing rational hypotheses about score meaning and utility.
(p. 742)
As can be interpreted from the above definition, the unitary concept is reflected
in an evaluative summary and a comprehensive view. In other words, this concept
encompasses the test content, test criterion and test consequence with hypotheses
and empirical verification. Therefore, this concept is characterised by its multidi-
mensionality, where score interpretation, test use, evidential basis and consequential
basis are interacting with each other for a comprehensive evaluation, as illustrated
in Table 2.2 (Messick 1988, p. 42).
Talking about test validation, Messick (1989b, 1995, 1996) also suggests that
evidence from six distinguishable aspects should be collected in order to verify the
overall validity. These six aspects are “content, substantive, structural, generalis-
ability, external and consequential aspects of construct validity” (p. 248).
When evidence is collected for further verification, one way is to structure all the
evidence in the form of arguments (Cronbach 1980, 1988, House 1980) because
validity arguments provide a comprehensive evaluation of the intended interpre-
tation and uses of test scores (Cronbach 1988). Following Cronbach and Messick,
Kane (1990, 1992, 2001, 2002, 2004, 2006, 2010), Kane et al. (1999) develop an
interpretive framework of arguments to provide guidance for justifying interpre-
tations of test scores and use. Later, Mislevy (2003), Mislevy et al. (2002, 2003)
propose an evidence-centred design (ECD), at the heart of which is what is referred
to as an evidentiary argument. In the recent development, argument-based approach
remains trendy in test validation (e.g. Bachman and Palmer 2010; Chapelle et al.
2008, 2010; Xi 2010). Therefore, the following part will review and critique the
most representative framework of argument-based approach AUA and justify why
the present study will still employ the unitary notion of test validity instead of
resorting to this newly established framework.
Bachman (2005) first puts forward AUA, and Bachman and Palmer (2010) later
revises and enriches the framework with a number of tests in real-life settings. Thus,
as aforementioned, this framework is inviting an increasing number of test vali-
dation studies, so a review on its essence clearly becomes necessary in the present
study. Then what should be the essence of AUA? In fact, any argument-based
framework lays its foundation on the base argument whose structure makes explicit
the reasoning logic employed to justify the plausibility of the conclusion or claim.
AUA is of no exception. Therefore, the structure of the base argument is of crucial
importance; a minor modification may divert the general direction of reasoning,
thus resulting in utterly different outcomes. Since AUA resides its base argument
structure in the Toulmin model, it is necessary to obtain a full understanding of the
Toulmin argument structure and its reasoning logic before a critique on the
framework can be made.
58 2 Literature Review
Toulmin does not explicitly put forward the notion of “the Toulmin model” himself,
but rather regards it as “one of the unforeseen by-products of the uses of argument”
(Toulmin 2003, p. viii). The aim of Toulmin’s writing the book is strictly philo-
sophical, to criticise the syllogism or demonstrative deductions in general. His major
viewpoint is that the form of syllogism is simplistic and ambiguous with no practical
use in daily arguments. To do justice to the situation, Toulmin builds up a pattern of
argument analysis. This pattern can be illustrated with the typical example of the
Toulmin model (see Fig. 2.7): by appealing to the datum (D)—“Harry was born in
Bermuda”, one can make a claim (C) about Harry’s nationality—“So, presumably,
Harry is a British subject”. The step from the datum to the claim is guaranteed by the
implicit warrant—“A man born in Bermuda will generally be a British subject”,
which is an inference drawn on the British Nationality Acts, and whose authority
relies on its backing which makes an account of the British statutes and other legal
provisions. Considering the potential exceptional conditions, such as “Both Harry’s
parents may be aliens” and “Harry might have changed his nationality since birth”, a
qualifier—“presumably” is included to indicate a tentative modality in the claim.
This is clearly a judgmental reasoning process. According to Toulmin (2003),
the rationality of a logical or practical argument is guaranteed by “‘Data such as D
entitle one to draw conclusions, or make claims, such as C’, or alternatively ‘Given
data D, one may take it that C’” (p. 91). In other words, the data “on which the
claim is based” (p. 90) should reveal sufficient bearing of warrants, which are
“general, hypothetical statements, which can act as bridges, and authorise the sort of
step to which our particular argument commits us” (p. 91). Meanwhile, the warrants
should be further supported by the backing, which Toulmin defines as “straight-
forward matters-of-fact” (p. 96) to provide other assurances for the reasoning
D So, Q, C
So, presumably, Harry
Harry was born in Bermuda
is a British subject
Since Unless
W R
On account of
B
The following statutes and other legal provisions:
……
When applying the Toulmin model to build up the framework, Bachman (2005)
makes a few changes to the basic structure of the model: (1) the Q element has been
removed; (2) the rebuttal remains at its original position, but a new component,
rebuttal data, has been added to justify the rebuttal—to “support, weaken or reject
the alternative explanation” (Bachman 2005, p. 10); and (3) Bachman and Palmer
(2010) change rebuttal data into rebuttal backing.
As can be seen in Fig. 2.8, all changes are targeted at the elements that Toulmin
employs to attack the syllogism: the qualifier is gone, while the rebuttal is rein-
forced. This is somehow against Toulmin’s intention. The qualifier is what makes a
60 2 Literature Review
unless
Rebuttal
since
Warrant
so
Toulmin argument, without which the Toulmin claim is reduced back to a syllo-
gistic one: being either yes or no, all or none; without which there is no need to
consider the rebuttals in the first place. On the other hand, the rebuttals, generally
exceptional and rare though negative to the claim, need to be considered for the
claim to be plausible, but have to be ignored if any claim is to be made at all.
However, in the modified versions, the qualifier is nowhere to be found, whereas
the rebuttal is not let go.
With the above, AUA is not entirely consistent with Toulmin’s argument model,
especially in terms of its base argument. Then, what is its reasoning logic? An
analysis of the roles of rebuttal and rebuttal backing will help to reach some
insights. As is mentioned in the earlier discussion about substantial and analytical
arguments, the backing includes straightforward matters-of-fact or truths and when
factual backing is used to guarantee a claim, or a hypothetical statement for that
matter, no reasoning is involved and no argument is necessary.
However, Bachman and Palmer (2010) change Rebuttal Data (the Rebuttal
within a frame) in Fig. 2.8 into Rebuttal Backing. Therefore, as long as the rebuttal
is to be verified within the reasoning process from the data to the claim, the
reasoning process is undermined. As long as the rebuttal cannot be ignored, the
claim is hardly convincing, or even predictable. In that case, the whole logic
reasoning process falls into a never-ending regression.
The example in Fig. 2.9 is to illustrate how the rebuttal is supported by the
rebuttal backing and thus the claim is rejected (Bachman and Palmer 2010). Based
on the data Jim is going to the hospital, the claim, Jim is sick, is to be made (no
claim yet); although the warrant, People often go to the hospital when they are sick,
should provide enough guarantee to make the claim, we must check whether the
rebuttal Jim could be visiting someone who is in the hospital, is true or not; it is true
that Jim is visiting his partner in the hospital, so Jim is not sick.
2.4 Validity and Validation 61
Counterclaim:
Claim : Jim is sick.
Jim is not sick.
Rebuttal :
unless
Jim could be visiting
Rebuttal Backing:
Jim is visiting his
Data : Jim is going to partner in the hospital.
the hospital.
Fig. 2.9 Structure of example practical argument (Bachman and Palmer 2010, p. 97)
However, the above reasoning turns to: Jim is going to the hospital, so Jim is not
sick. This does not seem to be the result of the Toulmin reasoning. If assembled in
the form of a Toulmin argument, the reasoning should be as follows.
A: Jim is going to the hospital (SINCE people often go to the hospital when they are sick,
UNLESS they are going to the hospital for some other reasons,) SO PRESUMABLY Jim is
sick.
B: Jim is visiting his partner in the hospital (SINCE we can take it that people are not sick
themselves when they are visiting someone in the hospital, UNLESS they are indeed sick
themselves,) SO PROBABLY Jim is not sick.
This is how arguments are supposed to be settled. As can be seen, each side has
its own claim; each claim is justified with a separate reasoning process; each
process is guaranteed with its own warrant. Most importantly, both sides take into
consideration the rebuttal, but neither is trying to verify the rebuttal in the same
reasoning process, instead a proper qualifier is included.
If there is a need to reason by the logic of AUA, the rebuttal has to be verified as
well. As can be seen in Fig. 2.9, even if Jim is visiting his partner, he may still be
sick himself. If this rebuttal needs to be verified, chances are that the validation will
fall into an endless paradoxical cycle. In other words, before any claim is made, the
rebuttals must be verified first. As a consequence, another verification process is
embedded in the current one so that in terms of model construction the model
always contains “a self” within the model itself.
In brief summary, although argument-based validation can be viewed as a step
forward in comparison with the unitary concept of test validity, caution might be
taken in applying it to validate the proposed rating scale in question. In particular, it
needs further exploration as to how to embed all sorts of validity arguments into a
62 2 Literature Review
coherent and significantly sufficient argument with construct validity as the core.
Therefore, in terms of validation for the rating scale with nonverbal delivery
embedded, the present study will still refer to a unitary notion of validity.
The previous section reviews the evolution of validity in language testing and
justifies the application of a unitary concept as the theoretical base of validation for
the rating scale to be proposed in the present study. Then, when it comes to the
validation of rating scales, it is still felt necessary to review how rating scales can be
validated.
With regard to the facets of rating scale validity, Knoch (2009) tailors Bachman
and Palmer’s (1996) framework of test usefulness and excludes the facet of inter-
activeness because that is not an integral tenet that should necessarily be applied to
rating scale validation. In addition, Knoch’s (2009) revises the framework
emphasises the role of construct validity of a rating scale and puts forward three
criteria in validity evaluation as follows.
The scale provides the intended assessment outcome appropriate to purpose and context
and the raters perceive the scale as representing the construct adequately…The trait scales
successfully discriminate between test takers and the raters report that the scale is func-
tioning adequately…The rating scale descriptors reflect current applied linguistics theory as
well as research. (p. 65)
To briefly interpret the above criteria, in validating a rating scale, three aspects
should be taken into account: (1) the extent to which a rating scale reflects the
construct; (2) the extent to which a rating scale discriminates candidates across
various proficiency levels; and (3) the extent to which a rating scale manifests a
selected theory. Therefore, at the phase of rating scale validation, these three criteria
serve as the guidelines in constructing the phase-specific research questions.
In terms of rating scale validation methods, both quantitative and qualitative
methods are well documented. A majority of previous studies employ quantitative
methods to validate a rating scale. Because a rating scale with explicitly defined
categories facilitates consistent rating, a few studies examine whether differences
between score categories are clear using multifaceted Rasch measurement (Bonk and
Ockey 2003; McNamara 1996) or other factors impacting scoring results (Lumley
and O’Sullivan 2005; O’Loughlin 2002). Besides, multidimensional scaling has also
been applied to the scale development for different tests and rater groups
(Chalhoub-Deville 1995; Kim 2009). More robust statistical methods, such as an
MTMM approach and differential item functioning analysis, have been used for the
validation of classroom assessment (Llosa 2007), of speaking tests (Kim 2001) or of a
rating scale (Yamashiro 2002). It might be found that with an ever-growing
involvement of statistical tools into the language assessment community, an
increasing number of sophisticated statistical methods have been applied into and
enriched the study of rating and rating scales.
2.5 Rating Scale Evaluation and Validation 63
On the other hand, qualitative methods are also increasingly employed in test
validation studies (Lazaraton 2008), including speaking assessment validation (e.g.
Lazaraton 1992, 2002, 2008). Commonly adopted methods can be rater verbal
protocols and analysis of test discourse (e.g. Brown et al. 2005; Cumming et al.
2006). By aligning the rater verbal protocol with the descriptors stipulated in the
rating scale, researchers are able to validate a rating scale supposedly reflective of
the underlying construct a particular test intends to elicit. More elaborations will be
made on qualitative approaches to test validation in the last section of this chapter.
In order to obtain more sources for the validation of the rating scale, both
quantitative and qualitative methodologies will be employed in the present study.
On the quantitative side, as the rating scale to be proposed touches upon formative
assessment with a consideration of embedding nonverbal delivery as an assessment
dimension, different traits from candidates’ performances as reflected in their group
discussions can be measured via different methods, such as teacher-rating and
peer-rating; therefore, an MTMM approach will be adopted, which is rather suitable
and powerful in addressing the extent to which different measures or methods that
assess one given construct are substantially correlated among themselves. As for the
qualitative side, since the main argument for validating the proposed rating scale is
to validate the dimension of nonverbal delivery, an MDA approach will be used.
Further justifications will be made after the related qualitative approaches to
assessment validation are shed light on.
MTMM is first introduced by Campbell and Fiske (1959), who direct the attention of
construct validity research typically to the extent to which data exhibit evidence in
three areas, or meet three requirements. One is the concern of convergent validity
(CV), referring to the extent to which different assessment methods concur in their
measurement of the same trait. These values are supposed to be moderately high if the
construct validity is probed into. The second concern is discriminant validity (DV),
indicating the extent to which independent assessment methods diverge in their
assessment of different traits. Contrary to the requirement for CV, the values for DV
should demonstrate minimal convergence. The last consideration is method effects
(MEs), deemed as an extension of DV. MEs represent bias that could possibly derive
from using the same method in the assessment of different traits; correlations among
these traits would be typically higher than those measured by different methods.
The original MTMM design (Campbell and Fiske 1959) receives criticism
because more external, multiple and quantifiable criteria are expected to be incor-
porated into model perception (e.g. Marsh 1988, 1989; Schmitt and Stults 1986).
Widaman (1985) also adds to the effect that the original MTMM design somehow
fails to explicitly state the requirement of uncorrelated methods. Contingent upon
these criticisms, Widaman (1985) proposes an approach of nested-model compar-
isons, where a baseline model is first perceived to be compared with other
64 2 Literature Review
Against the above, it can be argued that without probing into the de facto
assessment processes, especially if candidates’ performance is not investigated
analytically with a qualitative approach, a full picture of what is tested conforms to
what is intended to test will never be depicted. Therefore, for triangulation it is
quintessential to apply a qualitative approach to validating the rating scale to be
proposed.
As far as rating scale validation is concerned, there are mainly two prevailing
qualitative methods: verbal protocol analysis (VPA) and discourse-based approach,
particularly conversation analysis (CA). They are either singly adopted for rating or
test validations, or orchestrated with other quantitative methods to triangulate
research findings. The ensuing part is devoted to a review on both methods, fol-
lowed by the details of MDA so that further justifications in addition to what is
previously argued in the section of nonverbal delivery can be made for adopting
MDA as the qualitative validation method in this study.
information that is (or has been) attended to as a particular task is (or has been)
carried out” (pp. 1–2).
If the terrain is surveyed where VPA is empirically adopted, an overwhelming
popularity can be found among those who bent on writing assessment rating. For
instance, Cumming (1990) uses VPA to compare experienced and novice raters in
their judgments on the criterion range of analytic assessment; Cumming et al.
(2001, 2002) also examine the criteria extracted from the VPA data to come up with
the general categories for essay evaluation. Similarly, this method is also employed
by other studies in either describing rating process or comparing raters with various
extraneous variables or characteristics (e.g. Connor and Carrel 1993; Erdosy 2004;
Lumley 2002, 2005; Milanovic et al. 1996; Smith 2000; Vaughan 1991; Weigle
1994, 1999; Wolfe 1997; Wolfe et al. 1998).
However, applying VPA to speaking assessment rating seems to be underex-
plored. One of the few studies in spoken language assessment using that method is
conducted by Brown et al. (2005). In their study, they use VPA to investigate rater
orientation in the context of academic English assessment. The study finds that
expert EAP teachers generally assess test-takers’ vocabulary skills and frequently
comment on the adequacy of their vocabulary for a particular purpose. Ducasse and
Brown (2009), also using VPA, finds teacher-raters can identify three interaction
parameters in assessing paired oral communication, which yields implications for a
fuller understanding of the construct of effective interaction.
It has to be admitted that using VPA enables researchers to validate a rating scale
in terms of the extent to which raters score the candidates’ products in line with
what is stipulated in the descriptors of a rating scale. In other words, it can mainly
enhance the degree of scoring validity. However, when it comes to the construct
validation of a rating scale, this method seems to be less powerful because it is very
likely that the data elicited from rater verbal protocol does not necessarily cover
whole thinking processes. Thus, VPA may record an incomplete reoccurrence of
rater’s mind (Barkaoui 2011). When evaluating this method, Green (1998), Lumley
and Brown (2005) also point out a few drawbacks of VPA. Besides its conspicuous
disadvantage of time consumption, this method might also result in individual
differences in the sense that respondents might either produce long or short reports
of their mental processing. If not enough due attention is paid to the wording of
verbal report elicitation, respondents’ verbal reports might also be disrupted as they
could be somewhat coerced to “keep talking” (Ericsson and Simon 1993).
Coupled with the above drawbacks, most studies using this method outlined
above also justify themselves in choosing VPA because most previous studies on
the rating in writing assessment also heavily rely on this method. Considering the
applicability of VPA in the present study, whose focus differs significantly from
writing assessment, and given the practicality issue that VPA in the context of oral
assessment might also consume even more time than that of writing, this method
has to be discarded.
2.5 Rating Scale Evaluation and Validation 67
1
For detailed descriptions of turn, refer to Sacks (1992), Sacks et al. (1974), Oreström (1983).
68 2 Literature Review
that both rating scales share the weakness that there is a dearth of continuous
development of language acquisition that is supposed to be reflected in rating
scales. Another large-scale application of CA is mainly conducted by Lazaraton
(1991, 1992, 1995, 1996a, b) on a series of Cambridge EFL examinations in both
interview conversation structure of spoken language assessment and
interlocutor/candidate behaviours. In these studies, she not only aligns candidates’
responses with possible communicative functions to see whether the tests really
elicit the intended construct, but also profiles the role of interlocutor in certain
assessment settings.
Therefore, CA clearly serves as a necessary and reasonable complement to the
validation of language tests. Psathas (1995) evaluates CA as “an approach and a
method for studying social interaction, utilisable for a wide, unspecified phenom-
ena… it is a method that can be taught and learned, that can be demonstrated and
that has achieved reproducible results” (p. 67). However, largely due to its con-
straint of being applied to small-scale data, one of the criticisms against CA is that
the analytic methodology itself and its descriptive categories adopted might be too
vaguely defined to be usable and replicable to the studies of a similar nature (Brown
and Yule 1983; Cortazzi 1993; Eggins and Slade 1997; Wolfson 1989). On the
other hand, since CA is a method that involves much training and practice, most
researchers have to consume more time to familiarise themselves with the tran-
scribing conventions than to transcribe data (Hopper et al. 1986). On top of that,
Schiffrin (1994) and Levinson (1983) also notice that CA seems less capable of
bridging the gap between language form and language function.
Having critiqued the above, this part might call for an awareness that although
CA is conducive to tracking speakers’ utterances on a turn-by-turn basis, it is not
equally powerful and explanatory to synchronise what happens non-linguistically
with what is uttered verbally. The section reviewing rating scales already reiterates
that a majority of prevailing rating scales do not assess candidates’ nonverbal
delivery. If all meaning-making resources need to be probed into, CA seems to be a
dispreferred option. This is because although it might be argued that nonverbal
delivery could still be transcribed using a “second-line” (Larazaton 2002, p. 71),
this method can neither align verbal delivery with nonverbal channels on a large
scale, nor could it analyse interactions among different nonverbal channels, such as
eye contact, gesture and head movement, as previously reviewed. Therefore, CA
seems beyond its strength to be applied to the present study. In order to find a
method that is able to scrutinise more meaning-generation resources, this study
turns to an emerging discourse-based approach: MDA.
Anterior to unfolding what MDA can offer, two key concepts need to be clarified in
foregrounding the notion. Since there are different approaches to MDA to be shed
light on later, the definitions of key concepts could also be slightly different.
If MDA is defined in a stratum-by-stratum manner, Stöckl (2004) views multimodal
as “communicative artefacts and processes which combine various sign systems
(modes) and whose production and reception calls upon the communicators to
semantically and formally interrelate all sign repertoires present” (p. 9). Then, what
is the point of mediating multimodal with discourse analysis? The main reason is
that quite a portion of meaning is conveyed through nonverbal channels. In that
case, communication should not be understood as a process realised only by one
particular sensory organ. Therefore, the discourse elicited in such settings is mul-
timodal discourse (Zhang 2009).
The stratum of multimodal discourse naturally extends to the method with which
multimodal discourse is examined, viz. MDA. Jewitt (2006) thinks that MDA is a
perspective from which discourse is analysed when all the communicative modes
are deemed as meaning-making resources and that it depicts an approach that
“understand[s] communication and representation to be more than language, and
which attend to the full range of communicational forms people use—image,
gesture, gaze, posture, and so on—and the relationships between them” (Jewitt
2009, p. 14). O’Halloran (2011), in a similar vein, defines MDA as “[extending] the
study of language per se to the study of language in combination with other
70 2 Literature Review
resources, such as image, scientific symbolism, gesture, action, music and sound”
(p. 120).
Having noted that MDA also looks at meaning-making resources other than
verbal language alone, this section maps out the terrains that this particular method
can cover. Simpson (2003) points out six domains that MDA mainly focuses on:
(1) multimodality and new media; (2) application of multimodality in the academic
and educational context; (3) multimodality and literacy; (4) construction of multi-
modal corpora; (5) multimodality and typology; and (6) MDA and its rationale.
Baldry and Thibault (2006), however, posit six slightly different topics for MDA
research: (1) what is multimodal text; (2) how to transcribe and analyse such text;
(3) what technologies are needed to analyse multimodal texts and construct mul-
timodal corpora; (4) how meaning potential can be exponentially increased when
meaning-making resources from multimedia are applied to hypertext; (5) how to
relate language studies to multimodality and multimedia; and (6) to what extent
MDA can bring changes to linguistics. It can be felt that two things might be shared
even though the above research domains vary slightly from each other. One is that
the ultimate purpose of MDA is to perceive all the meaning-making resources,
particularly those beyond the boundary of verbal language. The other is a trend that
MDA can be applied to large-scale research by means of corpus construction.
Bateman et al. (2004) and Bateman (2008) also believe that one of the multimodal
study foci is to formulate an analytical framework for dealing with multimodal data
in corpora. In fact, this domain is also foregrounded by the fact that previous
discourse-based analysis methods usually fail to quantitatively account for and
generalise research findings.
Nonetheless, even though MDA sets explicit directions for research and further
development, there are still different approaches to or streams of MDA, as is
foreshadowed. In order to select a suitable approach for this study and be consistent
in a line of analysis, the following part introduces these approaches and reviews
how they are applied to the studies related to Chinese EFL learners, and then
justifications will be made to account for selecting MDA in this study.
Approaches to Multimodality
Broadly divided, there could be two approaches to MDA with different theoretical
underpinnings. One of the approaches lay its foundation on Halliday’s (1978, 1985)
social semiotics to language studies, in which all potential meanings are structured
and construed in the sets of interrelated systems. Therefore, this stream is usually
known as systemic functional multimodal discourse analysis (SF-MDA), whose
bases are established by the works of Kress and van Leeuwen (1996, 1998, 2001,
2002, 2006; Kress et al. 2001, 2005; van Leeuwen 1999, 2001), O’Toole (1994,
2010), Baldrey and O’Halloran (2005, 2008a, 2011) and so forth. The other stream
of MDA, whose rationale can be traced back to activity theory (Engestrom 1987;
Daniels 2001) (AT-MDA), draws upon interactional sociolinguistics and intercul-
tural communication. That stream includes mediated discourse theory
2.5 Rating Scale Evaluation and Validation 71
(MDT) (Norris 2002, 2004; Norris and Jones 2005; Scollon 2001; Scollon and
Scollon 2004) and situated discourse analysis (SDA) (Gu 2006a, b, 2007, 2009).
SF-MDA
One of the main reasons why SF-MDA emerges and develops exponentially is that
its underpinnings can be directly loaned from systemic functional linguistics (SFL).
Specifically, SF-MDA absorbs the notion of language as social semiotic and
meaning potential and extends the boundary of meaning-making resources. In
addition, with reference to metafunctional meanings, SF-MDA also believes that
multimodal discourse is also multifunctional in that discourse is embedded with
ideational, interpersonal and textual meanings. SF-MDA also develops the theory
concerning register and associates the interpretation of discourse with the particular
context of the discourse. All these features provide SF-MDA with a fit platform on
which all the SFL-related theories, without any further modification, can immedi-
ately serve as its strong support.
Within the scenario of SF-MDA, most studies concentrate on the analyses and
interpretations of pictorial system, especially within a framework of analysing
visual text and its communicative meaning (Kress and van Leeuwen 1996, 2006).
Congruent with ideational, interpersonal and textual meanings in the SFL studies,
this framework describes meanings as not only representational (the representation
of entities, physical or semiotic), but also interactive (images constructing the
nature of relations among viewers and what is viewed) and compositional (the
distribution of information value or the relative emphasis among elements of the
image). Therefore, how images convey meanings also conforms to certain gram-
matical rules, which are beyond the conventional sense of grammar in linguistics. In
their follow-up work (Kress and van Leeuwen 2001), having noted that the
drawback of their framework lies in the isolated grammar for each individual
modality, Kress and van Leeuwen (2001) draw the attention of perceiving all the
modalities in a coherent context. The broad framework is supposed to identify the
four strata of meaning making in any communicative practice, including discourse,
design, production and distribution.
Other representative researchers also mainly take the lens of MDA on images.
For instance, O’Toole (1994, 2010) applies the visual arts grammar to the analyses
of paintings and architecture and reaches similar terms regarding meaning making:
representational meaning, modal meaning and compositional meaning. Likewise,
SF-MDA is also tailored to study other semiotic resources, including visual images
(Kress and van Leeuwen 2006; O’Halloran 2008b); mathematical symbols
(O’Halloran 2005), movement and gesture (Martinec 2000b, 2001, 2004), video
texts and Internet sites (Djonov 2006; Iedema 2001; Lemke 2002; O’Halloran
2004) and three-dimensional sites (Ravelli 2000) as well.
The above research on SF-MDA frames indicates that this stream does have
much to offer, especially when meaning-making resources other than verbal lan-
guage are probed into. However, when briefing SF-MDA, Jewitt (2009) thinks this
stream is not without flaws. It might dawn upon this review that most of the
72 2 Literature Review
analyses on images, symbols and among others, if not all, are rather impressionistic.
In other words, if perceived by different researchers with varied cultural or edu-
cational background, the interpretations might diverge to a certain extent. The
reason might be that SF-MDA is already linked the signifier with the signified to a
great extent, yet the way their relevancy is interpreted is still based on subjective
perceptions.
Another limitation pointed out by Jewitt (2009) is that “MDA is a kind of ‘lin-
guistic imperialism’ that imports and imposes linguistic terms on everything”
(p. 26). However, this limitation can be justified as most SF-MDA studies are
undertaken within linguistics field. If MDA is intended to interpret a language
system, there should be no “linguistic imperialism” to speak of. It might also be
controversial that SF-MDA is only concerned with static discourse, such as image,
architecture, as how they convey meanings through different channels. Nevertheless,
this flaw can again be defended by the fact that even though most SF-MDA studies
focus on those static discourses, it does not necessarily follow that it would be
powerless in dealing with dynamic discourses, such as situated discourses
embodying human actions. This can be supported by Hood’s (2007, 2010, 2011)
studies, in which an SF-MDA approach is adopted to present a multimodal analysis
of a poet’s performance and the role of body language in face-to-face teaching.
Therefore, although this flaw exists, it might be caused by a lower profile of
SF-MDA on dynamic discourse instead of the powerlessness of the approach per se.
AT-MDA
In addition to applying SFL to studying various modalities, a host of researchers are
also interested in basing their MDA studies on activity theory.
By integrating sociolinguistics, ethnolinguistics, intercultural communication,
Scollon (2001 and Scollon and Scollon 2003) proposes MDT that integrates social
activity with discourse. This is a step forward in that previous discourse analysis
studies usually neglect the significance of activity, whereas sociology theories, in
most cases, do not take discourse into account either. Unlike a conventional sense
of discourse analysis, which treats a text or a genre as the unit of analysis, MDT
mainly looks at mediated action and “social actors as they are acting because these
are the moments in social life when the discourses in which we are interested are
instantiated in the social world as social action, not simply as material objects”
(Scollon 2001, p. 3). According to Scollon (2001), any social actor conducts a
mediated action by means of material objects (including the actor’s own dress, body
and so forth) in the material world. Based on Scollon’s framework of AT-MDA,
Norris (2002, 2004) devises a MDA framework, where mediated action is still taken
as the unit of analysis. Her framework substantiates AT-MDA in the sense that she
further distinguishes different sorts of mediated actions into low-level action (a
simple gesture) and high-level action (a series of concrete actions) and that the
framework quantifies the degree of complexity for high-level actions by ushering in
the notion of mode density (Norris and Jones 2005).
2.5 Rating Scale Evaluation and Validation 73
Content unit
Medium unit
Fig. 2.10 Content and medium layers in agent-oriented modelling (Gu 2006a)
2
Gu (2006b) uses the term agent-oriented modelling language (AML), yet he later changes the
term to agent-oriented modelling (AOM) because AOM perceives the modelling as a methodol-
ogy, while AOML emphasises its relation with UML as the modelling metalanguage (Gu 2009).
74 2 Literature Review
An Integrated Evaluation
Having reviewed both approaches to MDA and how MDA is employed in the
Chinese EFL context, this part comes to an integrated evaluation and justifies this
approach in the present study. What needs to be addressed first is that there is no
absolute distinction as to which approach is right or wrong. Gu (2006b) expresses
his concern over foreseeable collaboration between SF-MDA and AT-MDA though
both approaches have solid foundations in their own right. Indeed, considering the
ultimate research purpose and explanatory power, both approaches are not con-
tradictory; their divergence only lies in different perspectives of looking at multi-
modal discourses and meaning-making resources. SF-MDA treats multimodal texts
on the basis of social semiotics in its fullest sense. By comparison, as AT-MDA
focuses more on how discourse is realised in a social activity context, it can be fully
operated in dynamic discourses.
This study adopts SF-MDA based on the following considerations. Regarding
the nature and aims of this study, which intend to design and validate a rating scale
with a consideration of embedding nonverbal delivery in speaking assessment, it
should be noticed that nonverbal delivery will be looked into to a great extent. As is
76 2 Literature Review
critiqued above, AT-MDA seems less explored in dealing with static discourse,
while SF-MDA can be applied to both static and dynamic discourses though pre-
vious studies have not rendered great concern for dynamic discourse. In that case, if
full use is made of SF-MDA to probe into a static discourse and more potentials of
SF-MDA are tapped to analyse a dynamic discourse, this study can not only
qualitatively analyse how candidates perform, but also benefits SF-MDA in terms
of its extended scope of applicability. It may be argued that both SF-MDA and
AT-MDA can be applied to the present study in an interwoven manner as both have
their strengths in approaching different types of multimodal texts. However,
adopting SF-MDA does not necessarily mean that both approaches are not recon-
cilable; rather, the decision on adopting SF-MDA follows the principle of consis-
tently referring to the same framework and applying it to qualitatively validate the
rating scale to be proposed.
By static discourse, it mainly refers to the transcription of candidates’ verbal
language, while dynamic discourse takes a closer look at candidates’ nonverbal
delivery. To be more specific, at the rating scale validation stage, when candidates’
performances are investigated to be aligned with their analytic scores and the
descriptors regarding verbal utterances, all possible meaning-making resources will
be analysed with SF-MDA as the theoretical framework. On the other hand, when
how candidates perform and synchronise their verbal language with nonverbal
delivery, then SF-MDA will also be referred to.
Apart from a consideration of discourse nature, another concern is that since
MDA will only be adopted in the qualitative stage of rating scale validation, the
randomly selected samples will not be that large in scale compared with those when
MTMM, a quantitative approach, is utilised. Therefore, the previously mentioned
weakness of SF-MDA’s reliability in directly bridging what is signifier and what is
signified can be offset to the minimum degree. Otherwise, if all the samples are to
be analysed with an SF-MDA approach, it is felt that analyses will wind up with an
almost endless inventory, giving rise to other logistic issues jeopardising the
practicality or implementation of this study. Furthermore, as is also aforementioned,
AT-MDA demands higher level of technology literacy, which might constrain this
study.
It could also be argued that since nonverbal delivery can be probed into within
the paradigm of nonverbal communication studies, why this study will adopt
SF-MDA as the validation method for the rating scale to be proposed. Scollon and
Scollon (2009) also note the similarities between the current interests in multi-
modality with the research in the field of nonverbal communication, as best rep-
resented by the works by Pike (1967), Ruesch and Kees (1956) and Hall (1959).
However, while acknowledging that the work in nonverbal communication can
inform multimodal studies, they highlight that “it is not simply a return” as the
crucial difference is that “[n]o longer is language taken to be the model by which
these other phenomena are studied, but, rather, language itself is taken to be equally
grounded in human action with material means in specific earth-grounded sites of
engagement” (Scollon and Scollon 2009, p. 177).
2.5 Rating Scale Evaluation and Validation 77
Based on all the above considerations, this study employs an SF-MDA approach
in the qualitative validation of the rating scale to be proposed and the fine-grained
reference to MDA henceforth is SF-MDA. At this stage, however, what still leaves
blank is how to apply the framework of MDA to operationalise the rating scale
validation. The next part will sketch out an operationalised framework informed by
MDA and provides a revised one drawn from Martinec’s (2000b, 2001, 2004) and
Hood’ (2007, 2011) studies.
semiotic system
with a power function to convince other discussants of his/her own opinion with an
upward pointing index finger. In that manner, the three strata are comprehensively
probed into and candidates’ performance can be qualitatively aligned with the
rating scale descriptors and the subscores assigned by teacher and peer raters.
The above general framework provides a sketch of how candidates’ nonverbal
delivery can be analysed from an MDA perspective. In order to particularise a
repertoire of nonverbal delivery channels and re-address the analysis framework
that is held back in the previous section of review on nonverbal delivery, this study
will mainly refer to Martinec’s (2000b, 2001, 2004) and Hood’s (2007, 2011)
studies in qualitatively validating the rating scale to be proposed.
Appraisal Theory
Appreciation
Expansion Contraction
Fig. 2.13 The structure of Appraisal Theory (Martin and White 2005, p. 38)
Building on the work by McNeill (1992, 1998, 2000) and Enfield (2009) in
cognitive studies as well as Kendon’s (1980, Kendon 2004) research in psychology,
Hood (2007, 2011) takes an SFL perspective to investigate nonverbal delivery, with a
special to gestures. In terms of interpersonal meanings, Hood (2011), informed by
Appraisal Theory (Martin 1995, 2000; Martin and White 2005), identifies gesture that
embodies attitude, engagement and graduation, as illustrated in Fig. 2.13. Hood
(2011) further argues that nonverbal channels, such as gestures, can express feelings
and values in attitude can grade meaning along various dimensions in graduation and
can expand or contract space for others during interaction in engagement.
In Appraisal Theory, attitudes can instantiate a variety of interpersonal mean-
ings. However, considering the three main nonverbal channels in the present study,
a polemic set of values that broadly classify attitudes as Positive and Negative are
proposed. This is because, unlike facial expression, eye contact, gesture and head
movement generally signify either positive or negative attitude instead of affect,
appreciation and judgment, as outlined in Fig. 2.13. For instance, positive attitude
can be embodied in an occurrence of head nod, while negative attitude can be
instantiated by the gesture of crossing both hands before the chest when a candidate
intends to interrupt other speakers.
Graduation in interpersonal meaning is also elaborated by Hood (2004, 2006).
She is concerned, however, that “by grading an objective (ideational) meaning the
speaker gives a subjective slant to the meaning, signalling for the meaning to be
interpreted evaluatively” (Hood 2011, p. 43). In line with Appraisal Theory, Hood
(2011) extends graduation as force to the meanings of intensity, size, quantity,
scope and graduation as focus to specificity. Instead of addressing all the aspects,
this study will mainly look at the pace of different nonverbal delivery occurrences,
such as the frequency of head nod in an interval unit.
The third aspect of Appraisal Theory is engagement. Specific to gestures,
engagement is realised via the positioning of the hands to expand or contract
negotiation space for other addressees. In describing interpersonal meanings
instantiated by teachers’ gestures, Hood (2011) suggests an open palm or palms-up
84 2 Literature Review
2.6 Summary
Revolving around three key phases of the present study, viz. (1) building an
argument for embedding nonverbal delivery into speaking assessment, and the
issues of (2) how to design and (3) how to validate a rating scale with such a
consideration informed by the argument, this chapter reviews the related literature.
The first section reviews the topical issue of this study: nonverbal delivery. The
next two sections address the issue of the rating scale design, while the last two
86 2 Literature Review
sections pave the way for the concrete procedures of how to validate a rating scale,
especially the notion of validity and validation methods.
Specifically, the first section mainly pinpoints the significance of nonverbal
delivery in communication and in a repertoire of research fields and also outlines
the previous studies on three most representative channels of nonverbal delivery. In
that sense, a theoretical argument for incorporating nonverbal delivery into
speaking assessment can be felt to call for a corresponding empirical argument.
In the second section, by comparing and contrasting the evolution of commu-
nicative competence related models, the section outlines their components and
respective strengths and weaknesses, justifies the employment of the CLA model as
the theoretical framework for the rating scale design and points out the
quintessential role of nonverbal delivery in the CLA model. The third section also
responds to the issue of rating scale design. With a review on the prevailing
taxonomies of rating scales in language assessment and the exemplifications of a
few existing rating scales used by main language testing batteries, this section
explicitly informs the formulation of the rating scale with nonverbal delivery
embedded as an assessment dimension. Moreover, by highlighting the context
where a rating scale is to be applied, the properties that the rating scale supposedly
processes are also accorded.
The fourth section is devoted to conceptualising validity and validation. An
overview is provided regarding three evolution phases of validity in language
assessment scenario, based on which this study justifies itself in adopting a unitary
notion of validity with construct validity as the core. In terms of validation methods,
the last section argues the necessity of using both quantitative and qualitative
methodologies in rating scale validation. MTMM is reviewed so that a glimpse is
rendered of how this quantitative method will be adopted to verify the construct
validity of the rating scale with teacher-rating and peer-rating as different scoring
methods and different subdimensions on the scale as traits. The last section intro-
duces MDA in detail, ranging from its theoretical origin, different streams of
research and its application both worldwide and in the Chinese EFL context. The
end of the last section provides fine-grained frameworks informed by an MDA
approach so that the proposed rating scale can be validated quantitatively by an
alignment of candidates’ nonverbal delivery performance with the corresponding
rating scale descriptors and the subscores they are assigned.
References
AERA, APA, and NCME. 1999. Standards for educational and psychological tests and manuals.
Washington, DC: American Psychological Association.
Alderson, J.C. 1981. Report of the discussion on general language proficiency. In Issues in
language testing, ed. J.C. Alderson, and A. Hughes, 87–92. London: The British Council.
Alderson, J.C. 1991. Bands and scores. In Language testing in the 1990s, ed. J.C. Alderson, and
B. North, 71–86. London: Modern English Publications and the British Council.
Alderson, J.C. (ed.). 2002. Common European Framework of Reference for Languages: learning,
teaching, assessment: case studies. Strasbourg: Council of Europe.
Alderson, J.C. 2010. The Common European Framework of Reference for Language. Invited
seminar at Shanghai Jiao Tong University, Shanghai, China, Oct 2010.
Alderson, J.C., and J. Banerjee. 2002. Language testing and assessment (Part 2). Language
Teaching 35(2): 79–113.
Alderson, J.C., N. Figueras, H. Kuiper, and G. Nold. 2006. Analyzing tests of reading and
listening in relation to the Common European Framework of Reference: the experience of the
Dutch CEFR Construct Project. Language Assessment Quarterly 3(1): 3–30.
Alibali, M.W., L. Flevares, and S. Goldin-Meadow. 1997. Assessing knowledge conveyed in
gesture: do teachers have the upper hand? Journal of Educational Psychology 89: 183–193.
Allal, L., and L.M. Lopez. 2005. Formative assessment of learning: a review of publication in
French. In Formative assessment: improving learning in secondary classrooms, ed. J. Looney,
241–264. Paris: Organisation for Economic Cooperation and Development.
Anastasi, A. 1950. Some implications of cultural factors for test construction. New York:
Educational Testing Service.
Anastasi, A. 1954. Psychological testing. New York: Macmillan.
Anastasi, A. 1961. Psychological testing, 2nd ed. New York: Macmillan.
Anastasi, A. 1976. Psychological testing, 4th ed. New York: Macmillan.
Anastasi, A. 1982. “What do intelligence tests measure?” In On educational testing: Intelligence,
performance standards, test anxiety, and latent traits, eds. S.B. Anderson, and J.S. Hemlick,
5–28. San Francisco, CA: Jossey-Bass, Inc.
Angoff, W. 1988. Validity: an evolving concept. In Test validity, ed. H. Wainer, and H.I. Braun,
19–32. Hillsdale: Lawrence Erlbaum Associates.
APA. 1954. Technical recommendations for psychological tests and diagnostic techniques.
Psychological Bulletin Supplement 51(2): 1–38.
APA, AERA, and NCME. 1966. Standards for educational and psychological tests and manuals.
Washington, DC: American Psychological Association.
APA, AERA, and NCME. 1974. Standards for educational and psychological tests and manuals.
Washington, DC: American Psychological Association.
Applebee, A.N. 2000. Alternative models of writing development. In Perspectives on writing:
research, theory, practice, ed. R. Indrisano, and J.R. Squire, 90–111. Newark: International
Reading Association.
Argyle, M., and M. Cook. 1976. Gaze and mutual gaze. Cambridge: Cambridge University Press.
Bacha, N. 2001. Writing evaluation: what can analytic versus holistic essay scoring tell us? System
29: 371–383.
Bachman, L.F. 1988. Problems in examining the validity of the ACTFL oral proficiency interview.
Studies in Second Language Acquisition 10(2): 149–164.
Bachman, L.F. 1990. Fundamental considerations in language testing. Oxford: Oxford University
Press.
Bachman, L.F. 1991. What does language testing have to offer? TESOL Quarterly 25(4): 671–704.
Bachman, L.F. 2005. Building and supporting a case for test use. Language Assessment Quarterly
2(1): 1–34.
Bachman, L.F., and A.S. Palmer. 1981. The construct validation of the FSI oral interview.
Language Learning 31: 67–86.
Bachman, L.F., and A.S. Palmer. 1982. The construct validation of some components of
communicative proficiency. TESOL Quarterly 16(4): 449–465.
88 2 Literature Review
Bachman, L.F., and A.S. Palmer. 1989. The construct validation of self-ratings of communicative
language ability. Language Testing 6(4): 449–465.
Bachman, L.F., and A.S. Palmer. 1996. Language testing in practice: designing and developing
useful language tests. Oxford: Oxford University Press.
Bachman, L.F., and A.S. Palmer. 2010. Language assessment in practice: developing language
tests and justifying their use the real world. Oxford: Oxford University Press.
Bachman, L.F., and S.J. Savignon. 1986. The evaluation of communicative language proficiency:
a critique of the ACTFL oral interview. Modern Language Journal 70(3): 380–390.
Bachman, L.F., B.M. Lynch, and M. Mason. 1995. Investigating variability in tasks and rater
judgments in a performance test of foreign language speaking. Language Testing 12(2): 238–257.
Bae, J., and L.F. Bachman. 1998. A latent variable approach to listening and reading: testing
factorial invariance across two groups of children in the Korean/English two-way immersion
program. Language Testing 15(3): 380–414.
Baird, L.L. 1983. The search for communication skills. Educational Testing Service Research
Report, No. 83-14. Princeton: Educational Testing Service.
Baldry, A., and P. Thibault. 2006. Multimodal transcription and text analysis. London: Equinox.
Barakat, R.A. 1973. Arabic gestures. Journal of Popular Culture 6(4): 749–787.
Barkaoui, K. 2007. Rating scale impact on EFL essay marking: a mixed-method study. Assessing
Writing 12(2): 86–107.
Barkaoui, K. 2011. Think-aloud protocols in research on essay rating: an empirical study of their
veridicality and reactivity. Language Testing 28(1): 51–75.
Bateman, J.A. 2008. Multimodality and genre: a foundation for the systematic analysis of
multimodal documents. London: Palgrave Macmillan.
Bateman, J., J. Delin, and R. Henschel. 2004. Multimodality and empiricism: preparing for a
corpus-based approach to the study of multimodal meaning-making. In Perspectives on
multimodality, ed. E. Ventola, C. Cassily, and M. Kaltenbacher, 65–88. Philadelphia: John
Benjamins.
Bateman, J.A., J. Delin, and R. Henschel. 2006. Mapping the multimodal genres of traditional and
electronic newspapers. In New directions in the analysis of multimodal discourse, ed.
T.D. Royce, and W.L. Bowcher, 147–172. Mahwah: Lawrence Erlbaum Associates.
Black, P., and D. Wiliam. 1998. Assessment and classroom learning. Assessment in Education 5
(1): 7–74.
Black, P., and D. Wiliam. 2009. Developing the theory of formative assessment. Educational
Measurement, Evaluation and Accountability 21(1): 5–31.
Bloom, B.S., J.T. Hasting, and G.F. Madaus (eds.). 1971. Handbook of formative and summative
evaluation of student learning. New York: McGraw-Hill.
Bonk, W.J., and G.J. Ockey. 2003. A many-facet Rasch analysis of the second language group oral
discussion task. Language Testing 20(1): 89–110.
Bourne, J., and C. Jewitt. 2003. Orchestrating debate: a multimodal approach to the study of the
teaching of higher order literacy skills. Reading: Literacy and Language, UKRA, July, 64–72.
Brindley, G. 1986. The assessment of second language proficiency: issues and approaches.
Adelaide: National Curriculum Resource Centre.
Brindley, G. 1991. Defining language ability: the criteria for criteria. In Current developments in
language testing, ed. S. Anivan, 139–164. Singapore: Regional Language Centre.
Brindley, G. 2002. Issues in language assessment. In The Oxford handbook of applied linguistics,
ed. R.B. Kaplan, 459–470. Oxford: Oxford University Press.
Brookhart, S.M. 2004. Classroom assessment: tensions and intersection in theory and practice.
Teachers College Record 106(3): 429–458.
Brookhart, S.M. 2007. Expanding views about formative classroom assessment: a review of the
literature. In Formative classroom assessment: theory into practice, ed. J.H. McMillan, 43–62.
New York: Teachers College Press.
Brooks, L. 2009. Interacting in pairs in a test of oral proficiency: co-constructing a better
performance. Language Testing 26(3): 341–366.
References 89
Brown, A. 2003. Interviewer variation and the co-construction of speaking proficiency. Language
Testing 20(1): 1–25.
Brown, A., N. Iwashita, and T. McNamara. 2005. An examination of rater orientations and test
taker performance on English for academic purposes speaking tasks. TOEFL Monograph
Series, No. TOEFL-MS-29. Princeton: Educational Testing Service.
Brown, J.D., and K.M. Bailey. 1984. A categorical instrument for scoring second writing skills.
Language Learning 34(1): 21–42.
Brown, J.D., and T. Hudson. 1998. The alternatives in language assessment. TESOL Quarterly 32
(4): 653–675.
Brown, G., and G. Yule. 1983. Discourse analysis. Cambridge: Cambridge University Press.
Brumfit, C.J. 1984. Communicative methodology in language teaching: the roles of fluency and
accuracy. Cambridge: Cambridge University Press.
Brumfit, C.J., and K. Johnson. 1979. The communicative approach to language teaching. Oxford:
Oxford University Press.
Burgoon, J.K., and T. Saine. 1978. The unspoken dialogue: an introduction to nonverbal
communication. Boston: Hughton Mifflin Company.
Burgoon, J.K., D.A. Coker, and R.A. Coker. 1986. Communicative effects of gaze behavior: a test
of two contrasting explanations. Human Communication Research 12: 495–524.
Campbell, D.T., and D.W. Fiske. 1959. Convergent and discriminant validation by the multi-trait
multi-method matrix. Psychological Bulletin 56: 81–105.
Canale, M. 1983. From communicative competence to communicative language pedagogy. In
Language and communication, ed. J.C. Richards, and R.W. Schmidt, 2–27. London: Longman.
Canale, M., and M. Swain. 1980. Theoretical bases of communicative approaches to second
language teaching and testing. Applied Linguistics 1(1): 1–47.
Candlin, C.N. 1986. Explaining communicative competence limits of testability? In Toward
communicative competence testing: proceedings of the second TOEFL invitational conference,
ed. C.W. Stansfield, 38–57. Princeton: Educational Testing Service.
Caple, H. 2008. Intermodal relations in image nuclear news stories. In Multimodal semiotics:
functional analysis in contexts of education, ed. L. Unsworth, 125–138. London: Continuum.
Carroll, J.B. 1961. The nature of data, or how to choose a correlation coefficient. Psychometrika 35
(4): 347–372.
Carroll, J.B. 1968. The psychology of language testing. In Language testing symposium: a
psycholinguistic perspective, ed. A. Davies, 46–69. London: Oxford University Press.
Celce-Murcia, M., Z. Dörneyei, and S. Thurrell. 1997. Direct approaches in L2 instruction: a
turning point in communicative language teaching? TESOL Quarterly 31(1): 141–152.
Cerrato, L. 2005. Linguistic functions of head nods. In Gothenburg papers in theoretical
linguistics 92: proceedings from 2nd Nordic conference on multi-modal communication, ed.
J. Allwood, and B. Dorriots, 137–152. Sweden: Gothenburg University.
Chafe, W. 1994. Discourse, consciousness, and time: The flow and displacement of conscious
experience in speaking and writing. Chicago: University of Chicago Press.
Chalhoub-Deville, M. 1995. Deriving oral assessment scales across different tests and rater groups.
Language Testing 12(1): 16–33.
Chapelle, C.A. 1998. Field independence: a source of language test variance? Language Testing
15(1): 62–82.
Chapelle, C.A. 1999. Validity in language assessment. Annual Review of Applied Linguistics 19:
254–272.
Chapelle, C.A., M.K. Enright, and J. Jamieson (eds.). 2008. Building a validity argument for the
Test of English as a Foreign Language. New York: Routledge.
Chapelle, C.A., M.K. Enright, and J. Jamieson. 2010. Does an argument-based approach to
validity make a difference? Educational Measurements: Issues and Practice 29(1): 3–13.
Charney, D. 1984. The validity of using holistic scoring to evaluate writing: a critical overview.
Research in the Teaching of English 18(1): 65–81.
Chen, R. 2008. Some words on writing a multimodal lesson ware for English teaching. Journal of
Fujian Education Institute 1: 75–77.
90 2 Literature Review
Chen, Y., and G. Huang. 2009. Multimodal construal of heteroglossia: evidence from language
textbooks. Computer Assisted Foreign Language Education 6: 35–41.
Chen, Y., and H. Wang. 2008. Ideational meaning of image and text-image relations. Journal of
Ningbo University (Education Edition) 1: 124–129.
Cheng, L. 2005. Changing language teaching through language testing: a washback study.
Cambridge: Cambridge University Press.
Chomsky, N. 1965. Aspects of the theory of syntax. Cambridge: MIT Press.
Cienki, A. 2008. Why study metaphor and gesture? In Metaphor and Gesture, eds. A. Cienki and
C. Müller, 5–26. Amsterdam/Philadelphia: John Benjamins Publishing Company.
Cizek, G.J. 2010. An introduction to formative assessment: history, characteristics and challenges.
In Handbook of formative assessment, ed. H.L. Andrade, and G.J. Cizek, 3–17. New York:
Routledge.
Clark, J.L. 1985. Curriculum renewal in second language learning: an overview. Canadian
Modern Language Review 42(3): 342–360.
Clarkson, R., & M.T. Jensen. 1995. Assessing achievement in English for professional
employment programmes. In Language assessment in action, ed. G. Brindley, pp. 165–194.
Sydney, Macquarie University: National Centre for English Language Teaching and Research.
Cohen, A. 1994. Assessing language ability in the classroom, 2nd ed. Boston: Heinle and Heinle
Publishers.
Connor, U., and P.L. Carrel. 1993. The interpretation of the tasks by writers and readers in
holistically rated directed assessment of writing. In Reading in the composition classroom:
second language perspectives, ed. J.G. Carson, and I. Leki, 141–160. Boston: Heine & Heine.
Connor, U., and A. Mbaye. 2002. Discourse approaches to writing assessment. Annual Review of
Applied Linguistics 22: 263–278.
Cooper, C.R. 1977. Holistic evaluation of writing. In Evaluating writing: describing, measuring,
judging, ed. C.R. Cooper, and L. Odell, 3–31. Urbana: NCTE.
Corder, S.P. 1983. Strategies of communication. In Strategies in interlanguage communication,
ed. C. Færch, and G. Kasper, 15–19. London: Longman.
Cortazzi, M. 1993. Narrative analysis. London: Falmer Press.
Council of Europe. 2001. Common European framework of reference for languages: learning,
teaching, assessment. Cambridge: Cambridge University Press.
Cowie, B., and B. Bell. 1999. A model of formative assessment in science education. Assessment
in Education 6(1): 102–116.
Creider, C. 1977. Towards a description of East African gestures. Sign Language Studies 14: 1–20.
Cronbach, L.J. 1949. Essentials of psychological testing. New York: Harper & Row.
Cronbach, L.J. 1971. Test validation. In Educational measurement, 2nd ed, ed. R.L. Thorndike,
443–507. Washington, DC: American Council on Education.
Cronbach, L.J. 1980. Validity on parole: how can we go straight? New directions for testing and
assessment: Measuring achievement over a decade. Proceedings of the 1979 ETS invitational
conference, pp. 99–108. San Francisco: Jossey-Bass.
Cronbach, L.J. 1988. Five perspectives on validity argument. In Test validity, ed. H. Wainer, and
H.I. Braun, 3–17. Hillsdale: Lawrence Erlbaum Associates.
Cronbach, L.J. 1989. Construct validation after thirty years. In Intelligence: measurement, theory,
and public policy, ed. R. Linn, 147–167. Urbana: University of Chicago.
Cronbach, L.J., and P.C. Meehl. 1955. Construct validity in psychological tests. Psychological
Bulletin 52(4): 281–302.
Cumming, A. 1990. Expertise in evaluating second language composition. Language Testing 7(1):
31–51.
Cumming, A., R. Kantor, and D.E. Powers. 2001. Scoring TOEFL essays and TOEFL 2000
prototype writing tasks: an investigation into raters’ decision making and development of a
preliminary analytic framework. TOEFL Monograph Series, No. TOEFL-MS-22. Princeton:
Educational Testing Service.
Cumming, A. 2009. Language assessment in education: tests, curricula and teaching. Annual
Review of Applied Linguistics 29: 90–100.
References 91
Cumming, A., R. Kantor, and D.E. Powers. 2002. Decision making while rating ESL/EFL writing
tasks: a descriptive framework. Modern Language Journal 86: 67–96.
Cumming, A., R. Kantor, K. Baba, U. Erdosy, K. Eouanzoui, and M. James. 2006. Analysis of
discourse features and verification of scoring levels for independent and integrated tasks for
the new TOEFL. Princeton: Educational Testing Service.
Cureton, E.E. 1950. Validity. In Educational measurement, ed. E.F. Lingquist, 621–694.
Washington, DC: American Council on Education.
Daly, A., and L. Unsworth. 2011. Analysis and comprehension of multimodal texts. Australian
Journal of Language and Literacy 34(1): 61–80.
Daniels, H. 2001. Vygotsky and pedagogy. London: Routledge.
Davidson, F., and B. Lynch. 2002. Testcraft: a teacher’s guide to writing and using language test
specifications. New Haven: Yale.
Davies, A., and P. LeMahieu. 2003. Assessment for learning: reconsidering portfolio and research
evidence. In Optimising new modes of assessment: in search of qualities and standards, ed.
M. Sergers, F. Dochy, and E. Cascallar, 141–169. Dordrecht: Kluwer Academic Publishers.
Davies, A., A. Brown, C. Elder, K. Hill, T. Lumley, and T. McNamara. 1999. Dictionary of
language testing. Cambridge: Cambridge University Press.
Davison, C. 2004. The contradictory culture of teacher-based assessment: ESL assessment
practices in Australian and Hong Kong secondary schools. Language Testing 21(3): 305–334.
de Jong, J.H.A.L. 1992. Assessment of language proficiency in the perspective of the 21st century.
AILA Review 9: 39–45.
Derewianka, B., and C. Coffin. 2008. Visual representations of time in history textbooks. In
Multimodal semiotics, ed. L. Unsworth, 187–200. London: Continuum.
Djonov, E.N. 2006. Analysing the organisation of information in websites: from hypermedia
design to systemic functional hypermedia discourse analysis. Unpublished Ph.D. thesis,
University of New South Wales, Australia.
Douglas, D., and J. Smith. 1997. Theoretical underpinnings of the Test of Spoken English revision
project. TOEFL Monograph Series, No. TOEFL-MS-9. Princeton: Educational Testing
Service.
Douglas, D. 2000. Assessing languages for specific purposes. Cambridge: Cambridge University
Press.
Ducasse, A.M., and A. Brown. 2009. Assessing paired orals: raters’ orientation to interaction.
Language Testing 26(3): 423–443.
Dwyer, C.A. 2000. Excerpt from validity: theory into practice. The Score 22(4): 6–7.
Ebel, R.L. 1961. Must all tests be valid? American Psychologist 16(10): 640–647.
Ebel, R. L., and D. A. Frisbie. 1991. Essentials of educational measurement, 5th ed. Englewood
Cliffs, NJ: Prentice—Hall.
Efron, D. 1941. Gesture, race and culture. The Hague: Mouton.
Egbert, M.M. 1998. Miscommunication in language proficiency interviews of first-year German
students: a comparison with natural conversation. In Talking and testing: discourse approaches
to the assessment of oral proficiency, ed. R. Young, and W. He, 147–172. Philadelphia: John
Benjamins.
Eggins, S., and D. Slade. 1997. Analysing casual conversation. London: Cassell.
Ekman, P., and W.V. Friesen. 1969. Nonverbal leakage and clues to deception. Psychiatry 32: 88–
106.
Ekman, P., and W.V. Friesen. 1974. Detecting deception from body or face. Journal of Personality
and Social Psychology 29: 288–298.
Ellsworth, P.C., and L.M. Ludwig. 1971. Visual behaviour in social interaction. Journal of
Communication 21(4): 375–403.
Enfield, N.J. 2009. The anatomy of meaning: Speech, gesture, and composite utterances.
Cambridge: Cambridge University Press.
Engestrom, Y. 1987. Learning by expanding: an activity theoretical approach to developmental
research. Helsinki: Orienta-Konsultit Oy.
92 2 Literature Review
Erdosy, M.U. 2004. Exploring variability in judging writing ability in a second language: a study
of four experienced raters of ESL compositions. TOEFL Research Report, No. RR-03-17.
Princeton: Educational Testing Service.
Ericsson, K.A., and H. Simon. 1993. Protocol analysis. Cambridge: MIT Press.
Færch, C., and G. Kasper (eds.). 1983. Strategies in interlanguage communication. London:
Longman.
Færch, C., et al. 1984. Learner language and language learning. Philadelphia: Multilingual
Matters Ltd.
Feng, D. 2011. Visual space and ideology: a critical cognitive analysis of spatial orientations in
advertising. In Multimodal studies: exploring issues and domains, ed. K.L. O’Halloran, and
B.A. Smith, 55–75. London: Routledge.
Folland, D., and D. Robertson. 1976. Towards objective in group oral testing. ELT Journal 30(2):
156–167.
Fulcher, G. 1987. Tests of oral performance: the need for data-based criteria. ELT Journal 41(4):
287–291.
Fulcher, G. 1993. The construction and validation of rating scales for oral tests in English as a
foreign language. Unpublished Ph.D. thesis. University of Lancaster, UK.
Fulcher, G. 1996a. Does thick description lead to smart tests? A data-based approach to rating
scale construction. Language Testing 13(2): 208–238.
Fulcher, G. 1996b. Invalidating validity claims for the ACTFL oral rating scale. System 24(2):
163–172.
Fulcher, G. 1997. The testing of speaking in a second language. In Encyclopaedia of language and
education, vol. 7, ed. C. Clapham, and D. Corson, 75–85., Language testing and assessment
New York: Springer.
Fulcher, G. 2003. Testing second language speaking. London: Longman/Pearson Education.
Fulcher, G. 2004. Deluded by artifices? The Common European Framework and harmonization.
Language Assessment Quarterly 1(4): 253–266.
Fulcher, G. 2010. Practical language testing. London: Hodder Education.
Fulcher, G., and F. Davidson. 2007. Language testing and assessment: an advanced resource
book. London: Routledge.
Fulcher, G., F. Davidson, and J. Kemp. 2011. Effective rating scale development for speaking
tests: performance decision trees. Language Testing 27(1): 1–25.
Galloway, V.B. 1987. From defining to developing proficiency: a look at the decisions. In Defining
and developing proficiency: guidelines, implementations, and concepts, ed. H. Byrnes, and
M. Canale, 25–73. Lincolnwood: National Textbook Company.
Garrett, H.E. 1947. Statistics in psychology and education, 3rd ed. New York: Longmans, Green
& Company.
Goldin-Meadow, S., and M.A. Singer. 2003. From children’s hands to adults’ ears: Gesture’s role
in teaching and learning. Developmental Psychology 39: 509–520.
Goodwin, L.D. 1997. Changing conceptions of measurement validity. Journal of Nursing
Education 36: 102–107.
Goodwin, L.D. 2002. Changing conceptions of measurement validity: an updated on the new
standards. Journal of Nursing Education 41: 100–106.
Goodwin, C., and J.C. Heritage. 1990. Conversation analysis. Annual Review of Anthropology 19:
283–307.
Goodwin, L.D., and N.L. Leech. 2003. The meaning of validity in the new standards for
educational and psychological testing: implications for measurement courses. Measurement
and Evaluation in Counseling and Development 36(3): 181–191.
Goulden, N.R. 1992. Theory and vocabulary for communication assessments. Communication
Education 41(3): 258–269.
Goulden, N.R. 1994. Relationship of analytic and holistic methods to rater’s scores for speeches.
The Journal of Research and Development in Education 27: 73–82.
Grant, L., and L. Ginther. 2000. Using computer-tagged linguistic features to describe L2 writing
differences. Journal of Second Language Writing 9: 123–145.
References 93
Green, J.R. 1968. A gesture inventory for the teaching of Spanish. Philadelphia: Chilton Books.
Green, A. 1998. Verbal protocol analysis in language testing research: a handbook. Cambridge:
Cambridge University Press.
Green, A. 2007. Washback to learning outcomes: a comparative study of IELTS preparation and
university pre-sessional language courses. Assessment in Education 14(1): 75–97.
Grierson, J. 1995. Classroom-based assessment in intensive English centres. In Language
assessment in action, ed. G. Brindley, 239–270. Sydney: National Centre for English Language
Teaching and Research.
Grootenboer, H. 2006. Treasuring the gaze: eye miniature portraits and the intimacy of vision. Art
Bulletin 88(3): 496–507.
Gu, Y. 2006a. Multimodal text analysis: a corpus linguistic approach to situated discourse. Text &
Talk 26(2): 127–167.
Gu, Y. 2006b. Agent-oriented modelling language, Part 1: modelling dynamic behaviour.
Proceedings of the 20th international CODATA conference, Beijing, pp. 21–47. Beijing:
Information Centre, Chinese Academy of Social Sciences.
Gu, Y. 2007. Learning by multimedia and multimodality. In E-learning in China: Sino-UK
initiatives into policy, pedagogy and culture, ed. H. Spencer-Oatey, 37–56. Hong Kong: The
Hong Kong University Press.
Gu, Y. 2009. From real life situated discourse to video-stream data-mining: an argument for
agent-oriented modelling for multimodal corpus compilation. International Journal of Corpus
Linguistics 14(4): 433–466.
Guijarro, A.J.M., and M.J.P. Sanz. 2009. On interaction of image and verbal text in a picture book:
a multimodal and systemic functional study. In The world told and the world shown:
multisemiotic issues, ed. E. Ventola, and A.J.M. Guijarro, 107–123. Hampshire: Palgrave
Macmillan.
Guilford, J.P. 1946. New standards for test evaluation. Educational and Psychological
Measurement 6(3): 427–438.
Guion, R.M. 1977. Content validity: the source of my discontent. Applied Psychological
Measurement 1(1): 1–10.
Gulliksen, H. 1950. Theory of mental tests. Hillsdale: Lawrence Erlbaum Associates.
Guo, L. 2004. Multimodality in biology textbooks. In Multimodal discourse analysis:
systemic-functional perspectives, ed. K.L. O’Halloran, 196–219. London: Continuum.
Hale, G.A., D.A. Rock, and T. Jirele. 1989. Confirmatory factor analysis of the TOEFL. TOEFL
Research Report, No. RR-32. Princeton NJ: Educational Testing Service.
Hall, E.T. 1959. The silent language. New York: Doubleday.
Halliday, M.A.K. 1973. Explorations in the functions of language. London: Edward Arnold.
Halliday, M.A.K. 1976. The form of a functional grammar. In Halliday: system and function in
language, ed. G. Kress, 101–135. Oxford: Oxford University Press.
Halliday, M.A.K. 1978. Language as social semiotic: the social interpretation of language and
meaning. London: Edward Arnold.
Halliday, M.A.K. 1985. An introduction to functional grammar. London: Arnold.
Halliday, M.A.K., and R. Hasan. 1976. Cohesion in English. London: Longman.
Halliday, M.A.K., and C.M.I.M. Matthiessen. 2004. An introduction to functional grammar, 3rd
ed. London: Edward Arnold.
Halliday, M.A.K., A. McIntosh, and P. Strevens. 1964. The linguistic sciences and language
teaching. Bloomington: Indiana University Press.
Hamp-Lyons, L. 1990. Second language writing: assessment issues. In Second language writing:
research insights for the classroom, ed. B. Kroll, 69–87. New York: Cambridge University
Press.
Hamp-Lyons, L. 1991. Scoring procedures for ESL contexts. In Assessing second language
writing in academic contexts, ed. L. Hamp-Lyons, 241–276. Norwood: Ablex.
Hamp-Lyons, L. 1997. Washback, impact and validity: ethical concerns. Language Testing 14(3):
295–303.
94 2 Literature Review
Hatch, E. 1978. Discourse analysis and second language acquisition. In Second language
acquisition: a book of readings, ed. E. Hatch, 401–435. Rowley: Newbury House.
Hattie, J., and H. Timperley. 2007. The power of feedback. Review of Educational Research 77(1):
81–112.
Hawkey, R. 2001. Towards a common scale to describe L2 writing performance. Cambridge
Research Notes 5: 9–13.
Hawkey, R., and F. Barker. 2004. Developing a common scale for the assessment of writing.
Assessing Writing 9(2): 122–159.
He, W. 1998. Answering questions in LPIs: a case study. In Talking and testing: discourse
approaches to the assessment of oral proficiency, ed. R. Young, and W. He, 101–116.
Philadelphia: John Benjamins.
Heath, C.C., and P. Luff. 2007. Gesture and institutional interaction: figuring bids in auctions of
fine art and antiques. Gesture 7(2): 215–240.
Hempel, C.G. 1965. Aspects of scientific explanation and other essays in the philosophy of
science. Glencoe: Free Press.
Henley, N.M. 1977. Body politics: power, sex, and nonverbal communication. Englewood Cliffs:
Prentice-Hall.
Henley, N.M., and S. Harmon. 1985. The nonverbal semantics of power and gender: a perceptual
study. In Power, dominance, and nonverbal behavior, ed. S.L. Ellyson, and J.F. Dovidio, 151–
164. New York: Springer.
Herman, J.L., and K. Choi. 2008. Formative assessment and the improvement of middle school
science learning: The role of teacher accuracy. CRESST Report 740. Los Angeles, CA:
National Center for Research on Evaluation, Standards, and Student Testing.
Hess, E.H. 1975. The tell-tale eye: how your eyes reveal hidden thoughts and emotions. New
York: van Nostrand Reinhold.
Hilsdon, J. 1995. The group oral exam: advantages and limitations. In Language testing in the
1990s: the communicative legacy, ed. C. Alderson, and B. North, 189–197. Hertfordshire:
Prentice Hall International.
Hood, S. 2004. Managing attitude in undergraduate academic writing: A focus on the introductions
to research reports. In Analysing academic writing: Contextualized frameworks, eds.
L.J. Ravelli, and R.A. Ellis, 24–44. London: Continuum.
Hood, S. 2006. The persuasive power of prosodies: Radiating values in academic writing. Journal
of English for Academic Purposes, 5(1):37–49.
Hood, S.E. 2007. Gesture and meaning making in face-to-face teaching. Paper presented at the
Semiotic Margins Conference, University of Sydney.
Hood, S.E. 2010. Mimicking and mocking identities: the roles of language and body language in
Taylor Mali’s “Speak with conviction”. Invited seminar at the Hong Kong Polytechnic
University, 4 November 2010.
Hood, S.E. 2011. Body language in face-to-face teaching: a focus on textual and interpersonal
meaning. In Semiotic margins: meanings in multimodalities, ed. S. Dreyfus, S. Hood, and S.
Stenglin, 31–52. London: Continuum.
Hopper, R., S. Koch, and J. Mandelbaum. 1986. Conversation analysis methods. In Contemporary
issues in language and discourse processes, ed. D.G. Ellis, and W.A. Donohue, 169–186.
Hilldale: Lawrence Erlbaum Associates.
Hornik, J. 1987. The effect of touch and gaze upon compliance and interest of interviewees. The
Journal of Social Psychology 127: 681–683.
House, E.T. 1980. Evaluating with validity. Beverly Hills: Sage Publications.
Hu, L.T., and P.M. Bentler. 1999. Cutoff criteria for fit indexes in covariance structure analysis:
conventional criteria versus new alternatives. Structural Equation Modelling: A
Multidisciplinary Journal 6: 1–55.
Hu, Z., and J. Dong. 2006. How meaning is construed multimodally: a case study of a PowerPoint
presentation contest. Computer Assisted Foreign Language Education 3: 3–12.
Huerta-Macias, A. 1995. Alternative assessment: responses to commonly asked questions. TESOL
Journal 5(1): 8–11.
References 95
Hughes, A. 2003. Testing for language teachers, 2nd ed. Cambridge: Cambridge University Press.
Hulstijn, J.H. 2007. The shaky ground beneath the CEFR: quantitative and qualitative dimensions
of language proficiency. The Modern Language Journal 91(4): 663–667.
Hulstijn, J.H. 2011. Language proficiency in native and nonnative speakers: an agenda for research
and suggestions for second-language assessment. Language Assessment Quarterly 8(3): 229–
249.
Hymes, D.H. 1962. The ethnography of speaking. In Anthropology and human behaviour, ed.
T. Gladwin, and W.C. Sturtevant, 13–53. Washington: The Anthropology Society of
Washington.
Hymes, D.H. 1964. Introduction: toward ethnographies of communication. American
Anthropologist 6(6): 1–34.
Hymes, D.H. 1972. On communicative competence. In Sociolinguistics, ed. J. Pride, and
J. Holmes, 269–293. Harmondsworth: Penguin.
Hymes, D.H. 1973. Toward linguistic competence. Texas working papers in sociolinguistics
(working paper No. 16). Austin, Tx: Centre for Intercultural Studies in Communication, and
Department of Anthropology, University of Texas.
Hymes, D.H. 1974. Foundations in sociolinguistics: an ethnographic approach. Philadelphia:
University of Pennsylvania Press.
Hymes, D.H. 1982. Toward linguistic competence. Philadelphia: Graduate School of Education,
University of Pennsylvania.
Iedema, R. 2001. Analysing film and television: a social semiotic account of hospital: an unhealthy
business. In Handbook of visual analysis, ed. T. van Leeuwen, and C. Jewitt, 183–204.
London: Sage.
Iizuka, Y. 1992. Extraversion, introversion and visual interaction. Perceptual and Motor Skills 74:
43–59.
Ingram, D., and E. Wylie. 1993. Assessing speaking proficiency in the international English
language testing system. In A new decade of language testing research: selected papers from
the 1990s language testing research colloquium, ed. D. Douglas, and C. Chapelle, 220–234.
Alexandria: TESOL Inc.
Jacobs, E. 1988. Clarifying qualitative research: A focus on traditions. Educational Researcher, 17
(1):16–24.
Jackendoff, R. 1983. Semantics and cognition. Cambridge: MIT Press.
Janik, S.W., A.R. Wellens, M.L. Goldberg, and L.F. Dell’Osso. 1978. Eyes as the centre of focus
in the visual examination of human faces. Perceptual and Motor Skills 47: 857–858.
Jarvis, G.A. 1986. Proficiency testing: a matter of false hopes? ADFL Bulletin 18: 20–21.
Jewitt, C. 2002. The move from page to screen: the multimodal reshaping of school English.
Journal of Visual Communication 1(2): 171–196.
Jewitt, C. 2006. Technology, literacy and learning: a multimodal approach. London: Routledge.
Jewitt, C. 2009. An introduction to multimodality. In The Routledge handbook of multimodal
analysis, ed. C. Jewitt, 14–27. London: Routledge.
Jewitt, C. 2011. The changing pedagogic landscape of subject English in UK classrooms. In
Multimodal studies: exploring issues and domains, ed. K.L. O’Halloran, and B.A. Smith, 184–
201. London: Routledge.
Johnson, K., and H. Johnson. 1999. Encyclopaedic dictionary of applied linguistics: a handbook
for language teaching. Malden: Blackwell Publishers Inc.
Johnson, M., and A. Tylor. 1998. Re-analysing the OPI: how much does it look like natural
conversation? In Talking and testing: discourse approaches to the assessment of oral
proficiency, ed. R. Young, and W. He, 27–51. Philadelphia: John Benjamins.
Jöreskog, K.G. 1993. Testing structural equation models. In Testing structural equation models,
ed. D. Bollen, and J.S. Long, 294–316. Newbury Park: Sage Publications.
Jungheim, N.O. 1995. Assessing the unsaid: the development of tests of nonverbal ability. In
Language testing in Japan, ed. J.D. Brown, and S.O. Yamashita, 149–165. Tokyo: JALT.
Jungheim, N.O. 2001. The unspoken element of communicative competence: evaluating language
learners’ nonverbal behaviour. In A focus on language test development: expanding the
96 2 Literature Review
language proficiency construct across a variety of tests, ed. T. Hudson, and J.D. Brown, 1–34.
Honolulu: University of Hawaii, Second Language Teaching and Curriculum Centre.
Kaindl, L. 2005. Multimodality in the translation of humour in comics. In Perspectives on
multimodality, ed. E. Ventola, C. Charles, and M. Kaltenbacher, 173–192. Amsterdam: John
Benjamins.
Kalma, A. 1992. Gazing in triads: a powerful signal in floor apportionment. British Journal of
Social Psychology 31: 21–39.
Kane, T. M. 1990. An argument-based approach to validation. Iowa: The American College
TestingProgram.
Kane, M.T. 1992. An argument-based approach to validity. Psychological Bulletin 112(3): 527–
535.
Kane, M.T. 1994. Validating interpretative arguments for licensure and certification examinations.
Evaluation and the Health Professions 17(2): 133–159.
Kane, M.T. 2001. Current concerns in validity theory. Journal of Educational Measurement 38(4):
319–342.
Kane, M.T. 2002. Validating high-stakes testing programs. Educational Measurement: Issues and
Practice 21(1): 31–41.
Kane, M.T. 2004. Certification testing as an illustration of argument-based validation.
Measurement: Interdisciplinary Research and Perspectives, 2(3), 135–170.
Kane, M.T. 2006. Validation. In Educational measurement, 4th ed, ed. R. Brennan, 17–64.
Westport: American Council on Education and Praeger.
Kane, M.T. 2010. Validity and fairness. Language Testing 27(2): 177–182.
Kane, M.T., T. Crooks, and A. Cohen. 1999. Validating measures of performance. Educational
Measurement: Issues and Practice 18(2): 5–17.
Kasper, G., and K.R. Rose. 2002. Pragmatic development in a second language. Oxford:
Blackwell.
Kendon, A. 1967. Some functions of gaze-direction in social interaction. Acta Psychologica 26:
22–63.
Kendon, A. 1980. Gesticulation and speech: Two aspects of the process of utterance. In The
relationship of verbal and nonverbal communication, ed. M.R. Key, 207–227. The Hague:
Mouton and Co.
Kendon, A. 1981. The organization of behavior in face-to-face interaction: observations on the
development of a methodology. In Handbook of research methods in nonverbal behavior, ed.
P. Ekman, and K. Scherer, 440–505. Cambridge: Cambridge University Press.
Kendon, A. 1985. Some uses of gesture. In Perspectives on silence, ed. D. Tannen, and
M. Saville-Troike, 215–234. Norwood: Ablex.
Kendon, A. 1996. Gesture in language acquisition. Multilingual 15: 201–214.
Kendon, A. 2004. Gesture: visible action as utterance. Cambridge: Cambridge University Press.
Kim, M. 2001. Detecting DIF across the different language groups in a speaking test. Language
Testing 18(1): 89–114.
Kim, Y. 2009. An investigation into native and non-native teachers’ judgments of oral English
performance: a mixed methods approach. Language Testing 26(2): 187–217.
Kleinke, C.L. 1986. Gaze and eye contact: a research review. Psychological Bulletin 100(1):
78–100.
Knoch, U. 2009. Diagnostic writing assessment: the development and validation of a rating scale.
Frankfurt: Peter Lang.
Knox, J.S. 2008. Online newspapers and TESOL classrooms: a multimodal perspective. In
Multimodal semiotics: functional analysis in contexts of education, ed. L. Unsworth, 139–158.
London: Continuum.
Kok, A.K.C. 2004. Multisemiotic mediation in hypertext. In Multimodal discourse analysis:
systemic-functional perspectives, ed. K.L. O’Halloran, 131–159. London: Continuum.
Kondo-Brown, K. 2002. A FACETS analysis of rater bias in measuring Japanese second language
writing performance. Language Testing 19(1): 3–31.
References 97
the 15th Language Testing Research Colloquium, Cambridge and Arnhem, ed. M. Milanovic,
and N. Saville, 18–33. Cambridge: Cambridge University Press.
Lazaraton, A. 2002. A qualitative approach to the validation of oral language tests. Cambridge:
Cambridge University Press.
Lazaraton, A. 2008. Utilising qualitative methods for assessment. In Encyclopaedia of language
and education, 2nd edn. Vol. 7: Language Testing and Assessment, pp. 197–209. New York:
Springer.
Leathers, D.G., and H.M. Eaves. 2008. Successful nonverbal communication: principles and
applications, 4th ed. New York: Pearson Education Inc.
Lemke, J.L. 2002. Travels in hypermodality. Visual Communication 1(3): 299–325.
Lennon, P. 1990. Investigating fluency in EFL: a quantitative approach. Language Learning 40(3):
387–417.
Leung, C. 2005a. Convival communication: recontextualising communicative competence.
International Journal of Applied Linguistics 15(2): 119–143.
Leung, C. 2005b. Classroom teacher assessment of second language development: construct as
practice. In Handbook of research in second language teaching and learning, ed. E. Hinkel,
869–888. Mahwah: Lawrence Erlbaum Associates.
Leung, C., and B. Mohan. 2004. Teacher formative assessment and talk in classroom contexts:
assessment as discourse and assessment of discourse. Language Testing 21(3): 335–359.
Levine, P., and R. Scollon (eds.). 2004. Discourse and technology: multimodal discourse analysis.
Washington: Georgetown University Press.
Levinson, S.C. 1983. Pragmatics. Cambridge: Cambridge University Press.
Linn, R.L. 1994. Performance assessment: policy promises and technical measurement standards.
Educational Researcher 23(9): 4–14.
Linn, R.L. 1997. Evaluating the validity of assessments: the consequences of use. Educational
Measurement: Issues and Practice 16(2): 14–16.
Liski, E., and S. Puntanen. 1983. A study of the statistical foundations of group conversation tests
in spoken English. Language Learning 33(2): 225–246.
Little, D. 2006. The Common European Framework of Reference for Languages: content, purpose,
origin, reception and impact. Language Teaching 39(3): 167–190.
Llosa, L. 2007. Validating a standards-based classroom assessment of English proficiency: a
multi-trait multi-method approach. Language Testing 24(4): 489–515.
Lloyd-Jones, R. 1977. Primary trait scoring. In Evaluating writing: describing, measuring,
judging, ed. C.R. Cooper, and L. Odell, 33–66. Urbana: National Council of Teachers of
English.
Long, Y., and P. Zhao. 2009. The interaction study between multimodality and metacognitive
strategy in college English listening comprehension teaching. Computer Assisted Foreign
Language Education 4: 58–74.
Lowe, P. 1985. The ILR proficiency scale as a synthesising research principle: the view from the
mountain. In Foreign language proficiency in the classroom and beyond, ed. C.J. James, 9–54.
Lincolnwood: National Textbook Company.
Lumley, T. 2002. Assessment criteria in a large-scale writing test: what do they really mean to the
raters? Language Testing 19: 246–276.
Lumley, T. 2005. Assessing second language writing: the rater’s perspective. New York: Peter
Lang.
Lumley, T., and A. Brown. 2005. Research methods in language testing. In Handbook of research
in second language teaching and learning, ed. E. Hinkel, 855–933. Mahwah: Lawrence
Erlbaum Associates.
Lumley, T., and B. O’Sullivan. 2005. The effect of test-taker gender, audience and topic on task
performance in tape-mediated assessment of speaking. Language Testing 22(4): 415–437.
Luoma, S. 2004. Assessing speaking. Cambridge: Cambridge University Press.
Lynch, B. 2001. Rethinking assessment from a critical perspective. Language Testing 18(4): 333–
349.
Lynch, B. 2003. Language assessment and programme evaluation. New Haven: Yale.
References 99
Macken-Horarik, M. 2004. Interacting with the multimodal text: reflections on image and verbiage
in ArtExpress. Visual Communication 3(1): 5–26.
Macken-Horarik, M., L. Love, and L. Unsworth. 2011. A grammatics ‘good enough’ for school
English in the 21st century: four challenges in realising the potential. Australian Journal of
Language and Literacy 34(1): 9–23.
Maiorani, A. 2009. The Matrix phenomenon. A linguistic and multimodal analysis. Saarbrucken:
VDM Verlag.
Marsh, H.W. 1988. Multi-trait multi-method analyses. In Educational research methodology, and
evaluation: an international handbook, ed. J.P. Keeves, 570–578. Oxford: Pergamon.
Marsh, H.W. 1989. Confirmatory factor analysis of multi-trait multi-method data: many problems
and a few solutions. Applied Psychological Measurement 15: 47–70.
Martin, J.R. 1995. Interpersonal meaning, persuasion and public discourse: Packing semiotic
punch. Australian Journal of Linguistics, 15(1):33–67.
Martin, J.R. 2000. Beyond exchange: Appraisal systems in English. In Evaluation in text:
Authorial stance and the construction of discourse, eds. S. Hunston, and G. Thompson 142–
175. Oxford: Oxford University Press.
Martin, J.R. 2008. Intermodal reconciliation: mates in arms. In New literacies and the English
curriculum, ed. L. Unsworth, 112–148. London: Continuum.
Martin, J.R. and P.R.R., White. 2005. The language of evaluation: Appraisal in English. London:
Palgrave.
Martinec, R. 2000a. Types of processes in action. Semiotica 130(3): 243–268.
Martinec, R. 2000b. Construction of identity in Michael Jackson’s “Jam”. Social Semiotics 10(3):
313–329.
Martinec, R. 2001. Interpersonal resources in action. Semiotica 135(1): 117–145.
Martinec, R. 2004. Gestures that co-occur with speech as a systematic resource: the realisation of
experiential meanings in indexes. Social Semiotics 14(2): 193–213.
Matsumoto, D. 2006. Culture and cultural worldviews: Do verbal descriptions about culture reflect
anything other than verbal descriptions of culture? Culture and Psychology, 12(1):33–62.
Matsuno, S. 2009. Self-, peer- and teacher-assessments in Japanese university EFL writing
classrooms. Language Testing 26(1): 75–100.
Matthews, M. 1990. The measurement of productive skills: doubts concerning the assessment
criteria of certain public examinations. English Language Teaching Journal 44(2): 117–121.
Matthiessen, C.M.I.M. 2007. The multimodal page: a systemic functional exploration. In New
directions in the analysis of multimodal discourse, ed. T.D. Royce, and W.L. Bowcher, 1–62.
Mahwah: Lawrence Erlbaum Associates.
Maynard, S.K. 1987. Interactional functions of a nonverbal sign: head movement in Japanese
dyadic casual conversation. Journal of Pragmatics 11: 589–606.
Maynard, S.K. 1989. Japanese conversation: self-contextualisation through structure and
interactional management. Norwood: Albex.
Maynard, S.K. 1990. Understanding interactive competence in L1/L2 contrastive context: a case of
backchannel behaviour in Japanese and English. In Language proficiency: defining, teaching,
and testing, ed. L.A. Arena, 41–52. New York: Plenum Press.
McCrimman, J.M. 1984. Writing with a purpose, 8th ed. Boston: Houghton Mifflin.
McKay, P. 1995. Developing ESL proficiency descriptions for the school context: the NLLIA ESL
band scales. In Language assessment in action, ed. G. Brindley, 3–34. Sydney: National Centre
for English Language Teaching and Research.
McNamara, T. 1990. Item response theory and the validation of an ESP test for health
professionals. Language Testing 7(1): 52–76.
McNamara, T. 1996. Measuring second language performance. London: Longman.
McNamara, T. 2000. Language testing. Oxford: Oxford University Press.
McNamara, T. 2001. Language assessment as social practice: challenges for research. Language
Testing 18(4): 333–349.
McNamara, T., and C. Roever. 2006. Language testing: the social dimension. Oxford: Blackwell
Publishing.
100 2 Literature Review
McNeill, D. 1979. The conceptual basis of language. Hilldale: Lawrence Erlbaum Associates.
McNeill, D. 1992. Hand and mind: what gestures reveal about thought. Chicago: The University
of Chicago Press.
McNeill, D. 1998. Speech and gesture integration. In The nature and functions of gesture in
children's communication. New directions for child development, eds. J.M. Iverson, and S.
Goldin-Meadow, 11–27. San Francisco: Jossey-Bass Inc, Publishers.
McNeill, D. (ed.). 2000. Language and gesture. Cambridge: Cambridge University Press.
McNeill, D. 2005. Gesture and thought. Chicago: The University of Chicago Press.
Mehrens, W.A. 1997. The consequences of consequential validity. Educational Measurement:
Issues and Practice 16(2): 16–18.
Messick, S. 1975. The standard problem: meaning and values in measurement and evaluation.
American Psychologist 30(10): 955–966.
Messick, S. 1980. Test validity and the ethics of assessment. American Psychologist 35(11): 1012–
1027.
Messick, S. 1988. The once and future issues of validity: assessing the meaning and consequences
of measurement. In Test validity, eds. H. Wainer, and H.I. Braun, 33–45. Hillsdale: Lawrence
Erlbaum Associates.
Messick, S. 1989a. Meaning and value in test validation: the science and ethics of assessment.
Educational Researcher 18(2): 5–11.
Messick, S. 1989b. Validity. In Educational measurement, 3rd ed, ed. R.L. Linn, 13–103. New
York: American Council on Education & Macmillan Publishing Company.
Messick, S. 1992. Validity of test interpretation and use. In Encyclopaedia of educational
research, 6th ed, ed. M.C. Alkin, 1487–1495. New York: Macmillan.
Messick, S. 1994. The interplay of evidence and consequences in the validation of performance
assessment. Educational Research 2(2): 13–23.
Messick, S. 1995. Standards of validity and the validity of standards in performance assessment.
Educational Measurement: Issues and Practice 14(4): 5–8.
Messick, S. 1996. Validity and washback in language testing. Language Testing 13(3): 241–256.
Mickan, P. 2003. What’s your score? An investigation into language descriptors for rating written
performance. Canberra: IELTS Australia.
Milanovic, M., N. Saville, A. Pollitt, and A. Cook. 1996. Developing and validating rating scales
for CASE: theoretical concerns and analyses. In Validation in language testing, ed.
A. Cumming, and R. Berwick, 15–38. Philadelphia: Multilingual Matters Ltd.
Mislevy, R.J. 2003. Substance and structure in assessment arguments. Law, Probability, and Risk
2(4): 237–258.
Mislevy, R.J., L.S. Steinberg, and R.G. Almond. 2003. On the structure of educational
assessments. Measurement: Interdisciplinary Research and Perspectives 1(1):3–67.
Mislevy, R.J., R.G. Almond, and L.S. Steinberg. 2002. On the roles of task model variables in
assessment design. In Generating items for cognitive tests: theory and practice, ed. S. Irvine,
and P. Kyllonen, 97–128. Hillsdale: Lawrence Erlbaum Associates.
Morrow, K. (ed.). 2004. Insights from the Common European Framework. Oxford: Oxford
University Press.
Mosier, C.I. 1947. A critical examination of the concepts of face validity. Educational and
Psychological Measurement 7(2): 191–205.
Moss, P.A. 1992. Shifting conceptions of validity in educational measurement: implications for
performance assessment. Review of Educational Research 62(3): 229–258.
Munby, J. 1978. Communicative syllabus design. Cambridge: Cambridge University Press.
Myford, C.M. 2002. Investigating design features of descriptive graphic rating scales. Applied
Measurement in Education 15(2): 187–215.
Nakatsuhara, F. 2009. Conversational styles in group oral tests: how is the conversation
co-constructed? Unpublished Ph.D. thesis, The University of Essex, UK.
Nambiar, M.K., and C. Goon. 1993. Assessment of oral skills: a comparison of scores obtained
through audio recordings to those obtained through face-to-face evaluation. RELC Journal 24
(1): 15–31.
References 101
Neu, J. 1990. Assessing the role of nonverbal communication in the acquisition of communicative
competence in L2. In Developing communicative competence in a second language: series on
issues in second language research, ed. C.R. Scarcella, S.E. Andersen, and D.S. Krashen, 121–
138. New York: Newbury House Publishers.
Nevo, D., and E. Shohamy. 1984. Applying the joint committee’s evaluation standards for the
assessment of alternative testing methods. Paper presented at the annual meeting of the
American Educational Research Association, New Orleans.
Nevo, B. 1985. Face validity revisited. Journal of Educational Measurement 22(4): 287–293.
Norris, S. 2002. Theoretical framework for multimodal discourse analysis presented via the
analysis of identity construction of two women living in Germany. Unpublished Ph.D. thesis,
Georgetown University, USA.
Norris, S. 2004. Analysing multimodal interaction: a methodological framework. London:
Routledge.
Norris, J.M. 2005. Book review: common European Framework of Reference for Languages:
learning, teaching, assessment. Language Testing 22(3): 399–405.
Norris, S., and R.H. Jones (eds.). 2005. Discourse in action: introducing mediated discourse
analysis. London: Routledge.
North, B. 1994. Scales of language proficiency: a survey of some existing systems. Washington,
DC: Georgetown University Press.
North, B. 1996. The development of a common framework scale of descriptors of language
proficiency based on a theory of measurement. Unpublished Ph.D. thesis, Thames Valley
University, UK.
North, B. 2000. The development of a common framework scale of language proficiency. New
York: Peter Lang Publishing Inc.
North, B. 2003. Scales for rating language performance: descriptive models, formulation styles,
and presentation formats. TOEFL Monograph, No. TOEFL-MS-24. Princeton: Educational
Testing Service.
North, B. 2010a. Levels and goals: central frameworks and local strategies. In The handbook of
educational linguistics, ed. B. Spolsky, and F.M. Hult, 220–230. Malden: Wiley-Blackwell.
North, B. 2010b. Assessment, certification and the CEFR: an overview. Plenary speech at
IATEFL TEA SIG & EALTA conference, Barcelona, Spain.
North, B., and G. Schneider. 1998. Scaling descriptors for language proficiency scales. Language
Testing 15(2): 217–262.
O’Halloran, K.L. 2000. Classroom discourse in mathematics: a multisemiotic analysis. Linguistics
and Education 10(3): 359–388.
O’Halloran, K.L. 2004. Visual semiosis in film. In Multimodal discourse analysis:
systemic-functional perspectives, ed. K.L. O’Halloran, 109–130. London: Continuum.
O’Halloran, K.L. 2005. Mathematical discourse: language, symbolism and visual images.
London: Continuum.
O’Halloran, K.L. 2008a. Inter-semiotic expansion of experiential meaning: hierarchical scales and
metaphor in mathematics discourse. In New developments in the study of ideational meaning:
from language to multimodality, ed. C. Jones, and E. Ventola, 231–254. London: Equinox.
O’Halloran, K.L. 2008b. Systemic functional-multimodal discourse analysis (SF-MDA): con-
structing ideational meaning using language and visual imagery. Visual Communication 7(4):
443–475.
O'Halloran, K. 2009. Historical changes in the Semiotic landscape: From calculation to
computation. In The routledge handbook of multimodal analysis, ed. C. Jewitt, 98–113. UK:
Routledge.
O’Halloran, K.L. 2011. Multimodal discourse analysis. In Continuum companion to discourse
analysis, ed. K. Hyland, and B. Paltridge, 120–137. London: Continuum.
O’Halloran, K.L., and F.V. Lim. 2009. Sequential visual discourse frames. In The world told and
the world shown: multisemiotic issues, ed. E. Ventola, and A.J.M. Guijarro, 139–156.
Hampshire: Palgrave Macmillan.
102 2 Literature Review
O’Loughlin, K.K. 2002. The impact of gender in oral proficiency testing. Language Testing 19(2):
169–192.
O’Malley, J.M., and A.U. Chamot. 1990. Learning strategies in second language acquisition.
Cambridge: Cambridge University Press.
O’Toole, M. 1994. The language of displayed art. London: Leicester University Press.
O’Toole, M. 2010. The language of displayed art, 2nd ed. London: Routledge.
O’Toole, M. 2011. Art vs. computer animation: integrity and technology in “South Park”. In
Multimodal studies: exploring issues and domains, ed. K.L. O’Halloran, and B.A. Smith, 239–
252. London: Routledge.
Ockey, G.J. 2001. Is the oral interview superior to the group oral? Working paper on language
acquisition and education, International University of Japan, vol. 11, pp. 22–41.
Oller, J.W. 1979. Language tests at school. London: Longman.
Oller, J.W. 1983. Evidence for a general language proficiency factor: an expectancy grammar. In
Issues in language testing research, ed. J.W. Oller, 3–10. Rowley: Newbury House.
Oller, J.W., and F.B. Hinofotis. 1980. Two mutually exclusive hypotheses about second language
ability: indivisible or partially divisible competence. In Research in language testing, ed. J.W.
Oller, and K. Perkins, 13–23. Rowley: Newbury House.
Oreström, B. 1983. Turn-taking in English conversation. Lund Studies in English 66, CWK
Gleerup.
Painter, C. 2007. Children’s picture book narratives: reading sequences of images. In Advances in
language and education, ed. A. McCabe, M. O’Donnell, and R. Whittaker, 40–59. London:
Continuum.
Painter, C. 2008. The role of colour in children’s picture books. In New literacies and the English
curriculum, ed. L. Unsworth, 89–111. London: Continuum.
Painter, C., J.R. Martin, and L. Unsworth. 2013. Reading visual narratives: Image analysis of
children’s picture books. Bristol: Equinox Publishing.
Patri, M. 2002. The influence of peer feedback on self- and peer-assessment. Language Testing 19
(2): 109–132.
Pawley, A., and F.H. Syder. 1983. Two puzzles for linguistic theory: nativelike selection and
nativelike fluency. In Language and communication, ed. J.C. Richards, and R.W. Schmidt,
191–225. London: Longman.
Pienemann, M., and M. Johnston. 1987. Factors influencing the development of language
proficiency. In Applying second language acquisition research, ed. D. Nunan, 89–94.
Adelaide: National Curriculum Resource Centre.
Pike, K.L. 1967. Language in relation to a unified theory of the structure of human behaviour, 2nd
ed. The Hague: Mouton & Co.
Poggi, I. 2001. The lexicon of the conductor’s face. In Language, vision and music, ed.
P. McKevitt, S. Nuallsin, and C. Mulvihill, 271–284. Amsterdam: John Benjamins.
Pollitt, A., and C. Hutchinson. 1987. Calibrating graded assessment: Rasch partial credit analysis
of performance in writing. Language Testing 4(1): 72–92.
Pomerantz, A., and B.J. Fehr. 1997. Conversation analysis: An approach to the study of social
action as sense making practices. In Discourse as social action, discourse studies: a
multidisciplinary introduction, vol. 2, ed. T.A. van Dijk, 64–91. London: Sage Publications.
Popham, W.J. 1990. Modern educational measurement: a practitioner’s perspective. New York:
Prentice Hall.
Popham, W.J. 1997. Consequential validity: right concern—wrong concept. Educational
Measurement: Issues and Practice 16(2): 9–13.
Popham, W.J. 2008. Transformative assessment. Alexandria: Association for Supervision and
Curriculum Development.
Psathas, G. 1995. Conversation analysis: the study of talk-in-interaction. Thousand Oaks: Sage.
Purpura, J. 1999. Learner strategy use and performance on language tests: a structural equation
modelling approach. Cambridge: Cambridge University Press.
Purpura, J. 2004. Assessing grammar. Cambridge: Cambridge University Press.
References 103
L2 and EFL writing: a structural equation modelling approach. In New directions for research
in L2 writing, ed. S. Ransdell, and M.L. Barbier, 101–122. Dordrecht: Kluwer Academic.
Scollon, R. 2001. Mediated discourse: the nexus of practice. London: Routledge.
Scollon, R., and S.W. Scollon. 2003. Discourses in place: language in the material world.
London: Routledge.
Scollon, R., and W.B.K. Scollon. 2004. Nexus analysis: Discourse and the emerging internet.
London: Routledge.
Scollon, R., and S.W. Scollon. 2009. Multimodality and language: a retrospective and prospective
view. In The Routledge handbook of multimodal analysis, ed. C. Jewitt, 170–180. London:
Routledge.
Scriven, M. 1967. The methodology of evaluation. In Perspectives on curriculum evaluation, ed.
R.W. Tylor, R.M. Gagne, and M. Scriven, 39–83. Chicago: Rand McNally.
Searle, J.R. 1969. Speech act: an essay in the philosophy of language. Cambridge: Cambridge
University Press.
Shepard, L.A. 1993. Evaluating test validity. In Review of research in education, vol. 19, ed.
L. Darling-Hammond, 405–450. Washington DC: American Educational Research
Association.
Shepard, L.A. 1997. The centrality of test use and consequences for test validity. Educational
Measurement: Issues and Practice, 16(2), 5–8, 13, 24.
Shepard, L.A. 2000. The role of assessment in a learning culture. Educational Researcher 29(7):
4–14.
Shohamy, E. 1981. Inter-rater and intra-rater reliability of the oral interview and concurrent
validity with cloze procedure. In The construct validation of tests of communicative
competence, ed. A.S. Palmer, J.M. Groot, and G.A. Trosper, 94–105. Washington, DC:
TESOL.
Shohamy, E. 1996. Competence and performance in language testing. In Performance and
competence in second language acquisition, ed. G. Brown, K. Malmkjaer, and J. William,
138–151. Cambridge: Cambridge University Press.
Shohamy, E. 2001. The power of tests: a critical perspective of the uses of language tests. London:
Longman.
Shohamy, E., C.M. Gordon, and R. Kraemer. 1992. The effect of raters’ background and training
on the reliability of direct writing tests. Modern Language Journal 76: 27–33.
Shute, V.J. 2008. Focus on formative feedback. Review of Educational Research 78(1): 153–189.
Simpson, J. 2003. Report on BAAL/CUP seminar on multimodality and applied linguistics.
Reading, UK.
Sinclair, J.M., and M. Coulthard. 1975. Towards an analysis of discourse. Oxford: Oxford
University Press.
Skehan, P. 1984. Issues in the testing of English for specific purposes. Language Testing 1(2):
202–220.
Skehan, P. 1995. Analysability, accessibility and ability for use. In Principles and practice in
applied linguistics, ed. G. Cook, and B. Seidlhofer, 91–106. Oxford: Oxford University Press.
Skehan, P. 1996. Second language acquisition research and task-based instruction. In Challenge
and change in language teaching, ed. J. Willis, and D. Willis, 17–30. Oxford: Heinemann.
Smith, D. 2000. Rater judgments in the direct assessment of competency-based second language
writing ability. In Studies in immigrant English language assessment, vol. 1, ed. G. Brindley,
159–189. Sydney: Macquarie University.
Sparhawk, C.M. 1978. Contrastive identificational features of Persian gesture. Semiotica 24: 49–
86.
Spolsky, B. 1986. A multiple choice for language testers. Language Testing 3(2): 147–158.
Spolsky, B. 1989a. Communicative competence, language proficiency and beyond. Applied
Linguistics 10(2): 138–156.
Spolsky, B. 1989b. Conditions for second language learning: introduction to a general theory.
Oxford: Oxford University Press.
References 105
Spolsky, B. 1993. Testing and examinations in a national foreign language policy. In National
foreign language policies: practice and prospects, ed. K. Sajavaara, S. Takala, D. Lambert,
and C. Morfit, 124–153. Jyväskyla: Institute for Education Research, University of Jyväskyla.
Spolsky, B. 2008. Introduction: language testing at 25: maturity and responsibility? Language
Testing 25(3): 297–305.
Stein, P. 2008. Multimodal pedagogies in diverse classrooms: representation, rights and
resources. London: Routledge.
Stern, H.H. 1978. The formal-functional distinction in language pedagogy: a conceptual
clarification. Paper presented at the 5th AILA congress, Montreal, Canada.
Stöckl, H. 2004. In between modes: language and image in printed media. In Perspectives on
multimodality, ed. E. Ventola, C. Charles, and M. Kaltenbacher, 9–30. Amsterdam: John
Benjamins.
Street, B.V. (ed.). 1993. Cross-cultural approaches to literacy. Cambridge: Cambridge University
Press.
Suppe, F. 1977. The structure of scientific theories, 2nd ed. Urbana: University of Illinois Press.
Swain, M. 1985. Communicative competence: some roles of comprehensible input and
comprehensible output in its development. In Input in second language acquisition, ed.
S. Gass, and C. Madden, 235–256. New York: Newbury House.
Tan, S. 2009. A systemic functional framework for the analysis of corporate television
advertisements. In The world told and the world shown: multisemiotic issues, ed. E. Ventola,
and A.J.M. Guijarro, 157–182. Hampshire: Palgrave Macmillan.
Tan, S. 2010. Modelling engagement in a web-based advertising campaign. Visual
Communication 9(1): 91–115.
Tarone, E.E., and G. Yule. 1989. Focus on the language learner: approaches to identifying and
meeting the needs of second language learners. Oxford: Oxford University Press.
Teasdale, A., and C. Leung. 2000. Teacher assessment and psychometric theory: a case of
paradigm crossing? Language Testing 17(2): 163–184.
Thibault, P.J. 2000. The multimodal transcription of a television advertisement. In Multimodality
and multimediality in the distance learning age, ed. A. Baldry, 311–385. Campobasso, Italy:
Palladino.
Thorndike, E.L. 1920. A constant error in psychological ratings. Journal of Applied Psychology 4:
469–477.
Thorndike, R.M. 1997. Measurement and evaluation in psychology and education. Upper Saddle
River: Merrill.
Tomasello, M. 2003. Constructing a language: a usage-based theory of language acquisition.
London: Harvard University Press.
Toulmin, S.E. 2003. The uses of argument. Cambridge: Cambridge University Press.
Tseng, C., and J. Bateman. 2010. Chain and choice in filmic narrative: an analysis of multimodal
narrative construction in The Fountain. In Narrative revisited, ed. C.R. Hoffmann, 213–244.
Amsterdam: John Benjamins.
Turner, C.E. 1989. The underlying factor structure of L2 cloze test performance in Francophone,
University-level students: Causal modelling as an approach to construct validation. Language
Testing, 6(2):172–197.
Turner, C.E., and J.A. Upshur. 2002. Rating scales derived from student samples: effects of the
scale maker and the student sample on scale content and student scores. TESOL Quarterly 36
(1): 49–70.
Underhill, N. 1987. Testing spoken English. Cambridge: Cambridge University Press.
Unsworth, L., and E. Chan. 2009. Bridging multimodal literacies and national assessment
programs in literacy. Australian Journal of Language and Literacy 32(3): 245–257.
Upshur, J.A., and C.E. Turner. 1995. Constructing rating scales for second language tests. ELT
Journal 49(1): 3–12.
Upshur, J.A., and C.E. Turner. 1999. Systematic effects in the rating of second language speaking
ability: test method and learner discourse. Language Testing 16(1): 82–111.
106 2 Literature Review
van Dijk, T.A. 1977. Text and context: exploration in the semantics and pragmatics of discourse.
London: Longman.
van Ek, J.A. 1975. The threshold level in a European unit/credit system for modern language
learning by adults. Strasbourg: Council of Europe.
van Leeuwen, T. 1999. Speech, sound and music. London: Macmillan.
van Leeuwen, T. 2001. Visual racism. In The semiotics of racism, ed. R. Wodak, and M. Reisigl,
333–350. Vienna: Passagen Verlag.
van Leeuwen, T. 2011. The language of colour: an introduction. London: Routledge.
van Lier, L. 1989. Reeling, writhing, drawling, stretching, and fainting in coils: oral proficiency
interviews as conversation. TESOL Quarterly 23(3): 489–508.
van Moere, A. 2007. Group oral test: how does task affect candidate performance and test score?
Unpublished Ph.D. thesis, The University of Lancaster, UK.
Vaughan, C. 1991. Holistic assessment: what goes on in the rater’s mind? In Assessing second
language writing in academic contexts, ed. L. Hamp-Lyons, 111–125. Norwood: Ablex.
Verhoeven, L. 1997. Sociolinguistics and education. In The handbook of sociolinguistics, ed.
F. Coulmas, 389–404. Oxford: Blackwell.
Wainer, H., and H.I. Braun (eds.). 1988. Test validity. Hilldale: Lawrence Erlbaum Associates.
Wang, Y. 2009. The design of multimodal listening autonomous learning and its effect. Computer
Assisted Foreign Language Education 6: 62–65.
Wang, L., G. Beckett, and L. Brown. 2006. Controversies of standardised assessment in school
accountability reform: a critical synthesis of multidisciplinary research evidence. Applied
Measurement in Education 19(4): 305–328.
Webbink, P. 1986. The power of the eyes. New York: Springer.
Wei, Q. 2009. A study on multimodality and college students’ multiliteracies. Computer Assisted
Foreign Language Education 2: 28–32.
Weigle, S.C. 1994. Effects of training on raters of ESL compositions. Language Testing 11(2):
197–223.
Weigle, S.C. 1999. Investigating rater/prompt interactions in writing assessment: quantitative and
qualitative approaches. Assessing Writing 6(2): 145–178.
Weigle, S.C. 2002. Assessing writing. Cambridge: Cambridge University Press.
Weiner, M., et al. 1972. Nonverbal behaviour and nonverbal communication. Psychological
Review 79: 185–214.
Weir, C.J. 1990. Communicative language testing. Englewood Cliffs: Prentice Hall Regents.
Weir, C.J. 2005. Limitations of the Common European Framework of Reference for Languages
(CEFR) for developing comparable examinations and tests. Language Testing 22(3): 281–300.
White, E.M. 1985. Teaching and assessing writing. San Francisco: Jossey-Bass Inc.
White, S. 1989. Backchannels across cultures: a study of Americans and Japanese. Language in
Society 18: 59–76.
Widaman, K.F. 1985. Hierarchically tested covariance structure models for multi-trait
multi-method data. Applied Psychological Measurement 9: 1–26.
Widdowson, H.G. 1978. Teaching language as communication. Oxford: Oxford University Press.
Wolfe, E.W. 1997. The relationship between essay reading style and scoring proficiency in a
psychometric scoring system. Assessing Writing 4(1): 83–106.
Wolfe, E.W., C. Kao, and M. Ranney. 1998. Cognitive differences in proficient and non-proficient
essay scorers. Written Communication 15: 465–492.
Wolfe-Quintero, K., S. Inagaki, and H.-Y. Kim. 1998. Second language development in writing:
measures of fluency, accuracy and complexity. Honolulu: University of Hawaii at Manoa.
Wolfson, N. 1989. Perspectives: sociolinguistics and TESOL. New York: Newbury House.
Wylie, L. 1977. Beaux gesters: a guide to French body talk. New York: E. P. Dutton.
Xi, X. 2010. How do we go about investigating test fairness? Language Testing 27(2): 147–170.
Yamashiro, A.D. 2002. Using structural equation modelling for construct validation of an English
as a foreign language public speaking rating scale. Unpublished Ph.D. thesis, Temple
University, USA.
References 107
Yang, H., and C.J. Weir. 1998. Validation study of the national College English Test. Shanghai:
Shanghai Foreign Language Education Press.
Young, R. 1995. Discontinuous language development and its implications for oral proficiency
rating scales. Applied Language Learning 6: 13–26.
Young, R., and W. He. 1998a. Language proficiency interviews: a discourse approach. In Talking
and testing: discourse approaches to the assessment of oral proficiency, ed. R. Young, and
W. He, 1–24. Philadelphia: John Benjamins.
Young, R., and W. He (eds.). 1998b. Talking and testing: discourse approaches to the assessment
of oral proficiency. Philadelphia: John Benjamins.
Zebrowitz, L.A. 1997. Reading faces: window to the soul?. Boulder: Westview Press.
Zhang, D. 2009. On a synthetic theoretical framework for multimodal discourse analysis. Foreign
Languages in China 1: 24–30.
Zhang, Z. 2010. A co-relational study of multimodal PPT presentation and students’ learning
achievements. Foreign Languages in China 3: 54–58.
Zhang, D., and L. Wang. 2010. The synergy of different modes in multimodal discourse and their
realisation in foreign language teaching. Foreign Language Research 2: 97–102.
Zhu, Y. 2007. Theory and methodology of multimodal discourse analysis. Foreign Language
Research 5: 82–86.
Zhu, Y. 2008. Studies on multiliteracy ability and reflections on their effects on teaching.
Chapter 3
Research Design and Methods
Accorded with the aims of building an argument for nonverbal delivery in speaking
assessment and designing and validating a rating scale in the context of group
discussion assessment, the entire research could be chronologically broken down
into (1) argument building (henceforth AB) phase, (2) rating scale formulation
(henceforth RSF) phase and (3) rating scale validation (henceforth RSV) phase.
© Springer Science+Business Media Singapore 2016 109
M. Pan, Nonverbal Delivery in Speaking Assessment,
DOI 10.1007/978-981-10-0170-3_3
110 3 Research Design and Methods
Dataset 2
Dataset 3
Dataset 1
Group discussions of ELF
Hyme’s notion of Questionnaire results learners in the formative Rating results
communicative assessment context (teachers and peers)
from teachers and
competence learners in the Chinese (150 samples of group
EFL context discussion)
Bachman’s
Canale and Swain’s
communicative
communicative
language ability Language
30 samples
competence model
model Competence
validity:
Strategic
CEFR’s a componential
Competence validity:
Communicative (nonverbal 20 samples notion
a unitary notion
Language delivery)
Research Phase II 100 samples with construct
Competence model Rating scale validity as the core
formulation argument-
based validity
Studies on
rating scale nonverbal delivery validation methods
orientation
assessor-oriented
scoring approach Rating scale validation Rating scale validation
analytic approach MTMM approach to MDA approach to quantitative: qualitative:
validate the construct align scores with multi-trait multimodal
rating scale focus construct-focussed validity of the rating performance and multi-method discourse analysis
scale descriptors (MTMM) (MDA)
theory-based and (quantitative) (qualitative)
rating scale design empirically-driven
Afterwards, as is delineated, the RSF phase was carried out in three steps, with
RSF-I and RSF-II addressing the operationalisations on language competence and
strategic competence on the rating scale. However, the ways in which the band
descriptors for both parts were formulated differed in that RSF-I attempted to
describe the part of language competence based on what assessment domains
teachers and learners in the Chinese EFL context supposedly perceive (Dataset 1).
For that end, questionnaire survey was the main research instrument, and the sta-
tistical method was exploratory factor analysis (EFA), to be detailed in Sect. 3.3.1.
By comparison, instead of resorting to questionnaires, the part of strategic com-
petence on the rating scale was drawn from the findings from the empirical study at
the AB phase. As mentioned earlier, the AB phase not only evidenced that non-
verbal delivery employed by learners across different proficiency levels could be
differentiated, but it would also inform the range finders with gradable descriptors
in adjacent levels for formulating nonverbal delivery on the rating scale. At this
phase, 30 samples of group discussion with equal distribution of candidates’ pro-
ficiency levels from Dataset 2 were analysed. Having been formulated into a ten-
tative version, the rating scale was then trialed and prevalidated (RSF-III) on a
smaller scale (20 samples from Dataset 2) so as to resolve the issue of practicality,
and make modifications, if any, before it was used to rate a larger sample in the
RSV phase.
The RSV stage also embarked upon a review on the relevant literature,
addressing the issue of how a rating scale should be validated. The answers covered
112 3 Research Design and Methods
not only the conceptualisation of validity in language assessment but also the
validation methods. Having cast doubt on the feasibility of argument-based validity
(see dotted arrow in Fig. 3.1), this study argued back to adopt a unitary notion of
validity, putting construct validity in the central place. The review on the validation
methods justified the methods in which the rating scale was cross-validated, namely
MTMM (RSV-I) and MDA (RSV-II).
As is shown in Fig. 3.1, the RSV phase, particularly RSV-I, involved
teacher-raters’ and peer-raters’ scoring (Dataset 3) on 100 samples randomly
selected from Dataset 2. In real practice, all the subscores assigned by teacher and
peer raters and measured against the revised rating scale were run by EQS (see
Sect. 3.2.2) for the statistical output of MTMM model comparison. The indices of
model fit would testify whether the different traits embedded in the intended con-
struct of the rating scale could be consistently measured by different rating methods.
However, given the inadequacy of deploying a quantitative approach alone and the
uncertainty of whether the assigned subscores were aligned with candidates’
de facto performance, the RSV phase was furthered to RSV-II, where an MDA
approach was applied.
Therefore, the integration of quantitative and qualitative validation methods
paved the way for scrutinising whether the proposed rating scale was characterised
by the anticipated construct validity and to reach the fittest MTMM model for
explaining the intended CLA construct. The rating scale would be subject to further
modifications in case such a need arose from the results at the RSV phase.
Ultimately, as illustrated in Fig. 3.1, this project yielded its ultimate product, viz. a
rating scale with sound construct validity and practicality for scoring Chinese ter-
tiary EFL learners’ performance in group discussion in formative assessment.
3.2 Data
As is illustrated in Fig. 3.1, three datasets threading through the whole process of
the study need detailed description. Each dataset is research aim specific and was
collected independently. This section, therefore, sheds more light on a depiction of
phase-specific data for the whole research project. The following section will be
unfolded to elaborate on the three datasets for the three main phases of the study.
1
The key institutions in China refer to those granted with 211 project and/or 985 project, whereas
those non-key institutions refer to those without any of the above project grants. These two project
grants are sound indicators of the comparative high rankings among all the institutions of higher
learning in the Chinese mainland.
114 3 Research Design and Methods
for Science and Technology (USST). Table 3.1 outlines the distribution of the data
sources, featuring a comparative balance between key and non-key institutions as
well as a geographic diversity of the institutions where the participants are affiliated.
In addition, the participants’ majors (liberal arts, engineering, science, law, man-
agement, etc.) are also generally spread out.
A total of 1400 questionnaires (1100 for learners and 300 for teachers) were
distributed to the respondents in the seven institutions specified above in the aca-
demic year 2009–2010. Before the questionnaires were administered to the
respondents, the researcher had liaisons and discussions with the coordinators of
each institution to clarify the details on how the questionnaires should be admin-
istered in a way that would most possibly engage the respondents in conscientiously
completing the questionnaires. Enlightened and suggested by the coordinators, the
questionnaires were administered to learner respondents in their spoken English
class, where one of the topics for oral discussion was what makes a good English
speaker in a group discussion. As for teacher respondents, the questionnaires were
distributed during departmental regular meetings. The questionnaire administration
was so designed as the respondents’ unwillingness could be reduced to a minimum
degree, thus enhancing response reliability (Table 3.2).
As a result, 1312 questionnaires were returned. Due to various reasons, such as
incomplete responses and detected invalid response (e.g. all choices being identical,
see Sect. 5.2.2 for more details), a few returned questionnaires were discarded.
3.2 Data 115
Among the valid questionnaires, 1039 copies were from learner respondents (return
rate 94.5 %) and 273 from teacher respondents (return rate 91 %).
Concerning teaching experience, as reflected by the average length of English
teaching, the range falls between 4.2 and 7.5 years, with a mean of 6.07 years. This
can serve as a sound indicator that the teacher respondents involved had accumu-
lated quite a satisfactory amount of teaching experience so that their responses to a
great extent might be deemed as reliable and representative in revealing their
perceptions towards the assessment domains of language competence. The English
learning length on the part of learner respondents basically corresponds with the
length of streamline education in China, falling into the range between 6.8 and
10.3 years. The dispersion might be caused by different localised language policies
in China that fine-tune the starting point of learning English as a compulsory
subject. However, with a mean of 8.36 years for the language learning length, it can
be convinced that the learner respondents as a whole were exposed to English
learning for rather a long period. Therefore, all the returned questionnaires could be
regarded as representative of the teachers and learners in the Chinese EFL context.
Given the fact that this questionnaire was originally devised from the CLA model
(see Sect. 5.2.2 for more details), the participants’ responses with a revelation of
their perceptions towards what constitutes language competence in group discus-
sion could usefully inform how the part of language competence should be
formulated.
116 3 Research Design and Methods
Dataset 2 was involved into almost each phase of research, ranging from building
an argument for nonverbal delivery in differentiating candidates across different
proficiency levels to validating the proposed rating scale with the quantitative and
qualitative approaches. Although Dataset 2 was also collected from the same seven
institutions as above specified, more complex logistic issues were involved. As
such, four respects concerning collecting and processing Dataset 2 will be presented
below, viz. recording, transcribing, applying and presenting data (Leech et al. 1995;
Thompson 2005).
The ultimate product of this study, viz. a validated rating scale for group discussion
in formative assessment in the Chinese EFL context, logically determined that
samples of group discussion should be collected as the base data. Therefore, a total
of 150 samples of group discussion were collected from the seven institutions
previously outlined. However, gathered as a data pool, Dataset 2 was separated into
three subsets, subject to the further processing and analyses in conformity with the
phase-specific research objectives.
More specifically, 30 proficiency-stratified samples of group discussion were
used to not only build a further empirical argument for the necessity of embedding
nonverbal delivery into speaking assessment (AB phase) but also to depict dis-
cernible nonverbal delivery employed by the candidates across a range of profi-
ciency levels (RSF-II phase). Therefore, Dataset 2 should meet certain specific
requirements in that the proficiency levels of the candidates were predetermined
with a reasonable and consistent yardstick. Likewise, RSF-III, with another 20
samples of group discussion, served the purpose of trialling the tentative version of
the rating scale so that the practicality of the rating scale could be testified to the
fullest possible extent.
What needs pointing out is that there was no “recyclable” sample in any research
phase. All the remaining 100 samples of group discussion, comprising in the
vicinity of 300 candidates’ performance in group discussion, were reserved for the
RSV phase to meet the case-number threshold for the quantitative validation. Given
the above, the following will describe the participants involved in Dataset 2, fol-
lowed by other details of this dataset.
The Participants
2
College English Test (CET) is a large-scale high-stakes written test of English language profi-
ciency at the tertiary level in Chinese mainland. At present, the test battery is divided into two
tests: CET4 and CET6. The difference between the two tests largely lies in the degree of difficulty.
The test is of large scale in that millions of candidates sit for the test yearly, and it is high stakes in
the sense that a host of institutions might take the CET score as one of the thresholds of conferring
bachelor’s degrees to their graduates.
118 3 Research Design and Methods
Data Collection
This part details the procedures of how this dataset was collected. Before the data
were recorded, all the participants were informed of the assessment task by the
coordinators in each of the seven universities. All the assessments in form of group
discussion were conducted during either Semester 1 or Semester 2 in the academic
year 2009–2010. The participants were told approximately one week in advance
that they would be supposed to get themselves involved in group discussions for
around five minutes as part of formative assessment. Under the permission of the
coordinators, the researcher specified all the topics for group discussions, covering
an extensive range from campus life, cultural differences to other topical issues, all
of which assumedly are familiar to tertiary students so that utterances could be
elicited with comparative ease. In addition, no demand on priori professional or
academic knowledge is imposed on any of the topics. Table 3.3 provides a full list
of the group discussion topics for candidates to choose from.
When this dataset was collected, some considerations for data authenticity and
naturalness were borne in mind. First, instead of being designated into a particular
group, all the participants in each institution were provided with freedom to choose
their own peer discussants, but with four participants in one group as the maximum.
Second, all the assessments were administered in the candidates’ classroom, a
familiar environment of which could reduce their anxiety to the minimum degree
possible. Another consideration was to reach an agreement with all the subject
teachers via the coordinators in each university that the candidates’ performance
would not be scored instantly on the spot in order to guarantee a smooth contin-
uation of the entire assessment process. The last consideration is the clearance of
research ethics as this study involved video-recording. With the help of the coor-
dinators, all the participants were told that their performance would be audio- and
visual-recorded for research purposes only. Only those participants who signed on
the written consent forms would be recorded. They were also told that their per-
formance would not be negatively graded if they showed their unwillingness to be
videotaped. As the researcher foresaw the necessity of presenting a number of the
participants’ portraits in the form of snapshot when the proposed rating scale is
validated, the written consent also contains an agreement of being willing to be
exposed for illustration purpose in this project.
Data Recording
After all the preparations for data collection were made, the researcher travelled to
each institution during the appointed periods, when the participants’ formative
assessments were supposed to take place. While the assessment was going on, the
coordinator and the subject teacher played the role of organisers while the
researcher himself video-recorded the samples of group discussion. Before each
group discussion initiated, either the coordinator or the subject teacher would
inform the candidates of performing as naturally as possible and that the presence of
the researcher was merely for the recording purpose. In case any of the participants
showed their unwillingness to be videotaped, the recording would be suspended
and the researcher excused himself out of the classroom so that the formative
assessment could still be administered as planned.
In order to ensure the best quality of video-recording, the seating arrangement
for the group discussion was designed in the way as exemplified in Fig. 3.2. As can
be seen, the camera was positioned in the centre of the classroom to capture all the
discussants. The seats were arranged in the shape of a crescent so that candidates
would be within each others’ vision. In the middle of their crescent-shaped seating
camera
For the further analyses in relation to each specific research phase, all the samples
of group discussion in Dataset 2 needed to be transcribed into both monomodal and
multimodal texts. As a matter of fact, the transcription of monomodal texts could be
a step that was explored prior to multimodal text transcription because the former
would be embedded into one tier of the latter. The ensuing part elaborates on the
transcription of both types of texts.
The transcription format of spoken language is of serious concern; yet “there is little
agreement among researchers about the standardisation of [transcription] conven-
tions” (Lapadat and Lindsay 1999, p. 65). No strictly standard approach is used to
transcribe talk in corpus linguistics research (Cameron 2001). It has to be admitted
that transcription should be the basis of any further analysis and that consensus has
been reached that the transcription is characterised by the following: it is selective
in nature, conventional by design, theoretically motivated, socially situated and
methodologically driven (see Atkinson 1992; Edward 1993; Fairclough 1992;
Goodwin 1981, 1994; Green et al. 1997; Gumperz 1992; Mehan 1993; Ochs 1979;
Roberts 1997).
Therefore, when the present study proceeded to data transcription, the researcher
considered the issue of reliability and also adhered to transcribing the utterances
verbatim. Another important concern before transcription was the metadata, without
which candidates’ utterances would be nothing but a bundle of words of
unknowable provenance or authenticity (Burnard 2005). In the case of the present
study, the header information, one of the basic components of metadata, was
attached significance to. It includes institution level (key or non-key), institution
name, participants’ majors, their language proficiency level, their name initials,
their genders and the particular topic they chose. Figure 3.3 shows an example of
the header information format specified in this study.
As illustrated in Fig. 3.3, several field names constitute header information; each
field name is contained within a set of boldface square brackets with its value
<schoollevel=local> </schoollevel>
<school=USST> </school>
<major=wiring> </major>
<level=1> </level>
<speakers sp1=CC, male sp2=ZMJ, male sp3=XB, male> </speakers>
<topic=What is your opinion towards college students’ having part-time jobs?> </topic>
specified and ended with an identical field name and an additional backslash. With
these field names to label the corresponding demographic information of the can-
didates, the tracking and sorting of the needed data could be retrieved from Dataset
2 in batches. For example, all the verbal language by Group A candidates could be
retrieved by defining the field name level as A. The example of Fig. 3.3, therefore,
can be interpreted as a sample from the group discussion by three male Group-A
candidates majoring in wiring at USST, a non-key university in the Chinese
mainland. In addition, their topic was what is your opinion towards college stu-
dents’ having part-time jobs.
The transcription sets one turn as a basic unit with each speaker’s turn sequence
number attached. Figure 3.4 illustrates an excerpt of the transcribed data. As is
shown, the whole lot of the data transcription is contained within a set of markers
(<conversation> and </conversation>). Within that set, each speaker’s utterances
on a turn-by-turn basis are also marked with a starting marker (e.g. <sp1>) and an
ending marker (e.g. </sp1>).
After monomodal text transcription was completed, this study continued to multi-
modally transcribe the candidates’ nonverbal delivery. It can be arguable that such
transcription can be equal to annotating the occurrences of cocontextualised non-
verbal delivery; however, the distinction between transcription and annotation
<conversation>
<sp1> Then let’s talk about the topic we choose. How do you prepare to treat your
parents when they are old? To let those living with you stay independence by
themselves, or stay in the retirement house? </sp1>
<sp2> I’m more inclined to let they stay independent, because that they need quiet
atmosphere and they, what they will stay would be the old level poor. Take my
parents for example, en...if they live with me, it's not convenient and en...they can't
enjoy their own life. En...because they just speak dialect. But our dialect is very
different from the common speech. En...when they were to talk with others, then,
there won't make big progress. Or then...they may not a, accustomed to our lifestyle.
</sp2>
<sp3> En...I don't think so. I want them to stay with me. En...because if there is no
relatives to be with them, they will feel lonely. And as we all know, old people often
for your and, they are more particular, they are particularly easily to miss the kids.
En...if they lived with us, en...we can take more care of them and give them a good
living environment. And we can also en...avoid the long trip to visit them. </sp3>
</conversation>
largely lies in whether the data were perceived directly by sensory organs.
Annotation ought to be based on a certain theory by the annotator who treats the
data through theory-laden lenses (Allwood et al. 2003; Garside et al. 1997; Gu
2006, 2009). Considering the fact that “[m]ultimodal texts are composite products
of the combined effects of all the resources used to create and interpret them”
(Baldry and Thibault 2006, p. 18) and that the critical issue of representing the
simultaneity of different modalities has not been ideally resolved (Flewitt et al.
2009), multimodal texts are mainly based on the descriptions of what is factually
presented by the data. Therefore, as Dataset 2 was processed directly through the
researcher’s observation without hinging upon any evaluative subjectivity, this
study worked on Dataset 2 in the sense of transcription.
In the present study, ELAN3 was employed as a multimodal transcriber (Version
4.0.1) (see Fig. 3.5 for a screenshot). What ELAN can provide is inputting candi-
dates’ verbal utterances as well as transcribing all the occurrences of nonverbal
delivery in the defined tiers. It can also export all the transcription results along with
the time frame so that both the frequency and cumulative durations of nonverbal
channels specified could be automatically calculated. In that case, the transcription
could be seen as multiplicative instead of merely additive (Baldry and Thibault
2006; Lemke 1998).
Four main tiers for multimodal text transcription are defined: verbal utterance,
eye contact, gesture and head movement. The first tier is the same as the mono-
modal transcription that records what candidates verbally produced in group dis-
cussions. What should be noted is that at the AB phase, this study investigated
group-based performance; in other words, irrespective of the number of discussants
in one group discussion, their verbal utterances were transcribed into one
group-based tier. The other three tiers were defined to, respectively, transcribe the
occurrences of the participants’ eye contact, gesture and head movement. The
nonverbal delivery by different candidates was also transcribed into the, respec-
tively, allocated three tiers at this exploratory phase. However, due to a consider-
ation of a fine-grained investigation following the analytic framework of MDA
reviewed in Chap. 2, the transcriptions of candidates’ nonverbal delivery at the
RSV-II were conducted on an individual basis.
The transcription was piloted for the purpose to reach a general profile of what
was supposed to be transcribed. For example, concerning the transcription of eye
contact and head movement, it was felt that a prescribed manner in terms of
directionality consistent with the analytical framework can be adopted, which
would also be quite facilitating for comparatively objective judgment. Basically, a
candidate would have eye contact with peer(s) (EC/p), with the researcher (EC/r),
with the camera (EC/c) or with nothing in particular (no eye contact at all) or other
physical objects in the classroom (e.g. gazing at the ceiling or looking out of the
window, etc.) (EC/n). The first three types whose targets are more specific can be
more easily identified and feasibly transcribed, whereas the last one, with seemingly
3
Freeware downloadable from http://www.lat-mpi.eu/tools/elan (accessed on 9 November 2012).
124 3 Research Design and Methods
Automatic retrieval of
transcription
Media file
player
Transcription
tiers
As expounded in Fig. 3.1, Dataset 3 pertaining to the RSF and RSV phases
included the assessment results based on the proposed rating scale. It was subse-
quently needed in RSF-III (20 samples) and in both steps of the RSV phase (100
samples).
qualitative feedback the researcher collected during a meeting with the expert raters.
When both aspects were addressed, which signified the accomplishment of RSF-III,
the RSF phase as a whole was brought to an end.
After the tentatively proposed rating scale was fine-tuned in congruence with the
research findings in RSF-III, the RSV phase called for teacher-rating and
peer-rating of the remaining 100 samples of group discussion in Dataset 2.
Regarding the selection criteria of teacher raters, the researcher thought it unnec-
essary to invite the same three experienced raters above introduced largely out of
two reasons. First, as the three raters had been previously involved in the rating
process, there might be possibilities that they would still adhere to the tentative
version of the rating scale although the revised version was also brought forth and
supposed to be used. This was because their first impression on the rating scale
might be so sharply etched in their mind that subconscious or unconscious reluc-
tance to accept the revised version might arise due to their familiarity with the trial
version. Second, since the rating scale proposed in this study is intended to be
generalisable to formative assessment, the environment of which does not and
cannot necessarily require experienced raters, nor would it be possible for all EFL
teachers to be expert raters.
With the above consideration, another three teacher raters were invited at RSV-I.
Although they were not as experienced as the expert raters, they were truly epit-
omised as frontline instructors involved in formative assessment. Like the training
session conducted in RSF-III, the raters were also given a half-day workshop to be
acquainted with the band descriptors and initial data screening. However, unlike the
previous half-day rating process, the rating at RSV-I took a much longer time given
the larger data size. As it was impractical to require all the three teacher raters to
score the candidates’ performance within a consecutive period of time, they were
allowed to take away the data and return the rating results to the researcher within
the following five days. Such an accommodation was partly due to the heavy rating
workload and partly based on a consideration of intra-rater reliability, as the
lengthier the rating process would last, the less reliable the rating results within
individual raters might incur. The three teacher raters were given a certain amount
of honorarium as a token of appreciation.
When it comes to peer-rating, there seemed almost no possibility of returning the
data to the particular institutions where the samples of group discussion were
collected, requesting the candidates to rate their peers’ performance. The main
reason was due to certain logistic constraints because the peer-rating results were
supposedly based on the revised rating scale, which actually came into play after
Dataset 2 had been collected. This study, therefore, adopted an indirect way, in
which the samples were randomly rated by the peers from different institutions to
which the researcher had comparatively easier access. The samples of group dis-
cussion at different proficiency levels were also rated by the learners of the
128 3 Research Design and Methods
At RSF-I, where the assessment domains were designed to be incubated via the
results from the questionnaires, the method of how those domains could be
3.3 Methods and Instruments 129
Fig. 3.7 An EQS example of path diagram with embedded parameter estimates
3.4 Summary
From the logistic concerns and the perspective of general design, this chapter
outlines the research procedure, the data, the methods as well as the research
instruments of this study. Based on the literature review of nonverbal delivery, how
to design and how to validate a rating scale with a consideration of embedding
nonverbal delivery into speaking assessment, the first section entails how this study
was carried out in a three-phase design. In the AB phase, an argument was
advanced for incorporating nonverbal delivery as a dimension of differentiating
candidates across a range of proficiency levels. When the study proceeded to the
RSF phase, a rating scale informed by such an argument was formulated basically
in the domains of language competence and strategic competence, the latter of
which can be largely represented by nonverbal delivery enlightened by a review on
the previous literature. The RSF phase ended up with a small-scale prevalidation
study in the sense that certain modifications were made to refine the tentative
version of the rating scale. The RSF phase, on the other hand, was separated into
two lines, with quantitative and qualitative validation, respectively. The second
section of this chapter describes the data and profiles a few considerations on data
collection, processing and analysis. In particular, with a number of exemplifica-
tions, more light is shed on how three datasets threading through this study would
be processed and further analysed to serve phase-specific purposes. The last section
wraps up this chapter by an elaboration on the statistical methods and the corre-
sponding software used in rating scale formulation and validation.
References 131
References
Allwood, J., L. Gronqvist, E. Ahlsen, and M. Gunnarsson. 2003. Annotation and tools for an
activity based spoken language corpus. In Current and new directions in discourse and
dialogue, ed. C.J. van Kuppevelt, and R.W. Smith, 1–18. Dordrecht: Kluwer Academic
Publishers.
Atkinson, P. 1992. Understanding ethnographic texts. Newbury Park, CA: Sage.
Baldry, A., and P. Thibault. 2006. Multimodal transcription and text analysis. London: Equinox.
Burnard, L. 2005. Developing linguistic corpora: Metadata for corpus work. In Developing
linguistic corpora: A guide to good practice, ed. M. Wynne, 30–46. Oxford: Oxbow Books.
Cameron, D. 2001. Working with spoken discourse. London: Sage.
Edward, J.A. 1993. Principles and contrasting systems of discourse transcription. In Talking data:
Transcription and coding in discourse research, ed. J.A. Edward, and M.D. Lambert, 3–31.
Hillsdale, NJ: Lawrence Erlbaum Associates.
Fairclough, N. 1992. Discourse and text: Linguistic and intertextual analysis with discourse
analysis. Discourse and Society 3: 193–217.
Field, A.P. 2005. Discovering statistics using SPSS, 2nd ed. London: Sage.
Flewitt, R., R. Hampel, M. Hauck, and L. Lancaster. 2009. What are multimodal data and
transcription? In The Routledge handbook of multimodal analysis, ed. C. Jewitt, 40–53.
London and New York: Routledge.
Garside, R., G. Leech, and T. McEnery (eds.). 1997. Corpus annotation. London: Longman.
Goodwin, C. 1981. Forms of talk. Philadelphia, PA: University of Philadelphia.
Goodwin, C. 1994. Professional vision. American Anthropologist 96: 606–633.
Gorsuch, R.L. 1983. Factor analysis. Hillsdale, NJ: Lawrence Erlbaum Associates.
Greaves, C. 2008. ConcGram 1.0: A phraseological search engine. Amsterdam: John Benjamins
Publishing Company.
Green, J., M. Franquiz, and C. Dixon. 1997. The myth of the objective transcription: Transcribing
as a situated act. TESOL Quarterly 31: 172–176.
Gu, Y. 2006. Multimodal text analysis: A corpus linguistic approach to situated discourse. Text &
Talk 26(2): 127–167.
Gu, Y. 2009. From real life situated discourse to video-stream data-mining: An argument for
agent-oriented modelling for multimodal corpus compilation. International Journal of Corpus
Linguistics 14(4): 433–466.
Gumperz, J.J. 1992. Contextualisation and understanding. In Rethinking context: Language as an
interactive phenomenon, ed. A. Duranti, and C. Goodwin, 229–252. Cambridge: Cambridge
University Press.
Hutcheson, G., and N. Sofroniou. 1999. The multivariate social scientist: Introductory statistics
using generalized linear models. London: Sage Publications.
Jin, Y. 2006. On the improvement of test validity and test washback: the CET-4 washback study.
Foreign Language World 6: 65–73.
Kaiser, H.F. 1974. An index of factorial simplicity. Psychometrika 39(1): 31–36.
Lapadat, J.C., and A.C. Lindsay. 1999. Transcription in research and practice: From standard-
isation of technique to interpretative positioning. Qualitative Inquiry 5(1): 64–86.
Leech, G., G. Myers, and J. Thomas (eds.). 1995. Spoken English on computer: Transcription,
mark-up and application. London: Longman.
Lemke, J. 1998. Metamedia literacy: Transforming meanings and media. In Handbook of literacy
and technology: Transformation in a post-typographic world, ed. D. Reinking, M. McKenna,
L. Labbo, and R. Kieffer, 283–302. Hilldale, NJ: Lawrence Erlbaum Associates.
Mehan, H. 1993. Beneath the skin and between the ears: A case study in the politics of
representation. In Understanding practice: Perspectives on activity and context, ed.
S. Chaiklin, and J. Lave, 241–268. Cambridge: Cambridge University Press.
Ochs, E. 1979. Transcription as theory. In Development pragmatics, ed. E. Ochs, and B.
Schieffilin, 43–72. New York, NY: Newbury House.
132 3 Research Design and Methods
Roberts, C. 1997. Transcribing talk: Issues of representation. TESOL Quarterly 31: 167–172.
Scott, M. 2008. WordSmith tools (Version 5.0). Liverpool: Lexical Analysis Software.
Thompson, P. 2005. Spoken language corpora. In Developing linguistic corpora: A guide to good
practice, ed. M. Wynne, 59–70. Oxford: Oxbow Books.
Widaman, K.F. 1985. Hierarchically tested covariance structure models for multi-trait
multi-method data. Applied Psychological Measurement 9: 1–26.
Yang, H., and C.J. Weir. 1998. Validation study of the national College English Test. Shanghai:
Shanghai Foreign Language Education Press.
Chapter 4
Building an Argument for Embedding
Nonverbal Delivery into Speaking
Assessment
This chapter reports on the AB phase of this research, which unveils an empirical
study that foregrounds the entire research project. Prior to assuredly advancing to
formulating and validating a rating scale as the ultimate product of this project, this
study should first build an argument for embedding nonverbal delivery into
speaking assessment. Specifically, an empirical study was conducted as to how
particular channels of nonverbal delivery deployed by Chinese EFL learners can be
described, so that not only how much they achieve in this respect can be micro-
scopically characterised, but also the argument mentioned above can be articulated.
The research findings at this phase would also inform how the part of strategic
competence on the rating scale (RSF-II), mainly reflected by nonverbal delivery,
can be subsequently formulated.
particularising the observable rating scale descriptors for nonverbal delivery dis-
cerning candidates across proficiency levels. Hence, this phase of study can be
crucial in the sense that the wording of modifiers in the band descriptors, if saliently
distinguishable, could reflect gradable changes between adjacent proficiency levels.
In retrospect, the very first general research question raised in the first chapter
addresses the role that nonverbal delivery plays in EFL learners’ spoken production
in group discussions. To be addressed in this phase of research, this question can be
made more addressable since the above research objectives provide pertinent
insights on how fine-grained research questions specific to this phase can be per-
ceived, as outlined as follows. How these questions can be further operationalised
will be approached in research design section below.
AB-RQ1: What are the main characteristics of Chinese EFL learners’ nonverbal
delivery in group discussion in the context of formative assessment?
AB-RQ2: To what extent can Chinese EFL learners’ employment of nonverbal
delivery be differentiated across different proficiency levels?
AB-RQ3: How does Chinese EFL learners’ nonverbal delivery interact with their
verbal utterance?
4.2 Method
1
These samples were selected in an ascending order of their sequence numbers in each proficiency
group.
4.2 Method 135
This part explicates the research findings and discussions on three most represen-
tative nonverbal channels as reviewed before. The findings of each nonverbal
channel will be reported below consecutively in three sections. The first section
mainly deals with the two dimensions of measurement: frequencies/occurrences and
cumulative durations of nonverbal channels. The second section, beyond a statis-
tical spectrum, takes a closer look at how candidates across different language
proficiency levels instantiate nonverbal channels and what communicative func-
tions their nonverbal delivery might serve. The last section touches upon the
interaction between verbal language and nonverbal channels so that the interface
between these two modalities would be examined.
Considering the different durations of group discussions, this study standardised the
occurrences of eye contact in each sample to the frequencies in a unit interval of
136 4 Building an Argument for Embedding Nonverbal Delivery …
partly divergent from what has been previously uncovered from the frequency
dimension. The echoing part is that Group A, in both dimensions of frequency and
cumulative duration, ranks the first among the three groups. What is divergent is
that not only Group B but also Group C is different from Group A in EC/p versus
ASD ratio, whereas only Group B is significantly different from Group A in the
EC/p frequency.
The duration data, after a conversion into seconds and standardised, were also
put into one-way ANOVA as the data present normal distribution. It is found that
the durations of EC/p are significantly different across groups (see Table 4.4,
p = 0.002 < 0.01). A further post hoc Tamhane’s T2 reveals that Group C exhibits
significantly shorter duration of EC/p than Group A.
Table 4.5 lists the descriptive statistics of EC/r cumulative duration. As is
shown, the EC/r versus ASD ratios tend to descend in the order of Group A
(30.87 %), Group B (21.18 %) and Group C (5.36 %). The one-way ANOVA
shows a significant inter-group difference (Table 4.6, p = 0.036 < 0.05), and a
further post hoc Tamhane’s T2 indicates that Group-A candidates spent signifi-
cantly more time in having EC/r than Group C (p = 0.041 < 0.05). This, to a certain
extent, does not support the previous findings that in terms of the frequency
Table 4.4 One-way ANOVA of EC/p cumulative duration across the groups
Sum of squares df Mean square F Sig.
Between groups 0.495 2 0.248 6.372 0.002
Within groups 5.712 27 0.039
Total 6.207 29
138 4 Building an Argument for Embedding Nonverbal Delivery …
Table 4.6 One-way ANOVA of EC/r cumulative duration across the groups
Sum of Squares df Mean square F Sig.
Between groups 0.086 2 0.043 3.512 0.036
Within groups 0.738 27 0.012
Total 0.824 29
dimension, Group C has higher frequencies of EC/r than Group A. More discussion
would be devoted to explaining this issue later.
Then, the research findings turn to the cumulative durations of EC/c across a
range of proficiency levels, as outlined in Table 4.7. Percentagewise, each group
seems similar regarding the EC/c versus ASD ratios; nonetheless, it can be deduced
from the maximum of the corresponding durations that the longest sample from
Group A (4:46.5) almost covered the entire ASD of that particular group (5:46.9).
Interpretation can be therefore made that the concerned candidates might have
engaged themselves with constant and continuous EC/c throughout the entire dis-
cussion period. The one-way ANOVA finds no significant difference among groups
across different proficiency levels (see Table 4.8, p = 0.316 > 0.05).
So far, there still leaves an account of the candidates’ eye contact with other or
non-detectable physical objects (EC/n) in group discussions. As it would not be
Table 4.8 One-way ANOVA of EC/c cumulative duration across the groups
Sum of squares df Mean square F Sig.
Between groups 0.019 2 0.010 1.168 0.316
Within groups 0.665 27 0.008
Total 0.684 29
4.3 Research Findings 139
verbal language
sp_1: Em...I have er...I think we er...we have a lot of time, nonverbal delivery
we should use em...we should use free time to study more (eye contact)
intensify
courses. Don’t you agree? sp_1 sp_2
sp_2: Er...I don’t agree.
persuasion
verbal language
nonverbal delivery
sp_1: So she can help me buy the, buy the thing which I
required. All in all, I’d like the friends who are different compensate (eye contact)
regulatory
As unfolded above, generally the candidates were not observed to be highly active
in presenting EC/p in group discussions, nor were their EC/p duration long and
constant. In a sense, the lack of EC/p might partly be attributable to their inexact
understanding of what they were supposed to do. Most learners of intermediate and
elementary proficiency levels, if not all, might regard group discussion task as a
platform on which they would just voice out their own views rather than play the
role of a group member with active interaction and engagement. Therefore, they
intrinsically discarded or even poorly performed EC/p with attentive or persuasive
functions.
In response to the finding that Group A outnumbered Group B in EC/p fre-
quency, it is thought that learners of advanced proficiency, with more exposure to
English learning and incidental culture acquisition, would employ more conver-
sation management strategies so that their intended conveyance can be further
intensified by or compensated for the accompanying verbiage. Although there is no
significant difference in EC/p frequencies between Group A and Group C, the
corresponding duration of the latter is shorter. This is because, on one hand,
elementary-level candidates might be excessively cautious in their discussion,
turning to their peers for negotiation or turn-taking via eye contact. On the other
hand, such occurrences of eye contact usually featured briefness and instability.
This would only augment the absolute frequency of Group-C candidates’ eye
contact with peers, whose duration, nevertheless, is not accordingly in proportionate
to their occurrence frequency.
The occurrences of candidates’ eye contact, especially those with the
teacher/researcher, was also characterised by an excess of impression management.
Admittedly, eye contact should be employed for impression purpose on certain
occasions. However, as far as group discussion is concerned, where candidates
already got acquainted with other discussants, eye contact with someone other than
the discussants should not be encouraged. Prior to video-recording, the researcher
explicitly clarified their roles as not assessors on the spot; nonetheless, despite such
reassurance, the candidates still seemed to wrongly deem the teacher/researcher as
their discourse referent.
With regard to the difference across the proficiency levels, Group C outnum-
bered Group A in EC/r frequency, yet a reversed picture is presented in the case of
cumulative durations. It is considered that elementary learners geared their dis-
course referent to the researcher in fear of committing errors in spoken production.
Each time they shifted the directionality of eye contact, it would not last long
because such an action was taken just for the sake of receiving “not-that-bad”
reassurance from the on-the-spot researcher. By contrast, Group A, despite a rather
satisfactory mastery over conversation management strategies as pinpointed above,
virtually talked to the researcher; therefore, Group-A candidates would explore
every means possible for impression making, lengthening their duration of EC/r. He
and Dai (2006) also found that CET-SET candidates would “express their own line
4.3 Research Findings 143
In what follows, the findings on gestures will be presented. Table 4.11 lists the
descriptive statistics of gesture frequencies on a sample basis. On the whole, there
were averagely 10.82 times of gesture occurrences in each observed sample of
group discussion. It can be interpreted, therefore, that if ASD is still standardised to
five minutes with three candidates involved in each group, the mean frequency of
gesturing concerning all the observed samples was approximately one occurrence
per minute for each candidate. This initially reveals that the candidates did not
frequently resort to gestures synchronising with their verbiage. A comparison
across the different groups would expose that Group A ranked the first with 13.89
occurrences of gestures in each sample; Group B and Group C came next with an
average frequency of 10.58 and 7.85, respectively.
Being normally distributed, the data representing gesture frequencies were fur-
ther processed by one-way ANOVA. It is testified that there is significant difference
of gesture frequencies across the groups (see Table 4.12, p = 0.001 < 0.01). Since
the data for each proficiency group are also not homogeneous, post hoc Tamhane’s
T2 test was deployed for a further inter-group comparison. It is found that Group-A
candidates exhibited statistically more occurrences of gestures than Group-C
counterparts (p = 0.002 < 0.01).
The research findings then turn to the descriptive statistics of cumulative gesture
duration. Although gesture duration might not be a sound parameter that can dis-
cern candidates across different proficiency levels, or might not be included in the
rating scale descriptors. At this exploratory phase, it would be more advisable to
include this tentative measure as more insightful and interesting findings might thus
be produced.
As indicated in Table 4.13, be it cumulative gesture duration, or gesture versus
ASD ratio (henceforth gesture vs. ASD ratio), the rankings remain to be the same in
the order of Group A, Group B and Group C. Among the groups, the cumulative
duration of gesture in Group-A samples averagely accounted for 40.45 % of ASD,
indicating that quite frequently candidates synchronised their verbiage with gestures
of various manifestations. The maximum cumulative duration of gesture in
Group A (6′ 48.5″) was even longer than the ASD of that particular group. This is
because when transcribed on the time frame, gestures by all the candidates in each
group were observed, triggering a possibility that more than two candidates’ ges-
tures were encoded. However, the above extreme case, though comparatively rare,
showcases that advanced-level candidates could be found to entirely synchronise
their verbal utterances with gesturing.
A similar approach of one-way ANOVA and post hoc Tamhane’s T2 was used
to test possible disparity of gesture cumulative duration across the groups. It is
found that there is significant inter-group difference (see Table 4.14,
p = 0.045 < 0.05) and that the cumulative duration of gesture by Group-C candi-
dates was marginally statistically shorter than that by Group A (p = 0.044 < 0.05).
So far rough findings can be obtained that the candidates generally kept a low
profile of employing gestures, yet those of higher proficiency tended to instantiate
more gestures. However, at this stage, a fuller understanding of the candidates’ de
facto gesturing could not be reached unless in-depth analyses on their gesture
manifestations are profiled. Given this, the findings turn to the descriptive tran-
scriptions of gestures.
By randomly sifting the transcription texts, it has been found that a majority of
them are embedded with a number of keywords related to gestures defined in the
present study: HAND, FINGER, PALM, ARM and FIST, and their pluralities
included. The concordance frequencies of the above keywords constitute 95.14 %
of all the gesture transcriptions, ensuring that the extracted keywords can to a great
extent account for how the candidates instantiated their gestures. Tentatively, the
keyword HAND(S) was first championed with a view to extracting all the verbs
related to that because this keyword could be thought of as the most direct word in
describing various gestures. Table 4.15 lists all the HAND(S)-related verbs across
the groups with their respective rankings.
As is shown in Table 4.15, 16 verbs were retrieved from both Group A and
Group B, whereas 13 verbs from Group C. This disparity basically corresponds
with the previous findings that Group C presents a lower profile regarding both the
frequency and cumulative duration of gesture use. A detailed comparison among
the top-ranked verbs would further reveal that candidates of all proficiency levels
share the same descriptive verbs with basically similar ranking: MOVING,
RAISING, SHAKING and WAVING.
The next step was geared to taking a closer look at these shared verbs as a
revelation of how the candidates performed gestures. The pilot screening of these
verbs could be divided into two broad categories in relation to the
meaning-productiveness of gestures, as shown by part of the concordance lines in
Figs. 4.4 and 4.5. Referring to the accompanying verbiage of the gesture tran-
scription for meaning making in Fig. 4.4, MOVING was mainly associated with the
movement of hand(s) for meaning conveyance; RAISING mostly referred to use of
hand in yielding the turn to the group members; SHAKING, as its face meaning
suggests, often indicates an act of hand-shaking; what is worth mentioning is that
146 4 Building an Argument for Embedding Nonverbal Delivery …
There are two aspects shared by different proficiency groups, as can be found in
Table 4.17. One aspect is that candidates across a range of proficiency groups share
many THINK-related chunks, an indicator that when they synchronised their
gestures with verbal utterances, most meanings intended might be expressing their
own viewpoints or requesting others’ opinions. Regarding the communicative
functions of gestures (Ekman and Friesen 1969), candidates’ gesture in that aspect
should fall into illustrators because they resorted to a variety of hand or arm
movements in making themselves comprehended.
The second shared aspect mainly refers to the meaning embedded with adjective
or adverb degrees. As comparative and superlative degrees might serve an emphatic
purpose, the above finding illustrates that while learners were instantiating certain
meanings with emphatic foci, they would be likely to use gestures in synchroni-
sation with the accompanying verbiage. Such occurrences of gesture realised the
function of illustrators when its communicative function is concerned.
Considering the interaction between verbal language and gesture, it can be found
that both modalities achieve complementarities in the process of meaning trans-
mission. For instance, as illustrated in Fig. 4.7, Speaker 1, after alleging an opinion
4.3 Research Findings 149
illustrative
with a comparative degree more important, yielded his turn to another discussant;
meanwhile, Speaker 1 slightly raised the forefinger of the left hand upwards as if the
speaker was pointing at something for an emphasis on the accompanying verbiage.
As the comparative degree in this case expressed illustrative meaning and the gesture
functioned as illustrators, the interaction between both modalities was intensified.
Table 4.17 also shows a few uncommon phraseologies. As can be found,
Group B and Group C tended to synchronise their gestures with the accompanying
verbiage of That’s all, a signal of turn termination. Considering the communicative
function of such a gesture, it should fall into adaptors in Ekman and Friesen’s
(1969) taxonomy because, instead of gesturing for a signal of an intention to yield a
turn, candidates’ gestures were mostly those reassuring themselves of task fulfil-
ment. Nonetheless, there can also be exceptions as illustrated in Fig. 4.8. When
Speaker 1 finished the turn with That’s all, instead of gesturing to invite other
candidates for floor-taking, he still seemed to continue his turn via raising both
hands upwards from a resting position on the thighs. Therefore, the accompanying
verbiage intended for a turn termination seemed inconsistent with what gesture
would instantiate; hence, the two modalities bifurcated.
From Table 4.17, the phraseology of agree with you can also be found in Group
A. By reading the concordance lines retrieved from the gesture transcription texts, it
can be noticed that advanced-level candidates were able to appropriately use
Verbal language
Nonverbal delivery
sp_1: Think about our history and some (gesture)
famous, famous people and event. Yes. sp_1: raising both hands upwards from a
That’s all. Diverge static position resting on the thighs
Turn-ending Adaptor
The last nonverbal channel examined in this phase of study is head movement,
mainly manifested by head nod and shake. Table 4.18 lists the descriptive statistics
of head movement frequency. It can be found that the minimum occurrence of head
movement is 1. As far as the mean frequencies are concerned, the occurrences of
head movement could be ranked in an ascending order in terms of proficiency
levels: Group C (5.26), Group B (7.53) and Group A (9.47). If 5 min is still taken as
ASD, each candidate then had only one occurrence of head movement in
approximately 2.5 min.
As the data present normal distribution and heterogeneity, one-way ANOVA and
post hoc Tamhane’s T2 test were conducted to see any possible inter-group sig-
nificant difference. As is shown in Table 4.19, three proficiency groups are sig-
nificantly different from each other (p = 0.025 < 0.05). A post hoc Tamhane’s T2
test further finds that such difference lies between Group A and Group C
(p = 0.026 < 0.05). Therefore, it could be interpreted that in terms of frequency,
candidates of higher proficiency instantiated more head movement than the
lower-level counterparts.
Likewise, the cumulative duration of head movements, together with the ratio of
head movement to ASD in each group discussion, was also calculated. Table 4.20
lists the statistics described above. Impressionistically, the head movement duration
versus ASD ratio, a most obvious parameter indicative of the extent of head
movement instantiation showcases that in Group A approximately 20 % of the
discussion period was accompanied with head movements, yet Group B and
Group C had moderately lower percentage in this regard.
When the data were standardised and tested by one-way ANOVA (see
Table 4.21), significant difference can be found of head movement across three
proficiency groups (p = 0.004 < 0.05), and a post hoc Tamhane’s T2 test further
indicates that Group C was significantly different from Group B (p = 0.005 < 0.05)
and Group A (p = 0.007 < 0.05) in that aspect. Therefore, a brief summary can
made that, from both dimensions of head movement frequency and duration, the
candidates are generally found to keep the head in a rather static position as a
whole. This is because during group discussion, only about one-fifth of the time
witnessed the occurrences of head movement. Group C was significantly different
from Group A in the aspect of head movement frequency, whereas Group C could
also be distinguished significantly from Group A and Group B from a duration
perspective.
At this stage, the statistics could only provide a sketchy profile of how the
candidates performed head movement. It has to be admitted that in the Chinese
social context, nodding is generally understood as agreement, while head shaking
usually refers to disagreement. However, a fuller picture of how the candidates
aligned appropriate head movements with what they expressed verbally can only be
depicted when verbal language is taken into consideration. In that context, it would
be necessary to examine how head movement interacts with accompanying ver-
biage and further try to analyse the communicative functions they would possibly
serve.
Fig. 4.9 Concordance lines of synchronisation between head nod and verbal language
Verbal language
sp_2: It doesn’t mean you want to lie or something. It just Nonverbal delivery
meant you want to don’t hurt others and want to make others (head movement)
more comfortable. diverge sp_3: nodding
sp_3: I’m afraid I don’t think so. Anyway, a lie is a lie.
disagreement agreement
Fig. 4.11 Concordance lines of synchronisation between head shake and verbal language
the negation signal not or no as the context words, as shown in Fig. 4.11, only a
total of 7 occurrences have been found. It could be holistically felt that when
negation was conveyed, the candidates seemed to rarely or reluctantly accompany
their head shake with the intended verbiage of negation or disagreement.
When the verbal language synchronised with head movements was retrieved in
the format of phraseology, as is encompassed in Table 4.22, it can be found that the
candidates generally expressed their own viewpoints or elicited other discussants’
responses when performing head movements, which can be evidenced by such
phraseologies as I think, do you think and what about you.
However, inter-group differences concerning the phraseologies can also be
found in Table 4.22. Two points are worthy of attention. One is that when the
meaning of agreement was expressed by advanced-level candidates, there was also
accompanying head movement, as can be cross-validated with the findings above.
The other point is that the expressions indicative of turn termination that’s all were
again uttered by Group-B and Group-C candidates. More specifically, when nearing
the end of their turns, they would yield their turn partially by means of nodding so
that other discussants might be hinted to take the floor. The finding in this aspect
also corresponds to what has been discovered in the section of eye contact, where
candidates, instead of resorting to verbal utterance, might perform eye contact with
other discussants for turn-taking.
156 4 Building an Argument for Embedding Nonverbal Delivery …
Confining the instantiations of head movement to head nod and shake, this phase of
research discovers that the candidates would present more occurrences of nodding
than head shake. Generally, they are able to nod when an intended verbiage of
agreement or a signal of backchannelling is requested though occasional cases of
inappropriateness in nodding might also occur. The following is a discussion on
what is found above.
First, as head movement is one of the most salient nonverbal channels as
afore-reviewed, it would be expected of candidates, whenever necessary, to
accompany their verbiage with head nod or shake in group discussion, the task of
which usually elicits conflicts of viewpoints or negotiation. As found above,
however, the frequency of head movement keeps a comparative low profile. When
verbal utterances intend the meanings of agreement or disagreement, rare syn-
chronised head movements were observed. Microscopically, a number of candi-
dates, particularly those of elementary proficiency level, might be unable to initiate
their head movement as backchannelling in that their proficiency may deter them
from fully fathoming what was conveyed by other discussants. Another possibility
would be that they might not pay due attention to others’ utterances so that no
response in the form of head movement could be detected. This also enlightens this
study that head movement can be instantiated in a context, where a need for
backchannelling arises. The infrequent head shake could also be partly explained by
cultural influence. In the Chinese social context, communicators, out of courtesy,
might not frequently shake head even in the case of disagreement. This is consistent
with the findings of Jungheim’s (2001) study, where Japanese EFL learners, con-
textualised in a similar culture of courtesy, were found to perform frequent nodding
when assessed by native speakers of English.
Second, as far as the communicative functions of head movements are con-
cerned, the main purpose for head nod and shake should be indicating agreement or
disagreement in an enhanced fashion. If the candidates did not appropriately nod or
shake their heads in synchronisation with the intended verbiage, or sometimes
accompanied head movements only for regulatory purposes as a result of anxiety in
the assessment context, their performance cannot be regarded as communicative.
The degree of appropriateness, therefore, can serve as one of the dividing lines to
discern candidates across a range of proficiency levels when head movement, a
salient domain of nonverbal delivery, is to be incorporated into speaking
assessment.
4.4 Summary 157
4.4 Summary
References
Ekman, P., and W.V. Friesen. 1969. Nonverbal leakage and clues to deception. Psychiatry 32:
88–106.
He, L., and Y. Dai. 2006. A corpus-based investigation into the validity of the CET-SET group
discussion. Language Testing 23(3): 370–401.
Hood, S.E. 2007. Gesture and meaning making in face-to-face teaching. Paper presented at the
Semiotic Margins Conference, University of Sydney.
Hood, S.E. 2011. Body language in face-to-face teaching: A focus on textual and interpersonal
meaning. In Semiotic margins: Meanings in multimodalities ed. Dreyfus, S., S. Hood, and
S. Stenglin, pp. 31–52. London and New York: Continuum.
Jungheim, N.O. 2001. The unspoken element of communicative competence: Evaluating language
learners’ nonverbal behaviour. In A focus on language test development: Expanding the
language proficiency construct across a variety of tests, ed. T. Hudson, and J.D. Brown, 1–34.
Honolulu: University of Hawaii, Second Language Teaching and Curriculum Centre.
Leathers, D.G., and H.M Eaves. 2008. Successful nonverbal communication: Principles and
applications, 4th ed. Pearson Education, Inc.
Martinec, R. 2000. Types of processes in action. Semiotica 130(3): 243–268.
Martinec, R. 2001. Interpersonal resources in action. Semiotica 135(1): 117–145.
Martinec, R. 2004. Gestures that co-occur with speech as a systematic resource: The realisation of
experiential meanings in indexes. Social Semiotics 14(2): 193–213.
Chapter 5
Rating Scale Formulation
Overall, this phase of study mainly aims to formulate a tentative version of the
rating scale with language competence and strategic competence as two broad
dimensions. As aforementioned, how both dimensions will be formulated is
© Springer Science+Business Media Singapore 2016 159
M. Pan, Nonverbal Delivery in Speaking Assessment,
DOI 10.1007/978-981-10-0170-3_5
160 5 Rating Scale Formulation
5.2 Method
This section apportions the research design of RSF-I. Given the fact that the par-
ticipants involved in this phase of study have been introduced with regard to their
demographic information and that exploratory factor analysis, the statistical method
adopted to extract the assessment domains from the questionnaire responses, is also
brought to light in Chap. 4, this part mainly outlines the research procedure and
explains the questionnaire design.
5.2 Method 161
Having noted that the ultimate product of RSF-I would be the assessment domains
and descriptors of language competence on the rating scale, RSF-I was executed in
three steps, as illustrated in Fig. 5.1. The first step was to operationalise various
manifestations of language competence into statements. This step was followed by
the core of RSF-I with questionnaire as a research instrument (see Sect. 5.2.2 for
more details). A good number of operationalised statements concerning language
competence were then itemised for generating the trial versions of questionnaires
for teachers and learners, respectively. The modified questionnaires were dis-
tributed to the respondents after the trial use mainly to disambiguate the band
descriptors. In order to distil the essence of the respondents’ perceptions towards
what would be supposed to constitute language competence in group discussion,
their rating on the questionnaire statements was extracted with EFA. The last step,
deriving from the questionnaire response analyses, proceeded to design the
extracted assessment domains and the corresponding descriptors for measuring
language competence in group discussion on the rating scale.
Step 2
Formulate a questionnaire
based on the
operationalisations
Step 3
Step 1
The questionnaires serving as the core research instrument at this phase are intro-
duced in detail in this section. As previously mentioned, the questionnaire could be
regarded as a granular epitome, or the operationalisation of language competence in
the CLA model. The following part presents the conceptual components of lan-
guage competence and a few assumedly aligned operationalisation statements in the
questionnaire (see Appendices V and VII for the trial versions for teacher and
learner respondents, respectively).
It can be recaptured that language competence in the CLA model is categorised
into organisational competence and pragmatic competence. The former can be
further divided into grammatical competence (GC) and textual competence (TC),
whereas the latter is composed of illocutionary competence (IC) and sociolinguistic
competence (SC). Blending the nature and practicality of group discussion with
these four domains, along with a consideration of determining assessment domains
and benchmarking each domain to be observable and characterisable, RSF-I
embellishes and tabulates the above in Tables 5.1 and 5.2.
From Table 5.1, it can be seen that although the CLA model stratifies different
layers of ingredients regarding organisational competence, modifications are made
to more effectively foreground the components when the assessment task is
Before a presentation of the research findings, this part will first dwell on the
threshold values for running EFA of the observed dataset. As is reviewed con-
cerning the threshold indices for EFA (see Sect. 3.3.1), the number of the
respondents amounts 1312, a figure exceeding the minimum requirement of 300.
With the method of principal component EFA, this phase of study first checks the
threshold values as follows. Table 5.3 shows that the KMO value is 0.758, indi-
cating sound fitness of the dataset for EFA. Bartlett’s test also presents statistical
significance (p = 0.000), which further reveals the appropriateness for the teachers’
and learners’ rating data to run factor analysis.
Table 5.4 reflects the communalities of each item (statement) after extraction in
the factor analysis. As is aforementioned in Chap 3, the extraction above 0.5 can be
succinctly acceptable for any further data interpretation. All the extraction values in
Table 5.4 are above 0.5, showing fairly much variance in each item (statement)
explained by the latent factors can be represented.
Then, the research findings of RSF-I move to the core part of EFA deriving from
the questionnaire responses. Table 5.5 presents the factor loadings of each variable
on the latent components when eigenvalue threshold is set at 1.0 by default. Judging
from the results in Table 5.5, four components are extracted from the 16 variables
(statements), whose loadings exceeding 0.3 on each corresponding latent compo-
nent are displayed (loadings below 0.3 are discarded due to poor interpretability).
Component 1, heavily loaded on the variables from GC_1 through GC_6, can be
regarded as one of the main contributors to GC. The only low loading, yet above
0.3, is found in the case of GC_4 (0.319). Component 2 is closely related to the
variables from GC_7 to GC_11, serving as another main contributor to GC.
However, there are two variables with low loadings on this component: GC_7
(0.308) and GC_8 (0.314), betokening that the variances at a marginal level in
either statement can be explained by Component 2. Component 3 is heavily loaded
on TC_1 and SC_2, and the remaining three variables (IC_1, IC_2 and SC_1)
mainly contribute to Component 4. Table 5.5 also alludes to the fact that all the
latent variables (components) can explain 69.63 % accumulated variance, again
affirming a sound indicator for the explanatory power of the extracted components
with regard to what all the statements intend to reflect.
Another issue to be cross-checked was whether the components extracted were
inter-correlated because in principal component analysis, promax rotation of the
5.3 Research Findings 167
latent factors was selected for maximising data fit. Considering the nature of
non-orthogonal rotation in this method, the correlations between latent factors might
be unpredictably high. Bearing the possible results of promax rotation in mind,
Table 5.6 presents a correlation matrix of four latent components. It can be noted that
the four components, after being rotated in a promax fashion, were not incidentally
highly correlated with each other, with only two correlation coefficients above 0.3.
One of them is the correlation coefficient between Component 1 and Component 2
(0.338), both of which originally derived from the operationalisations of GC in the
CLA model. Thus, it can be apprehensible that such a correlation coefficient fell into
an excusable range. Another even slightly higher correlation coefficient can be found
between Component 3 and Component 4 (0.502). An initial possible explanation
might be that after the promax rotation, where orthogonal angle was no longer
maintained, these two components might be clustered slightly closer to be more
interdependent. This issue will be re-addressed in the discussion below.
Given what is found above, including four initially extracted components, their
factor loadings and the degree of independence among latent factors, two issues
should be addressed. First, why did certain individual variables fail to be loaded on
the supposedly latent component for a particular assessment domain? Second, how
could these four extracted components inform the formulation of language com-
petence on the rating scale? The following discussion attempts to address these
questions in detail.
5.4 Discussion
speaking volume, pitch and stress. Such convergence not only indicates teachers’
and learners’ shared perceptions in that domain but also reveals that pronunciation
and intonation should be one of the legitimate and key elements in assessing
candidates’ GC. Against this, these elements would be reflected in the formulation
of the rating scale, particularly in the dimension concerning pronunciation and
intonation. The exception, as found above, derives from the statement GC4
speaking smoothly and loudly can help clear communication, with a loading
marginally exceeding 0.3. This means speaking smoothly and loudly in group
discussion does not substantially contribute to respondents’ perceptions of what
should be assessed, which consequently disqualifies that particular element off the
descriptors on the rating scale.
The statements falling into Component 2 from EFA are also relevant to GC
stipulated in the CLA model. However, dissimilar to Component 1, this component
seems to statistically contribute to grammar (correctness and variation) and vo-
cabulary (range, correctness and appropriateness). In that case, it can be assumed
that grammar and vocabulary can be grouped together as another assessment
dimension on the rating scale. Although the statements GC_7 and GC_8 have no
high factor loading on Component 2, it would not necessarily follow that their
contribution to this latent variable can be negligent. A more discrete reading at both
statements will yield the finding that they are intended for grammar correctness and
variation. Therefore, the possible reasons for their low loadings could be either
teachers’ and learners’ under-saliency in recognising grammaticality in group dis-
cussion, or certain foreseeable infeasibility of observing grammar variation in rating
process. In that context, when the rating scale is formulated, due caution needs to be
taken in describing grammar correctness and variation.
So far the first two latent variables are discussed. As reviewed in the literature,
most existing rating scales for speaking assessment (see Appendices I through IV)
also “conventionally” include (1) pronunciation and intonation, and (2) grammar
and vocabulary. The discrepancy, if any, among certain analytic rating scales
consists in a further demarcation of assessment domains into more concrete points.
In that sense, these two assessment dimensions extracted from EFA greatly cor-
respond with a majority of prevailing rating scales, and they would also be naturally
set as two dimensions on the rating scale of this study. Then, the discussions
proceed to the other latent variables from EFA, which touches upon the dimensions
uncommon in the existing rating scales.
What is found reveals that Component 3 is loaded with TC_1 and SC_2, and
Component 4 loaded with IC_1, IC_2 and SC_1. It has to be acknowledged that a
majority of these variables are originally designed with a view to operationalising
TC, IC and SC in the CLA model. However, the three intended dimensions have
been shrunk into only two latent variables after EFA, and the statements repre-
senting the three dimensions even presented their divergence on the extracted
factors. This might have resulted from the promax rotation, where a possible threat
was posed to the independence between the components, as can be cross-validated
by the correlation matrix in Table 5.6. In addition, a host of remaining items, such
as IC_1 and SC_1, were credited with less heavy loadings on either component.
5.4 Discussion 169
Against that context, considerations can be made as to generalise and integrate the
intended construct of these remaining statements into one unitary component:
discourse management, which covers coherence and cohesion, fluency and topic
development. The naming of this assessment dimension to a great extent is
expected to reflect how candidates can manage their discourse in executing group
discussion.
To draw an interim summary, when language competence in the CLA model
was operationalised into individualised statements, based on which questionnaires
were designed and further administered to teachers and learners in the Chinese EFL
context, what should be assessed regarding language competence in group dis-
cussion was extracted in an exploratory manner. With the method of EFA, RSF-I
extracted four latent variables. The first two latent variables corresponded with GC,
comprising of pronunciation and intonation, and grammar and vocabulary.
However, justifications are made as to integrate all the remaining two latent vari-
ables into one, providing that no manifestation perceived in the questionnaire was
to be eliminated. Therefore, such an integrated dimension is named as discourse
management.
Having been informed by the results from the questionnaire and the follow-up EFA,
RSF-I in this section embarks upon formulating the part of language competence on
the rating scale. Basically, this part of the rating scale will be presented in two steps.
First, the rating scale for each analytic dimension, together with the corresponding
band descriptors, will be outlined. Second, the specifications will be provided to
illuminate how each band descriptor is brought forth mainly with respect to the
discriminating power across a range of proficiency levels.
Pronunciation
Intelligible Unintelligible
Native Foreign
5 4 3 2 1
Appropriate Inappropriate
Varied Monotonous
Intonation
foci on what is supposed to be assessed. What is worth noting is that although this
dimension is embedded with more than one aspect when rating is proceeded,
supposedly the rater would assign only one score on a five-point scale in an inte-
grated manner to evaluate candidates’ performance in this regard.
Prior to using this rating scale, raters would be routinely anticipated to famil-
iarise themselves with all the band descriptors, based on a correct and consistent
understanding of which the follow-up field rating could be facilitated. Table 5.7
presents the band descriptors for Pronunciation and Intonation.
The second dimension on the rating scale, namely Grammar and Vocabulary, which
is extracted from the questionnaire responses, bears much resemblance with the first
dimension as foregoing formulated, yet with slight difference in that the adjectives
used on both ends as reminders for raters are more congruent with those keywords
in the questionnaire statements. Figure 5.3 exhibits this dimension on the rating
scale. On the two continuums concerning the subdimension of Grammar,
accurate/inaccurate and varied/monotonous are provided for positioning and
observation purposes. Comparatively, Vocabulary is chiefly manifested by its
observable breadth and depth as well as by whether or not what is conveyed could
reflect idiomaticity use expected in the native speech community of English.
Grammar
Accurate Inaccurate
Varied Monotonous
5 4 3 2 1
Broad/Deep Narrow/shallow
Idiomatic Unidiomatic
Vocabulary
Following the same practice as raters would do for this first dimension on the
rating scale, the researcher of this study also imposes the requirement that raters
should get acquainted with the dimension of Grammar and Vocabulary. The scales
with modifiers attached on both ends serve the purpose of reminding raters of what
assessment domains should be carefully observed as stipulated in the band
descriptors. Similar to the practice of the first assessment dimension, raters would
also be supposed to assign only one score to this dimension based on their
observation and judgment on candidates’ performance in this aspect.
Table 5.8 lists the detailed band descriptors for the second dimension.
A microscopic look at the descriptors of one particular band will more effectively
enhance an understanding of what constitutes this dimension and how it is drawn
from the results of questionnaire survey. Take Band 4, a level to be considered as
higher-immediate proficiency, for example. The first descriptor at this level indi-
cates the degree of grammaticality, which can tolerate “occasional grammatical
error” only. The second descriptor is laid down with more reference to a consid-
eration of syntactic variation. At this level, candidates would be anticipated to
produce a range of variations though occasional inaccuracy and inflexibility might
5.5 Rating Scale (1): Language Competence 173
be excused. The third respect of descriptor deals with accuracy at sentence level,
still conforming to the adjective explanatory continuum linked by accuracy and
inaccuracy. With regard to the fourth and fifth descriptors, more emphasis is placed
on the bifold aspect of vocabulary. On the one hand, candidates to be assigned at
this level should be proven to have both vocabulary breadth and depth though
lapses can be tolerated. On the other hand, candidates might not be able to con-
stantly produce idiomatic expressions, but it can be detected that there would be
certain efforts for groping idiomatic expressions. All the foregoing descriptors
constitute what would be expected of candidates falling into that level.
It should be noted that in better discerning candidates across a range of profi-
ciency levels with regard to their grammar and vocabulary, all the gradable mod-
ifiers between two adjacent levels on the rating scale, such as those indicating
frequency (e.g. constant, frequent) and those degree (e.g. repetitive, limited), are
largely enlightened by and formulated from the frequency and degree modifiers
deployed in the questionnaire statements.
Discourse Management
Fluency Disfluency
Coherent Fragmentary
Developed Underdeveloped
5 4 3 2 1
What is elaborated above addresses the first broad dimension of the rating scale in
this study, RSF-II then dwells upon how strategic competence, mainly as reflected
by nonverbal delivery, can be formulated. As the development of this dimension
heavily relies on the research findings of the AB phase, this section will first
recapture the empirical study that aims at building an argument for embedding
nonverbal delivery into speaking assessment. Afterwards, the dimension of
Nonverbal Delivery, together with its corresponding descriptors, will be presented.
The AB phase, based on a small sample size, assesses the role of three nonverbal
channels by Chinese college EFL learners in their group discussions in formative
assessment. What follows in this section draws a synopsis of what has been cap-
tured and also proposes how the research findings and discussion can inform the
formulation of Nonverbal Delivery on the rating scale in this study.
In terms of eye contact, candidates generally tended to instantiate less eye
contact with their peers, and there were significant inter-group differences in fre-
quency and duration. Advanced learners, comparatively, were capable of resorting
to gazing in fulfilling the assessment task and switching their eye contact between
attentive and persuasive functions when turn-taking was involved. By contrast,
candidates of elementary and intermediate proficiencies, in most respects, gazed at
other discussants largely for attentive and regulatory purposes. In all likelihood, the
above observation results from their inexact speech referents or a discrepant mas-
tery of strategic competence. A majority of candidates across different proficiency
levels tended to have eye contact with an aim of impression management. However,
the ultimate goals of doing this can be discernible among candidates across a range
of proficiency levels. Advanced learners would be more likely to domineer or
impress the discourse referents, whereas those of lower proficiency were prone to
be timid or fidget in expressing themselves or afraid of committing errors when
shifting their eye contact directionality to the on-the-spot researcher.
Similarly, when the dimension of gesture was probed into, candidates did not
frequently avail themselves of gestures in synchronisation with their verbiage.
Many occurrences of gestures in group discussions as there might be, the cumu-
lative durations might still be short. Candidates of different proficiency levels
presented certain differences in gesturing in the context, where candidates of
advanced proficiency exhibited better performances in both gesture variety and the
degree to which their gestures could explain or intensify the intended accompa-
nying verbiage. In stark contrast, although candidates of elementary and interme-
diate levels could use gestures to partly illustrate or reinforce accompanying verbal
language, their gestures were still less satisfactory given a dearth in diversity and
176 5 Rating Scale Formulation
With the above recap, it might dawn on this phase of study that the design of the
rating scale for formative assessment can be provided with resourceful insights
from the AB phase research findings. In addition, the “unconventional” dimension
of Nonverbal Delivery can also be formulated in a describable manner, which is, in
fact, first explored by Jungheim (1995), who argues a necessity for formulating
Nonverbal Ability Scales. Given the fact that the candidates across various profi-
ciency levels in the AB phase of research exhibit significantly different performance
on three most salient nonverbal channels, the descriptors of this dimension on the
rating scale should be naturally drawn from what is found regarding the statistical
and descriptive differences among different groups.
Therefore, informed by the research findings and discussions in the AB phase,
particularly the descriptions discerning the employment of nonverbal delivery by
candidates across a range of proficiency levels, RSF-II comes to formulate the part
of nonverbal delivery on the rating scale as shown in Fig. 5.5. Following a similar
approach as practised in formulating language competence on the rating scale, in
terms of layout, the part of Nonverbal Delivery is also characterised by extreme
modifiers on both ends with five possible grades positioned in the centre. The
modifiers still serve to remind raters of what should be primarily observed. For
instance, they are supposed to judge whether a candidate would instantiate a higher
or lower frequency of eye contact with other discussants in a group discussion and
whether the occurrences of eye contact, if any, are mostly durable ones or merely
brief glances. In addition, whether candidates’ gestures feature variedness or
monotony and whether they can instantiate appropriate head movements are also
etched on both ends of the scale for scoring.
5.6 Rating Scale (2): Strategic Competence 177
Nonverbal Delivery
Frequent Infrequent
Durable Brief
Varied Monotonous
Appropriate Inappropriate
5 4 3 2 1
Despite the reminders on the rating scale, raters are still supposed and strongly
encouraged to familiarise with each individual descriptor so that their scoring
results would not by a great margin yield an inconsistency due to discrepant
understandings.
The band descriptors for nonverbal delivery on the rating scale are shown in
Table 5.10. The five-level division on this part of the rating scale is the same as the
previous three dimensions in RSF-I. The band descriptors for each level are
revolved around three nonverbal channels recaptured above. For eye contact, the
measures of frequency and duration are significantly considered. However, in
addition to gesturing frequency, whether gestures are characterised by a formal
diversity and whether they can perceivably enhance meaning making along with
candidates’ verbiage in group discussions are also reflected as domains to be
5.7 Summary
Abridged from what was found in the AB phase of this study, this chapter addresses
the phases of RSF-I and RSF-II, viz. how the rating scale with a consideration of
embedding nonverbal delivery into speaking assessment is formulated.
Appendix IX provides a tentative version of the rating scale.
When the part of language competence was formulated on the rating scale, this
study used a questionnaire comprising of the perceptibly operationalised statements
originating from the CLA model, based on which in the Chinese EFL context
teachers’ and learners’ rating could be drawn for an extraction of possible assess-
ment dimensions on the rating scale. After the processing of EFA and a further
discussion on latent variable naming, this phase of study proposed three dimensions
representing the core components of language competence, namely Pronunciation
and Intonation, Grammar and Vocabulary, and Discourse Management. In par-
ticular, Discourse Management was incubated as a result of a few remaining salient
features that were not found to be statistically heavily loaded on the intended factor.
Therefore, an integration approach was adopted for this dimension formulation.
Afterwards, the rating scale descriptors were developed by referring to certain
modifiers signifying degree and frequency on candidates’ potential performance.
The gradable descriptors were aimed at discriminating candidates across a range of
proficiency levels.
How strategic competence was developed was largely based on the research
findings of the empirical study in the AB phase. As it has been found that candi-
dates with predetermined proficiency levels might be discerned with regard to their
performance in eye contact, gesture and head movements, strategic competence, as
mainly reflected by the dimension of Nonverbal Delivery on the rating scale, can be
developed with the aid of certain observable distinguishing features detected from
the study previously conducted. In a similar vein, certain degree and frequency
modifiers are employed with a view to reflecting the discriminating power of the
gradable descriptors.
Therefore, a tentative rating scale with four dimensions is so far brought forth.
However, considering that this rating scale is still subject to refinement, rather than
directly applying this rating scale for any validation, this study proceeds to RSF-III,
5.7 Summary 179
where a prevalidation study is conducted based on expert raters’ trial rating and
their feedback. It is expected that with the results from the trial rating as well as the
suggestions contributed by the expert raters, this rating scale can be further shaped
up for an enhancement of its perceived construct validity and rater-friendliness.
References
Bachman, L.F. 1990. Fundamental considerations in language testing. Oxford: Oxford University
Press.
Bachman, L.F., and A.S. Palmer. 1996. Language testing in practice: Designing and developing
useful language tests. Oxford: Oxford University Press.
Jungheim, N.O. 1995. Assessing the unsaid: The development of tests of nonverbal ability. In
Language testing in Japan, ed. J.D. Brown, and S.O. Yamashita, 149–165. Tokyo: JALT.
Chapter 6
Rating Scale Prevalidation
and Modification
The previous two chapters, respectively, contrive two core components of the rating
scale drawn from the CLA model: language competence and strategic competence.
Generally speaking, the proposed rating scale is developed into a five-band one,
with three dimensions contributing to language competence and one dimension to
strategic competence. Detailed descriptors and discriminating wording between
each two adjacent bands are also substantiated so as to assess potential candidates
with respect to their all-round attainment of communicative language ability in the
context of group discussion. However, due caution should be taken before this
tentatively formulated rating scale proceeds to be validated; it should be first trialled
or, in a sense, prevalidated to eliminate any potential impracticality or
rater-unfriendliness. Bearing the above as a crux consideration for this phase of
study, this chapter reports on the last step of the RSF phase, where the proposed
rating scale is processed via a small-scale validation by expert rating and judgment
for further rating scale refinement.
With trialling the tentatively proposed rating scale as a point of departure, this phase
of study mainly aims to testify rater-friendliness of this rating scale, viz. the extent
to which expert raters would perceive it as practical, or would adjust and disam-
biguate any inappropriate diction that could possibly attenuate the validity of the
proposed rating scale. Expert judgment in this case, therefore, would be expected to
fine-tune the rating scale so that candidates can be even better distinguished
between distinct adjacent proficiency levels. The answers are sought to address the
following four research questions.
RSF-III-RQ1: To what extent is the tentatively proposed rating scale valid?
RSF-III-RQ2: To what extent is the tentatively proposed rating scale
rater-friendly?
RSF-III-RQ3: To what extent can the proposed rating scale distinguish candi-
dates across a range of proficiency levels?
RSF-III-RQ4: How can the proposed rating scale be revised?
© Springer Science+Business Media Singapore 2016 181
M. Pan, Nonverbal Delivery in Speaking Assessment,
DOI 10.1007/978-981-10-0170-3_6
182 6 Rating Scale Prevalidation and Modification
As a wrapping-up step at the RSF phase, this phase of study was conducted to
initially testify the construct validity of the proposed rating scale and also its
practicality, without which the RSV phase could not proceed with full preparations.
This section, therefore, outlines the research procedure and the methods used.
This phase of study would virtually serve as a prevalidation with three steps
strapped by expert rater scoring and judgment. To commence with, three invited
expert raters were requested to score the same 20 samples of group discussion
against the tentatively proposed rating scale. Afterwards, a group interview with
them was called for to procure the feedback mainly dwelling on the extent to which
the tentative rating scale would be rater-friendly. After the gathering of the expert
raters’ scoring and the interview data, namely the raters’ responses to the interview
6.2 Research Procedure and Methods 183
questions along with their suggestions, this phase of study would glide to the
analyses of investigating the construct validity of the proposed rating scale by
correlating the subscores assigned. In addition, the experts’ comments would also
be qualitatively retrieved so as to inform how the rating scale could be better
modified. Upon the completion of all these steps, both the analyses of the scoring
results and the interview responses would be referred to for a refinement of the
rating scale formulation.
RSF-III needed 20 samples of group discussion from Dataset 2, the expert rating
results and the interview data. As how Dataset 2 and Dataset 3 concerning expert
rating were collected has been detailed (see Sects. 3.2.2 and 3.2.3), no more
description is redundantly rendered here. However, more elaborations will be made
below on how the interview with the expert raters was conducted, and how the
related data in this phase of study would be processed and analysed.
Based on the above research purposes and design, this section will first unfold the
quantitative findings of the initial examination on the construct validity of the rating
scale proposed, which is followed by the qualitative findings of expert evaluation in
the interview.
Prior to a presentation of the findings on examining the construct validity, it
would be necessary to first check the inter-rater reliability for the scores assigned by
the three expert raters against the proposed rating scale based on candidates’ per-
formance in group discussion. Since more than two raters were involved, this study
finds it less appropriate to resort to the conventional method of computing Kappa
coefficient since this method should be more often deployed in scrutinising the
agreement between two raters only. Against this, correlations among the three raters
regarding the same assessment domains of rating scale were analysed. Since the
raters were supposed to assign each subscore within a range between 1 and 5, there
would be no possibility of quasi-correlation incurred by the computation of being
concordant in order rather than in magnitude, in the case of which intra-class
correlation checking was therefore exempted.
Table 6.2 displays the results of Pearson correlation as an indication of rating
agreement. There being four assessment dimensions on the proposed rating scale,
the correlation analysis was computed on a dimension basis among the raters
accordingly. Judging from Table 6.2, almost all the correlation coefficient values
are well above 0.70 (p < 0.01). Although there seems controversial concerning the
Table 6.2 Inter-rater reliability of expert rater scoring
6.3 Research Findings
Before the unveiling of the correlation matrix of the subscores, capturing a brief
picture of the descriptive statistics of the scores would be necessary so as to profile
how proficient the candidates had achieved when measured against the rating scale
tentatively proposed.
Table 6.3 lists the descriptive statistics of the expert rating results. As indicated,
the mean score for Dimension 1 (4.07) is the highest among all the dimension
scores. Given the intended role of Dimension 1 in mainly assessing candidates’
performance of pronunciation and intonation, it can be initially interpreted that the
candidates under observation have a quite satisfactory command of English pro-
nunciation as well as intonation, which can be nearly alignable with the
near-advanced-level descriptors of the rating scale (Band 4). Comparatively, there
is no great gap in the mean subscores for the other three dimensions, basically
falling into a range between 3.29 and 3.85. This indicates that the observed can-
didates could survive the middle demarcation of all the bands (Band 3) on the rating
scale. What is worth noticing is that the mean subscore of Dimension 4 is the lowest
(3.29), leading to a depiction that the candidates generally would not attain the
anticipated performance on nonverbal delivery. Considering the statistics of
skewness and kurtosis do not reveal a normal distribution of the dataset, in the
follow-up data analysis, Spearman rho would be adopted for nonparametric cor-
relation analysis.
1
For example, Landis and Koch (1977), Altman (1991) propose that inter-rater reliability coeffi-
cient within a range of 0.60 and 0.80 is considered substantial or good, whereas Fleiss (1981) more
vaguely sets the range of 0.40 and 0.75 as intermediate to good. This study generally sets 0.70 as a
threshold as moderate to high correlation strength as suggested by Gwet (2012).
6.3 Research Findings 187
Table 6.4 shows the correlation of the mean subscores assigned by the expert
raters. As can be seen, the correlation between every two dimensions features quite
high coefficient values of above 0.70 (p < 0.01). For example, Dimension 1 is most
highly correlated with Dimension 3, with a coefficient reaching 0.818 (p < 0.01). To
a great extent, this means that although Dimension 1 (Pronunciation and Intonation)
and Dimension 3 (Discourse Management) are intended for different domains of
assessment, the subscores in these regards are positively highly related so that a
unitary construct is actually being observed and measured. It is a similar case with
the correlations among other dimensions on the rating scale proposed.
As specified above, after the three expert raters completed the scoring on the 20
samples of group discussion using the proposed rating scale, an interview with them
was called for to obtain their feedback on the structured questions listed in
Table 6.1. This part will display the synthesised interview responses addressing
each question, which, in an integrated manner, would be ultimately pertinent to the
rating scale modification when the RSF phase winds up.
188 6 Rating Scale Prevalidation and Modification
Interview-Q1: Is it possible that teacher raters and peer raters would have pos-
sible misunderstanding on the rating scale that is incurred by the diction in the
various band descriptors?
Three expert raters unanimously agreed that the rating scale features clear
wording in different bands on the whole. However, there could be a few places
worth improving so that misunderstanding, if any, might be reduced to the mini-
mum level possible.
1. Rater_A pointed out that the wording of “foreign accent” in the dimension of
“Pronunciation and Intonation” would possibly incur misunderstanding because
“foreign accent” cannot be equivalently interpreted as “Chinese accent”.
Therefore, Rater_A proposed that “foreign accent” be changed into “Chinese
transfer” so that raters can have full access to clearer references as to what
should be observed and what should be compared. This change can be necessary
as would-be raters are Chinese EFL teachers and learners, in the case of which
“Chinese transfer” might be more directly comprehensible in this particular
context.
2. Rater_A also held that the wording of “flexibility” in the dimension of
“Grammar and Vocabulary” needs to be clarified, especially what is meant by
“flexibility” concerning syntactic variations. The other two expert raters also
agreed on a necessity of such clarification. Rater_C suggested deleting the word
“flexibility” because “range of syntactic variation” is already to a great extent
inclusive of what “flexibility” intends to denote.
3. Rater_C observed that the rating scale would be presented to raters in its English
version, in the context of which certain unfamiliar terms would be likely to
trigger confusion for peer raters. For example, in the dimension of “Discourse
Management”, there are such terms as “coherence”, “cohesion”, “connectors”
and “discourse markers”. While EFL teachers might have a basic understanding
of the above terms due to their research experience, it would be challenging for
peer raters, who would be bewildered as to what to observe and what these
terms are really referred to. However, when the suggestions on how to improve
this flaw were invited from the floor, all the expert raters expressed their
intention of maintaining an English version rating scale instead of accommo-
dating it into a bilingual one. Therefore, how to resolve this issue would be open
to discussion below.
4. Rater_B pointed out that “expressiveness” in the dimension of “Grammar and
Vocabulary”, such as the case of “occasional grammatical errors without
reducing expressiveness”, can be possibly confounding to peer raters. In order to
facilitate their understanding, Rater_B suggested that the descriptor be rephrased
with “with the intended meaning maintained” so that the intended meanings on
the rating scale could be more approachable to users.
5. Rater_B also paid heed to the dimension of “Nonverbal Delivery”, in the
descriptors of which “changeable eye contact” might cause misunderstanding
because raters would be disoriented by a possible opposing pair of “changeable”
6.3 Research Findings 189
and “unchangeable”. The latter may mean the other extreme of a possible
interpretation. The researcher explained that this descriptor was laid down from
the relevant research findings in the AB phase and that “changeable” in this case
refers to the phenomenon where a candidate is able to instantiate and switch eye
contact to different addressees in group discussion when turn-taking occurs.
Rater_B, therefore, suggested replacing it with “manageable” or “controllable”
and emphasised a proper way of understanding this wording in rater training
process.
6. Rater_C remarked that there is a descriptor with “regulatory gesture” in the
dimension of “Nonverbal Delivery” (Band 2), which might be elusive to raters.
The researcher responded that “regulatory gesture” was described as a result of
referring to a previous taxonomy in relation to gesture functions. Rater_C
thought that it would be more advisable to eschew terming certain descriptors on
the rating scale for the sake of facilitating understanding. Against this context, a
descriptor, such as “gestures not conducive to verbal language conveyance”,
opposite to the corresponding descriptor in Band 5 of the same dimension,
might be crispier.
Interview-Q2: Is there any need to add more dimensions of descriptors to the
rating scale? If so, what should be added?
All the expert raters believed that although there are only four assessment
dimensions of descriptors on the rating scale, each dimension is actually inclusive
of multiple traits to be observed by raters. Hence, the wholeness of the rating scale
can already reflect the comprehensiveness of communicative competence inspired
by the CLA model. If one more dimension is reckoned up, practicality of this rating
scale might be jeopardised as teacher and peer raters would be overburdened with
too many dimensions in scoring process. As can be imagined, and also as the expert
raters perceived, this is because, from a cognitive perspective, five assessment
dimensions might be the maximum cognition load for raters. In other words, any
other aggregated domain would distract raters’ attention in the real practice of
rating. On the other hand, the specified period of time of group discussion for
on-the-spot scoring will also determine the impracticality of a rating scale with
more than four dimensions.
Interview-Q3: Is there any need to delete any part of descriptors that would most
likely fail to distinguish candidates across different proficiency levels? If so, what
should be deleted?
The expert raters conjectured that two kinds of descriptors should be considered
deleting. One kind might be those redundant descriptors that can be almost
explained by other descriptors within the same band, viz. “overlapping descriptors”.
The other kind could be certain descriptors that would not perceivably function well
or cannot serve as much discriminating power as expected, viz. “weak descriptors”.
The following is a collection of the experts’ viewpoints regarding both scenarios
above.
190 6 Rating Scale Prevalidation and Modification
1. Rater_C pointed out that, in the dimension of “Grammar and Vocabulary”, the
rating scale features the descriptors of “[a]lmost all sentences are error-free”
(Band 4) and “[f]requent error-free sentences” (Band 3). However, both
descriptors might be somewhat overlapping with, or largely accounted for by the
relevant descriptors concerning sentential accuracy, such as “[a] range of syn-
tactic variations with occasional inaccuracy” (Band 4). Therefore, the descrip-
tors in this respect may be deleted.
2. Both Rater_B and Rater_C cast doubt on the feasibility of assessing “idiomatic
chunks” as described in the dimension of “Grammar and Vocabulary”. As the
judgment on whether a chunk is idiomatic can be substantially dependent on
rater’s own language proficiency and their sensitivity to the degree of
idiomaticity. Therefore, although a consideration of incorporating a judgment on
chunk idiomaticity is highly recommendable, potential subjectivity involved by
rating scale end-users might be proportionately problematic. In addition, the
expert raters challenged that the rating scale describes chunk idiomaticity from
Band 3 through Band 5, yet the aspect of such descriptors is nowhere traceable
for the bottom two bands. Furthermore, in describing chunk idiomaticity, the
rating scale glides from “frequent use” (Band 5) abruptly to “infrequent use”
(Band 4) between two adjacent bands. With the above, it would be doubtful as
to whether chunk idiomaticity with the expected power to distinguish candidates
of various proficiency levels should be embedded in the rating scale.
3. Rater_A echoed the viewpoint of Rater_B and Rater_C and also gauged that the
modifiers, such as “rare” and “occasional”, might be varied or inconsistent when
accorded with raters’ subjective judgment. Rater_A, therefore, proposed that the
adjacent two bands may be integrated into one band. This is partly because such
a solution can dodge rater leniency or harshness incurred by subjective judg-
ment on the wording of frequency adverbials, and partly because the three expert
raters estimated that even candidates of foreseeable excellent performance could
only be categorised into a mixed descriptors of the top two bands on the rating
scale. Against the above, Rater_A would rather prefer a reduction in the top two
bands and suggested that they be condensed into one single band.
In the process of extended discussion and certain digression, the three expert
raters also rendered a good number of insightful suggestions on how the top two
bands could be inter-woven. It can be summarised that basically most modifiers in
the descriptors are softened (e.g. avoidance of absolute wording) so that the revised
top band on the rating scale might manifest a near-native proficiency level guided
by the notion of communicative competence in the context of group discussion.
Interview-Q4: Can the adjacent bands really reflect gradable descriptions of
communicative competence in the context of group discussion? Would there be
any possibility that two adjacent bands would overlap too vaguely?
The feedback from the expert raters addressing the last point of RSF-III-Q3
naturally led to a discussion on RSF-III-Q4. Rater_B and Rater_C shared their
6.3 Research Findings 191
perceptions about slight vagueness and overlapping of Band 4 and Band 5, and that
only candidates with the best performance in assessment settings might be partially
alignable with descriptors in Band 4 and partially with those in Band 5. Rater_B
commented as follows.
I would think it challenging for either teacher raters or peer raters to distinguish the shades
of difference in the descriptors between the Band 4 and Band 5. In addition, quite few
candidates in the Chinese EFL context would be able to reach an ideal proficiency level of
Band 5. So what can be suggested is that the top two levels should be somewhat integrated
into one single level. Compared with the bottom two levels, the top two levels can be
somewhat overlapping in the respective descriptors.
As such, it would be worth considering reducing the band number from five to
four. The detailed revision will be unfolded in the next section.
Interview-Q5: How is the layout of the rating scale? Is it easy and friendly to be
understood and used by teacher raters and peer-raters?
As this question involved the practicality of the rating scale proposed, the expert
raters opined much at their discretion. Most inclinations were well informed by
their professional rating practice in using this proposed rating scale, as well as by
their previous experience in monitoring rating quality for large-scale high-stakes
assessments.
All the expert raters thought that although presenting a rating scale with extreme
modifiers on both ends of a five-point continuum would be conducive to reminding
raters of what should be assessed in each dimension, certain side effects might also
arise in that too many descriptions or domains are supposed to be observed at one
rating, in the case of which raters would be more distracted than reminded. Rater_B,
therefore, commended that this rating scale should be physically composed of two
parts only, with one part dealing with all the detailed band descriptors for rater
training and reference, and the other part as a separate sheet for raters to assign
marks for each assessment dimension.
All the above are the excerpts and analyses drawn from the group interview after
the expert raters had completed the scoring on the 20 samples of group discussion.
In addition to the structured questions, the expert raters also foregrounded rater
training. They unanimously and fastidiously put the significance of rater training to
a limelight, without which they believed raters would fail to reach a shared
understanding in approaching the rating scale descriptors. In that case, scoring
results would not be generalisable to other contexts, nor can this study ensure that
the intended construct would be measured consistently by raters in the Chinese EFL
context. For instance, Rater_A pinpointed the preponderance of rater training as
follows.
Rater training, no matter whether for teacher raters or learner raters, is quite essential for
this validation study because, in this way, the consensus can be reached concerning some
key areas to be observed. Also rater training, especially on the side of peer raters, can be
indispensable as this group of raters will be likely to judge on their own with little con-
sideration of the concordance with what is described in the rating scale.
192 6 Rating Scale Prevalidation and Modification
6.4 Discussion
Departing from both quantitative and qualitative aspects, the previous section sheds
light on the analyses of the inter-dimension correlation and of how the expert raters
perceived the usefulness of the proposed rating scale. Generally speaking, when the
subscores assigned by the expert raters were correlated, the dimensions were proven
to be highly correlated with each other, indicating that the expert raters, after being
trained, were able to consistently measure the candidates’ communicative compe-
tence in group discussions with a shared construct as reflected in the proposed
rating scale. The only exception occurs in the correlations of Dimension 3
(Discourse Management) with the other assessment dimensions. This might be
because raters needed to observe various aspects contributing to candidates’ com-
petence in managing their discourse, thus possibly leading to a slightly different
divergence in the scoring results. However, as the correlation coefficient can still
satisfactorily meet the basic requirements of examining the construct validity of the
rating scale, its validity can be preliminarily verified in that sense. Based on the
above qualitative findings and analyses from the interview, the following part, in
five facets, will discuss how the previously proposed rating scale should be revised
based on the prevalidation analyses.
First, in order to reduce to the minimum level the probable misunderstanding of
the rating scale caused by descriptor wording, this study intends to take the advice
by the three expert raters. Specifically elicited from the interview, unclear wording
is concerned with either the phrasing or those frequency adverbials triggering a
difference in raters’ subjective judgment. Therefore, it can be quintessential to
revise flawed and ambiguous wording with an aim to disburdening the problematic
percipience in rating process. This phase of study, therefore, would contribute to
revising part of the wording problems detected by the expert raters analysed above.
The modified wording in accordance with the experts’ suggestions will be reflected
in the revised rating scale in the next section.
Second, there would be two options for accommodating peer raters’ apprehen-
sion constraints concerning certain terms adopted in the rating scale descriptors.
One option is that more examples would be provided for peer raters to facilitate
their observation and further judgment. However, after a consultation with the
expert raters, the other option was more favoured that the examples would not be
rendered explicitly on the rating scale; instead, more input on exemplifications
would be rendered in peer-rating training process, aggregated with rated samples of
group discussion, so that peer raters would not only know what are meant by such
terms as “discourse markers” and “connector” but also can familiarise themselves
with more lively and anchorable examples in training. Therefore, the issue of the
unfamiliar terms in the rating scale can be thus addressed by means of more
informative explanation realised in rater training process.
Third, concerning the doubt arising from the possibly weak descriptors in certain
bands of the rating scale, this study intended to respond to this issue by deleting a
few descriptors that can be greatly explained by an integration of other descriptors
6.4 Discussion 193
within the same band. For instance, as stated above, it would be unnecessary to
include the descriptor of “almost all sentences are error-free” in the dimension of
Grammar and Vocabulary for Band 4 because it can be substantially covered by
“accurate syntactic variation”. Therefore, inspired by the expert raters’ suggestion,
the descriptors that cannot foreseeably clearly distinguish candidates across dif-
ferent proficiency levels were eliminated, as reflected in the revised version of the
rating scale below.
Fourth, this study, as suggested by the expert raters’ feedback in the group
interview, needs to consider whether the proposed five-band rating scale should be
contracted to four bands. The interview analysis indicates that the expert raters
foresaw that at cost of losing finer distinctions between the top two adjacent bands,
a four-band rating scale would be more advantageous in its feasibility and
rater-friendliness, particularly being more practical in slicing the top performers in
the assessment. This is because candidates who achieve extraordinarily well can be
first categorised into the top band and then provided with more detailed and per-
tinent feedback on an individual basis, which also echoes what formative assess-
ment uniquely excels in. In addition, the three expert raters also doubted whether a
good number of candidates would be really assigned with Band 5 as it is too
perfectly described. A reduction in the band number, viz. an accommodation and
integration of the top two band descriptors, can also be resonated with the spoken
rating scale calibration of the TOEFL iBT (Chapelle et al. 2008), where a four-band
rating scale cannot only reserve its distinguishing power in discerning candidates
across a range of proficiency levels as equally well as a five-band one, but also
facilitate raters’ pain-staking efforts in making a choice among the five prescribed
levels of descriptors. What is also worth noticing is that aligning candidates’ per-
formance in group discussion with a five-band rating scale could be even more
challenging for peer raters, who would barely assign a point of five to their peers.
Hence, the top band in a five-band rating scale would seem to be not as powerful
and discriminating as expected; hence, certain descriptors in Band 5 will be
attenuatedly integrated into Band 4 descriptors.
Fifth, as all the expert raters expressed their concern of possible distraction by
more than one reminder on each end of the continuum on the rating scale, this study
needs to consider rearranging its layout. According to the expert raters’ proposition,
the rating scale would only maintain the names of assessment dimensions, while
those words placed on the ends of the continuum would be discarded.
Another concern is the necessity of rater training. This issue was not prioritised
in the interview questions, but brought forth among the top concerns after a con-
sultation with the expert raters. This is because if raters are not vigorously trained,
their understanding would be prone to diverge. In addition, to enhance scoring
reliability, the rater training process should be deemed as an ingredient that helps to
yield consistent rating results if another group of teacher raters or peer raters is
invited to score the samples in this research.
Judged and suggested by the expert raters, this phase of study brought forth a
rating scale ready for the validation phases. Table 6.5 presents the revised full
version of the rating scale, with necessary modifications of descriptor wording and
Table 6.5 The revised rating scale
194
Band Pronunciation and Grammar and Vocabulary Discourse Management Nonverbal Delivery
Intonation
4 – Almost no listener effort – Almost no detectable – Rare repetition or self-correction; effective – Frequent, controllable
for intelligibility, with grammatical errors, with use of fillers to compensate for occasional eye contact with other
acceptable slip of tongue only self-repaired minor hesitation(s) discussants
– Almost no foreign accent lapses – General coherence and cohesion achieved – Frequent and various
of Chinese transfer – A range of syntactic by controllable use of connectors and communication-conducive
– Occasional variations (complex and discourse markers gestures
mispronunciation simple structures) with – Topic is discussed with reasoning, personal – Evidence of appropriate
– Flexible stress on words accuracy experience or other examples for in-depth head nod/shake
and sentences – Vocabulary breath and development, with only minor irrelevance
– Correctness and variation depth almost sufficient for
in intonation at the natural and accurate
sentence level expression
3 – Detectable accent slightly – Noticeable grammatical – General continuous flow of utterance can – Having only brief eye
reducing overall errors slightly reducing be maintained, yet repetition, contact with other
intelligibility expressiveness self-correction and hesitation are noticeable discussants
– Mispronunciations of – Effective and accurate use for word and grammar – Frequent gestures with a
some words with of simple structures, with – Coherence and cohesion can be basically lack in variety
possible confusion less frequent use of achieved by the use of connectors and – Head nod/shake
– Inappropriate stress on complex structures discourse markers, but sometimes detectable, but sometimes
words and sentences – Vocabulary breadth inappropriate use might occur inappropriate
reducing meaning sufficient for the topic, – Topic is discussed with relevant utterance,
conveyance with less noticeable but the attempt to produce long response is
– Occasional inappropriate vocabulary depth sometimes limited
or awkward intonation – Rare use of idiomatic
noticeable at the sentence chunks
level
2 – Effort needed in sound – Noticeable grammatical – Frequent repetition, self-correction, and – Infrequent eye contact
recognition for errors seriously reducing long noticeable pauses for word and with other discussants
intelligibility expressiveness grammar – Gestures, most of them
– Fairly accurate use of are for non-communicative
simple structures, with purposes
6 Rating Scale Prevalidation and Modification
(continued)
Table 6.5 (continued)
Band Pronunciation and Grammar and Vocabulary Discourse Management Nonverbal Delivery
Intonation
– Detectable foreign accent inaccuracy in complex – Constant use of only a limited number of – Inappropriate head nod/shake
that sometimes cause structures connectors and discourse markers for
6.4 Discussion
Dimension Score
6.5 Summary
This chapter dwells on the prevalidation of the rating scale based on the expert
raters’ scoring results of the 20 samples of group discussion and their judgments
concerning the possibly problematic wording, discriminating power and other rel-
evant issues of practicality. The expert judgment and suggestion on the de facto use
of the rating scale have informed a multifaceted modification for the rating scale
descriptors, band ranges as well as the layout. In addition, the significance of rater
training, for both teacher raters and peer raters, is re-emphasised as another outcome
in this phase of study.
Therefore, this chapter serves as a bridge between the formulation and the
validation of the rating scale in such a context, where the construct validity and
certain practical issues of the rating scale tentatively proposed were initially
examined. The end product of the RSF phase renders a revised and supposedly
more rater-friendly version of the rating scale, paving the way for the large-scale
validation in the next phase.
References 197
References
Alderson, J.C. 1993. Judgments in language testing. In A new decade of language testing
research: Selected papers from the 1990 Language Testing Research Colloquium, ed.
D. Douglas, and C. Chapelle, 46–50. Washington, DC: Teachers of English to Speakers of
Other Languages Inc.
Altman, D.G. 1991. Practical statistics for medical research. London: Chapman and Hall.
Chapelle, C.A., M.K. Enright, and J. Jamieson (eds.). 2008. Building a validity argument for the
test of english as a foreign language. New York: Routledge.
Fleiss, J.L. 1981. Statistical methods for rates and proportions, 2nd ed. New York: Wiley.
Gwet, K.L. 2012. Handbook of inter-rater reliability: The definitive guide to measuring the extent
of agreement among multiple raters, 3rd ed. Gaithersburg: Advanced Analytics LLC.
Landis, J.R., and G.G. Koch. 1977. The measurement of observer agreement for categorical data.
Biometrics 33: 159–174.
Chapter 7
Rating Scale Validation: An MTMM
Approach
On the basis of the rating scale formulated and further revised, the research project
proceeds into the validation stage, where the rating scale is processed in a larger
sample size validation with the quantitative method previously elaborated on so that
the rating scale proposed can be statistically robust to validly measure the antici-
pated construct of communicative competence in candidates’ performance in group
discussion. At key issue in this phase of the study is whether the revised rating scale
can be validated with an observation of a multitude of assessment dimensions
coupled with discrepant rating methods. As arranged, this chapter will first brief
certain methodological issues concerning how the quantitative validation, viz.
MTMM, is conducted specific to the RSV-I phase and then analyse the data,
especially the goodness-of-fit statistics in line with Widaman’s (1985) framework
of alternative model comparison in probing and validating whether, and if so, how
different assessment dimensions on the rating scale are modelled.
With validating the revised rating scale with a quantitative method as a primary
point of departure, this phase of study bears two subsidiary objectives: (1) to
compare and select the best model that fits the dataset, namely 100 samples of
group discussion, among all the alternative CFA MTMM models; (2) to determine
the parameter estimates for the final model selected in order to investigate the
extents to which how each trait and method factor can contribute to the selected
CFA MTMM model. As such, this phase of study seeks to address only one
research question: To what extent do different rating methods measure the construct
of communicative ability as reflected by the different assessment dimensions in the
proposed rating scale?
When reporting on the research findings on the quantitative validation of the revised
rating scale, this subsection will be unfolded in three consecutive parts. First, the
baseline model of CFA MTMM specific to the present study as well as all the other
alternative models will be displayed and probed into for exploring a range of
model-fitness indices. Second, as the selection of the best fit model largely depends
on the convergent validity, discriminant validity and absence of method bias, a
triangulated comparison will be made to see which model can fit the data appro-
priately and effectively. The last part of this subsection is to determine the
parameter estimates of the selected model so as to further validate how each factor
functions within the model and correlates with each other.
From the perspective of basic model composition, MTMM models contain a
series of linear equations relating dependent variables to independent variables.
Dependent variables are defined as those receiving a path from another variable in
the model and thus appear on the left-hand side of an equation (Kline 2005). In the
case of the present research, the dependent variables are equated with four
assessment dimensions, namely (F1) pronunciation and intonation (PI), (F2)
grammar and vocabulary (GV), (F3) discourse management (DM) and (F4) non-
verbal delivery (ND), which, in an integrated manner, comprise the underlying
communicative ability in the context of group discussion via the rating by teachers
(F5) and peers (F6). On the other hand, independent variables are those that
originate paths but do not receive a path and appear on the right-hand side of an
equation. In this study, the observed variables, viz. all the analytic scores assigned
by teacher and peer raters, are the independent variables supposed to be
squared-lined in a vertical fashion in the centre of the model. The basic layout of the
model construction can be perceived through the follow-up research findings.
Table 7.1 outlines the univariate and multivariate statistics for model assumption
checks. Univariate normality is usually tested by referring to skewness and kurtosis.
If skewness and kurtosis values fall within |3.30| (z score at p < 0.01), univariate
normality can be accordingly recognised (Tabachnick and Fidell 2007). As is
indicated in Table 7.1, all the skewness and kurtosis values fall into the absolute
value of |1.38| (z score at p < 0.01), showing that the data present univariate
normality. As regards multivariate normality, Mardia’s normalised estimate was
checked, with values of 5.00 or below considered to indicate multivariate normality
(Byrne 2006). Table 7.1 also displays that the Mardia’s normalised estimate reaches
4.8345, an indicator that the observed data do not violate the assumption of mul-
tivariate normality. With the above model assumption checked, the ensuing section
can reassuringly proceed to conducting the three steps concerning model devel-
opment, comparison and parameter estimate determination specified in Widaman’s
(1985) framework of MTMM model comparison.
202
Fig. 7.1 The baseline CFA MTMM model (Model 1). PI Pronunciation and Intonation, GV
Grammar and Vocabulary, DM Discourse Management, ND Nonverbal Delivery, T-rating
Teacher-rating, P-rating Peer-rating
The first model, representing the hypothesised CFA MTMM model as shown in
Fig. 7.1, is the baseline model against which all the subsequent alternative MTMM
models are compared. This model designates the traits (assessment dimensions) to
be correlated in pairs and the scoring methods independent of each other. The
reason why the baseline model is designed with a consideration of uncorrelated
scoring methods is that either teacher-rating or peer-rating should be regarded as
unique. Since in MTMM models estimating the factor loadings is the primary focus,
instead of fixing factor loadings, the variances of factors are fixed to 1 for the
purpose of model identification. Therefore, all factor loadings and the covariances
among the trait factors are freely estimated. However, just as previously justified,
covariances among method factors are not constrained to be 0 in the baseline
model, given that each scoring method is unique. As is shown in Table 7.2, the fit
indices indicate the baseline model (Model 1) provides a good fit for the data
(χ2(28) = 462.796, p = 0.818; CFI = 1.000; NNFI = 1.024; SRMR = 0.015;
RMSEA = 0.000; 90 % C.I. = 0.000, 0.060).
Model 2 specifies that there is no trait observed in the model, with only the
presence of scoring methods yet uncorrelated, as is displayed in Fig. 7.2. As
204 7 Rating Scale Validation: An MTMM Approach
Table 7.2 Fit indices for the Bentler–Bonett normed fit index = 0.995
baseline model (Model 1)
Bentler–Bonett non-normed fit index = 1.024
Comparative fit index (CFI) = 1.000
Root mean square residual (RMR) = 0.008
Standardised RMR = 0.015
Root mean square error of approximation (RMSEA) = 0.000
90 % Confidence interval of RMSEA (0.000, 0.060)
indicated by the goodness-of-fit statistics shown in Table 7.3, the fit for this model
is extremely poor (χ2(19) = 59.716, p = 0.000; CFI = 0.528; NNFI = 0.894;
SRMR = 0.106; RMSEA = 0.043; 90 % C.I. = 0.076, 0.136), justifying an
assumption that this model cannot be a plausible explanation for the observed data.
Following Model 2, which eschews a presence of traits, Model 3, as displayed in
Fig. 7.3, eclectically integrates all the observed variables into one latent variable,
Communicative Language Ability. As with the baseline model, each observed
variable loads on both a trait and a method factor in Model 3. However, the
7.3 Research Findings 205
correlations among the trait factors are fixed to 1, thus treating the four factors as
one overall “umbrella factor”. As is shown in Table 7.4, the goodness-of-fit results
indicate that the fit of this model is marginally good albeit substantially less well
fitting than is the case for the baseline model (χ2(11) = 37.116, p = 0.000;
CFI = 0.854; NNFI = 0.882; SRMR = 0.056; RMSEA = 0.111; 90 % C.I. = 0.073,
0.151).
As is presented in Fig. 7.4, another alternative model is Model 4, whose dif-
ference from the baseline model only consists in the unspecified correlations among
the trait factors. The lack of correlation among the traits, therefore, can be con-
ducive to a comparison that would provide evidence of the extent to which the traits
are significantly distinct from one another. The fit indices shown in Table 7.5 reveal
that Model 4 does not meet the statistical criterion of fit (χ2(12) = 84.882, p = 0.000;
CFI = 0.871; NNFI = 0.699; SRMR = 0.211; RMSEA = 0.178; 90 % C.I. = 0.143,
0.213).
Model 5, as displayed in Fig. 7.5, can be regarded as typically the least
restrictive one (Schmitt and Stults 1986; Widaman 1985) in that both trait and
method factors are specified and the baseline and correlations among traits and
206 7 Rating Scale Validation: An MTMM Approach
methods are also allowed. Comparing this model with the baseline model provides
the discriminant evidence related to the method factors. A review of the
goodness-of-fit results shows that the fit of this model is exceptionally good fit to
the data (χ2(5) = 454.251, p = 0.813; CFI = 0.998; NNFI = 1.017; SRMR = 0.019;
RMSEA = 0.009; 90 % C.I. = 0.000, 0.079). However, as Model 5 correlates the
two method factors, whether this model is more interpretable in the context of the
present study still needs to be further explored and accounted for in the follow-up
discussion (Table 7.6).
The final CFA MTMM model is illustrated in Fig. 7.6. In this model, a
higher-order factor perceived as communicative language ability in group discus-
sion affects the rating on all observed variables through the first-order factors. As
previously noted, the fit indices of this model are assumed to be the same as those
of the baseline model because there is no difference between a higher-order model
with four first-order factors and uncorrelated-two-factor model in terms of fit
statistics (Rindskopf and Rose 1988; Shin 2005). Nonetheless, this model has more
explanatory power regarding the inter-factor covariances when the factors are
highly correlated with each other.
7.3 Research Findings 207
The previous subsection has examined the goodness-of-fit results of all the sug-
gested alternative MTMM models. In this subsection, in determining the evidence
of construct validity of the proposed rating scale at the matrix level, the baseline
model is compared with the other four CFA MTMM models, noting that Model 1
and Model 6 being intrinsically the same. Goodness-of-fit indices for all six
MTMM models are summarised in Table 7.7.
As observed earlier, the evidence of construct validity can be twofold: conver-
gent validity and discriminant validity. One of the criteria related to evidence of
construct validity provides the basis for judgment regarding the issue of convergent
evidence among trait factors. Using Widaman’s (1985) approach, this study
208 7 Rating Scale Validation: An MTMM Approach
compares Model 1 with the model whose traits are not specified (Model 2).
A significant χ2 difference (Δχ2) between the two models represents convergent
evidence among the traits. Cheung and Rensvold (2002) also suggest that difference
in CFI (ΔCFI), the value of which exceeds 0.01 within the context of invariance
testing, should also serve as the yardstick of significant difference. In the case of the
present study, as indicated in Table 7.8, a comparison between Model 1 and Model
2 leads to the result of Dv2ð9Þ ¼ 403:08, with highly significant difference
(p < 0.001) and ΔCFI = 0.472, being a substantial difference as well.
7.3 Research Findings 209
The evidence of discriminant validity is sought not only from the perspective of
trait factor but also assessed in terms of method factor. The first comparison is made
between the models whose traits are freely correlated (Model 1) and the one in
which traits are perfectly correlated, namely with a single trait (Model 3). The
comparison results shown in Table 7.8 indicate a significant difference
Dv2ð17Þ ¼ 425:68; p\0:001 and a sizeable CFI difference (ΔCFI = 0.146),
revealing the anticipated evidence of discriminant validity among traits. On the
other hand, as Model 4 features uncorrelated traits, a comparison between Model 1
and Model 4 would be able to suggest the extent to which each trait factor is
separable from each other. As indicated in Table 7.8, the comparison between
Model 1 and Model 4 leads to the results of exceedingly significant difference
(Dv2ð16Þ ¼ 377:914; p\0:001) and a value of ΔCFI greater than 0.01
(ΔCFI = 0.129), both of which do not excessively depart from a acceptable range,
thus once again lending support to validating a close relationship between trait
factors.
Based on the same logic, though in reverse, the second comparison is made to
test the evidence of discriminant validity regarding method factors, where the
baseline model with uncorrelated methods is compared with a freely correlated
model (Model 5). As Model 5 is characterised by the least restriction as above
explained, it can be thereby regarded as less restrictive than the baseline model.
What is noteworthy is that a more restricted model with more degrees of freedom
can be a stronger candidate model in that it has to withstand a greater chance of
being rejected (Raykov and Marcoulides 2006). Against this, the larger the dis-
crepancy in Δχ2 and ΔCFI values to be found between Model 1 and Model 5, the
weaker the support for evidence of discriminant validity between method factors
would be. Table 7.8 also outlines the comparison results of an insignificant Dv2ð23Þ
being 8.545 (p > 0.05) and almost negligible ΔCFI amounting 0.004. In that case,
evidence of discriminant relationship between method factors can be collected and
it can be fairly argued that the observed data present a minimum effect of common
method bias across methods of measurement.
In line with the requirements of CFA MTMM model comparison, what is found
above, to a great extent, demonstrates that the data in this phase of study are
characterised by satisfactory convergent relationship among the traits and dis-
criminant relationship among the traits and between the methods. As model com-
parison at the matrix level is only able to provide a global assessment of evidence of
construct validity (Byrne and Bazana 1996), individual parameter estimates would
be subsequently examined so that the trait- and method-related variance could be
evaluated more precisely.
However, before proceeding to parameter estimates, determination should be
made as to which candidate model can be selected as the final model. The previous
research findings have pinpointed that Models 1 and 6 feature better goodness-of-fit
results than Models 2, 3 and 4 and that Models 1 and 6 are also more interpretable
than Model 5 in the sense that scoring methods should be regarded as individually
210 7 Rating Scale Validation: An MTMM Approach
unique in lieu of being interrelated. Therefore, Models 1 and 6 could stand out to be
better candidates than the other ones, yet the issue of selecting between Model 1
and Model 6 still remains to be resolved. The discrepancy between these two
models, as previously noted, is that the latter is a higher-order factor model, the fact
of which thus paves the way for a consideration of making Model 6 as the final
selection because trait factors are not only closely related to each other but also
correlated with a higher-order factor within that model. In that sense, Model 6 can
be more parsimonious and interpretable considering the hypothesised notion of
CLA. Thus, the factor loadings of and the correlations within Model 6 are further
investigated below.
Considering the comparison results of model fit and parsimony, the above findings
have led to the selection of Model 6 with a higher-order factor as the final model. In
order to seek a more precise assessment of construct validity, the extent of variances
accounted for by trait and method factors is envisioned, and the corresponding
factor loadings and error variances of Model 6 are accordingly examined, as out-
lined in Table 7.9. All the factor loadings are standardised parameter estimates,
which have been scaled to a mean of 0 and a standard deviation of 1 (Byrnes and
Bazana 1996). Bollen (1989) argues that standardised parameter estimates are more
useful than unstandardised parameter counterparts for interpretability because the
former is more powerful in reflecting the relative sizes of the factor loadings in a
model. As such, all the factor loadings outlined in Table 7.9 are standardised
parameter estimates.
In examining individual parameters, convergence is reflected in the magnitude of
the trait loadings. The more significant the factor loadings tend to be, the more
evidence of convergence among traits and methods can be collected. As indicated
in Table 7.8, all the trait factor loadings are significant and substantial, indicating an
overall convergent evidence of construct validity. With the factor loadings of four
assessment dimensions (with PI = 0.990, GV = 0.998, DM = 0.991, ND = 0.991
loaded on CLA) on the underlying higher-order factor CLA noted, it can be felt that
at the parameter level, a reasonably sound indicator of CLA comprising the above
four dimensions on the rating scale could be sought. In other words, the high
first-order factor loadings temper evidence of discrimination, which is typically
determined by examining the factor correlation matrices or, in this case, the
higher-order factor loadings.
When factor loadings are compared across traits, methods and error variances,
the proportion of method variance exceeds that of trait variance for all except for
Discourse Management (DM) rated by peers. The factor loading of DM_P on DM
is 0.405, slightly lower than the corresponding error variance of 0.526. This means
when the dimension of discourse management was observed by peer raters, more
error of measurement might occur, which was likely to be attributable to the fact
7.3 Research Findings 211
that peer raters might fail to capture, or assess as accurately as teacher raters
regarding the candidates’ de facto performance in managing their discourse.
Discriminant validity bearing on particular traits and methods is determined by
examining the factor correlation matrices, as shown in Table 7.10. Conceptually
and ideally, although correlations among traits should be almost negligible to sat-
isfy evidence of discriminant validity, “such findings are highly unlikely in general
and with respect to psychological data in particular” (Byrne 2006, p. 344).
Generally speaking, the coefficients among the traits in Table 7.10 are below
moderate correlation, which indicates that the four assessment dimensions are not
interdependent. One exception is that the correlation coefficient between DM and
ND reaches 0.673, entitling these two traits to be above moderately correlated.
Since the previous findings have revealed that the proposed model is a higher-factor
212 7 Rating Scale Validation: An MTMM Approach
one and that the higher-order factor is heavily loaded on four assessment dimen-
sions, the general below moderate correlation among the traits could be
understandable.
Finally, an examination of method factor correlation touches upon their dis-
criminability and thus upon the extent to which the methods are maximally dis-
similar. This factor is an important underlying assumption of an MTMM approach.
Given the obvious dissimilarity of teaching-rating and peer-rating, it is not sur-
prising to find a statistically insignificant correlation of 0.080 between the two
scoring methods, as shown in Table 7.10.
7.4 Discussion
The above research findings have been presented in three aspects. The first aspect
addresses the goodness-of-fit results of the baseline CFA MTMM model and the
alternative models. In line with the predetermined criteria drawn from the literature,
Models 1, 5 and 6 can be regarded as well-fitting ones. Based on the consideration
of interpretability, Model 5 has been eliminated because the traits within that model
are neither correlated nor related to a higher-order factor. Finally, Model 6 is
selected as the final model given its interpretability and consistency with previous
studies regarding speaking ability taxonomy or language ability as a whole. In
particular, Model 6 would be soundly supported by Sawaki’s (2007) research,
where a validation study of assessment scales is conducted for L2 speaking ability
for the purpose of student placement and diagnosis. Her analysis also shows that
speaking ability consists of several dimensions yet with an underlying higher-order
ability. In addition, such a hierarchical model of L2 communicative language ability
has received extensive support from other well-documented studies as well (e.g.
Bachman and Palmer 1989; Llosa 2007; Sawaki et al. 2009; Shin 2005).
In the second place, pairs of hierarchically nested models are compared using
chi-square difference tests to determine whether the assessment dimensions display
convergence, discrimination and method effects. In terms of global model fit,
evidence of convergence, discrimination and method effects is found in the final
model. Nonetheless, when it comes to the third aspect, where a closer inspection of
the individual parameter estimates is taken, a slightly nuanced picture is depicted.
On the one hand, extremely high factor loadings of the higher-order factor
Communicative Language Ability can be obtained so that the perceived
CFA MTMM model is further confirmed, thus somehow lending support to the
construct validity of the revised rating scale. On the other hand, as found above, the
factor loading of DM_P on DM is 0.405, below the corresponding error variance of
0.526. This means peer raters might experience certain difficulty of assessing
candidates’ performance in managing their discourse. Part of the reason could be
the wording confusion in the band descriptors for Discourse Management. It should
be borne in mind that this assessment dimension incorporates textual competence,
7.4 Discussion 213
7.5 Summary
In an attempt to gather evidence of construct validity for the revised rating scale,
confirmatory factor analysis of MTMM data was conducted in this research phase.
In general, this phase of study gathered the convergent and discriminant evidence as
well as the absence of method effect that enabled the revised rating scale to validly
address communicative language ability, a higher-order latent factor perceived in
the final CFA MTMM model. Although in this validation phase certain noises were
detected, such as peer raters’ possibly improper handling of assessing discourse
management and the unexpectedly high correlation between certain assessment
dimensions, their main causes were expounded and could be largely attributable to
the weaknesses stringed with an analytic rating scale per se. In order to take a closer
look at the correlation between candidates’ performance and the scores they were
assigned by teacher and peer raters, the next phase of validation will take a qual-
itative approach so that more arguments can be collected to validate the rating scale
proposed in this study.
214 7 Rating Scale Validation: An MTMM Approach
References
Bachman, L.F., and A.S. Palmer. 1989. The construct validation of self-ratings of communicative
language ability. Language Testing 6(4): 449–465.
Bollen, K. A. 1989. Structural Equations with Latent Variables. John Wiley and Sons.
Byrne, B.M. 2006. Structural equation modeling with EQS: Basic concepts, applications, and
programming, 2nd ed. Mahwah: Lawrence Erlbaum Associates.
Byrne, B.M., and P.G. Bazana. 1996. Investigating the measurement of social and academic
competencies for early/late preadolescents and adolescents: A multitrait-multimethod analysis.
Applied Measurement in Education 9: 113–132.
Cheung, G.W., and R.B. Rensvold. 2002. Evaluating goodness-of-fit indexes for testing
measurement invariance. Structural Equation Modeling 9(2): 233–255.
Kline, R.B. 2005. Principles and practice of structural equation modeling, 2nd ed. New York: The
Guildford Press.
Llosa, L. 2007. Validating a standards-based classroom assessment of english proficiency: A
multi-trait multi-method approach. Language Testing 24(4): 489–515.
Raykov, T., and G.A Marcoulides. 2006. A first course in structural equation modeling (2nd ed.).
Lawrence Erlbaum Associates, Inc.
Rindskopf, D., and T. Rose. 1988. Some theory and applications of confirmatory second-order
factor analysis. Multivariate Behavioral Research 23: 51–67.
Sawaki, Y. 2007. Construct validation of analytic rating scales in a speaking assessment:
Reporting a score profile and a composite. Language Testing 24(3): 355–390.
Sawaki, Y., L.J. Stricker, and A.H. Oranje. 2009. Factor structure of the TOEFL internet-based
test. Language Testing 26(1): 5–30.
Schmitt, N., and D.M. Stults. 1986. Methodology review: Analysis of multi-trait multi-method
matrices. Applied Psychological Measurement 10: 1–22.
Shin, S.K. 2005. Did they take the same test? Examinee language proficiency and the structure of
language tests. Language Testing 22(1): 31–57.
Tabachnick, B.G., and L.S. Fidell. 2007. Using multivariate statistics, 5th ed. Needham Heights,
MA: Allyn and Bacon.
Widaman, K.F. 1985. Hierarchically tested covariance structure models for multi-trait
multi-method data. Applied Psychological Measurement 9: 1–26.
Chapter 8
Rating Scale Validation: An MDA
Approach
In the research design, it has been noted that this phase of study will draw upon the
de facto performance in nonverbal delivery by the candidates and analyse their
performances in an MDA approach reviewed in the literature (see Sect. 2.5.2.3);
therefore, this phase of study would be rather straightforward in its research pro-
cedure. The data used in this phase would be the candidates’ performance, and their
respective scores assigned by teacher and peer raters.
As aforementioned, this phase of study qualitatively addresses the nonverbal
delivery by the candidates; thus, a need would be felt to select a number of can-
didates for analyses. As only three candidates were to be selected, instead of
conducting stratified random sampling for a larger sample size, this study consis-
tently aimed at the group discussion sequenced No. 50 from each proficiency
group. The second speaker in each selected group was further chosen as the rep-
resentative for each proficiency group. The candidates’ privacy was protected as
pseudonyms would be used in the follow-up analyses and descriptions. Table 8.1
outlines the selected candidates with the averaged subscores from teacher- and
peer-rating attached. Tom, Linda and Diana represent elementary, intermediate and
advanced proficiency groups, respectively. From Table 8.1, it can be noticed that
their total scores measured against the rating scale present an ascending order,
meaning that Diana from Group A performed best (total score equal to 14) and Tom
from Group C performed worst (total score equal to 7). A closer look at their,
respectively, subscores on nonverbal delivery (ND) leads to an awareness that these
three candidates’ performance on ND also correspond to the sequence of their
predetermined proficiency levels. Although there is slight variation between tea-
cher- and peer-rating for the three cases, all the inconsistency, if any, is still within a
gap of one adjacent band, which can be generally deemed as acceptable. More
specifically to ND, the three candidates were assigned 1.5, 3 and 4, respectively, if
the teacher raters’ and peer raters’ scoring results are averaged. Given the quali-
tative approach this phase of research aims to adopt, this score distribution thus
indicates that the randomly selected candidates can be representative of different
levels in light of nonverbal delivery.
Table 8.2 presents the additional information about the whole duration of the
group discussion the selected candidates were, respectively, engaged in as well as
the cumulative duration of their participation in the group discussion. As can be
seen, both the group discussion length and the duration of how long the candidates
verbally participated in the group discussion follow the lowest-to-highest sequence
of their proficiency levels. When these two time parametres are standardised to
seconds, the extent to which the candidates virtually involved in the group dis-
cussion can be profiled. Table 8.2 indicates that Linda from the intermediate group
involved most (38.85 %) even though the time she spent in the group discussion
(1′ 55″) was shorter than Diana’s (2′ 28″). data were collected. It should also be
noteworthy that, however, averagely all the selected candidates verbally engaged
themselves in approximately one-third portion of the whole group discussion, thus
justifying the comparability across the selected candidates. In addition, Table 8.2
also indicates that Tom and Linda were in a sitting posture in the group discussion,
while Diana was standing when talking to the other discussants. Without any
intervention from the researcher, these postures were subject to the candidates’ own
preference or choice when the candidates’ nonverbal delivery frequencies were
calculated, they were standardised to the occurrences in a 5-min group discussion.
Having specified the demographic and data information of the selected candi-
dates above, the following section will outline an inventory of measures on which
the MDA analyses and the above three aspects of alignment will be based.
In line with the general framework adapted to this study and the integrated
framework (Martinec 2000, 2001, 2004; Hood 2007, 2011) reviewed in the liter-
ature to investigate the metafunctions of candidates’ nonverbal delivery, each
nonverbal channel is examined from the perspectives of its formal manifestations
and the corresponding metafunctions. As how metafunctions are classified has been
previously expounded (see Sections “Nonverbal Delivery: Communicative Versus
Performative”, “Martinec’s Taxonomy on Actions” and “Hood’s Taxonomy on
Nonverbal Delivery Metafunctions”), this section only outlines how the formal
nonverbal channels are observed.
Table 8.3 lists the measures of the three nonverbal delivery channels observed.
The checked areas in Table 8.3 indicate what measures the different nonverbal
channels will be studied. Regarding eye contact, this phase will touch upon the
frequency, directionality and duration of eye contact. What is worth mentioning is
that the duration here not only refers to the mean duration of eye contact for each
occurrence but also include the cumulative duration of eye contact of a particular
candidate in the group discussion. Level of eye contact is also included as it can be
feasible to judge the level of eye contact from the perspective of the recipient,
which is more associated with metafunctional meanings reviewed in the literature
(see Sects. “Martinec’s Taxonomy on Actions” and “Hood’s Taxonomy on
Nonverbal Delivery Metafunctions”).
Apart from the measures of frequency and directionality that eye contact will
cover, gesture will be looked into in light of its level instead of duration. However,
there are also two additional measures observed as a result of gesture realisation,
viz. how hand(s) is(are) described (e.g. palm open or fist) and use of hand(s) (e.g.
right hand, or left hand, or both). As head movement (e.g. head movement naturally
accompanying eye contact transition) can be regarded as a broader realisation of eye
contact, this phase of study will focus on the measures of frequency and direc-
tionality only.
In line with Martinec’s (2000, 2001, 2004) taxonomy of action and Hood’s (2007,
2011) research on nonverbal delivery metafunctions, this chapter will revolve
around the research findings in three aspects of alignment. The first alignment is
concerned with the correspondence between the nonverbal delivery channels and
the rating scale descriptors regarding nonverbal delivery. The second alignment is
more focused on the descriptive elaborations upon how candidate’s performance in
nonverbal delivery is realised from the MDA perspective and how much commu-
nicativeness is achieved corresponding to the rating scale descriptors. The third
alignment will further look into the interaction, particularly the complementarities,
between the candidates’ verbal and nonverbal delivery in relation to their respective
proficiency levels. However, the presentation of the research findings below still
follows the taxonomy of different nonverbal delivery channels, viz. eye contact,
gesture and head movement, with both their formal realisations and metafunctions
addressed in-depth.
In accordance with the specifications above, the findings on eye contact will be
presented from the perspectives of formal eye contact and its metafunctions.
Meanwhile, the candidates’ performance in nonverbal delivery will be associated
with the above two perspectives under the given operationalisation for an analysis
of the alignments with the candidates’ overall proficiency level and the proposed
rating scale descriptors.
The formal eye contact is first presented with regard to its directionalities.
Figure 8.1, in the form of a bar chart, indicates that Diana, Linda and Tom exhibited
a descending order of eye contact frequency (see the rightmost column sum) and
that all of them had the forward eye contact,1 a commonplace directionality in
communication, but none had any backward eye contact. A more careful scrutiny
concerning the different directionalities would uncover more interesting findings.
Although Tom had the fewest occurrences of eye contact in group discussion, he
1
The directionality of eye contact here is slightly distinguished from the AB phase, where the
recipient of eye contact, such as the camera, was described. In this phase, forward eye contact
means having an occurrence of eye contact with an unspecified object physically located in front of
the speaker. In reverse, backward eye contact refers to the occurrence that a speaker looks at
certain positions at his/her back.
220 8 Rating Scale Validation: An MDA Approach
ht
d
m
d
ft
ar
ar
ar
ar
le
rig
su
w
rw
nw
w
ck
up
fo
w
ba
do
Table 8.4 Eye contact Tom Linda Diana
duration (s)
Mean duration 3.15 3.29 4.38
Min. duration 0.50 0.55 0.85
Max. duration 6.80 6.70 10.36
Cumulative duration 41.10 79.05 114.05
had the highest frequency of downward eye contact,2 which seems absent in the
case of Diana. When eye contact in the context of group discussion was investi-
gated, the directionality is assumed to be more horizontal than vertical. The
downward eye contact might be Tom’s presenting eye contact with the ground.
Both Linda and Diana had one occurrence of upward eye contact. In addition, Tom
had no eye contact in a left or right way, indicating a comparative sedentary posture
and less varied eye contact positioning. It is also noted that Linda had no eye
contact to the right, which can be partially explained by her rightmost sitting
position among the three group discussants.
Table 8.4 outlines the duration of eye contact by the three candidates. The
results, especially the ordering, are similar to what was previously found in fre-
quencies, with Tom’s mean duration and cumulative duration of eye contact at 3.15
and 41 s, respectively (shortest), and Diana’s at 4.38 and 114 s (longest). However,
it was also found that Linda, positioned in the middle, did not feature significantly
longer mean duration than Tom. In particular, when the minimum and maximum
durations of eye contact fixation (gaze) were investigated, Linda performed a
shorter max duration than Tom. The findings below concerning the metafunctions
of eye contact will take one step further in untangling the discrepancies.
The proposed rating scale descriptors pertaining to eye contact focus on three
aspects of eye contact: frequency, controllability and briefness. Judging from the
above findings, it can be summarised that Tom, with the least variation in eye
2
Upward eye contact and downward eye contact are described as looking at the objects whose
location is, respectively, above (see Frame 8.4A as an illustration) and above (see Frame 8.4B as
an illustration) the horizontal vision of the speaker. They are usually synchronised with moving the
speaker’s head to a higher or lower position, which might facilitate the researcher’s judgment.
8.3 Research Findings 221
Having obtained the above findings of the formal eye contact, this section, informed
by Martinec’s (2000, 2001, 2004) and Hood’s (2007, 2011) works, turns to the
metafunctions of eye contact following the integrated operational framework for
MDA analyses. In practice, the research findings will be unfolded in the spectrums
of the three metafunctions: ideational, interpersonal and textual meanings.
Ideational Meaning
To commence with, the findings regarding the ideational meaning of eye contact are
presented. Although Martinec’s (2000) demarcation of actions into presenting,
representing and indexical might blur the judgment of eye contact in relation to its
ideational meaning, the co-contextualisation of eye contact with the candidates’
verbiage might be of great assistance in facilitating the judgment in this study.
Figure 8.2 outlines the distribution of eye contact with regard to the above tax-
onomy. Among the candidates, Tom performed the largest number of presenting
actions in this regard, indicating that most of his eye contact, if not all, cannot
practically serve communicative purposes. In contrast, Linda and Diana kept an
almost negligible profile of eye contact falling into the category of presenting
action; most of their eye contact occurrences belong to indexical actions. As
indexical actions are usually language dependent, an abundance of eye contact in
10
0
presenting indexical representing
222 8 Rating Scale Validation: An MDA Approach
(a) (b)
this category can also be justified because most eye contact occurrences request the
co-contextualisation of verbiage for meaning access. Eye contact of the representing
type refers to the established conveyance of a certain formal eye contact, such as
wearing a despiteful look showing disagreement with disdain and rolling the eyes
indicating a prolonged inconclusive thinking. It should be noted that the judgment
on these formal eye contact of representing type might be confined to the generally
accepted Chinese social context.
Presenting Action
Eye contact serving presenting function,3 though tendering a comparatively
dwindling profile in Fig. 8.2, deserves a closer look because such eye contact via
material, state and mental processes can reflect how the candidates performed in
group discussions. However, eye contact of this type does not practically enhance
communication effectiveness, but mostly serves adaptive purposes particularly in an
assessment context.
Based on the findings in Fig. 8.2, this section looks into the occurrences of eye
contact of presenting type by Tom, as illustrated in Fig. 8.3. When material is
concerned, judging from the level of Tom’s vision, he presented eye contact with
the other discussant’s clothes (Frame 8.3A), or simply with the ground of the
classroom (Frame 8.3B) in the course of the discussion. Specifically, in Frame
8.3A, while Tom was holding the turn, he seemed to gaze at the other discussant’s
clothes (dashed arrow) while the others were attentively gazing at Tom (arrows).
Likewise, in Frame 8.3B, when Tom yielded the turn to another discussant, his eye
contact, instead of targeting at the speaker, chose the ground as the material. Neither
of the eye contact occurrences above therefore is regarded as semantically loaded or
3
As accorded with Martinec’s (2000) taxonomy, presenting functions mainly refers to those that do
not generate representational or communicative meanings, such as those actions representative of
the candidate’s nervousness in assessment contexts (see Section “Martinec’s Taxonomy on
Actions” for more explanations).
8.3 Research Findings 223
Representing Action
Representing actions with regard to eye contact refer to those with self-explanatory
gaze in a given communication and social context. They can be either language
independent or language correspondent, as explained before. Considering the
(a) (b)
Indexical Action
If an occurrence of eye contact falls into the category of indexical action, accom-
panying language is indispensable for the full access to the ideational meaning of
the eye contact that is intended. Among the indexical eye contact occurrences by
the three candidates, this study has mainly retrieved two kinds of ideational
meanings: agreement and uncertainty, to be illustrated below.
As is observed in this study, agreement and uncertainty conveyed via eye contact
are usually realised as a result of long-time eye contact fixation of forward direction,
fulfilling a basic function of gaze: tendering response after attentiveness is shown.
The two frames in Fig. 8.6 illustrate the occurrences of eye contact indicating
agreement and uncertainty, respectively. In Frame 8.6A, upon one of the discus-
sants’ termination of her turn (the verbiage suggesting Hainan as a travel destination
for the forthcoming vacation), she had eye contact with Linda, who, in return,
(a) (b)
continued her gaze with an accompanying verbiage of “yeah” and even added a
smile as a response of agreement. Diana’s eye contact with the peer seems to be
different, as is shown in Frame 8.6B. Having been questioned the plan after
graduation by one of the discussants, Diana took over the turn and expressed her
uncertainty via a gaze at that particular discussant. Diana’s gesture (both hands
across in a fisted form) can also indicate an uncertainty in this regard (see the
findings in the section of gestures below for further triangulation). It is found that
both kinds of the eye contact occurrences with similar verbiage exemplified above
dominate the indexical eye contact in the cases of Linda and Diana.
Associating the above findings with the nonverbal delivery descriptors in the
rating scale, it can be felt that the keywords in the descriptors, viz. controllable and
brief, can be further validated. This is because the occurrences of Tom’s eye
contact, with the fewest occurrences, long duration of gazing at the physical objects
with almost no communication-enhancing effect, and most falling into the category
of presenting function, can be judged as neither controllable nor brief. Although
Linda had eye contact with the other discussants realising the intended representing
and indexical functions, her eye contact seemed to be less empowered due to its
briefness and constant shift in eye contact directions. With various ideational
meanings expressed, Diana’s eye contact with the other discussants at her own turn
or during the others’ turns, can be credited as controllable. She was able to employ
the gaze of showing her attention when the others held the turns and the gaze
serving as a signal of persuasion or agreement when the turn was yielded to her.
Accordingly, the nonverbal delivery subscores that the candidates were assigned
can also be aligned with the above findings.
Interpersonal Meaning
Eye contact in a certain manner can denote positive or negative attitudes. The
interpersonal meaning in relation to attitudes can particularly overlap with indexical
eye contact in that indexical eye contact, as analysed above, mostly contribute to the
ideational meaning of agreement and uncertainty. Therefore, how attitudes are
realised via eye contact will not be redundantly elaborated. However, interpersonal
meaning can also be realised via engagement, which might include the eye contact
indicating neutral, expansion, contraction and possibility. As the commonly
observed eye contact with neutral engagement leaves limited space to be explored
in-depth, and engagement of possibility can be similar to the conveyance of
uncertainty in ideational meaning conveyance, it would be more worthwhile to tap
the potential of the expansion and contraction engagement of eye contact.
When eye contact carries the interpersonal meaning of engagement, expansion
can be realised when the candidate performs a durable and slightly upward gaze
with the other discussant(s) as shown in Frame 8.7A of Fig. 8.7. This is because
Diana’s gaze with such a direction might indicate plenty of negotiation space
provided to show receptivity. Therefore, eye contact in this manner indicates not
only attentiveness but also broad-mindedness in listening to others. Another form of
engagement can be instantiated by an occurrence of slightly downward eye contact,
as is illustrated in Frame 8.7B and Frame 8.7C. In both frames, Linda performed a
downward gaze during her turn though the other discussants were both gazing at
her to show attentiveness. This can be understood as Linda’s unwillingness to be
interrupted when her flow of thought was going on and no other suggestion con-
cerning another travel destination would be allowed at that particular moment, thus
instantiating an engagement of contraction and realising a distancing effect.
The last realisation of engagement via eye contact is graduation, which can be
measured via the duration of how long one occurrence of representing or indexical
eye contact would take in shifting from one contact target to another. It has to be
pointed out, however, that the eye contact targets mainly refer to the discussants in
the group because if eye contact is shifted to certain physical objects, it could be
only regarded as a failure of realising any interpersonal meaning. The criteria for
graduation in eye contact are tentatively cut off as follows in this study: fast at 0.5 s
and shorter; slow at 1 s and longer; and medium between 0.5 and 1 s. Figure 8.8
outlines the frequency distribution of the three candidates’ eye contact shift when
measured accordingly.
Fig. 8.7 Engagement of eye contact in interpersonal meaning: expansion and contraction
8.3 Research Findings 227
0
fast medium slow
As is revealed from Fig. 8.8, among all the occurrences of representing and
indexical eye contact, none of Tom’s eye contact was lined up with the targets of the
other discussants. Therefore, the engagement of graduation seems to be absent in his
case. Diana and Linda tended to shift their gaze rapidly from one discussant to the
other especially when they were supposed to be attentive in the discussion.
Figure 8.9 presents such an eye contact shift in the case of Diana, where she shifted
her eye contact target swiftly from one discussant (Frame 8.9A, on her right) to the
other (Frame 8.9B, on her left) in accordance with the turn change between them.
One thing worth caution is that due to Linda’s rightmost sitting position, her leftward
eye contact might simultaneously capture both discussants, causing a shortened
duration of eye contact shift. By comparison, as Diana was standing in the middle, it
would take longer time for her to shift the eye contact from one side to the other.
With the above, when interpersonal meaning of eye contact is considered, the
candidates’ performances can also be justifiably aligned with the rating scale
descriptor in nonverbal delivery and the subscores they were assigned. Both Linda
and Diana were able to perform eye contact with positive and negative attitudes and
shifted their gaze at the other discussants quickly to achieve a high degree of
graduation. Yet Linda is reduced to a disadvantageous position in this regard in that
she is felt to be more passive given more manifestations of her contraction
engagement compared with Diana’s expansion engagement. The occurrences of
Tom’s eye contact can be hardly felt to realise any interpersonal meaning.
(a) (b)
Therefore, the keywords of controllable and brief in describing eye contact on the
rating scale are further validated. As the total number of Tom’s representing and
indexical eye contact is only twice, his being assigned 1.5 (between infrequent and
almost no eye contact) can also be justified.
Textual Meaning
Ideational meaning and interpersonal meaning alone cannot optimise the intended
meaning. To a certain extent, textual meaning should also be involved so that all the
meaning potentials can co-function in a semiotic network. Informed by the oper-
ational framework specified, textual meaning instantiated via eye contact mainly
involves two aspects: what/who is the receipt of eye contact and how specific is eye
contact. The former can be observed by the object(s)/person(s) at which an
occurrence of eye contact targets, whereas the latter is more concerned with the
duration of such occurrences of eye contact. The longer an occurrence of eye
contact would last, the more specific it is. In fact, such specificity can also be
measured via the size of pupils because the enlarged pupil size can mean a higher
degree of specificity or attentiveness. However, given the practical technology
constraints, neither the collected data nor the analysing instrument is suitable for
such a measurement.
Figure 8.10 outlines the distribution of the targets that the candidates’ eye
contact is aimed at. Basically, their eye contact targeted at the other discussants
(peers), the teacher on the spot, the camera for recording purpose, and other tangible
objects in the classroom, such as window, ground and ceiling. Among all the target
objects, the three candidates would have the highest frequency of eye contact with
the other discussants. Except for Tom, who had a saliently significant number of
eye contact occurrences with the ground, all the three candidates seemed to exhibit
the aforementioned eye contact with the physical objects that would possibly
attenuate communication effectiveness. For example, both Diana and Linda seemed
to have brief eye contact with the ceiling (occurrence of upward gaze). One
interesting finding is that Diana and Linda also had eye contact with their hand(s) or
10
5
0
)
ow
nd
r
a
er
(s
g
he
er
er
pe
ou
d
ili
ac
in
ng
ce
gr
ca
te
/fi
)
(s
nd
ha
8.3 Research Findings 229
Table 8.5 Eye contact with Contact target Tom Linda Diana
peers: duration (s)
Peers Mean 1.26 2.67 4.45
Min. 0.20 0.35 0.89
Max. 4.27 6.70 10.36
Hand(s)/finger(s) Mean 0.00 1.72 1.25
Min. 0.00 0.85 0.68
Max. 0.00 2.58 1.82
finger(s). This is because when they intended to express or reinforce the meaning
via gestures, their own gaze at the hand(s) or finger(s) would arouse the others’
attention, the phenomenon of which will be further unfolded in the findings on
gestures below.
However, these eye contact targets alone cannot help explain much with regard
to the textual meaning realised. This study then looks into the duration of the
candidates’ eye contact with the other discussants given the fact that only eye
contact of this kind would be textual meaning intended.
As is indicated in Table 8.5, the candidates’ eye contact with the peers can be
largely similar to the results in Table 8.4 as the eye contact of this category accounts
for a majority of all the occurrences. A scrutiny at the mean will help to recon-
ceptualise a scenario of Diana’s comparatively more durable eye contact. This
indicates that when Diana gazed at the peers, she would conscientiously and sin-
cerely look at the other discussants, thus achieving a higher degree of specificity.
However, the statistics concerning the eye contact with hand(s)/finger(s) will render
a different picture. It is found that Linda (mean: 1.72 s) presented longer gaze at her
own hand(s)/finger(s) than Diana (mean: 1.25 s). As such, it can be said that Linda,
when performing gestures in realisation of metafunctions, would also resort to her
eye contact, another form of meaning-making resource, to pinpoint the significance
of the gesture being performed. Against this, the other discussants’ attention would
be mobilised as a result of the specificity in the Linda’s eye contact.
Although not much alignment between the findings from the perspective of
textual meaning and the nonverbal delivery descriptors can be made, a picture of
how textual meaning is realised by Linda and Diana can be captured. One of the
main reasons why such an alignment seems not operationalisable is that the rating
scale descriptor is supposed to bring forth the most salient features instead of being
too fine-grained. This might end up with raters’ inaccessibility to what is supposed
to be observed. In that sense, even though Linda seemed to perform better than
Diana in giving full play to the possible textual meaning of her eye contact, this
cannot serve hard evidence in justifying that Linda outperformed Diana, whose
overall delivery via eye contact, as previously analysed, should still be appraised.
230 8 Rating Scale Validation: An MDA Approach
90
80 Tom
70 Linda
60 Diana
50
40
30
20
10
0
d
ht
d
m
d
ft
ar
ar
ar
ar
le
rig
su
w
rw
nw
w
ck
up
fo
w
ba
do
Fig. 8.11 Directionality of gestures
8.3.2 Gesture
Talking about the formal gestures by the three candidates, this section will present
the findings with regard to (1) the directionality of gesture4; (2) descriptions of
hands; (3) use of hands; and (4) hands level.5 Prior to the qualitative findings, the
frequency analyses of gestures concerning the above measures are presented below.
Figure 8.11 showcases the frequency of gesture directionalities in the candidates’
group discussion. Generally from the rightmost column of sum, Diana is found to
have performed the largest number of gesture directionalities of various kinds. In
particular, there was extraordinarily high frequency of Diana’s using right hand in
her gestures. Comparatively, Tom did not have noticeably high frequency of any
gesture directionality. Therefore, it can be initially deemed that Tom kept an
extremely low profile of gestures in synchronisation with his verbal utterance in the
group discussion.
Proceeding to the directionality of gestures in general to the description of hands,
as illustrated in Fig. 8.12, this study may come to a deal of slight variation against
the above gesture directionality comparison. One similar tendency is that Tom,
compared with the other counterparts, can be generally found to have least variation
in terms of hand descriptions. However, there are a few exceptions. Tom tended to
be more often fisted than Linda, who never had any occurrence of fist in her group
discussion. This finding urges an in-depth exploration of what role or function a fist
could play when Tom was involved in the group discussion. The follow-up
4
Similar to the directionalities of eye contact described in Sect. 8.3.1.1, gestures were observed
with regard to the directions of hand movement. For instance, if an occurrence of moving the hand
upwards from a lower position, its directionality is judged as upward.
5
Hands level is judged when the location of hand(s) is considered in relation to the speaker’s head,
chest, legs and waist.
8.3 Research Findings 231
14
12 Tom
Linda
10 Diana
8
6
4
2
0
g
n
n
d
in
t
de
fis
tin
pe
ow
ar
jo
wa
si
rw
in
-o
-d
s
ck
po
nd
lm
fo
lm
ba
pa
ha
pa
discussion will turn to this point again. In addition, Fig. 8.12 also reveals that Linda
used more pointing than Tom or Diana. Pointing can be a form of reference in
communication heavily loaded with the textual meaning of gestures, after the
exploration of which Linda’s prominent use of pointing can be explained below.
One peculiar finding in this figure is that Diana was found to open palm constantly
while Linda had the occurrence of such gesture once.
Figure 8.13 illustrates the use of hands, either left hand, right hand or both by the
candidates. Individually, all the candidates tend to use right hand more often.
However, it can be found that Diana’s right-hand use was exponentially more than
the left counterpart. In addition, Linda seldom used both hands in her gestures.
Hood (2011) puts forward that to a certain extent gesturing with both hands usually
produces larger and more dramatic gestures, whereas one hand usually triggers a
smaller and more reserved gestures. This seems to be consistent with what is found
above concerning hands description, where Linda performed significantly more
pointing with fingers only, yet presented fewer palm-triggered gestures.
The comparison in Fig. 8.14 shows that when candidates instantiated gestures,
their hands level might also vary. Tom’s hands level was either at the leg level or
above the head level, yet it was only Tom who would have the occurrences of
232 8 Rating Scale Validation: An MDA Approach
50
Tom
40 Linda
Diana
30
20
10
0
head chest legs waist
gestures above the head. Comparatively, it can be sensed that Linda and Diana
placed their hands at a wider range of positions and levels.
At this stage, if the proposed rating scale is aligned with the candidates’ perfor-
mance in nonverbal delivery, the descriptors concerning gestures with a focus on
frequency and variety can be validated. Diana, assigned a subscore of 4, presented not
only frequent but also diversified gestures, the latter of which can be more manifested
in the directionalities of her gesturing, the use and description of her hands as well her
hands level. By comparison, although Linda also had high frequency of gestures, the
above measures of hand descriptions and hand use would justifiably downgrade her to
the subscore of 3. The case of Tom in this regard is not quite up to the standards of
being frequent and various in gesture use. As Tom was assigned a subscore of 1.5 as
an averaged result of teacher- and peer-rating, a retrospective review on its upper
adjacent band, namely Band 2, is necessary. The gesture descriptor for Band 2 is
“gesture, most of them are for non-communicative purposes”; therefore, raters, due to
Tom’s poor performance, might not even move into length to consider the issues of
frequency or variation of communication-conducive gestures. The research findings
about the metafunctions about Tom’s gestures will further testify that his gestures, if
not all, are overwhelmingly performative gestures with non-representational mean-
ings, as expounded below.
20
10
0
presenting indexical representing
might not semantically loaded or wilfully performed. For example, his gesture
could be scratching the head or rubbing his hands on the legs. By contrast, neither
Linda nor Diana presented a salient profile of presenting actions; instead, Linda had
more representing actions, whereas Diana featured more indexical actions. The
following part will further scrutinise how three metafunctions are realised in the
above three types of gestures in the candidates’ group discussion.
Ideational Meaning
Presenting Action
Figure 8.15 has above indicated that presenting action keeps a lower profile
compared with representing and indexical actions, and this sort of action can be
more commonly found in the nonverbal delivery in Tom’s performance. However,
presenting action does not actually serve much communicativeness in group dis-
cussion; as such, an unexpectedly abundance use of presenting gestures can be
interpreted as not being communication conducive. Therefore, a judgment of Tom’s
low subscore on nonverbal delivery will justify an analysis of his presenting ges-
turing as a start.
As foreshadowed, gestural presenting actions can be realised by various means,
such as material, behavioural, state and mental processes. Material process refers to
the involvement of objects in the gestural realisation. This study finds regular
occurrences of material processes in Tom’s gestures, which can be showcased in
Fig. 8.16, where Tom, sitting in the leftmost position among the peers, moved the
chair slightly forward with both hands. This action might be interpreted in a bi-fold
manner. One explanation is that for the purpose of drawing physically closer to the
other two discussants, Tom performed a subtle forward movement of his chair. The
other explanation would be that Tom was too nervous in the assessment settings to
be aware of sitting calmly in the group discussion. One word of caution, however,
234 8 Rating Scale Validation: An MDA Approach
should be borne in mind that Tom performed that action three times, guiding this
study to be more in favour of the second explanation. Lim (2011), in analysing the
teachers’ gestures in the lecturing environment, argues that material processes that
“are extraneous to the focus of the lesson may draw attention away from the main
communicative event” (p. 273). Likewise, Tom’s action in this case would also be
liable to disrupt communication effectiveness.
Behavioural process can refer to the action of crying, laughing or moaning, or
other physiological process like breathing, coughing and burping (Martinec 2000);
naturally, this process can also be realised in a gestural fashion. As group discussion
might trigger viewpoint exchanging and experience sharing, the candidates’ ges-
tures are assumed to be embedded with behavioural processes. Figure 8.17 illus-
trates the presentation of behavioural processes in Linda’s gestures. Frame 8.17A
snapshots the presenting gesture of laughing yet hiding the face with both palms by
Linda sitting to the leftmost side when one of the other discussants (sitting in the
middle) shared unpleasant travelling experience in the group. Therefore, Linda’s
presenting gesture might be interpreted as regarding the discussant’s story as
laughter. Another example can be found in Frame 8.17B, where Linda was trying to
hide her face with the left hand and index finger touching the forehead when
another discussant suggested a travel destination that Linda had already been to.
As such, Linda performed that gesture as if she was showing her unwillingness to
revisit a travel destination.
What is worth pointing out is that these behavioural processes are quite evident
in Linda’s performance in the group discussion; however, it does not necessarily
mean that Diana from the advanced group did not have any realisation of these
behaviours. This is because in the case of Diana, she would be more likely to realise
laughing, breathing or surprising via facial expression, the domain of which is
practically beyond a measurable scope of nonverbal delivery assessment in this
study.
With regard to the state processes, it is also found that the occurrences of Tom’s
gestures could be instantiated by long-time sitting. Martinec (2000) proposes the
category of state processes to describe processes that “have no significant move-
ment and have no obvious expenditure of energy” (p. 249). Echoing this definition,
Tom was constantly sitting still without much noticeable energy-consuming
movement whenever holding or yielding his turn in the group discussion.
Integrating with the findings of material process, this study finds that Tom would
either move the chair occasionally due to nervousness in communication or merely
sit still. The comparatively low profile of these two processes, therefore, might have
justifiably placed Tom at a disadvantageous position when he was assessed.
In stark contrast, although Linda was basically keeping the posture of sitting in
relation to the state process, her overall performance in overall nonverbal delivery,
particularly gestures observed, would trigger dynamics from time to time. It has to
be admitted that a sitting posture will, to a certain extent, confine the space of
gesturing in the domains of material and state processes. However, Linda seemed to
have accommodated herself to such a confinement by natural and constant gesturing
when discussing with the group members. With a standing posture, Diana was
naturally endowed with more flexibility; thus, the whole duration of the group
discussion witnessed almost no conspicuous happening of gestures with salient
expenditure of energy.
Another realisation of presenting action is mental process, and the instances of
which can be, for instance, described as a finger or hand pursing at the chin.
Although gestural presenting action does not serve much communicative purpose, it
would somehow mirror the candidate’s inner mindset, such as hesitation and
meditation. Figure 8.18 illustrates Diana’s (standing in the middle) mental process
in relation to her gestural presenting action. In Frame 8.18A, Diana was placing the
index finger of her left hand gently upon the tip on the left side when she was a bit
timid in asking her group members about what their future husband will be like.
Similarly, in Frame 8.18B, after yielding her turn to the discussant to her left, she
again pursed her index finger at the chin as if she was presenting uncertainty, or her
spontaneous reaction to a question that requested time buffering. As stated,
although mental process signifies the ideational meaning of presenting gestures, it
does not serve communication purposes. However, since this action is under the
category of performative gesture, raters might be impressed by the candidates’
performance if they would be able to realise the mental process with gestural
vehicles. As such, Diana’s high subscore in nonverbal delivery can be justified.
236 8 Rating Scale Validation: An MDA Approach
Representing Action
Following the ideational meaning of gestural presenting actions, this section will
continue with the ideational meaning of representing action in relation to gestures,
which can be regarded as more pertaining to analysing the alignment of the can-
didates’ nonverbal delivery performance with the communication effects, both
implicit and explicit, achieved.
As is reviewed, representing gestures can be further categorised into
language-independent and language-correspondent gestures. The former in its own
right lends support to the iconic meaning of gestures in a certain social context. The
latter conveys the meaning without relying on the synchronised language though it
usually co-occurs with the verbal utterance. In the case of the three selected candidates
in this phase of study, both language-independent and language-correspondent ges-
tures can be retrieved.
Figure 8.19 renders three instances with which the representing gestures can be
captured and interpreted. Frame 8.19A is a presentation of Tom’s representing
gesture of waving his right hand towards the end of the discussion, signifying
“goodbye”. It should be noted that accompanying this gesture, Tom actually did not
utter the word “goodbye”, the case of which falls into the category of language-
independent gestures. This is because conventionally in the Chinese social context,
waving hands upon the termination of the group discussion might be interpreted as
Indexical Action
As is shown in Fig. 8.5, indexical actions account for the largest proportion of all
the gestures observed. Under most circumstances, indexical actions are language
dependent, which determines their close affinity with the accompanying verbal
language for the full interpretation of the meaning. In the context of the present
study, where the candidates were supposed to hold group discussion in the for-
mative assessment, it has been observed that the presenting gestures were primarily
intended for the conveyance of importance, receptivity and relation.
Importance can be instantiated by a rhythmical movement in the candidates’
indexical gestures. Figure 8.21 illustrates two frames, which, respectively, indicate
Diana’s and Linda’s rhythmic beat in highlighting the points they were conveying.
In Frame 8.21A, Diana was listing various disadvantages of living in a cos-
mopolitan. Each time she came up with one disadvantage, she would clap her hands
once, tantamount to attaching significance by counting numbers. In a quite similar
vein, in emphasising a number of criteria for selecting an ideal travel destination,
Linda expanded and contracted her palms rhythmically, as is shown in Frame
8.21B.
Another realisation of indexical gestures is receptivity, which is usually
instantiated by means of open palm, as illustrated in Fig. 8.22. In Frame 8.22A,
Interpersonal Meaning
The following part will be geared towards the interpersonal meaning interpreted
from the three candidates’ gestures. As is specified, representing and indexical
gestures might carry much interpersonal metafunction, which, as far as gestures are
concerned, can be probed into from the perspectives of attitude, engagement and
graduation (Hood 2011).
The interpersonal meaning of either representing or indexical gestures can
transmit the intended conveyance of being positive or negative. Figure 8.25 illus-
trates the distribution of positive and negative gestures with interpersonal meanings
across the three candidates. It is found that Tom and Linda basically kept a balance in
expressing positive and negative interpersonal meaning though Linda’s gestures
with attitudes embedded far outnumbered Tom’s. Diana is found to be distinguished
in that she tended to have more gestures with a positive polarity. This can also be
echoed with the findings of head movement below, where there was much more
nodding than head shaking. As Tom’s formal gestures are extremely limited in
number, the following analyses correspondingly reserve limited space for his case.
15
10
0
positive negative
8.3 Research Findings 241
30
20
10
0
fast medium slow
extent to which the gestures are judged to fall into one of the graduation subcat-
egories depends on the automatic retrieval of the gesture duration by ELAN. Fast
gestures are tentatively cut off at 0.5 s and below and slow gestures at 1 s and
above. The gestures falling into the range of 0.5–1 s is judged as medium gesture.
Against the criteria, Fig. 8.29 lists the distribution of gestures in relation to
interpersonal meaning of graduation. It is found that Diana’s gestures are basically
characterised by swiftness and that Linda performed more medium than slow
gestures. In the case of Tom, only a fragmentary number of gestures could be
grouped into medium and slow graduation. This holistic finding is consistent with
the above observations regarding the candidates’ activeness in that Diana and Linda
engaged themselves in the discussion with various gestures, while Tom was still
sedentary. In order to make a comparison across the candidates, this study selects
the shared gestures when all the candidates intended to express the negative attitude
of interpersonal meaning, as is illustrated in Fig. 8.30. Similar to the distribution
reflected in Fig. 8.29, Diana and Linda waved the palm in fast (Frame 8.30A) and
medium (Frame 8.30B) motion, respectively, while Tom was almost still (Frame
8.30C) in performing a similar interpersonal-meaning-embedded gesture.
In a brief summary, when interpersonal meaning channelled in the candidates’
gestures is assessed, Diana is found to not only have lavish and constant use of
gestures indicating positive and negative attitudes, but also shift different forms of
engagement in line with her turns with a large number of gestures rapidly per-
formed. In that sense, Diana can be judged as a frisky or even quick-witted com-
municator to a large extent, thus again aligning her gestures with a great sense of
Textual Meaning
Textual meaning serves as a bridge linking the resources of ideational and inter-
personal meaning. According to Hood (2011), textual meaning with regard to
gestures can be realised by pointing, which can be assessed from the aspects of
directionalities and specificity. Figure 8.31 illustrates the distribution of various
possible directionalities of pointing, which can be broadly broken down into the
directions with reference to human body and those concerning physical objects or
geographic locations. Very few gestures, especially those that have not been
entirely captured by the camera or those with undetermined reference due to a
moving pointing, fall into the uncategorised.
It can be found that Tom would occasionally point at the other discussants to get
his viewpoints across. By comparison, Linda’s pointing at various directionalities
seems to be balanced. In other words, she would point not only at herself or the
other discussants with the referred person(s) embedded but also at the physical
objects, such as the window and the door in the classroom, or the geographic
directions like “south”. Diana shared a similar profile of pointing at objects and
directions with Linda, yet her pointing at the other discussants seems to be more
proliferated. This might be understood as a preferred reference to the other members
in the manner of pointing when she intended to convince them (e.g. accompanying
the verbiage of “don’t you think so”), to build a rapport in communication or draw
their attention, all of which seems to echo the above observation of her
vivaciousness.
ed
t(s
s
oo
se
n(
is
/d
an
io
or
ow
ct
ss
g
re
te
cu
di
ca
in
is
un
ic
rd
ph
he
id
ra
ts
ot
og
ou
ge
8.3 Research Findings 245
The last channel of nonverbal delivery observed in this study is head movement,
whose stereotyped manifestations mainly include nodding and head shaking. The
following presents the research findings in the aspects of formal head movements
and how they realise metafunctions accorded with the integrated analytic frame-
work of this study.
Although conventionally categorised into nodding and head shaking, in light of the
possible forms, head movement can also be inclusive of head upward, head
downward, head right and head left. It should be noted that nodding and shaking,
respectively, refer to the dynamic movement (more than one repetitive occurrence)
of head in vertical and horizontal manners, while the remaining four forms refer to
only one occurrence of a particular movement direction followed by a maintained
position for a certain period. For example, head downward can be turning the head
to a downward position that follows a maintained period, yet without any positive
or negative meaning as might be implied by nodding or head shaking.
Figure 8.33 outlines the distribution of various formal head movements by the
three candidates. It is found that, in terms of frequency, Diana performed the largest
number of head movements in various directions except for downwardness. By
contrast, Tom had only a few occurrences of head movement, mainly downward
movements, which also corresponds with what is found above regarding eye
contact. This is because when an occurrence of downward eye contact is captured, a
corresponding downward head movement might occur as an accompanying action.
When nodding and shaking are looked into, both Linda and Diana seem to have
performed more nodding than shaking. With this, in the Chinese social context, an
understanding of the fact that more positive expression was conveyed via their
utterances can be reached. Similar to what is found in eye contact, Linda moved her
ht
d
m
d
ft
in
in
ar
ar
le
rig
su
dd
ak
nw
w
up
sh
no
w
do
8.3 Research Findings 247
to the left a few times, with no occurrence of rightward head movement, which can
again be explained by her rightmost sitting position among the discussants.
Therefore, as far as the above findings of head movement frequencies are
concerned, as Tom only had a few salient downward head movements, there is no
appropriate contextualised nodding or head shaking to speak of. The nonverbal
delivery subscore that Tom was assigned (1.5) can be thus justified because his
performance can be regarded as falling between inappropriate head nod/shake and
no head/nod. Although Linda had fewer occurrences of head movement than Diana,
both of them can be said to have detectable head nodding and shaking. However,
whether these occurrences can be judged as appropriate would request further
explorations when the metafunctions of head movement are analysed.
15
10
0
presenting representing indexical
248 8 Rating Scale Validation: An MDA Approach
Diana. This can be particularly true when it comes to their head movements other
than nodding and shaking because only the discussion context can be referred to in
interpreting what is intended by an upward, downward, left or right head
movement.
Ideational Meaning
Ideational meaning in the case of head movement refers to the surface meaning by
which such movements instantiate. The following presents the findings of ideational
meaning realised via presenting, representing and indexical head movement.
Presenting Action
The integrated analytic framework stipulates that ideational meaning of nonverbal
delivery channels can be theoretically instantiated by material, behavioural, state,
verbal and mental processes. However, considering the practicality of head
movements, characterising no object contacts (material), exhaustible movements
(behavioural), predetermined dynamic state and inapplicable verbal processes, only
the mental process is analysed here.
The above findings already indicate that Tom’s presenting head movements
were usually manifested by downwardness, coinciding with the findings in the
directionality of his eye contact. Such occurrences of presenting head movements
can be tentatively described as absent-mindedness because during the other dis-
cussants’ turn, Tom did not show his attentiveness by appropriately gazing at the
turn-holder; instead, the change of his eye contact direction was naturally accom-
panied with the downward movement of his head. However, when Linda’s and
Diana’s presenting head movements, though in a limited number, are analysed, it
can be felt that ideational meaning can be realised by the mental process, as
illustrated in Fig. 8.35.
In Frame 8.35A, Linda was listening to another discussant on suggesting Tibet
as the travel destination, the verbiage being “there is some culture”, upon the
(a) (b)
termination of which, Linda subtly moved her head to the left (see Frame 8.35B)
and maintained that position for a certain period with the verbiage of “yes, some
traditional culture” as a signal of confirmed agreement. Although this process
seemed to be less noticeable than other vibrant movements of the head, the detected
action of this type reveals her thinking, an ongoing mental process.
Representing Action
Representing head movements can be interpreted without the language. As most
occurrences of head nodding and shaking are already semantically loaded,
respectively, known as positive and negative meanings, they would correspond-
ingly fall into representing actions. In particular, the act of nodding can indicate not
only a speaker’s agreement with what others utter but also his or her attentiveness at
that particular moment, or known as nonverbal backchannelling (e.g. White 1989;
Young and Lee 2004).
Take an occurrence of nodding by Diana as an illustration. In Frame 8.36A of
Fig. 8.36, Diana was listening to one of the discussants in airing her view on the
given topic. Statically, Diana was gazing, yet dynamically she was nodding when
transited to Frame 8.36B. Diana’s gaze at the other discussant, with the varying
levels of vision (see the dashed arrows) not only shows her attentiveness
(backchannelling showing attention or interest) but also implies her agreement to
the other’s view. In that case, it can be felt that the whole process of this head
movement can be accessed without any involvement of verbal language.
Indexical Action
Indexical head movements are language dependent, indicating that their meanings
would be blurred if the accompanying language or the verbal context is not given.
This study finds that most of the indexical head movements would instantiate the
meaning of importance or receptivity, as analysed below.
(a) (b)
Figure 8.37 illustrates two frames with the similar conveyance of importance via
head movement. In Frame 8.37A, Linda was trying to emphasise that one of the
selling points of travelling to Tibet is to see the special animals like antelope.
Accompanying her verbiage, she vibrantly moved her head downward to highlight
the word “special” in her verbiage. However, such a downward head movement,
though intended to convey importance, does not seem to be as effective or
appropriate as anticipated because a downward action would more often suggest
weakening than strengthening. This can be somehow understood because her
accompanying indexical gesture already accounts for the intended meaning of
importance. When Diana was attaching the importance to her return to the home-
town after graduation, she moved her head upward a bit (see Frame 8.37B) as if she
would like to achieve an effect of awakening along with an uplifted open palm. In
both cases, the candidates intended to show the meaning of importance.
In addition, indexical head movements can also express receptivity especially
when the speaker intends to yield the turn to the next speaker. Figure 8.38 is just a
case in point, illustrating the only occurrence of Tom’s indexical head movement.
Frame 8.38A shows that Tom was talking in a static sitting posture. When he
intended to yield his turn to another discussant with the verbiage “what do you
think, Mr. Zhang?”, he moved his head to the left with a synchronised gaze (see
Frame 8.38B, the dashed arrows). In the meantime, the third discussant also turned
aside (see the arrow). So far at that moment, Tom performed as expected; however,
the transition to Frame 8.38C would lead to disappointment because while the
turn-holder was substantiating the discussion, Tom moved his head back to gaze at
the third discussant (see the dashed arrows), whereas the third discussant was still
gazing at the turn-holder (see the arrow). Against this, it can be said that the only
occurrence of Tom’s indexical head movement fails to salvage him from the low
nonverbal delivery subscore assigned.
If the findings concerning formal head movements are not succinct to align the
candidates’ performance in nonverbal delivery with the corresponding descriptor on
the rating scale, especially regarding the appropriateness of head nodding and
shaking, how ideational meanings are realised via presenting, representing and
indexical head movements above can to a certain degree account for the reasons
why Linda’s head movement might occasionally be regarded as inappropriate and
why Diana’s performance in head movement can not only present “evidence of
appropriate head nod/shake” but also feature well-timed co-ordination with other
nonverbal channels, such as gestures, to maximise meaning potential.
Interpersonal Meaning
Consistent with the realisations of interpersonal meaning via eye contact and ges-
ture, head movement is also able to realise interpersonal meaning by means of
attitude, engagement and graduation.
It is evident that, in terms of attitude, head movement can realise positive and
negative meaning through head nodding and shaking, respectively. As is noted in
Fig. 8.33, Linda and Diana had fewer occurrences of head shaking than nodding.
Since nodding, as an indication of attentiveness and agreement, has been elaborated
above, this section will bring forth more insights on head shaking. Throughout the
discussion, it has been observed that Linda simply had only one occurrence of head
shaking, as illustrated in Fig. 8.39, where both frames present a dynamic horizontal
movement as indicated by the arrows. However, a further integration with the
accompany verbiage will again capture an inappropriate use of head movement.
When Linda was agreeing to plan a trip by uttering “oh, that’s a good idea”, she
(a) (b)
10
0
fast medium slow
8.3 Research Findings 253
even though the graduation of Tom’s head movements also features slowness, the
corresponding interpersonal meaning cannot be instantiated. In the case of Linda,
most occurrences of head movement fall into medium graduation, indicating that
her head movements cannot be characterised by deliberateness or urgency.
When the descriptor of nonverbal delivery on the rating scale is validated again
by referring to what is found above concerning how interpersonal meaning is
realised via the candidates’ performance in head movements, more evidence of
alignment can be collected. The demarcation in head movement descriptor between
Band 3 and Band 4 lies in the appropriateness of head movement. As Linda is
found to shake her head accompanying the verbiage of positive conveyance, cou-
pled with the detected unexpected occurrence of downward head movement above,
her head nod/shake can be judged as less appropriate. The appropriateness of
Diana’s head movements with regard to the interpersonal meaning can again
support the subscore she was assigned because she is found to perform head
nodding and shaking as expected in the given social context and also control the
graduation of head movement in instantiating different meanings.
Textual Meaning
Textual meaning with regard to head movement can be twofold. On one hand, when
a candidate performs head nod or shake, the wavelength can be measured to
indicate the degree of agreement or disagreement, respectively. This is because a
head nod or shake can be understood as a more confirmed occurrence of agreement
or disagreement if it features higher frequencies in a unit interval. For example,
nodding rapidly as a token of positive backchannelling can be felt as a circum-
stantiated acknowledgement of agreement. In order to standardise this measure, this
study retrieved the frequencies of horizontal (head shake) or vertical (head nod)
movements that occurred in one second. On the other hand, concerning the head
movements other than nodding or shaking, this study looked into the amplitude of
head movement because this measure, akin to the pointing in gesture, can tell the
specificity, particularly that of attentiveness, tendering the organisational resources
for ideational and interpersonal meanings.
Informed by the fact that there is no detectable head nod or shake by Tom,
Table 8.6 lists the standardised wavelength of head movement performed only by
Linda and Diana. The higher frequency a candidate performs in a second, the more
accelerated a head nod or shake is. Thus, as is revealed, Linda seems to perform
nodding and head shaking more slowly than Diana. This indicates that when Linda
nodded or shook head, she might have transmitted a mere signal of hesitant positive
(a) (b)
Linda can be evaluated to be almost similar except for the fact that her head
movement in a unit interval seems longer, thus seemingly aggravating a scenario of
a non-committal approval or rebuttal.
8.4 Discussion
Having presented the findings of the three randomly selected candidates’ perfor-
mances in nonverbal delivery with regard to its various forms and the respective
metafunctions with an MDA approach, this section continues with a further dis-
cussion on the three research questions.
RSV-II-RQ1: What functions do the candidates’ nonverbal delivery channels serve?
When each nonverbal delivery channel, viz. eye contact, gesture and head move-
ment, is investigated, both Martinec’s (2000, 2001, 2004) and Hood’s (2007, 2011)
frameworks are referred to. With regard to the former, the candidates’ eye contact,
gesture and head movement are categorised into performative actions (presenting)
and communicative actions (representing and indexical), the judgment of which
mainly relies on an interwoven evaluation of their potential of communicativeness
and the synchronised verbal language. The latter framework, after being accom-
modated and slightly revised in the present study, is able to multimodally analyse
the three metafunctions instantiated by the nonverbal delivery channels in accor-
dance with an MDA approach.
Tom is the candidate who has been found to perform the least number of
nonverbal delivery occurrences of any channel, leaving an impression of being
sedentary. Regarding eye contact, Tom is characterised by frequent durable gaze at
the ground in the group discussion, blocking the instantiation and realisation of the
corresponding ideational and textual meanings. Likewise, his gesturing was also
limited in light of variation, with merely a detected occurrence of monotonous arm
swing as presenting action and that of a waving hand as the only representing action
of bidding farewell upon the termination of the group discussion. In addition, Tom,
instead of having any head nod or shake, only performed downward head move-
ment coinciding with the finding of constant and noticeable gaze at the ground.
Therefore, it can be said that in a meaning-making process, Tom almost resorted to
the verbal modality in meaning conveyance. Most supposed functions, especially
those that can be instantiated by representing and indexical actions, failed to
enhance the accompanying verbiage. Judging from the above, this study would
think that Tom basically reaches the first stratum of meaning-making network,
namely the ability to employ conventional monomodality. The second stratum of
the network, viz. how individual modality presents different metafunctions, and the
third stratum, viz. how different modalities achieve complementarities, seem to be
groundless for an analysis in Tom’s case.
Moving to the case of Linda, a candidate from the intermediate proficiency level,
it can be found that more meaning-making resources are made use of. Linda’s eye
256 8 Rating Scale Validation: An MDA Approach
contact features comparatively high frequency yet with briefness and constant shifts
in gaze directionalities. An interpersonal meaning of contraction can be described
as a result of a few occurrences of downward gaze during her own turn and others’
turns. However, Linda is competent in instantiating more desirable textual mean-
ings in that some durable gaze features the specificity of her gesturing. Although
she has a good number of gesture occurrences of various kinds and directionalities,
due to her leftmost sitting position, she could have performed even better if more
physical space had been provided for freer instantiation. In addition, the tendency of
her contraction in eye contact can also be triangulated in her salient gestures of
down palm, which draws more social distance and limits negotiation space between
speakers. Concerning head movement, Linda is able to instantiate textual meaning
by her leftward head movement with great amplitude so that more of her atten-
tiveness and initiation of turn-yielding can be realised. Nevertheless, Linda’s head
movement occasionally fails to realise the expected ideational meanings because
certain head movements of hers violate contextualised appropriateness.
Therefore, when the stratum of metafunctions realised by nonverbal delivery is
considered, this study thinks that Linda, despite her occasional inactiveness that
might be triggered by the personality, can perform quite satisfactorily in the domain
of nonverbal delivery because her eye contact, gesture and head movement all
achieve the desired and describable metafunctions to a certain degree. Even moving
to the stratum of inter-semiotic complementarities, her gestures and eye contact can
co-function to instantiate the accompanying verbiage.
The case of Diana can be judged as a model. From the statistics of formal
nonverbal delivery channels with regard to their respective frequency, duration and
variation, she performed better than the other two candidates. Unavoidably, Diana
had only a few occurrences of performative, or presenting actions. However, those
cannot serve as a counterargument to downgrade her performance in this regard. In
addition to the anticipated ideational meanings, her eye contact, with its durability
and firmness, can also instantiate positive and negative attitudes, controls the
engagement of contraction and expansion in accordance with the turn shifts.
Likewise, her rapid gestures would indicate her activeness and openness in wel-
coming different viewpoints, while presenting an invisible defensiveness when a
need of building her own arguments in support of her view arose. Her head
movement is also properly controlled as she not only performs various swift head
movements in conveying surface meanings but also shows her own attentiveness
via such movements.
Therefore, as a whole, Diana can be felt to be natural in the meaning-making
process of group discussion. Diana is even more proficient than Linda because not
only her conveyance can be instantiated by various nonverbal delivery channels
with ideational, interpersonal and textual meaning realised but also different
modalities of nonverbal delivery, along with the modality of verbal language, can
co-ordinate in an integrated manner to maximise the meaning potential.
8.4 Discussion 257
RSV-II-RQ2: To what extent are teacher raters’ and peer raters’ scoring in non-
verbal delivery alignable with the corresponding descriptors of the proposed rating
scale?
This research question can be generally facilitated with the above fine-grained
analyses and discussion and can be addressed in twofolds.
First, a closer look at the nonverbal delivery descriptors might generate a few
keywords, or certain crucial points of observation. In describing eye contact, the
main demarcation lies in frequency, controlledness, briefness, with the first key-
word pertaining to the formal eye contact and the latter two concerned with the
metafunctions explored in an MDA approach above. Gesture, in addition to fre-
quency (formal gesture), is also described in terms of variation (formal gesture) and
communicativeness (metafunctions) on the rating scale. Head movement, as the last
dimension of nonverbal delivery, is judged against appropriateness of head nod or
shake. The exclusion of frequency in head movement, as previously explained, is to
minimise the intervening effect that candidates’ diversified personalities and cul-
tural background might exert on the scoring results. Therefore, appropriateness in
head movement can be aligned via both formal and metafunctions of head move-
ments. The detailed descriptions of the three candidates’ performance in eye con-
tact, gesture and head movement indicate that what is found above can almost
perfectly match what is supposed to be observed and stipulated in the rating scale.
Second, when the nonverbal subscores assigned by the teacher raters and peer
raters are considered, there was no inconsistency in Linda’s (3) and Diana’s
(4) subscores, and most observable and analysable characteristics of their formal
nonverbal delivery and their, respectively, metafunctions can be accorded to the
respective bands. Tom was assigned 1 by peer raters and 2 by teacher raters. This
discrepancy can be mediated because all raters were supposed to, respectively,
observe eye contact, gesture and head movement to reach one subscore of non-
verbal delivery. The judgment on the poor performance in one nonverbal delivery
channel might unconsciously impair another. In the case of Tom, there was no
detectable head nod or shake, with which raters might assign 1, yet raters might also
assign 2 owing to their observation that most of his gestures, though detectable,
were not communicative enhancing. Therefore, justifications can be made that
teacher raters’ and peer raters’ scoring in nonverbal delivery can be to a great extent
alignable with the nonverbal delivery descriptors of the proposed rating scale.
RSV-II-RQ3: To what extent can the nonverbal delivery descriptors distinguish
candidates across different proficiency levels?
This research question addresses the discriminating power of the gradable
descriptors of the rating scale. As specified in the research design of this chapter,
the three candidates were randomly selected from three predetermined proficiency
groups. The scoring results against the proposed rating scale have already discerned
them into three levels, with Diana and Linda, candidates from the advanced and
intermediate proficiency levels, respectively, falling into Band 4 and Band 3, and
258 8 Rating Scale Validation: An MDA Approach
Tom positioned between Band 2 and Band 1. Therefore, this ranking basically
corresponds to the predetermined proficiency levels of these candidates.
As is found above, the nonverbal delivery descriptors of the rating scale can
effectively discern the case of Tom because much alignment can be found of his
poor performance with the detailed descriptors specified previously. Linda is dis-
tinguished from Diana from a few formal nonverbal delivery performances and the
corresponding metafunctions. Formally, Linda’s eye contact is found to be brief
instead of being durable and firm, and occasionally she also presented certain
inappropriate head nodding. Considering the metafunctions, her inactiveness as
reflected in the interpersonal meaning of eye contact and more liable engagement of
contraction can account for the downgraded subscore she was assigned. Oppositely,
Diana is found to be satisfactory in the aspects where Linda flawed. Therefore, the
discriminating power of the rating scale, particularly with regard to the nonverbal
delivery descriptors, can also be accordingly validated.
8.5 Summary
Following the line of validating the revised rating scale, this phase of study adopted
an MDA approach to analyse three randomly selected candidates’ (Tom, Linda and
Diana) nonverbal delivery performance. When nonverbal channels were investi-
gated from the perspective of their formal manifestations, a series of parametres,
such as frequency, directionality, duration and levels, were probed into. However,
due to the complexity of gestures, this study also focused on the use of hands and
detailed gesture descriptions for a further analysis. When nonverbal channels were
analysed with regard to their metafunctions, the integrated framework drawn from
Martinec’s (2000, 2001, 2004) and Hood’s (2007, 2011) research was referred to.
In investigating formal nonverbal channels, namely the first stratum of the
general framework reviewed in the literature, this study found that the three can-
didates differ in their employment of nonverbal delivery, yet their individual per-
formance on nonverbal delivery may be generally aligned with the corresponding
rating scale descriptors, especially concerning the quantifier descriptors, such as the
parametres of frequency and duration. Among the candidates, Tom seemed to be
most sedentary, without salient performance in any of the nonverbal channels
observed. Comparatively, Linda and Diana performed better in that they both
frequently and constantly resorted to eye contact, gesture and head movement in
accompanying their verbal language.
Further elevated to the second stratum of the general framework, where the
metafunctions instantiated by the candidates’ nonverbal channels were analysed,
this study focally conducted an even more fine-grained comparison between Linda
and Diana as an analysis on Tom’s performance was almost excluded due to his low
profile in nonverbal delivery. The comparison has found that Diana was able to
instantiate different metafunctional meaning via her nonverbal delivery. In addition,
she could be demonstrated to impress the other discussants as an engaged, articulate
8.5 Summary 259
and strategic speaker in the group discussion. Diana could also shift the meta-
functions of a particular nonverbal channel in accordance with turn-taking.
Although Linda also performed quite satisfactorily in nonverbal delivery, the
metafunctions realised via nonverbal channels seemed to present an image of
slightly passive and hesitant speaker among the discussants. Such a comparison
also lends support to an alignment of the candidates’ performance with the sub-
scores assigned to them as well as the observable descriptors of nonverbal delivery
on the rating scale. In particular, the key quantifiers used in the descriptors, such as
controlled (eye contact), communication-conducive (gesture) and appropriate (head
movement), can be further validated.
On top of what is summarised above, this study also explored the third stratum
specified in the general framework of validating this rating scale in an MDA
approach. Diana was found to employ different channels of nonverbal delivery with
her accompanying verbiage so that the intended meaning could be conveyed more
effectively. Even when there was no synchronised verbal language, different non-
verbal channels might also co-function for an enhancement of meaning instantiation
in Diana’s case. Nonetheless, very limited co-ordination across different nonverbal
channels might be detected regarding Tom’s and Linda’s performances.
Against what is revealed from the analyses on the candidates’ nonverbal
delivery, the discriminating power of the rating scale as reflected by the four
gradable bands was accordingly validated. It can be summarised that, nonverbal
delivery, as a newly devised and incorporated assessment dimension of this rating
scale, is valid in measuring candidates’ performance in nonverbal delivery, which is
judged as the most salient representation of strategic competence under the CLA
model. Therefore, a combination of a validation study with an MTMM approach
and another one with an MDA approach, this research project approaches the
accomplishment of validating the proposed rating scale in a triangulated manner.
References
Hood, S.E. 2007. Gesture and meaning making in face-to-face teaching. Paper Presented at the
Semiotic Margins Conference, University of Sydney.
Hood, S.E. 2011. Body language in face-to-face teaching: A focus on textual and interpersonal
meaning. In Semiotic margins: Meanings in multimodalities, ed. Dreyfus, S, S. Hood and
S. Stenglin, 31–52. London and New York: Continuum.
Lim, F.V. 2011. A systemic functional multimodal discourse analysis approach to pedagogic
discourse. Unpublished PhD thesis. Singapore: National University of Singapore.
Martinec, R. 2000. Types of processes in action. Semiotica 130(3): 243–268.
Martinec, R. 2001. Interpersonal resources in action. Semiotica 135(1): 117–145.
Martinec, R. 2004. Gestures that co-occur with speech as a systematic resource: The realisation of
experiential meanings in indexes. Social Semiotics 14(2): 193–213.
White, S. 1989. Backchannels across cultures: A study of Americans and Japanese. Language in
Society 18: 59–76.
Young, R.F., and J. Lee. 2004. Identifying units in interaction: Reactive tokens in Korean and
English conversations. Journal of Sociolinguistics 8(3): 380–407.
Chapter 9
Conclusion
This section briefly summarises the main findings of the three research phases in
this project.
In the AB phase, this study conducted an empirical study to explore the role of
nonverbal delivery in Chinese EFL candidates’ performance in group discussion,
particularly how candidates across a range of proficiency levels might be dis-
criminated against their nonverbal delivery performance. In a sense, if nonverbal
delivery can discriminate well among the candidates of predetermined proficiency
levels, an argument of incorporating nonverbal delivery into speaking assessment
can be accordingly advanced.
In this phase of study, it was mainly found that although there seemed to be a
generally low profile of employing nonverbal delivery in group discussion, the
candidates across a range of proficiency levels can be statistically discerned with
regard to their performance of eye contact, gesture and head movement. Candidates
of advanced proficiency were characterised by higher frequency and longer dura-
tion of eye contact. Elementary-level candidates, though featuring a high frequency
of eye contact occurrences, were inclined to shift their gaze hurriedly without much
fixed or durable eye contact with their peer discussants. In addition, rather than
enhance communication effectiveness, most occurrences of their eye contact, if not
all, serve regulatory or adaptive purposes. Although intermediate-level candidates
were found to instantiate eye contact with other discussants, the degree to which
their eye contact can serve attentive purposes would be more impaired compared
with the advanced counterparts.
Candidates’ gestures can be mainly distinguished from the perspectives of fre-
quency, diversity and communication-conduciveness. Advanced candidates would
be able to perform satisfactorily in all of the above measures, whereas candidates of
the elementary proficiency level were found to keep an extremely low profile of
resorting to gestures in accompanying their verbal language. The intermediate-level
(Δχ2(17) = 425.68, p < 0.001, ΔCFI = 0.146). The standardised parameter estimates
and trait–method correlations revealed no method effect or bias concerning rating
methods. Thus, this rating scale, with nonverbal delivery included as a crucial
dimension, was validated in a statistical spectrum.
The rating scale, especially its assessment dimension of Nonverbal Delivery, was
further validated at the micro-level with an MDA approach. Three randomly
selected candidates (pseudonyms as Tom, Linda and Diana) representing different
proficiency levels were probed into concerning their de facto performance in
nonverbal delivery. Tom, with a subscore of 1.5 on nonverbal delivery, was found
to be rather sedentary and passive in the group discussion because only a limited
number of nonverbal channels with ideational meanings are instantiated. A majority
of his nonverbal delivery occurrences remained to be performative, or as a likely
regulation to adapt himself to an assessment setting. In that sense, almost no
interpersonal or textual meanings could be detected from his nonverbal delivery;
thus, Tom was reduced to stagnation where only the first stratum of nonverbal
delivery employment could be taken into account in his case.
In stark contrast, Diana, as a representative of advanced proficiency level who
was assigned a full mark in nonverbal delivery, was found to be articulate in
eclectically resorting to a repertoire of nonverbal channels in accompanying her
verbiage. At certain points, her nonverbal performance can also instantiate intended
meanings without any synchronised verbal language. Judging from the perspective
of metafunctions, she was found to be capable of realising a variety of meaning
potentials via nonverbal delivery. Although she seemed somewhat aggressive in
group discussion, her frequent shift in instantiating different nonverbal channels
with discrepant metafunctions would impress other discussants as an active and
negotiable speaker as well as an attentive listener. Although Linda, whose subscore
of nonverbal delivery is 3, performed quite satisfactorily in terms of formal non-
verbal channels, she was found to be slightly passive and hesitant in the group
discussion. In particular, when the interpersonal meaning of her gestures was
looked into, she seemed to be self-contained and strike a certain distancing effect on
the peer discussants.
The above profile of the three candidates’ performance on nonverbal delivery
can also be aligned with the descriptors of nonverbal delivery on the rating scale
and the subscores they were assigned. Therefore, the MDA approach further val-
idated the rating scale regarding certain keywords to be observed in the rating
process as well as a number of quantifiers that reflect discriminant bands of can-
didates’ nonverbal delivery.
per se is also anticipated to yield implications. Since much hope is pinned on this
product to be routinely applied in the group discussion of formative assessment,
certain washback effects (Alderson and Wall 1993; Cheng 2005; Green 2007)
should also be considered. This section dwells upon what possible implications this
rating scale might exert on English teaching and textbook compilation, both of
which are the main sources where EFL learners’ acquire the English language.
the classroom or somewhere in the corner. Although this way of seating can be
prevailing in certain EFL teaching contexts, it would be eagerly desirable in the
Chinese EFL context given its large population of English learning.
Despite the significance of what has been revealed in this study and implications
yielded above, it has to be admitted that this research is not without caveats. The
following two points have to be highlighted when the limitations of the study are
considered.
First, as is also reviewed in the literature, nonverbal delivery can be highly social
context specific, which means there can be substantial differences in nonverbal
communication from one social context to another. In that case, it can be likely that,
similar to language transfer, EFL learners exhibit the same nonverbal delivery
performance as they would do in the native language. Although this point can be
claimed to be an excuse for EFL learners to keep a low profile in their performance
regarding nonverbal delivery in certain social contexts, awareness should be raised
that since EFL learners communicate and are assessed in English, they are supposed
to perform as expected in the target language. In order to minimise the possible
effects of L1 nonverbal delivery transfer, this study has maintained a homogeneous
social context, where all the data, ranging all the way from learners’
video-recordings to the scoring results, were collected in the Chinese EFL context.
With regard to rater characteristics, the raters are homogeneous given their
nationality being Chinese. All these findings derive from an expected guarantee that
raters would score the candidates of the same social context; if the raters with other
social contexts were selected for this study, the scoring will be jeopardised because
they might be either more severe or lenient with the candidates in the Chinese EFL
context.
Second, nonverbal delivery should also be claimed to be highly personality
oriented. It can be observed that more extroverted learners might be more likely to
resort to nonverbal delivery channels. However, this study also manages to offset
this weakness by “being lenient” in the descriptors to be observed. When an
argument for embedding nonverbal delivery into speaking assessment is built, it
should be noted that a good number of parameters have been taken into account,
whereas when the descriptors of nonverbal delivery are formulated, not every
9.3 Limitations of This Study 267
fine-grained parameter, such as the duration of gesture, has been written into the
rating scale descriptors. This is because if all the details of nonverbal delivery
channels are considered, not only would the raters find it infeasible to observe so
many points in the scoring process, but also they might be requested to be too tough
to those less extroverted candidates. Therefore, it can be claimed that the corre-
sponding descriptors only manifest the most basic and salient presentations of
expected nonverbal delivery.
The above research limitations can indeed provide more insights on the future
directions of research, to be outlined as follows.
First, rater characteristics can be regarded as a variable to be further explored.
Should native speakers or speakers of other EFL contexts be designated as the raters
to score the same performance against the proposed rating scale, there might or
might not be differences. If there is no discrepancy in the rating results between
native speakers and non-native speakers, it can be said that the possible effect of
rater’s social contexts on the scoring results can be negligible. However, in case
significant differences are yielded, a word of caution should be made as to limit the
applicability of the proposed rating scale to a homogeneous social context only.
According to a most recent study, Gui (2012) posits that Chinese and American
raters might hold different perceptions of nonverbal communication when scoring
contestants’ performance in public speaking. A follow-up study deriving from the
present research will further helpfully validate the rating scale with regard to its
scope of utility.
Second, the argument of embedding nonverbal delivery into speaking assess-
ment can also be further consolidated by comparing different scoring contexts,
where raters might be provided with the video-recording or with the
audio-recording only. If raters are blocked with the visual channel that would
otherwise enable them to view the nonverbal delivery of the candidates, the rating
differences in candidates’ overall performance across a range of proficiency levels
might not be as significant as is revealed in this study. In the context of formative
assessment, where more detailed feedback to learners and teaching practitioners are
requested, the blockage of visual channel in the scoring process can be regarded as
an impediment of comprehensive assessment and a potential danger posed to test
fairness.
268 9 Conclusion
9.5 Summary
This chapter, mainly recapturing the main findings of each research phase, draws a
conclusion to the whole research project. Departing from three research aims, this
study links an argument of embedding nonverbal delivery into speaking assessment
with the development and validation of a rating scale so that the role of nonverbal
delivery in assessing communicative ability is increasingly given prominence to. It
is highlighted that the final product of this study, namely a validated rating scale to
be used for group discussion in the context of formative assessment, would not only
yield much utility significance but also achieve positive washback effects on EFL
teaching and textbook compilation. The last two sections, respectively, clarify the
limitations of this study concerning candidate variability in nonverbal delivery
performance, and point out the directions of exploring nonverbal delivery from the
perspectives of rater characteristics and whether rating should be approached via
audio- and/or video-recordings in formative assessment.
References
Alderson, J.C. 1993. Judgments in language testing. In A new decade of language testing
research: Selected papers from the 1990 language testing research colloquium, ed.
D. Douglas, and C. Chapelle, 46–50. Washington, DC: Teachers of English to Speakers of
Other Languages Inc.
Alderson, J.C., and D. Wall 1993. Does washback exist?. “Applied Linguistics, 14(2): 115–129.
Allright, R. 1984. The importance of interaction in classroom language teaching. Applied
Linguistics, (5): 156–171.
Bailey, K.M., and Nunan, D. (Eds.) 1996. Voices from the Language Classroom: Qualitative
Research in Second Language Education. New York: Cambridge University Press.
Cheng, L. 2005. Changing language teaching through language testing: A washback study.
Cambridge: Cambridge University Press.
Cunningsworth, A. 1995. Choosing Your Coursebook. Oxford: Heinemann.
Ellis, R. 1990. Instructed Second Language Acquisition. Oxford: Blackwell.
Frank, C. 1999. Ethnographic Eyes: A Teacher’s Guide to Classroom Observation. Westport:
Heinemann.
Green, A. 2007. Washback to learning outcomes: A comparative study of IELTS preparation and
university pre-sessional language courses. Assessment in Education 14(1): 75–97.
Gui, M. 2012. Exploring differences between Chinese and American EFL teachers’ evaluations of
speech performance. Language Assessment Quarterly 9(2): 186–203.
Lim, F.V. 2011. A systemic functional multimodal discourse analysis approach to pedagogic
discourse. Unpublished Ph.D. thesis. Singapore: National University of Singapore.
Long, M.H. & Sato, C.J. 1983. Classroom foreigner talk discourse: Forms and functions of
teachers’ questions. In H.W. Seliger & Long, M.H. (Eds.), Classroom Oriented Research in
Second Language Acquisition, pp. 268–285. Mass: Newbury House.
Appendix I
IELTS Speaking Rating Scale
(Band 8 and Band 9)
(continued)
Score General Delivery Language use Topic
description development
at this level is not significantly somewhat limited
characterised by affected) in the range of
at least two of the structures used.
following This may affect
overall fluency,
but it does not
seriously interfere
with the
communication of
the message
Appendix III
TEEP Speaking Rating Scale
Appropriateness
0 Unable to function in the spoken language.
1 Able to operate only in a very limited capacity: responses characterised by
sociocultural inappropriateness.
2 Signs of developing attempts at response to role, setting, etc., but
misunderstandings may occasionally arise through inappropriateness,
particularly of sociocultural convention.
3 Almost no errors in the sociocultural conventions of language; errors not
significant enough to be likely to cause social misunderstandings.
Grammatical accuracy
0 Unable to function in the spoken language; almost all grammatical patterns
are inaccurate, except for a few stock phrases.
1 Syntax is fragmented and there are frequent grammatical inaccuracies;
some patterns may be mastered but speech may be characterised by a
telegraphic style and/or confusion of structural elements.
2 Some grammatical inaccuracies; developing a control of major patterns,
but sometimes unable to sustain coherence in longer utterances.
3 Almost no grammatical inaccuracies; occasional imperfect control of a few
patterns.
Intelligibility
0 Severe and constant rhythm, intonation and pronunciation problems cause
almost complete unintelligibility.
1 Strong interference from LI in rhythm, intonation and pronunciation;
understanding is difficult and achieved often only after frequent repetition.
2 Rhythm, intonation and pronunciation require concentrated listening, but
only occasional misunderstanding is caused or repetition required.
3 Articulation is reasonably comprehensible to native speakers; there may be
a marked “foreign accent” but almost no misunderstanding is caused and
repetition required only infrequently.
Fluency
0 Utterances halting, fragmentary and incoherent.
1 Utterances hesitant and often incomplete except in a few stock remarks
and responses. Sentences are, for the most part, disjointed and restricted in
length.
2 Signs of developing attempts at using cohesive devices, especially con-
junctions. Utterances may still be hesitant, but are gaining in coherence,
speed and length.
3 Utterances, while occasionally hesitant, are characterised by an evenness
and flow hindered, very occasionally, by groping, rephrasing and cir-
cumlocutions. Inter-sentential connectors are used effectively as fillers.
Appendix III: TEEP Speaking Rating Scale 275
0 1 2 3 4 5
NONSPEAKER VERY LIMITED SPEAKER Some BASIC SPEAKER AT BEC 1 Some MODERATE SPEAKER AT
AT BEC 1 LEVEL features LEVEL features BEC 1 LEVEL
Insufficient
of 1 and of 3 and
sample to make Has considerable difficulty Able to communicate in Generally able to communicate
some of some of
an assessment or communicating in everyday everyday situations if listener is in everyday situations with little
3 5
totally situations, even when listener is patient and supportive strain on listener
incomprehensible patient and supportive Most utterances are basic Basic structures sufficiently
Basic structures consistently structures, with frequent errors accurate for everyday use;
distorted; lack of vocabulary of grammar, vocabulary and difficulty with more complex
makes communication on style. Range of vocabulary and structures. Adequate range of
familiar topics consistently style only partly adequate for vocabulary for familiar topics;
difficult familiar topics and situations some errors in style
Very limited range of structures; Limited range of structures; Some range of structures; some
little or no attempt at using attempt at using cohesive use of cohesive devices, though
cohesive devices; speech devices; speech often halting, not always successfully. Speech
halting; pauses may be lengthy; though some utterances flow generally flows smoothly; some
utterances sometimes smoothly. Most utterances short; hesitation while searching for
abandoned. Speech generally turns rarely developed language. Often uses
fragmented; no lengthy Fairly frequent pronunciation appropriately long utterances,
utterances attempted; turns not errors; first-language though may leave turns
developed characteristics noticeably hinder undeveloped
Frequent pronunciation errors; understanding. Strongly marked Some pronunciation errors; first-
intrusive first-language first-language interference in language characteristics may
characteristics consistently prosody hinder understanding. Fairly
hinder understanding. Stress and Generally understands language marked first-language
intonation patterns generally and purpose of task. Sometimes interference in prosody
distorted has to be drawn out; requires Deals with tasks reasonably
May not understand language assistance from effectively. Occasionally relies
and purpose of talk. Often interlocutor/partner. Has some on assistance of
depends on interlocutor/partner difficulty in responding to topic- interlocutor/partner in initiating
Appendix IV: BEC Level 1 Rating Scale
(continued)
(continued)
0 1 2 3 4 5
for initiating or sustaining shifts. Often inappropriate or or sustaining utterances.
utterances. Has difficulty in ineffective in turn-taking or Responds to topic-shifts, but
responding to topic-shifts— responding to may require time to do so.
often seems unaware of them. interlocutor/partner. Has Usually appropriate and
Can use very basic difficulty using basic repair effective in turn-taking and
conversational formulae but strategies. Listening ability: responding to
may interact inappropriately. sometimes requires rephrasing interlocutor/partner. Generally
Generally unable to repair uses appropriate repair
communication problems strategies. Listening ability:
himself/herself. Listening occasionally requires rephrasing
ability: often requires rephrasing
Appendix IV: BEC Level 1 Rating Scale
279
Analytic rating scale
280
0 1 2 3 4 5
Grammar and Impossible to Frequently difficult to Some Meaning sometimes Some Meaning generally
vocabulary understand or understand basic features obscured features conveyed despite errors
insufficient to structures consistently of 1 and Most utterances are basic of 3 and Basic structures
assess distorted; lack of some of structures, with frequent some of sufficiently accurate for
vocabulary makes 3 errors of grammar, 5 everyday use; difficulty
communication on vocabulary and style; with more complex
familiar topics range of vocabulary and structures; adequate
consistently difficult style only partly adequate range of vocabulary for
for familiar topics and familiar topics; some
situations errors in style
Discourse (almost) no Very limited range of Some Limited range of Some Fair range of linguistic
management linguistic resources linguistic resources features linguistic resources features resources;
Very limited range of of 1 and Limited range of of 3 and some range of structures;
structures; little or no some of structures; some attempt some of some use of cohesive
attempt at using cohesive 3 at using cohesive 5 devices, though not
devices; speech halting; devices; speech often always successfully;
pauses may be lengthy; halting, though some speech generally flows
utterances sometimes utterances flow smoothly; smoothly; some
abandoned; speech most utterances short; hesitation while
generally fragmented; no turns rarely developed searching for language;
lengthy utterances often uses appropriately
attempted; turns not long utterances, though
developed may leave turns
undeveloped
Pronunciation Impossible to Frequently difficult to Some Sometimes difficult to Some Occasionally difficult to
understand or understand; features understand; features understand
insufficient to Frequent pronunciation of 1 and fairly frequent of 3 and Some pronunciation
assess errors; very intrusive some of pronunciation errors; some of errors; first-language
first-language 3 first-language 5 characteristics may
(continued)
Appendix IV: BEC Level 1 Rating Scale
(continued)
0 1 2 3 4 5
characteristics characteristics noticeably hinder understanding;
consistently hinder hinder understanding; fairly marked first-
understanding; stress and strongly marked first- language interference in
intonation patterns language interference in prosody
generally distorted prosody
Interactive (almost) no Frequently dependent in Some Sometimes dependent in Some Fairly independent in
communication interaction with interaction; features interaction; features interaction;
interlocutor/partner may not understand of 1 and generally understands of 3 and deals with tasks
language and purpose of some of language and purpose of some of reasonably effectively;
task; often depends on 3 task; sometimes has to be 5 occasionally relies on
interlocutor/partner for drawn out; requires assistance of
Appendix IV: BEC Level 1 Rating Scale
Respectful Teachers,
Many thanks for participating in this questionnaire survey. It is related to a study on
“Nonverbal Delivery in Speaking Assessment: From an Argument to a Rating Scale
Development and Validation”. It is my honour to have invited you to provide what
you think of the features of good oral English proficiency in group discussion. It
will take you about 10–15 min to complete this questionnaire. Please carefully
read the following directions before you proceed to your response.
******************************************************************
Directions: Please circle the number corresponding to your perception for
each statement. If you strongly agree with the statement, please circle the
number 5; if you agree with statement, please circle the number 4; if you think
it is hard to make judgment, please circle the number 3; if you disagree with
the statement, please circle the number 2; if you strongly disagree with the
statement, please circle the number 1.
1. Pronunciation accuracy is important in assessing candidates’ oral English
proficiency.
1 2 3 4 5
2. Intelligibility in pronunciation to facilitate listener’s effort is important in
assessing candidates’ oral English proficiency.
1 2 3 4 5
3. Good pronunciation in oral English proficiency means native-like.
1 2 3 4 5
4. Speaking smoothly and loudly can help clear communication.
1 2 3 4 5
5. Effective use of pitch patterns and pauses means effective control of
intonation.
1 2 3 4 5
6. Effective use of stress means effective control of intonation.
1 2 3 4 5
Respectful Teachers,
Many thanks for participating in this questionnaire survey. It is related to a study on
Nonverbal Delivery in Speaking Assessment: From an Argument to a Rating Scale
Development and Validation. It is my honour to have invited you to provide what
you think of the features of good oral English proficiency in group discussion. It
will take you about 10–15 min to complete this questionnaire. Please carefully
read the following directions before you proceed to your response.
尊敬的老師:
非常感謝您能參加此次問卷調查。此次問卷調查是有關“口語測試中之非言
語行為:論述的構建到評分量表的設計與驗證”之博士論文科研項目。我們很
榮幸能夠邀請到您,並由您向我們提供您對學生小組討論時評估其英語口語
能力特徵的看法。本次問卷大約會佔用您10至15分鐘的時間。勞煩您在填寫
以下問卷之前仔細閱讀填寫細則。
******************************************************************
Directions: Please circle the number corresponding to your perception for
each statement. If you strongly agree with the statement, please circle the
number 5; if you agree with statement, please circle the number 4; if you think
it is hard to make judgment, please circle the number 3; if you disagree with
the statement, please circle the number 2; if you strongly disagree with the
statement, please circle the number 1.
以下是對英語小組討論時學生口語能力特徵的部分描述,請在相應的數字上
畫圈。如果您極為贊同這一描述,則請在數字5上面畫圈;如果您贊同這一描
述,則請在數字4上面畫圈;如果您對這一描述較難判斷,則請在數字3上面畫
圈;如果您不贊同這一描述,則請在數字2上面畫圈;如果您極為不贊同這一描
述,則請在數字1上面畫圈。
11. Choosing appropriate words and phrases is important in assessing the candi-
dates’ vocabulary.
選擇恰當的詞語及短語對評估學生的詞彙很重要。
1 2 3 4 5
12. Employing cohesive devices, such as those indicating cause and effect (be-
cause, therefore) and sequence (then), and discourse markers, such as well, I
mean, in group discussion is important in assessing the candidates’ oral
English proficiency.
在小組討論中運用銜接手段,如表明因果關係(because, therefore)和秩序關
係(then)和話
語標記語,如well及 I mean, 對評估學生英語口語能力很重要。
1 2 3 4 5
Directions: Please circle the number corresponding to your perception for each
statement. If you strongly disagree with the statement, please circle the number 5; if
you disagree with statement, please circle the number 4; if you think it is hard to
make judgment, please circle the number 3; if you agree with the statement, please
circle the number 2; if you strongly agree with the statement, please circle the
number 1.
以下是對英語小組討論時學生口語能力特徵的部分描述,請在相應的數字上
畫圈。如果您極為不贊同這一描述,則請在數字5上面畫圈;如果您不贊同這一
描述,則請在數字4上面畫圈;如果您對這一描述較難判斷,則請在數字3上面畫
圈;如果您贊同這一描述,則請在數字2上面畫圈;如果您極為贊同這一描述,則
請在數字1上面畫圈。
1. Fulfilling language communicative functions, such as greeting and apology, is
important in assessing the candidates’ oral English proficiency.
能夠完成各種語言交流功能,比如問候和道歉,對評估學生英語口語能力很
重要。
1 2 3 4 5
2. Stating topic-related ideas with reasons and examples is important in assessing
the candidates’ oral English proficiency.
運用說理和舉例來闡述與話題有關的內容對評估學生英語口語能力很重
要。
1 2 3 4 5
3. Choosing appropriate language to fit different contexts and audience means
good oral English proficiency.
根據不同的場合和聽眾來選擇恰當的語言意味着較好的英語口語能力。
1 2 3 4 5
288 Appendix VI: Questionnaire for Teachers (Final Version)
4. Knowing to use fillers, such as so, I mean and well, to compensate for occa-
sional hesitation to control speech means good oral English proficiency.
懂得運用填充語,如so, I mean和well以彌補偶爾的遲疑來控制話語意味着
較好的英語口語
能力。
1 2 3 4 5
***************************************************************
再次感謝您的合作和支持!
Appendix VII
Questionnaire for Learners (Trial Version)
Dear Students,
******************************************************************
Dear Students,
Many thanks for participating in this questionnaire survey. It is related to a study on
“Nonverbal Delivery in Speaking Assessment: From an Argument to a Rating Scale
Development and Validation”. It is my honour to have invited you to provide what
you think of the features of good oral English proficiency in group discussion. It
will take you about 10-15 minutes to complete this questionnaire. Please carefully
read the following directions before you proceed to your response.
親愛的同學:
非常感謝您能參加此次問卷調查。此次問卷調查是有關“口語測試中之非言
語行為:論述的構建到評分量表的設計與驗證”之博士論文科研項目。我們很
榮幸能夠邀請到您,並由您向我們提供您對學生小組討論時評估其英語口語
能力特徵的看法。本次問卷大約會佔用您10至15分鐘的時間。勞煩您在填寫
以下問卷之前仔細閱讀填寫細則。
******************************************************************
Directions: Please circle the number corresponding to your perception for each
statement. If you strongly agree with the statement, please circle the number 5;
if you agree with statement, please circle the number 4; if you think it is hard
to make judgment, please circle the number 3; if you disagree with the
statement, please circle the number 2; if you strongly disagree with the
statement, please circle the number 1.
以下是對英語小組討論時學生口語能力特徵的部分描述,請在相應的數字上
畫圈。如果您極為贊同這一描述,則請在數字5上面畫圈;如果您贊同這一描
述,則請在數字4上面畫圈;如果您對這一描述較難判斷,則請在數字3上面畫圈;
如果您不贊同這一描述,則請在數字2上面畫圈;如果您極為不贊同這一描述,
則請在數字1上面畫圈。
11. Choosing appropriate words and phrases is important in assessing the candi-
dates’ vocabulary.
選擇恰當的詞語及短語對評估學生的詞彙很重要。
1 2 3 4 5
12. Employing cohesive devices, such as those indicating cause and effect (be-
cause, therefore) and sequence (then), and discourse markers, such as well, I
mean, in group discussion is important in assessing the candidates’ oral
English proficiency.
在小組討論中運用銜接手段,如表明因果關係(because, therefore)和秩序關
係(then)和話
語標記語,如well及I mean,對評估學生英語口語能力很重要。
1 2 3 4 5
Directions: Please circle the number corresponding to your perception for each
statement. If you strongly disagree with the statement, please circle the number 5; if
you disagree with statement, please circle the number 4; if you think it hard to make
judgment, please circle the number 3; if you agree with the statement, please circle
the number 2; if you strongly agree with the statement, please circle the number 1.
以下是對英語小組討論時學生口語能力特徵的部分描述,請在相應的數字上
畫圈。如果您極為不贊同這一描述,則請在數字5上面畫圈;如果您不贊同這一
描述,則請在數字4上面畫圈;如果您對這一描述較難判斷,則請在數字3上面畫
圈;如果您贊同這一描述,則請在數字2上面畫圈;如果您極為贊同這一描述,則
請在數字1上面畫圈。
1. Fulfilling language communicative functions, such as greeting and apology, is
important in assessing the candidates’ oral English proficiency.
能夠完成各種語言交流功能,比如問候和道歉,對評估學生英語口語能力很
重要。
1 2 3 4 5
2. Stating topic-related ideas with reasons and examples is important in assessing
the candidates’ oral English proficiency.
運用說理和舉例來闡述與話題有關的內容對評估學生英語口語能力很重
要。
1 2 3 4 5
3. Choosing appropriate language to fit different contexts and audience means
good oral English proficiency.
根據不同的場合和聽眾來選擇恰當的語言意味着較好的英語口語能力。
1 2 3 4 5
294 Appendix VIII: Questionnaire for Learners (Final Version)
4. Knowing to use fillers, such as so, I mean and well, to compensate for occa-
sional hesitation to control speech means good oral English proficiency.
懂得運用填充語,如so, I mean和well以彌補偶爾的遲疑來控制話語意味着
較好的英語口語
能力。
1 2 3 4 5
***************************************************************
再次感謝您的合作和支持!
Appendix IX
Proposed Rating Scale (Tentative Version)
Discourse Management
Fluency Disfluency
Coherent Scattered
Developed Underdeveloped
5 4 3 2 1
Nonverbal Delivery
Frequent Infrequent
Durable Brief
Appropriate Inappropriate
Varied Monotonous
5 4 3 2 1
(continued)
Band Band descriptors for grammar and vocabulary
Vocabulary breath and depth sufficient for expression, with occasional detectable
inaccuracy
Accompanying infrequent use of idiomatic chunks
3 Noticeable grammatical errors slightly reducing expressiveness.
Effective and accurate use of simple structures, with less frequent use of complex
structures
Frequent error-free sentences
Vocabulary breadth sufficient for the topic, with less noticeable vocabulary depth
Rare use of idiomatic chunks
2 Noticeable grammatical errors seriously reducing expressiveness
Fairly accurate use of simple structures, with inaccuracy in complex structures
Frequently incomplete and choppy sentences
Vocabulary breadth insufficient for the topic
Inaccurate use of words causing confusion
1 Frequent grammatical errors, with no intention of self-correction
Detectable and repetitive formulaic expressions
Inaccuracy and inability to use basic structures
Topic development seriously limited by vocabulary scarcity
<sp3> This is a hard choice between Shanghai and hometown. And what do you
want to know about future? </sp3>
<sp1> If I know if I have special power, I want to know what air environment will
be. After some years later or some decades later, as you know that we are in face of
many environmental problems and some the local problems have graduated into the
international issues. Sometimes we may talk about what we will do if the end of the
earth really occurs. </sp1>
<sp3> I know you say in the movie. </sp3>
<sp2> Just just just to me. Two days before I dreamed of there in Shanghai have an
earthquake. So horrible, so horrible. </sp2>
<sp1> Really? So terrible. Maybe you when when you woke up, you will feel
lucky that it was just a dream. </sp1>
<sp3> Yeah, my major is environmental engineering. I think I can do something to
the environment, yes? </sp3>
<sp2> Yes, we must protect our environment. </sp2>
<sp3> I will do something to protect the river, to en…yeah to the air and to
something else. </sp3>
<sp2> And and what what do you want to know about your future? </sp2>
<sp3> What I want to know most is the condition of my parents’ health. They do
not have some serious illness these days but some small ones will come up just now
and then. Some days ago, my father told me that he feels it is a little bit hard for
him to go downstairs, so I worried about him very much. I asked her, him to to do
more exercise so he will en…her condition will be better, I think. En…everyone
will die but I don’t want them to suffer from a lot of pain before that day come.
That’s the thing I care about the most. What do you have anything else you want to
know? </sp3>
<sp2> I want to know what kind of person you will, you will marry. </sp2>
<sp1> We are also. </sp1>
<sp3> We too. </sp3>
<sp2> Is he taller or is he shorter? </sp2>
<sp1> Is he handsome? </sp1>
<sp3> Handsome? </sp3>
<sp2> Handsome? Yeah. </sp2>
<sp3> I hope he can be very kind and responsible, yes. If he is handsome and tall, it
can be better. </sp3>
<sp1> I hope that. </sp1>
<sp3> We all hope want a tall and handsome boyfriend, yes? </sp3>
<sp2> Yes, and I want to, I want to know what kind of job I will I will take I will
do in my future. </sp2>
<sp3> What do you want to do? </sp3>
<sp2> I want to be a university teacher. </sp2>
<sp3> I support you. </sp3>
<sp1> I want to be a white collar to earn a lot of money. </sp1>
<sp3> I want to be a psychologist but it is different, a little bit different from my
major. I want to know if I will fulfil my dream in the future. </sp3>
Appendix X: Transcriptions of the Three Selected Group Discussions 301
<sp2> Yes, because I want to see the singing star there, and that’s a good place for
shopping. </sp2>
<sp3> But I think it maybe so expensive, and it is so complicated to make a
passport. </sp3>
<sp2> Oh, yes, that’s a problem. </sp2>
<sp1> Yeah, the time is not enough. I think we should consider, not consider
abroad, because it’s too expensive. </sp1>
<sp2> In our country, China. </sp2>
<sp1> Yeah. </sp1>
<sp2> Do you have some idea? </sp2>
<sp3> I want to go Tibet. </sp3>
<sp1> Tibet? </sp2>
<sp3> Yeah, It’s my favourite space. And they are so cultural there, traditional
cultural. Em… and it’s mystery here. </sp3>
<sp2> Yes, Great, I think so. And I think there we can see some special animals,
like antelope, Tibet antelope and others. Do you think so? </sp2>
<sp1> But I think, Tibet is too far away, and the air pressures is not fit us. </sp1>
<sp3> Do you have some suggestions? </sp3>
<sp1> Em…let me think about. Maybe we, we can go to Guilin. </sp1>
<sp2> Guilin? Oh, that’s a good place. </sp2>
<sp1> Yeah, it’s very beautiful. The scenery attract me a lot. And in the TV
program, I see the shaped, strange shaped mountain and the beautiful river there.
And em…have a lot of, has a lot of legends there. I er…look forward there very
much. </sp1>
<sp3> But do you know as an attractive place, so in summer vacation there will be
so many people. </sp3>
<sp1> It’s a problem. </sp1>
<sp2> That’s a pity. I have never been to Guilin. And I’ve heard the saying that
Guilin’s scenery is the best of the world. </sp2>
<sp1> Yeah. </sp1>
<sp2> But I think maybe there are too many people there at that time. </sp2>
<sp3> So we should think about the close to Shanghai. </sp3>
<sp2> Some place close to Shanghai? </sp2>
<sp3> Yes. For example, Hangzhou or Suzhou? </sp3>
<sp1> Hangzhou, Suzhou is. </sp1>
<sp2> Hangzhou is my hometown! </sp2>
<sp1> They are very good place, but I think I have been there many, many times,
and I don’t want to go there again. </sp1>
<sp2> How about Suzhou? </sp2>
<sp1> Suzhou, em… </sp1>
<sp2> Can you introduce Suzhou? </sp2>
<sp3> Just some gardens, special. </sp3>
<sp1> Suzhou Garden is very famous. </sp1>
<sp3> Hanshan Temple. </sp3>
<sp2> Hanshan Temple! </sp2>
Appendix X: Transcriptions of the Three Selected Group Discussions 303
<sp2> Em…er…I think you are all right. But I’d better living in the city, because I
think the city is better for me. See you! </sp2>
<sp1><sp3> See you! </sp1></sp3>
</conversation>