Mingwei Pan (Auth.) - Nonverbal Delivery in Speaking Assessment - From An Argument To A Rating Scale Formulation and Validation-Springer Singapore (2016)

Mingwei Pan
Nonverbal
Delivery in
Speaking
Assessment
From an Argument to a Rating Scale
Formulation and Validation
Nonverbal Delivery in Speaking Assessment
Mingwei Pan
Nonverbal Delivery
in Speaking Assessment
From an Argument to a Rating Scale
Formulation and Validation
123
Mingwei Pan
Faculty of English Language and Culture
Guangdong University of Foreign Studies
Guangzhou, Guangdong
China
ISBN 978-981-10-0169-7 ISBN 978-981-10-0170-3 (eBook)

DOI 10.1007/978-981-10-0170-3
Library of Congress Control Number: 2015955873
Springer Singapore Heidelberg New York Dordrecht London

© Springer Science+Business Media Singapore 2016
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made.
Printed on acid-free paper
Springer Science+Business Media Singapore Pte Ltd. is part of Springer Science+Business Media
(www.springer.com)
Preface
Language, especially its spoken form, is now universally recognised as being

highly complex, multidimensional and variable according to a multitude of social,
contextual and affective factors (Bachman 1990). Accordingly, the taxonomy of
language ability has been long arguably partitioned into a plethora of dimensions.
Bachman (1990) and Bachman and Palmer (1996) propose the framework of
communicative language ability (CLA), whose explanatory power and inclusive-
ness, as compared against the other frameworks of a similar nature concerning
communicative competence, can be justified as the fittest model to which speaking
assessment can be referred. In particular, deriving from strategic competence,
nonverbal delivery, with its interactive and interdependent role with the accom-
panying verbiage, should be judged as an indispensable component in speaking
assessment, especially when its promoting effect on communicativeness is con-
sidered. However, notwithstanding well-documented research might be omnipre-
sent with regard to where learners in the Chinese EFL context are poor at, how they
perform their nonverbal delivery in speaking assessment seems to be comparatively
underexplored (e.g. Liu and Pan 2010a, b; Pan 2011a, b). Against this, there arises a
need of examining Chinese EFL learners’ oral proficiency in an all-round manner,
via which a rating scale incorporating nonverbal delivery is to be formulated and
validated.
With the above background as a point of departure, this research project,
contextualised in formative assessment, mainly aims at (1) building an empirical
argument of embedding nonverbal delivery into speaking assessment; (2) develop-
ing a rating scale, with nonverbal delivery included as a dimension, for assessing
candidates’ communicative competence in group discussion; and
(3) cross-validating the proposed rating scale with multi-trait multi-method
(MTMM) as well as multimodal discourse analysis (MDA) approaches. These
three aims also constitute the three phases of research (henceforth AB phase, RSF
phase and RSV phase, respectively) for the present study.
The data of this project are 150 samples of group discussion by Chinese EFL
learners at the tertiary level in the context of formative assessment. For
v
vi Preface
phase-specific purposes, all the samples are accordingly video-recorded,

transcribed, processed and analysed. Except for the 30 samples used in the AB
phase, the other 120 samples are scored by expert raters, teacher raters and/or peer
raters specific to the design of the latter two research phases.
In the AB phase, this study conducts an empirical study to explore the role of
nonverbal delivery in Chinese EFL candidates’ performance in group discussion,
particularly how candidates across a range of proficiency levels can be discrimi-
nated with regard to their nonverbal delivery. In a sense, if nonverbal delivery can
statistically discriminate the candidates of predetermined proficiency levels, an
argument of incorporating nonverbal delivery into speaking assessment can be
accordingly advanced. The descriptive, comparative and extrapolative statistics in
this phase of study find that although there seems to be a generally low profile of
employing nonverbal delivery by the observed candidates in group discussion, they
can be statistically discerned vis-à-vis eye contact, gesture and head movement.
Candidates of advanced proficiency are characterised by higher frequency and
longer duration of eye contact. Elementary-level candidates, though featuring a high
frequency of eye contact occurrences, are inclined to shift their gaze hurriedly and
not able to instantiate durable eye contact with the peer discussants. In addition,
rather than enhance communication effectiveness, most occurrences of their eye
contact, if not all, serve regulatory or adaptive purpose. Although intermediate-level
candidates are found to present eye contact with their peers, the degree to which
their eye contact can serve attentive purpose would be more impaired compared
with the advanced-level counterparts. Candidates’ gestures can be mainly distin-
guished from the perspectives of frequency, diversity and communication-
conduciveness. Advanced candidates would be able to perform satisfactorily in
all of the above measures, whereas candidates of elementary proficiency level are
found to maintain an extremely low profile of resorting to gestures in accompanying
their verbal language. Although intermediate-level candidates can be judged to
perform well in gesturing frequency and diversity, a number of gesture occurrences
are found to serve adaptive or performative purpose, failing to be a remarkable
enhancer for intended meaning conveyance. When head movement is probed into,
head nod and shake are the main manifestations. It has to be noted that, given the
socio- and cultural preponderance, candidates are not significantly different in
presenting lower frequency of head shake than head nod, yet whether they perform
certain head movements appropriately in the given social context might be referred
to as a discriminating point because candidates are found to nod even though
certain negative meanings are intended.
Enlightened by the findings in the AB phase, this study draws an interim con-
clusion that nonverbal delivery, as reflected by eye contact, gesture and head
movement, can be one of the indicators for assessing candidates’ overall spoken
English production and that what has been extracted to discern candidates across
various proficiency levels can usefully and effectively inform how a new rating
scale can be formulated consequently.
When such a rating scale is developed, two broad dimensions are perceived in
the RSF phase: language competence and strategic competence. The former is
Preface vii
formulated by an operationalised questionnaire drawn from the related spectra of

CLA model. After an exploratory factor analysis from the Chinese EFL teaching
practitioners’ and learners’ responses to the constituents of language competence in
group discussion, this study distils and organically brings forth three assessment
dimensions representing language competence: Pronunciation and Intonation (D1),
Grammar and Vocabulary (D2) and Discourse Management (D3). The gradable
descriptors of these dimensions have been written and further fine-grained by
referring to the statements in the questionnaires. Based on the review over the
definitions of strategic competence and the empirical argument in the AB phase,
Nonverbal Delivery (D4) is perceived as the fourth dimension on the rating scale
proposed. In writing the descriptors for this dimension, what can observably and
feasibly discriminate candidates regarding their nonverbal delivery in the AB phase
is referred to.
A four-dimensional rating scale, therefore, is tentatively formulated, and it
epitomises what would supposedly be measured in relation to communicative
competence in group discussion, as guided by CLA model. Considering the fact
that the expert raters’ scoring reveals a high correlation between two assessment
dimensions, this rating scale can be initially certified to be valid in its construct, yet
it would be subject to certain modifications in wording, disambiguation and the
shrinkage of bands from five to four for a higher degree of rater-friendliness.
The rating scale, afterwards, is phased into the RSV phase, where both quanti-
tative and qualitative approaches are employed. When MTMM is deployed fol-
lowing Widaman’s (1985) alternative model comparison method, it is found that,
considering the interpretability and consistency with previous studies regarding
speaking ability taxonomy, a second-order correlated trait/uncorrelated method
model not only provides sound goodness-of-fit indices (χ2(28) = 462.796, p = 0.818;
CFI = 1.000; NNFI = 1.024; SRMR = 0.015; RMSEA = 0.000; 90 % C.I. = 0.000,
0.060), but also presents divergent validity (Δχ2(9) = 403.08, p < 0.001;
ΔCFI = 0.472) and discriminant validity (Δχ2(17) = 425.68, p < 0.001;
ΔCFI = 0.146). The standardised parameter estimates and trait–method correlations
reveal no method effect or bias concerning rating methods. Thus, this rating scale,
with nonverbal delivery included as a crucial dimension, has been validated in a
statistical spectrum.
The rating scale, especially its assessment dimension of Nonverbal Delivery, is
further validated on a micro basis with an MDA approach, with a special reference
to an integrated analytic framework drawn from Martinec’s (2000a, b, 2001, 2004)
taxonomy of action and Hood’s (2007, 2011) works on nonverbal delivery. Three
randomly selected candidates (pseudonyms: Tom, Linda and Diana) representing
different proficiency levels are probed into concerning their de facto performance in
nonverbal delivery. Tom, with a nonverbal delivery subscore of 1.5, is found to be
rather sedentary and passive in the group discussion because only a limited number
of captured nonverbal channels with ideational meanings are instantiated.
A majority of his nonverbal delivery occurrences retain to be performative, or as a
likely regulation to adapt himself to an assessment setting. In that sense, almost no
interpersonal or textual meanings can be interpreted from his nonverbal delivery;
viii Preface
thus, Tom is reduced to stagnation where only the mere occurrence of nonverbal
delivery employment can be captured. In stark contrast, Diana, as a representative
of advanced proficiency level who is assigned a full mark in nonverbal delivery, is
found to be articulate in eclectically resorting to a repertoire of nonverbal channels
in accompanying her verbiage. At certain points, her nonverbal performance can
also instantiate intended meanings in the absence of any synchronised verbal lan-
guage. Judging from the perspective of metafunctions, she is found to be capable of
realising a variety of meaning potentials via nonverbal delivery. Although she
seems somewhat aggressive in group discussion, her frequent shift in instantiating
different nonverbal channels with discrepant metafunctions would impress other
discussants as an active and negotiable speaker as well as an attentive listener.
Although Linda, whose subscore of nonverbal delivery is 3, performed quite sat-
isfactorily in terms of formal nonverbal channels, she is found to be slightly passive
and hesitant in the group discussion. In particular, when the interpersonal meaning
of her gestures is looked into, she seems to be self-contained and strike a certain
distancing effect on the peer discussants. The above profile of the three candidates’
performance on nonverbal delivery can also be aligned with the descriptors of
nonverbal delivery on the rating scale, thus lending weightier support to validate the
proposed rating scale.
This research project yields significance in the sense that it organically integrates
multimodal discourse analysis, a research method scarcely explored in language
assessment, with rating scale validation, thus extending the literature of applying
this method to more research of a similar kind. In addition, based on the research
findings, how nonverbal delivery can penetrate into EFL learning and teaching is
also enlightened and suggested. In particular, this thesis illuminates how EFL
textbooks should be multimodally compiled for a heavier load of meaning making
and how EFL teaching can be optimised with nonverbal delivery by teaching
practitioners incorporated in daily instruction.
References
Bachman, L.F. 1990. Fundamental considerations in language testing. Oxford: Oxford University
Press.
Bachman, L.F., and A.S. Palmer. 1996. Language testing in practice: designing and developing
useful language tests. Oxford: Oxford University Press.
Hood, S.E. 2007. Gesture and meaning making in face-to-face teaching. Paper presented at the
Semiotic Margins Conference, University of Sydney.
Hood, S.E. 2011. Body language in face-to-face teaching: a focus on textual and interpersonal
meaning. In Semiotic margins: meanings in multimodalities, eds. S. Dreyfus, S. Hood, and
S. Stenglin, pp. 31–52. London: Continuum.
Liu, Q., and M. Pan. 2010a. A tentative study on non-verbal communication ability in Chinese
college students’ oral English. Computer-assisted Foreign Language Education in China
(2):38–43.
Liu, Q., and M. Pan. 2010b. Constructing a multimodal spoken English corpus of Chinese Science
and Engineering Major Learners. Modern Educational Technology (4):69–72.
Preface ix
Martinec, R. 2000a. Construction of identity in Michael Jackson’s “Jam”. Social Semiotics 10

(3):313–329.
Martinec, R. 2000b. Types of processes in action. Semiotica 130(3):243–268.
Martinec R. 2001. Interpersonal resources in action. Semiotica 135(1):117–145.
Martinec, R. 2004. Gestures that co-occur with speech as a systematic resource: the realisation of
experiential meanings in indexes. Social Semiotics 14(2):193–213.
Pan, M. 2011a. Reconceptualising and reexamining communicative competence: a multimodal
perspective. Unpublished Ph.D. thesis. Shanghai: Shanghai International Studies University.
Pan, M. 2011b. Incorporating nonverbal delivery into spoken English assessment: a preliminary
study. English Language Assessment 6:29–54.
Widaman, K.F. 1985. Hierarchically tested covariance structure models for multi-trait
multi-method data. Applied Psychological Measurement 9:1–26.
Acknowledgments
A great many people have helped in the writing of this book. In particular, I feel
profoundly indebted to Professor David D. Qian from Hong Kong Polytechnic
University, whose resourcefulness, insightfulness and supportiveness have removed
my amateurishness in research and nurtured me towards professionalism in the
academia of language assessment.
I also need to thank other scholars who so generously offered their time and
voices—Professor Frederick G. Davidson from University of Illinois at
Urbana-Champaign, Professor Alister Cumming from University of Toronto and
Professor Zou Shen from Shanghai International Studies University—and all the
others whose voices are also recorded here. Their scholarship continues to be a
great source of stimulation.
I would also like to thank the Springer editorial team who have been such a
pleasure to work with, in particular Ms. Rebecca Zhu and Ms. Yining Zhao.
xi
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Research Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 General Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Research Significance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Book Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Nonverbal Delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Eye Contact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Gesture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.3 Head Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Communicative Competence . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Hymes’ Notion of Communicative Competence . . . . . . . . . 17
2.2.2 Communicative Competence Model . . . . . . . . . . . . . . . . . 19
2.2.3 Communicative Language Ability Model . . . . . . . . . . . . . 21
2.2.4 Communicative Language Competence Model. . . . . . . . . . 28
2.2.5 An Integrated Review on Communicative Competence . . . . 31
2.3 Rating Scale and Formative Assessment . . . . . . . . . . . . . . . . . . . 33
2.3.1 Rating Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.2 Taxonomies of Rating Scales . . . . . . . . . . . . . . . . . . . . . 35
2.3.3 A Critique on the Existing Rating Scales . . . . . . . . . . . . . 42
2.3.4 Formative Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.3.5 Properties of the Present Rating Scale . . . . . . . . . . . . . . . 47
2.4 Validity and Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.4.1 Validity: A Componential Notion . . . . . . . . . . . . . . . . . . 48
2.4.2 Validity: A Unitary Notion . . . . . . . . . . . . . . . . . . . . . . . 53
2.4.3 Argument-Based Validation and AUA . . . . . . . . . . . . . . . 57
xiii
xiv Contents
2.5 Rating Scale Evaluation and Validation . . . . . . . . . . . . . . . . . . . 62

2.5.1 Quantitative Validation Methods . . . . . . . . . . . . . . . . . . . 63
2.5.2 Qualitative Validation Methods . . . . . . . . . . . . . . . . . . . . 65
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3 Research Design and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.1 Research Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.2 Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
3.2.1 Dataset 1: Questionnaire Responses . . . . . . . . . . . . . . . . . 112
3.2.2 Dataset 2: Samples of Group Discussion . . . . . . . . . . . . . 116
3.2.3 Dataset 3: Rating Results . . . . . . . . . . . . . . . . . . . . . . . . 126
3.3 Methods and Instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
3.3.1 Exploratory Factor Analysis . . . . . . . . . . . . . . . . . . . . . . 128
3.3.2 Multi-trait Multi-method. . . . . . . . . . . . . . . . . . . . . . . . . 129
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
4 Building an Argument for Embedding Nonverbal Delivery
into Speaking Assessment . . . . . . . . . . . . . . . . . . ....... . . . . . . . 133
4.1 Research Objectives and Questions . . . . . . . . ....... . . . . . . . 133
4.2 Method. . . . . . . . . . . . . . . . . . . . . . . . . . . . ....... . . . . . . . 134
4.3 Research Findings . . . . . . . . . . . . . . . . . . . . ....... . . . . . . . 135
4.3.1 Findings on Eye Contact . . . . . . . . . . ....... . . . . . . . 135
4.3.2 Discussion on Eye Contact . . . . . . . . . ....... . . . . . . . 142
4.3.3 Findings on Gesture. . . . . . . . . . . . . . ....... . . . . . . . 143
4.3.4 Discussion on Gesture . . . . . . . . . . . . ....... . . . . . . . 150
4.3.5 Findings on Head Movement . . . . . . . ....... . . . . . . . 152
4.3.6 Discussion on Head Movement . . . . . . ....... . . . . . . . 156
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . ....... . . . . . . . 157
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....... . . . . . . . 158
5 Rating Scale Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.1 Research Objectives and Question . . . . . . . . . . . . . . . . . . . . . . . 159
5.2 Method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
5.2.1 Research Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5.2.2 Research Instrument. . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
5.3 Research Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
5.5 Rating Scale (1): Language Competence . . . . . . . . . . . . . . . . . . . 169
5.5.1 Pronunciation and Intonation. . . . . . . . . . . . . . . . . . . . . . 169
5.5.2 Grammar and Vocabulary. . . . . . . . . . . . . . . . . . . . . . . . 171
5.5.3 Discourse Management . . . . . . . . . . . . . . . . . . . . . . . . . 173
Contents xv
5.6 Rating Scale (2): Strategic Competence . . . . . . . . . . . . . . . . . . . 175

5.6.1 Nonverbal Delivery: A Recapture . . . . . . . . . . . . . . . . . . 175
5.6.2 Nonverbal Delivery: Rating Scale . . . . . . . . . . . . . . . . . . 176
5.6.3 Nonverbal Delivery: Band Descriptors . . . . . . . . . . . . . . . 177
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
6 Rating Scale Prevalidation and Modification . . . . . . . . . . . . . . . . . . 181
6.1 Research Objectives and Questions . . . . . . . . . . . . . . . . . . . . . . 181
6.2 Research Procedure and Methods. . . . . . . . . . . . . . . . . . . . . . . . 182
6.2.1 Research Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.2.2 Research Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
6.3.1 Assessment Dimension Correlation . . . . . . . . . . . . . . . . . 186
6.3.2 Expert Judgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
7 Rating Scale Validation: An MTMM Approach . . . . . . . . . . . . . . . 199
7.2 Research Procedure and Method . . . . . . . . . . . . . . . . . . . . . . . . 200
7.3.1 CFA MTMM Model Development . . . . . . . . . . . . . . . . . 203
7.3.2 Alternative CFA MTMM Model Comparisons . . . . . . . . . 207
7.3.3 Individual Parameters for the Final Model . . . . . . . . . . . . 210
7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
8 Rating Scale Validation: An MDA Approach . . . . . . . . . . . . . . . . . 215
8.2 Research Procedure and Method . . . . . . . . . . . . . . . . . . . . . . . . 216
8.3.1 Eye Contact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
8.3.2 Gesture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
8.3.3 Head Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
8.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
9 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
9.1 Summary of This Research Project . . . . . . . . . . . . . . . . . . . . . . 261
9.2 Research Implications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
9.2.1 Nonverbal Delivery in EFL Teaching. . . . . . . . . . . . . . . . 264
9.2.2 Nonverbal Delivery in EFL Textbooks . . . . . . . . . . . . . . . 265
xvi Contents
9.3 Limitations of This Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266

9.4 Future Directions of Research . . . . . . . . . . . . . . . . . . . . . . . . . . 267
9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
Appendix I: IELTS Speaking Rating Scale (Band 8 and Band 9) . . . . . 269
Appendix II: TOEFL Independent Speaking Rating Scale

(Band 3 and B and 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
Appendix III: TEEP Speaking Rating Scale . . . . . . . . . . . . . . . . . . . . . 273
Appendix IV: BEC Level 1 Rating Scale . . . . . . . . . . . . . . . . . . . . . . . 277
Appendix V: Questionnaire for Teachers (Trial Version) . . . . . . . . . . . 283
Appendix VI: Questionnaire for Teachers (Final Version) . . . . . . . . . . . 285
Appendix VII: Questionnaire for Learners (Trial Version) . . . . . . . . . . 289
Appendix VIII: Questionnaire for Learners (Final Version) . . . . . . . . . 291
Appendix IX: Proposed Rating Scale (Tentative Version) . . . . . . . . . . . 295
Appendix X: Transcriptions of the Three Selected

Group Discussions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
Acronyms and Abbreviations
ACTFL American Council of the Teaching of Foreign Languages

AERA American Educational Research Association
AGFI Adjusted Goodness-of-Fit Index
AOM Agent-oriented modelling
APA American Psychological Association
ASD Average sample duration
ASLPR Australian Second Language Proficiency Ratings
AUA Assessment Use Argument
BEC Business English Certificate
CA Conversation analysis
CEFR Common European Framework of Reference
CET College English Test
CET4 College English Test Band 4
CET6 College English Test Band 6
CET-SET College English Test Spoken English Test
CFA Confirmatory factor analysis
CQUPT Chongqing University of Posts and Telecommunications
CLA Communicative language ability
CLC Communicative language competence
CV Convergent validity
DM Discourse Management (rating scale dimension 3)
DV Discriminant validity
EC Eye contact
EC/c Eye contact with the camera
EC/p Eye contact with the peer(s)
EC/n Eye contact with none
EC/r Eye contact with the researcher
ECD Evidence-centred design
ECUST East China University of Science and Technology
EFA Exploratory factor analysis
EFL English as a Foreign Language
xvii
xviii Acronyms and Abbreviations
ETS English Testing Service

FSI Foreign Service Institute
GV Grammar and Vocabulary (rating scale dimension 2)
HD Hypothetic–deductive
HIT Harbin Institute of Technology
IELTS International English Language Testing System
MDA Multimodal discourse analysis
AT-MDA Activity theory multimodal discourse analysis
SF-MDA Systemic functional multimodal discourse analysis
MDT Mediated discourse theory
ME Method effect
MTMM Multi-trait multi-method
NAEP National Assessment of Educational Programme
NCME National Council on Measurement in Education
ND Nonverbal delivery (rating scale dimension 4)
NNFI Non-normed fit index
NUST Nanjing University of Science and Technology
OPI Oral Proficiency Interview
P-rating Peer-rating
PDT Performance decision tree
PETS-OT Oral Test of the Public English Test System
PI Pronunciation and intonation (rating scale dimension 1)
RMSEA Root mean square error of approximation
SDA Situated discourse analysis
SEM Structural equation modelling
SFL Systemic functional linguistics
SNU Shanghai Normal University
SISU Shanghai International Studies University
SRMR Standardised root mean square residual
T-rating Teacher-rating
TEEP Test of English for Educational Purposes
TEM Test for English Majors
TEM-OT Test for English Majors Oral Test
TEM4-OT Test for English Majors Band 4 Oral Test
TLI Tucker–Lewis Index
TOEFL Test of English as a Foreign Language
USST University of Shanghai for Science and Technology
VPA Verbal protocol analysis
List of Figures
Figure 2.1 Communicative Competence Model (Canale and Swain

1980; Canale 1983) . . . . . . . . . . . . . . . . . . . . . . . . ..... 20
Figure 2.2 CLA components in communicative language use
(Bachman 1990, p. 85) . . . . . . . . . . . . . . . . . . . . . . ..... 22
Figure 2.3 Subcomponents of language competence in the CLA
Model (Bachman 1990, p. 87) . . . . . . . . . . . . . . . . . ..... 23
Figure 2.4 A model of language use (Bachman 1990, p. 103) . . . ..... 26
Figure 2.5 Components of the CLC Model (Council of Europe
2001, pp. 108–129) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Figure 2.6 Notional evolution of communicative competence . . . . . . . . 32
Figure 2.7 A Toulmin model example (Toulmin 2003, p. 97) . . . . . . . . 58
Figure 2.8 AUA base argument (Bachman 2005, p. 9) . . . . . . . . . . . . . 60
Figure 2.9 Structure of example practical argument (Bachman and
Palmer 2010, p. 97) . . . . . . . . . . . . . . . . . . . . . . . . ..... 61
Figure 2.10 Content and medium layers in agent-oriented
modelling (Gu 2006a) . . . . . . . . . . . . . . . . . . . . . . ..... 73
Figure 2.11 Three-stratum MDA framework: an example . . . . . . . ..... 78
Figure 2.12 An integrated taxonomy of nonverbal delivery
channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 80
Figure 2.13 The structure of Appraisal Theory (Martin and White
2005, p. 38) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Figure 3.1 Flow chart of general research design. . . . . . . . . . . . . . . . . 111
Figure 3.2 Seating arrangement and recording set-up . . . . . . . . . . . . . . 119
Figure 3.3 An example of header information format . . . . . . . . . . . . . . 121
Figure 3.4 An excerpt of transcribed texts . . . . . . . . . . . . . . . . . . . . . 122
Figure 3.5 Transcription interface of ELAN. . . . . . . . . . . . . . . . . . . . . 124
Figure 3.6 A snapshot of ELAN for gesture transcription
retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 125
Figure 3.7 An EQS example of path diagram with embedded
parameter estimates . . . . . . . . . . . . . . . . . . . . . . . . ..... 130
Figure 4.1 Research design for the AB phase . . . . . . . . . . . . . . ..... 135
xix
xx List of Figures
Figure 4.2 Intensification between verbal language

and eye contact . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 141
Figure 4.3 Compensation of eye contact for the verbal language . ..... 141
Figure 4.4 Meaning-generative gesture concordance lines (HAND
as search item) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Figure 4.5 Non-meaning-generative gesture concordances (1) . . . . . . . . 146
Figure 4.6 Non-meaning-generative gesture concordances (2) . . . . . . . . 148
Figure 4.7 Intensification between verbal language and gesture . . . . . . . 149
Figure 4.8 Divergence between verbal language and gesture . . . . . . . . . 149
Figure 4.9 Concordance lines of synchronisation between head
nod and verbal language . . . . . . . . . . . . . . . . . . . . . ..... 154
Figure 4.10 Divergence between verbal language and head
movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 154
Figure 4.11 Concordance lines of synchronisation between head
shake and verbal language . . . . . . . . . . . . . . . . . . . . . . . . 155
Figure 5.1 Research design for RSF-I . . . . . . . . . . . . . . . . . . . . . . . . 161
Figure 5.2 Rating scale (Part I): Pronunciation and Intonation . . . . . . . 170
Figure 5.3 Rating scale (Part II): Grammar and Vocabulary . . . . . . . . . 171
Figure 5.4 Rating scale (Part III): Discourse Management . . . . . . . . . . 173
Figure 5.5 Rating scale (Part IV): Nonverbal Delivery . . . . . . . . . . . . . 177
Figure 6.1 The layout of the revised rating scale . . . . . . . . . . . . . . . . . 196
Figure 7.1 The baseline CFA MTMM model (Model 1). PI
pronunciation and intonation, GV grammar and
vocabulary, DM discourse management, ND nonverbal
delivery, T-rating teacher-rating, P-rating
peer-rating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 203
Figure 7.2 No trait/uncorrelated method MTMM model
(Model 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 204
Figure 7.3 Single trait/uncorrelated method MTMM model
(Model 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 205
Figure 7.4 Uncorrelated trait/uncorrelated method MTMM model
(Model 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 206
Figure 7.5 Correlated trait/correlated method MTMM model
(Model 5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Figure 7.6 A Second-order factor model (Model 6) . . . . . . . . . . . . . . . 208
Figure 8.1 Directionalities of eye contact . . . . . . . . . . . . . . . . . . . . . . 220
Figure 8.2 Distribution of eye contact types . . . . . . . . . . . . . . . . . . . . 221
Figure 8.3 Presenting eye contact: material . . . . . . . . . . . . . . . . . . . . . 222
Figure 8.4 Presenting eye contact: mental. . . . . . . . . . . . . . . . . . . . . . 223
Figure 8.5 Eye contact of representing functions . . . . . . . . . . . . . . . . . 224
Figure 8.6 Indexical eye contact . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Figure 8.7 Engagement of eye contact in interpersonal meaning:
expansion and contraction . . . . . . . . . . . . . . . . . . . . ..... 226
Figure 8.8 Interpersonal meaning in eye contact: graduation . . . . ..... 227
List of Figures xxi
Figure 8.9 Engagement of eye contact in interpersonal meaning:

graduation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
Figure 8.10 Textual meaning in eye contact: contact targets . . . . . . . . . . 228
Figure 8.11 Directionality of gestures . . . . . . . . . . . . . . . . . . . . . . . . . 230
Figure 8.12 Description of hands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
Figure 8.13 Use of hands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
Figure 8.14 Hands level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
Figure 8.15 Distribution of gesture types . . . . . . . . . . . . . . . . . . . . . . . 233
Figure 8.16 Gestural presenting action: material process (Tom). . . . . . . . 234
Figure 8.17 Gestural presenting action: behavioural process
(Linda) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
Figure 8.18 Gestural presenting action: mental process (Diana) . . . . . . . . 236
Figure 8.19 Examples of representing gestures . . . . . . . . . . . . . . . . . . . 236
Figure 8.20 Distribution of representing gestures: entities . . . . . . . . . . . . 237
Figure 8.21 Indexical gestures: importance . . . . . . . . . . . . . . . . . . . . . . 238
Figure 8.22 Indexical gestures: receptivity . . . . . . . . . . . . . . . . . . . . . . 238
Figure 8.23 Indexical gestures: relation . . . . . . . . . . . . . . . . . . . . . . . . 239
Figure 8.24 Indexical gestures: defensiveness . . . . . . . . . . . . . . . . . . . . 239
Figure 8.25 Interpersonal meaning in gestures: attitude . . . . . . . . . . . . . 240
Figure 8.26 Attitude of gestures in interpersonal meaning:
negative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 241
Figure 8.27 Engagement of gestures in interpersonal meaning:
expansion and contraction . . . . . . . . . . . . . . . . . . . . ..... 242
Figure 8.28 Engagement of gestures in interpersonal meaning:
possibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 242
Figure 8.29 Interpersonal meaning in gestures: graduation . . . . . . ..... 243
Figure 8.30 Graduation in interpersonal meaning . . . . . . . . . . . . ..... 243
Figure 8.31 Textual meaning in gestures: pointing
directionalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
Figure 8.32 Textual meaning in gestures: pointing specificity . . . . . . . . . 245
Figure 8.33 Distribution of formal head movements . . . . . . . . . . . . . . . 246
Figure 8.34 Distribution of head movement types . . . . . . . . . . . . . . . . . 247
Figure 8.35 Ideational meaning in head movement: mental. . . . . . . . . . . 248
Figure 8.36 Representing head movement: nodding . . . . . . . . . . . . . . . . 249
Figure 8.37 Indexical head movement: importance . . . . . . . . . . . . . . . . 250
Figure 8.38 Indexical head movement: receptivity . . . . . . . . . . . . . . . . . 250
Figure 8.39 Interpersonal meaning in head movement: negative
attitude . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 251
Figure 8.40 Graduation in interpersonal meaning: head
movements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 252
Figure 8.41 Amplitude of head movement . . . . . . . . . . . . . . . . . ..... 254
Figure 9.1 An example of seating arrangement in a multimodal
class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 265
List of Tables
Table 2.1 Taxonomies of rating scales . . . . . . . . . . . . . . . . . . . . . . . . 35

Table 2.2 Facets of validity (Messick 1988, p. 42) . . . . . . . . . . . . . . . 56
Table 2.3 Ideational meaning of nonverbal delivery channels . . . . . . . . 82
Table 2.4 Interpersonal meaning of nonverbal delivery . . . . . . . . . . . . 84
Table 2.5 Textual meaning of nonverbal delivery . . . . . . . . . . . . . . . . 85
Table 3.1 Distribution of the data sources . . . . . . . . . . . . . . . . . . . . . 114
Table 3.2 Demographic distribution of the questionnaire
respondents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 115
Table 3.3 Topics for group discussions . . . . . . . . . . . . . . . . . . ..... 118
Table 3.4 Sample distribution across proficiency groups. . . . . . . ..... 120
Table 4.1 Descriptive statistics of eye contact frequency
(directionalities) . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 136
Table 4.2 One-way ANOVA of eye contact frequency across
groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 137
Table 4.3 Descriptive statistics of EC/p cumulative duration . . . . ..... 137
Table 4.4 One-way ANOVA of EC/p cumulative duration across
the groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 137
Table 4.5 Descriptive statistics of EC/r cumulative duration . . . . ..... 138
Table 4.6 One-way ANOVA of EC/r cumulative duration across
the groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 138
Table 4.7 Descriptive statistics of EC/c cumulative duration . . . . ..... 138
Table 4.8 One-way ANOVA of EC/c cumulative duration across
the groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Table 4.9 Integration of eye contact versus ASD ratios . . . . . . . . . . . . 139
Table 4.10 Context words in EC/p verbal modality interface . . . . . . . . . 140
Table 4.11 Descriptive statistics of gesture frequency . . . . . . . . . . . . . . 143
Table 4.12 One-way ANOVA of gesture frequency. . . . . . . . . . . . . . . . 144
Table 4.13 Descriptive statistics of gesture cumulative duration . . . . . . . 144
Table 4.14 One-way ANOVA of gesture cumulative duration. . . . . . . . . 145
xxiii
xxiv List of Tables
Table 4.15 Comparison of gesture-related verbs (1). . . . . . . . . . . ..... 146

Table 4.16 Comparison of gesture-related verbs (2). . . . . . . . . . . ..... 147
Table 4.17 Phraseologies of gesture-synchronised verbal
utterances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 148
Table 4.18 Descriptive statistics of head movement frequency . . . ..... 152
Table 4.19 One-way ANOVA of head movement frequency . . . . ..... 152
Table 4.20 Descriptive statistics of head movement cumulative
duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 153
Table 4.21 One-way ANOVA of head movement duration. . . . . . ..... 153
Table 4.22 Phraseologies of head-movement-synchronised verbal
language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 155
Table 5.1 Operationalised statements of organisational
competence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Table 5.2 Operationalised statements of pragmatic competence . . . . . . . 163
Table 5.3 KMO and Bartlett’s test results. . . . . . . . . . . . . . . . . . . . . . 165
Table 5.4 Communalities of items after extraction . . . . . . . . . . . . . . . . 165
Table 5.5 Component matrix of factor analysis . . . . . . . . . . . . . . . . . . 166
Table 5.6 Correlation matrix of the extracted components . . . . . . . . . . 167
Table 5.7 Band descriptors for Pronunciation and Intonation . . . . . . . . 170
Table 5.8 Band descriptors for Grammar and Vocabulary . . . . . . . . . . 172
Table 5.9 Band descriptors for Discourse Management . . . . . . . . . . . . 174
Table 5.10 Band descriptors for Nonverbal Delivery . . . . . . . . . . . . . . . 177
Table 6.1 Questions for expert consultation . . . . . . . . . . . . . . . . . . . . 184
Table 6.2 Inter-rater reliability of expert rater scoring . . . . . . . . . . . . . 185
Table 6.3 Descriptive statistics of the expert rating results . . . . . . . . . . 187
Table 6.4 Correlation of subscores in expert rating . . . . . . . . . . . . . . . 187
Table 6.5 The revised rating scale . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Table 7.1 Univariate and multivariate statistics for normal
distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
Table 7.2 Fit indices for the baseline model (Model 1) . . . . . . . . . . . . 204
Table 7.3 Fit indices for Model 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Table 7.7 Summary of goodness-of-fit statistics . . . . . . . . . . . . . . . . . 208
Table 7.8 Differential goodness-of-fit indices for MTMM model
comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 208
Table 7.9 Trait and method loadings (standardised parameter
estimates). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Table 7.10 Trait and method correlations . . . . . . . . . . . . . . . . . . . . . . . 211
Table 8.1 The selected candidates’ information (1) . . . . . . . . . . . . . . . 216
Table 8.2 The selected candidates’ information (2) . . . . . . . . . . . . . . . 217
Table 8.3 Measures of formal nonverbal delivery . . . . . . . . . . . . . . . . 218
List of Tables xxv
Table 8.4 Eye contact duration (s). . . . . . . . . . . . . . . . . . . . . . . . . . . 220

Table 8.5 Eye contact with peers: duration (s) . . . . . . . . . . . . . . . . . . 229
Table 8.6 Wavelength of head movement (frequency
per second) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Chapter 1
Introduction
As a prelude to reporting a full-length expedition of the present study, this chapter

introduces the entire research project in the facets of research background, research
objectives, general research questions and its significance. To commence with, the
section of research background sheds light on the context where this study was
motivated and how it was further conducted. In particular, a closer scrutiny is
slanted at the status quo of Chinese college English-as-a-Foreign-Language
(EFL) learners with regard to their spoken English proficiency. That section is
followed by a good number of research objectives and general research questions
that this study sets out to address. In particular, this chapter pinpoints paramount
significance in building an argument for incorporating nonverbal delivery into
speaking assessment as well as in designing and validating a rating scale with such
a consideration in the context of formative assessment.
1.1 Research Background
Language, especially its spoken form, is now universally recognised as being

highly complex, multidimensional and variable according to a multitude of social
and contextual factors (e.g. Bachman 1990; Swain 1985). Accordingly, nonverbal
delivery, with its interactive and interdependent role with verbal language, is also
gaining momentum, thanks to its promoting effect on communicativeness. Harrison
(1965) at much earlier times discovered that in face-to-face communication, about
65 % of the information was conveyed through nonverbal channels. Argyle (1988)
empirically finds that nonverbal delivery contributes to 4.3 times of the clues
compared with the verbal counterpart. In addition, a good number of studies verify
the accuracy and efficiency of nonverbal delivery (e.g. Davitz 1969; Leathers and
Emigh 1980; Rothman and Nowicki 2004; Sternglanz and DePaulo 2004). Some
studies even elevate the role of nonverbal delivery to such a height of being an
indispensable metacommunicative function (Capella and Palmer 1989; Leathers
1979).
Given the complexity of spoken language, along with potential functions the
accompanying nonverbal delivery might serve, assessing EFL learners’ spoken
© Springer Science+Business Media Singapore 2016 1
M. Pan, Nonverbal Delivery in Speaking Assessment,
DOI 10.1007/978-981-10-0170-3_1
2 1 Introduction
English has long enticed great concern in language teaching and learning. Many
international large-scale English test batteries, almost without any exception, fea-
ture an incorporation of an oral testing component with an aim to measure candi-
dates’ communicative competence. In addition, irrespective of any testing form,
such as oral proficiency interview, dialogue or discussion, rating scales are gen-
erally regarded as the yardsticks against which candidates’ communicative com-
petence is observed and measured.
In the context of tertiary education in China, English language assessment has
long been prioritised by education authorities, university administrators, teaching
practitioners, students as well as parents. Be it College English Test (CET) or Test
for English Majors (TEM), a separate speaking assessment is routinely administered
to the candidates whose scores in written tests reach the required threshold. Apart
from those domestic tests, Chinese EFL learners also have the access to the lan-
guage proficiency tests administered worldwide, such as Test of English as a
Foreign Language (TOEFL), International English Language Testing System
(IELTS) and Business English Certificate (BEC), with oral assessments included.
Most of these tests are so highly valued that Chinese college students spare no
efforts in obtaining high scores for meeting the degree-conferring requirements,
equipping themselves with more competitive edge in job market and recognising
their own abilities, among other various reasons.
However, although Chinese EFL learners seem to be motivated by a fervent
craze of English proficiency tests, a substantial gap still exists between their general
spoken English proficiency and what is stipulated in the curriculum requirements
for college English learning in China. Cai’s (2002a) study reveals that a total
number of 32,107 candidates took College English Test Spoken English Test
(CET-SET) from January 1999 to May 2001, and among them, 18,550 test-takers,
tantamount to approximately 57.8 %, were assigned Grade B equal to intermediate
proficiency level, signalling that more than half are only capable of developing
certain familiar topics in English. In Cai’s (2002b) follow-up study, it is even
argued that, given the huge CET test population, it can be imagined that those to be
judged as qualified English communicators would account for only a tiny pro-
portion of all the Chinese EFL learners nationwide.
Albeit insufficient evidence drawn from the above figures in providing a detailed
profile, an abundance of studies might aggravate the concern over the status quo of
Chinese college EFL learners’ spoken English proficiency. In a preliminary study of
TEM4 Oral Test (TEM4-OT), Wen et al. (1999) find that, in terms of spoken
English proficiency, except for the speech rate, English majors in China are gen-
erally below the benchmarks as is stipulated in the curriculum for English majors.
Wen et al. (2001), in an investigation on a larger scale, reconfirm that Chinese EFL
learners’ spoken English is characterised by expression inaccuracy, disfluency, lack
of innovative ideas as well as a poor mastery of interaction strategies expected of
daily communication.
It has to be admitted that the above studies as a whole might leave an impression
that it could be commonplace that Chinese college EFL learners’ spoken English
proficiency is far from satisfactory. What they are poor at, as exposed above, is
1.1 Research Background 3
merely observed from standardised summative assessments and is mainly measured

against the rating scales exclusively designed for these speaking tests. In all like-
lihood, more problems remain to be further revealed. An examination of the rating
scales for CET and TEM oral assessments may indicate that the general assessment
domains are just pertaining to pronunciation, grammar, content and fluency,
whereas other aspects, particularly nonverbal delivery, are rarely paid due attention
to. In fact, Chinese EFL learners present unnaturalness to a great extent, which can
be evidenced by certain detectable nonverbal delivery that would otherwise be
supposed to enhance communication effectiveness (see Liu and Pan 2010a, b; Pan
2011a, b).
Against this background, at least three research gaps need to be filled for the
overall assessment and improvement of Chinese EFL learners’ oral English profi-
ciency. One of the concerns is that although the research on nonverbal delivery is
well documented, it is felt that how this postulated variable can discern EFL
learners across different proficiency levels and how they can be assessed remain to
be explored. Therefore, an argument for incorporating nonverbal delivery into
speaking assessment needs to be articulated. Second, in addition to standardised
summative assessments usually adopted for an evaluation of EFL learners’ oral
proficiency, formative assessment should also be given a role to play in observing
learners’ ongoing progress along the path of spoken English learning. The last
concern is that, bolstered by an argument for nonverbal delivery, a rating scale
should be designed and validated in a triangulated manner so that it can be as
inclusive and explanatory as expected in addressing almost all the aspects of
communicative competence. Taking these concerns into account, this study sets out
to first build an argument for incorporating nonverbal delivery into speaking
assessment and further to design and validate a rating scale for group discussion
with such a consideration of nonverbal delivery in the context of formative
assessment.
1.2 Research Objectives
Considering the main aims of building an argument for embedding nonverbal

delivery into speaking assessment and also developing and validating a rating scale
with nonverbal delivery included in formative assessment at the tertiary level in the
Chinese EFL context, three subsidiary objectives that thread through the present
study are incubated.
First, an argument for incorporating nonverbal delivery into speaking assessment
would serve as a complement to the well-documented research on nonverbal
communication. In the spectrum of language assessment, nonverbal delivery seems
to be a dimension hardly measurable; thus, this study will explore an observation
for assessing nonverbal delivery as reflected by one of the dimensions of the rating
scale to be proposed. Second, the rating scale, with the significance of nonverbal
delivery attached, is meant to measure tertiary EFL learners across different majors
4 1 Introduction
in the task of group discussion in formative assessment. In addition, the proposed

rating scale is thickly and informatively designed in the sense that different aspects
of communicative language ability can be examined for the provision of pertinent
and detailed feedback on learners’ oral English proficiency. Third, given the nature
of nonverbal delivery, which generates meanings via “unconventional” channels, a
rating scale is developed and validated not only with statistical methods commonly
witnessed in the arena of language assessment, but also via descriptive analysis in a
multimodal fashion so that all the possible meaning-making resources can be taken
into consideration. Therefore, compared with most existing rating scales to be
reviewed in Chap. 2, this rating scale is aimed to facilitate teachers in identifying
their students’ spoken English proficiency level. It is also conducive for English
learners to improve the naturalness and communicativeness of nonverbal delivery in
their spoken English.
1.3 General Research Questions
In order to build an argument for incorporating nonverbal delivery into speaking

assessment, on the basis of which a rating scale is accordingly designed and vali-
dated for group discussion in formative assessment, this study attempts to address
four general research questions (RQs) as follows. It should be noted that in each
phase of research to be more clearly elucidated, a number of phase-specific RQs,
together with their respective operationalisations will be presented in the corre-
sponding chapters.
RQ1 What role does nonverbal delivery play in learners’ performance in group
discussion?
RQ2 What are the components (bands, assessment dimensions and descriptors) of
the rating scale for group discussion in formative assessment?
RQ3 To what extent is the rating scale reliable, valid and practical?
RQ4 How can the rating scale discriminate learners across different proficiency
levels?
These four research questions will be answered at the different phases of this
research. RQ1 touches upon the role of nonverbal delivery in EFL learners’ per-
formance of spoken production. Thus, this question is raised in response to an
intended argument for including nonverbal delivery into speaking assessment and
can be resolved by verifying that nonverbal delivery well differentiates learners
across a range of proficiency levels. Soundly supported by such an argument, the
remaining RQs encapsulate the follow-up rating scale formulation and validation.
RQ2 deals with the development of a rating scale with nonverbal delivery per-
ceived, whereas RQ3 and RQ4, in an integrated manner, are devoted to addressing
1.3 General Research Questions 5
the properties of the rating scale, viz. its validity,1 reliability, practicality and dis-
criminating power.
1.4 Research Significance
With the above general RQs substantially, discretely and satisfactorily addressed, it
is also anticipated that the present study will yield much significance and value.
Firstly, when the proposed rating scale is used in formative assessment, tertiary
EFL learners’ merits and demerits in oral English proficiency can be fully captured
and measured, particularly with a view to their performance in nonverbal delivery.
In a sense, teachers will be informed of what demerits their students share and what
particularly obtrusive demerits individual learners possess so that an adjustment in
their instructions can be made. In a similar vein, learners will ameliorate their
spoken English via anchoring their performance with the rating scale descriptors
and the assessment results.
Secondly, with a special view to the construct validity, this study will provide a
complete validation procedure of the rating scale with both quantitative and qual-
itative approaches. Particularly, as will be detailed later, the qualitative approach
this study adopted is multimodal discourse analysis (MDA), a method underused in
language assessment. It is hoped that the integration of language testing and MDA
can, in a much broader sense, provide practical guidance for the investigation on the
interface between these two domains. Thus, as far as the validation methods are
concerned, this study would inform the area of rating scale validation and will shed
light on the research of a similar kind.
Lastly, the study will demonstrate a theoretically sound and practically feasible
rating scale for EFL learners at the tertiary level in China. With appropriate alter-
ations, it is expected to be further applied to assessing learners of other levels in the
Chinese EFL context, such as secondary school students. What is even more
promising is that the proposed rating scale can be referred to in terms of oral
English assessment for specific or professional purposes on condition that part of
the assessment construct is basically unchanged. In that sense, its utility will be
helpfully widened.
1
This study conceptualises construct validity as a unitary and overarching notion, to which all the
components of validity contribute. See more details and justifications in Sect. 2.4.
6 1 Introduction
1.5 Book Layout
Having highlighted the expected significance attached to this study, this section
then outlines the layout of this book, which is sequentially arranged into nine
chapters.
This chapter serves as an introduction to the whole research project, which
overviews the research background, research aims and objectives, general research
questions as well as the anticipated value which the present study is to be
substantiated.
In Chap. 2, a crucial part for literature review, five sections are earmarked in
response to three key issues involved in this research. The first section is concerned
with the most essential notion of this study, viz. nonverbal delivery, outlining the
previous studies on this notion and how they might inform the present study. The
second section, by elaborating on the conceptualisation of communicative compe-
tence, along with the relevant models, surveys the fittest rationale on which a rating
scale should be based. The third section continues with a review on the taxonomies
of rating scales in the context of language assessment and then describes the
properties of the rating scale to be proposed. The second and the third sections,
therefore, address the key issue of how to develop a rating scale. The last two
sections of the second chapter review the concept of validity and validation as well
as the validation methods in language testing. In so doing, clarifications can be
made as to what notion of test validity this study resides in and what validation
methods best accommodate the present study. Thus, these two sections provide an
answer to and navigate the process of validating the rating scale to be proposed.
Chapter 3 depicts a general picture of the research design and clarifies the
research methods utilised in this study. In addition, how the data were collected,
processed and analysed, and how three datasets were allocated to serve different
research purposes in each phase of the project are also detailed in this chapter.
Chapter 4, based on a comparatively smaller dataset of test-takers’ group dis-
cussion, reports on a preliminary study with a special view to empirically verifying
the necessity of a new dimension, nonverbal delivery, to be incorporated in spoken
English assessment. In a way, this chapter spearheads the whole project in that it
builds an argument to justify an indispensable role of nonverbal delivery in
assessing EFL learners’ communicative ability in a comprehensive manner.
Chapter 5 addresses two broad components of the proposed rating scale.
Informed by the results from a questionnaire administered to both teaching prac-
titioners and learners in the Chinese EFL context, the first half of this chapter sheds
light on the descriptors of those “conventional” dimensions on the rating scale, such
as pronunciation and intonation, vocabulary and grammar, and discourse man-
agement. The second half draws upon the research findings of the study reported in
Chapter 4, with which nonverbal delivery, as an “unconventional” dimension, is
brought forth in a gradable manner on the rating scale.
Chapter 6 links the development with the validation of the proposed rating scale.
In this chapter, the rating scale, as an interim product based on the findings reported
1.5 Book Layout 7
in Chapter 5, is prevalidated via expert judgment with both quantitative and

qualitative approaches so that a fine-tuned rating scale could be perceived and
further developed.
Chapters 7 and 8 succour the validation phase of the revamped rating scale. On
the one hand, based on the candidates’ scores measured against the rating scale,
Chap. 7 validates the rating scale with a multi-trait multi-method (MTMM)
approach to observe whether different rating methods would yield sound
goodness-of-fit indices for the perceived second-order MTMM model. Chapter 8
turns to the qualitative validation of the rating scale by referring to MDA approach
so that randomly selected candidates’ de facto performance on nonverbal delivery,
their subscores on the rating scale and the corresponding descriptors of nonverbal
delivery can be aligned.
Chapter 9 concludes this book by presenting a synopsis regarding the main
findings, the implications of this study and the possible washback effects that might
be brought forth by the proposed rating scale. In the end, the limitations of this
study and the future directions of further exploration are briefed.
1.6 Summary
This chapter panoramically inaugurates what this book intends to convey. Against
the background of less saliently pinpointed role of nonverbal delivery, low spoken
English proficiency of Chinese tertiary EFL learners and a prevalence of stan-
dardised summative speaking assessments, this study sets out to build an argument
for fortifying an essential role of nonverbal delivery in speaking assessment, based
on which a rating scale with such a consideration of including nonverbal delivery in
formative assessment is formulated and validated. This chapter then outlines the
research aims and subsidiary objectives of this study. Having safely entrenched all
the above, this chapter proposes four general research questions, enclosing the role
of nonverbal delivery in EFL learners’ speaking assessment, the components,
reliability, validity, practicality and discriminating power of the rating scale to be
proposed. In the end, this chapter sketches out an introduction to how this book is
arranged on a chapter-by-chapter basis.
References
Argyle, M. 1988. Bodily communication, 2nd ed. London: Methuen.

Press.
Cai, J. 2002a. On the evaluation of college student s English speaking ability. Foreign Language
World 1: 63–66.
Cai, J. 2002b. The current pressure on college English teaching. Foreign Language Teaching and
Research 3: 228–230.
8 1 Introduction
Capella, J.N., and M.T. Palmer. 1989. The structure and organization of verbal and nonverbal
behaviour: Data for models of reception. Journal of Language and Social Psychology 8:
167–191.
Davitz, J.R. 1969. The repertoire of nonverbal behaviour: Categories, origins, usage, and coding.
Semiotica 69: 49–97.
Harrison, R. 1965. Nonverbal communication: Exploration into time, space, action and object.
Florence, KY: Wadsworth Publishing Co., Inc.
Leathers, D.G. 1979. The impact of multichannel message inconsistency on verbal and nonverbal
decoding behaviours. Communication Monograph 46: 88–100.
Leathers, D.G., and T.H. Emigh. 1980. Decoding facial expressions: A new test with decoding
norms. Quarterly Journal of Speech 66: 418–436.
Liu, Q., and M. Pan. 2010a. A tentative study on non-verbal communication ability in Chinese
college students’ oral English. Computer-assisted Foreign Language Education in China 2:
38–43.
Liu, Q., and M. Pan. 2010b. Constructing a multimodal spoken English corpus of Chinese Science
and Engineering major learners. Modern Educational Technology 4: 69–72.
Pan, M. 2011a. Reconceptualising and reexamining communicative competence: A multimodal
perspective. Unpublished PhD thesis. Shanghai: Shanghai International Studies University.
Pan, M. 2011b. Incorporating nonverbal delivery into spoken English assessment: A preliminary
study. English Language Assessment 6: 29–54.
Rothman, A.D., and S. Nowicki. 2004. A measure of the ability to identify emotion in children’s
tone of voice. Journal of Nonverbal Behaviour 28: 67–92.
Sternglanz, R.W., and B.M. DePaulo. 2004. Reading nonverbal cues to emotions: The advantages
and liabilities of relationship closeness. Journal of Nonverbal Behaviour 28: 245–266.
Swain, M. 1985. Communicative competence: Some roles of comprehensible input and
comprehensible output in its development. In Input in second language acquisition, ed.
S. Gass, and C. Madden, 235–256. New York: Newbury House.
Wen, Q., C. Wu, and L. So. 1999. Evaluating the oral proficiency of TEM4: The requirements
from the teaching curriculum. Foreign Language Teaching and Research 1: 29–34.
Wen, Q., X. Zhao, and W. Wang. 2001. A guide to TEM4 oral test. Shanghai: Shanghai Foreign
Language Education Press.
Chapter 2
Literature Review
This chapter reviews the literature pertaining to the present study. As the whole
research can be chronologically broken down into three main phases, covering
(1) building an argument for embedding nonverbal delivery into speaking assess-
ment, (2) the formulation and (3) the validation of the rating scale for group
discussion in formative assessment, this chapter is accordingly organised into five
sections, with the first section reviewing nonverbal delivery relating to the first
phase, and the other four sections consecutively addressing the related literature
concerning rating scale development and validation.
Specifically, the first section reviews the previous research with regard to non-
verbal delivery. Instead of standing still in the arena of language assessment, this
section of review will commence with a review on nonverbal delivery in other fields
of research; thus, a dearth of the related studies can be felt in the context of
language testing. The second section is more concerned with the conceptualisation
of communicative competence, addressing the issue of what rationale the rating
scale development in the case of the present study should be based on. In particular,
a link between nonverbal delivery and strategic competence will be drawn so that a
theoretical argument can be tentatively advanced to embed nonverbal delivery into
speaking assessment. The third section, appertaining to the categorisations of rating
scales in language assessment and the essentials of formative assessment, provides a
leeway for determining the basic properties of the rating scale to be designed in this
research. In response to the issue of rating scale validation, the fourth and fifth
sections, respectively, dwell on the notions of validity and validation, and quanti-
tative and qualitative approaches to be adopted for validating the rating scale
proposed in this study.
2.1 Nonverbal Delivery
In retrospect, the meaning conveyance via nonverbal delivery might be dated back
to Greek rhetoric, where Quintilian (AD 35-100), one of the first in recorded
history, drew the research attention to the use of gesture. He distinguishes rhetorical
delivery into vox (voice) and gestus (the use of gesture). In a quite similar vein,
DOI 10.1007/978-981-10-0170-3_2
10 2 Literature Review
Cicero (106-43 BC) particularly expounds on rhetorical skills and conceptualises

sermo corporis (body language) or eloquentia corporis (eloquence of the body).
However, the burgeoning of studying nonverbal delivery, such as gesture and eye
contact, as a subject in its incubation stage unfortunately enticed limited academic
attention afterwards, given the privileging of language in academia. It was not until
the Cold War in the twentieth century that nonverbal delivery seemed to be
renourished in the research scenario.
Despite a vicissitude of exploring nonverbal delivery as above briefed, its sig-
nificance in communication has been well documented (Leathers and Eaves 2008).
In particular, its communicative functions in specific social and cultural contexts, its
impact on intercultural communication as well as its interface with verbal delivery
are circumnavigated in a plethora of different disciplines. From a sociological
perspective, claims are made that nonverbal delivery exerts great functional sig-
nificance on society (Leathers and Eaves 2008) and that “the importance of non-
verbal behaviour in overall communication effectiveness is obvious, and the
difficulties in assessing the skills involved should not blind us to their significance”
(Baird 1983, p. 33).
In addition to researching into the significance of nonverbal delivery in com-
munication, how various nonverbal delivery channels convey meaning are cham-
pioned by scholars such as Ekman and Friesen (1969, 1974), Goldin-Meadow and
Singer (2003), Kendon (1981, 1996, 2004), Leathers and Eaves (2008) and McNeill
(1979, 1992, 2000, 2005), whose studies will be unfolded and reviewed in the
ensuing part on the representative channels of nonverbal delivery. More recently,
social semioticians such as Martinec (2000b, 2001, 2004) and Hood (2007, 2011)
have also systematised nonverbal delivery, such as gestures, from a systemic
functional linguistics perspective. Their works, which will be discussed in-depth in
the section concerning MDA approach, are referred to in this research when the
rating scale is validated qualitatively.
Given the above, although nonverbal delivery is felt to play a crucial role in
communication, there seems an extremely low profile of studies with regard to the
employment of nonverbal delivery by EFL learners in their target language com-
munication. In a limited number of such studies, Nambiar and Goon (1993) dis-
cover that assessors tend to assign lower grades when only candidates’
voice-recording is provided. By comparison, when assessors rate the same candi-
dates’ performance via video-recording, where both verbal and nonverbal delivery
are made accessible, the candidates with satisfactory nonverbal delivery are
assigned higher scores because raters need to simultaneously focus on verbal
utterance and extra-linguistic cues. Another study pertaining to the employment of
nonverbal delivery is an interactional analysis by Neu (1990), who finds that EFL
learners might exhibit their communicative competence effectively by synchronised
gesturing.
Thus, it can be believed that the inclusion of nonverbal delivery in speaking
assessment is intended to not only better discriminate candidates across a range of
proficiency levels but also provide more comprehensive feedback for candidates in
2.1 Nonverbal Delivery 11
relation to what potential progress can be made in their spoken English

performance.
In order to render a fuller picture of the previous studies on nonverbal delivery,
this section of review will continue with the concrete and representative manifes-
tations of nonverbal channels, viz. eye contact, gesture and head movement
(Jungheim 1995, 2001). With these nonverbal channels reviewed below, it is
anticipated that a stronger theoretical argument for embedding nonverbal delivery
into speaking assessment can be built, thus paving the way for a forthcoming
empirical argument to be advanced in this study.
2.1.1 Eye Contact
The central role of eye contact in nonverbal delivery has long been acknowledged.
A host of researchers are devoted to studying the language of eyes and now arrive at
a consensus that there may well be a language of the eyes with its own syntax and
grammar (Webbink 1986). Janik et al. (1978) find that attention is focused on the
eyes 43.4 % of the communication duration. When eye contact is investigated in a
social context, more interest is invited in identifying how eye contact can make
meanings in social interactions (Kendon 1967; Street 1993). For example, Bourne
and Jewitt (2003) study various purposes of eye contact in young learners’ English
learning process. Besides, there are also extensive studies aiming at the roles of eye
contact in the development of children’s language and communication, indicating
that eye contact is primal regarding its shared attention by both infants and adults
(Tomasello 2003).
Leathers and Eaves (2008) list a total of seven functions that eye contact pos-
sibly serves. The first function is attentiveness. Argyle and Cook (1976) emphasise
that mutual eye contact “has the special meaning that two people are attending to
each other, [which] is usually necessary for social interaction to begin or be sus-
tained” (p. 170). The enlargement of pupils can be an indication that listener’s or
speaker’s attentiveness is accordingly promoted (Hess 1975). Second, persuasive
function, with which the persuader wishing to be noticed as trustworthy must
maintain eye contact while speaking and being spoken to by the persuadee
(Burgoon and Saine 1978; Burgoon et al. 1986; Grootenboer 2006). Third, inti-
macy, conducive to establishing interpersonal relations, is another function. In
interpreting this function, Hornik (1987) and Kleinke (1986) assert that the intensity
of eye contact, or the duration of gaze, has a crucial role to play in developing
intimacy between persons. Fourth, regulatory function, which refers to alerting the
decoder that the encoding process is occurring and continuing by virtue of sig-
nalling the encoder whether listening and decoding are occurring, and by indicating
when the listener is to speak (Ellsworth and Ludwig 1971; Kalma 1992). Fifth, eye
contact can also serve an affective function. Eye contact, along with facial
expression, is able to function as a powerful medium of emotional communication
(Zebrowitz 1997), or as Schlenker (1980) concisely phrases, “the eyes universally
symbolise affect” (p. 258). Sixth, eye contact has its power function, which largely
deals with eyes’ function of exerting authority, or that of performing mesmerisation
(Henley 1977; Henley and Harmon 1985). Seventh, impression management
function, as its name suggests, means speaker’s efforts in formatting either positive
or negative impressions upon the addressees (see Iizuka 1992; Kleinke 1986).
However, it should be noted that the above taxonomy of communicative func-
tions are viewed in such a broad social context that it might not be directly
applicable to studying eye contact deployed by EFL learners. For instance, in
language assessment context, where candidates perform their oral task, it would be
less likely that there are occurrences of eye contact with intimacy or power function
as almost no necessity can be felt in this particular setting. In addition, a few
communicative functions might be overlapping or serve more than one function as
above elaborated, in the case of which judging what function(s) a captured
occurrence of eye contact serves might be complicated. The desolation of eye
contact from its accompanying verbiage can be another drawback of the above
taxonomy. Without synchronised verbal utterance, it would be a practical challenge
to fathom what exactly eye contact attempts to convey.
When an occurrence of eye contact is observed and measured, Poggi (2001)
proposes a set of measures to analyse eye contact from the perspective of bodily
organs, roughly including eyebrows (inner part, medial part and outer part), eyelids
(upper or lower), wrinkles and eye (humidity, reddening, pupil dilation, eye posi-
tion and eye direction). Fine-grained as these measures are, it may be technologi-
cally demanding as the observation of various occurrences of eye contact in
accordance with the above specified frame might be jeopardised by its complexity
and judgment subjectivity. In real practice, when eye contact is measured in this
study, in order to provide a leeway for the first phase of this study, where an
empirical argument is tentatively built for embedding nonverbal delivery into
speaking assessment, the descriptive analysis will be refrained from resorting to the
detailed taxonomy of bodily organs. Instead, analyses will be largely based on
candidates’ eye contact as is de facto presented, mainly from the angles of eye
contact directionality and duration because both measures can tentatively help allow
an observation of candidates’ frequency and intensity of the various referents they
visualise (Cerrato 2005). When the occurrences of eye contact are described and
analysed, the taxonomy by Leathers and Eaves (2008) above outlined will be
referred to.
Nonetheless, when the rating scale is validated qualitatively, considering more
explanatory power and applicability, eye contact will be probed into with an MDA
approach, with basically an integrated framework drawn from the studies by
Martinec (2000b, 2001, 2004) and Hood (2007, 2011). In such a context, not only
the frequency and duration of eye contact as salient measures will be probed into,
but also other vehicles carried via eye contact, such as eye contact shift, will also be
focused on. The operationalised framework from Martinec’s (2000b, 2001, 2004)
and Hood’s (2007, 2011) studies will be further expounded below in detail along
with an elaboration on MDA approach in Sect. 2.5 of this chapter.
2.1.2 Gesture
Unlike eye contact, whose manifestations mainly refer to such issues as duration,
directionality and intensity of pupil fixation, gesture can be instantiated via a ple-
thora of different manifestations. Thus, the question of what constitutes a unit of
gesture is contested, with compelling reasons offered for various perspectives.
Within the field of nonverbal communication, gesture can be broadly defined as
“any distinct bodily action that is regarded by participants as being directly
involved in the process of deliberate utterance” (Kendon 1985, p. 215). Kendon
(1996) further proposes that a gesture consists of “phases of bodily action that have
those characteristics that permit them to be ‘recognised’ as components of willing
communicative action” (p. 8). However, this begs the question of recognition by
whom. In addition, there can be concerns in the subjectivity involved in identifying
unambiguously what is willing communicative gesture. Kendon (2004) explains
that a prototypical gesture passes through three phases, namely the preparation, the
stroke and retraction, with the stroke phase being the only obligatory element.
McNeill (1992) describes the stroke phase as “the phase carried out with the quality
of ‘effort’ a gesture in kinetic term” (p. 375). He continues to argue that “[s]
emantically, it is the content-bearing part of the gesture” (p. 376). With the above,
when gesture is observed in this study, more foci will be placed on the meaning
potential it makes though the judgement will basically follow. Kendon’s (2004)
proposed prototypical gesture, with the stroke phase as the core.
Following formal instantiation of gestures, quite a few studies decipher what
various gestures would supposedly convey in particular settings, viz. their
emblematic or iconic meanings. However, they rarely touch upon more than an
inventory of providing the respective verbal glosses in various social contexts (e.g.
Barakat 1973; Creider 1977; Efron 1941; Green 1968; Saitz and Cervenka 1972;
Sparhawk 1978; Wylie 1977), though efforts are also made in response to gestures’
role in generating thinking (Alibali et al. 1997), in enhancing teaching and learning
for complex ensembles (Kress et al. 2001) and in coordinating with workplace
discourse (Heath and Luff 2007). However, emblematic meaning alone does not
constitute all the possible conveyance or function of gestures.
Ekman and Friesen’s (1969) taxonomy of gesture functions encapsulates
emblems, illustrators, affect displays, regulators and adaptors. Emblems are gestures
with a direct verbal translation consisting of a word or two with a precise meaning
known by most of the members of a given culture; thus emblematic gestures are
mostly speech independent. For instance, the OK sign made by a fist with the
thumb pointing upward is a classic example of an emblem.
Illustrators are used to augment what is being said and to reinforce or
de-intensify the perceived strength of emotions experienced by the communicator.
Therefore, examples of illustrators can be signals for turn-taking in conversations
(pointing at the next turn-holder with an upward palm) or baton (slamming of
hand). Given the fact that gestures might be highly associated with the accompa-
nying verbiage when being interpreted, they can be regarded as speech dependent.
The communication of affect displays or emotions is much more closely linked

with facial expressions, postures and reflex actions than gestures, such as shivering.
Therefore, the function of gesture in this aspect is discarded in this study due to
practical and technological constraints.
Regulators, as described by Kendon (2004), are gestures that are habitual and
mostly unintentional and that are used by interactants to exercise a mutual influence
over the initiation and termination of spoken messages. Therefore, the judgment on
gestures falling into this category also requests synchronised speech. In certain
cases, such gestures are vital in the sense that interactants can be sensitive to each
other’s turn-taking prerogatives.
Adaptors, according to Ekman and Friesen (1969), are a source of involuntary
information about the psychological states of individuals who exhibit them, which
might showcase anxiety, nervousness, etc. Self-adaptors involve the manipulation
of the enactor’s body such as scratching. Alter adaptors are designed to psycho-
logically or physically protect the enactor from others, which can include folding of
arms. Object-focused adaptors involve the unconscious manipulation of objects
such as tapping of pens. Therefore, as far as meaning conveyance is concerned,
adaptor gestures are usually not recognisably communicative.
The above taxonomy classifies gestures by taking into account formal gestures,
their communicativeness and their relation with psychological and physical reac-
tions. To a certain extent, it has to be admitted that this taxonomy provides a
comprehensive encapsulation with regard to what functions gesture might serve.
However, akin to what is pointed out concerning the weaknesses of categorising
eye contact functions above, this taxonomy may also be internally overlapping. For
example, a pictogram, such as tracing the movement of signing a cheque when
requesting for the bill, can fall into the category of illustrator when there is
accompanying verbiage. However, in certain social contexts, such a gesture can
convey the intended meaning even without any synchronised verbal utterance.
Thus, the taxonomy’s isolation from verbal language can be regarded as a main
drawback.
In this study, gesture is observed with regard to the movement of hands and arms
exclusively. This is because if movements by other bodily parts are also taken into
account, it would turn out to be an almost endless inventory encompassing the
movements of various bodily parts; what might be even more intriguing is that
gesture, if defined too broadly, will be likely to trigger a confusion of hand/arm
movement with other synchronised bodily movements as well as a complication for
gesture transcription. Against this, unlike the prescribed practice of observing eye
contact in light of its directionality and duration in the first phase of the present
study, only the gestures with the involvement of hands and arms will be looked
into. When the detected gestures are further analysed in relation to their commu-
nicative functions, this study will refer to Ekman and Friesen’s (1969) taxonomy
above reviewed.
However, it should be noted that when the rating scale is validated qualitatively,
although the judgment on gesture occurrences still follows the observation of hands
or arms, their meaning potentials will be analysed beyond Ekman and Friesen’s
(1969) taxonomy in order to maximise the interpretability of various gestures.

Therefore, an MDA approach with regard to gestures, particularly Martinec’s
(2000b, 2001, 2004) and Hood’s (2007, 2011) frameworks, will be focused on. An
integrated framework, considering whether a gesture is performative or commu-
nicative and how it realises metafunctional meanings, will be further clarified in the
section of MDA below.
2.1.3 Head Movement
Dissimilar to a fervour that solely concentrates on eye contact and gesture, quite
few studies, if not none, have been exclusively devoted to a third essential and
conspicuous channel of nonverbal delivery, head movement. This nonverbal
channel might be slightly akin to eye contact in the sense that the directionality of
head movement, in most cases, naturally corresponds to that of eye contact. It is,
however, different from gestures in that head movements, with a comparative
scarcity in variedness, are overwhelmingly instantiated via head nod or head shake
though other vertical or horizontal movements of the head, such as one-way left-
ward movement from a central position, can also constitute a basic occurrence of
head movement under discussion.
In a limited number of studies, a revelation can be made concerning cultural
influence on head movement (e.g. Maynard 1987, 1990; Weiner et al. 1972). For
instance, head shake can be usually interpreted as negation or disagreement in the
Chinese culture, whereas in certain other cultures, such an occurrence can also be
understood as agreement (Matsumoto 2006). Let us take head nodding as another
example, Jungheim (2001) deems it as a backchannelling signal “giving feedback to
indicate the success or failure of communication” (p. 4), especially when interac-
tants intend to (1) show agreement with what is said, (2) pay respect to other
speakers, or (3) indicate that they are attentively listening to the speaker in the
Japanese culture (see Maynard 1987, 1989, 1990; White 1989).
Considering a dearth of any existing framework concerning the communicative
functions of head movement that this study can comfortably rests upon, in building
an argument for incorporating head movement as one of the dimensions of non-
verbal delivery in speaking assessment, Ekman and Friesen’s (1969) aforemen-
tioned framework in its general application is tentatively referred to. Since the main
purpose of that research phase would just discriminate candidates across the pre-
determined proficiency levels, in terms of formal head movement, only head nod
(generally interpreted as agreement) and head shake (generally interpreted as dis-
agreement) that are semantically loaded will be investigated. When head movement
as one subdimension of the rating scale descriptors for nonverbal delivery is vali-
dated, in addition to head nod or shake, more fine-grained head movements, such as
vertical or horizontal movements of high frequency in an interval unit, are also
taken into account following an integrated framework drawn from Martinec’s
(2000b, 2001, 2004) and Hood’s (2007, 2011) research to be unfolded below.
The above provides a review on nonverbal delivery, with a particular view to the
three most representative channels and what approaches this study will adopt in
observing and analysing formal nonverbal delivery at different phases of the study.
With this section of review addressed, it can be felt that nonverbal delivery, with its
proven significance and saliency in communication, should be embedded into
speaking assessment, where meaning making is realised not just from verbal lan-
guage alone. The ensuing section will then review the notion of communicative
competence and specifically indicate the role that nonverbal delivery legitimately
plays in assessing EFL learners’ communicative ability.
2.2 Communicative Competence
In order to more comprehensively and accurately evaluate the multifacets of EFL

learners’ proficiency, providers of a plethora of worldwide administered language
proficiency tests have been aware of the importance of seeking for a sound theo-
retical rationale to account for what is to be measured (e.g. Grant and Ginther 2000;
Schoonen et al. 2002; Wolf-Quintero et al. 1998). In the domain of oral assessment,
where communication between the articulator and the addressee determines the
evaluation on the candidates’ performance, the notion of communicative compe-
tence that serves as a construct yardstick can never be underestimated. Therefore,
the formulation of a rating scale, with a consideration of embedding nonverbal
delivery as previously argued, should first of all take that notion into serious
consideration, without the delineation of which a rating scale will remain
groundless concerning what should be measured.
Chronologically, Chomsky (1965) from the outset conceptualised competence,
which is regarded as the internal grammar of the speaker and the listener. Chomsky
(1965) believes that it is the “ideal” language system that enables speakers to
produce and understand an infinite number of sentences and to distinguish gram-
matical sentences from ungrammatical sentences. Linguistic competence is inclu-
sive of such components as phonetics, phonology, syntax, semantics and
morphology. Therefore, linguistic competence, as is explained by Chomsky (1965),
can be deemed as an entity or system, far more abstract than the language system
per se. In response to Chomsky’s notion, Hymes (1972), considering the social
dimension of language use, puts forward communicative competence with more
emphasis on the social nature of language. As Hymes (1972) states, communicative
competence refers to one’s capability and awareness of knowing when, where and
how to say what with whom. Although the notion witnesses great percussion in the
applied linguistics community, its operationalisation is challenged (Canale and
Swain 1980). In order to offset the weaknesses and improve the bases of how
communicative competence can be interpreted, Canale and Swain (1980), and
Canale (1983) bring forth communicative competence model, comprising four
domains of competence to be observed. Afterwards, by critiquing and distilling the
essence of other researchers’ views on what communicative competence should be
2.2 Communicative Competence 17
construed, Bachman (1990), Bachman and Palmer (1996) eclectically put forward
the model of communicative language ability (CLA), which has been credited as a
widely recognised framework with new insights on the language ability. The most
recent framework with regard to communicative competence is the conceptualisa-
tion of communicative language competence (Council of Europe 2001) as one of
the by-products from Common European Framework of Reference (CEFR).
The above brief introduction on the notional evolution of communicative com-
petence leads to a necessity that this section of review should outline, critique and
compare the above models for reaching the fittest one to bolster the explanation of
what domains should be measured in a speaking rating scale and why nonverbal
delivery plays a crucial role in light of communicative competence assessment.
2.2.1 Hymes’ Notion of Communicative Competence
2.2.1.1 Theoretical Groundings
The notion of communicative competence is termed by American sociolinguist Dell

Hymes, who employs the terminology in a research article entitled On
Communicative Competence, and defines the notion as “a knowledge of the rules
for understanding and producing both the referential and social meaning of lan-
guage” (1972, p. 270). As a matter of fact, such a notion is long incubated and
might be traced back to Hymes’ conceptualisation of communicative competence in
other academic works, such as The ethnography of speaking (1962) and The
ethnography of communication (1964), which include communicative event, con-
stituents of communicative event, the interrelationship among the constituents as
well as the expected knowledge and the abilities of a communicator. It can be said,
therefore, Hymes’ notion of communicative competence is progressively enriched,
yet it was not until at an international conference on language development that this
notion systematically came to the fore.
2.2.1.2 Components of Communicative Competence
Hymes (1972) asserts that one’s capacity is composed of language knowledge and
the ability to use language and that communicative competence consists of four
parameters that included “communicative form and function in integral relation to
each other” (Leung 2005b, p. 119). Concerning communication beyond Chomsky’s
(1965) demarcation between competence and performance, he proposes a frame-
work comprising four following questions to explain what communicative com-
petence should include.
(To what degree) is something formally possible?

(To what degree) is something feasible in virtue of the means of implementation available?
(To what degree) is something appropriate in relation to a context in which it is used and
evaluated?
(To what degree) is something in fact done, actually performed, and what does its doing
entail? (Hymes 1972, pp. 270–288).
The first question deals with what is possible considering the language form.
Actually, what is possible refers to something acceptable within a formal system that is
grammatical, cultural or communicative (Hymes 1972). However, communicative
competence is not succinctly interpreted when what is possible stands alone; the second
question, therefore, touches upon feasibility, such as the memory limitation and per-
ceptual device, or as can be rephrased, concerns what is biologically and psychologi-
cally feasible. To illustrate this parameter, Royce (2007) renders an example, where a
sentence itself may be grammatically well formed, yet can be so lengthy that it fails to
convey what is intended. The third question is more concerned with the appropriateness
of language use in particular settings, reflecting the sociological and pragmatic aspects
of language use. The last parameter bears upon a communicator’s knowledge of
probabilities in the sense that whether what is conveyed is actually common determined
by whether successful communication can be fulfilled.
Reaffirmed by Hymes’ other works (1973, 1974, 1982), his proposition might be
interpreted that communicative competence includes not only grammatical knowl-
edge but also language user’s ability to judge whether what is said is practical,
appropriate and probable. That means a language user with expected communica-
tive competence should be aware of the above parameters, and the most salient
connotation of performance is “that of imperfect manifestation of underlying sys-
tem” (Hymes 1972, p. 289).
2.2.1.3 A Critique on Hymes’ Notion of Communicative Competence
Hymes conceptualises communicative competence mainly from an ethnographic

perspective, and when further motivated within the community of applied linguistics,
this notion exerts great impact on language teaching (e.g. Celce-Murcia et al. 1997;
Savignon 1983, 1997). On the surface level, Hymes’ (1972) notion of communicative
competence seems to contradict linguistic competence by Chomsky (1965), yet it is
not the real case. In fact, Chomsky views linguistic competence as a set of idealised
knowledge system in a pure language community (Brumfit and Johnson 1979), while
Hymes puts more weight on the contextualised use of language.
On top of that, both perceive language from different angles. As a descriptive
linguist, Chomsky focuses more on how language is generated and comprehended
and tries to establish Universal Grammar (Chomsky 1965) across different languages
in human society, while little interest is felt in how language is supposed to be used.
By contrast, Hymes, as an ethno- and sociolinguist, ruminates over language
operationalisation in a given context. Therefore, the former’s effort is abstracting all

languages into a condensed form of representation; whereas the latter analyses lan-
guages in their real use. Since their perspectives towards the language vary sub-
stantially, why their conceptualisation of language ability differs is understandable.
Although Hymes’ notion of communicative competence sheds much light on
how language ability should be interpreted, particularly attaching significance to
language teaching, it is not without limitations. First, in a communicational setting,
Hymes’ notion seems to lay more stress on the speaker’s part, somehow neglecting
the interactiveness in communication (Canale and Swain 1980; Johnson and
Johnson 1999). According to Hymes’ explanation, a communicator should pay
attention to such parameters as where and when to say what to whom, all of which
depart from an initiator. However, in real communication, it should also be borne in
mind that communication goes beyond one-way conveyance; it should be a mutual
process as a result of meaning negotiation. What is also under concern in the
communication process is the ability to take into consideration the target audience’s
ability and potential ideas, both at the affective and cognitive levels. Therefore, one
limitation of this notion is that the interaction between communicators’ internalised
ideas as well as their awareness is somehow impaired (Riley 1996).
Second, Hymes’ notion overemphasised the determining role that external set-
ting plays in communication fulfilment. As Hymes (1972) explains, the core of
language use is the degree to which language is used appropriately. Thus, when
language is uttered, it should be not only grammatically acceptable, but also
context-specific and suitable on the particular speech occasion (Richards et al.
1992). Although the external speech setting undeniably influences the communi-
cators’ selection of speech, its role on language ability might be felt as a deter-
minant if overstressed. Given the above critique, context and speech should not be
on an absolute vis-à-vis relation; the selection of speech, though context-specific,
does not bear a fixed pattern of expression.
2.2.2 Communicative Competence Model
As divulged above, the notion put forward by Hymes (1972) exerts great impact on
language teaching, yet the four parameters are mainly challenged by virtue of their
operationalisation. Although plenty of studies with communicative competence as a
point of departure manage to apply the notion to language teaching, in such
domains as syllabus design (Munby 1978) and language classroom teaching
(Savignon 1983; Widdowson 1978), such application largely concentrates on a
micro-basis. Against this, Canale and Swain (1980) contrive a model with more
pertinent foci on the overall reflection of communicative competence, comprising
grammatical competence, sociolinguistic competence and strategic competence.
Later, Canale (1983) adds discourse competence for the model expansion.
2.2.2.2 Components of Communicative Competence Model
Conjured up with four components, Fig. 2.1 reflects an integration of commu-

nicative competence model by Canale and Swain (1980) and Canale (1983). Within
this model, grammatical competence can be understood to include “knowledge of
lexical items and rules of morphology, syntax, sentence-grammar semantics and
phonology” (Canale and Swain 1980, p. 29), a level mainly pertaining to the
comprehension and production of surface meaning as well as the ability to construct
well-formed sentences. Therefore, the competence in this respect reflects the
knowledge and skills needed to correctly understand and accurately express what is
intended to convey (Canale 1983).
As Canale and Swain (1980) explain, sociolinguistic competence includes two
subcomponents: sociocultural rules of use and rules of discourse, and “knowledge
of these rules will be crucial in interpreting utterances for social meaning, partic-
ularly when there is a low level of transparency between the literal meaning of an
utterance of the speaker’ intention” (p. 30). If these rules are violated, the speaker
can be judged as lacking the expected sociolinguistic competence. The significance
of sociolinguistic competence can be demonstrated by an understanding that
grammatical competence alone cannot constitute communicative competence
because successful achievement of communication should go beyond grammatical
rules and include contextualised appropriateness.
Strategic competence, or coping strategy (Stern 1978), undergoes slightly
conceptual extension in the model formulation as the initial communicative com-
petence model by Canale and Swain (1980) defines it as consisting of “verbal and
nonverbal communication strategies that may be called into action to compensate
for breakdowns in communication due to performance variables or to insufficient
competence” (p. 30). However, Canale (1983) further broadens the ends of
strategic competence as also “[enhancing] the rhetorical effect of utterances”
(p. 339). From the extended definition, it might be felt that strategic competence
does not merely serve a compensatory purpose, but enhances speech production as
well.
Discourse competence, emerging as the fourth component in the revised model
by Canale (1983), falls into an ability to encompass grammatical forms and
semantic meanings to construct a text that surpasses sentential level. Text organi-
sation, both in written and spoken forms, cannot be realised without discourse
competence. In addition, discourse competence in conversation can be regarded as a
Communicative Competence
grammatical sociolinguistic strategic discourse

competence competence competence competence
Fig. 2.1 Communicative Competence Model (Canale and Swain 1980; Canale 1983)
main point of departure in discourse analysis (e.g. Hatch 1978; Larsen-Freeman

1980; Richards and Schmidt 1983; Sinclair and Coulthard 1975). Particularly,
discourse competence can reflect the extent to which EFL learners are competent at
initiating, maintaining and terminating a conversation or discussion.
2.2.2.3 A Critique on Communicative Competence Model
Admittedly, communicative competence model, with an incorporation of Hymes’

(1972) notion of communicative competence, introduces new components, viz.
discourse competence and strategic competence. It is groundbreaking because it
runs counter to Oller’s hypothesis that language ability is a unitary construct (see
Oller 1983; Oller and Hinofotis 1980). On top of that, this model refutes
Chomsky’s (1965) notion of competence and points out that competence fails to
account for sociolinguistic appropriateness in a given context above discussed.
Shohamy (1996) is also positive about this model as it brings non-linguistic ele-
ments into the conceptualisation of communicative competence. In addition, the
primacy of linguistic competence in this model is also well represented (Jackendoff
1983). Therefore, the influence of communicative competence model can be felt in
the applied linguistics arena, where abundant studies are somewhat assimilated to
this model (e.g. Bachman and Palmer 1982; Kasper and Rose 2002; O’Malley and
Chamot 1990; Skehan 1995; Spolsky 1989b; Swain 1985; Tarone and Yule 1989;
Verhoeven 1997).
Notwithstanding communicative competence model broadens the construct of
language ability by absorbing in other components, such as discourse competence
and strategic competence, there is still flaw in the model per se. On condition that
the whole model is scrutinised with a top-down approach, its components seem to
be powerless to manifest their respective roles, and such behavioural objectives lead
to ill-defined domains and problems for testing (Popham 1990). In other words,
how the four components interact internally and whether all the components are
equally quintessential remain to be specified. If how the components function is not
explicitly stated, by default they can only be regarded as independent, dissimilar to
a sound model whose components are inter-woven and consummated.
2.2.3 Communicative Language Ability Model
In the early 1990s, Lyle F. Bachman, an American applied linguist, based on the
critique on the weaknesses of Lado’s (1961) and Carroll’s (1961, 1968) interpre-
tations on language ability, develops the prevailing models posited by Halliday
(1976), van Dijk (1977), Hymes (1972, 1973, 1982), Savignon (1983), Canale and
Swain (1980) and Canale (1983) and conceptualises a new model, where
communicative competence should be constructed as “consisting of both knowl-

edge, or competence, and the capacity for implementing, or executing that com-
petence in appropriate, contextualised communicative language use” (Bachman
1990, p. 84). As Bachman (1990) points out, the underpinnings of the CLA model
actually reflect what Candlin (1986) describes communicative competence as
[the] ability to create meanings by exploring the potential inherent in any language for
continual modification in response to change, negotiating the value of convention rather
than conforming to established principle. In sum, a coming together of organised knowl-
edge structures with a set of procedures for adapting the knowledge to solve new problems
of communication than do not have ready-made and tailored solutions. (p. 40)
The development of the CLA model is aggregated by a multitude of studies,

among which the main sources include Fundamental considerations in language
testing (Bachman 1990), Language testing in practice (Bachman and Palmer 1996)
and What does language testing have to offer (Bachman 1991). Revolving around
the above three sources, the following is devoted to reviewing the model compo-
nents, in conjunction with a critique.
2.2.3.2 Components of the CLA Model
Bachman (1990) gestates the construct of CLA on the basis of three core compo-
nents, viz. language competence, strategic competence and psychophysiological
mechanisms. Figure 2.2 illustrates the componential breakdown and the internal
correlation of the model. As is shown, knowledge structures refer to language users’
social and cultural knowledge and the general knowledge about the material world,
knowledge structures language competence

knowledge of the world knowledge of language
strategic competence
psychophysiological
mechanisms
context of
situation
Fig. 2.2 CLA components in communicative language use (Bachman 1990, p. 85)
whereas the context of situation includes the reciprocal sides of the communication,
situation, topic and purpose (Bachman 1990). In addition to the knowledge in both
regards, the three core parts constituting the CLA model are language competence,
strategic competence and psychophysiological mechanisms, all coordinating with
the knowledge structures and situation context to depict an overall picture of
communicative competence.
Language Competence
Bachman (1990) subcategorises language competence into organisational compe-

tence and pragmatic competence, under which each can be further broken down
into several subcomponents, as showcased in Fig. 2.3.
1. Organisational competence
The component of organisational competence in the CLA model is influenced by
Widdowson’s (1978) demarcation between use and usage and Halliday’s (1978,
1985), Halliday and Matthiessen’s (2004) systemic functional grammar, both
governing language users’ selection of words on micro- and macro-bases.
Organisational competence, therefore, determines textual organisation, which
involves the abilities to control the linguistic form, to produce and identify gram-
matically correct sentences (Bachman 1990).
Further divided, organisation competence comprises grammatical competence
and textual competence (Bachman 1990). The former refers to an ability to compose
discourse or sentence with word as a basic unit. This competence boils down to the
capability of mastering grammatical rules, ranging from vocabulary, morphemes,
syntax to phonology or graphology. The latter combines the organised discourse or
sentences into a text on a larger scale, which enables users to connect a number of
language
competence
organisational pragmatic
competence competence
grammatical textual illocutionary sociolinguistic competence

competence competence competence
sensitivity to differences in dialect or variety
vocabulary cohesion ideational functions sensitivity to differences in register
morpheme manipulative functions sensitivity to naturalness
syntax rhetorical regulatory functions ability to interpret cultural references and
phonology/ organisation interactional functions figures of speech
graphology
Fig. 2.3 Subcomponents of language competence in the CLA model (Bachman 1990, p. 87)
clauses in accordance with the rules stipulating cohesion and rhetorical organisa-
tion. Some cohesive devices are salient, such as lexical connection, reference,
substitution and omission (Halliday and Hasan 1976); there are also devices with
implied functions, regulating the occurring sequence of new and given information
in a text. Rhetorical organisation in the CLA model mainly touches upon methods,
such as narration, description and classification (McCrimman 1984).
2. Pragmatic competence
Pragmatic competence is more concerned about how discourse, clause and inten-
tion realise their meanings and functions in a particular context, or as Bachman
(1990) pinpoints, this competence deals with “the relationships between (the) signs
and their referents on the one hand, and the language users and the context of
communication, on the other” (p. 89). Pragmatic competence can be split into two
subcomponents: illocutionary competence and sociolinguistic competence.
Illocutionary competence encompasses “the knowledge of the pragmatic con-
ventions for performing acceptable language functions” (Bachman 1990, p. 90).
This concept bears much relevance with Speech Act Theory (Searle 1969), which
includes such functions as assertion, warning and imagination. As is shown in
Fig. 2.3, illocutionary competence is further classified into four groups: ideational,
manipulative, heuristic and imaginative. The ideational function is used to “express
meaning in terms of our experience of the real world” (Halliday 1973, p. 20),
including the language use either to express propositions or to exchange informa-
tion about such knowledge. The manipulative function is mainly applied to affect
the world around us. The abilities falling into this group include the instrumental
function used to handle things such as making suggestions, requests, commands
and warnings; the regulatory function used to control others’ behaviour by either
controlling or formulating the persons or the objects in the environment; and the
interactional function, which serves to form, maintain, or change interpersonal
relationships. The heuristic function is applied to share with others our knowledge
of the world, which frequently occurs in such acts as teaching, learning, problem
solving and conscious memorising. The imaginative function enables to create or
extend humour or aesthetic values by constructing and communicating fantasies,
creating metaphors, attending plays and so forth (Bachman 1990).
Sociolinguistic competence, as another part of pragmatic competence, is defined
as “the sensitivity to, or control of the conventions of language use that are
determined by the features of the use context” (Bachman 1990, p. 94). The sen-
sitivity referred to is linked with the response to which communicators are able to
cognate the dialect, language variety, differences in register (Halliday et al. 1964),
cultural references and figures of speech as well as the degree to which speakers can
appropriately and naturally generate the utterances expected in the target language
in a specific language-use context (Pawley and Syder 1983).
Strategic Competence
Akin to one of the components in communicative competence model (Canale and

Swain 1980; Canale 1983) aforementioned, Bachman (1990) also terms another
part that contributes to CLA as strategic competence, which deems language use as
a dynamic process embedding communicators’ judgments, identification, negotia-
tion of relevant information in a particular context. All of the cognitive as well as
psychological processes are inter-woven together. Bachman (1990), by distilling
the essence of communication strategy previously elaborated, puts forward a much
broader concept of strategic competence, comprising of three metacognitive
strategies: assessment strategies, planning strategies and execution strategies.
Bachman’s belief is that in any language-use context, these metacognitive strategies
co-occur with all the aspects of language use as an entity, where there is supposed
to be no subordinate to another.
1. Assessment component
Bachman (1990) outlines four aspects of assessment strategy, the integration of
which are concerned with an ability to decide on the particular language as the
desired information conveyance channel, to select what sort of language variety or
dialect is able to achieve communicative effectiveness in a particular context, and to
identify the interlocutors’ knowledge and ability and the degree to which com-
munication is ultimately fulfilled. The particular importance Bachman attaches to is
how communicators themselves know the interlocutors’ knowledge and ability, as
underscored by Corder (1983) that
[t]he strategies adopted by speakers, of course, depend upon their interlocutors…[w]hat we
attempt to communicate and how we set about it are determined not only our knowledge of
the language but also by our current assessment of our interlocutor’s linguistic competence
and his knowledge of the topic of discourse. (p. 15)
2. Planning component
The planning strategy enables communicators to formulate a plan in realisation of
communicative purpose with certain language knowledge selected. If the speakers
are interacting in their mother tongue, the knowledge needed derives from the
ability in the first language. Nonetheless, if the communication takes place in a
bilingual or a second/foreign language setting, what is needed regarding language
knowledge can be switched to the abilities either transferred from their first lan-
guage or those gradually fostered in their interlanguage. The main functions of
planning strategy are to select the relevant language knowledge, schemata and mind
mapping.
3. Execution component
The strategy of execution is a critical stage before communication is realised
under the co-functioning of psychophysiological mechanisms (see Section
“Psychophysiological Mechanisms”). For instance, in the receptive channel of
language input, visual and auditory faculties shall be applied. Bachman (1990)
Goal
Interpret or express speech with
specific function, modality, and
content
Language Competence
Situational Planning Process
Assessment Retrieve items from language Organisational competence
competence Pragmatic competence
L1
Li
L2
Plan
Composed of items, the
realisation of which is expected to
lead to a communicative goal
Execution Psychophysiological
A neurological and physiological mechanisms
process
Utterance
Express or interpret language
Fig. 2.4 A model of language use (Bachman 1990, p. 103)
holds that the three components of strategic competence, in effect, co-exist in the
whole process of communication, interacting with language ability and
language-use context. Having integrated the flow chart originally taken from Færch
and Kasper’s model (1983), Bachman (1990) visualises how the above components
and the other parts of the CLA model co-function, as illustrated in Fig. 2.4.
As can be seen, along the central line from goal to utterance, both language
competence and psychophysiological mechanisms exert their respective influences
on the planning process and execution stage. The whole process also witnesses the
existence of situational assessment impacting planning process and utterance
because communicators need to make situation-specific judgments on what com-
munication channels to be adopted to optimise meaning conveyance. Bachman
(1991) further contends that language knowledge can only be realised with the
involvement of strategic competence. Therefore, the strategies concerning assess-
ment, planning and execution are intrinsically interdependent.
Psychophysiological Mechanisms
Psychophysiological mechanisms may be viewed as a third component of the CLA

model. Bachman (1990), Bachman and Palmer (1996) associate biological mech-
anisms with language production and think these mechanisms are “the neurological
and physiological processes that are included in the execution phase of language
use” (Bachman 1990, p. 107). For instance, when test-takers are required to
describe a picture, they not only use linguistic competence to construct sentences,
but also employ their visual skill to obtain the non-linguistic information in the
picture, their auditory skill to obtain the information in the examiner’s instructions,
and their articulatory skill to pronounce the words correctly and to provide
appropriate stress and intonation. However, this competence component has been
rarely explored either theoretically or empirically in-depth.
2.2.3.3 A Critique on the CLA Model
The above review on the CLA model leads to an embodiment of the interaction of
language knowledge within the context of language use, which integrates language
knowledge and a series of cognitive strategies. Such a notional presentation is
characterised by more explanatory power as the CLA model is epitomised as a leap
forward compared with the Canale and Swain’s communicative competence model.
The CLA model embeds strategic competence and regards it as not just serving a
compensatory function, which, to a certain extent, echoes the modified model by
Canale (1983). More importantly, the CLA model recognises the roles of cognitive
strategies and pragmatic competence, together with their impact on the realisation
of communicative competence. On the whole, the CLA model has been theoretically
sound and empirically verified and has been merited as the state-of-the-art repre-
sentation (Alderson and Banerjee 2002).
Despite its prevalence, the CLA model is without caveats. McNamara (1990)
believes that when performance tests are taken into account, this model seems to be
less operationalisable because raters are very likely to assign unbalanced weightings
to a particular component of language knowledge. Upshur and Turner (1999), on
the same side, believe that a cure-all, construct-only approach to evaluating com-
plex performance may cover the influences that task context and discourse may
have on how raters interpret rating scales in the assessment of communicative
competence because such a disproportion may beget a biased focus on one com-
ponent only. In a similar vein, Purpura (2004), when addressing the subcomponent
of grammatical competence, contends that since “meaning” plays a central role in
the CLA model, the model per se would be more consolidated by how “meaning”
should be theoretically defined and how grammatical resources can be employed to
express denotative and connotative meanings on the one hand and a variety of
pragmatic meanings on the other. Chapelle (1998), from an interactionist per-
spective towards construct definition, critiques that the CLA model is defined and
operated more on the trait basis, and further states that, “[t]rait components can no
longer be defined in context-independent, absolute terms, and contextual features

cannot be defined without reference to their impact on underlying characteristics”
(p. 43).
In addition to the above, the concrete components of the CLA model seem to be
unstable with slightly different wording or naming from the different sources, with
which the CLA model is constructed. In Bachman and Palmer’s (1996) Language
testing in practice, topical knowledge, language knowledge and personal charac-
teristics are interrelated with strategic competence and are all included in the
language-use context. This somewhat differs in wording and diverges from the
model in its earlier version. Another point is that psychophysiological mechanisms
do not have a place in Bachman’s (1990, 1991) description of the CLA model.
Likewise, Bachman seems to be subtly uncertain about the categorisation of
semantic knowledge as he groups the knowledge of this aspect in the first two
versions (Bachman 1990, 1991), whereas the final version of the CLA model
(Bachman and Palmer 1996) witnesses the regression of semantic knowledge as
part of illocutionary competence and sociolinguistic competence. Despite a few
minor weaknesses and possible impracticality of the CLA model above outlined, it
has to be admitted that the model per se features a comprehensive, systematic and
interrelated reflection of what communicative ability is supposed to be construed.
After the review on the last model concerning communicative competence and an
integrated review on all the models, more justifications will be rendered as to why
this study would refer to the CLA model as the theoretical base when a rating scale
with a consideration of incorporating nonverbal delivery as a dimension in
assessing speaking is formulated.
2.2.4 Communicative Language Competence Model
With almost the same name yet discrepant academic background with the CLA
model, communicative language competence (CLC) Model (Council of Europe
2001; North 2010a, b) is a by-product of CEFR (Council of Europe 2001). It is
based on the initial considerations of providing a common basis for language
syllabi, curriculum guidelines, examinations, textbook and so on, and of relating a
Europe credit scheme to fixed points in a framework (van Ek 1975). This frame-
work is inspired by the documents such as Threshold, Vantage, Waystage,
Breakthrough, Effective Operational Proficiency and Mastery (Alderson 2010). It is
then developed with detailed descriptors for each level of expected behavioural
descriptions of language ability in various domains (Little 2006). Therefore, in
terms of the theoretical groundings, it is more a political and educational demand
than an academic motive though the above documents effectively guide the model
formulation and the conceptualisation of communicative competence in its own
right.
2.2.4.2 Components of the CLC Model
Stipulated by Council of Europe (2001), the CLC Model consists of three domains:
“linguistic competences, sociolinguistic competences and pragmatic competences”
(p. 108), as is outlined in Fig. 2.5.
As is illustrated, linguistic competences are concerned with the “knowledge of
and ability to use language resources to form well structured messages” (Council of
Europe 2001, p. 109), which can be subcategorised into lexical competence,
grammatical competence, semantic competence, phonological competence and
orthoepic competence. Judging from the interpretation of these subcomponents in
linguistic competences, linguistic competences bear much relation with grammat-
ical competence of the CLA model, reflecting a mastery of language knowledge in a
traditional and narrowed sense.
Sociolinguistic competences refer to the “possession of knowledge and skills for
appropriate language use in a social context” (Council of Europe 2001, p. 118).
They include linguistic markers of social relations, politeness conventions,
expressions of folk wisdom, register difference as well as dialect and accent.
Sociolinguistic competence in the CLA model is labelled within pragmatic com-
petence; therefore, this subcomponent is somehow elevated as one of the core
components in the CLC Model, in the case of which the social realisation of
language use is emphasised.
How pragmatic competences are defined is largely based on the description of
how its subcomponents are made up of. Pragmatic competences embed discourse
competences (abilities to organise, construct and arrange knowledge), functional
competences (abilities to generate communication-inductive meaning) and design
competences (abilities to sequence the messages in accordance with schemes and
interactiveness) (Council of Europe 2001). Given this, an understanding can be
Communicative
Language Competence
linguistic competences sociolinguistic competences pragmatic competences
lexical competence linguistic markers of social relations discourse competence

grammatical competence politeness conventions functional competence
semantic competence expressions of folk wisdom design competence
phonological competence register difference
orthographic competence dialect and accent
ortheopic competence
Fig. 2.5 Components of the CLC Model (Council of Europe 2001, pp. 108–129)
reached that pragmatic competences in the CLC Model seem to indicate a broader
sense of pragmatics, with partial anchoring with pragmatic competence in the CLA
model.
2.2.4.3 A Critique on the CLC Model
As one of the by-products of CEFR, the CLC Model has provided a Europe-specific
reference for language teaching, learning as well as assessment. Council of Europe
(2001) claims CEFR to be comprehensive as “it should attempt to specify as full a
range of language knowledge, skills and use as possible…and all users should be
able to describe their objectives, etc. by reference to it” (p. 7). In that sense, the
CLC Model is an important point of reference, but not an instrument of coercion,
nor for accountability (Alderson 2010).
Nevertheless, how communicative competence is defined in the CLC Model
etches much flaw to the model per se. First, the construct of language ability in this
model or the descriptors of different levels are basically drawn from teachers’ and
learners’ perceptions, with little empirical research or theoretical basis. In addition,
the descriptors take “insufficient account of how variations in terms of contextual
parameters may affect performances by raising or lowering the actual difficulty level
of carrying out the target ‘can-do’ statement” (Weir 2005, p. 281). Although the
CLC Model refers to such documents as Waystage, Threshold and Vantage, as
previously mentioned, they are barely different from each other (Alderson 2010).
While the CEFR claims to cover both aspects of proficiency and development in its
six ascending levels of proficiency, it fails to do so consistently (e.g. Alderson et al.
2006; Hulstijn 2011; Norris 2005). A number of researchers (e.g. Cumming 2009;
Fulcher 2004; Hulstijn 2007; Spolsky 2008) express concerns regarding the
foundation of the CEFR system. Spolsky (2008), for instance, criticises the CEFR
as “arbitrary” standard to produce uniformity, whereas Cumming (2009) points out
the dilemma of the imprecision of standards such as the CEFR “in view of the
complexity of languages and human behaviour” (p. 92).
Second, a comparison between the CLC Model and the previously highlighted
models reaches the finding that the CLC Model excludes strategic competence,
which, though partially included in pragmatic competences, is largely abandoned.
Therefore, the above-mentioned pragmatic competences in a broader sense are no
longer how the pragmatic aspect of language use is conventionally conceptualised.
As reviewed above, strategic competences, playing a quintessential role in language
use, should be a subcomponent attached to communicative language ability as a
whole. Such abandonment would also cause infeasibility for test development or
validation, whose rationale rightly resides in the CLC Model (see Alderson 2002;
Morrow 2004).
Third, the naming of sociolinguistic competences itself might be problematic.
This is because the literal sense of this ability suggests that they seem to be
naturally subordinate to linguistic competences, which is another core component of
the CLC Model that overrides sociolinguistic competences.
2.2.5 An Integrated Review on Communicative

Competence
Endeavouring to seek the fittest model for designing a rating scale with nonverbal
delivery included as a dimension, the above review is devoted to an elaboration on
communicative competence, covering a range of the background of the notion
(Hymes 1972) and the subsequent notional evolution (Bachman 1990; Bachman
and Palmer 1996; Canale and Swain 1980; Canale 1983; Council of Europe 2001).
However, in the process of notional development, there are admittedly also other
frameworks relating to communicative competence. Celce-Murcia et al. (1997), for
instance, extend communicative competence model by further dividing sociolin-
guistic competence into sociocultural competence and actional competence. With
regard to the CLA model renovations, Douglas (2000) proposes a model with a
particular view to the language use for specific purposes, in the case of which
professional or topical knowledge is equally emphasised. Likewise, Purpura (2004)
develops an extended model based on the CLA model, where “a model of language
knowledge with two interacting components: grammatical knowledge and prag-
matic knowledge” (Purpura 2008, p. 60) is proposed.
An in-depth analysis of the above-modified models or frameworks, though
excluded here, would be sufficient to constitute an understanding that they are
characterised by either domain-specificity or further breakdown dissolving from the
CLA model. Therefore, it can be justifiable that the CLA model serves as an
umbrella model that covers the notions and models just briefed.
A retrospective review on communicative competence model, the CLA model
and the CLC Model on a chronological continuum, as illustrated in Fig. 2.6, can
provide a better understanding of communicative competence and which model can
be judged as the fittest. The components with the linkage by arrows as an indication
from a developmental point of view mean that they are basically of the same
conceptual referents. It can be observed that when the notion is ushered into the
CLA model, as the arrows in the figure point, its components are most compre-
hensive and inclusive, with integrated interactions and mechanisms between dif-
ferent components. Notably, the CLA model substantiates the component of
strategic competence and incubates psychophysiological mechanisms, though the
related studies on the latter are unavailable.
When the notion evolves into the CLC Model, strategic competence disappears;
design competences in the CLC Model, judging from the definition that is previ-
ously mentioned, have only seemingly partial connection with psychophysiological
mechanisms in the CLA model, as indicated by a dotted arrow in Fig. 2.6.
Therefore, it can be felt that the absence of strategic competence in explaining what
communicative competence is might give rise to the model caveats; thus, it can be
naturally argued back that the CLA model should be selected as the fittest model
with inclusiveness and explanatory power. All these can enhance a justification that
the CLA model is the most appropriate to be the theoretical rationale, based on
Bachman (1990) &

Canale & Swain (1980) Canale (1983) Bachman & Palmer Council of Europe (2001)
(1996)
Language competence Linguistic

competences
Grammatical Grammatical Organisational competence
Grammatical competence
Pragmatic
competences
Textual competence
Sociolinguistic Sociolinguistic Discourse competences
Pragmatic competence Functional competences
Illocutionary competence
Design competences
Sociolinguistic competence
Discourse
competence Strategic competence Sociolinguistic
assessment competences
planning
execution
Strategic Strategic
Psychophysiological
mechanisms
Fig. 2.6 Notional evolution of communicative competence
which a rating scale with nonverbal delivery assessment included is to be proposed

in the present study.
Gearing the CLA model to the basic properties of spoken language, it can also be
felt that this model can be an integrated epitome of accuracy, fluency and appro-
priateness. Regarding accuracy, Skehan (1996) broadly defines it as “the extent to
which the language produced conforms to target language norms” (p. 18). In that
sense, it not only covers the accurate use of individual word, but also the exactness
of phrase, sentence and discourse as a whole. Viewed from the CLA model, this
property of spoken language can be reflected from organisational competence
under language competence. In terms of fluency, although different researchers vary
their perspectives in defining this property (e.g. Brumfit 1984; Færch et al. 1984;
Lennon 1990; Sajavaara 1987; Schmidt 1992), this notion is usually associated with
three factors: coherence, continuity and acceptability, all of which can also be
covered by either textual competence or pragmatic competence in the CLA model.
Though there is no established definition for the property of appropriateness, it can
be tentatively understood as the extent to which utterances approximate the con-
ventions in a given social context. Therefore, it again falls into the domain of
pragmatic competence of the CLA model.
In addition to an integrated review on the notional evolution of communicative
competence as well as an analysis of the common grounds between the CLA model
and the basic properties of spoken language, more comments and reflections on the
above models as a whole are rendered below. First, from its inception, the notional
transmutation actually accompanies a discussion of whether communicative com-
petence should be a unitary concept or a multi-componential one; in the case of the
latter, issues arise as to what components can best represent and constitute the
construct of the notion. As showcased from the above elaborations, communicative
competence is multi-componential; thus, when communicative competence is
assessed, EFL learners are supposed to be assessed in different domains. This also
echoes the philosophy of the present study in that a rating scale, particularly in the
context of formative assessment, should be designed as analytic instead of holistic.
This issue will be re-addressed and further resolved in the next section of this
chapter. Second, Connor and Mbaye (2002) pinpoint that a sound model of com-
municative competence offers a convenient framework for categorising components
of written and spoken discourse, in which all the possible competences should be
reflected in the scoring criteria. A substantial number of test designers also indeed
adopted the CLA model to be the basis of rating scale design (e.g. Clarkson and
Jensen 1995; Grierson 1995; Hawkey 2001; Hawkey and Barker 2004; McKay
1995; Milanovic et al. 1996). To that end, the selection of the CLA model in the
present study can be further justified.
Therefore, following the CLA model, the rating scale to be proposed will
comprise two broad dimensions: language competence and strategic competence.
The former is quite self-explanatory within the model with regard to what detailed
assessment domains should be looked at; however, strategic competence seems not
to be that observable because it is explained in terms of three metacognitive
strategies in the model. In that context, enlightened by the definition of strategic
competence, which mainly concerns how a speaker resorts to non-linguistic means
to sustain communication, and also informed by the review on nonverbal delivery
in the previous section, the present study attempts to incorporate nonverbal delivery
into the rating scale as one observable dimension to correspond to strategic com-
petence. Although it has to be admitted that nonverbal delivery alone cannot depict
a full picture of strategic competence, it can to a large extent provide a detectable
and representative profile of candidates’ performance in speaking assessment.
With the above, it can be felt that incorporating nonverbal delivery into speaking
assessment is well grounded because it is intrinsically rooted in strategic compe-
tence in the CLA model. Yet, such a perception largely remains on the theoretical
level. If an argument for embedding nonverbal delivery into speaking assessment
can be built via an empirical study to verify that the competence in this aspect can
indeed discern candidates across a range of proficiency levels, such an argument
can be further consolidated. It can also pave the way for the formulation and
validation of the rating scale with such a consideration. As aforementioned, this
argument will be made in the first phase of this study.
2.3 Rating Scale and Formative Assessment
This section will touch upon the literature concerning rating scale and the context of
the rating scale to be proposed in this study, viz. formative assessment. Therefore,
this section will review four respects: (1) what is a rating scale in language
assessment? (2) what are different categorisations of rating scales? (3) what is
formative assessment and how can it benefit EFL learners? (4) what type of rating
scale best accommodates the context of formative assessment? The end of this
section will integrate the review to summarise the wide-ranging properties of the
rating scale to be proposed.
2.3.1 Rating Scale
Scales in language assessment are labelled in various names. Alderson (1991)

provides a number of alternatives, such as “band scores, band scales, profile bands,
proficiency levels, proficiency scales, [and] proficiency ratings” (p. 71). Similarly,
de Jong (1992) also terms rating scales as “guidelines, standards, levels, yardsticks,
stages, scales or grades” (p. 43). However, no matter how it is named, considering
its function, a rating scale equals to a yardstick against which learners’ performance
can be measured in “a hierarchical sequence of performance ranges” (Galloway
1987, p. 27). In describing a rating scale in language assessment, McNamara (2000)
suggests that it is a series of ascending descriptions of remarkable features of
performance at each language level; on the other hand, Luoma (2004) also states
that rating scales are the reflections of test developers’ understanding and expec-
tation of what test construct is. Thus, they “form part of their definition of the
construct assessed in the test” (Luoma 2004, p. 59). Davies et al.’s (1999) definition
seems more inclusive when they propose that
a rating scale is the description of language proficiency consisting of a series of constructed
levels against which a language learner’s performance is judged. Like a test, a proficiency
(rating) scale provides an operational definition of a linguistic construct such as proficiency.
Typically such scales range from zero mastery through to an end-point representing the
well-educated native speaker. The levels or bands are commonly characterised in terms of
what subjects can do with the language (tasks and functions which can be performed) and
their mastery of linguistic features (such as vocabulary, syntax, fluency and cohesion)…
Scales are descriptions of groups of typically occurring behaviours; they are not in them-
selves test instruments and need to be used in conjunction with tests appropriate to the
population and test purpose. Raters of judges are normally trained in the use of proficiency
scales so as to ensure the measure’s reliability (pp. 153–154).
From the above-integrated definition, it can be thought that, in terms of com-

ponents, a rating scale includes both the domains to be assessed (construct) and the
alignments between examinees’ performance and the predetermined levels of
behavioural descriptions. Therefore, as far as language assessment is concerned,
any rating scale development should also bear the considerations of the above two
components. The present study will consistently follow the definition and confine
the concept of rating scale to the context of language assessment only.
The evolving definition of rating scale also orchestrated with the developments
of the particular oral rating scales. In the early 1950s, the rating scale of the US
Foreign Service Institute (FSI) was first introduced and it has six bands from zero
2.3 Rating Scale and Formative Assessment 35
(foreignness) to perfection (nativeness). Raters judge the relative amounts of for-

eignness or nativeness of each domain: “accent, fluency, comprehension, vocabu-
lary and grammar” (Lowe 1985, p. 19). Afterwards, many other language
proficiency test batteries started to apply that rating scale to their scoring, such as
the reputed American Council of the Teaching of Foreign Languages (ACTFL)
(ACTFL 1986, 1999; North 2000) and other language proficiency tests, oral tests in
particular (see Shohamy 1981). The prevailing language assessments in recent years
also witness the employment of rating scales, such as IELTS Oral Test, Spoken
English of Test of TOEFL and Spoken Test of BEC.
In the context of English language testing in Chinese mainland, similarly, rating
scales constitute an integral part in various oral proficiency tests, such as CET-SET,
TEM-OT and Oral Test of the Public English Test System (PETS-OT). However,
these rating scales for spoken English assessment vary in many aspects. The fol-
lowing section will particularise the prevailing taxonomies of rating scales with a
few exemplifications to reflect their respective features.
2.3.2 Taxonomies of Rating Scales
Fulcher (2003), after reviewing different categorisations of rating scale, proposes a

framework for describing rating scales from the perspectives of (1) rating scale
orientation (Alderson 1991), (2) scoring approach (Hamp-Lyons 1991) and (3) fo-
cus (Bachman 1990). In addition, another three ways can categorise rating scales. It
is proposed that rating scales can be divided in accordance with how they are
designed (Fulcher 2003, 2010; Fulcher and Davidson 2007; Fulcher et al. 2011);
whether they are designed based on experts’ intuition, a particular theory, empirical
findings or performance decision trees. Alderson and Banerjee (2002) divide rating
scales in the angle of task specificity. North (2003) classifies rating scales in terms
of band and descriptor layouts. Therefore, in terms of rating scale typology, there
can be possibly a total of six taxonomies, as summarised in Table 2.1.
Table 2.1 Taxonomies of rating scales

Orientation Scoring Focus
• User • Analytic approach • Real world
• Assessor • Holistic approach • Construct
• Constructor – Holistic scoring
– Primary-trait
scoring
– Multiple-trait
scoring
Design Task specificity Band and descriptor layout
• Intuition-based • Generic • Graphic and numerical
• Theory-based • Task-specific scale
• Empirically driven • Labelled scale
• Performance decision trees • Defined scale
(PDTs)
2.3.2.1 User Versus Assessor Versus Constructor
The categorisation of rating scales from the perspective of user-orientation is first

proposed by Alderson (1991), whose suggestion leads to a tripartition of
user-oriented, assessor-oriented and constructor-oriented scales. This way of clas-
sification mainly dwells on the particular informants of a rating scale. User-oriented
scales are used to report information about the behaviour of a test-taker at a given
level; assessor-oriented scales are designed to provide guidance for rating process,
zooming in the quality of the performance expected; constructor-oriented scales are
produced to aid test constructors in designing test tasks. As one of the main pur-
poses of this study is to inform teaching practitioners of candidates’ performances
in formative assessment, primary concern is given to the development of an
assessor-oriented rating scale. This concern can also be well supported by North’s
(2003) argument that scales used to rate second language performance should be
mostly assessor-oriented, giving prominence to the aspects of ability as reflected in
candidates’ performance. However, what needs pondering is that in the context of
formative assessment, which will be cast light on later, not only teachers but also
peers and learners themselves may play the role of assessors. The forthcoming
review on formative assessment will make an argument and justify who should be
assessors for the rating scale proposed in this study.
2.3.2.2 Holistic Versus Analytic
The second categorisation is of holistic and analytic scales. This is first brought
forth by Shohamy (1981) and has long served as the most salient and
best-documented categorisation (e.g. Bachman 1988; Bachman and Savignon 1986;
Douglas and Smith 1997; Fulcher 1997; Ingram and Wylie 1993; Underhill 1987;
Weir 1990). As this taxonomy is commonly referred to (see Barkaoui 2007; Cooper
1977; Fulcher 2003; Goulden 1992, 1994; Hamp-Lyons 1991; Weigle 2002), more
elaborations will be accordingly unfolded in this section of review. Holistic rating
scale is also referred to as impressionistic or global scale. It is first defined in the
context of writing assessment when Cooper (1977) posits that a holistic rating scale
refers to
any procedure which stops short of enumerating linguistic, rhetorical, or informational
features of a piece of writing … [s]ome holistic procedures may specify a number of
particular features and even require that each feature be scored separately, but the reader is
never required to stop and count or tally incidents of the feature (p. 4).
If further divided in terms of scoring methods, a holistic rating scale can be

broken down into holistic, primary-trait and multiple-trait scoring methods
(Hamp-Lyons 1991). A holistic scoring method requires raters to assign only one
score in order to encapsulate the overall performance or features of a candidate in a
particular assessment task and its emphasis is on how a candidate excels (White
1985). In most cases, such rating scale descriptors of each proficiency level include
more than one domain of assessment, such as accuracy, vocabulary and fluency.
However, all the descriptors of a particular band are grouped together, dissimilar to
multi-trait scoring, where different domains of assessment are separately described
in detail.
Since only one score is supposed to be given, this scoring method usually
triggers controversy in lieu of an incomplete account of targeted construct (Fulcher
2003). It also seems less powerful in explaining the intriguing nature of speaking.
Another problem with holistic scoring is that in speaking assessment, raters might
overlook one or two aspects, in the case of which candidates might be rated on their
strengths instead of being penalised for weaknesses (Bacha 2001; Charney 1984;
Cumming 1990; Hamp-Lyons 1990; Knoch 2009). However, holistic scoring is
primarily favoured by large-scale language assessments, where the time allocated
for rating is of the topmost concern, yet it is spurned by classroom assessment
because it provides limited feedback for students and teachers about what might be
revealed from assessment per se.
Primary-trait scoring is developed to assess certain expected language functions
or rhetorical features elicited by an assessment task (Lloyd-Jones 1977). It was first
adopted by the National Assessment of Educational Programme (NAEP) for the
purpose of obtaining more information from one single score. As Applebee (2000)
explains, regarding writing assessment, “primary trait assessment in its initial for-
mulations focused on the specific approach that a writer might take to be successful
on a specific writing task; every task required its own unique scoring guide” (p. 4).
Therefore, it can be comprehended that in primary-trait scoring, raters predetermine
a main trait for the successful task fulfilment so that scoring criteria are usually
reduced to one chief dimension and is therefore context-dependent (Fulcher 2003).
Although only one score needs to be assigned in primary-trait scoring, that single
score largely depends on the degree to which the candidate addresses the specific
requirements of a given oral assessment task (Barkaoui 2007).
This kind of rating scale is advantageous in virtue of its focus on one targeted
observable aspect of language performance, and it is a relatively quick way to score
speaking performance, especially when rating emphasises one specific aspect of
that performance. For example, if candidates are requested to perform a presenta-
tion as an assessment task, a rater would rather concentrate on candidates’ artic-
ulation than lexical density. In that case, the primary-trait articulation is assessed
with a focused weighting. However, just because this way of scoring concentrates
on only one primary trait, it would be less fair to argue that the aspect singled out
for assessment is primary enough to base a single score on it (Knoch 2009).
Hamp-Lyons (1991) puts forward multiple-trait scoring, or multi-trait scoring
for the rating scale designed to offer feedback to learners and other stakeholders
about performance on contextually appropriate and task-specific criteria. As this
scoring method per se suggests, it involves evaluating various traits for reaching an
overall score. Although this approach is similar to primary-trait scoring in that both
methods are holistic in nature, it allows raters to observe more than one dimension.
Given that, it can also be regarded as an extended version of holistic scoring method
as the band descriptors of each assessment domain are much more detailed and
corporeal.
Since large-scale language assessments usually take rating duration into serious
consideration, the rating scales adopted by IELTS Speaking (see Appendix I) and
TOEFL iBT Independent Speaking Tasks (see Appendix II) are typical of this
category. In the former case, a rater is supposed to judge examinees’ performance in
four aspects, fluency and coherence, lexical resource, grammatical range and
accuracy, pronunciation, and assigns an overall score according to nine bands
(Band 1 to Band 9). What is slightly different in the case of TOEFL is that a number
of general descriptions concerning task fulfilment, coherence or intelligibility are
also attached in the rating scale in addition to the descriptors of the three individual
traits (delivery, language use and topic development). Yet a rater is still expected to
accord an overall score to the speech sample within a range of 5 bands (Band 0 to
Band 4).
By contrast, Cooper (1977) defines analytic approach as requiring the rater “to
count or tally incidents of the features” (p. 4). Analytic rating scales are inclusive of
separate categories representing different aspects or dimensions of performance. For
example, dimensions for oral performance might include fluency, vocabulary and
accuracy. Each dimension is scored separately, and then dimension scores are
totalled. Analytic rating scales can be extremely similar to multi-trait scoring in the
sense that both require raters to assign more than one score to a speech sample.
However, their difference consists in the fact that multi-trait scoring is more
task-specific, usually focusing on specific features of performance necessary for the
success of task fulfilment; the latter is more generalisable to a plethora of assess-
ment tasks with generic dimensions of language production included.
For example, the rating scale for Test of English for Educational Purposes
(TEEP) takes this form (see Appendix III). A rater is supposed to tick one number
for each of the six assessment domains (appropriateness, adequacy of vocabulary
for purpose, grammatical accuracy, intelligibility, fluency and relevance and ad-
equacy of content) and then sum up the subscores. One special example is the rating
scale of BEC Oral Test, which combines holistic and analytic rating (see
Appendix IV for Level 1). One interlocutor responsible for communicating with
candidates marks holistically while another assessor takes charge of analytic
marking, with the two scores averaged to a final score subsequently.
However, analytic scoring is criticised insomuch that various domains apart do
not necessarily add up to the whole. In other words, individual subscores for
different dimensions might not supply reliable information of what is assessed
globally. On the other hand, since scoring is multifaceted, raters might assign
correspondingly lower subscores to all the assessment domains if one particular
domain is not as satisfactorily performed as expected. Therefore, tendency would
be assigning the same low grades across all the domains, known as “halo effect”
(Thorndike 1920) or “cross-contamination” (Alderson 1981).
On the positive side, it can be found that if rating is analytically conducted, raters
can be refrained from being confused with dimensions as they are supposed to
assign subscores to each assessment dimension. Weir (1990) also comments that
analytic rating scales facilitate rater training and scoring calibration, especially for
inexperienced raters. In addition, the advantage of adopting analytic over holistic
rating scales includes an access to fine-grained information about examinees’ lan-
guage ability (Bachman et al. 1995; Brown and Bailey 1984; Kondo-Brown 2002;
Pollitt and Hutchinson 1987) because, from a variety of dimensions, rating ana-
lytically may reveal more information about what students are excelled in.
Weigle (2002), in the context of writing assessment, also contends that analytic
rating scales are generally accepted to result in higher reliability and construct
validity especially for second language writers although they can be
time-consuming. This is accorded with Sawaki’s (2007) view that in second lan-
guage assessments, analytic rating scales are often used to assess candidates’ lan-
guage ability within a single modality, viz. speaking in the case of this study. When
it comes to the construction of a rating scale for formative assessment, whether
analytic or holistic scale is preferred will be further discussed in the follow-up
section.
2.3.2.3 Real-World Versus Ability/Interaction
The third categorisation outlined in Table 2.1 is of real-world and ability/interaction

rating scales, another demarcation from the perspective of testing situation (see
Bachman 1990, pp. 344–348 for details). A real-world rating scale stipulates that
the assessment tasks are situation-specific, viz. the authentic tasks that are antici-
pated in real life. Given this, real-world rating scales can be usually applied in
performance tests. An ability/interaction rating scale relates more to a construct
than a task and is designed on the assumption that it is possible to generalise from
test scores to real-world situations that may not be modelled in the test tasks.
Considering this study concentrates on the development of a rating scale in for-
mative assessment, in which more weighting concerning the general construct of
learners’ oral performance, including nonverbal delivery, will be given, it should be
therefore an ability/interaction rating scale.
2.3.2.4 Intuition-Based Versus Theory-Based Versus Empirically

Driven Versus PDTs
Regarding the process of rating scale design, North (1996) describes the develop-
ment of rating scales as condensing the complexity of performance into thin
descriptors. The way in which rating scales and rating criteria are constructed and
interpreted by raters also act as de facto test constructs (McNamara 2000).
Therefore, another categorisation of rating scales takes the perspective of how they
are developed: intuition-based, theory-based, empirically driven and performance
decision trees (PDTs) (Brindley 1991; Fulcher 2003, 2010; Fulcher et al. 2011;
North 2003). The first type tends to be a priori measuring instrument or “armchair
method of scale development” (Fulcher 2010, p. 209). A priori method usually
refers to constructing the descriptors of the rating scales by an expert, often using
his/her own intuitive judgment concerning the nature of language proficiency, along
with a consultation with other experts. Therefore, it is believed to be the most
prevailing method of generating a rating scale (Knoch 2009). A priori method can
be subclassified into more specific development methodologies (North 1994), but
they mostly have in common “the lack of any empirical underpinnings, except as
post hoc validity studies” (Jarvis 1986, p. 21).
The second type is on the basis of an existing theory or framework. Lantolf and
Frawley (1985) expound that the validity of a rating scale can be limited if no
linguistic theory or the research in the definition of proficiency is taken into
account. As is aforementioned, the advantage of basing a rating scale on a model of
communicative competence is that “these models are generic and therefore not
context-dependent” (Knoch 2009, p. 48), resulting in higher generalisability.
The third type, designed in a post hoc fashion, is likely to be driven by the data
elicited from a sample of testees and rating scale developers manage to extract the
features that distinguish candidates across various proficiency levels. For example,
Fulcher (1987, 1993, 1996a) developed a rating scale of fluency in spoken English
assessment based on the distinctive discourse features discernable in candidates’
oral production. Another data-based method of rating scale development is a
corpus-based/corpus-driven approach. Hawkey (2001), Hawkey and Barker (2004)
manage to design a universal rating scale that covers Cambridge ESOL writing
examinations at different proficiency levels.
The latest development of rating scales witnesses the fourth type, which starts
with an analysis of the discourse features expected in real-life interaction and then
finds its assessment domains in the context of a particular framework as the trees.
Afterwards, the decision on whether obligatory elements are present in each tree is
made to determine what should be assessed as reflected in a rating scale (Fulcher
2010). Fulcher et al. (2011) employ a scoring model for service encounters with
PDTs and prioritise this method in performance tests within a specific commu-
nicative context.
Since it is not quite necessary for the rating scales used in low-stakes speaking
assessments to be constructed from data, most of them are developed intuitively.
However, when this approach is applied to the formulation of the rating scale for
large-scale and high-stakes tests, problems of validation and reliability might arise.
For instance, Skehan (1984) and Fulcher (1987, 1993) criticise the English
Language Testing Service regarding the intuitively developed rating scale.
Likewise, Brindley (1986, 1991), Pienemann and Johnston (1987) find that the
rating scale used in Australian Second Language Proficiency Ratings (ASLPR)
lacks validity due to its intuitive development. Bachman (1988), Bachman and
Savignon (1986), Fulcher (1996b), Lantolf and Frawley (1985, 1988), Matthews
(1990) and Spolsky (1993) invalidate the ACTFL scales with either the empirical
studies or the reasoning that the scale confuses linguistic with non-linguistic cri-
teria. Therefore, it can be generalised that even though a rating scale is developed
intuitively or based on the theoretical underpinnings, it would be better to be
validated with or informed by data-driven methods.
Specific to the present study, on the one hand, the development of the rating
scale is based on the priori consideration of the CLA model, together with the
possible discriminating features informed by the data-driven evidence when an
argument for embedding nonverbal delivery into speaking assessment is built. On
the other hand, post hoc quantitative and qualitative validation studies will con-
tribute to the finalisation of the rating scale. However, as formative assessment in
the context of this study does not fall into professional English tests, almost no need
is felt as to apply the PTDs method. Therefore, the rating scale, with nonverbal
delivery included as a dimension, is an integration of theory-laden and empirically
validated one.
2.3.2.5 Generic Versus Task-Specific
Alderson and Banerjee (2002) divide rating scales in terms of task specificity. One
division is generic scales, referring to those constructed in advance for almost all
sorts of assessment tasks and the other is used to evaluate test-takers’ performance
on target specific tasks. Rating scales and tasks are thus directly linked because the
scales describe speaking skills that tasks might elicit (Luoma 2004). However, as
different assessment tasks feature discrepant task characteristics, it is questionable
as to whether such a generic rating scale can be designable. Since the present study
proposes to design a rating scale to be applicable to formative assessment, it would
be far beyond a claim of being a generic one because the assessment task in the
present study, viz. group discussion to be elaborated below, is prespecified.
2.3.2.6 Graphic and Numerical Versus Labelled Versus Defined
The last categorisation focuses more on the physical layout of rating scales. The
most simple type in this categorisation is a graphic and numerical rating scale, in
which there is a continuum with two points representing both ends of a scale, yet
with no descriptors of behaviours expected from candidates (North 2003).
Therefore, the subjectivity among various raters becomes the main drawback of
such design. The second type is a labelled rating scale, viz. a scale with cues
attached to various points along the scale. Nonetheless, it can still be regarded as
less assessor-friendly as the cues provided might be vague, such as a range from
poor to excellent (Knoch 2009). The third type is a vertical rating scale with each
point elaborately defined so that succinct space is allowed for longer descriptions.
For instance, Shohamy et al.’s (1992) ESL writing scale falls into this type.
However, since there is no significant difference in the reliability of different
designs (Myford 2002), this study will first aim at a rating scale with sufficient
defined behavioural expectations for rater-friendliness, yet subject to revision after
the expert judgment in the rating scale formulation phase.
2.3.3 A Critique on the Existing Rating Scales
However, classified, rating scales represent “the descriptions of expected outcomes,

or impressionistic etchings of what proficiency might look like as one moves
through hypothetical points or levels on a development continuum” (Clark 1985,
p. 348). This part will continue with a critique on the main existing rating scales for
speaking assessment.
North and Schneider (1998) summarise two main weaknesses regarding the
rating scales for language proficiency assessment, oral tests included. On the one
hand, there is no guarantee that the descriptors of proficiency in a rating scale are
accurate or valid; on the other hand, a number of them cannot be regarded as
offering criterion-referenced assessment although they generally claim to do so. In
particular, the wording in a rating scale is sometimes vague, subjective or hardly
measureable (Mickan 2003; Upshur and Turner 1995), such wording as weak, poor
and better (Turner and Upshur 2002; Upshur and Turner 1995). Therefore, they
result in less consistency and most of them appear in fact to have been produced
pragmatically by appeal to intuition and those scales that rating scale developers
have access to.
Fulcher (1996b) and North and Schneider (1998) also point out that in the
process of rating scale development, it is rare that much account is taken into using
a model of communicative competence and/or language use and a model of mea-
surement. Less consideration is found when assessment providers use the
rank-ordered scale exclusively for one context to another inappropriate context
(Spolsky 1986, 1989a). Specifically, there seems arguable possibility of directly
borrowing existing rating scales for summative assessment to formative assessment
as these two assessment contexts might be mutually distinguishable. Although a
good number of rating scales take strategic competence into their development,
what can be revealed from the examples cited in the review above would follow that
few of them systematically observe competence in an operationalisable manner.
What is even controversial is that nonverbal delivery, as one of the most pro-
nounced components of strategic competence, is absent in most, if not all, existing
rating scales. Therefore, it should be advisable that a rating scale designed with the
CLA model as its underpinnings should be well informed of the above gap.
Over and above, limited literature can be found regarding rating scales exclu-
sively for the context of formative assessment. Since formative assessment can be
as important as standard-based or summative assessment, rating scales exclusively
in this context also invite concern in light of validity and reliability (Brown and
Hudson 1998; Cohen 1994). Similarly, rating scale development with necessary
considerations for formative assessment can thus fit into the standardised assess-
ment paradigms to reconceptualise the relationship between formative assessments
and standardised summative assessments (Brindley 2002; Lynch 2001; McNamara
2001; Teasdale and Leung 2000). How can the rating scale in this study be designed
manageably for formative assessment while also offsetting the weaknesses of the
rating scales critiqued above? The following part will interject formative assessment
to expound more on the necessity, feasibility and significance of using an analytic

rating scale in formative assessment.
2.3.4 Formative Assessment
Formative assessment derives from formative evaluation (Scriven 1967), fore-

grounding the notion on the practice of programme evaluation. Therefore, strictly
speaking, formative assessment is not exclusively confined to English learning.
Bloom et al. (1971) extend the notion of formative evaluation to a much broader
sense, stating that
formative evaluation is for us the use of systematic evaluation in the process of curriculum
construction, teaching and learning for the purpose of improving any of these three pro-
cesses…This means that in formative evaluation one must strive to develop the kinds of
evidence that will be most useful in the process, seek the most useful method of reporting the
evidence, and search for ways of reducing the negative effect associated with evaluation –
perhaps by reducing the judgmental aspects of evaluation or, at least, by having the users of
the formative evaluation (teachers, students, curriculum makers) make the judgments.
(p. 118)
The above definition clarifies that one of the purposes to conduct formative
assessment is to diminish, or even remove possible negative backwash of
high-stakes tests on language learning (Wang et al. 2006). Against this background,
increasing attention is paid to the great potential of formative assessment; con-
ventional summative testing of language learning outcomes also gradually com-
pacts formative modes of assessing language learning as an ongoing process
(Davison 2004). However, formative assessment vis-à-vis summative assessment is
still under explored (Black and Wiliam 1998; Davies and LeMahieu 2003; Leung
2005a; Leung and Mohan 2004).
2.3.4.1 Definition
The notion of formative assessment suggests itself to be opposed to summative

assessment. Broadly conceived, formative assessment refers to
the collaborative processes engaged in by educators and students for the purpose of
understanding the students’ learning and conceptual organisation, identification of strength,
diagnosis of weaknesses, areas for improvement, and as a source of information that
teachers can use in instructional planning and students can use in deepening their under-
standing and improving their achievement. (Cizek 2010, pp. 6–7)
The wording of this broad definition, such as purpose and source, mainly tou-
ches upon the functions of formative assessment. However, because of its broad-
ness, many aspects of formative assessment fail to be specified, such as the referents
of educators and the nature of the information source as further guidance in
language learning. As a matter of fact, the most frequently cited definition is

brought forth in the seminal article by Black and Wiliam (1998), who define
formative assessment as “all those activities undertaken by teachers, and/or by their
students, which provide information to be used as feedback to modify the teaching
and learning activities in which they are engaged” (p. 10). This definition specifi-
cally narrows down the referents of educators to teachers and students, and the
information is stipulated as a kind of feedback, which is regarded as positive impact
of formative assessment practices (Allal and Lopez 2005; Brookhart 2004, 2007;
Hattie and Timperley 2007; Shute 2008).
Later, Cowie and Bell (1999) further confine the settings of formative assess-
ment to a context, where both assessment and learning take place simultaneously. In
a similar vein, from the perspective of formative assessment use, Popham (2008)
regards formative assessment as a planned process when the teacher or students use
assessment-based evidence to progress learning and instruction. In order to provide
a comprehensive definition, Black and Wiliam (2009) propose that assessment is
formative:
to the extent that evidence about student achievement is elicited, interpreted, and used by
teachers, learners, or their peers, to make decisions about the next steps in instruction that
are likely to be better, or better founded, than the decisions they would have taken in the
absence of the evidence that was elicited. (p. 6)
In this highly inclusive definition, the agents involved in formative assessment

are clearer and extended to peers. In addition, formative assessment is no longer just
for the sake of evaluation, but also for decision-making. Therefore, the ultimate
purpose of formative assessment can enhance teaching and learning. The present
study will follow the above definition so that the particular agents for the rating
scale can be explicitly stated and desired positive impact of the assessment can be
thus achieved.
2.3.4.2 Benefits of Formative Assessment
In fact, the above analyses on the definitions of formative assessment already reveal
its functions and purposes, which can also be credited as the benefits in four
aspects.
First, as far as the nature of formative assessment is concerned, it provides a
wealth of feedback from assessors for learners. Therefore, such feedback is char-
acterised by learner-specificity and full description on an individual basis (Sadler
1989). Herman and Choi (2008) examine the perceptions by teachers and learners
to see whether formative assessment is shared with a similar understanding by both
sides. The results indicate that the perceptions and attitudes on both sides are
consistent, and that it also emphasised the significance of improving learners with
the information available from formative assessment. Rea-Dickens (2006, p. 168),
however, dissuade that in formative assessment the feedback to learners should be
“descriptive” rather than “evaluative” so that it is not negatively perceived.
Second, in lieu of the preference that classroom is usually the primary choice of
formative assessment, learners’ anxiety can be much lowered due to a familiar
environment. Davidson and Lynch (2002), Lynch (2001, 2003) and McNamara
(2001) in general agree upon endorsing formative assessment to conventional
testing methods as a shift of the locus of control from centralised authority into the
hands of classroom teachers and their peers. If assessment environment is familiar
to candidates, assumedly they will be at a more advantageous position of giving full
play to their potentials.
Third, as formative assessment may include tasks or activities, such as ongoing
self-assessment, peer-assessment, projects and portfolios (Cohen 1994), most
assessment methods can be task-based. As Ross (2005) point out, one of key
appeals formative assessment can provide is the autonomy given to learners.
Formative assessment is thus thought to influence learner development through a
widened sphere of feedback during their engagement with various learning tasks.
Last, regarding the validity in alignment with traditional standardised assess-
ment, there emerges research on validating formative assessment as a testing
method. Huerta-Macias (1995) prioritises the direct face validity of alternatives to
conventional achievement tests as sufficient justification for their use. This view is
also accorded with the notion of learner and teacher empowerment (Shohamy
2001). Therefore, it can be believed that if a rating scale for formative assessment is
also vigorously validated, it can be applied as a valid measure.
With the benefits of formative assessment outlined above, it is necessary to
develop, in the formative assessment context, a rating scale with a dimension of
nonverbal delivery. In so doing, teachers can assess learners from various dimen-
sions and learners may also have the access to feedback of various aspects for
self-enhancement.
2.3.4.3 Group Discussion as a Formative Assessment Task
The above outlines various aspects of benefits that formative assessment might
offer. This part of review turns to group discussion as a formative assessment task
for assessing EFL learners’ spoken English so that why group discussion is par-
ticularly chosen as the main assessment task in the present study can be justified.
Prior to unfolding the usefulness of group discussion in formative assessment,
the previous studies on group discussion as an assessment task is first reviewed. In
the first large-scale study on group discussion concerning the accuracy of
test-takers’ production, Liski and Puntanen (1983) find that test-takers’ performance
in group discussions can serve as a fit predictor of their overall academic success. In
addition, Fulcher (1996a) also reports that test-takers consider group discussion as a
valid form of second language testing and that examinees feel less anxious and
more confident when speaking to other discussants instead of examiners or inter-
locutors (Folland and Robertson 1976; Fulcher 1996a). Fulcher (1996a) also finds
that group discussion is an easily organised task compared with picture talk, where
an interlocutor or an interview based on speaking prompts will be involved. In
addition, group discussion, similar to paired discussion (Brooks 2009), may elicit
richer language functions than oral proficiency interviews (OPI) so that commu-
nicative ability can be more comprehensively assessed (Nakatsuhara 2009; van
Moere 2007).
Pre-eminently, in the context of formative assessment, group discussion can be
assessed not only by instructors but also by learners and their peers on condition
that the rating scales and criteria are made transparent and accessible to all the
parties concerned (Fulcher 2010; Shepard 2000). However, previous studies also
indicate that without substantial experience of applying the scoring criteria to work
samples, self-assessments may fluctuate substantially (Ross 1998; Patri 2002). By
contrast, peer-assessments are likely to be much more reliable though they can be
more lenient than instructor-assessments (Matsuno 2009). Therefore, the present
study, instead of including self-assessment as a rating method, resorts to teacher and
peer-rating when the proposed rating scale is validated based on the above
considerations.
Although the reliability of group discussion as an assessment task in standard-
ised large-scale testing is challenged as raters might not be able to assign reliable
scores when candidates are tested in groups (Folland and Robertson 1976; Hilsdon
1995), such unreliability is hardly recorded empirically. In response to that, Nevo
and Shohamy (1984) compare 16 language assessment experts’ intuition and per-
ception of having group discussion as an assessment task with other forms, such as
role-play and OPI, only to find that group discussion ranks the top in terms of task
utility standards, but stands in the middle on fairness, probably leading to testing
experts’ suspicion of the reliability of group discussion. Despite that, scant evidence
can be collected to support that group discussion is not reliable.
In terms of task usefulness and task characteristics, group discussion also has a
few distinctive features. It is first of all highly interactive and authentic (Kormos
1999; Lazaraton 1996b; van Lier 1989), with all the discussants involved in a
meaning-making and negotiating process. It is also characterised by a high degree
of feasibility and economy in the sense that formative assessment of this kind can
just take place in classrooms and can be time-saving because several students are
grouped together to be assessed, thus greatly reducing the time that traditional
testing methods would call for (Ockey 2001). Another point inherent in formative
assessment that can also credit group discussion is that all candidates are in dis-
cussion with familiar faces without interlocutors, which tends to lower their anxiety
and eschew more errors arising from the intervention of interlocutors (Ross and
Berwick 1992; Johnson and Tylor 1998; Young and He 1998a, Brown 2003).
What’s more, even though candidates’ weaknesses are disclosed in various aspects,
they do not feel as ashamed as they would otherwise be in face of generally stern
examiners or interlocutors.
To briefly summarise, the above review provides positive evidence that this
particular assessment task can be judged as ideal from the perspectives of face
validity, reliability, authenticity, interactiveness, impact and practicality, which,
incidentally but purposefully, can be accorded with Bachman and Palmers’ (1996)
framework of test usefulness.
2.3.5 Properties of the Present Rating Scale
Integrating the above review on rating scales and formative assessment, procure-
ment can be reached regarding the properties of the rating scale that this study
intends to propose. It will be an assessor-oriented analytic rating scale specifically
for group discussion in formative assessment. The band and level descriptors aim to
be defined and descriptive instead of merely evaluative. The design of the rating
scale will be firstly theory-grounded on the construct of the CLA model and pre-
liminary discriminating features identifiable in candidates’ nonverbal delivery and
then undergo empirical corroboration with data-driven involvement.
2.4 Validity and Validation
As the last phase of the present study sets out to validate a proposed rating scale
with nonverbal delivery included as an assessment dimension, it is of importance to
review the conceptualisation of validity and the evolution of validation methods.
What should be pointed out is that validity is an integral and most basic concept in
language assessment because “accepted practices of test validation are critical to
decisions about what constitutes a good language test for a particular situation”
(Chapelle 1999, p. 254). How validity is defined in reality determines how a test is
to be validated.
Historically, test validity is an ever-changing concept and has undergone
metamorphoses chronologically (Angoff 1988; Cronbach 1988, 1989; Goodwin
1997, 2002; Goodwin and Leech 2003; Kane 1994, 2001; Messick 1988, 1989a, b;
Langenfeld and Crocker 1994; McNamara and Roever 2006; Moss 1992; Shepard
1993; Yang and Weir 1998). Researchers with different perceptions towards
validity (e.g. Angoff 1988; Kane 2001; Goodwin and Leech 2003) have various
demarcations of its development. Nonetheless, what stands to be certain is that the
introduction of construct in conceptualising validity is widely regarded as a mile-
stone. Therefore, all demarcations can fall into three phases in terms of how the role
of construct validity evolves: (1) the preconstruct-validity phase, a period before
construct validity was put forward by Cronbach and Meehl (1955); (2) the initial
phase of construct validity, a period covering the range from the 1970s to the 1980s,
when construct validity was made co-existent with other types of validity in lan-
guage testing; and (3) the core phase of construct validity, a period when the
concept starts to play a quintessential role in test validation.
In recent decades, with the popularity of argument-based validation method,
there are also other perspectives of conceptualising validity, among which
Assessment Use Argument (AUA) is utilised in full swing as “[an] overarching
logical structure that provides a basis both for test design and development and for
score interpretation and use” (Bachman 2005, p. 24). However, as far as the essence
of AUA is concerned, it still falls into the third phase as this notion calls for
evidence collection in support of construct validity.
Therefore, concerning the concept of validity, this part will embark upon a
review on a componential notion of validity, followed by a unitary concept of
validity, with construct validity as the core. Afterwards, details on the newly
established AUA (Bachman 2005; Bachman and Palmer 2010) will also be briefly
reviewed; however, the critique on AUA in this section will lead to an argument
that caution should be taken in employing AUA as the framework of validating the
proposed rating scale. This section of review will wind up with the justification of
employing both quantitative and qualitative approaches for the validation of the
rating scale to be proposed based on the unitary notion of test validity; in particular,
the incorporation of nonverbal delivery calls for a qualitative approach in validating
the rating scale.
2.4.1 Validity: A Componential Notion
Prior to a unitary concept, test validity could be viewed as an umbrella term

covering several types of validity, yet different researchers hold discrepant yard-
sticks of test validity taxonomies. For example, Guilford (1946) divides test validity
into two components from the perspective of data analysing methods and real use:
factorial validity and practical validity. However, Cronbach (1949), again from the
angle of data analysis, categorises validity into logical/judgmental validity and
analytical/empirical validity. The former is a rather loosely organised, broadly
defined set of approaches, including content analyses, and examination of opera-
tional issues and test-taking processes, whereas the latter places more emphasis on
the use of factor analysis, and especially on correlation(s) between test scores and a
criterion measure (Anastasi 1950). Anastasi (1954, 1961, 1976) categorises test
validity into four aspects, viz. face validity, content validity, factorial validity and
empirical validity. Although the ways of cataloguing test validity vary in the early
phase of conceptualisation, they are almost identical in nature: correlating observed
test scores with criterion measurement. Thus, all of them except face validity can be
grouped into criterion-related validity.
2.4.1.1 Criterion-Related Validity
The early phase of test validity stresses test purposes. Guilford (1946) points out
that every test is purpose-specific and that one test can be valid for a particular
purpose but invalid for another. No matter whether it is test providers or test users,
all parties of stakeholders should be responsible for verifying that one test is valid
for the particular purpose it serves. In that sense, how validity is defined is closely
associated with test purposes, which can be interpreted in the Garrett’s (1947)
definition that “the validity of a test is the extent to which it measures what it
2.4 Validity and Validation 49
purports to measure” (p. 394). Similarly, Cureton (1950) also views that test pur-
pose should be the basic issue of test validity and phrased its importance as “how
well a test does the job it was employed to do” (p. 621). Against the above
viewpoint, test purposes can be twofold: either diagnosing the existing issues or
predicting the future performance. Accordingly, American Psychological
Association (APA), American Educational Research Association (AERA) and
National Council on Measurement in Education (NCME) in their early versions of
Standards for Educational and Psychological Testing (Standards) divides
criterion-related validity into concurrent validity and predictive validity (see APA
1954; APA et al. 1966).
In fact, criterion-related validity is deeply rooted in a realist philosophy of
science, which holds that every individual can produce a value on the specific
assessment characteristics and the assessment purpose is to estimate or predict that
value as accurately as possible. In the context of standardised testing, the “true
score”, or the estimates most approximating the “true score”, reflects the extent to
which the test has precisely estimated that value (Thorndike 1997). In that sense,
the precision of estimation is the degree of test validity.
The above definition reveals that criterion-related validity is concerned with the
test per se and that it is a static property attached to test validity (Goodwin and
Leech 2003). Therefore, criterion-related validity equals to “the correlation of
scores on a test with some other objective measure of that which the test is used to
measure” (Angoff 1988, p. 20). A test can be judged as valid or invalid according to
the measuring results (Cureton 1950; Gulliksen 1950) and “[i]n a very general
sense, a test is valid for anything with which it correlates” (Guilford 1946, p. 429).
The key to validating criterion-related validity then lies in how to lay down the
criterion measure in order to obtain standardised test scores, without which such
validation studies cannot be carried out. Cureton (1950) puts forward the following
method.
A more direct method of investigation, which is always to be preferred wherever feasible, is
to give the test to a representative sample of the group with whom it is to be used, observe
and score performances of the actual task by the members of this sample, and see how well
the test performances agree with the task performances. (p. 623)
As revealed above, the first step is sampling the target candidates and observing
their performances in the real assessment tasks to assign the corresponding scores.
The scores ultimately derived should become the standard scores with reference to
the criterion. When other tests are in the process of validation, the newly observed
scores will be correlated with the standard scores to see the extent to which the test
consistently measures the candidates’ ability. Therefore, “the test is valid in the
sense of correlating with other [valid and reliable language] tests” (Oller 1979,
pp. 417–418). Ebel (1961), however, would rather think that some language tests
can be regarded as valid merely through subjective judgment and that language
assessment experts’ judgment on validity can be employed to measure test validity.
Once the validity criterion is determined, it is possible to design standard testing for
the validation of other tests.
The thorny problem of undertaking criterion-related validation is that there is

actually no such standardised test; even though there is one, it needs to be validated
itself as well. If a standardised test is developed for the purpose of validating
another, the cycle will be reduced to an infinite regression (Kane 2001). Therefore,
in order to solve this problem, the content that is covered in a test emerged to the
attention of validation studies and became an aid in criterion-related validation,
hence content validity.
2.4.1.2 Content Validity
Content validity usually refers to the extent to which the test items or tasks are
sufficient enough to represent the domain or universe of the content to be covered in
a test. It was explained as “[whether] the behaviours demonstrated in testing con-
stitute a representative sample of behaviours to be exhibited in a desired perfor-
mance domain” (APA et al. 1974, p. 28).
Angoff (1988), in summarising what content validity in language assessment
represents from the aspects of content relevance, content coverage and content
significance, posits that a test has content validity when all the test items are
representative not only of the domain but also of the number and significance of the
domain. Messick (1988), from the interface between content and construct, asserts
that “[w]hat is judged to be relevant and representative of the domain is not the
surface content of test items or tasks, but the knowledge, skill, or other pertinent
attributes measured by the items or tasks” (p. 38).
The main validation method regarding content validity is based on logical
judgement, such as expert evaluation and a review of the test content by assessment
experts (Angoff 1988). Since much subjectivity is involved, this validation method
is usually controversial (Guion 1977; Kane 2001). Given this, there used to be a call
for empirical validation on expert evaluation (Bachman et al. 1995). However, in
direct performance tests, there are indeed advantages for expert evaluation
(Cronbach 1971), which is still being utilised in many assessment settings (Kane
2001). Cronbach (1971) also puts forward equivalent tests for content validation, in
which two sets of scores obtained from two different tests with the same content
coverage are correlated. A low-correlation coefficient can indicate that at least one
of them does not have high content validity, yet it is challenging to determine which
particular test it is. On the other hand, if the correlation coefficient is high, it can be
generally thought that both tests have content validity.
Unlike criterion-related validation, whose problem lies in the availability of a
real standardised test, content validation is challenged in that the representativeness
of test content can be barely guaranteed. On the one hand, the domain or universe of
a test cannot be easily operationalised because what is assessed can be either
language knowledge and language skills, or complicated performances or pro-
cesses. On the other hand, the number of test items, coverage of test materials and
method of sampling all impact the representativeness of a test content as well as its
facility and discriminating power (Angoff 1988). Latent variables mentioned above
may give rise to the under-representativeness of a test. Messick (1989b, 1992,

1996) detects two points that might jeopardise the content validity of a test, viz.
construct under-representation and construct irrelevance variance. The former
might lead to negative washback in lieu of an over-emphasis on partial learning
content; the latter may increase or decrease the difficulty of a test because what is
covered is somewhat irrelevant to what is supposed to be assessed, thus incurring
test unfairness.
Another two points concerning content validity are also worth mentioning. One
is confirmationist/conformist bias mentioned by Cronbach (1988) and Kane (2001).
Such bias refers to the practice that researchers or test developers are liable to adopt
a confirmationist approach when validating the test content from the perspectives of
relevance, coverage and significance. By contrast, a falsificationist approach is
rarely used. In so doing, content validity, in all likelihood, can be exaggeratingly
verified. The other point is that the consideration of test content exerts influence on
scores, yet the content should be embedded in the test rather than in the test
response (Messick 1975, 1980, 1988, 1989a, b). Therefore, “in a fundamental
sense, content-related evidence does not qualify as validity evidence” (Messick
1988, p. 38). In addition, Messick (1988) also cautions researchers that when scores
are interpreted, the related skills on the part of high-achievers can be generalised
while it does not necessarily lead to the same interpretation that low-achievers do
not possess the expected skills. It is because low-achievers might not perform well
in a particular testing environment. This point is also regarded as one of the con-
straints in collecting content-related evidence in test validation (Messick 1975).
2.4.1.3 Construct Validity
Construct validity is first conceptualised by Paul Meehl and Robert Challman upon
their draft of the Standards (1954), and further nourished by Cronbach and Meehl
(1955). The introduction of construct validity, together with criterion-related
validity and content validity, signifies the beginning of a “trinity view” of test
validity (see APA et al. 1966). Therefore, construct validity has been regarded as a
hallmark in the evolution of test validity. However, when this notion is first con-
ceptualised, it is treated as a mere supplement to criterion-related validity
(Cronbach and Meehl 1955). This is because when the criterion measure is not
available, researchers would turn to an indirect validation method, which highlights
the trait or quality underlying the test instead of test behaviour or scores on the
criteria. Then “the trait or quality underlying the test” is just what construct is. The
Standards (APA et al. 1974) put a psychological construct as
[a]n idea developed or “constructed” as a work of informed, scientific imagination; that is, it
is a theoretical idea developed to explain and to organise some aspects of existing
knowledge. Terms such as “anxiety”, “clerical aptitude”, or “reading readiness” refer to
such constructs, but the construct is much more than the label; it is a dimension understood
or inferred from its network of interrelationships. (p. 29)
Construct in this case is only a theoretical idea, representing the abstraction of

constructed terms or labels, equal to the understanding of or inference from the
relationships between theories. Ebel and Frisbi (1991) think that construct refers to
those human behaviours or mental processes that can hardly be measured directly.
Therefore, it can be a hypothesised abstraction, trait or variable. Bachman (1990)
explains the notion in a much simpler way. He regards construct as those to be
measured. Since abstraction cannot be directly measured, observed data cannot be
directly used for reasoning or inference. Therefore, construct validity refers to the
extent to which a theory or trait can be reflected by the observed data.
In construct validation, first of all a construct theory needs to be found that
embodies human behaviours. After that, hypotheses are put forward and tests are
administered to obtain data. Then, whether the theory-deducted hypotheses can be
verified by means of all statistical methods, such as correlation analysis, is another
issue. In such a process, the construct theory is the prerequisite as well as the crux.
Cronbach and Meehl (1955) adopt a hypothetic-deductive model of theories (HD
model) as the framework of constructing theories. HD model (Suppe 1977) treats
theories as axiomatic systems and regards the core of theory as a combination of a
series of axioms, as reflected as empirical laws. The implicit concept in theories is
connected with axioms and correlated with the explicit observable variables. If the
observation results are consistent with the theoretical hypotheses, it can serve as a
proof that the observation (test) is valid (Hempel 1965; Kane 2001). In that sense,
construct validation involves mutual verification of measure and constructed the-
ories. On the one hand, constructed theories guide the collection, analysis and
interpretation of the data; on the other hand, the data can serve to testify, modify
and even nullify the constructed theories (Angoff 1988).
The above elaboration indicates that construct validity is a fairly complex pro-
cess. It cannot be simply reflected by one correlation coefficient; rather, it involves
evidence collection and reasoning and must be inferred from observation and data
analyses. Considering the intriguing nature of construct validity, Campbell and
Fiske (1959) put forward an MTMM approach, which includes both theoretical
explanation and empirical verification. Theoretically, if the method and trait are the
same, they should be highly correlated; empirically, the correlation coefficient
between different methods and same traits is also known as convergent validity,
which should significantly higher than discriminant validity, a coefficient between
same methods and different traits. In order to claim that the measures of a rating
scale have construct validity, it is required that both convergence and discrimination
should be demonstrated. This approach will be further detailed in the next section of
the literature review.
2.4.1.4 Face Validity
Face validity, as its name suggests, usually refers to the degree to which all those at
the surface level, such as the language and instructions used in a test, whether the
layout or the printing quality of a test paper can be acceptable to candidates and the
public (Hughes 2003). Whether test validity in this regard should also be treated as
a component of validity has long been debatable. Because face validity is only
confined to the acceptability of the test paper at the surface level without any
involvement of psychological measurement, it cannot truly reflect the validity of a
test in the strictest sense, nor can it be a yardstick of measuring the degree of
validity for a test. Mosier (1947), when criticising the ambiguity of face validity,
thinks that “any serious consideration of face validity should be abandoned” (cited
from Angoff 1988, p. 23). Angoff (1988) also mentions that “superficial judgments
of the validity of a test made solely on the basis of its appearance can easily be very
wrong” (p. 24).
Although face validity is challenged as incompetent to be one of the components
of validity, quite a number of researchers have noted its importance. Anastasi
(1982) believes that “the language and contexts of test items can be expressed in
ways that would look valid and be acceptable to the test-taker and the public
generally” (p. 136). Likewise, Nevo (1985) also acknowledges the usefulness of
face validity and thinks that face validity should also be reported in test validation.
In brief summary, test validity at the first evolution phase was perceived as a
componential entity, with criterion-related validity, content validity and construct
validity as its tenets. However, in the case of this study, when a rating scale with a
consideration of embedding nonverbal delivery into speaking assessment is vali-
dated, it seems quite impractical to accumulate evidence from all the above three
aspects of validity. After the following part, where light will be shed on validity as a
unitary notion and as playing a core role in all sources of validity, this review can
justify that in validating the proposed rating scale, construct validity will be mainly
scrutinised.
2.4.2 Validity: A Unitary Notion
Although Cronbach and Meehl (1955) augment the significance of construct

validity as a determinant responsible for test performance in almost all tests, the
Standards (APA et al. 1966, 1974) still categorise test validity into three or four
components (predictive validity and concurrent validity can be folded into crite-
rion-related validity) and views it as a supplement co-existing with criterion-related
and content validity.
It was not until in the early 1980s that measurement researchers, such as
Cronbach (1971, 1980, 1988, 1989) and Messick (1975, 1980, 1988, 1989a, b),
started to emphasise the inferences and decisions made from test scores. By then,
the overarching validity with construct validity as the core turned to be gradually
and pervasively accepted. The unitary concept of test validity is reflected by the
wholeness of validity and the supplementary nature of validity evidence. In theory,
it holds that validity is a multifaceted entirety; in practice, construct validity can be
verified from all the sources possible.
2.4.2.1 Definition and Significance
The two latest versions of Standards (AERA et al. 1985, 1999) define construct
validity from a unitary perspective. However, the Standards (AERA et al. 1999) are
added with test use and consequence, reflecting a further extension of test validity.
The Standards (AERA et al. 1985) regard validity as
the appropriateness, meaningfulness and usefulness of the specific inferences made form
test scores. Test validation is the process of accumulating evidence to support such infer-
ences. A variety of inferences may be made from scores produced by a given test, and there
are many ways of accumulating evidence to support any particular inference. Validity,
however, is a unitary concept. Although evidence may be accumulated in many ways,
validity always refers to the degree to which that evidence supports the inferences that are
made from test scores. (p. 9)
However, the Standards (AERA et al. 1999) explain validity as

the degree to which evidence and theory support the interpretations of test scores entailed
by proposed uses of tests. …The process of validation involves accumulating evidence to
provide a sound scientific basis for the proposed score interpretations. It is the interpreta-
tions of test scores required by proposed uses that are evaluated, not the test itself. (p. 9)
…these sources of evidence may illuminate different aspects of validity, but they do not
represent distinct types of validity. Validity is a unitary concept. It is the degree to which
all of the accumulated evidence supports the intended interpretation of test scores for the
intended purposes. (p. 11)
The above definitions of validity can be comprehended as the extent to which all
sorts of evidence can support the score interpretation and use. Hence, validity is a
unitary concept with construct validity as the core. Compared with the compo-
nential notion, there are three distinguishable features for this view.
First, dissimilar to a preference to classifying validity into several components
(see Angoff 1988; Langenfeld and Crocker 1994; Messick 1988, 1989b, 1995;
Shepard 1993), this view no longer treats validity as divisible; rather, it is a unifying
force (Goodwin and Leech 2003; Messick 1988). The previous criterion-related
validity and content validity are also embedded into the evidence collection con-
cerning “content relevance”, “content coverage”, “predictive utility” and “diag-
nostic utility” (Messick 1980, p. 1015). The validation process includes collecting
evidence from various sources, interpreting and using the evidence for verification.
In light of construct validity, Cronbach (1988) puts forward two kinds of validation
programmes as follows.
The weak programme is sheer exploratory empiricism; any correlation of the test score with
another variable is welcomed. …The strong programme, spelled out in 1955 (Cronbach &
Meehl) and restated in 1982, by Meehl and Golden, calls for making one’s theoretical ideas
as explicit as possible, then devising deliberate challenges. (pp. 12–13)
It can be seen that the weak programme focuses on the correlation between test
scores and other variables, while the strong one tends to seek theory-based ideas.
The former holds that evidence should be gathered from a variety of sources so that
its advantage consists in its diversity and complementariness. However, just as
Kane (2001) points out, the weakness of this programme is its opportunistic
strategy; in other words, it seeks “readily available data rather than more relevant
but less accessible evidence” (p. 326). The strong programme follows an approach
of validation-through-falsification, viz. “an explanation gains credibility chiefly
from falsification attempts that fail” (Cronbach 1988, p. 13). Yet it also has its
weakness in that this approach is limited in its utility given an absence of a
well-grounded theory to test (Kane 2001).
The unitary notion of validity lays more emphasis on complementariness,
instead of alternativeness, of evidence. This view is widely accepted and reinforced
since the 1980s. Bachman (1990) notes that “it is important to recognise that none
of these by itself is sufficient to demonstrate the validity of a particular interpre-
tation or use of test scores” (p. 237). In a similar vein, Weir (2005) also emphasises
that
[v]alidity is multifaceted and different types of evidence are needed to support any claims
for the validity of scores on a test. These are not alternatives but complementary aspects of
an evidential basis for test interpretation…No single validity can be considered superior to
another. Deficit in any one raises questions as to the well-foundedness of any interpretation
of test scores. (p. 13)
Second, the unitary concept of validity has transferred its focus from test per se
to the interpretation of test scores, or more precisely, to the extent to which the
score interpretation can be supported by the evidence. In 1986, English Testing
Service (ETS) sponsored a symposium themed Test validity for the 1990s and
beyond and most of the keynote speeches are compiled in the proceedings by
Wainer and Braun (1988). On the first page of the prelude, there is a footnote to the
effect that a test itself cannot be claimed to be valid; rather, the inferences made
from the test scores should be used as the sources of validation.
In fact, Cronbach (1971) shares the above view when stating that “one validates
not a test, but an interpretation of data arising from a specified procedure” (p. 447)
and “one does not validate a test, but only a principle for making inferences”
(Cronbach and Meehl 1955, p. 297). Based on this, McNamara and Roever (2006)
even elevate Cronbach’s view to such a height that there is no such thing as a truly
valid test, but only defensible interpretations to a certain degree. Therefore, the
unitary concept of validity shows that test validity is manifested in score inter-
pretation rather than test per se.
Third, after the unitary concept of test validity is put forward, the test use and its
consequence also invite great concern. Although they are not new in validity
studies, the Standards (1985) include neither of them into the definition of validity.
With the maturing of the unitary concept, there has been an increasing awareness of
and concern over the intended and unintended purposes, potential and actual
consequences (Cronbach 1988; Linn 1994; Messick 1989b, 1994; Shepard 1993).
Fitting into that trend, the new version Standards (1999) officially include the test
use and consequence into the definition of validity.
However, there are also researchers (e.g. Dwyer 2000; Popham 1997) who prefer
to confine validity to the boundary of score interpretation and traditional
psychological measurement rather than extend it to the language policy domain.

The discussion (see Linn 1997; Mehrens 1997; Popham 1997; Shepard 1997)
addresses both positive and negative sides of including test use and consequence
into validation, yet winds up with no consensus. The dispute over validity devel-
opment in recent decades also concentrates on the inclusion or exclusion of test use
and its consequences. Proponents tend to include them into the scope of test val-
idation, and they focus on differential item function, backwash effects and social
consequence of tests (e.g. Bachman and Palmer 1996; Cheng 2005; Green 2007;
Hamp-Lyons 1997; Hughes 2003; Shohamy 2001). However, there are also a good
number of researchers on the opposing side. Kunnan (2000, 2004, 2005, 2008,
2010) maintain that the study of test fairness should be placed in a larger scope,
rather than treated as a subordinate element of test validity. Bachman (2005),
Bachman and Palmer (2010) put forward AUA, splitting test use from test validity.
McNamara and Roever (2006) also think that test validity should not be extended to
cover political and social dimensions because score use and social consequence
would fail to reflect the role of language testing in a social dimension.
2.4.2.2 Multidimensionality of the Unitary Concept
Although the second evolution stage of test validity deems the notion as a unitary
one, it is still etched with many dimensions. Messick is among the first proponents
of a unitary concept of test validity and his works (1975, 1980, 1988, 1989a, b,
1992, 1994, 1995, 1996) exert far-reaching significance. Messick (1995) defines
validity as
nothing less than an evaluative summary of both the evidence for and the actual – as well as
the potential – consequences of score interpretation and use. This comprehensive view of
validity integrates considerations of content, criteria and consequences into a comprehen-
sive framework for empirically testing rational hypotheses about score meaning and utility.
(p. 742)
As can be interpreted from the above definition, the unitary concept is reflected
in an evaluative summary and a comprehensive view. In other words, this concept
encompasses the test content, test criterion and test consequence with hypotheses
and empirical verification. Therefore, this concept is characterised by its multidi-
mensionality, where score interpretation, test use, evidential basis and consequential
basis are interacting with each other for a comprehensive evaluation, as illustrated
in Table 2.2 (Messick 1988, p. 42).
Table 2.2 Facets of validity (Messick 1988, p. 42)

Test interpretation Test use
Evidential basis (1) Construct validity (3) Construct validity + relevance/utility
Consequential basis (2) Value implications (4) Social consequences
Messick (1988) explains the four pairs of interaction as follows.

(1) an inductive summary of convergent and discriminant evidence that the test scores have
a plausible meaning or construct interpretation; (2) an appraisal of the value implications of
the test interpretation; (3) a rationale and evidence for the relevance of the construct and the
utility of the scores in particular applications; (4) an appraisal of the potential social
consequences of the proposed use and of the actual consequences when used. (p. 42)
Talking about test validation, Messick (1989b, 1995, 1996) also suggests that
evidence from six distinguishable aspects should be collected in order to verify the
overall validity. These six aspects are “content, substantive, structural, generalis-
ability, external and consequential aspects of construct validity” (p. 248).
When evidence is collected for further verification, one way is to structure all the
evidence in the form of arguments (Cronbach 1980, 1988, House 1980) because
validity arguments provide a comprehensive evaluation of the intended interpre-
tation and uses of test scores (Cronbach 1988). Following Cronbach and Messick,
Kane (1990, 1992, 2001, 2002, 2004, 2006, 2010), Kane et al. (1999) develop an
interpretive framework of arguments to provide guidance for justifying interpre-
tations of test scores and use. Later, Mislevy (2003), Mislevy et al. (2002, 2003)
propose an evidence-centred design (ECD), at the heart of which is what is referred
to as an evidentiary argument. In the recent development, argument-based approach
remains trendy in test validation (e.g. Bachman and Palmer 2010; Chapelle et al.
2008, 2010; Xi 2010). Therefore, the following part will review and critique the
most representative framework of argument-based approach AUA and justify why
the present study will still employ the unitary notion of test validity instead of
resorting to this newly established framework.
2.4.3 Argument-Based Validation and AUA
Bachman (2005) first puts forward AUA, and Bachman and Palmer (2010) later
revises and enriches the framework with a number of tests in real-life settings. Thus,
as aforementioned, this framework is inviting an increasing number of test vali-
dation studies, so a review on its essence clearly becomes necessary in the present
study. Then what should be the essence of AUA? In fact, any argument-based
framework lays its foundation on the base argument whose structure makes explicit
the reasoning logic employed to justify the plausibility of the conclusion or claim.
AUA is of no exception. Therefore, the structure of the base argument is of crucial
importance; a minor modification may divert the general direction of reasoning,
thus resulting in utterly different outcomes. Since AUA resides its base argument
structure in the Toulmin model, it is necessary to obtain a full understanding of the
Toulmin argument structure and its reasoning logic before a critique on the
framework can be made.
2.4.3.1 The Toulmin model
Toulmin does not explicitly put forward the notion of “the Toulmin model” himself,
but rather regards it as “one of the unforeseen by-products of the uses of argument”
(Toulmin 2003, p. viii). The aim of Toulmin’s writing the book is strictly philo-
sophical, to criticise the syllogism or demonstrative deductions in general. His major
viewpoint is that the form of syllogism is simplistic and ambiguous with no practical
use in daily arguments. To do justice to the situation, Toulmin builds up a pattern of
argument analysis. This pattern can be illustrated with the typical example of the
Toulmin model (see Fig. 2.7): by appealing to the datum (D)—“Harry was born in
Bermuda”, one can make a claim (C) about Harry’s nationality—“So, presumably,
Harry is a British subject”. The step from the datum to the claim is guaranteed by the
implicit warrant—“A man born in Bermuda will generally be a British subject”,
which is an inference drawn on the British Nationality Acts, and whose authority
relies on its backing which makes an account of the British statutes and other legal
provisions. Considering the potential exceptional conditions, such as “Both Harry’s
parents may be aliens” and “Harry might have changed his nationality since birth”, a
qualifier—“presumably” is included to indicate a tentative modality in the claim.
This is clearly a judgmental reasoning process. According to Toulmin (2003),
the rationality of a logical or practical argument is guaranteed by “‘Data such as D
entitle one to draw conclusions, or make claims, such as C’, or alternatively ‘Given
data D, one may take it that C’” (p. 91). In other words, the data “on which the
claim is based” (p. 90) should reveal sufficient bearing of warrants, which are
“general, hypothetical statements, which can act as bridges, and authorise the sort of
step to which our particular argument commits us” (p. 91). Meanwhile, the warrants
should be further supported by the backing, which Toulmin defines as “straight-
forward matters-of-fact” (p. 96) to provide other assurances for the reasoning
D So, Q, C
So, presumably, Harry
Harry was born in Bermuda
is a British subject
Since Unless
W R
A man born in Bermuda will Both his parents were aliens/
generally be a British subject He has become a naturalised American/

……
On account of
B
The following statutes and other legal provisions:
……
Fig. 2.7 A Toulmin model example (Toulmin 2003, p. 97)

process. Contrary to the syllogistic argument, a Toulmin argument is not considered

universally true, so Toulmin includes two important additional elements in his
model: a rebuttal to represent “the exceptional conditions which might be capable
of defeating or rebutting the warranted conclusion” (p. 94) and a qualifier to
indicate “the degree of force which our data confer on our claim in virtue of our
warrant” (p. 93).
In making a claim, there is no denying of the possibility of potential rebuttals,
but there is no such exclusion either; otherwise, no conclusion can ever be made,
since it is just impossible to exclude all rebuttals. What to do is to include a properly
worded qualifier to indicate the strength of the claim. Apart from a mechanism of
handling exceptional situations, the Toulmin model is superior to the syllogism in
still another aspect: differentiating a substantial argument from an analytic one. By
breaking down the ambiguous major premise into warrants and backing, syllogistic
arguments can be separated into two different types: either in the form of “D; W; so
C” or “D; B; so C”. While the former is guaranteed by warrants (hypothetical
statements), the claim is arguable and the argument is a substantial one; the latter is
guaranteed by backing, namely matters-of-fact or truths, in which the claim is but a
tautology of the fact or truth contained in the premises and there is no real argument
involved.
As far as the reasoning mechanism is concerned, the basic principle of reasoning
that undergirds the Toulmin model is the law of large probability. This is just the
opposite of hypothesis testing whose principle of reasoning is based on the law of
small probability. Nevertheless, there exist remarkable resemblances between the
two: while the warrant of the Toulmin model is just like the confidence level in a
hypothesis testing; the rebuttal corresponds to the significance level (α). The larger
probability the warrants entail, the smaller probability the rebuttals. To ensure that
the claim is plausible, the rebuttals must be rare and exceptional so as to guarantee
that the warrants are highly probable and the step from the data to the claim is
secured. Thus, before a claim is made, it should be ensured that the warrants lend to
a certain level of confidence; on the other hand, not all warrants of whatever
probability can be rejected; otherwise, rational reasoning could be almost
impossible.
2.4.3.2 The Base Argument of AUA
When applying the Toulmin model to build up the framework, Bachman (2005)
makes a few changes to the basic structure of the model: (1) the Q element has been
removed; (2) the rebuttal remains at its original position, but a new component,
rebuttal data, has been added to justify the rebuttal—to “support, weaken or reject
the alternative explanation” (Bachman 2005, p. 10); and (3) Bachman and Palmer
(2010) change rebuttal data into rebuttal backing.
As can be seen in Fig. 2.8, all changes are targeted at the elements that Toulmin
employs to attack the syllogism: the qualifier is gone, while the rebuttal is rein-
forced. This is somehow against Toulmin’s intention. The qualifier is what makes a
Fig. 2.8 AUA base argument

(Bachman 2005, p. 9) Claim
unless
Rebuttal
since
Warrant
so
Backing Data Rebuttal
Toulmin argument, without which the Toulmin claim is reduced back to a syllo-
gistic one: being either yes or no, all or none; without which there is no need to
consider the rebuttals in the first place. On the other hand, the rebuttals, generally
exceptional and rare though negative to the claim, need to be considered for the
claim to be plausible, but have to be ignored if any claim is to be made at all.
However, in the modified versions, the qualifier is nowhere to be found, whereas
the rebuttal is not let go.
2.4.3.3 Reasoning Logic
With the above, AUA is not entirely consistent with Toulmin’s argument model,
especially in terms of its base argument. Then, what is its reasoning logic? An
analysis of the roles of rebuttal and rebuttal backing will help to reach some
insights. As is mentioned in the earlier discussion about substantial and analytical
arguments, the backing includes straightforward matters-of-fact or truths and when
factual backing is used to guarantee a claim, or a hypothetical statement for that
matter, no reasoning is involved and no argument is necessary.
However, Bachman and Palmer (2010) change Rebuttal Data (the Rebuttal
within a frame) in Fig. 2.8 into Rebuttal Backing. Therefore, as long as the rebuttal
is to be verified within the reasoning process from the data to the claim, the
reasoning process is undermined. As long as the rebuttal cannot be ignored, the
claim is hardly convincing, or even predictable. In that case, the whole logic
reasoning process falls into a never-ending regression.
The example in Fig. 2.9 is to illustrate how the rebuttal is supported by the
rebuttal backing and thus the claim is rejected (Bachman and Palmer 2010). Based
on the data Jim is going to the hospital, the claim, Jim is sick, is to be made (no
claim yet); although the warrant, People often go to the hospital when they are sick,
should provide enough guarantee to make the claim, we must check whether the
rebuttal Jim could be visiting someone who is in the hospital, is true or not; it is true
that Jim is visiting his partner in the hospital, so Jim is not sick.
Counterclaim:
Claim : Jim is sick.
Jim is not sick.
Rebuttal :
unless
Jim could be visiting
(Warrant) : People someone who is in the

since
often go to the hospital hospital.
when they are sick. so

supports
Rebuttal Backing:
Jim is visiting his
Data : Jim is going to partner in the hospital.
the hospital.
Fig. 2.9 Structure of example practical argument (Bachman and Palmer 2010, p. 97)
However, the above reasoning turns to: Jim is going to the hospital, so Jim is not
sick. This does not seem to be the result of the Toulmin reasoning. If assembled in
the form of a Toulmin argument, the reasoning should be as follows.
A: Jim is going to the hospital (SINCE people often go to the hospital when they are sick,
UNLESS they are going to the hospital for some other reasons,) SO PRESUMABLY Jim is
sick.
B: Jim is visiting his partner in the hospital (SINCE we can take it that people are not sick
themselves when they are visiting someone in the hospital, UNLESS they are indeed sick
themselves,) SO PROBABLY Jim is not sick.
This is how arguments are supposed to be settled. As can be seen, each side has
its own claim; each claim is justified with a separate reasoning process; each
process is guaranteed with its own warrant. Most importantly, both sides take into
consideration the rebuttal, but neither is trying to verify the rebuttal in the same
reasoning process, instead a proper qualifier is included.
If there is a need to reason by the logic of AUA, the rebuttal has to be verified as
well. As can be seen in Fig. 2.9, even if Jim is visiting his partner, he may still be
sick himself. If this rebuttal needs to be verified, chances are that the validation will
fall into an endless paradoxical cycle. In other words, before any claim is made, the
rebuttals must be verified first. As a consequence, another verification process is
embedded in the current one so that in terms of model construction the model
always contains “a self” within the model itself.
In brief summary, although argument-based validation can be viewed as a step
forward in comparison with the unitary concept of test validity, caution might be
taken in applying it to validate the proposed rating scale in question. In particular, it
needs further exploration as to how to embed all sorts of validity arguments into a
coherent and significantly sufficient argument with construct validity as the core.
Therefore, in terms of validation for the rating scale with nonverbal delivery
embedded, the present study will still refer to a unitary notion of validity.
2.5 Rating Scale Evaluation and Validation
The previous section reviews the evolution of validity in language testing and
justifies the application of a unitary concept as the theoretical base of validation for
the rating scale to be proposed in the present study. Then, when it comes to the
validation of rating scales, it is still felt necessary to review how rating scales can be
validated.
With regard to the facets of rating scale validity, Knoch (2009) tailors Bachman
and Palmer’s (1996) framework of test usefulness and excludes the facet of inter-
activeness because that is not an integral tenet that should necessarily be applied to
rating scale validation. In addition, Knoch’s (2009) revises the framework
emphasises the role of construct validity of a rating scale and puts forward three
criteria in validity evaluation as follows.
The scale provides the intended assessment outcome appropriate to purpose and context
and the raters perceive the scale as representing the construct adequately…The trait scales
successfully discriminate between test takers and the raters report that the scale is func-
tioning adequately…The rating scale descriptors reflect current applied linguistics theory as
well as research. (p. 65)
To briefly interpret the above criteria, in validating a rating scale, three aspects
should be taken into account: (1) the extent to which a rating scale reflects the
construct; (2) the extent to which a rating scale discriminates candidates across
various proficiency levels; and (3) the extent to which a rating scale manifests a
selected theory. Therefore, at the phase of rating scale validation, these three criteria
serve as the guidelines in constructing the phase-specific research questions.
In terms of rating scale validation methods, both quantitative and qualitative
methods are well documented. A majority of previous studies employ quantitative
methods to validate a rating scale. Because a rating scale with explicitly defined
categories facilitates consistent rating, a few studies examine whether differences
between score categories are clear using multifaceted Rasch measurement (Bonk and
Ockey 2003; McNamara 1996) or other factors impacting scoring results (Lumley
and O’Sullivan 2005; O’Loughlin 2002). Besides, multidimensional scaling has also
been applied to the scale development for different tests and rater groups
(Chalhoub-Deville 1995; Kim 2009). More robust statistical methods, such as an
MTMM approach and differential item functioning analysis, have been used for the
validation of classroom assessment (Llosa 2007), of speaking tests (Kim 2001) or of a
rating scale (Yamashiro 2002). It might be found that with an ever-growing
involvement of statistical tools into the language assessment community, an
increasing number of sophisticated statistical methods have been applied into and
enriched the study of rating and rating scales.
2.5 Rating Scale Evaluation and Validation 63
On the other hand, qualitative methods are also increasingly employed in test
validation studies (Lazaraton 2008), including speaking assessment validation (e.g.
Lazaraton 1992, 2002, 2008). Commonly adopted methods can be rater verbal
protocols and analysis of test discourse (e.g. Brown et al. 2005; Cumming et al.
2006). By aligning the rater verbal protocol with the descriptors stipulated in the
rating scale, researchers are able to validate a rating scale supposedly reflective of
the underlying construct a particular test intends to elicit. More elaborations will be
made on qualitative approaches to test validation in the last section of this chapter.
In order to obtain more sources for the validation of the rating scale, both
quantitative and qualitative methodologies will be employed in the present study.
On the quantitative side, as the rating scale to be proposed touches upon formative
assessment with a consideration of embedding nonverbal delivery as an assessment
dimension, different traits from candidates’ performances as reflected in their group
discussions can be measured via different methods, such as teacher-rating and
peer-rating; therefore, an MTMM approach will be adopted, which is rather suitable
and powerful in addressing the extent to which different measures or methods that
assess one given construct are substantially correlated among themselves. As for the
qualitative side, since the main argument for validating the proposed rating scale is
to validate the dimension of nonverbal delivery, an MDA approach will be used.
Further justifications will be made after the related qualitative approaches to
assessment validation are shed light on.
2.5.1 Quantitative Validation Methods
MTMM is first introduced by Campbell and Fiske (1959), who direct the attention of
construct validity research typically to the extent to which data exhibit evidence in
three areas, or meet three requirements. One is the concern of convergent validity
(CV), referring to the extent to which different assessment methods concur in their
measurement of the same trait. These values are supposed to be moderately high if the
construct validity is probed into. The second concern is discriminant validity (DV),
indicating the extent to which independent assessment methods diverge in their
assessment of different traits. Contrary to the requirement for CV, the values for DV
should demonstrate minimal convergence. The last consideration is method effects
(MEs), deemed as an extension of DV. MEs represent bias that could possibly derive
from using the same method in the assessment of different traits; correlations among
these traits would be typically higher than those measured by different methods.
The original MTMM design (Campbell and Fiske 1959) receives criticism
because more external, multiple and quantifiable criteria are expected to be incor-
porated into model perception (e.g. Marsh 1988, 1989; Schmitt and Stults 1986).
Widaman (1985) also adds to the effect that the original MTMM design somehow
fails to explicitly state the requirement of uncorrelated methods. Contingent upon
these criticisms, Widaman (1985) proposes an approach of nested-model compar-
isons, where a baseline model is first perceived to be compared with other
alternative models which might be trait-correlated or method-correlated. This way

of model formulation also signifies that MTMM per se derives from structural
equation modelling (SEM).
Jöreskog (1993) categorises three types of model formulation modes: (1) strictly
confirmatory, in which a single model is formulated and this model will be tested
with empirical data and is either accepted or rejected based on interpretable
parameter estimates; (2) model comparison, in which several alternative models are
specified and tested with empirical data; (3) model generating, in which a tentative
model is specified and keeps being testified based on an SEM analysis and sub-
stantive theory until a satisfactory model emerges. Widaman’s (1985) framework of
alternative model comparison mentioned above squarely falls into the second mode.
In the arena of language testing studies, since MTMM was first applied by
Bachman and Palmer (1981) to examine the construct validity of the FSI oral
interview, it has been extensively used in understanding the factor structure of test
performance and language ability (e.g. Bachman and Palmer 1989; Hale et al. 1989;
Turner 1989), in testing hypothesised relationships among test-taker characteristics
and test performance (e.g. Kunnan 1995; Purpura 1999; Sasaki 1993), in multi-
faceted approaches in construct validation (e.g. Bachman and Palmer 1981, 1982),
in multi-sample analyses based on the salient personal attributes (e.g. Bae and
Bachman 1998) and in validating classroom assessment (Llosa 2007). It can be felt
that, however, this approach keeps a comparatively low profile when applied to
rating scale validation, especially a validation study with a special view to
observing whether different rating methods might lead to a similar measurement of
the same construct. Considering the fact that this study investigates whether, if so,
how different scoring methods, viz. teacher-rating and peer-rating, measure the
given construct, Widaman’s (1985) framework of alternative MTMM model
comparison is adopted to investigate the relative effects of different scoring methods
on the targeted construct of communicative language ability as reflected in rating
scale with a dimension of nonverbal delivery included. In that case, the fittest and
most interpretable MTMM model can be found.
When a decision is made on whether the data fits the model, the related
goodness-of-fit statistics are referred to. Following the well-documented literature,
this study would judge comparative fit index (CFI) and non-normed fit index
(NNFI), whose values, if greater than 0.95, indicate acceptable model fit (Hu and
Bentler 1999; Raykov and Marcoulides 2006). Adjusted goodness-of-fit index
(AGFI), Tucker-Lewis index (TLI) and standardised root mean square residual
(SRMR) in each model comparison will also be looked into. As reported in Hu and
Bentler (1999), good model fit is indicated by AFGI and TLI values greater than
0.95, RMSEA values less than 0.06 and SRMR values less than 0.08. In addition,
the root mean square error of approximation (RMSEA) is also calculated. Small
residuals less than 0.05 indicate a small discrepancy between the observed corre-
lation matrix and the correlation matrix estimated from the model (Hu and Bentler
1999). Therefore, when the proposed rating scale is validated, the above indices
will be referred to for reaching the fittest model for data interpretation.
2.5.2 Qualitative Validation Methods
The previous section of review elaborates on the quantitative validation method to

be adopted for validating a rating scale. This part of review will turn to the qual-
itative method for the rating scale validation in this study.
One important issue needs to be addressed before a presentation of the details of
the qualitative validation method, that is, the necessity of applying a qualitative
approach to assessment validation. It might be noticed that if a study only con-
centrates on the statistical methods for validating language tests or rating scales, the
limitation would be that such validation can only be conducted after test admin-
istration because no score can be accessed beforehand. In other words, there can be
no priori validation in its own right. In that context, there has been a growing
awareness that language testers should consider more innovative approaches to test
validation, “approaches that promise to illuminate the assessment process itself,
rather than just assessment outcomes” (Lazaraton 2002, p. xi). Jacobs (1988) also
emphasises the significance of qualitative approaches to test validation and views
them as a must-do instead of might-do undertaking when he asserts that
[q]ualitative methods have been sufficiently successful that at this point the task is not to
decide whether or not to admit them into the methodological arsenal of practising
researchers; the task is to articulate their rationale so that they can be used in an informed
and self-conscious fashion. (p. 248)
Against the above, it can be argued that without probing into the de facto
assessment processes, especially if candidates’ performance is not investigated
analytically with a qualitative approach, a full picture of what is tested conforms to
what is intended to test will never be depicted. Therefore, for triangulation it is
quintessential to apply a qualitative approach to validating the rating scale to be
proposed.
As far as rating scale validation is concerned, there are mainly two prevailing
qualitative methods: verbal protocol analysis (VPA) and discourse-based approach,
particularly conversation analysis (CA). They are either singly adopted for rating or
test validations, or orchestrated with other quantitative methods to triangulate
research findings. The ensuing part is devoted to a review on both methods, fol-
lowed by the details of MDA so that further justifications in addition to what is
previously argued in the section of nonverbal delivery can be made for adopting
MDA as the qualitative validation method in this study.
2.5.2.1 Verbal Protocol Analysis
When a need of microscopically looking at the process of rating arises, researchers

might resort to VPA, based on which raters’ mental processing of what is being
assessed and how their judgments are made can be verbally recorded and reflected.
Green (1998) points out that VPA “is a methodology which is based on the
assertion that an individual’s verbalisations may be seen to be an accurate record of
information that is (or has been) attended to as a particular task is (or has been)
carried out” (pp. 1–2).
If the terrain is surveyed where VPA is empirically adopted, an overwhelming
popularity can be found among those who bent on writing assessment rating. For
instance, Cumming (1990) uses VPA to compare experienced and novice raters in
their judgments on the criterion range of analytic assessment; Cumming et al.
(2001, 2002) also examine the criteria extracted from the VPA data to come up with
the general categories for essay evaluation. Similarly, this method is also employed
by other studies in either describing rating process or comparing raters with various
extraneous variables or characteristics (e.g. Connor and Carrel 1993; Erdosy 2004;
Lumley 2002, 2005; Milanovic et al. 1996; Smith 2000; Vaughan 1991; Weigle
1994, 1999; Wolfe 1997; Wolfe et al. 1998).
However, applying VPA to speaking assessment rating seems to be underex-
plored. One of the few studies in spoken language assessment using that method is
conducted by Brown et al. (2005). In their study, they use VPA to investigate rater
orientation in the context of academic English assessment. The study finds that
expert EAP teachers generally assess test-takers’ vocabulary skills and frequently
comment on the adequacy of their vocabulary for a particular purpose. Ducasse and
Brown (2009), also using VPA, finds teacher-raters can identify three interaction
parameters in assessing paired oral communication, which yields implications for a
fuller understanding of the construct of effective interaction.
It has to be admitted that using VPA enables researchers to validate a rating scale
in terms of the extent to which raters score the candidates’ products in line with
what is stipulated in the descriptors of a rating scale. In other words, it can mainly
enhance the degree of scoring validity. However, when it comes to the construct
validation of a rating scale, this method seems to be less powerful because it is very
likely that the data elicited from rater verbal protocol does not necessarily cover
whole thinking processes. Thus, VPA may record an incomplete reoccurrence of
rater’s mind (Barkaoui 2011). When evaluating this method, Green (1998), Lumley
and Brown (2005) also point out a few drawbacks of VPA. Besides its conspicuous
disadvantage of time consumption, this method might also result in individual
differences in the sense that respondents might either produce long or short reports
of their mental processing. If not enough due attention is paid to the wording of
verbal report elicitation, respondents’ verbal reports might also be disrupted as they
could be somewhat coerced to “keep talking” (Ericsson and Simon 1993).
Coupled with the above drawbacks, most studies using this method outlined
above also justify themselves in choosing VPA because most previous studies on
the rating in writing assessment also heavily rely on this method. Considering the
applicability of VPA in the present study, whose focus differs significantly from
writing assessment, and given the practicality issue that VPA in the context of oral
assessment might also consume even more time than that of writing, this method
has to be discarded.
2.5.2.2 Conversation Analysis
Another main qualitative approach favoured by test validation researchers, as is

aforementioned, is a discourse-based approach, which can be mostly represented by
CA with its origin in sociology (Goodwin and Heritage 1990). As a matter of fact, CA
covers an extensive scope of research applicability, ranging from validating language
tests and rating scales to broadening the boundary of investigating the organisational
structure of conversation. The latter is usually achieved by identifying the reoccurring
patterns of naturally occurring conversation produced by speakers with various
demographic variables. In other words, researchers manage to generalise the generic
stages of conversation and try to model them on a turn-by-turn basis.1 Turning back to
how CA can be employed in language assessment, investigators analyse candidates’
performances based on the transcription conventions of CA and further look into how
their performances are aligned with a test construct or a rating scale. More specifi-
cally, CA can be viewed as an instrument to observe whether elicited performance by
candidates can be correlated with test construct, as reflected by a rating scale. If
qualitative descriptions of elicited data can be proven to provide positive correlation
with the defined construct to a certain degree, it can then be reckoned that a test or a
rating scale features construct validity.
Lazaraton (2002) summarises a few salient features of CA. One is that CA often
deals with single cases, which is largely based on descriptive rather than statistical
analyses. Given this, analysis results are usually “situationally invoked standards that
are part of the activity they seek to explain” (Pomerantz and Fehr 1997, p. 67).
Another is that CA “rejects the use of investigator-stipulated theoretical and con-
ceptual definitions of research questions” (Pomerantz and Fehr 1997, p. 66).
Therefore, it is usually not the practice of CA to formulate a hypothesis of what
conversation patterns are before the data is analysed and generalised into “talk rules”.
Unlike VPA that is much embraced by written assessment and validation, CA
has its place in the empirical studies of speaking assessment given its nature that
resides in spoken language. Young and He (1998b) conduct a number of studies on
the assessment of spoken English with a discourse-based approach, particularly
with CA. In that edited manuscript, researchers compare oral proficiency interview
with natural conversation by looking at the turn, sequence and repair (e.g. Egbert
1998; He 1998). In addition, as reviewed in the section of rating scale, Fulcher
(1993, 1996a) analyse candidates’ responses and native speakers’ talks qualitatively
to operationalise the notion of fluency. Although he does not explicitly state his
adoption of CA, the whole research procedure follows a CA approach in that he
analytically extracts the (dis)fluency features differentiating learners across different
proficiency levels on a turn-by-turn basis.
Similarly, from the lens of second language acquisition, Young (1995) analyses
the rating scales of the ACTFL OPI Guidelines (ACTFL 1986) and Cambridge
Assessment of Spoken English with a discourse-based approach. It is discovered
1
For detailed descriptions of turn, refer to Sacks (1992), Sacks et al. (1974), Oreström (1983).
that both rating scales share the weakness that there is a dearth of continuous
development of language acquisition that is supposed to be reflected in rating
scales. Another large-scale application of CA is mainly conducted by Lazaraton
(1991, 1992, 1995, 1996a, b) on a series of Cambridge EFL examinations in both
interview conversation structure of spoken language assessment and
interlocutor/candidate behaviours. In these studies, she not only aligns candidates’
responses with possible communicative functions to see whether the tests really
elicit the intended construct, but also profiles the role of interlocutor in certain
assessment settings.
Therefore, CA clearly serves as a necessary and reasonable complement to the
validation of language tests. Psathas (1995) evaluates CA as “an approach and a
method for studying social interaction, utilisable for a wide, unspecified phenom-
ena… it is a method that can be taught and learned, that can be demonstrated and
that has achieved reproducible results” (p. 67). However, largely due to its con-
straint of being applied to small-scale data, one of the criticisms against CA is that
the analytic methodology itself and its descriptive categories adopted might be too
vaguely defined to be usable and replicable to the studies of a similar nature (Brown
and Yule 1983; Cortazzi 1993; Eggins and Slade 1997; Wolfson 1989). On the
other hand, since CA is a method that involves much training and practice, most
researchers have to consume more time to familiarise themselves with the tran-
scribing conventions than to transcribe data (Hopper et al. 1986). On top of that,
Schiffrin (1994) and Levinson (1983) also notice that CA seems less capable of
bridging the gap between language form and language function.
Having critiqued the above, this part might call for an awareness that although
CA is conducive to tracking speakers’ utterances on a turn-by-turn basis, it is not
equally powerful and explanatory to synchronise what happens non-linguistically
with what is uttered verbally. The section reviewing rating scales already reiterates
that a majority of prevailing rating scales do not assess candidates’ nonverbal
delivery. If all meaning-making resources need to be probed into, CA seems to be a
dispreferred option. This is because although it might be argued that nonverbal
delivery could still be transcribed using a “second-line” (Larazaton 2002, p. 71),
this method can neither align verbal delivery with nonverbal channels on a large
scale, nor could it analyse interactions among different nonverbal channels, such as
eye contact, gesture and head movement, as previously reviewed. Therefore, CA
seems beyond its strength to be applied to the present study. In order to find a
method that is able to scrutinise more meaning-generation resources, this study
turns to an emerging discourse-based approach: MDA.
2.5.2.3 Multimodal Discourse Analysis
Having outlined the advantages of qualitative methods to assessment validation

along with their complementariness to quantitative methods, this part then con-
tinues with a review on MDA, the qualitative validation method to be adopted in
this study. Previous studies in speaking assessment are heavily dependent on the
transcription of verbal language, generally known as a single semiotic system.

Nevertheless, the call for extending a semiotic system to a multifaceted one has
long been overdue.
Halliday (1978, 1985), Chafe (1994), Halliday and Matthiessen (2004) contend
that gesture, facial expression and so forth that accompany the discourse should
also be regarded as semiotic modes capable of generating meanings. Likewise,
there is already an explicit acknowledgement that communication is inherently
multimodal, that literacy is not confined to language (Kress and van Leeuwen 2001;
Levine and Scollon 2004) and that “all texts are multimodal” (Stein 2008, p. 25).
Norris (2004) also shares the view that “all interactions are multimodal” and
multimodality “steps away from the notion that language always plays the central
role in interaction, without denying that it often does” (p. 3). Matthiessen (2007)
regards multimodality as “an inherent feature of all aspects of our lives …
throughout human evolution” (p. 1). Zhu (2007) even points out certain possible
danger if discourse is analysed monomodally. Therefore, it can felt that an inves-
tigation into discourse should not strictly follow the propriety of verbal language
exclusively. Instead, the weakness of the qualitative approaches outlined above
should be overcome and the possibility of perceiving all the possible
meaning-making resources, or other modalities and inter-semiotics, should be
accordingly explored.
Definition and Research Scope
Anterior to unfolding what MDA can offer, two key concepts need to be clarified in
foregrounding the notion. Since there are different approaches to MDA to be shed
light on later, the definitions of key concepts could also be slightly different.
If MDA is defined in a stratum-by-stratum manner, Stöckl (2004) views multimodal
as “communicative artefacts and processes which combine various sign systems
(modes) and whose production and reception calls upon the communicators to
semantically and formally interrelate all sign repertoires present” (p. 9). Then, what
is the point of mediating multimodal with discourse analysis? The main reason is
that quite a portion of meaning is conveyed through nonverbal channels. In that
case, communication should not be understood as a process realised only by one
particular sensory organ. Therefore, the discourse elicited in such settings is mul-
timodal discourse (Zhang 2009).
The stratum of multimodal discourse naturally extends to the method with which
multimodal discourse is examined, viz. MDA. Jewitt (2006) thinks that MDA is a
perspective from which discourse is analysed when all the communicative modes
are deemed as meaning-making resources and that it depicts an approach that
“understand[s] communication and representation to be more than language, and
which attend to the full range of communicational forms people use—image,
gesture, gaze, posture, and so on—and the relationships between them” (Jewitt
2009, p. 14). O’Halloran (2011), in a similar vein, defines MDA as “[extending] the
study of language per se to the study of language in combination with other
resources, such as image, scientific symbolism, gesture, action, music and sound”
(p. 120).
Having noted that MDA also looks at meaning-making resources other than
verbal language alone, this section maps out the terrains that this particular method
can cover. Simpson (2003) points out six domains that MDA mainly focuses on:
(1) multimodality and new media; (2) application of multimodality in the academic
and educational context; (3) multimodality and literacy; (4) construction of multi-
modal corpora; (5) multimodality and typology; and (6) MDA and its rationale.
Baldry and Thibault (2006), however, posit six slightly different topics for MDA
research: (1) what is multimodal text; (2) how to transcribe and analyse such text;
(3) what technologies are needed to analyse multimodal texts and construct mul-
timodal corpora; (4) how meaning potential can be exponentially increased when
meaning-making resources from multimedia are applied to hypertext; (5) how to
relate language studies to multimodality and multimedia; and (6) to what extent
MDA can bring changes to linguistics. It can be felt that two things might be shared
even though the above research domains vary slightly from each other. One is that
the ultimate purpose of MDA is to perceive all the meaning-making resources,
particularly those beyond the boundary of verbal language. The other is a trend that
MDA can be applied to large-scale research by means of corpus construction.
Bateman et al. (2004) and Bateman (2008) also believe that one of the multimodal
study foci is to formulate an analytical framework for dealing with multimodal data
in corpora. In fact, this domain is also foregrounded by the fact that previous
discourse-based analysis methods usually fail to quantitatively account for and
generalise research findings.
Nonetheless, even though MDA sets explicit directions for research and further
development, there are still different approaches to or streams of MDA, as is
foreshadowed. In order to select a suitable approach for this study and be consistent
in a line of analysis, the following part introduces these approaches and reviews
how they are applied to the studies related to Chinese EFL learners, and then
justifications will be made to account for selecting MDA in this study.
Approaches to Multimodality
Broadly divided, there could be two approaches to MDA with different theoretical
underpinnings. One of the approaches lay its foundation on Halliday’s (1978, 1985)
social semiotics to language studies, in which all potential meanings are structured
and construed in the sets of interrelated systems. Therefore, this stream is usually
known as systemic functional multimodal discourse analysis (SF-MDA), whose
bases are established by the works of Kress and van Leeuwen (1996, 1998, 2001,
2002, 2006; Kress et al. 2001, 2005; van Leeuwen 1999, 2001), O’Toole (1994,
2010), Baldrey and O’Halloran (2005, 2008a, 2011) and so forth. The other stream
of MDA, whose rationale can be traced back to activity theory (Engestrom 1987;
Daniels 2001) (AT-MDA), draws upon interactional sociolinguistics and intercul-
tural communication. That stream includes mediated discourse theory
(MDT) (Norris 2002, 2004; Norris and Jones 2005; Scollon 2001; Scollon and
Scollon 2004) and situated discourse analysis (SDA) (Gu 2006a, b, 2007, 2009).
SF-MDA
One of the main reasons why SF-MDA emerges and develops exponentially is that
its underpinnings can be directly loaned from systemic functional linguistics (SFL).
Specifically, SF-MDA absorbs the notion of language as social semiotic and
meaning potential and extends the boundary of meaning-making resources. In
addition, with reference to metafunctional meanings, SF-MDA also believes that
multimodal discourse is also multifunctional in that discourse is embedded with
ideational, interpersonal and textual meanings. SF-MDA also develops the theory
concerning register and associates the interpretation of discourse with the particular
context of the discourse. All these features provide SF-MDA with a fit platform on
which all the SFL-related theories, without any further modification, can immedi-
ately serve as its strong support.
Within the scenario of SF-MDA, most studies concentrate on the analyses and
interpretations of pictorial system, especially within a framework of analysing
visual text and its communicative meaning (Kress and van Leeuwen 1996, 2006).
Congruent with ideational, interpersonal and textual meanings in the SFL studies,
this framework describes meanings as not only representational (the representation
of entities, physical or semiotic), but also interactive (images constructing the
nature of relations among viewers and what is viewed) and compositional (the
distribution of information value or the relative emphasis among elements of the
image). Therefore, how images convey meanings also conforms to certain gram-
matical rules, which are beyond the conventional sense of grammar in linguistics. In
their follow-up work (Kress and van Leeuwen 2001), having noted that the
drawback of their framework lies in the isolated grammar for each individual
modality, Kress and van Leeuwen (2001) draw the attention of perceiving all the
modalities in a coherent context. The broad framework is supposed to identify the
four strata of meaning making in any communicative practice, including discourse,
design, production and distribution.
Other representative researchers also mainly take the lens of MDA on images.
For instance, O’Toole (1994, 2010) applies the visual arts grammar to the analyses
of paintings and architecture and reaches similar terms regarding meaning making:
representational meaning, modal meaning and compositional meaning. Likewise,
SF-MDA is also tailored to study other semiotic resources, including visual images
(Kress and van Leeuwen 2006; O’Halloran 2008b); mathematical symbols
(O’Halloran 2005), movement and gesture (Martinec 2000b, 2001, 2004), video
texts and Internet sites (Djonov 2006; Iedema 2001; Lemke 2002; O’Halloran
2004) and three-dimensional sites (Ravelli 2000) as well.
The above research on SF-MDA frames indicates that this stream does have
much to offer, especially when meaning-making resources other than verbal lan-
guage are probed into. However, when briefing SF-MDA, Jewitt (2009) thinks this
stream is not without flaws. It might dawn upon this review that most of the
analyses on images, symbols and among others, if not all, are rather impressionistic.
In other words, if perceived by different researchers with varied cultural or edu-
cational background, the interpretations might diverge to a certain extent. The
reason might be that SF-MDA is already linked the signifier with the signified to a
great extent, yet the way their relevancy is interpreted is still based on subjective
perceptions.
Another limitation pointed out by Jewitt (2009) is that “MDA is a kind of ‘lin-
guistic imperialism’ that imports and imposes linguistic terms on everything”
(p. 26). However, this limitation can be justified as most SF-MDA studies are
undertaken within linguistics field. If MDA is intended to interpret a language
system, there should be no “linguistic imperialism” to speak of. It might also be
controversial that SF-MDA is only concerned with static discourse, such as image,
architecture, as how they convey meanings through different channels. Nevertheless,
this flaw can again be defended by the fact that even though most SF-MDA studies
focus on those static discourses, it does not necessarily follow that it would be
powerless in dealing with dynamic discourses, such as situated discourses
embodying human actions. This can be supported by Hood’s (2007, 2010, 2011)
studies, in which an SF-MDA approach is adopted to present a multimodal analysis
of a poet’s performance and the role of body language in face-to-face teaching.
Therefore, although this flaw exists, it might be caused by a lower profile of
SF-MDA on dynamic discourse instead of the powerlessness of the approach per se.
AT-MDA
In addition to applying SFL to studying various modalities, a host of researchers are
also interested in basing their MDA studies on activity theory.
By integrating sociolinguistics, ethnolinguistics, intercultural communication,
Scollon (2001 and Scollon and Scollon 2003) proposes MDT that integrates social
activity with discourse. This is a step forward in that previous discourse analysis
studies usually neglect the significance of activity, whereas sociology theories, in
most cases, do not take discourse into account either. Unlike a conventional sense
of discourse analysis, which treats a text or a genre as the unit of analysis, MDT
mainly looks at mediated action and “social actors as they are acting because these
are the moments in social life when the discourses in which we are interested are
instantiated in the social world as social action, not simply as material objects”
(Scollon 2001, p. 3). According to Scollon (2001), any social actor conducts a
mediated action by means of material objects (including the actor’s own dress, body
and so forth) in the material world. Based on Scollon’s framework of AT-MDA,
Norris (2002, 2004) devises a MDA framework, where mediated action is still taken
as the unit of analysis. Her framework substantiates AT-MDA in the sense that she
further distinguishes different sorts of mediated actions into low-level action (a
simple gesture) and high-level action (a series of concrete actions) and that the
framework quantifies the degree of complexity for high-level actions by ushering in
the notion of mode density (Norris and Jones 2005).
Content unit
Medium unit
Fig. 2.10 Content and medium layers in agent-oriented modelling (Gu 2006a)
Informed by activity theory, Gu (2006a, b) also establishes another AT-MDA

framework, setting a perspective on studying multimodal texts from content unit
and medium unit. Figure 2.10 is an example to illustrate a distinction of these two
units (Gu 2006a). The top-screen shots are a series of contiguous actions by an
attendant providing in-flight service; thus, this can be viewed as a content unit,
symbolising concrete acts of service. A medium unit is realised by the duration of
and the time frame of this act. Thus, a multimodal text is composed when these two
units are combined. Based on these considerations, Gu (2006b, 2009) proposes
agent-oriented modelling (AOM)2 to frame situated discourse by social actors, in
the case of which total saturated experience can be distinguished from total satu-
rated signification. The former refers to “face-to-face interaction with naked senses
and embodied messages”, while the latter is more concerned with “the total of
meaning constructed out of the total saturated experience by the acting co-present
individuals” (Gu 2009, p. 436).
Compared with SF-MDA, it can be informed that AT-MDA usually takes a
stance that does not hurriedly establish a link between the signifier and the signified.
In other words, in dealing with multimodal texts, AT-MDA usually faithfully
presents what can be observed objectively. However, as the issue of interpreting the
observation remains to be resolved, this stream tends to advocate more objective
methods, such as layman validation of what a particular gesture is signified (Gu,
personal communication, 5 December 2010). Objective though it appears, it can
still be foreseeable that layman validation will result in even more inconsistency in
interpretation because layman involvement in great numbers might end up with
diversified interpretations.
In addition, AT-MDA lays a comparatively higher demand on technology lit-
eracy, the area of which most researchers in applied linguistics might find chal-
lenging, especially with regard to transcription, markup and modelling languages.
Although Gu (2009) signals a tripartite division of labour in corpus-based MDA
studies to facilitate research, logistic issues of how different parties are pooled
2
Gu (2006b) uses the term agent-oriented modelling language (AML), yet he later changes the
term to agent-oriented modelling (AOM) because AOM perceives the modelling as a methodol-
ogy, while AOML emphasises its relation with UML as the modelling metalanguage (Gu 2009).
together still remain to be solved instantly. What distinguishes AT-MDA from

SF-MDA can also be that most AT-MDA studies, if not all, deal with situated or
mediated discourses; therefore, this approach can satisfactorily explain dynamic
discourse, viz. discourse that usually embeds humans’ contiguous actions.
MDA in the Chinese EFL Context
When MDA is applied as a domain of enquiry in discourse-based academia, studies

tend to explore a diversified range of meaning-making resources, such as films
(Baldry and Thibault 2006; Iedema 2001; Martinec 2000a; O’Halloran 2004; Tseng
and Bateman 2010), animation (O’Toole 2011) and colour (van Leeuwen 2011). In
addition, a number of school subjects are also investigated multimodally, such as
mathematics (O’Halloran 2000, 2005, 2009), science (Guo 2004; Jewitt 2002;
Kress et al. 2001; Kress 2000), English (Daly and Unsworth 2011; Jewitt 2002,
2011; Kress et al. 2005; Macken-Horarik et al. 2011; Unsworth and Chan 2009)
and history (Derewianka and Coffin 2008). A variety of media are also the research
foci of MDA, such as picture books (Guijarro and Sanz 2009; Martin 2008; Painter
2007, 2008; Painter et al. 2013), comic books (Kaindl 2005), newspapers (Bateman
et al. 2006; Caple 2008; Knox 2008; Macken-Horarik 2004), advertisements (Feng
2011; O’Halloran and Lim 2009), documents (Baldry and Thibault 2006; Bateman
2008), television advertisements (Thibault 2000; Baldry and Thibault 2006; Tan
2009), websites (Lemke 2002; Kok 2004; Djonov 2006, 2008; Tan 2010) and
online virtual world (Maiorani 2009).
All the above studies to a great extent consolidate the theoretical base of MDA
and further inform the directions to which MDA can be applied in the Chinese EFL
context, where this method per se is just emerging. Studies that apply MDA as the
rationale in the Chinese EFL context would mostly focus on how teaching and
learning can be facilitated by multimodal inputs and how meanings can be
instantiated in an unconventional fashion. A review of the previous MDA studies
can help us to better understand the status quo of its application in the Chinese EFL
context and further inform us of how this approach can be further lent itself to the
arena of language assessment, rating scale validation in particular.
The MDA studies previously reviewed can be basically categorised into three
aspects. First, a number of studies link multimodality with multi-literacy, stressing
how EFL learners’ multi-literacy can be fostered by an input with a combination of
verbal language and other meaning-generative means, such as visual and auditory
channels (e.g. Chen 2008; Zhu 2008). Other studies in this stream encourage an
interrelation between verbal language and other visual input by analysing EFL
learners’ PowerPoint slides in order to highlight how other channels can enhance
meaning making (e.g. Hu and Dong 2006; Wei 2009; Zhang 2010). Second, MDA
studies in China also integrate multimodality with how specific micro-skill lan-
guage teaching can be administered (Zhang 2010), such as listening (Long and
Zhao 2009; Wang 2009) and speaking (Zhang and Wang 2010). The third category
that applies MDA to the studies pertinent to the Chinese EFL context is pertaining
to English textbook evaluation. With meaning-making resources as a point of

departure, researchers critique the layout, illustrations or colours of language
textbooks in relation to what is conveyed verbally. For instance, by referring to the
framework of ideational meaning, Chen and Wang (2008) assess the image–text
relations and their differences across a range of scaffolding stages. Similarly, Chen
and Huang (2009) adopt the framework of interpersonal meaning to further
examine potential problems in language textbook compilation.
Common grounds of the above studies can be twofold. On the one hand, most of
these MDA studies are based on exploring the possibility of improving language
learning and teaching with a repertoire of modalities. In a sense, their advantage can
be seen via a perception that multimodal input can stimulate sensory organs of EFL
learners. On the other hand, most MDA studies above follow an SF-MDA
approach, whereas the paucity in applying AT-MDA can be evidently found. Part
of the reason can be that AT-MDA might not be that suitable to account for
meaning-making channels as far as printed texts are concerned. This is because that
AT-MDA mostly deals with mediated or situated discourse, or dynamic discourse
as a whole.
Another point worth attention is that although there are prolific studies
addressing the issue of how different modalities may interact for an enhancement of
effective language learning, a similar issue of how learners’ employment of dif-
ferent modalities in their output is yet to be resolved. Additionally, MDA is still an
untouched approach in language assessment. Therefore, if a rating scale particularly
with nonverbal delivery incorporated also takes the above points into account and is
further validated with this approach, this study can enrich qualitative validation
methods in language testing.
An Integrated Evaluation
Having reviewed both approaches to MDA and how MDA is employed in the
Chinese EFL context, this part comes to an integrated evaluation and justifies this
approach in the present study. What needs to be addressed first is that there is no
absolute distinction as to which approach is right or wrong. Gu (2006b) expresses
his concern over foreseeable collaboration between SF-MDA and AT-MDA though
both approaches have solid foundations in their own right. Indeed, considering the
ultimate research purpose and explanatory power, both approaches are not con-
tradictory; their divergence only lies in different perspectives of looking at multi-
modal discourses and meaning-making resources. SF-MDA treats multimodal texts
on the basis of social semiotics in its fullest sense. By comparison, as AT-MDA
focuses more on how discourse is realised in a social activity context, it can be fully
operated in dynamic discourses.
This study adopts SF-MDA based on the following considerations. Regarding
the nature and aims of this study, which intend to design and validate a rating scale
with a consideration of embedding nonverbal delivery in speaking assessment, it
should be noticed that nonverbal delivery will be looked into to a great extent. As is
critiqued above, AT-MDA seems less explored in dealing with static discourse,
while SF-MDA can be applied to both static and dynamic discourses though pre-
vious studies have not rendered great concern for dynamic discourse. In that case, if
full use is made of SF-MDA to probe into a static discourse and more potentials of
SF-MDA are tapped to analyse a dynamic discourse, this study can not only
qualitatively analyse how candidates perform, but also benefits SF-MDA in terms
of its extended scope of applicability. It may be argued that both SF-MDA and
AT-MDA can be applied to the present study in an interwoven manner as both have
their strengths in approaching different types of multimodal texts. However,
adopting SF-MDA does not necessarily mean that both approaches are not recon-
cilable; rather, the decision on adopting SF-MDA follows the principle of consis-
tently referring to the same framework and applying it to qualitatively validate the
rating scale to be proposed.
By static discourse, it mainly refers to the transcription of candidates’ verbal
language, while dynamic discourse takes a closer look at candidates’ nonverbal
delivery. To be more specific, at the rating scale validation stage, when candidates’
performances are investigated to be aligned with their analytic scores and the
descriptors regarding verbal utterances, all possible meaning-making resources will
be analysed with SF-MDA as the theoretical framework. On the other hand, when
how candidates perform and synchronise their verbal language with nonverbal
delivery, then SF-MDA will also be referred to.
Apart from a consideration of discourse nature, another concern is that since
MDA will only be adopted in the qualitative stage of rating scale validation, the
randomly selected samples will not be that large in scale compared with those when
MTMM, a quantitative approach, is utilised. Therefore, the previously mentioned
weakness of SF-MDA’s reliability in directly bridging what is signifier and what is
signified can be offset to the minimum degree. Otherwise, if all the samples are to
be analysed with an SF-MDA approach, it is felt that analyses will wind up with an
almost endless inventory, giving rise to other logistic issues jeopardising the
practicality or implementation of this study. Furthermore, as is also aforementioned,
AT-MDA demands higher level of technology literacy, which might constrain this
study.
It could also be argued that since nonverbal delivery can be probed into within
the paradigm of nonverbal communication studies, why this study will adopt
SF-MDA as the validation method for the rating scale to be proposed. Scollon and
Scollon (2009) also note the similarities between the current interests in multi-
modality with the research in the field of nonverbal communication, as best rep-
resented by the works by Pike (1967), Ruesch and Kees (1956) and Hall (1959).
However, while acknowledging that the work in nonverbal communication can
inform multimodal studies, they highlight that “it is not simply a return” as the
crucial difference is that “[n]o longer is language taken to be the model by which
these other phenomena are studied, but, rather, language itself is taken to be equally
grounded in human action with material means in specific earth-grounded sites of
engagement” (Scollon and Scollon 2009, p. 177).
Based on all the above considerations, this study employs an SF-MDA approach
in the qualitative validation of the rating scale to be proposed and the fine-grained
reference to MDA henceforth is SF-MDA. At this stage, however, what still leaves
blank is how to apply the framework of MDA to operationalise the rating scale
validation. The next part will sketch out an operationalised framework informed by
MDA and provides a revised one drawn from Martinec’s (2000b, 2001, 2004) and
Hood’ (2007, 2011) studies.
Applying MDA to Rating Scale Validation
In line with an MDA approach, three strata of meaning-making resources are

focused on in the present study. The first stratum is a semiotic system. As is
illustrated before, all semiotic resources available for meaning generation can be
regarded as modes. Jewitt (2006) adopts mode as a foregrounding stratum as
“concentrating on the semiotic resources of individual modes as they feature in a
text is one way to ‘prise open’ a text” (p. 40). Within this stratum, this study can
find how candidates deploy a range of resources and further assign meanings with
them.
However, the first stratum only deals with whether or not these semiotic systems
are utilised, rather than how they are put into use in relation to meaning making. If
attention is only placed on one semiotic system, the texts will then be fragmented to
realise only part of meaning potential. In that case, no interaction between different
semiotic systems can be instantiated. This naturally leads to the second stratum of
the framework, namely the metafunctions of meaning. Halliday (1978, 1985)
classifies all social functions into three metafunctions. Concurrently each social
semiotic is conveyed with the construal of the world around us and inside us
(ideational meaning), meaning relating to interaction between speaker and
addressees (interpersonal meaning) and how it is structured and created (textual
meaning). Likewise, MDA also applies metafunctions to all modes so as to see how
different modes interact and how their juxtaposition and relations realise meanings.
When discourse metafunctionality is associated with the present study, how
different semiotic systems, especially those instantiated via nonverbal channels, are
interrelated can be discerned. For example, candidates’ nonverbal delivery is sup-
posed to construct ideational, interpersonal or textual meaning in group discussion
in formative assessment. Let us take gestures as an example, candidates’ gestures
can instantiate ideational meaning (the social ensemble of a particular gesture),
interpersonal meaning (how gesture is made to influence the interpersonal relation
and intangible distance) as well as textual meaning (how a gesture is frequently
made to achieve transition in expression).
What should be noted is that analysing these metafunctions alone would still be
incomplete. Against this, this study will extend to a third stratum: intersemiotic
relations. In other words, when different modes are utilised and what metafunctions
they instantiate are interpreted, how they interact with each other will also be
scrutinised. It is possible that different nonverbal semiotic systems can be mutually
Eye contact: A particular candidate

Eye contact: Most occurrences of a
very frequently has eye contact with
particular candidate’s eye contact
other participants during his/her
have interpersonal metafunction
turn in oral group discussions.
because s/he constantly has eye
contact with other discussants to Eye contact and gesture: When a
show his/her attention during particular candidate tries to convince
semiotic system
others’ turns. other discussants of his/her opinion with
a pointing gesture, s/he also uses eye
intersemiotic contact with power function.
interaction
metafunctions
metafunctions semiotic system

metafunctions
semiotic system
Fig. 2.11 Three-stratum MDA framework: an example
enhanced, which can be judged as the inter-semiotic complementariness, while the

reversed way is also possible given the fact that one nonverbal semiotic system is
not in full conformity with another synchronised nonverbal channel or accompa-
nying verbiage.
Figure 2.11 depicts an example illustrating the mechanism of a three-stratum
MDA framework. Although this figure is three-dimensional in appearance, it does
not necessarily follow that only three strata are assigned to the observed semiotic
system. As a matter of fact, the number of dimensions is determined by the number
of modes observed. Therefore, if a multimodal text is analysed, the discourse per se
can be actually n-dimensional. Alongside each semiotic system, three metafunc-
tions are concurrently embedded and the relation between two modes or among
more than two semiotic systems, if any, can lead to intersemiotic interaction.
Therefore, it does not necessarily mean that only two semiotic systems will interact;
three or even more can also interact beyond a mere depiction in Fig. 2.11.
The example illustrated in Fig. 2.11 is a semiotic system of eye contact. On the
stratum of semiotic system, it might be observed that a particular candidate very
frequently has eye contact with other participants during his/her own turn in the
group discussion. Then, this meaning-making resource is analysed from three
metafunctions; it can be interpreted that most occurrences of his/her eye contact
have an underlying interpersonal metafunction because s/he constantly has eye
contact with other discussants during others’ turns to show his/her attentiveness.
Elevated to a higher stratum of inter-semiotic relations in the framework, this
semiotic system can be found to interact with other semiotic systems, such as
gestures. The semiotic system of eye contact co-ordinates well when s/he gazes
with a power function to convince other discussants of his/her own opinion with an
upward pointing index finger. In that manner, the three strata are comprehensively
probed into and candidates’ performance can be qualitatively aligned with the
rating scale descriptors and the subscores assigned by teacher and peer raters.
The above general framework provides a sketch of how candidates’ nonverbal
delivery can be analysed from an MDA perspective. In order to particularise a
repertoire of nonverbal delivery channels and re-address the analysis framework
that is held back in the previous section of review on nonverbal delivery, this study
will mainly refer to Martinec’s (2000b, 2001, 2004) and Hood’s (2007, 2011)
studies in qualitatively validating the rating scale to be proposed.
Nonverbal Delivery: Communicative Versus Performative

This study, in the phase of rating scale validation, divides nonverbal delivery into
communicative channel and performative channel, which is also in alignment with
Kendon’s (1981, 2004) and Cienki’s (2008) studies in describing nonverbal
delivery, particularly with regard to gestures. In terms of its relationship with verbal
language, communicative channel is further classified into language correspondent
channel, language independent channel and language dependent channel.
Language correspondent channels refer to those that co-occur with accompanying
verbiage, but their meanings can be accessed and interpreted without relying on
speech. Language independent channels occur in the absence of language and
generate meaning on their own. Language correspondent channels can be distin-
guished from language independent channels mainly by the criteria of whether
there is accompanying verbiage in the occurrence of nonverbal delivery. Language
dependent channels also co-occur with language but request the accompanying
verbiage for a full access to and interpretation of their meanings.
Performative channels are mainly nonverbal delivery practically performed for
the execution of a task. It may not be semantically loaded or wilfully performed to
convey meaning. An example of performative eye contact can be a sudden
downward eye contact directionality shift when the discussant is questioned.
Another example of performative gesture can be scratching one’s neck to ease an
itch. While the primary intent of performative channels is not to communicate, they
may, at times, be construed to convey meaning, thus serving as communicative
channels as aforementioned. For instance, an act of scratching one’s head can be a
performative gesture as a reflex to an itch. However, it can also be interpreted as a
communicative gesture to suggest uncertainty. As observed, the boundary between
the classification of Communicative and performative channels might be nebulous.
Nonetheless, the intended meanings are usually disambiguated when a particular
occurrence of nonverbal delivery is interpreted in a co-contextualised manner.
Hence, it is arguably useful not to disregard performative channels in this study,
despite them being not primarily communicative in nature.
It can be felt that instead of describing the communicative functions of nonverbal
delivery channels as reviewed in the first section of this chapter, the above tax-
onomy is one step forward in that it also considers the role of accompanying
verbiage and how it interacts with what happens non-linguistically. In addition to

the above demarcation of nonverbal channels in relation to verbal language, more
fine-grained frameworks (Hood 2007, 2011; Martinec 2000b, 2001, 2004) are
reviewed below for an integrated framework for validating the rating scale.
Martinec’s Taxonomy on Actions

Martinec (2000b) proposes that actions can be classified into Presenting Action,
Representing Action and indexical action. Martinec (2000b, p. 243) defines
Presenting Action as “most often used for some practical purpose” and “commu-
nicates non-representational meanings”. They are classified as performative chan-
nels in this study. Representing Actions “function as a means of representation” and
are semantically loaded. They are classified as communicative channels in this
study. In terms of its relationship with language, Representing Action can also be
described as Language correspondent channel or language independent channel in
this study. Indexical action usually only co-occurs with accompanying verbiage and
“in order to retrieve its full meaning, one has to have access to the second-order
context which is represented simultaneously in Indexical action and concurrent
speech” (Martinec 2000b, p. 244). Indexical Action is therefore classified as
communicative channel and is described as language dependent channel in this
study. A synthesis of the above review can reach an integrated framework, as is
outlined in Fig. 2.12, where the taxonomy of communicative and performative
channels and Martinec’s (2000b, 2001, 2004) taxonomy of action types, along with
their relationship with verbal language, are hierarchically connected.
According to Martinec (2000b, p. 247), Presenting Action can be “seen as part of
our experience of reality, formed in our interaction with it by means of our per-
ceptions and motor actions”. As such, Martinec (2000b) adapts the Hallidayan
Nonverbal delivery channels
Communicative channels Performative channels
Representing action Indexical action Presenting action
Language Language Language

independent dependent dependent
channels channels channels
Fig. 2.12 An integrated taxonomy of nonverbal delivery channels

processes of transitivity (Halliday 1978, 1985) to Presenting Action. The different

types of Presenting Actions are distinguished according to the processes of tran-
sitivity in systemic functional theory. They are Material process, Behavioural
process, Mental process, Verbal process and State process.
The classification for Material processes, defined by an obvious expansion of
effort, such as moving a chair forward, can be straightforward. Martinec (2000b,
p. 247) claims that “behavioural processes are similar to Material processes in that
they involve an expenditure of energy but they differ in that the main participant,
called Behaver, must be conscious”. This distinction can arguably be blurred as
almost all occurrences of nonverbal delivery must necessarily be enacted by a
conscious individual though not all of them are intentional. Martinec (2000b)
further describes an act of kicking a ball as Material processes and an act of
grooming such as combing as Behavioural processes. It may be controversial that in
both situations, there must necessarily be an enactor of the action that is conscious.
Perhaps a more distinct classification is whether an action is directed to self,
described as a Behavioural process, or directed to others or to objects, described as
a Material process. Examples of Behavioural process might include laughing, and
physiological processes like coughing.
Martinec (2000b, p. 249) also proposes the category of State processes to
describe processes without salient movement, or those without obvious consump-
tion of energy, such as sitting and standing. Verbal processes have two realisations:
visual and auditory. Martinec (2000b, p. 248) asserts that “the visual realization is
the lip movement which articulates sound in the way that is done for speech” and
“the auditory realization is speech sounds”. As neither facial expression nor lip
movement is within the scope of nonverbal delivery channels to be investigated in
this study, Verbal processes in Presenting Action are discarded accordingly.
Martinec (2000b) believes that there are no Mental processes in action unlike in
language as “they are processes of cognition and take place in the mind, which is
not directly observable” (p. 250). However, these “processes of cognition” might be
expressed in language as mental processes through such mental verb as think and
consider. In a similar vein, it is arguably possible to identify the realisation of
mental processes. From the analysis of candidates’ performance in nonverbal
delivery, it might be found that indicators of cognition may be suggested by an act
of a finger pursing at the chin.
Representing Action can be certain nonverbal delivery with a signifying function
in a given sociocultural context (Martinec 2000b). They are either universally
recognisable or within a semiotic community. The ideational meanings instantiated
by Representing Action are classified as Participants, Processes and
Circumstances, and they are usually realised in the case of gestures. Participants
can be the physical entities that gestures refer to, such as a Representing gesture
with the reference to an object, such as “village”. Martinec (2000b, p. 253) suggests
only two kinds of Processes for Representing Actions: static and dynamic. For
example, certain ongoing actions can fall into this category, such as “scuba-diving”.
Circumstances can be those indicating concrete directions or locations, such as a
gesture accompanying the verbiage of “outdoors”.
Table 2.3 Ideational meaning of nonverbal delivery channels

Ideational meaning
Presenting action Representing action Indexical action
➀ Eye contact ✓ Processes ✓ Entity ✓ Importance ➁➂
➁ Gesture Material ➀➁ Participants ➁ ✓ Receptivity ➁➂
Behavioural ➁ Process ➁ ✓ Relation ➁
➂ Head movement
State ➀➁ Circumstances ➁ ✓ Agreement ➀
Mental ➀➁➂ ✓ Uncertainty ➀
✓ Defensiveness ➁
The third category of action delineated by Martinec (2000b) is indexical action.

They are communicative channels and language dependent channels because they
necessarily accompany language for an accurate interpretation. From the data of
candidates’ group discussion, certain indexical actions can be interpreted via an
understanding of the accompanying verbiage.
However, it should be noted that the above framework needs revisiting when
applied to describing and validating nonverbal delivery occurrences by candidates
in the Chinese EFL context. In addition, three main nonverbal delivery channels,
viz. eye contact, gesture and head movement, will instantiate intended meanings by
various means. Not all the realisation of meaning potentials reviewed above can be
generated by or come into effect through three main nonverbal channels. For
example, in presenting actions, only gesture is able to realise a behavioural process
because neither eye contact nor head movement would be embedded with some-
thing behavioural or a bodily motion.
Integrating the above review, therefore, this part synthesises an analytical
framework for a repertoire of actions with regard to their possible ideational meaning
with an MDA approach, as is outlined in Table 2.3. What is supposed to be observed
in each type of action for a specific nonverbal channel is indicated by a figure
following the observation point. For Presenting Action, which does not virtually
serve a signifying function or embody semantic meanings, its ideational meaning is
usually realised via Processes, which might incorporate Material, Behavioural, State
and Mental processes, as previously reviewed. Independent of language,
Representing Actions realise their ideational meaning through entities, which include
Participants, Process and Circumstances. As for indexical actions, usually requesting
a co-contextualisation for interpretation, realise their ideational meanings by possibly
indicating their importance, receptivity, relation or other context-specific meanings in
certain semiotic contexts.
Hood’s Taxonomy on Nonverbal Delivery Metafunctions

The above analytical framework deals with ideational meaning that would be
possibly instantiated by nonverbal channels. This part of review will continue with
Hood’s (2007, 2011) studies on nonverbal delivery metafunctions, particularly in
relation to interpersonal and textual meanings so that a complete analytical
framework for qualitatively validating the rating scale can be constructed.
Appraisal Theory
Engagement Attitude Graduation
Heterogloss Monogloss Affect Judgment Force Focus
Appreciation
Expansion Contraction
Fig. 2.13 The structure of Appraisal Theory (Martin and White 2005, p. 38)
Building on the work by McNeill (1992, 1998, 2000) and Enfield (2009) in
cognitive studies as well as Kendon’s (1980, Kendon 2004) research in psychology,
Hood (2007, 2011) takes an SFL perspective to investigate nonverbal delivery, with a
special to gestures. In terms of interpersonal meanings, Hood (2011), informed by
Appraisal Theory (Martin 1995, 2000; Martin and White 2005), identifies gesture that
embodies attitude, engagement and graduation, as illustrated in Fig. 2.13. Hood
(2011) further argues that nonverbal channels, such as gestures, can express feelings
and values in attitude can grade meaning along various dimensions in graduation and
can expand or contract space for others during interaction in engagement.
In Appraisal Theory, attitudes can instantiate a variety of interpersonal mean-
ings. However, considering the three main nonverbal channels in the present study,
a polemic set of values that broadly classify attitudes as Positive and Negative are
proposed. This is because, unlike facial expression, eye contact, gesture and head
movement generally signify either positive or negative attitude instead of affect,
appreciation and judgment, as outlined in Fig. 2.13. For instance, positive attitude
can be embodied in an occurrence of head nod, while negative attitude can be
instantiated by the gesture of crossing both hands before the chest when a candidate
intends to interrupt other speakers.
Graduation in interpersonal meaning is also elaborated by Hood (2004, 2006).
She is concerned, however, that “by grading an objective (ideational) meaning the
speaker gives a subjective slant to the meaning, signalling for the meaning to be
interpreted evaluatively” (Hood 2011, p. 43). In line with Appraisal Theory, Hood
(2011) extends graduation as force to the meanings of intensity, size, quantity,
scope and graduation as focus to specificity. Instead of addressing all the aspects,
this study will mainly look at the pace of different nonverbal delivery occurrences,
such as the frequency of head nod in an interval unit.
The third aspect of Appraisal Theory is engagement. Specific to gestures,
engagement is realised via the positioning of the hands to expand or contract
negotiation space for other addressees. In describing interpersonal meanings
instantiated by teachers’ gestures, Hood (2011) suggests an open palm or palms-up
position as “[embodying] an elicitation move on the part of the teacher, enacting an

expansion of heteroglossic space, inviting student voices into the discourse” (p. 46).
By contrast, a palms-down gesture contracts space for negotiation. However, in
addition to expansion and contraction, there can also be neutral engagement, which
takes up most of the time in candidates’ group discussion, and possibility, which,
for example, can be instantiated by an occurrence of placing the left hand against
the tip of the nose with the index finger and the thumb gently touching the face.
Although it has to be admitted that the above taxonomy is intentionally applied
in investigating gestures, with moderations, it can be applied to eye contact and
head movement as well. In fact, as this taxonomy covers almost all the possible
interpersonal meanings instantiated by gestures, which supposedly convey more
meanings than eye contact or head movement, the application of this taxonomy can
be justified to analyse interpersonal meaning of candidates’ eye contact and head
movement in this study. It is a similar case when the taxonomy of textual meanings
is applied to eye contact and head movement below.
Therefore, in this study, interpersonal metafunction generally covers represent-
ing and indexical actions, as listed in Table 2.4. In line with Hood’s (2007, 2011)
work on interpersonal meaning of nonverbal delivery, interpersonal meaning can be
realised via attitude, engagement and graduation. Irrespective of any nonverbal
channel, attitude is categorised into positive and negative. The judgment can be
facilitated and triangulated with reference to synchronised verbal utterances.
Engagement is broken down into expansion, contraction, neutral and possibility,
and graduation is realised by the pace of nonverbal channels (fast, medium or slow).
For textual meanings, Hood (2011) describes the identification, waves of
interaction, salience and cohesion in gesture. Mainly the wave of gestures can be
realised via an occurrence of repeated action, for example, constant or rhythmic
beat at a certain object. Each wavelength presents a peak where prominence is given
to the meaning conveyed (Martinec 2004). This can be especially true in indexical
gestures, where beats are supposed to offer an enhancement of importance intended
in the ideational meanings.
Following Hood’s (2011) line of analysis, another aspect of textual meaning can
be realised through pointing. Hood (2011) proposes not only the dimension of
specificity but also the dimension of directionality accorded by Martinec’s (2004)
study. Hood (2011, p. 38) also argues that variation in bodily resources can be
interpreted “as varying along a cline of specificity”. In that sense, textual meanings
can be interpreted differently when pointing is realised by different fingers or a
combination of more than one finger, or by a palm.
Table 2.4 Interpersonal Interpersonal meaning

meaning of nonverbal
Representing action Indexical action
delivery
Eye contact ✓ Attitude: Positive, Negative
Gesture ✓ Engagement: Neutral, Expansion,
Contraction, Possibility
Head movement
✓ Graduation: Fast, Medium, Slow
Table 2.5 Textual meaning Textual meaning

of nonverbal delivery
Representing action Indexical action
Eye contact ★ Gaze target
✓ Directionality: various objects or no
direction
✓ Specificity: duration of gaze
Gesture ★ Pointing
✓ Directionality: various objects
✓ Specificity: Hand, index finger,
thumb, thumb and index finger
★ Wavelength
✓ Rhythm: Once or consecutively many
times
Head movement ★ Wavelength
✓ Rhythm: Occurrences of head
nod/shake in a unit interval
★ Amplitude
✓ Specificity: Angle of head movement
Slightly different from the application of the taxonomy of interpersonal meaning

to eye contact and head movement, Hood’s (2007, 2011) framework with regard to
textual meaning is somewhat extended and revised for an analysis of eye contact
and head movement. Table 2.5 presents the analytical framework for textual
meaning of nonverbal delivery channels. For eye contact, the target which gaze is
aimed at can achieve its textual meaning mainly from the perspectives of direc-
tionality and specificity. Various objects or no concrete object (direction) from
candidates’ gaze, as well as how long a gaze fixes on an object can interpret textual
meaning. In a quite similar vein, gesture realises its textual meaning by pointing.
Nonetheless, the specificity of pointing is different from that of eye contact in that it
is more concerned with how different fingers, or a combination of fingers, specify
intended textual meaning. Apart from that, gestures can also achieve textual
meanings via wavelength, which might be observed in terms of gesturing rhythm.
The textual meanings instantiated by head movement are also two-faceted. Apart
from the wavelength in the form of frequency in a unit interval, the amplification of
head movement, namely the angle of movement is also one of the observation foci.
2.6 Summary
Revolving around three key phases of the present study, viz. (1) building an
argument for embedding nonverbal delivery into speaking assessment, and the
issues of (2) how to design and (3) how to validate a rating scale with such a
consideration informed by the argument, this chapter reviews the related literature.
The first section reviews the topical issue of this study: nonverbal delivery. The
next two sections address the issue of the rating scale design, while the last two
sections pave the way for the concrete procedures of how to validate a rating scale,
especially the notion of validity and validation methods.
Specifically, the first section mainly pinpoints the significance of nonverbal
delivery in communication and in a repertoire of research fields and also outlines
the previous studies on three most representative channels of nonverbal delivery. In
that sense, a theoretical argument for incorporating nonverbal delivery into
speaking assessment can be felt to call for a corresponding empirical argument.
In the second section, by comparing and contrasting the evolution of commu-
nicative competence related models, the section outlines their components and
respective strengths and weaknesses, justifies the employment of the CLA model as
the theoretical framework for the rating scale design and points out the
quintessential role of nonverbal delivery in the CLA model. The third section also
responds to the issue of rating scale design. With a review on the prevailing
taxonomies of rating scales in language assessment and the exemplifications of a
few existing rating scales used by main language testing batteries, this section
explicitly informs the formulation of the rating scale with nonverbal delivery
embedded as an assessment dimension. Moreover, by highlighting the context
where a rating scale is to be applied, the properties that the rating scale supposedly
processes are also accorded.
The fourth section is devoted to conceptualising validity and validation. An
overview is provided regarding three evolution phases of validity in language
assessment scenario, based on which this study justifies itself in adopting a unitary
notion of validity with construct validity as the core. In terms of validation methods,
the last section argues the necessity of using both quantitative and qualitative
methodologies in rating scale validation. MTMM is reviewed so that a glimpse is
rendered of how this quantitative method will be adopted to verify the construct
validity of the rating scale with teacher-rating and peer-rating as different scoring
methods and different subdimensions on the scale as traits. The last section intro-
duces MDA in detail, ranging from its theoretical origin, different streams of
research and its application both worldwide and in the Chinese EFL context. The
end of the last section provides fine-grained frameworks informed by an MDA
approach so that the proposed rating scale can be validated quantitatively by an
alignment of candidates’ nonverbal delivery performance with the corresponding
rating scale descriptors and the subscores they are assigned.
References
ACTFL. 1986. ACTFL proficiency guidelines. Hasting-on-Hudson: American Council on the

Teaching of Foreign Languages.
ACTFL. 1999. Revised ACTFL proficiency guidelines—Speaking. Yonkers: American Council on
the Teaching of Foreign Languages.
AERA, APA, and NCME. 1985. Standards for educational and psychological tests and manuals.
Washington, DC: American Psychological Association.
References 87
AERA, APA, and NCME. 1999. Standards for educational and psychological tests and manuals.
Alderson, J.C. 1981. Report of the discussion on general language proficiency. In Issues in
language testing, ed. J.C. Alderson, and A. Hughes, 87–92. London: The British Council.
Alderson, J.C. 1991. Bands and scores. In Language testing in the 1990s, ed. J.C. Alderson, and
B. North, 71–86. London: Modern English Publications and the British Council.
Alderson, J.C. (ed.). 2002. Common European Framework of Reference for Languages: learning,
teaching, assessment: case studies. Strasbourg: Council of Europe.
Alderson, J.C. 2010. The Common European Framework of Reference for Language. Invited
seminar at Shanghai Jiao Tong University, Shanghai, China, Oct 2010.
Alderson, J.C., and J. Banerjee. 2002. Language testing and assessment (Part 2). Language
Teaching 35(2): 79–113.
Alderson, J.C., N. Figueras, H. Kuiper, and G. Nold. 2006. Analyzing tests of reading and
listening in relation to the Common European Framework of Reference: the experience of the
Dutch CEFR Construct Project. Language Assessment Quarterly 3(1): 3–30.
Alibali, M.W., L. Flevares, and S. Goldin-Meadow. 1997. Assessing knowledge conveyed in
gesture: do teachers have the upper hand? Journal of Educational Psychology 89: 183–193.
Allal, L., and L.M. Lopez. 2005. Formative assessment of learning: a review of publication in
French. In Formative assessment: improving learning in secondary classrooms, ed. J. Looney,
241–264. Paris: Organisation for Economic Cooperation and Development.
Anastasi, A. 1950. Some implications of cultural factors for test construction. New York:
Educational Testing Service.
Anastasi, A. 1954. Psychological testing. New York: Macmillan.
Anastasi, A. 1961. Psychological testing, 2nd ed. New York: Macmillan.
Anastasi, A. 1976. Psychological testing, 4th ed. New York: Macmillan.
Anastasi, A. 1982. “What do intelligence tests measure?” In On educational testing: Intelligence,
performance standards, test anxiety, and latent traits, eds. S.B. Anderson, and J.S. Hemlick,
5–28. San Francisco, CA: Jossey-Bass, Inc.
Angoff, W. 1988. Validity: an evolving concept. In Test validity, ed. H. Wainer, and H.I. Braun,
19–32. Hillsdale: Lawrence Erlbaum Associates.
APA. 1954. Technical recommendations for psychological tests and diagnostic techniques.
Psychological Bulletin Supplement 51(2): 1–38.
APA, AERA, and NCME. 1966. Standards for educational and psychological tests and manuals.
APA, AERA, and NCME. 1974. Standards for educational and psychological tests and manuals.
Applebee, A.N. 2000. Alternative models of writing development. In Perspectives on writing:
research, theory, practice, ed. R. Indrisano, and J.R. Squire, 90–111. Newark: International
Reading Association.
Argyle, M., and M. Cook. 1976. Gaze and mutual gaze. Cambridge: Cambridge University Press.
Bacha, N. 2001. Writing evaluation: what can analytic versus holistic essay scoring tell us? System
29: 371–383.
Bachman, L.F. 1988. Problems in examining the validity of the ACTFL oral proficiency interview.
Studies in Second Language Acquisition 10(2): 149–164.
Press.
Bachman, L.F. 1991. What does language testing have to offer? TESOL Quarterly 25(4): 671–704.
Bachman, L.F. 2005. Building and supporting a case for test use. Language Assessment Quarterly
2(1): 1–34.
Bachman, L.F., and A.S. Palmer. 1981. The construct validation of the FSI oral interview.
Language Learning 31: 67–86.
Bachman, L.F., and A.S. Palmer. 1982. The construct validation of some components of
communicative proficiency. TESOL Quarterly 16(4): 449–465.
Bachman, L.F., and A.S. Palmer. 1989. The construct validation of self-ratings of communicative
language ability. Language Testing 6(4): 449–465.
Bachman, L.F., and A.S. Palmer. 1996. Language testing in practice: designing and developing
Bachman, L.F., and A.S. Palmer. 2010. Language assessment in practice: developing language
tests and justifying their use the real world. Oxford: Oxford University Press.
Bachman, L.F., and S.J. Savignon. 1986. The evaluation of communicative language proficiency:
a critique of the ACTFL oral interview. Modern Language Journal 70(3): 380–390.
Bachman, L.F., B.M. Lynch, and M. Mason. 1995. Investigating variability in tasks and rater
judgments in a performance test of foreign language speaking. Language Testing 12(2): 238–257.
Bae, J., and L.F. Bachman. 1998. A latent variable approach to listening and reading: testing
factorial invariance across two groups of children in the Korean/English two-way immersion
program. Language Testing 15(3): 380–414.
Baird, L.L. 1983. The search for communication skills. Educational Testing Service Research
Report, No. 83-14. Princeton: Educational Testing Service.
Baldry, A., and P. Thibault. 2006. Multimodal transcription and text analysis. London: Equinox.
Barakat, R.A. 1973. Arabic gestures. Journal of Popular Culture 6(4): 749–787.
Barkaoui, K. 2007. Rating scale impact on EFL essay marking: a mixed-method study. Assessing
Writing 12(2): 86–107.
Barkaoui, K. 2011. Think-aloud protocols in research on essay rating: an empirical study of their
veridicality and reactivity. Language Testing 28(1): 51–75.
Bateman, J.A. 2008. Multimodality and genre: a foundation for the systematic analysis of
multimodal documents. London: Palgrave Macmillan.
Bateman, J., J. Delin, and R. Henschel. 2004. Multimodality and empiricism: preparing for a
corpus-based approach to the study of multimodal meaning-making. In Perspectives on
multimodality, ed. E. Ventola, C. Cassily, and M. Kaltenbacher, 65–88. Philadelphia: John
Benjamins.
Bateman, J.A., J. Delin, and R. Henschel. 2006. Mapping the multimodal genres of traditional and
electronic newspapers. In New directions in the analysis of multimodal discourse, ed.
T.D. Royce, and W.L. Bowcher, 147–172. Mahwah: Lawrence Erlbaum Associates.
Black, P., and D. Wiliam. 1998. Assessment and classroom learning. Assessment in Education 5
(1): 7–74.
Black, P., and D. Wiliam. 2009. Developing the theory of formative assessment. Educational
Measurement, Evaluation and Accountability 21(1): 5–31.
Bloom, B.S., J.T. Hasting, and G.F. Madaus (eds.). 1971. Handbook of formative and summative
evaluation of student learning. New York: McGraw-Hill.
Bonk, W.J., and G.J. Ockey. 2003. A many-facet Rasch analysis of the second language group oral
discussion task. Language Testing 20(1): 89–110.
Bourne, J., and C. Jewitt. 2003. Orchestrating debate: a multimodal approach to the study of the
teaching of higher order literacy skills. Reading: Literacy and Language, UKRA, July, 64–72.
Brindley, G. 1986. The assessment of second language proficiency: issues and approaches.
Adelaide: National Curriculum Resource Centre.
Brindley, G. 1991. Defining language ability: the criteria for criteria. In Current developments in
language testing, ed. S. Anivan, 139–164. Singapore: Regional Language Centre.
Brindley, G. 2002. Issues in language assessment. In The Oxford handbook of applied linguistics,
ed. R.B. Kaplan, 459–470. Oxford: Oxford University Press.
Brookhart, S.M. 2004. Classroom assessment: tensions and intersection in theory and practice.
Teachers College Record 106(3): 429–458.
Brookhart, S.M. 2007. Expanding views about formative classroom assessment: a review of the
literature. In Formative classroom assessment: theory into practice, ed. J.H. McMillan, 43–62.
New York: Teachers College Press.
Brooks, L. 2009. Interacting in pairs in a test of oral proficiency: co-constructing a better
performance. Language Testing 26(3): 341–366.
References 89
Brown, A. 2003. Interviewer variation and the co-construction of speaking proficiency. Language
Testing 20(1): 1–25.
Brown, A., N. Iwashita, and T. McNamara. 2005. An examination of rater orientations and test
taker performance on English for academic purposes speaking tasks. TOEFL Monograph
Series, No. TOEFL-MS-29. Princeton: Educational Testing Service.
Brown, J.D., and K.M. Bailey. 1984. A categorical instrument for scoring second writing skills.
Language Learning 34(1): 21–42.
Brown, J.D., and T. Hudson. 1998. The alternatives in language assessment. TESOL Quarterly 32
(4): 653–675.
Brown, G., and G. Yule. 1983. Discourse analysis. Cambridge: Cambridge University Press.
Brumfit, C.J. 1984. Communicative methodology in language teaching: the roles of fluency and
accuracy. Cambridge: Cambridge University Press.
Brumfit, C.J., and K. Johnson. 1979. The communicative approach to language teaching. Oxford:
Oxford University Press.
Burgoon, J.K., and T. Saine. 1978. The unspoken dialogue: an introduction to nonverbal
communication. Boston: Hughton Mifflin Company.
Burgoon, J.K., D.A. Coker, and R.A. Coker. 1986. Communicative effects of gaze behavior: a test
of two contrasting explanations. Human Communication Research 12: 495–524.
Campbell, D.T., and D.W. Fiske. 1959. Convergent and discriminant validation by the multi-trait
multi-method matrix. Psychological Bulletin 56: 81–105.
Canale, M. 1983. From communicative competence to communicative language pedagogy. In
Language and communication, ed. J.C. Richards, and R.W. Schmidt, 2–27. London: Longman.
Canale, M., and M. Swain. 1980. Theoretical bases of communicative approaches to second
language teaching and testing. Applied Linguistics 1(1): 1–47.
Candlin, C.N. 1986. Explaining communicative competence limits of testability? In Toward
communicative competence testing: proceedings of the second TOEFL invitational conference,
ed. C.W. Stansfield, 38–57. Princeton: Educational Testing Service.
Caple, H. 2008. Intermodal relations in image nuclear news stories. In Multimodal semiotics:
functional analysis in contexts of education, ed. L. Unsworth, 125–138. London: Continuum.
Carroll, J.B. 1961. The nature of data, or how to choose a correlation coefficient. Psychometrika 35
(4): 347–372.
Carroll, J.B. 1968. The psychology of language testing. In Language testing symposium: a
psycholinguistic perspective, ed. A. Davies, 46–69. London: Oxford University Press.
Celce-Murcia, M., Z. Dörneyei, and S. Thurrell. 1997. Direct approaches in L2 instruction: a
turning point in communicative language teaching? TESOL Quarterly 31(1): 141–152.
Cerrato, L. 2005. Linguistic functions of head nods. In Gothenburg papers in theoretical
linguistics 92: proceedings from 2nd Nordic conference on multi-modal communication, ed.
J. Allwood, and B. Dorriots, 137–152. Sweden: Gothenburg University.
Chafe, W. 1994. Discourse, consciousness, and time: The flow and displacement of conscious
experience in speaking and writing. Chicago: University of Chicago Press.
Chalhoub-Deville, M. 1995. Deriving oral assessment scales across different tests and rater groups.
Language Testing 12(1): 16–33.
Chapelle, C.A. 1998. Field independence: a source of language test variance? Language Testing
15(1): 62–82.
Chapelle, C.A. 1999. Validity in language assessment. Annual Review of Applied Linguistics 19:
254–272.
Chapelle, C.A., M.K. Enright, and J. Jamieson (eds.). 2008. Building a validity argument for the
Test of English as a Foreign Language. New York: Routledge.
Chapelle, C.A., M.K. Enright, and J. Jamieson. 2010. Does an argument-based approach to
validity make a difference? Educational Measurements: Issues and Practice 29(1): 3–13.
Charney, D. 1984. The validity of using holistic scoring to evaluate writing: a critical overview.
Research in the Teaching of English 18(1): 65–81.
Chen, R. 2008. Some words on writing a multimodal lesson ware for English teaching. Journal of
Fujian Education Institute 1: 75–77.
Chen, Y., and G. Huang. 2009. Multimodal construal of heteroglossia: evidence from language
textbooks. Computer Assisted Foreign Language Education 6: 35–41.
Chen, Y., and H. Wang. 2008. Ideational meaning of image and text-image relations. Journal of
Ningbo University (Education Edition) 1: 124–129.
Cheng, L. 2005. Changing language teaching through language testing: a washback study.
Cambridge: Cambridge University Press.
Chomsky, N. 1965. Aspects of the theory of syntax. Cambridge: MIT Press.
Cienki, A. 2008. Why study metaphor and gesture? In Metaphor and Gesture, eds. A. Cienki and
C. Müller, 5–26. Amsterdam/Philadelphia: John Benjamins Publishing Company.
Cizek, G.J. 2010. An introduction to formative assessment: history, characteristics and challenges.
In Handbook of formative assessment, ed. H.L. Andrade, and G.J. Cizek, 3–17. New York:
Routledge.
Clark, J.L. 1985. Curriculum renewal in second language learning: an overview. Canadian
Modern Language Review 42(3): 342–360.
Clarkson, R., & M.T. Jensen. 1995. Assessing achievement in English for professional
employment programmes. In Language assessment in action, ed. G. Brindley, pp. 165–194.
Sydney, Macquarie University: National Centre for English Language Teaching and Research.
Cohen, A. 1994. Assessing language ability in the classroom, 2nd ed. Boston: Heinle and Heinle
Publishers.
Connor, U., and P.L. Carrel. 1993. The interpretation of the tasks by writers and readers in
holistically rated directed assessment of writing. In Reading in the composition classroom:
second language perspectives, ed. J.G. Carson, and I. Leki, 141–160. Boston: Heine & Heine.
Connor, U., and A. Mbaye. 2002. Discourse approaches to writing assessment. Annual Review of
Applied Linguistics 22: 263–278.
Cooper, C.R. 1977. Holistic evaluation of writing. In Evaluating writing: describing, measuring,
judging, ed. C.R. Cooper, and L. Odell, 3–31. Urbana: NCTE.
Corder, S.P. 1983. Strategies of communication. In Strategies in interlanguage communication,
ed. C. Færch, and G. Kasper, 15–19. London: Longman.
Cortazzi, M. 1993. Narrative analysis. London: Falmer Press.
Council of Europe. 2001. Common European framework of reference for languages: learning,
teaching, assessment. Cambridge: Cambridge University Press.
Cowie, B., and B. Bell. 1999. A model of formative assessment in science education. Assessment
in Education 6(1): 102–116.
Creider, C. 1977. Towards a description of East African gestures. Sign Language Studies 14: 1–20.
Cronbach, L.J. 1949. Essentials of psychological testing. New York: Harper & Row.
Cronbach, L.J. 1971. Test validation. In Educational measurement, 2nd ed, ed. R.L. Thorndike,
443–507. Washington, DC: American Council on Education.
Cronbach, L.J. 1980. Validity on parole: how can we go straight? New directions for testing and
assessment: Measuring achievement over a decade. Proceedings of the 1979 ETS invitational
conference, pp. 99–108. San Francisco: Jossey-Bass.
Cronbach, L.J. 1988. Five perspectives on validity argument. In Test validity, ed. H. Wainer, and
H.I. Braun, 3–17. Hillsdale: Lawrence Erlbaum Associates.
Cronbach, L.J. 1989. Construct validation after thirty years. In Intelligence: measurement, theory,
and public policy, ed. R. Linn, 147–167. Urbana: University of Chicago.
Cronbach, L.J., and P.C. Meehl. 1955. Construct validity in psychological tests. Psychological
Bulletin 52(4): 281–302.
Cumming, A. 1990. Expertise in evaluating second language composition. Language Testing 7(1):
31–51.
Cumming, A., R. Kantor, and D.E. Powers. 2001. Scoring TOEFL essays and TOEFL 2000
prototype writing tasks: an investigation into raters’ decision making and development of a
preliminary analytic framework. TOEFL Monograph Series, No. TOEFL-MS-22. Princeton:
Educational Testing Service.
Cumming, A. 2009. Language assessment in education: tests, curricula and teaching. Annual
Review of Applied Linguistics 29: 90–100.
References 91
Cumming, A., R. Kantor, and D.E. Powers. 2002. Decision making while rating ESL/EFL writing
tasks: a descriptive framework. Modern Language Journal 86: 67–96.
Cumming, A., R. Kantor, K. Baba, U. Erdosy, K. Eouanzoui, and M. James. 2006. Analysis of
discourse features and verification of scoring levels for independent and integrated tasks for
the new TOEFL. Princeton: Educational Testing Service.
Cureton, E.E. 1950. Validity. In Educational measurement, ed. E.F. Lingquist, 621–694.
Washington, DC: American Council on Education.
Daly, A., and L. Unsworth. 2011. Analysis and comprehension of multimodal texts. Australian
Journal of Language and Literacy 34(1): 61–80.
Daniels, H. 2001. Vygotsky and pedagogy. London: Routledge.
Davidson, F., and B. Lynch. 2002. Testcraft: a teacher’s guide to writing and using language test
specifications. New Haven: Yale.
Davies, A., and P. LeMahieu. 2003. Assessment for learning: reconsidering portfolio and research
evidence. In Optimising new modes of assessment: in search of qualities and standards, ed.
M. Sergers, F. Dochy, and E. Cascallar, 141–169. Dordrecht: Kluwer Academic Publishers.
Davies, A., A. Brown, C. Elder, K. Hill, T. Lumley, and T. McNamara. 1999. Dictionary of
language testing. Cambridge: Cambridge University Press.
Davison, C. 2004. The contradictory culture of teacher-based assessment: ESL assessment
practices in Australian and Hong Kong secondary schools. Language Testing 21(3): 305–334.
de Jong, J.H.A.L. 1992. Assessment of language proficiency in the perspective of the 21st century.
AILA Review 9: 39–45.
Derewianka, B., and C. Coffin. 2008. Visual representations of time in history textbooks. In
Multimodal semiotics, ed. L. Unsworth, 187–200. London: Continuum.
Djonov, E.N. 2006. Analysing the organisation of information in websites: from hypermedia
design to systemic functional hypermedia discourse analysis. Unpublished Ph.D. thesis,
University of New South Wales, Australia.
Douglas, D., and J. Smith. 1997. Theoretical underpinnings of the Test of Spoken English revision
project. TOEFL Monograph Series, No. TOEFL-MS-9. Princeton: Educational Testing
Service.
Douglas, D. 2000. Assessing languages for specific purposes. Cambridge: Cambridge University
Press.
Ducasse, A.M., and A. Brown. 2009. Assessing paired orals: raters’ orientation to interaction.
Dwyer, C.A. 2000. Excerpt from validity: theory into practice. The Score 22(4): 6–7.
Ebel, R.L. 1961. Must all tests be valid? American Psychologist 16(10): 640–647.
Ebel, R. L., and D. A. Frisbie. 1991. Essentials of educational measurement, 5th ed. Englewood
Cliffs, NJ: Prentice—Hall.
Efron, D. 1941. Gesture, race and culture. The Hague: Mouton.
Egbert, M.M. 1998. Miscommunication in language proficiency interviews of first-year German
students: a comparison with natural conversation. In Talking and testing: discourse approaches
to the assessment of oral proficiency, ed. R. Young, and W. He, 147–172. Philadelphia: John
Benjamins.
Eggins, S., and D. Slade. 1997. Analysing casual conversation. London: Cassell.
Ekman, P., and W.V. Friesen. 1969. Nonverbal leakage and clues to deception. Psychiatry 32: 88–
106.
Ekman, P., and W.V. Friesen. 1974. Detecting deception from body or face. Journal of Personality
and Social Psychology 29: 288–298.
Ellsworth, P.C., and L.M. Ludwig. 1971. Visual behaviour in social interaction. Journal of
Communication 21(4): 375–403.
Enfield, N.J. 2009. The anatomy of meaning: Speech, gesture, and composite utterances.
Engestrom, Y. 1987. Learning by expanding: an activity theoretical approach to developmental
research. Helsinki: Orienta-Konsultit Oy.
Erdosy, M.U. 2004. Exploring variability in judging writing ability in a second language: a study
of four experienced raters of ESL compositions. TOEFL Research Report, No. RR-03-17.
Princeton: Educational Testing Service.
Ericsson, K.A., and H. Simon. 1993. Protocol analysis. Cambridge: MIT Press.
Færch, C., and G. Kasper (eds.). 1983. Strategies in interlanguage communication. London:
Longman.
Færch, C., et al. 1984. Learner language and language learning. Philadelphia: Multilingual
Matters Ltd.
Feng, D. 2011. Visual space and ideology: a critical cognitive analysis of spatial orientations in
advertising. In Multimodal studies: exploring issues and domains, ed. K.L. O’Halloran, and
B.A. Smith, 55–75. London: Routledge.
Folland, D., and D. Robertson. 1976. Towards objective in group oral testing. ELT Journal 30(2):
156–167.
Fulcher, G. 1987. Tests of oral performance: the need for data-based criteria. ELT Journal 41(4):
287–291.
Fulcher, G. 1993. The construction and validation of rating scales for oral tests in English as a
foreign language. Unpublished Ph.D. thesis. University of Lancaster, UK.
Fulcher, G. 1996a. Does thick description lead to smart tests? A data-based approach to rating
scale construction. Language Testing 13(2): 208–238.
Fulcher, G. 1996b. Invalidating validity claims for the ACTFL oral rating scale. System 24(2):
163–172.
Fulcher, G. 1997. The testing of speaking in a second language. In Encyclopaedia of language and
education, vol. 7, ed. C. Clapham, and D. Corson, 75–85., Language testing and assessment
New York: Springer.
Fulcher, G. 2003. Testing second language speaking. London: Longman/Pearson Education.
Fulcher, G. 2004. Deluded by artifices? The Common European Framework and harmonization.
Language Assessment Quarterly 1(4): 253–266.
Fulcher, G. 2010. Practical language testing. London: Hodder Education.
Fulcher, G., and F. Davidson. 2007. Language testing and assessment: an advanced resource
book. London: Routledge.
Fulcher, G., F. Davidson, and J. Kemp. 2011. Effective rating scale development for speaking
tests: performance decision trees. Language Testing 27(1): 1–25.
Galloway, V.B. 1987. From defining to developing proficiency: a look at the decisions. In Defining
and developing proficiency: guidelines, implementations, and concepts, ed. H. Byrnes, and
M. Canale, 25–73. Lincolnwood: National Textbook Company.
Garrett, H.E. 1947. Statistics in psychology and education, 3rd ed. New York: Longmans, Green
& Company.
Goldin-Meadow, S., and M.A. Singer. 2003. From children’s hands to adults’ ears: Gesture’s role
in teaching and learning. Developmental Psychology 39: 509–520.
Goodwin, L.D. 1997. Changing conceptions of measurement validity. Journal of Nursing
Education 36: 102–107.
Goodwin, L.D. 2002. Changing conceptions of measurement validity: an updated on the new
standards. Journal of Nursing Education 41: 100–106.
Goodwin, C., and J.C. Heritage. 1990. Conversation analysis. Annual Review of Anthropology 19:
283–307.
Goodwin, L.D., and N.L. Leech. 2003. The meaning of validity in the new standards for
educational and psychological testing: implications for measurement courses. Measurement
and Evaluation in Counseling and Development 36(3): 181–191.
Goulden, N.R. 1992. Theory and vocabulary for communication assessments. Communication
Education 41(3): 258–269.
Goulden, N.R. 1994. Relationship of analytic and holistic methods to rater’s scores for speeches.
The Journal of Research and Development in Education 27: 73–82.
Grant, L., and L. Ginther. 2000. Using computer-tagged linguistic features to describe L2 writing
differences. Journal of Second Language Writing 9: 123–145.
References 93
Green, J.R. 1968. A gesture inventory for the teaching of Spanish. Philadelphia: Chilton Books.
Green, A. 1998. Verbal protocol analysis in language testing research: a handbook. Cambridge:
Cambridge University Press.
Green, A. 2007. Washback to learning outcomes: a comparative study of IELTS preparation and
university pre-sessional language courses. Assessment in Education 14(1): 75–97.
Grierson, J. 1995. Classroom-based assessment in intensive English centres. In Language
assessment in action, ed. G. Brindley, 239–270. Sydney: National Centre for English Language
Teaching and Research.
Grootenboer, H. 2006. Treasuring the gaze: eye miniature portraits and the intimacy of vision. Art
Bulletin 88(3): 496–507.
Gu, Y. 2006a. Multimodal text analysis: a corpus linguistic approach to situated discourse. Text &
Talk 26(2): 127–167.
Gu, Y. 2006b. Agent-oriented modelling language, Part 1: modelling dynamic behaviour.
Proceedings of the 20th international CODATA conference, Beijing, pp. 21–47. Beijing:
Information Centre, Chinese Academy of Social Sciences.
Gu, Y. 2007. Learning by multimedia and multimodality. In E-learning in China: Sino-UK
initiatives into policy, pedagogy and culture, ed. H. Spencer-Oatey, 37–56. Hong Kong: The
Hong Kong University Press.
Gu, Y. 2009. From real life situated discourse to video-stream data-mining: an argument for
agent-oriented modelling for multimodal corpus compilation. International Journal of Corpus
Linguistics 14(4): 433–466.
Guijarro, A.J.M., and M.J.P. Sanz. 2009. On interaction of image and verbal text in a picture book:
a multimodal and systemic functional study. In The world told and the world shown:
multisemiotic issues, ed. E. Ventola, and A.J.M. Guijarro, 107–123. Hampshire: Palgrave
Macmillan.
Guilford, J.P. 1946. New standards for test evaluation. Educational and Psychological
Measurement 6(3): 427–438.
Guion, R.M. 1977. Content validity: the source of my discontent. Applied Psychological
Measurement 1(1): 1–10.
Gulliksen, H. 1950. Theory of mental tests. Hillsdale: Lawrence Erlbaum Associates.
Guo, L. 2004. Multimodality in biology textbooks. In Multimodal discourse analysis:
systemic-functional perspectives, ed. K.L. O’Halloran, 196–219. London: Continuum.
Hale, G.A., D.A. Rock, and T. Jirele. 1989. Confirmatory factor analysis of the TOEFL. TOEFL
Research Report, No. RR-32. Princeton NJ: Educational Testing Service.
Hall, E.T. 1959. The silent language. New York: Doubleday.
Halliday, M.A.K. 1973. Explorations in the functions of language. London: Edward Arnold.
Halliday, M.A.K. 1976. The form of a functional grammar. In Halliday: system and function in
language, ed. G. Kress, 101–135. Oxford: Oxford University Press.
Halliday, M.A.K. 1978. Language as social semiotic: the social interpretation of language and
meaning. London: Edward Arnold.
Halliday, M.A.K. 1985. An introduction to functional grammar. London: Arnold.
Halliday, M.A.K., and R. Hasan. 1976. Cohesion in English. London: Longman.
Halliday, M.A.K., and C.M.I.M. Matthiessen. 2004. An introduction to functional grammar, 3rd
ed. London: Edward Arnold.
Halliday, M.A.K., A. McIntosh, and P. Strevens. 1964. The linguistic sciences and language
teaching. Bloomington: Indiana University Press.
Hamp-Lyons, L. 1990. Second language writing: assessment issues. In Second language writing:
research insights for the classroom, ed. B. Kroll, 69–87. New York: Cambridge University
Press.
Hamp-Lyons, L. 1991. Scoring procedures for ESL contexts. In Assessing second language
writing in academic contexts, ed. L. Hamp-Lyons, 241–276. Norwood: Ablex.
Hamp-Lyons, L. 1997. Washback, impact and validity: ethical concerns. Language Testing 14(3):
295–303.
Hatch, E. 1978. Discourse analysis and second language acquisition. In Second language
acquisition: a book of readings, ed. E. Hatch, 401–435. Rowley: Newbury House.
Hattie, J., and H. Timperley. 2007. The power of feedback. Review of Educational Research 77(1):
81–112.
Hawkey, R. 2001. Towards a common scale to describe L2 writing performance. Cambridge
Research Notes 5: 9–13.
Hawkey, R., and F. Barker. 2004. Developing a common scale for the assessment of writing.
Assessing Writing 9(2): 122–159.
He, W. 1998. Answering questions in LPIs: a case study. In Talking and testing: discourse
approaches to the assessment of oral proficiency, ed. R. Young, and W. He, 101–116.
Philadelphia: John Benjamins.
Heath, C.C., and P. Luff. 2007. Gesture and institutional interaction: figuring bids in auctions of
fine art and antiques. Gesture 7(2): 215–240.
Hempel, C.G. 1965. Aspects of scientific explanation and other essays in the philosophy of
science. Glencoe: Free Press.
Henley, N.M. 1977. Body politics: power, sex, and nonverbal communication. Englewood Cliffs:
Prentice-Hall.
Henley, N.M., and S. Harmon. 1985. The nonverbal semantics of power and gender: a perceptual
study. In Power, dominance, and nonverbal behavior, ed. S.L. Ellyson, and J.F. Dovidio, 151–
164. New York: Springer.
Herman, J.L., and K. Choi. 2008. Formative assessment and the improvement of middle school
science learning: The role of teacher accuracy. CRESST Report 740. Los Angeles, CA:
National Center for Research on Evaluation, Standards, and Student Testing.
Hess, E.H. 1975. The tell-tale eye: how your eyes reveal hidden thoughts and emotions. New
York: van Nostrand Reinhold.
Hilsdon, J. 1995. The group oral exam: advantages and limitations. In Language testing in the
1990s: the communicative legacy, ed. C. Alderson, and B. North, 189–197. Hertfordshire:
Prentice Hall International.
Hood, S. 2004. Managing attitude in undergraduate academic writing: A focus on the introductions
to research reports. In Analysing academic writing: Contextualized frameworks, eds.
L.J. Ravelli, and R.A. Ellis, 24–44. London: Continuum.
Hood, S. 2006. The persuasive power of prosodies: Radiating values in academic writing. Journal
of English for Academic Purposes, 5(1):37–49.
Hood, S.E. 2010. Mimicking and mocking identities: the roles of language and body language in
Taylor Mali’s “Speak with conviction”. Invited seminar at the Hong Kong Polytechnic
University, 4 November 2010.
Hood, S.E. 2011. Body language in face-to-face teaching: a focus on textual and interpersonal
meaning. In Semiotic margins: meanings in multimodalities, ed. S. Dreyfus, S. Hood, and S.
Stenglin, 31–52. London: Continuum.
Hopper, R., S. Koch, and J. Mandelbaum. 1986. Conversation analysis methods. In Contemporary
issues in language and discourse processes, ed. D.G. Ellis, and W.A. Donohue, 169–186.
Hilldale: Lawrence Erlbaum Associates.
Hornik, J. 1987. The effect of touch and gaze upon compliance and interest of interviewees. The
Journal of Social Psychology 127: 681–683.
House, E.T. 1980. Evaluating with validity. Beverly Hills: Sage Publications.
Hu, L.T., and P.M. Bentler. 1999. Cutoff criteria for fit indexes in covariance structure analysis:
conventional criteria versus new alternatives. Structural Equation Modelling: A
Multidisciplinary Journal 6: 1–55.
Hu, Z., and J. Dong. 2006. How meaning is construed multimodally: a case study of a PowerPoint
presentation contest. Computer Assisted Foreign Language Education 3: 3–12.
Huerta-Macias, A. 1995. Alternative assessment: responses to commonly asked questions. TESOL
Journal 5(1): 8–11.
References 95
Hughes, A. 2003. Testing for language teachers, 2nd ed. Cambridge: Cambridge University Press.
Hulstijn, J.H. 2007. The shaky ground beneath the CEFR: quantitative and qualitative dimensions
of language proficiency. The Modern Language Journal 91(4): 663–667.
Hulstijn, J.H. 2011. Language proficiency in native and nonnative speakers: an agenda for research
and suggestions for second-language assessment. Language Assessment Quarterly 8(3): 229–
249.
Hymes, D.H. 1962. The ethnography of speaking. In Anthropology and human behaviour, ed.
T. Gladwin, and W.C. Sturtevant, 13–53. Washington: The Anthropology Society of
Washington.
Hymes, D.H. 1964. Introduction: toward ethnographies of communication. American
Anthropologist 6(6): 1–34.
Hymes, D.H. 1972. On communicative competence. In Sociolinguistics, ed. J. Pride, and
J. Holmes, 269–293. Harmondsworth: Penguin.
Hymes, D.H. 1973. Toward linguistic competence. Texas working papers in sociolinguistics
(working paper No. 16). Austin, Tx: Centre for Intercultural Studies in Communication, and
Department of Anthropology, University of Texas.
Hymes, D.H. 1974. Foundations in sociolinguistics: an ethnographic approach. Philadelphia:
University of Pennsylvania Press.
Hymes, D.H. 1982. Toward linguistic competence. Philadelphia: Graduate School of Education,
University of Pennsylvania.
Iedema, R. 2001. Analysing film and television: a social semiotic account of hospital: an unhealthy
business. In Handbook of visual analysis, ed. T. van Leeuwen, and C. Jewitt, 183–204.
London: Sage.
Iizuka, Y. 1992. Extraversion, introversion and visual interaction. Perceptual and Motor Skills 74:
43–59.
Ingram, D., and E. Wylie. 1993. Assessing speaking proficiency in the international English
language testing system. In A new decade of language testing research: selected papers from
the 1990s language testing research colloquium, ed. D. Douglas, and C. Chapelle, 220–234.
Alexandria: TESOL Inc.
Jacobs, E. 1988. Clarifying qualitative research: A focus on traditions. Educational Researcher, 17
(1):16–24.
Jackendoff, R. 1983. Semantics and cognition. Cambridge: MIT Press.
Janik, S.W., A.R. Wellens, M.L. Goldberg, and L.F. Dell’Osso. 1978. Eyes as the centre of focus
in the visual examination of human faces. Perceptual and Motor Skills 47: 857–858.
Jarvis, G.A. 1986. Proficiency testing: a matter of false hopes? ADFL Bulletin 18: 20–21.
Jewitt, C. 2002. The move from page to screen: the multimodal reshaping of school English.
Journal of Visual Communication 1(2): 171–196.
Jewitt, C. 2006. Technology, literacy and learning: a multimodal approach. London: Routledge.
Jewitt, C. 2009. An introduction to multimodality. In The Routledge handbook of multimodal
analysis, ed. C. Jewitt, 14–27. London: Routledge.
Jewitt, C. 2011. The changing pedagogic landscape of subject English in UK classrooms. In
Multimodal studies: exploring issues and domains, ed. K.L. O’Halloran, and B.A. Smith, 184–
201. London: Routledge.
Johnson, K., and H. Johnson. 1999. Encyclopaedic dictionary of applied linguistics: a handbook
for language teaching. Malden: Blackwell Publishers Inc.
Johnson, M., and A. Tylor. 1998. Re-analysing the OPI: how much does it look like natural
conversation? In Talking and testing: discourse approaches to the assessment of oral
proficiency, ed. R. Young, and W. He, 27–51. Philadelphia: John Benjamins.
Jöreskog, K.G. 1993. Testing structural equation models. In Testing structural equation models,
ed. D. Bollen, and J.S. Long, 294–316. Newbury Park: Sage Publications.
Jungheim, N.O. 1995. Assessing the unsaid: the development of tests of nonverbal ability. In
Language testing in Japan, ed. J.D. Brown, and S.O. Yamashita, 149–165. Tokyo: JALT.
Jungheim, N.O. 2001. The unspoken element of communicative competence: evaluating language
learners’ nonverbal behaviour. In A focus on language test development: expanding the
language proficiency construct across a variety of tests, ed. T. Hudson, and J.D. Brown, 1–34.
Honolulu: University of Hawaii, Second Language Teaching and Curriculum Centre.
Kaindl, L. 2005. Multimodality in the translation of humour in comics. In Perspectives on
multimodality, ed. E. Ventola, C. Charles, and M. Kaltenbacher, 173–192. Amsterdam: John
Benjamins.
Kalma, A. 1992. Gazing in triads: a powerful signal in floor apportionment. British Journal of
Social Psychology 31: 21–39.
Kane, T. M. 1990. An argument-based approach to validation. Iowa: The American College
TestingProgram.
Kane, M.T. 1992. An argument-based approach to validity. Psychological Bulletin 112(3): 527–
535.
Kane, M.T. 1994. Validating interpretative arguments for licensure and certification examinations.
Evaluation and the Health Professions 17(2): 133–159.
Kane, M.T. 2001. Current concerns in validity theory. Journal of Educational Measurement 38(4):
319–342.
Kane, M.T. 2002. Validating high-stakes testing programs. Educational Measurement: Issues and
Practice 21(1): 31–41.
Kane, M.T. 2004. Certification testing as an illustration of argument-based validation.
Measurement: Interdisciplinary Research and Perspectives, 2(3), 135–170.
Kane, M.T. 2006. Validation. In Educational measurement, 4th ed, ed. R. Brennan, 17–64.
Westport: American Council on Education and Praeger.
Kane, M.T. 2010. Validity and fairness. Language Testing 27(2): 177–182.
Kane, M.T., T. Crooks, and A. Cohen. 1999. Validating measures of performance. Educational
Measurement: Issues and Practice 18(2): 5–17.
Kasper, G., and K.R. Rose. 2002. Pragmatic development in a second language. Oxford:
Blackwell.
Kendon, A. 1967. Some functions of gaze-direction in social interaction. Acta Psychologica 26:
22–63.
Kendon, A. 1980. Gesticulation and speech: Two aspects of the process of utterance. In The
relationship of verbal and nonverbal communication, ed. M.R. Key, 207–227. The Hague:
Mouton and Co.
Kendon, A. 1981. The organization of behavior in face-to-face interaction: observations on the
development of a methodology. In Handbook of research methods in nonverbal behavior, ed.
P. Ekman, and K. Scherer, 440–505. Cambridge: Cambridge University Press.
Kendon, A. 1985. Some uses of gesture. In Perspectives on silence, ed. D. Tannen, and
M. Saville-Troike, 215–234. Norwood: Ablex.
Kendon, A. 1996. Gesture in language acquisition. Multilingual 15: 201–214.
Kendon, A. 2004. Gesture: visible action as utterance. Cambridge: Cambridge University Press.
Kim, M. 2001. Detecting DIF across the different language groups in a speaking test. Language
Testing 18(1): 89–114.
Kim, Y. 2009. An investigation into native and non-native teachers’ judgments of oral English
performance: a mixed methods approach. Language Testing 26(2): 187–217.
Kleinke, C.L. 1986. Gaze and eye contact: a research review. Psychological Bulletin 100(1):
78–100.
Knoch, U. 2009. Diagnostic writing assessment: the development and validation of a rating scale.
Frankfurt: Peter Lang.
Knox, J.S. 2008. Online newspapers and TESOL classrooms: a multimodal perspective. In
Multimodal semiotics: functional analysis in contexts of education, ed. L. Unsworth, 139–158.
London: Continuum.
Kok, A.K.C. 2004. Multisemiotic mediation in hypertext. In Multimodal discourse analysis:
Kondo-Brown, K. 2002. A FACETS analysis of rater bias in measuring Japanese second language
writing performance. Language Testing 19(1): 3–31.
References 97
Kormos, J. 1999. Simulating conversations in oral-proficiency assessments: a conversation

analysis of role plays and non-scripted interviews in language exams. Language Testing 16(2):
163–188.
Kress, G. 2000. Design and transformation: new theories of meaning. In Multiliteracies: literacy
learning and the design of social futures, ed. B. Cope, and M. Kalantzis, 153–161. South
Yarra: Macmillan Publishers Australia Pte Ltd.
Kress, G., et al. 2001. Multimodal teaching and learning: the rhetorics of the science classroom.
London: Continuum.
Kress, G., and T. van Leeuwen. 1996. Reading images: the grammar of visual design. London:
Routledge.
Kress, G., and T. van Leeuwen. 1998. The (critical) analysis of newspaper layout. In Approaches
to media discourse, ed. A. Bell, and P. Garrett, 186–219. Oxford: Blackwell.
Kress, G., and T. van Leeuwen. 2001. Multimodal discourse: the modes and media of
contemporary communication. London: Edward Arnold.
Kress, G., and T. van Leeuwen. 2002. Colour as a semiotic mode: notes for a grammar of colour.
Visual Communication 3: 343–368.
Kress, G., and T. van Leeuwen. 2006. Reading images: the grammar of visual design, 2nd ed.
London: Routledge.
Kress, G., et al. 2005. English in urban classrooms: a multimodal perspective on teaching and
learning. London: Routledge.
Kunnan, A.J. 1995. Test taker characteristics and test performance: a structural modelling
approach. Cambridge: Cambridge University Press.
Kunnan, A.J. (ed.). 2000. Fairness and validation in language assessment. Cambridge: Cambridge
University Press.
Kunnan, A.J. 2004. Test fairness. In European language testing in a global context, ed.
M. Milanovic, and C.J. Weir, 27–48. Cambridge: Cambridge University Press.
Kunnan, A.J. 2005. Language assessment from a wider context. In Handbook of research in
second language learning, ed. E. Hinkel, 779–794. Mahwah: Lawrence Erlbaum Associates.
Kunnan, A.J. 2008. Towards a model of test evaluation: using the test fairness and wider context
frameworks. In Multilingualism and assessment: achieving transparency, assuring quality,
sustaining diversity. Papers from the ALTE Conference in Berlin, Germany, ed. L. Taylor, and
C.J. Weir, 229–251. Cambridge: Cambridge University Press.
Kunnan, A.J. 2010. Fairness matters and Toulmin’s argument structures. Language Testing 24(2):
183–189.
Lado, R. 1961. Language testing. New York: McGraw-Hill.
Langenfeld, T.E., and L.M. Crocker. 1994. The evolution of validity theory: publish school
testing, the courts, and incompatible interpretations. Educational Assessment 2(2): 149–165.
Lantolf, J., and W. Frawley. 1985. Oral proficiency testing: a critical analysis. The Modern
Language Journal 69(3): 337–345.
Lantolf, J., and W. Frawley. 1988. Proficiency, understanding the construct. Studies in Second
Language Acquisition 10(2): 181–196.
Larsen-Freeman, D. (ed.). 1980. Discourse analysis in second language research. Rowley:
Newbury House.
Lazaraton, A. 1991. A conversation analysis of structure and interaction in the language
interview. Unpublished Ph.D. thesis, University of California at Los Angeles, USA.
Lazaraton, A. 1992. The structural organisation of a language interview: a conversational analytic
perspective. System 20(3): 373–386.
Lazaraton, A. 1995. Qualitative research in TESOL: a progress report. TESOL Quarterly 29: 455–
472.
Lazaraton, A. 1996a. Interlocutor support in oral proficiency interviews: the case of CASE.
Lazaraton, A. 1996b. A qualitative approach to monitoring examiner conduct in CASE. In Studies
in language testing 3: performance testing, cognition, and assessment: selected papers from
the 15th Language Testing Research Colloquium, Cambridge and Arnhem, ed. M. Milanovic,
and N. Saville, 18–33. Cambridge: Cambridge University Press.
Lazaraton, A. 2002. A qualitative approach to the validation of oral language tests. Cambridge:
Cambridge University Press.
Lazaraton, A. 2008. Utilising qualitative methods for assessment. In Encyclopaedia of language
and education, 2nd edn. Vol. 7: Language Testing and Assessment, pp. 197–209. New York:
Springer.
Leathers, D.G., and H.M. Eaves. 2008. Successful nonverbal communication: principles and
applications, 4th ed. New York: Pearson Education Inc.
Lemke, J.L. 2002. Travels in hypermodality. Visual Communication 1(3): 299–325.
Lennon, P. 1990. Investigating fluency in EFL: a quantitative approach. Language Learning 40(3):
387–417.
Leung, C. 2005a. Convival communication: recontextualising communicative competence.
International Journal of Applied Linguistics 15(2): 119–143.
Leung, C. 2005b. Classroom teacher assessment of second language development: construct as
practice. In Handbook of research in second language teaching and learning, ed. E. Hinkel,
869–888. Mahwah: Lawrence Erlbaum Associates.
Leung, C., and B. Mohan. 2004. Teacher formative assessment and talk in classroom contexts:
assessment as discourse and assessment of discourse. Language Testing 21(3): 335–359.
Levine, P., and R. Scollon (eds.). 2004. Discourse and technology: multimodal discourse analysis.
Washington: Georgetown University Press.
Levinson, S.C. 1983. Pragmatics. Cambridge: Cambridge University Press.
Linn, R.L. 1994. Performance assessment: policy promises and technical measurement standards.
Educational Researcher 23(9): 4–14.
Linn, R.L. 1997. Evaluating the validity of assessments: the consequences of use. Educational
Liski, E., and S. Puntanen. 1983. A study of the statistical foundations of group conversation tests
in spoken English. Language Learning 33(2): 225–246.
Little, D. 2006. The Common European Framework of Reference for Languages: content, purpose,
origin, reception and impact. Language Teaching 39(3): 167–190.
Llosa, L. 2007. Validating a standards-based classroom assessment of English proficiency: a
multi-trait multi-method approach. Language Testing 24(4): 489–515.
Lloyd-Jones, R. 1977. Primary trait scoring. In Evaluating writing: describing, measuring,
judging, ed. C.R. Cooper, and L. Odell, 33–66. Urbana: National Council of Teachers of
English.
Long, Y., and P. Zhao. 2009. The interaction study between multimodality and metacognitive
strategy in college English listening comprehension teaching. Computer Assisted Foreign
Language Education 4: 58–74.
Lowe, P. 1985. The ILR proficiency scale as a synthesising research principle: the view from the
mountain. In Foreign language proficiency in the classroom and beyond, ed. C.J. James, 9–54.
Lincolnwood: National Textbook Company.
Lumley, T. 2002. Assessment criteria in a large-scale writing test: what do they really mean to the
raters? Language Testing 19: 246–276.
Lumley, T. 2005. Assessing second language writing: the rater’s perspective. New York: Peter
Lang.
Lumley, T., and A. Brown. 2005. Research methods in language testing. In Handbook of research
in second language teaching and learning, ed. E. Hinkel, 855–933. Mahwah: Lawrence
Erlbaum Associates.
Lumley, T., and B. O’Sullivan. 2005. The effect of test-taker gender, audience and topic on task
performance in tape-mediated assessment of speaking. Language Testing 22(4): 415–437.
Luoma, S. 2004. Assessing speaking. Cambridge: Cambridge University Press.
Lynch, B. 2001. Rethinking assessment from a critical perspective. Language Testing 18(4): 333–
349.
Lynch, B. 2003. Language assessment and programme evaluation. New Haven: Yale.
References 99
Macken-Horarik, M. 2004. Interacting with the multimodal text: reflections on image and verbiage
in ArtExpress. Visual Communication 3(1): 5–26.
Macken-Horarik, M., L. Love, and L. Unsworth. 2011. A grammatics ‘good enough’ for school
English in the 21st century: four challenges in realising the potential. Australian Journal of
Language and Literacy 34(1): 9–23.
Maiorani, A. 2009. The Matrix phenomenon. A linguistic and multimodal analysis. Saarbrucken:
VDM Verlag.
Marsh, H.W. 1988. Multi-trait multi-method analyses. In Educational research methodology, and
evaluation: an international handbook, ed. J.P. Keeves, 570–578. Oxford: Pergamon.
Marsh, H.W. 1989. Confirmatory factor analysis of multi-trait multi-method data: many problems
and a few solutions. Applied Psychological Measurement 15: 47–70.
Martin, J.R. 1995. Interpersonal meaning, persuasion and public discourse: Packing semiotic
punch. Australian Journal of Linguistics, 15(1):33–67.
Martin, J.R. 2000. Beyond exchange: Appraisal systems in English. In Evaluation in text:
Authorial stance and the construction of discourse, eds. S. Hunston, and G. Thompson 142–
175. Oxford: Oxford University Press.
Martin, J.R. 2008. Intermodal reconciliation: mates in arms. In New literacies and the English
curriculum, ed. L. Unsworth, 112–148. London: Continuum.
Martin, J.R. and P.R.R., White. 2005. The language of evaluation: Appraisal in English. London:
Palgrave.
Martinec, R. 2000a. Types of processes in action. Semiotica 130(3): 243–268.
Martinec, R. 2000b. Construction of identity in Michael Jackson’s “Jam”. Social Semiotics 10(3):
313–329.
Martinec, R. 2001. Interpersonal resources in action. Semiotica 135(1): 117–145.
Martinec, R. 2004. Gestures that co-occur with speech as a systematic resource: the realisation of
experiential meanings in indexes. Social Semiotics 14(2): 193–213.
Matsumoto, D. 2006. Culture and cultural worldviews: Do verbal descriptions about culture reflect
anything other than verbal descriptions of culture? Culture and Psychology, 12(1):33–62.
Matsuno, S. 2009. Self-, peer- and teacher-assessments in Japanese university EFL writing
classrooms. Language Testing 26(1): 75–100.
Matthews, M. 1990. The measurement of productive skills: doubts concerning the assessment
criteria of certain public examinations. English Language Teaching Journal 44(2): 117–121.
Matthiessen, C.M.I.M. 2007. The multimodal page: a systemic functional exploration. In New
directions in the analysis of multimodal discourse, ed. T.D. Royce, and W.L. Bowcher, 1–62.
Mahwah: Lawrence Erlbaum Associates.
Maynard, S.K. 1987. Interactional functions of a nonverbal sign: head movement in Japanese
dyadic casual conversation. Journal of Pragmatics 11: 589–606.
Maynard, S.K. 1989. Japanese conversation: self-contextualisation through structure and
interactional management. Norwood: Albex.
Maynard, S.K. 1990. Understanding interactive competence in L1/L2 contrastive context: a case of
backchannel behaviour in Japanese and English. In Language proficiency: defining, teaching,
and testing, ed. L.A. Arena, 41–52. New York: Plenum Press.
McCrimman, J.M. 1984. Writing with a purpose, 8th ed. Boston: Houghton Mifflin.
McKay, P. 1995. Developing ESL proficiency descriptions for the school context: the NLLIA ESL
band scales. In Language assessment in action, ed. G. Brindley, 3–34. Sydney: National Centre
for English Language Teaching and Research.
McNamara, T. 1990. Item response theory and the validation of an ESP test for health
professionals. Language Testing 7(1): 52–76.
McNamara, T. 1996. Measuring second language performance. London: Longman.
McNamara, T. 2000. Language testing. Oxford: Oxford University Press.
McNamara, T. 2001. Language assessment as social practice: challenges for research. Language
Testing 18(4): 333–349.
McNamara, T., and C. Roever. 2006. Language testing: the social dimension. Oxford: Blackwell
Publishing.
McNeill, D. 1979. The conceptual basis of language. Hilldale: Lawrence Erlbaum Associates.
McNeill, D. 1992. Hand and mind: what gestures reveal about thought. Chicago: The University
of Chicago Press.
McNeill, D. 1998. Speech and gesture integration. In The nature and functions of gesture in
children's communication. New directions for child development, eds. J.M. Iverson, and S.
Goldin-Meadow, 11–27. San Francisco: Jossey-Bass Inc, Publishers.
McNeill, D. (ed.). 2000. Language and gesture. Cambridge: Cambridge University Press.
McNeill, D. 2005. Gesture and thought. Chicago: The University of Chicago Press.
Mehrens, W.A. 1997. The consequences of consequential validity. Educational Measurement:
Issues and Practice 16(2): 16–18.
Messick, S. 1975. The standard problem: meaning and values in measurement and evaluation.
American Psychologist 30(10): 955–966.
Messick, S. 1980. Test validity and the ethics of assessment. American Psychologist 35(11): 1012–
1027.
Messick, S. 1988. The once and future issues of validity: assessing the meaning and consequences
of measurement. In Test validity, eds. H. Wainer, and H.I. Braun, 33–45. Hillsdale: Lawrence
Erlbaum Associates.
Messick, S. 1989a. Meaning and value in test validation: the science and ethics of assessment.
Educational Researcher 18(2): 5–11.
Messick, S. 1989b. Validity. In Educational measurement, 3rd ed, ed. R.L. Linn, 13–103. New
York: American Council on Education & Macmillan Publishing Company.
Messick, S. 1992. Validity of test interpretation and use. In Encyclopaedia of educational
research, 6th ed, ed. M.C. Alkin, 1487–1495. New York: Macmillan.
Messick, S. 1994. The interplay of evidence and consequences in the validation of performance
assessment. Educational Research 2(2): 13–23.
Messick, S. 1995. Standards of validity and the validity of standards in performance assessment.
Educational Measurement: Issues and Practice 14(4): 5–8.
Messick, S. 1996. Validity and washback in language testing. Language Testing 13(3): 241–256.
Mickan, P. 2003. What’s your score? An investigation into language descriptors for rating written
performance. Canberra: IELTS Australia.
Milanovic, M., N. Saville, A. Pollitt, and A. Cook. 1996. Developing and validating rating scales
for CASE: theoretical concerns and analyses. In Validation in language testing, ed.
A. Cumming, and R. Berwick, 15–38. Philadelphia: Multilingual Matters Ltd.
Mislevy, R.J. 2003. Substance and structure in assessment arguments. Law, Probability, and Risk
2(4): 237–258.
Mislevy, R.J., L.S. Steinberg, and R.G. Almond. 2003. On the structure of educational
assessments. Measurement: Interdisciplinary Research and Perspectives 1(1):3–67.
Mislevy, R.J., R.G. Almond, and L.S. Steinberg. 2002. On the roles of task model variables in
assessment design. In Generating items for cognitive tests: theory and practice, ed. S. Irvine,
and P. Kyllonen, 97–128. Hillsdale: Lawrence Erlbaum Associates.
Morrow, K. (ed.). 2004. Insights from the Common European Framework. Oxford: Oxford
University Press.
Mosier, C.I. 1947. A critical examination of the concepts of face validity. Educational and
Psychological Measurement 7(2): 191–205.
Moss, P.A. 1992. Shifting conceptions of validity in educational measurement: implications for
performance assessment. Review of Educational Research 62(3): 229–258.
Munby, J. 1978. Communicative syllabus design. Cambridge: Cambridge University Press.
Myford, C.M. 2002. Investigating design features of descriptive graphic rating scales. Applied
Measurement in Education 15(2): 187–215.
Nakatsuhara, F. 2009. Conversational styles in group oral tests: how is the conversation
co-constructed? Unpublished Ph.D. thesis, The University of Essex, UK.
Nambiar, M.K., and C. Goon. 1993. Assessment of oral skills: a comparison of scores obtained
through audio recordings to those obtained through face-to-face evaluation. RELC Journal 24
(1): 15–31.
References 101
Neu, J. 1990. Assessing the role of nonverbal communication in the acquisition of communicative
competence in L2. In Developing communicative competence in a second language: series on
issues in second language research, ed. C.R. Scarcella, S.E. Andersen, and D.S. Krashen, 121–
138. New York: Newbury House Publishers.
Nevo, D., and E. Shohamy. 1984. Applying the joint committee’s evaluation standards for the
assessment of alternative testing methods. Paper presented at the annual meeting of the
American Educational Research Association, New Orleans.
Nevo, B. 1985. Face validity revisited. Journal of Educational Measurement 22(4): 287–293.
Norris, S. 2002. Theoretical framework for multimodal discourse analysis presented via the
analysis of identity construction of two women living in Germany. Unpublished Ph.D. thesis,
Georgetown University, USA.
Norris, S. 2004. Analysing multimodal interaction: a methodological framework. London:
Routledge.
Norris, J.M. 2005. Book review: common European Framework of Reference for Languages:
learning, teaching, assessment. Language Testing 22(3): 399–405.
Norris, S., and R.H. Jones (eds.). 2005. Discourse in action: introducing mediated discourse
analysis. London: Routledge.
North, B. 1994. Scales of language proficiency: a survey of some existing systems. Washington,
DC: Georgetown University Press.
North, B. 1996. The development of a common framework scale of descriptors of language
proficiency based on a theory of measurement. Unpublished Ph.D. thesis, Thames Valley
University, UK.
North, B. 2000. The development of a common framework scale of language proficiency. New
York: Peter Lang Publishing Inc.
North, B. 2003. Scales for rating language performance: descriptive models, formulation styles,
and presentation formats. TOEFL Monograph, No. TOEFL-MS-24. Princeton: Educational
Testing Service.
North, B. 2010a. Levels and goals: central frameworks and local strategies. In The handbook of
educational linguistics, ed. B. Spolsky, and F.M. Hult, 220–230. Malden: Wiley-Blackwell.
North, B. 2010b. Assessment, certification and the CEFR: an overview. Plenary speech at
IATEFL TEA SIG & EALTA conference, Barcelona, Spain.
North, B., and G. Schneider. 1998. Scaling descriptors for language proficiency scales. Language
Testing 15(2): 217–262.
O’Halloran, K.L. 2000. Classroom discourse in mathematics: a multisemiotic analysis. Linguistics
and Education 10(3): 359–388.
O’Halloran, K.L. 2004. Visual semiosis in film. In Multimodal discourse analysis:
O’Halloran, K.L. 2005. Mathematical discourse: language, symbolism and visual images.
London: Continuum.
O’Halloran, K.L. 2008a. Inter-semiotic expansion of experiential meaning: hierarchical scales and
metaphor in mathematics discourse. In New developments in the study of ideational meaning:
from language to multimodality, ed. C. Jones, and E. Ventola, 231–254. London: Equinox.
O’Halloran, K.L. 2008b. Systemic functional-multimodal discourse analysis (SF-MDA): con-
structing ideational meaning using language and visual imagery. Visual Communication 7(4):
443–475.
O'Halloran, K. 2009. Historical changes in the Semiotic landscape: From calculation to
computation. In The routledge handbook of multimodal analysis, ed. C. Jewitt, 98–113. UK:
Routledge.
O’Halloran, K.L. 2011. Multimodal discourse analysis. In Continuum companion to discourse
analysis, ed. K. Hyland, and B. Paltridge, 120–137. London: Continuum.
O’Halloran, K.L., and F.V. Lim. 2009. Sequential visual discourse frames. In The world told and
the world shown: multisemiotic issues, ed. E. Ventola, and A.J.M. Guijarro, 139–156.
Hampshire: Palgrave Macmillan.
O’Loughlin, K.K. 2002. The impact of gender in oral proficiency testing. Language Testing 19(2):
169–192.
O’Malley, J.M., and A.U. Chamot. 1990. Learning strategies in second language acquisition.
O’Toole, M. 1994. The language of displayed art. London: Leicester University Press.
O’Toole, M. 2010. The language of displayed art, 2nd ed. London: Routledge.
O’Toole, M. 2011. Art vs. computer animation: integrity and technology in “South Park”. In
Multimodal studies: exploring issues and domains, ed. K.L. O’Halloran, and B.A. Smith, 239–
252. London: Routledge.
Ockey, G.J. 2001. Is the oral interview superior to the group oral? Working paper on language
acquisition and education, International University of Japan, vol. 11, pp. 22–41.
Oller, J.W. 1979. Language tests at school. London: Longman.
Oller, J.W. 1983. Evidence for a general language proficiency factor: an expectancy grammar. In
Issues in language testing research, ed. J.W. Oller, 3–10. Rowley: Newbury House.
Oller, J.W., and F.B. Hinofotis. 1980. Two mutually exclusive hypotheses about second language
ability: indivisible or partially divisible competence. In Research in language testing, ed. J.W.
Oller, and K. Perkins, 13–23. Rowley: Newbury House.
Oreström, B. 1983. Turn-taking in English conversation. Lund Studies in English 66, CWK
Gleerup.
Painter, C. 2007. Children’s picture book narratives: reading sequences of images. In Advances in
language and education, ed. A. McCabe, M. O’Donnell, and R. Whittaker, 40–59. London:
Continuum.
Painter, C. 2008. The role of colour in children’s picture books. In New literacies and the English
curriculum, ed. L. Unsworth, 89–111. London: Continuum.
Painter, C., J.R. Martin, and L. Unsworth. 2013. Reading visual narratives: Image analysis of
children’s picture books. Bristol: Equinox Publishing.
Patri, M. 2002. The influence of peer feedback on self- and peer-assessment. Language Testing 19
(2): 109–132.
Pawley, A., and F.H. Syder. 1983. Two puzzles for linguistic theory: nativelike selection and
nativelike fluency. In Language and communication, ed. J.C. Richards, and R.W. Schmidt,
191–225. London: Longman.
Pienemann, M., and M. Johnston. 1987. Factors influencing the development of language
proficiency. In Applying second language acquisition research, ed. D. Nunan, 89–94.
Adelaide: National Curriculum Resource Centre.
Pike, K.L. 1967. Language in relation to a unified theory of the structure of human behaviour, 2nd
ed. The Hague: Mouton & Co.
Poggi, I. 2001. The lexicon of the conductor’s face. In Language, vision and music, ed.
P. McKevitt, S. Nuallsin, and C. Mulvihill, 271–284. Amsterdam: John Benjamins.
Pollitt, A., and C. Hutchinson. 1987. Calibrating graded assessment: Rasch partial credit analysis
of performance in writing. Language Testing 4(1): 72–92.
Pomerantz, A., and B.J. Fehr. 1997. Conversation analysis: An approach to the study of social
action as sense making practices. In Discourse as social action, discourse studies: a
multidisciplinary introduction, vol. 2, ed. T.A. van Dijk, 64–91. London: Sage Publications.
Popham, W.J. 1990. Modern educational measurement: a practitioner’s perspective. New York:
Prentice Hall.
Popham, W.J. 1997. Consequential validity: right concern—wrong concept. Educational
Popham, W.J. 2008. Transformative assessment. Alexandria: Association for Supervision and
Curriculum Development.
Psathas, G. 1995. Conversation analysis: the study of talk-in-interaction. Thousand Oaks: Sage.
Purpura, J. 1999. Learner strategy use and performance on language tests: a structural equation
modelling approach. Cambridge: Cambridge University Press.
Purpura, J. 2004. Assessing grammar. Cambridge: Cambridge University Press.
References 103
Purpura, J. 2008. Assessing communicative language ability. In Encyclopaedia of language and

education, eds. E. Shohamy, and N.H. Hornberger, 2nd edn. Vol. 7: language testing and
assessment, pp. 53–68. New York: Springer.
Ravelli, L.J. 2000. Beyond shopping: constructing the Sydney Olympics in three-dimensional text.
Text 20(4): 489–515.
Raykov, T., and G.A. Marcoulides. 2006. A first course in structural equation modeling, 2nd ed.
Mahwah: Lawrence Erlbaum Associates, Inc.
Rea-Dickens, P. 2006. Currents and eddies in the discourse of assessment: a learning-focused
interpretation. International Journal of Applied Linguistics 16(2): 163–188.
Richards, J.C., and R.W. Schmidt. 1983. Conversation analysis. In Language and communication,
ed. J.C. Richards, and R.W. Schmidt, 117–153. London: Longman.
Richards, J.C., et al. 1992. Longman dictionary of language teaching and applied linguistics.
London: Longman.
Riley, P. 1996. Developmental sociolinguistics and the competence/performance distinction. In
Performance and competence in second language acquisition, ed. G. Brown, K. Malinkjaer,
and J. Williams, 114–135. Cambridge: Cambridge University Press.
Ross, S.J. 1998. Self-assessment in second language testing: a meta-analysis and analysis of
experiential factors. Language Testing 15(1): 1–20.
Ross, S.J. 2005. The impact of assessment method on foreign language proficiency growth.
Applied Linguistics 26(3): 317–342.
Ross, S.J., and R. Berwick. 1992. The discourse of accommodation in oral proficiency interviews.
Studies in Second Language Acquisition 14(2): 159–176.
Royce, T. 2007. Multimodal communicative competence in second language contexts. In New
directions in the analysis of multimodal discourse, ed. T. Royce, and W. Bowcher, 361–390.
New York: Routledge.
Ruesch, J., and W. Kees. 1956. Nonverbal communication: notes on the visual perception of
human relations. Berkeley: University of California Press.
Sacks, H. 1992. Lectures on conversation, vol. 1&2. Cambridge: Blackwell.
Sacks, H., E.A. Schegloff, and G. Jefferson. 1974. A simplest systematic for the organisation of
turn-taking in conversation. Language 50: 696–735.
Sadler, D.R. 1989. Formative assessment and the design of instructional systems. Instructional
Science 18(2): 119–144.
Saitz, R., and E.J. Cervenka. 1972. Handbook of gestures. Mouton: The Hague.
Sajavaara, K. 1987. Second language speech production: factors affecting fluency. In
Psycholinguistic models of production, ed. H.D. Dechert, and M. Raupach, 45–65.
Norwood: Ablex.
Sasaki, M. 1993. Relationships among second language proficiency, foreign language aptitude and
intelligence: a structural equation modelling approach. Language Learning 43: 313–344.
Savignon, S.J. 1983. Communicative competence: theory and classroom practice; texts and
contexts in second language learning. Reading: Addison-Wesley.
Savignon, S.J. 1997. Communicative competence: theories and classroom practice. New York:
McGraw-Hill.
Sawaki, Y. 2007. Construct validation of analytic rating scales in a speaking assessment: reporting
a score profile and a composite. Language Testing 24(3): 355–390.
Schiffrin, D. 1994. Approaches to discourse. Oxford: Basil Blackwell.
Schlenker, B.R. 1980. Impression management: the self-concept, social identity, and interpersonal
relations. Monterey: Brooks/Cole.
Schmidt, R. 1992. Psychological mechanisms underlying second language fluency. Studies in
Second Language Acquisition 3: 357–385.
Schmitt, N., and D.M. Stults. 1986. Methodology review: analysis of multi-trait multi-method
matrices. Applied Psychological Measurement 10: 1–22.
Schoonen, R., A. Van Gelderen, K. De Glopper, J. Hulstijn, P. Snellings, A. Simis, and M.
Stevenson. 2002. Linguistic knowledge, metacognitive knowledge, and retrieval speed in L1,
L2 and EFL writing: a structural equation modelling approach. In New directions for research
in L2 writing, ed. S. Ransdell, and M.L. Barbier, 101–122. Dordrecht: Kluwer Academic.
Scollon, R. 2001. Mediated discourse: the nexus of practice. London: Routledge.
Scollon, R., and S.W. Scollon. 2003. Discourses in place: language in the material world.
London: Routledge.
Scollon, R., and W.B.K. Scollon. 2004. Nexus analysis: Discourse and the emerging internet.
London: Routledge.
Scollon, R., and S.W. Scollon. 2009. Multimodality and language: a retrospective and prospective
view. In The Routledge handbook of multimodal analysis, ed. C. Jewitt, 170–180. London:
Routledge.
Scriven, M. 1967. The methodology of evaluation. In Perspectives on curriculum evaluation, ed.
R.W. Tylor, R.M. Gagne, and M. Scriven, 39–83. Chicago: Rand McNally.
Searle, J.R. 1969. Speech act: an essay in the philosophy of language. Cambridge: Cambridge
University Press.
Shepard, L.A. 1993. Evaluating test validity. In Review of research in education, vol. 19, ed.
L. Darling-Hammond, 405–450. Washington DC: American Educational Research
Association.
Shepard, L.A. 1997. The centrality of test use and consequences for test validity. Educational
Measurement: Issues and Practice, 16(2), 5–8, 13, 24.
Shepard, L.A. 2000. The role of assessment in a learning culture. Educational Researcher 29(7):
4–14.
Shohamy, E. 1981. Inter-rater and intra-rater reliability of the oral interview and concurrent
validity with cloze procedure. In The construct validation of tests of communicative
competence, ed. A.S. Palmer, J.M. Groot, and G.A. Trosper, 94–105. Washington, DC:
TESOL.
Shohamy, E. 1996. Competence and performance in language testing. In Performance and
competence in second language acquisition, ed. G. Brown, K. Malmkjaer, and J. William,
138–151. Cambridge: Cambridge University Press.
Shohamy, E. 2001. The power of tests: a critical perspective of the uses of language tests. London:
Longman.
Shohamy, E., C.M. Gordon, and R. Kraemer. 1992. The effect of raters’ background and training
on the reliability of direct writing tests. Modern Language Journal 76: 27–33.
Shute, V.J. 2008. Focus on formative feedback. Review of Educational Research 78(1): 153–189.
Simpson, J. 2003. Report on BAAL/CUP seminar on multimodality and applied linguistics.
Reading, UK.
Sinclair, J.M., and M. Coulthard. 1975. Towards an analysis of discourse. Oxford: Oxford
University Press.
Skehan, P. 1984. Issues in the testing of English for specific purposes. Language Testing 1(2):
202–220.
Skehan, P. 1995. Analysability, accessibility and ability for use. In Principles and practice in
applied linguistics, ed. G. Cook, and B. Seidlhofer, 91–106. Oxford: Oxford University Press.
Skehan, P. 1996. Second language acquisition research and task-based instruction. In Challenge
and change in language teaching, ed. J. Willis, and D. Willis, 17–30. Oxford: Heinemann.
Smith, D. 2000. Rater judgments in the direct assessment of competency-based second language
writing ability. In Studies in immigrant English language assessment, vol. 1, ed. G. Brindley,
159–189. Sydney: Macquarie University.
Sparhawk, C.M. 1978. Contrastive identificational features of Persian gesture. Semiotica 24: 49–
86.
Spolsky, B. 1986. A multiple choice for language testers. Language Testing 3(2): 147–158.
Spolsky, B. 1989a. Communicative competence, language proficiency and beyond. Applied
Spolsky, B. 1989b. Conditions for second language learning: introduction to a general theory.
Oxford: Oxford University Press.
References 105
Spolsky, B. 1993. Testing and examinations in a national foreign language policy. In National
foreign language policies: practice and prospects, ed. K. Sajavaara, S. Takala, D. Lambert,
and C. Morfit, 124–153. Jyväskyla: Institute for Education Research, University of Jyväskyla.
Spolsky, B. 2008. Introduction: language testing at 25: maturity and responsibility? Language
Testing 25(3): 297–305.
Stein, P. 2008. Multimodal pedagogies in diverse classrooms: representation, rights and
resources. London: Routledge.
Stern, H.H. 1978. The formal-functional distinction in language pedagogy: a conceptual
clarification. Paper presented at the 5th AILA congress, Montreal, Canada.
Stöckl, H. 2004. In between modes: language and image in printed media. In Perspectives on
multimodality, ed. E. Ventola, C. Charles, and M. Kaltenbacher, 9–30. Amsterdam: John
Benjamins.
Street, B.V. (ed.). 1993. Cross-cultural approaches to literacy. Cambridge: Cambridge University
Press.
Suppe, F. 1977. The structure of scientific theories, 2nd ed. Urbana: University of Illinois Press.
Swain, M. 1985. Communicative competence: some roles of comprehensible input and
comprehensible output in its development. In Input in second language acquisition, ed.
S. Gass, and C. Madden, 235–256. New York: Newbury House.
Tan, S. 2009. A systemic functional framework for the analysis of corporate television
advertisements. In The world told and the world shown: multisemiotic issues, ed. E. Ventola,
and A.J.M. Guijarro, 157–182. Hampshire: Palgrave Macmillan.
Tan, S. 2010. Modelling engagement in a web-based advertising campaign. Visual
Communication 9(1): 91–115.
Tarone, E.E., and G. Yule. 1989. Focus on the language learner: approaches to identifying and
meeting the needs of second language learners. Oxford: Oxford University Press.
Teasdale, A., and C. Leung. 2000. Teacher assessment and psychometric theory: a case of
paradigm crossing? Language Testing 17(2): 163–184.
Thibault, P.J. 2000. The multimodal transcription of a television advertisement. In Multimodality
and multimediality in the distance learning age, ed. A. Baldry, 311–385. Campobasso, Italy:
Palladino.
Thorndike, E.L. 1920. A constant error in psychological ratings. Journal of Applied Psychology 4:
469–477.
Thorndike, R.M. 1997. Measurement and evaluation in psychology and education. Upper Saddle
River: Merrill.
Tomasello, M. 2003. Constructing a language: a usage-based theory of language acquisition.
London: Harvard University Press.
Toulmin, S.E. 2003. The uses of argument. Cambridge: Cambridge University Press.
Tseng, C., and J. Bateman. 2010. Chain and choice in filmic narrative: an analysis of multimodal
narrative construction in The Fountain. In Narrative revisited, ed. C.R. Hoffmann, 213–244.
Amsterdam: John Benjamins.
Turner, C.E. 1989. The underlying factor structure of L2 cloze test performance in Francophone,
University-level students: Causal modelling as an approach to construct validation. Language
Testing, 6(2):172–197.
Turner, C.E., and J.A. Upshur. 2002. Rating scales derived from student samples: effects of the
scale maker and the student sample on scale content and student scores. TESOL Quarterly 36
(1): 49–70.
Underhill, N. 1987. Testing spoken English. Cambridge: Cambridge University Press.
Unsworth, L., and E. Chan. 2009. Bridging multimodal literacies and national assessment
programs in literacy. Australian Journal of Language and Literacy 32(3): 245–257.
Upshur, J.A., and C.E. Turner. 1995. Constructing rating scales for second language tests. ELT
Journal 49(1): 3–12.
Upshur, J.A., and C.E. Turner. 1999. Systematic effects in the rating of second language speaking
ability: test method and learner discourse. Language Testing 16(1): 82–111.
van Dijk, T.A. 1977. Text and context: exploration in the semantics and pragmatics of discourse.
London: Longman.
van Ek, J.A. 1975. The threshold level in a European unit/credit system for modern language
learning by adults. Strasbourg: Council of Europe.
van Leeuwen, T. 1999. Speech, sound and music. London: Macmillan.
van Leeuwen, T. 2001. Visual racism. In The semiotics of racism, ed. R. Wodak, and M. Reisigl,
333–350. Vienna: Passagen Verlag.
van Leeuwen, T. 2011. The language of colour: an introduction. London: Routledge.
van Lier, L. 1989. Reeling, writhing, drawling, stretching, and fainting in coils: oral proficiency
interviews as conversation. TESOL Quarterly 23(3): 489–508.
van Moere, A. 2007. Group oral test: how does task affect candidate performance and test score?
Unpublished Ph.D. thesis, The University of Lancaster, UK.
Vaughan, C. 1991. Holistic assessment: what goes on in the rater’s mind? In Assessing second
language writing in academic contexts, ed. L. Hamp-Lyons, 111–125. Norwood: Ablex.
Verhoeven, L. 1997. Sociolinguistics and education. In The handbook of sociolinguistics, ed.
F. Coulmas, 389–404. Oxford: Blackwell.
Wainer, H., and H.I. Braun (eds.). 1988. Test validity. Hilldale: Lawrence Erlbaum Associates.
Wang, Y. 2009. The design of multimodal listening autonomous learning and its effect. Computer
Assisted Foreign Language Education 6: 62–65.
Wang, L., G. Beckett, and L. Brown. 2006. Controversies of standardised assessment in school
accountability reform: a critical synthesis of multidisciplinary research evidence. Applied
Measurement in Education 19(4): 305–328.
Webbink, P. 1986. The power of the eyes. New York: Springer.
Wei, Q. 2009. A study on multimodality and college students’ multiliteracies. Computer Assisted
Foreign Language Education 2: 28–32.
Weigle, S.C. 1994. Effects of training on raters of ESL compositions. Language Testing 11(2):
197–223.
Weigle, S.C. 1999. Investigating rater/prompt interactions in writing assessment: quantitative and
qualitative approaches. Assessing Writing 6(2): 145–178.
Weigle, S.C. 2002. Assessing writing. Cambridge: Cambridge University Press.
Weiner, M., et al. 1972. Nonverbal behaviour and nonverbal communication. Psychological
Review 79: 185–214.
Weir, C.J. 1990. Communicative language testing. Englewood Cliffs: Prentice Hall Regents.
Weir, C.J. 2005. Limitations of the Common European Framework of Reference for Languages
(CEFR) for developing comparable examinations and tests. Language Testing 22(3): 281–300.
White, E.M. 1985. Teaching and assessing writing. San Francisco: Jossey-Bass Inc.
White, S. 1989. Backchannels across cultures: a study of Americans and Japanese. Language in
Society 18: 59–76.
multi-method data. Applied Psychological Measurement 9: 1–26.
Widdowson, H.G. 1978. Teaching language as communication. Oxford: Oxford University Press.
Wolfe, E.W. 1997. The relationship between essay reading style and scoring proficiency in a
psychometric scoring system. Assessing Writing 4(1): 83–106.
Wolfe, E.W., C. Kao, and M. Ranney. 1998. Cognitive differences in proficient and non-proficient
essay scorers. Written Communication 15: 465–492.
Wolfe-Quintero, K., S. Inagaki, and H.-Y. Kim. 1998. Second language development in writing:
measures of fluency, accuracy and complexity. Honolulu: University of Hawaii at Manoa.
Wolfson, N. 1989. Perspectives: sociolinguistics and TESOL. New York: Newbury House.
Wylie, L. 1977. Beaux gesters: a guide to French body talk. New York: E. P. Dutton.
Xi, X. 2010. How do we go about investigating test fairness? Language Testing 27(2): 147–170.
Yamashiro, A.D. 2002. Using structural equation modelling for construct validation of an English
as a foreign language public speaking rating scale. Unpublished Ph.D. thesis, Temple
University, USA.
References 107
Yang, H., and C.J. Weir. 1998. Validation study of the national College English Test. Shanghai:
Shanghai Foreign Language Education Press.
Young, R. 1995. Discontinuous language development and its implications for oral proficiency
rating scales. Applied Language Learning 6: 13–26.
Young, R., and W. He. 1998a. Language proficiency interviews: a discourse approach. In Talking
and testing: discourse approaches to the assessment of oral proficiency, ed. R. Young, and
W. He, 1–24. Philadelphia: John Benjamins.
Young, R., and W. He (eds.). 1998b. Talking and testing: discourse approaches to the assessment
of oral proficiency. Philadelphia: John Benjamins.
Zebrowitz, L.A. 1997. Reading faces: window to the soul?. Boulder: Westview Press.
Zhang, D. 2009. On a synthetic theoretical framework for multimodal discourse analysis. Foreign
Languages in China 1: 24–30.
Zhang, Z. 2010. A co-relational study of multimodal PPT presentation and students’ learning
achievements. Foreign Languages in China 3: 54–58.
Zhang, D., and L. Wang. 2010. The synergy of different modes in multimodal discourse and their
realisation in foreign language teaching. Foreign Language Research 2: 97–102.
Zhu, Y. 2007. Theory and methodology of multimodal discourse analysis. Foreign Language
Research 5: 82–86.
Zhu, Y. 2008. Studies on multiliteracy ability and reflections on their effects on teaching.
Chapter 3
Research Design and Methods
The previous chapter consecutively renders a detailed review with regard to

nonverbal delivery, and the methods of developing as well as validating a rating
scale with a consideration of embedding nonverbal delivery into speaking assess-
ment in the context of formative assessment. In order to fulfil the three broad aims
of this study, this research was carried out in a three-phase design. The first phase
built an argument for incorporating nonverbal delivery into assessing EFL learners’
speaking ability when group discussion was taken as the assessment task. The
second phase mainly dealt with the formulation of the rating scale, the completion
of which would call for three steps. The first two steps addressed how the parts of
language competence and strategic competence, nonverbal delivery in particular,
were, respectively, brought forth, and the last step trialed and prevalidated the
tentatively proposed the rating scale on a small scale so that its validity and
practicality could be initially testified by expert raters. The third phase, which
proceeded to validate the revised rating scale on a larger scale, was composed of
two steps with quantitative and qualitative validation approaches.
This chapter unfolds the research design and methods of the entire project in
three sections. The first section presents the general research design of this study,
including all the research phases responding to an argument for nonverbal delivery
in speaking assessment, the development and validation of the rating scale as
foreshadowed. In the second section, the data in different research phases, together
with how they are processed, are detailed. The last section introduces the research
methods and the research instruments with specific reference to each phase of the
present study.
3.1 Research Procedure
Accorded with the aims of building an argument for nonverbal delivery in speaking
assessment and designing and validating a rating scale in the context of group
discussion assessment, the entire research could be chronologically broken down
into (1) argument building (henceforth AB) phase, (2) rating scale formulation
(henceforth RSF) phase and (3) rating scale validation (henceforth RSV) phase.
DOI 10.1007/978-981-10-0170-3_3
110 3 Research Design and Methods
At the AB phase, candidates’ nonverbal delivery in group discussion was investi-

gated to see whether the performance by the learners across a predetermined range
of proficiency levels can be discerned in the light of their nonverbal delivery. The
research findings of this phase also informed the formulation of nonverbal delivery
descriptors on the rating scale at the RSF phase.
With regard to the RSF phase, as the specified properties of the rating scale
suggest, not only the rationale of CLA would be borne in mind as the rating scale is
intended to be theory-laden, but also the essentiality would be manifested regarding
incorporating a dimension of nonverbal delivery into the rating scale, as informed
by a theoretical necessity and the empirical findings from Step I and Step II studies
of the RSF phase (henceforth RSF-I and RSF-II, respectively). It has to be
acknowledged that the rating scale formulated from the results of RSF-I and RSF-II
would still remain to be tentative, with its practicality, or rater-friendliness subject
to further verification. Therefore, Step III of the RSF phase (henceforth RSF-III)
conducted a prevalidation study on the perceptions of and comments by the expert
raters in the Chinese EFL context.
When the study was ushered into the RSV phase, both quantitative and quali-
tative approaches were deployed. On the quantitative side, namely Step I of RSV
(henceforth RSV-I), MTMM was employed to explore the degree to which the
intended construct of the proposed rating scale could be validated. On one hand,
multi-trait in this research mainly referred to the fact that the intended construct as
measured against the proposed rating scale was multidimensional and each
dimension was one trait. On the other hand, multi-method could be understood as
two scoring methods by different parties of stakeholders, teacher-raters and peer-
raters. If the measurement of the given construct with different rating methods
corresponded to a perceived MTMM model with statistically satisfactory
goodness-of-fit indices, the proposed rating scale would be valid and the incor-
poration of nonverbal delivery would be further validated. On the qualitative side,
which referred to Step II of RSV (henceforth RSV-II), MDA was utilised to
examine the overall quality of the randomly selected candidates’ performance so
that the rating scale could be qualitatively validated as to whether the nonverbal
delivery descriptors of each different band as well as the subscores on nonverbal
delivery assigned by teacher and peer raters could be aligned with candidates’
nonverbal performance.
Figure 3.1 displays a flow chart indicative of the general research design of this
study. As portrayed, driven by a theoretical argument from the literature on non-
verbal delivery reviewed, an empirical study with a view to building an argument
for embedding nonverbal delivery into speaking assessment was first conducted,
constituting the first phase of this project. Prior to the RSF phase, the literature was
also reviewed pertaining to the theoretical underpinnings based on which the rating
scale was formulated and the properties that it aimed to be embedded with. Both
aspects at this phase served as fundamental guidance to address the issue of how to
design a rating scale. As reviewed, given the rejection of CEFR’s conceptualisation
of communicative competence, a dotted arrow is shown in Fig. 3.1 as an indication
that CLC was not followed as the theoretical grounding for the present study.
3.1 Research Procedure 111
Dataset 2
Dataset 3
Dataset 1
Group discussions of ELF
Hyme’s notion of Questionnaire results learners in the formative Rating results
communicative assessment context (teachers and peers)
from teachers and
competence learners in the Chinese (150 samples of group
EFL context discussion)
Bachman’s
Canale and Swain’s
communicative
communicative
language ability Language
30 samples
competence model
model Competence
validity:
Strategic
CEFR’s a componential
Competence validity:
Communicative (nonverbal 20 samples notion
a unitary notion
Language delivery)
Research Phase II 100 samples with construct
Competence model Rating scale validity as the core
formulation argument-
based validity
Research Phase II Research Phase III

How to design a Research Phase I
rating scale? An argument for Trial use of the Validating the How to validate a
nonverbal delivery proposed rating modified rating rating scale?
scale scale
Studies on
rating scale nonverbal delivery validation methods
orientation
assessor-oriented
scoring approach Rating scale validation Rating scale validation
analytic approach MTMM approach to MDA approach to quantitative: qualitative:
validate the construct align scores with multi-trait multimodal
rating scale focus construct-focussed validity of the rating performance and multi-method discourse analysis
scale descriptors (MTMM) (MDA)
theory-based and (quantitative) (qualitative)
rating scale design empirically-driven
task specificity specific to oral

group discussion
A validated rating scale
band and descriptive layout embedded with nonverbal
descriptor layouts delivery assessment
Fig. 3.1 Flow chart of general research design
Afterwards, as is delineated, the RSF phase was carried out in three steps, with
RSF-I and RSF-II addressing the operationalisations on language competence and
strategic competence on the rating scale. However, the ways in which the band
descriptors for both parts were formulated differed in that RSF-I attempted to
describe the part of language competence based on what assessment domains
teachers and learners in the Chinese EFL context supposedly perceive (Dataset 1).
For that end, questionnaire survey was the main research instrument, and the sta-
tistical method was exploratory factor analysis (EFA), to be detailed in Sect. 3.3.1.
By comparison, instead of resorting to questionnaires, the part of strategic com-
petence on the rating scale was drawn from the findings from the empirical study at
the AB phase. As mentioned earlier, the AB phase not only evidenced that non-
verbal delivery employed by learners across different proficiency levels could be
differentiated, but it would also inform the range finders with gradable descriptors
in adjacent levels for formulating nonverbal delivery on the rating scale. At this
phase, 30 samples of group discussion with equal distribution of candidates’ pro-
ficiency levels from Dataset 2 were analysed. Having been formulated into a ten-
tative version, the rating scale was then trialed and prevalidated (RSF-III) on a
smaller scale (20 samples from Dataset 2) so as to resolve the issue of practicality,
and make modifications, if any, before it was used to rate a larger sample in the
RSV phase.
The RSV stage also embarked upon a review on the relevant literature,
addressing the issue of how a rating scale should be validated. The answers covered
not only the conceptualisation of validity in language assessment but also the
validation methods. Having cast doubt on the feasibility of argument-based validity
(see dotted arrow in Fig. 3.1), this study argued back to adopt a unitary notion of
validity, putting construct validity in the central place. The review on the validation
methods justified the methods in which the rating scale was cross-validated, namely
MTMM (RSV-I) and MDA (RSV-II).
As is shown in Fig. 3.1, the RSV phase, particularly RSV-I, involved
teacher-raters’ and peer-raters’ scoring (Dataset 3) on 100 samples randomly
selected from Dataset 2. In real practice, all the subscores assigned by teacher and
peer raters and measured against the revised rating scale were run by EQS (see
Sect. 3.2.2) for the statistical output of MTMM model comparison. The indices of
model fit would testify whether the different traits embedded in the intended con-
struct of the rating scale could be consistently measured by different rating methods.
However, given the inadequacy of deploying a quantitative approach alone and the
uncertainty of whether the assigned subscores were aligned with candidates’
de facto performance, the RSV phase was furthered to RSV-II, where an MDA
approach was applied.
Therefore, the integration of quantitative and qualitative validation methods
paved the way for scrutinising whether the proposed rating scale was characterised
by the anticipated construct validity and to reach the fittest MTMM model for
explaining the intended CLA construct. The rating scale would be subject to further
modifications in case such a need arose from the results at the RSV phase.
Ultimately, as illustrated in Fig. 3.1, this project yielded its ultimate product, viz. a
rating scale with sound construct validity and practicality for scoring Chinese ter-
tiary EFL learners’ performance in group discussion in formative assessment.
3.2 Data
As is illustrated in Fig. 3.1, three datasets threading through the whole process of
the study need detailed description. Each dataset is research aim specific and was
collected independently. This section, therefore, sheds more light on a depiction of
phase-specific data for the whole research project. The following section will be
unfolded to elaborate on the three datasets for the three main phases of the study.
3.2.1 Dataset 1: Questionnaire Responses
As aforementioned, Dataset 1, collected mainly for RSF-I, comprised the responses

to the questionnaires administered to teachers and learners in the Chinese EFL
context, and this dataset would be determinant in formulating the part of language
competence on the rating scale. Therefore, several aspects concerning this dataset
3.2 Data 113
need to be introduced, including respondent characteristics and the context where

the questionnaires were administered.
Dataset 1 included the responses by two groups of participants chronologically.
One group comprised a number of experienced EFL teachers and tertiary EFL
learners, whose responses served to fine-tune the draft questionnaire. Such a
practice was trialled as it was anticipated that the finalised version would be pro-
vided with sufficient clarification and least disambiguation so that latent misun-
derstanding of the statements in the questionnaire could be reduced to the minimum
possible level in the large-scale administration. Regarding the respondent charac-
teristics, all the teacher respondents are full professors heavily involved in English
language teaching and assessment for more than two decades in the Chinese
mainland. Among them, one respondent has been responsible for a large-scale
high-stakes English proficiency test. Therefore, their demographic profile and
experience would succinctly guarantee substantial authoritativeness of the feedback
to questionnaire trialling. The learner respondents were randomly selected from two
universities in Shanghai, China, where the researcher of this research project
resided and had the access to data collection. The respondents were also recom-
mended as cooperative students by their subject teachers. Basically, the learner
respondents presented a profile of diversified majors and admission cohorts. For
trialling considerations, all teacher and learner respondents in RSF-I were requested
to complete hard-copy questionnaires, which was followed by an individual verbal
report that commented on the clarification of the statements in the draft
questionnaire.
After certain adjustments were made for the questionnaire based on teacher and
learner respondents’ feedback at the trial stage, the study then turned to the other
group of respondents, to whom the revised questionnaire was administered. In order
to ensure the representativeness of Chinese EFL community regarding both teachers
and learners at the tertiary level, a number of demographic variables were taken into
account, among which the institutions where the respondents are affiliated were a
concern. In practice, largely because of a logistic issue, the questionnaire admin-
istration in RSF-I was conducted in the same institutions of higher learning as those
from which candidates’ performance of group discussion for Dataset 2 was
collected.
Therefore, in order to characterise Chinese college EFL learners, the study
selected a total of seven institutions, ranging from key universities to non-key
college.1 The key institutions are Harbin Institute of Technology (HIT), Shanghai
International Studies University (SHISU), East China University of Science and
Technology (ECUST) and Nanjing University of Science and Technology (NUST);
the non-key institutions are Shanghai Normal University (SNU), Chongqing
University of Posts and Telecommunications (CQUPT) and University of Shanghai
1
The key institutions in China refer to those granted with 211 project and/or 985 project, whereas
those non-key institutions refer to those without any of the above project grants. These two project
grants are sound indicators of the comparative high rankings among all the institutions of higher
learning in the Chinese mainland.
Table 3.1 Distribution of the data sources

No. Institutions Descriptions Geographic location in
China
1. Harbin Institute of Technology Key university South-east China
(HIT) 985-project (non-coastal area)
211-project
2. Shanghai International Studies Key university East China
University 211-project (coastal area)
(SHISU)
3. Nanjing University of Science and Key university East China
Technology (NJUST) 211-project (coastal area)
4. Chongqing University of Posts and Non-key South-west China
Telecommunications (CQUPT) university (non-coastal area)
5. University of Shanghai for Science and Non-key East China
Technology university (coastal area)
(USST)
6. East China University of Science and Key university East China
Technology 985-project (coastal area)
(ECUST) 211-project
7. Shanghai Normal University Non-key East China
(SNU) university (coastal area)
for Science and Technology (USST). Table 3.1 outlines the distribution of the data
sources, featuring a comparative balance between key and non-key institutions as
well as a geographic diversity of the institutions where the participants are affiliated.
In addition, the participants’ majors (liberal arts, engineering, science, law, man-
agement, etc.) are also generally spread out.
A total of 1400 questionnaires (1100 for learners and 300 for teachers) were
distributed to the respondents in the seven institutions specified above in the aca-
demic year 2009–2010. Before the questionnaires were administered to the
respondents, the researcher had liaisons and discussions with the coordinators of
each institution to clarify the details on how the questionnaires should be admin-
istered in a way that would most possibly engage the respondents in conscientiously
completing the questionnaires. Enlightened and suggested by the coordinators, the
questionnaires were administered to learner respondents in their spoken English
class, where one of the topics for oral discussion was what makes a good English
speaker in a group discussion. As for teacher respondents, the questionnaires were
distributed during departmental regular meetings. The questionnaire administration
was so designed as the respondents’ unwillingness could be reduced to a minimum
degree, thus enhancing response reliability (Table 3.2).
As a result, 1312 questionnaires were returned. Due to various reasons, such as
incomplete responses and detected invalid response (e.g. all choices being identical,
see Sect. 5.2.2 for more details), a few returned questionnaires were discarded.
3.2 Data 115
Table 3.2 Demographic distribution of the questionnaire respondents

Institutions Identity Number Gender Average length of English
(male/female) teaching/learning (years)
HIT Teachers 28 10/18 5.7
Learners 84 51/33 7.8
SHISU Teachers 65 36/29 7.5
Learners 316 51/265 10.3
NJUST Teachers 14 4/10 5.3
Learners 54 32/22 8.5
CQUPT Teachers 16 7/9 4.2
Learners 156 124/32 6.8
USST Teachers 78 13/65 5.2
Learners 252 185/67 7.8
ECUST Teachers 29 8/21 7.2
Learners 76 32/44 8.9
SNU Teachers 43 19/24 7.4
Learners 101 44/57 8.4
Total Teachers 273 97/176 Mean: 6.07 years
Learners 1039 519/520 Mean: 8.36 years
Among the valid questionnaires, 1039 copies were from learner respondents (return
rate 94.5 %) and 273 from teacher respondents (return rate 91 %).
Concerning teaching experience, as reflected by the average length of English
teaching, the range falls between 4.2 and 7.5 years, with a mean of 6.07 years. This
can serve as a sound indicator that the teacher respondents involved had accumu-
lated quite a satisfactory amount of teaching experience so that their responses to a
great extent might be deemed as reliable and representative in revealing their
perceptions towards the assessment domains of language competence. The English
learning length on the part of learner respondents basically corresponds with the
length of streamline education in China, falling into the range between 6.8 and
10.3 years. The dispersion might be caused by different localised language policies
in China that fine-tune the starting point of learning English as a compulsory
subject. However, with a mean of 8.36 years for the language learning length, it can
be convinced that the learner respondents as a whole were exposed to English
learning for rather a long period. Therefore, all the returned questionnaires could be
regarded as representative of the teachers and learners in the Chinese EFL context.
Given the fact that this questionnaire was originally devised from the CLA model
(see Sect. 5.2.2 for more details), the participants’ responses with a revelation of
their perceptions towards what constitutes language competence in group discus-
sion could usefully inform how the part of language competence should be
formulated.
3.2.2 Dataset 2: Samples of Group Discussion
Dataset 2 was involved into almost each phase of research, ranging from building
an argument for nonverbal delivery in differentiating candidates across different
proficiency levels to validating the proposed rating scale with the quantitative and
qualitative approaches. Although Dataset 2 was also collected from the same seven
institutions as above specified, more complex logistic issues were involved. As
such, four respects concerning collecting and processing Dataset 2 will be presented
below, viz. recording, transcribing, applying and presenting data (Leech et al. 1995;
Thompson 2005).
3.2.2.1 Data Selection and Recording
The ultimate product of this study, viz. a validated rating scale for group discussion
in formative assessment in the Chinese EFL context, logically determined that
samples of group discussion should be collected as the base data. Therefore, a total
of 150 samples of group discussion were collected from the seven institutions
previously outlined. However, gathered as a data pool, Dataset 2 was separated into
three subsets, subject to the further processing and analyses in conformity with the
phase-specific research objectives.
More specifically, 30 proficiency-stratified samples of group discussion were
used to not only build a further empirical argument for the necessity of embedding
nonverbal delivery into speaking assessment (AB phase) but also to depict dis-
cernible nonverbal delivery employed by the candidates across a range of profi-
ciency levels (RSF-II phase). Therefore, Dataset 2 should meet certain specific
requirements in that the proficiency levels of the candidates were predetermined
with a reasonable and consistent yardstick. Likewise, RSF-III, with another 20
samples of group discussion, served the purpose of trialling the tentative version of
the rating scale so that the practicality of the rating scale could be testified to the
fullest possible extent.
What needs pointing out is that there was no “recyclable” sample in any research
phase. All the remaining 100 samples of group discussion, comprising in the
vicinity of 300 candidates’ performance in group discussion, were reserved for the
RSV phase to meet the case-number threshold for the quantitative validation. Given
the above, the following will describe the participants involved in Dataset 2, fol-
lowed by other details of this dataset.
The Participants
As the demographic representativeness of the candidates, particularly the univer-

sities they are affiliated with, has been previously elaborated on, other issues
3.2 Data 117
regarding the categorisation of participants’ overall linguistic proficiency, major,

gender, and so forth will be introduced below.
One of the topmost concerns was how the participants could be categorised into
different proficiency levels with the same yardstick. There could be several alter-
natives. One would be devising a norm-referenced spoken English test, adminis-
tering the test to all the candidates and then reshuffling them into different
proficiency levels in accordance with the test results. However, this option was
abandoned due to a variety of uncontrollable and unforeseeable factors. For
example, one of the practical constraints could be that the newly developed test
itself would still need validating, in the case of which the data collection might be
burdened with additional work and strained by more logistic issues.
Another alternative could be classifying the candidates based on their previous
academic records, such as spoken English test scores or overall evaluation by the
subject teachers. Nonetheless, although this way could pave the way for candidate
categorisation by avoiding a new test validation, it would still be far from valid
because the grouping results might not be convincing because the candidates were
from different institutions of higher learning in China. Chances could be that the
language proficiency level of a high achiever in a non-key university might not be
equivalent to that of a high achiever in a key university as different institutions
assumedly benchmarked their own criteria in enroling and assessing their students.
Considering the constraints above, this study referred to an external yardstick
and adopted a comparatively indirect, yet economical and reliable way. In other
words, this study turned to the candidates’ scores on a well-researched and vig-
ourously validated high-stakes test. All the candidates were grouped in accordance
with their written test scores on CET,2 with those who passed CET6 falling into
advanced group (henceforth Group A), those who passed CET4 but failed in CET6
into intermediate group (henceforth Group B), and those who failed in CET4 into
elementary group (henceforth Group C). Notwithstanding there is no established
positive correlation between EFL learners’ written and spoken English proficiency
as evidenced in CET scores, it could still be generally presumed that this way of
classification to a certain degree not only projected a general candidate profile for
Dataset 2 collection but also presented their comparative rankings in English pro-
ficiency, provided that CET per se has undergone rounds of validation studies (see
Jin 2006; Yang and Weir 1998). However, for the purpose of further analysis, only
the candidates with the same proficiency level were allowed to be grouped together
when they were assessed.
2
College English Test (CET) is a large-scale high-stakes written test of English language profi-
ciency at the tertiary level in Chinese mainland. At present, the test battery is divided into two
tests: CET4 and CET6. The difference between the two tests largely lies in the degree of difficulty.
The test is of large scale in that millions of candidates sit for the test yearly, and it is high stakes in
the sense that a host of institutions might take the CET score as one of the thresholds of conferring
bachelor’s degrees to their graduates.
Data Collection
This part details the procedures of how this dataset was collected. Before the data
were recorded, all the participants were informed of the assessment task by the
coordinators in each of the seven universities. All the assessments in form of group
discussion were conducted during either Semester 1 or Semester 2 in the academic
year 2009–2010. The participants were told approximately one week in advance
that they would be supposed to get themselves involved in group discussions for
around five minutes as part of formative assessment. Under the permission of the
coordinators, the researcher specified all the topics for group discussions, covering
an extensive range from campus life, cultural differences to other topical issues, all
of which assumedly are familiar to tertiary students so that utterances could be
elicited with comparative ease. In addition, no demand on priori professional or
academic knowledge is imposed on any of the topics. Table 3.3 provides a full list
of the group discussion topics for candidates to choose from.
When this dataset was collected, some considerations for data authenticity and
naturalness were borne in mind. First, instead of being designated into a particular
group, all the participants in each institution were provided with freedom to choose
their own peer discussants, but with four participants in one group as the maximum.
Second, all the assessments were administered in the candidates’ classroom, a
Table 3.3 Topics for group discussions

No. Topics
1. How to treat our parents when they are old?
2. What factors influence you most in your choice of jobs?
3. What do you think of having a private car?
4. Do young people need idols?
5. Should we learn more courses during our free time?
6. Where do you want to pursue further studies?
7. How do you evaluate your teachers in your university?
8. In what way do you think your university should be improved?
9. What is your opinion towards college students’ having part-time jobs?
10. What would you like to know about future?
11. What do you think of getting married in college?
12. Should students choose their own subjects?
13. Where do you want to live, in the city or in the countryside?
14. What do you think of the increased college enrolment?
15. How do you choose friends?
16. What are the effective ways of learning english well?
17. Will online learning replace classroom teaching?
18. What do you think of our traditional Chinese holidays?
19. Is the internet a blessing or a curse?
20. Are skills more important than knowledge?
3.2 Data 119
familiar environment of which could reduce their anxiety to the minimum degree
possible. Another consideration was to reach an agreement with all the subject
teachers via the coordinators in each university that the candidates’ performance
would not be scored instantly on the spot in order to guarantee a smooth contin-
uation of the entire assessment process. The last consideration is the clearance of
research ethics as this study involved video-recording. With the help of the coor-
dinators, all the participants were told that their performance would be audio- and
visual-recorded for research purposes only. Only those participants who signed on
the written consent forms would be recorded. They were also told that their per-
formance would not be negatively graded if they showed their unwillingness to be
videotaped. As the researcher foresaw the necessity of presenting a number of the
participants’ portraits in the form of snapshot when the proposed rating scale is
validated, the written consent also contains an agreement of being willing to be
exposed for illustration purpose in this project.
Data Recording
After all the preparations for data collection were made, the researcher travelled to
each institution during the appointed periods, when the participants’ formative
assessments were supposed to take place. While the assessment was going on, the
coordinator and the subject teacher played the role of organisers while the
researcher himself video-recorded the samples of group discussion. Before each
group discussion initiated, either the coordinator or the subject teacher would
inform the candidates of performing as naturally as possible and that the presence of
the researcher was merely for the recording purpose. In case any of the participants
showed their unwillingness to be videotaped, the recording would be suspended
and the researcher excused himself out of the classroom so that the formative
assessment could still be administered as planned.
In order to ensure the best quality of video-recording, the seating arrangement
for the group discussion was designed in the way as exemplified in Fig. 3.2. As can
be seen, the camera was positioned in the centre of the classroom to capture all the
discussants. The seats were arranged in the shape of a crescent so that candidates
would be within each others’ vision. In the middle of their crescent-shaped seating
camera
digital voice recorder

seat
Fig. 3.2 Seating arrangement and recording set-up

placed a digital voice recorder to audio-record the candidates’ verbal utterances.

This device was so placed as it served as a backup source for voice recording in
case the voice quality recorded by the camera might be less satisfactory.
Since only a minority of participants expressed their unwillingness of being
recorded, the researcher eventually selected a total of 150 samples of group dis-
cussion with good recording quality, among which 50 samples were allocated to
each proficiency group previously categorised. The samples of group discussion in
each proficiency level were numbered according to the alphabetical order of the first
speaker’s surname in each group so that different research phase would consecu-
tively consume 150 samples in this dataset. The whole dataset involved around 500
participants and lasted for a total of about 750 min (12.5 h). After the data tran-
scription, the total running tokens in Dataset 2 amount to 83,000. Therefore, along
with nonverbal delivery transcription to be elaborated on below, Dataset 2 can be
viewed as a multimodal corpus.
Table 3.4 lists the sample distribution of each proficiency group. It can be
generally judged that the distribution of samples in each proficiency group was
equally dispersed and balanced with regard to institution characteristics, such as key
or non-key universities. Although the number of group discussion samples from
certain universities was noticeably lower, such as SNU, such a small number
actually resulted from poor recording quality that might perplex the transcription of
candidates’ utterances. A glimpse at the rightmost column in Table 3.4 also leads to
an impression that the whole recording duration for each proficiency group was
evenly distributed in general, with only Group A’s recording slightly longer. Since
the assumption would stand that candidates of higher proficiency might produce
longer utterances compared with the lower proficiency counterparts, such a slight
imbalance could also be understandable.
Table 3.4 Sample distribution across proficiency groups

Sample sources (150 samples) Duration (ca. 750 min)
Group A Key ECUST (10 samples) ca. 290 min
(50 samples) universities SHISU (6 samples)
HIT (10 samples)
Non-key universities SNU (4 samples)
USST (20 samples)
Group B Key universities HIT (10 sample) ca. 220 min
(50 samples) ECUST (3 samples)
NJUST (12 samples)
Non-key universities USST (12 samples)
CQUPT (13 samples)
Group C Key universities HIT (8 samples) ca. 240 min
(50 samples) ECUST (17 samples)
Non-key universities CQUPT (10 samples)
USST (15 samples)
3.2 Data 121
3.2.2.2 Data Transcription
For the further analyses in relation to each specific research phase, all the samples
of group discussion in Dataset 2 needed to be transcribed into both monomodal and
multimodal texts. As a matter of fact, the transcription of monomodal texts could be
a step that was explored prior to multimodal text transcription because the former
would be embedded into one tier of the latter. The ensuing part elaborates on the
transcription of both types of texts.
Monomodal Text Transcription
The transcription format of spoken language is of serious concern; yet “there is little
agreement among researchers about the standardisation of [transcription] conven-
tions” (Lapadat and Lindsay 1999, p. 65). No strictly standard approach is used to
transcribe talk in corpus linguistics research (Cameron 2001). It has to be admitted
that transcription should be the basis of any further analysis and that consensus has
been reached that the transcription is characterised by the following: it is selective
in nature, conventional by design, theoretically motivated, socially situated and
methodologically driven (see Atkinson 1992; Edward 1993; Fairclough 1992;
Goodwin 1981, 1994; Green et al. 1997; Gumperz 1992; Mehan 1993; Ochs 1979;
Roberts 1997).
Therefore, when the present study proceeded to data transcription, the researcher
considered the issue of reliability and also adhered to transcribing the utterances
verbatim. Another important concern before transcription was the metadata, without
which candidates’ utterances would be nothing but a bundle of words of
unknowable provenance or authenticity (Burnard 2005). In the case of the present
study, the header information, one of the basic components of metadata, was
attached significance to. It includes institution level (key or non-key), institution
name, participants’ majors, their language proficiency level, their name initials,
their genders and the particular topic they chose. Figure 3.3 shows an example of
the header information format specified in this study.
As illustrated in Fig. 3.3, several field names constitute header information; each
field name is contained within a set of boldface square brackets with its value
<schoollevel=local> </schoollevel>
<school=USST> </school>
<major=wiring> </major>
<level=1> </level>
<speakers sp1=CC, male sp2=ZMJ, male sp3=XB, male> </speakers>
<topic=What is your opinion towards college students’ having part-time jobs?> </topic>
Fig. 3.3 An example of header information format

specified and ended with an identical field name and an additional backslash. With
these field names to label the corresponding demographic information of the can-
didates, the tracking and sorting of the needed data could be retrieved from Dataset
2 in batches. For example, all the verbal language by Group A candidates could be
retrieved by defining the field name level as A. The example of Fig. 3.3, therefore,
can be interpreted as a sample from the group discussion by three male Group-A
candidates majoring in wiring at USST, a non-key university in the Chinese
mainland. In addition, their topic was what is your opinion towards college stu-
dents’ having part-time jobs.
The transcription sets one turn as a basic unit with each speaker’s turn sequence
number attached. Figure 3.4 illustrates an excerpt of the transcribed data. As is
shown, the whole lot of the data transcription is contained within a set of markers
(<conversation> and </conversation>). Within that set, each speaker’s utterances
on a turn-by-turn basis are also marked with a starting marker (e.g. <sp1>) and an
ending marker (e.g. </sp1>).
Multimodal Text Transcription
After monomodal text transcription was completed, this study continued to multi-
modally transcribe the candidates’ nonverbal delivery. It can be arguable that such
transcription can be equal to annotating the occurrences of cocontextualised non-
verbal delivery; however, the distinction between transcription and annotation
<conversation>
<sp1> Then let’s talk about the topic we choose. How do you prepare to treat your
parents when they are old? To let those living with you stay independence by
themselves, or stay in the retirement house? </sp1>
<sp2> I’m more inclined to let they stay independent, because that they need quiet
atmosphere and they, what they will stay would be the old level poor. Take my
parents for example, en...if they live with me, it's not convenient and en...they can't
enjoy their own life. En...because they just speak dialect. But our dialect is very
different from the common speech. En...when they were to talk with others, then,
there won't make big progress. Or then...they may not a, accustomed to our lifestyle.
</sp2>
<sp3> En...I don't think so. I want them to stay with me. En...because if there is no
relatives to be with them, they will feel lonely. And as we all know, old people often
for your and, they are more particular, they are particularly easily to miss the kids.
En...if they lived with us, en...we can take more care of them and give them a good
living environment. And we can also en...avoid the long trip to visit them. </sp3>
</conversation>
Fig. 3.4 An excerpt of transcribed texts

3.2 Data 123
largely lies in whether the data were perceived directly by sensory organs.
Annotation ought to be based on a certain theory by the annotator who treats the
data through theory-laden lenses (Allwood et al. 2003; Garside et al. 1997; Gu
2006, 2009). Considering the fact that “[m]ultimodal texts are composite products
of the combined effects of all the resources used to create and interpret them”
(Baldry and Thibault 2006, p. 18) and that the critical issue of representing the
simultaneity of different modalities has not been ideally resolved (Flewitt et al.
2009), multimodal texts are mainly based on the descriptions of what is factually
presented by the data. Therefore, as Dataset 2 was processed directly through the
researcher’s observation without hinging upon any evaluative subjectivity, this
study worked on Dataset 2 in the sense of transcription.
In the present study, ELAN3 was employed as a multimodal transcriber (Version
4.0.1) (see Fig. 3.5 for a screenshot). What ELAN can provide is inputting candi-
dates’ verbal utterances as well as transcribing all the occurrences of nonverbal
delivery in the defined tiers. It can also export all the transcription results along with
the time frame so that both the frequency and cumulative durations of nonverbal
channels specified could be automatically calculated. In that case, the transcription
could be seen as multiplicative instead of merely additive (Baldry and Thibault
2006; Lemke 1998).
Four main tiers for multimodal text transcription are defined: verbal utterance,
eye contact, gesture and head movement. The first tier is the same as the mono-
modal transcription that records what candidates verbally produced in group dis-
cussions. What should be noted is that at the AB phase, this study investigated
group-based performance; in other words, irrespective of the number of discussants
in one group discussion, their verbal utterances were transcribed into one
group-based tier. The other three tiers were defined to, respectively, transcribe the
occurrences of the participants’ eye contact, gesture and head movement. The
nonverbal delivery by different candidates was also transcribed into the, respec-
tively, allocated three tiers at this exploratory phase. However, due to a consider-
ation of a fine-grained investigation following the analytic framework of MDA
reviewed in Chap. 2, the transcriptions of candidates’ nonverbal delivery at the
RSV-II were conducted on an individual basis.
The transcription was piloted for the purpose to reach a general profile of what
was supposed to be transcribed. For example, concerning the transcription of eye
contact and head movement, it was felt that a prescribed manner in terms of
directionality consistent with the analytical framework can be adopted, which
would also be quite facilitating for comparatively objective judgment. Basically, a
candidate would have eye contact with peer(s) (EC/p), with the researcher (EC/r),
with the camera (EC/c) or with nothing in particular (no eye contact at all) or other
physical objects in the classroom (e.g. gazing at the ceiling or looking out of the
window, etc.) (EC/n). The first three types whose targets are more specific can be
more easily identified and feasibly transcribed, whereas the last one, with seemingly
3
Freeware downloadable from http://www.lat-mpi.eu/tools/elan (accessed on 9 November 2012).
Automatic retrieval of
transcription
Media file
player
Transcription
tiers
Fig. 3.5 Transcription interface of ELAN
inexhaustible forms, could only be described in detail according to what happens

non-linguistically. Likewise, a candidate would either nod or shake the head,
echoing the two most common movements of head reviewed before.
However, describing gestures can be remarkably different. Although gestures
can also be categorised in terms of directionalities, the researcher found it chal-
lenging to follow a similar approach as there could be an endless inventory
regarding the manifestations of gestures. This is because, as informed by the ana-
lytical framework, gestures can be formally depicted by means of other dimensions,
such as level and use of hand (either left or right hand). Therefore, the transcriptions
of gesture followed a descriptive approach. In other words, when an occurrence of
gesture was transcribed, the researcher attempted to describe it as detailed as
possible, such as raising the forefinger of the right hand and pointing upwards and
stretch the right hand with the palm upwards. Since the transcription of gestures
could not be detached from such nouns as hand, palm, fist and arm, this study at the
AB phase would extract those nouns so that gestures of a similar kind might be
grouped together for further analysis (see Sect. 4.3.3 for more details).
Considering the complexity and effort consumption of data transcription, the
transcription of both monomodal and multimodal texts were demanding in the sense
that subjectivity might intervene to a certain extent. In order to guarantee the
transcription accuracy and reliability, the researcher of this study, teaming with
another invited researcher, transcribed all the samples separately and negotiated
with each other when any disagreement arose. After all the samples were tran-
scribed, a third researcher was invited to double-check the transcription after
attending a half-day workshop on the transcription guidelines and conventions
specified above. All the external assistants for the data transcription were given a
certain amount of honorarium as a token of appreciation. Through a few rounds of
check, this study endeavoured to minimise any transcription error or inconsistency
so that the transcribed texts could be as reliable as expected.
3.2 Data 125
3.2.2.3 Data Application and Presentation
As is outlined in the research design, the AB phase of research needed to analyse

several measurement dimensions to examine the Chinese college EFL learners’
employment of nonverbal delivery across a range of proficiency levels. On one
hand, the frequencies/occurrences of eye contact, gesture and head movement in
each sample were retrieved. On the other hand, the cumulative durations of non-
verbal channels by the candidates were also calculated. Both dimensions are sta-
tistically straightforward, which could be simply processed by the built-in
calculation function of ELAN. Figure 3.6 snapshots an example showing the
retrieval results of gestures in one sample of group discussion. In this example, not
only the descriptive transcriptions are listed, but also the begin time and the end
time, together with the durations of nonverbal delivery occurrences, are automati-
cally calculated in display.
However, the above simple calculation functions seem less potent in conducting
in-depth analyses on the candidates’ de facto performances in nonverbal delivery in
relation to the communicative functions reviewed before. Nor would it be possible
to uncover the interaction between verbal language and nonverbal channels. Thus, it
is necessary to use the export function of ELAN and then save the descriptive
transcriptions as machine-readable files for further analyses.
For those ends, the transcription results are exported and saved as text files.
More specifically, when learners’ gestures needed to be examined, the keywords
associated with gestures were explored, after which a keyword-driven approach was
adopted to investigate the synchronisation of nonverbal delivery with verbal
utterances by means of the corpus retrieval software WordSmith (Version 5.0)
(Scott 2008) and Concgram (Version 1.0) (Greaves 2008) where necessary. In that
case, how candidates’ nonverbal delivery is realised, how different communicative
meanings are instantiated, how candidates of various proficiency levels can be
differentiated in terms of nonverbal delivery, and how nonverbal delivery interacts
with accompanying verbiage can be investigated. In addition, wherever necessary,
SPSS (Version 18.0) was deployed for descriptive, correlational and extrapolative
statistical analyses.
Fig. 3.6 A snapshot of ELAN for gesture transcription retrieval

3.2.3 Dataset 3: Rating Results
As expounded in Fig. 3.1, Dataset 3 pertaining to the RSF and RSV phases
included the assessment results based on the proposed rating scale. It was subse-
quently needed in RSF-III (20 samples) and in both steps of the RSV phase (100
samples).
3.2.3.1 Trial Rating
The rating of candidates’ performance on group discussion at RSF-III signifies trial

rating, which mainly serves the purpose of initially verifying the construct validity
and practicality of the tentative rating scale and making modifications based on the
feedback from the expert raters. Therefore, in order to ensure the reliability and
authoritativeness of trial use, the raters at this phase were the three experts who
were previously involved in the questionnaire trial at RSF-I (henceforth, respec-
tively, referred to as Rater_1, Rater_2 and Rater_3).
Before the operation of trial rating at RSF-III, there was a morning training
session, when the researcher explained to the raters as to how the rating scale was
phased in based on RSF-I and RSF-II, and what the different band descriptors of the
rating scale could be disambiguated. In addition, the researcher also clarified the
purpose of this research phase, and briefly introduced how the samples of group
discussion were collected. Afterwards, all the three raters were supposed to rate the
same 20 samples of group discussion from Dataset 2 on the afternoon of the same
day of rater training. These samples of group discussion were selected from the
remaining 120 samples in the ascending order of their sequence numbers in each
proficiency group. As per the expert raters’ request, comparatively more samples
from Group A (7 samples) and Group C (8 samples) were selected than Group B
(5 samples) so that certain extreme cases, such as highest and lowest achievers,
would be screenable. A meeting between the researcher and the three invited raters
was organised primarily for the purpose of collecting feedback regarding the issue
of rater-friendliness. Under the raters’ permission, the meeting was audio-recorded
so that the researcher could concentrate himself on eliciting the expert raters’
comments on the rating scale instead of being engaged in taking notes. All the
expert raters were given a certain amount of honorarium as a token of appreciation
afterwards.
This trial use of the rating scale was followed by the data processing, which
mainly involved two aspects. One aspect, basically quantitative, deals with the
scores assigned against the trial-version rating scale by the expert raters. After
the inter- and intra-reliability was checked, more foci were geared to correlating
the subscores of each assessment dimension on the rating scale with the total scores
assigned to individual candidates. The correlation analysis was so conducted as to
see whether the tentatively proposed rating scale would be able to measure the
intended construct incubated in the RSF phase. The other aspect featured the
3.2 Data 127
qualitative feedback the researcher collected during a meeting with the expert raters.
When both aspects were addressed, which signified the accomplishment of RSF-III,
the RSF phase as a whole was brought to an end.
3.2.3.2 Field Rating
After the tentatively proposed rating scale was fine-tuned in congruence with the
research findings in RSF-III, the RSV phase called for teacher-rating and
peer-rating of the remaining 100 samples of group discussion in Dataset 2.
Regarding the selection criteria of teacher raters, the researcher thought it unnec-
essary to invite the same three experienced raters above introduced largely out of
two reasons. First, as the three raters had been previously involved in the rating
process, there might be possibilities that they would still adhere to the tentative
version of the rating scale although the revised version was also brought forth and
supposed to be used. This was because their first impression on the rating scale
might be so sharply etched in their mind that subconscious or unconscious reluc-
tance to accept the revised version might arise due to their familiarity with the trial
version. Second, since the rating scale proposed in this study is intended to be
generalisable to formative assessment, the environment of which does not and
cannot necessarily require experienced raters, nor would it be possible for all EFL
teachers to be expert raters.
With the above consideration, another three teacher raters were invited at RSV-I.
Although they were not as experienced as the expert raters, they were truly epit-
omised as frontline instructors involved in formative assessment. Like the training
session conducted in RSF-III, the raters were also given a half-day workshop to be
acquainted with the band descriptors and initial data screening. However, unlike the
previous half-day rating process, the rating at RSV-I took a much longer time given
the larger data size. As it was impractical to require all the three teacher raters to
score the candidates’ performance within a consecutive period of time, they were
allowed to take away the data and return the rating results to the researcher within
the following five days. Such an accommodation was partly due to the heavy rating
workload and partly based on a consideration of intra-rater reliability, as the
lengthier the rating process would last, the less reliable the rating results within
individual raters might incur. The three teacher raters were given a certain amount
of honorarium as a token of appreciation.
When it comes to peer-rating, there seemed almost no possibility of returning the
data to the particular institutions where the samples of group discussion were
collected, requesting the candidates to rate their peers’ performance. The main
reason was due to certain logistic constraints because the peer-rating results were
supposedly based on the revised rating scale, which actually came into play after
Dataset 2 had been collected. This study, therefore, adopted an indirect way, in
which the samples were randomly rated by the peers from different institutions to
which the researcher had comparatively easier access. The samples of group dis-
cussion at different proficiency levels were also rated by the learners of the
corresponding proficiency levels. For example, the samples of intermediate level

candidates from University A would be scored by those of the same proficiency
level from University B using the fine-tuned rating scale. However, the number of
peer raters was limited to ensure inter-rater reliability.
Among the ultimate remaining 100 samples, 33 samples were from Group A, 35
samples from Group B and 32 samples from Group C. Accordingly, three groups of
peer raters, with four in each, were invited to rate the video-recorded samples of
group discussion after the completion of teacher-rating. Six of them were from
SHISU, whereas another six from USST. In order to strike a balance regarding their
proficiency levels, four Group-A raters and two Group-B raters were from SHISU;
four Group-C raters and two Group-B raters were from USST. None of the peer
raters would be possibly assigned to rate their own performance in group discus-
sions. Similarly, all the peer raters received half-day training to perceive the rating
scale. On the following day of the training, they gathered together again to complete
rating within a single day. In the process of rating, they were discouraged to discuss
their own judgments and assignment of scores with other peer raters. Each peer
rater was given a set of stationery as a token of appreciation.
3.3 Methods and Instruments
At various research phases of this study, different research methods, particular

statistical methods, were employed. At the AB phase, certain descriptive and
extrapolative statistics, such as one-way ANOVA and post hoc tests, were used to
differentiate the candidates in their employment of nonverbal delivery so that the
band descriptors of nonverbal delivery on the rating scale could be informed in the
RSF phase. When the rating scale was formulated, exploratory factor analysis was
resorted to in extracting the teachers’ and learners’ shared perceptions of what
should be assessed regarding communicative competence in group discussions.
RSF-III witnessed the deployment of correlation analysis between the assigned
subscores with the total scores. As AB and RSF-III utilised the statistical methods
commonly found in language assessment research, only exploratory factor analysis,
the statistical method used at RSF-I, will be detailed below. With regard to the RSV
phase, MTMM and MDA were adopted. Since the rationales of both methods have
already been reviewed in the literature (see Sect. 2.5 for more details), this section
only renders an introduction to EQS, the research instrument with which MTMM
was operated.
3.3.1 Exploratory Factor Analysis
At RSF-I, where the assessment domains were designed to be incubated via the
results from the questionnaires, the method of how those domains could be
3.3 Methods and Instruments 129
extracted depended on exploratory factor analysis (EFA). As the responses to the

questionnaires were based on a five-point Likert scale reflective of the opera-
tionalised statements of language competence in the CLA model (see Chap. 5 for
the alignment of questionnaire statements with the CLA model), the collected data
were processed so that what teachers and learners perceived as the assessment
domains could be accordingly distiled. However, if all those statements were
included as individual assessment domains, the feasibility of the rating scale in this
part would be cast doubt on because the rating scale would wind up with nothing
but an inventory inclusive of what should be assessed. Therefore, EFA was
deployed to process the responses to the questionnaires, with the anticipation that
certain statements might be found to be clustered together accounting for a single,
unitary yet latent assessment domain (Gorsuch 1983).
There are certain thresholds for the dataset to meet before EFA could be run.
Field (2005) points out that normally the sample size should be more than 300 and
the communalities of each item after extraction should be ideally above 0.5;
otherwise, caution should be taken to run EFA for the dataset. In the case of this
study, in Dataset 1, the number of respondents far exceeded 300, which met the first
requirement regarding the data size. The research findings in RSF-I would also
check the values of communalities for reaching the eligibility of running EFA.
Apart from those, Dataset 1 also needed to be tested in terms of KMO and Barlett’s
Test of Sphericity, both indices of which test the data fitness for EFA. Kaiser
(1974), and Hutcheson and Sofroniou (1999) believe that KMO value between 0.70
and 0.80 can be regarded as good and value above 0.8 ideal. As for Barlett’s Test of
Sphericity, as long as the test can prove statistical significance, EFA would be
appropriate.
3.3.2 Multi-trait Multi-method
Following Widaman’s (1985) framework of alternative model comparison, the

present study used multi-variate software EQS (Version 6.1) to run the statistical
processing for comparing the baseline model with all other suggested alternative
models so that the fittest model with most interpretability can be reached. The most
conspicuous advantage of this instrument is that the software per se contains the
MTMM model syntax, which can be retrieved for any sort of data regarding the test
of goodness-of-fit indices. In addition, instead of manually drawing the path lines,
the researcher was aided by the software in formulating the paths of the diagram
and in calculating and marking the factor loadings automatically. Figure 3.7 por-
trays a screenshot of path diagram outcome with embedded parameter estimates.
Fig. 3.7 An EQS example of path diagram with embedded parameter estimates
3.4 Summary
From the logistic concerns and the perspective of general design, this chapter
outlines the research procedure, the data, the methods as well as the research
instruments of this study. Based on the literature review of nonverbal delivery, how
to design and how to validate a rating scale with a consideration of embedding
nonverbal delivery into speaking assessment, the first section entails how this study
was carried out in a three-phase design. In the AB phase, an argument was
advanced for incorporating nonverbal delivery as a dimension of differentiating
candidates across a range of proficiency levels. When the study proceeded to the
RSF phase, a rating scale informed by such an argument was formulated basically
in the domains of language competence and strategic competence, the latter of
which can be largely represented by nonverbal delivery enlightened by a review on
the previous literature. The RSF phase ended up with a small-scale prevalidation
study in the sense that certain modifications were made to refine the tentative
version of the rating scale. The RSF phase, on the other hand, was separated into
two lines, with quantitative and qualitative validation, respectively. The second
section of this chapter describes the data and profiles a few considerations on data
collection, processing and analysis. In particular, with a number of exemplifica-
tions, more light is shed on how three datasets threading through this study would
be processed and further analysed to serve phase-specific purposes. The last section
wraps up this chapter by an elaboration on the statistical methods and the corre-
sponding software used in rating scale formulation and validation.
References 131
References
Allwood, J., L. Gronqvist, E. Ahlsen, and M. Gunnarsson. 2003. Annotation and tools for an
activity based spoken language corpus. In Current and new directions in discourse and
dialogue, ed. C.J. van Kuppevelt, and R.W. Smith, 1–18. Dordrecht: Kluwer Academic
Publishers.
Atkinson, P. 1992. Understanding ethnographic texts. Newbury Park, CA: Sage.
Baldry, A., and P. Thibault. 2006. Multimodal transcription and text analysis. London: Equinox.
Burnard, L. 2005. Developing linguistic corpora: Metadata for corpus work. In Developing
linguistic corpora: A guide to good practice, ed. M. Wynne, 30–46. Oxford: Oxbow Books.
Cameron, D. 2001. Working with spoken discourse. London: Sage.
Edward, J.A. 1993. Principles and contrasting systems of discourse transcription. In Talking data:
Transcription and coding in discourse research, ed. J.A. Edward, and M.D. Lambert, 3–31.
Hillsdale, NJ: Lawrence Erlbaum Associates.
Fairclough, N. 1992. Discourse and text: Linguistic and intertextual analysis with discourse
analysis. Discourse and Society 3: 193–217.
Field, A.P. 2005. Discovering statistics using SPSS, 2nd ed. London: Sage.
Flewitt, R., R. Hampel, M. Hauck, and L. Lancaster. 2009. What are multimodal data and
transcription? In The Routledge handbook of multimodal analysis, ed. C. Jewitt, 40–53.
London and New York: Routledge.
Garside, R., G. Leech, and T. McEnery (eds.). 1997. Corpus annotation. London: Longman.
Goodwin, C. 1981. Forms of talk. Philadelphia, PA: University of Philadelphia.
Goodwin, C. 1994. Professional vision. American Anthropologist 96: 606–633.
Gorsuch, R.L. 1983. Factor analysis. Hillsdale, NJ: Lawrence Erlbaum Associates.
Greaves, C. 2008. ConcGram 1.0: A phraseological search engine. Amsterdam: John Benjamins
Publishing Company.
Green, J., M. Franquiz, and C. Dixon. 1997. The myth of the objective transcription: Transcribing
as a situated act. TESOL Quarterly 31: 172–176.
Gu, Y. 2006. Multimodal text analysis: A corpus linguistic approach to situated discourse. Text &
Talk 26(2): 127–167.
Gu, Y. 2009. From real life situated discourse to video-stream data-mining: An argument for
agent-oriented modelling for multimodal corpus compilation. International Journal of Corpus
Gumperz, J.J. 1992. Contextualisation and understanding. In Rethinking context: Language as an
interactive phenomenon, ed. A. Duranti, and C. Goodwin, 229–252. Cambridge: Cambridge
University Press.
Hutcheson, G., and N. Sofroniou. 1999. The multivariate social scientist: Introductory statistics
using generalized linear models. London: Sage Publications.
Jin, Y. 2006. On the improvement of test validity and test washback: the CET-4 washback study.
Foreign Language World 6: 65–73.
Kaiser, H.F. 1974. An index of factorial simplicity. Psychometrika 39(1): 31–36.
Lapadat, J.C., and A.C. Lindsay. 1999. Transcription in research and practice: From standard-
isation of technique to interpretative positioning. Qualitative Inquiry 5(1): 64–86.
Leech, G., G. Myers, and J. Thomas (eds.). 1995. Spoken English on computer: Transcription,
mark-up and application. London: Longman.
Lemke, J. 1998. Metamedia literacy: Transforming meanings and media. In Handbook of literacy
and technology: Transformation in a post-typographic world, ed. D. Reinking, M. McKenna,
L. Labbo, and R. Kieffer, 283–302. Hilldale, NJ: Lawrence Erlbaum Associates.
Mehan, H. 1993. Beneath the skin and between the ears: A case study in the politics of
representation. In Understanding practice: Perspectives on activity and context, ed.
S. Chaiklin, and J. Lave, 241–268. Cambridge: Cambridge University Press.
Ochs, E. 1979. Transcription as theory. In Development pragmatics, ed. E. Ochs, and B.
Schieffilin, 43–72. New York, NY: Newbury House.
Roberts, C. 1997. Transcribing talk: Issues of representation. TESOL Quarterly 31: 167–172.
Scott, M. 2008. WordSmith tools (Version 5.0). Liverpool: Lexical Analysis Software.
Thompson, P. 2005. Spoken language corpora. In Developing linguistic corpora: A guide to good
practice, ed. M. Wynne, 59–70. Oxford: Oxbow Books.
Yang, H., and C.J. Weir. 1998. Validation study of the national College English Test. Shanghai:
Shanghai Foreign Language Education Press.
Chapter 4
Building an Argument for Embedding
Nonverbal Delivery into Speaking
Assessment
This chapter reports on the AB phase of this research, which unveils an empirical
study that foregrounds the entire research project. Prior to assuredly advancing to
formulating and validating a rating scale as the ultimate product of this project, this
study should first build an argument for embedding nonverbal delivery into
speaking assessment. Specifically, an empirical study was conducted as to how
particular channels of nonverbal delivery deployed by Chinese EFL learners can be
described, so that not only how much they achieve in this respect can be micro-
scopically characterised, but also the argument mentioned above can be articulated.
The research findings at this phase would also inform how the part of strategic
competence on the rating scale (RSF-II), mainly reflected by nonverbal delivery,
can be subsequently formulated.
4.1 Research Objectives and Questions
As foreshadowed, the AB phase can be significant and informative in that the

research findings bolster an argument that is intended to be voiced out for
encompassing nonverbal delivery into speaking assessment. Essentially, the crux
would be observing whether, and if so, how the incorporation of such a dimension
would be able to differentiate EFL learners across different proficiencies. Therefore,
this study would accordingly verify a discriminating role of nonverbal delivery in
candidates’ spoken production.
One of the prerequisites of building such an argument is profiling nonverbal
delivery performance by EFL learners’ across a range of different proficiencies in
group discussions in the context of formative assessment, on both macro- and
micro-bases. In addition, more depiction with regard to how nonverbal delivery and
verbal language interact will be thrown light on. Therefore, an in-depth analysis on
the employment of candidates’ nonverbal delivery, along with its relationship with
accompanying verbiage, would naturally become another objective of this research
phase.
Both the argument above and a detailed profile of EFL learners’ performance in
nonverbal delivery will, in an integrated manner, lead to the objective of
DOI 10.1007/978-981-10-0170-3_4
134 4 Building an Argument for Embedding Nonverbal Delivery …
particularising the observable rating scale descriptors for nonverbal delivery dis-
cerning candidates across proficiency levels. Hence, this phase of study can be
crucial in the sense that the wording of modifiers in the band descriptors, if saliently
distinguishable, could reflect gradable changes between adjacent proficiency levels.
In retrospect, the very first general research question raised in the first chapter
addresses the role that nonverbal delivery plays in EFL learners’ spoken production
in group discussions. To be addressed in this phase of research, this question can be
made more addressable since the above research objectives provide pertinent
insights on how fine-grained research questions specific to this phase can be per-
ceived, as outlined as follows. How these questions can be further operationalised
will be approached in research design section below.
AB-RQ1: What are the main characteristics of Chinese EFL learners’ nonverbal
delivery in group discussion in the context of formative assessment?
AB-RQ2: To what extent can Chinese EFL learners’ employment of nonverbal
delivery be differentiated across different proficiency levels?
AB-RQ3: How does Chinese EFL learners’ nonverbal delivery interact with their
verbal utterance?
4.2 Method
In reviewing nonverbal delivery as an indispensable component of strategic com-

petence, eye contact, gesture and head movement, the three most representative
nonverbal channels in terms of definitions, manifestations and measurements have
already been clarified (see Sect. 2.1). In addition, the 30 samples1 of group dis-
cussion from Dataset 2 in this research phase involve 92 candidates, who represent
three predetermined stratified proficiency levels. Given an in-depth description
regarding how the samples of group discussion were collected, transcribed and
processed (see Sect. 3.2.2), this section only recapitulates how this phase was
designed and conducted.
Figure 4.1 itemises a three-step research design for the AB phase. The first step
stages data processing, including transcribing the occurrences of nonverbal delivery
(see Section “Multimodal Text Transcription”). The second step serves as the core
component of this phase, in which the analyses of nonverbal delivery by
Chinese EFL learners would be conducted mainly from the dimensions of fre-
quency and duration. More specifically, as illustrated in Fig. 4.1, three substeps
would be executed. The study would first profile the overall characteristics of
candidates’ nonverbal delivery candidates in group discussion (AB-RQ1), followed
1
These samples were selected in an ascending order of their sequence numbers in each proficiency
group.
4.2 Method 135
Step 1 Step 2 Step 3
Transcribing the Analysing the observation Building an argument for

candidates’ measures of nonverbal embedding nonverbal delivery
nonverbal delivery delivery into speaking assessment
Overall characteristics of Differences of nonverbal delivery Interaction between

nonverbal delivery across candidates of various verbal language and
proficiency levels nonverbal delivery
Fig. 4.1 Research design for the AB phase
by a comparison regarding their employment of nonverbal delivery across different

proficiency levels (AB-RQ2). The last substep is to qualitatively describe the
interaction, which might include impediment or complementarity, between verbal
language and nonverbal delivery (AB-RQ3). All the research findings deriving from
Step 2 would be utilised in Step 3 to (1) build an argument for embedding non-
verbal delivery into speaking assessment, and (2) further render guidance for the
formulation of nonverbal delivery descriptors in the rating scale.
4.3 Research Findings
This part explicates the research findings and discussions on three most represen-
tative nonverbal channels as reviewed before. The findings of each nonverbal
channel will be reported below consecutively in three sections. The first section
mainly deals with the two dimensions of measurement: frequencies/occurrences and
cumulative durations of nonverbal channels. The second section, beyond a statis-
tical spectrum, takes a closer look at how candidates across different language
proficiency levels instantiate nonverbal channels and what communicative func-
tions their nonverbal delivery might serve. The last section touches upon the
interaction between verbal language and nonverbal channels so that the interface
between these two modalities would be examined.
4.3.1 Findings on Eye Contact
Considering the different durations of group discussions, this study standardised the
occurrences of eye contact in each sample to the frequencies in a unit interval of
Table 4.1 Descriptive statistics of eye contact frequency (directionalities)

Range Min. Max. Mean S.D.
All Groups EC/c 19.00 1.00 20.00 2.56 3.65
EC/r 27.00 1.00 28.00 1.93 3.31
EC/p 96.00 1.00 97.00 32.22 20.60
Group A EC/c 19.00 1.00 20.00 2.80 4.25
EC/r 5.00 1.00 6.00 1.18 1.98
EC/p 80.00 6.00 86.00 38.26 19.76
Group B EC/c 11.00 1.00 12.00 2.10 2.78
EC/r 11.00 1.00 12.00 1.60 2.68
EC/p 94.00 2.00 96.00 28.00 19.14
Group C EC/c 16.00 1.00 17.00 2.78 3.80
EC/r 27.00 1.00 28.00 3.02 4.51
EC/p 96.00 1.00 97.00 30.40 21.79
5 min.2 For instance, 10 occurrences of eye contact in a 4-min group discussion

would mean 12.5 standardised occurrences of eye contact. Table 4.1 lists the
descriptive statistics of eye contact directionalities by the candidates across different
proficiency levels. As revealed, in terms of frequency, candidates’ EC/p ranks top,
with 32.22 times in average in each sample episode. In other words, provided that
averagely three candidates held a group discussion in five minutes, each of them
would present eye contact with their peers twice per minute, whereas the fre-
quencies of both EC/r and EC/c are approximately once.
Specific to EC/p, Group C’s minimum frequency is 1, while the corresponding
frequencies for Group A and B are 6 and 2, respectively, which initially showcases
that Group-C candidates were less likely to present EC/p in group discussion. As
the whole dataset displays normal distribution, with one-way ANOVA it is found
that EC/p (p = 0.033 < 0.05) and EC/r (p = 0.013 < 0.05) frequencies across
different proficiency groups are significantly different, as shown in Table 4.2. Given
the intrinsic heterogeneity of the groups across different proficiency levels, the post
hoc Tamhane’s T2 analysis was adopted to further prove that Group C significantly
outnumbered Group A in their EC/r (p = 0.030 < 0.05), while there were statisti-
cally fewer frequencies of EC/p on the part of Group B compared with Group A
(p = 0.029 < 0.05).
Table 4.3 lists the descriptive statistics of eye contact in relation to cumulative
duration (min: sec) and the ratio of EC/p duration to average sample duration
(ASD) for each proficiency group (henceforth EC/p vs. ASD ratio). A comparison
of EC/p versus ASD ratio indicates that Group A used the largest portion of time
(70.91 %) to have eye contact with other discussants, while the corresponding
percentages of the other groups are closely clustered (Group B 57.14 % and
Group C 51.99 %). Findings from this dimension are partly consonant with and
Standardised frequency = (Raw frequency × 5 min)/Group discussion duration.

2
4.3 Research Findings 137
Table 4.2 One-way ANOVA of eye contact frequency across groups

Sum of squares df Mean square F Sig.
EC/r Between groups 92.973 2 46.487 4.436 0.013
Within groups 1540.360 27 10.479
Total 1633.333 29
EC/p Between groups 2880.120 2 1440.060 3.508 0.033
Within groups 60,351.620 27 410.555
Total 63,231.740 29
Table 4.3 Descriptive statistics of EC/p cumulative duration

ASD Cumulative duration EC/p versus ASD Max. Min. S.D.
of EC/p ratio (%)
Group A 05:41.2 04:02.0 70.91 07:38.5 00:27.0 01:43.5
Group B 04:16.1 02.26.3 57.14 06:44.3 00:04.0 01:26.7
Group C 04:53.0 02:32.3 51.99 06:58.5 00:00.0 01:39.0
partly divergent from what has been previously uncovered from the frequency
dimension. The echoing part is that Group A, in both dimensions of frequency and
cumulative duration, ranks the first among the three groups. What is divergent is
that not only Group B but also Group C is different from Group A in EC/p versus
ASD ratio, whereas only Group B is significantly different from Group A in the
EC/p frequency.
The duration data, after a conversion into seconds and standardised, were also
put into one-way ANOVA as the data present normal distribution. It is found that
the durations of EC/p are significantly different across groups (see Table 4.4,
p = 0.002 < 0.01). A further post hoc Tamhane’s T2 reveals that Group C exhibits
significantly shorter duration of EC/p than Group A.
Table 4.5 lists the descriptive statistics of EC/r cumulative duration. As is
shown, the EC/r versus ASD ratios tend to descend in the order of Group A
(30.87 %), Group B (21.18 %) and Group C (5.36 %). The one-way ANOVA
shows a significant inter-group difference (Table 4.6, p = 0.036 < 0.05), and a
further post hoc Tamhane’s T2 indicates that Group-A candidates spent signifi-
cantly more time in having EC/r than Group C (p = 0.041 < 0.05). This, to a certain
extent, does not support the previous findings that in terms of the frequency
Table 4.4 One-way ANOVA of EC/p cumulative duration across the groups
Between groups 0.495 2 0.248 6.372 0.002
Within groups 5.712 27 0.039
Total 6.207 29
Table 4.5 Descriptive statistics of EC/r cumulative duration

ASD Cumulative EC/r versus Max. Min. S.D.
duration of ASD ratio (%)
EC/r
Group A 05:43.6 00:24.6 30.87 01:59.9 00:01.7 00:29.8
Group B 03:54.7 00.21.9 21.18 01:40.9 00:00.1 00:24.3
Group C 04:52.6 00:15.7 5.36 01:41.7 00:00.8 00:24.5
Table 4.6 One-way ANOVA of EC/r cumulative duration across the groups
Sum of Squares df Mean square F Sig.
Between groups 0.086 2 0.043 3.512 0.036
Total 0.824 29
dimension, Group C has higher frequencies of EC/r than Group A. More discussion
would be devoted to explaining this issue later.
Then, the research findings turn to the cumulative durations of EC/c across a
range of proficiency levels, as outlined in Table 4.7. Percentagewise, each group
seems similar regarding the EC/c versus ASD ratios; nonetheless, it can be deduced
from the maximum of the corresponding durations that the longest sample from
Group A (4:46.5) almost covered the entire ASD of that particular group (5:46.9).
Interpretation can be therefore made that the concerned candidates might have
engaged themselves with constant and continuous EC/c throughout the entire dis-
cussion period. The one-way ANOVA finds no significant difference among groups
across different proficiency levels (see Table 4.8, p = 0.316 > 0.05).
So far, there still leaves an account of the candidates’ eye contact with other or
non-detectable physical objects (EC/n) in group discussions. As it would not be
Table 4.7 Descriptive statistics of EC/c cumulative duration

ASD Cumulative EC/c versus Max. Min. S.D.
duration of ASD ratio (%)
EC/c
Group A 05:46.9 00:38.3 11.05 04:46.5 00:00.6 00:56.3
Group B 04:19.2 00.21.9 8.43 01:34.6 00:01.7 00:22.2
Group C 04:42.9 00:25.7 9.08 01:50.4 00:00.8 00:30.8
Table 4.8 One-way ANOVA of EC/c cumulative duration across the groups
Between groups 0.019 2 0.010 1.168 0.316
Total 0.684 29
Table 4.9 Integration of eye contact versus ASD ratios

Group A (%) Group B (%) Group C (%) Functions
EC/p versus ASD ratio 70.91 57.14 51.99 Persuasive
Attentive
Regulatory
EC/r versus ASD ratio 30.87 21.18 5.36 Impression
EC/c versus ASD ratio 11.05 8.43 9.08 Management
Total (duration dimension) 112.83 86.75 66.43
operationalisable to expose this measure on a frequency basis, nor would such

results yield utilitarian insights on informing rating scale formulation, this study
turned to the cumulative duration in capturing a holistic profile. Table 4.9 integrates
all the percentages from the duration dimension above for reaching the total per-
centages. Assumedly, the total percentage for each group should be squarely 100 %,
yet what has been shown in the bottom row runs counter to the expectation. This is
because in de facto transcriptions and data analyses, the study included eye contact,
in whatever forms specified above, of all the candidates in each group; thus, there
might be cases, where more than one speaker instantiated eye contact with the peers
simultaneously, or where none of the speakers presented any eye contact at all.
Given this, by summing up the three ratios (EC/p, EC/r and EC/c), it would be
possible to make a rough estimation of the candidates’ EC/n. Therefore, it can be
understood that a ratio totalling 66.43 % for Group C would mean that none of
EC/p, EC/r and EC/c instantiated during approximately one-third of ASD and that
during the “blank period”, the candidates might have no eye contact at all, or they
might have hardly traceable directionality of eye contact, such as looking down-
ward at the ground. Comparatively speaking, Group B performed better in filling in
this “blank period”, yet a certain gap of the specified eye contact directionalities in
their discussion is still felt. Group A, nevertheless, with a total ratio of 112.83 %, a
figure of which exceeds 100 %, testifies their activeness in instantiating the
aforementioned directionalities of eye contact, as anticipated in communication.
Referring back to the taxonomy of communicative function (Leathers and Eaves
2008), the four directionalities of eye contact could, respectively, fall into different
categories. As the candidates were supposed to discuss with other discussants in a
group, their eye contact with peers would be indicative of being either attention-, or
persuasion- or regulatory-oriented. The fact that the researcher was only responsible
for the recording was delivered to all the candidates in advance; therefore, candi-
dates’ eye contact either with the researcher or the camera, if detected, might fall
into impression management because no communication-conducive meanings are
realised. The last type, when the participants presented eye contact with other
physical or non-traceable objects, fails to fall into any category, yet it showcases the
participants’ anxiety and nervousness to a certain extent. Having classified EC/p,
EC/r and EC/c into the taxonomy specified above, this phase of study tabulates the
results, as synthesised in the rightmost column in Table 4.9. At this stage when
candidates’ eye contact is not linked with accompanying verbiage, it is rarely

convincing to distinguish their EC/p as having persuasive, attentive or regulatory
functions. However, what can be certain is that both EC/r and EC/c serve the
function of impression management because neither researcher nor camera was
supposed to be their discourse referent.
In order to further clarify what specific function(s) the candidates’ EC/p might
serve, it is necessary to synchronise verbal language with the occurrences of eye
contact. Taking that into account, this study extracted all the transcription texts from
the tier of eye contact and conducted a 5-word-span concgramming so that the role
candidates’ EC/p as well as the interface between EC/p and the corresponding
verbal modality can be addressed. Table 4.10 lists the top-ten context words
accompanying the candidates’ occurrences of eye contact with other discussants.
Communalities can be found among these context words (see Table 4.10). First,
the first and second pronouns (you, we, etc.), a kind of discourse referents, are
ranked among top 10 across the three groups. This indicates that alongside the
process when the participants were gazing at their peers in group discussions, their
intended verbal conveyance contained certain specific referents. Such communality,
in terms of communicative functions, might belong to attentiveness because the
discourse referents found above were mostly relevant to their target audiences or
other discussants. Second, the word think is common in modality interface.
Concordances in this aspect reveal two tendencies: the candidates’ own expres-
siveness and their requests for knowing the other discussants’ views. Therefore, the
candidates still deployed eye contact in signalling their attentiveness in the infor-
mation others would transmit. Similarly, using think may also serve the purpose of
persuasion (e.g. Don’t you think so?) although such cases are not as abundant as
those for an attentiveness function judging from the concordance lines. Third, an
overwhelming use of yes/yeah in synchronisation with eye contact could serve as
backchannelling, suggesting a response to agreement. However, as a rising tone of
yes/yeah could also be a request for consent, the interface between yes/yeah and
EC/p might be either attentiveness or persuasion.
Table 4.10 Context words in Rank Group A Group B Group C

EC/p verbal modality
interface 1 Yes/yeah You You
2 Think Yes/yeah Yes/yeah
3 You So Do
4 So Think Think
5 More About So
6 Our Oh We
7 We My My
8 Agree All Your
9 Do Very All
10 Don’t We Know
Table 4.10 also reveals certain dissimilarities among different proficiency

groups. On the side of Group A, two context words are unique: agree and don’t.
Through concordance lines, both words were found to be frequently used in con-
vincing the peers in group discussion. Figure 4.2 illustrates how verbal language
and EC/p can mutually intensify the intended meaning conveyance for communi-
cation effectiveness. As is shown, when speaker 1 (sp_1) was stating her viewpoint
by using I think and subsequently yielded the turn to speaker 2 (sp_2) by uttering
Don’t you agree?, simultaneously she was gazing at the supposed turn-holder, in
the case of which verbal language of convincing content was to a certain degree
intensified by an occurrence of eye contact with intended persuasive function.
As is revealed from Table 4.10, Group B and Group C shared one word: all
shaded in purple. Through concordances, this word is found to frequently co-occur
in the verbiage of That’s all, a chunk indicating a turn termination. In the mean-
time, the candidates would instantiate EC/p to hint turn-yielding, serving a regu-
latory function as illustrated in Fig. 4.3. In the light of inter-modality interaction in
group discussion, such regulatory function can be interpreted as a compensation for
an absence of the supposedly accompanying verbiage indicating floor-taking.
verbal language
sp_1: Em...I have er...I think we er...we have a lot of time, nonverbal delivery
we should use em...we should use free time to study more (eye contact)
intensify
courses. Don’t you agree? sp_1 sp_2
sp_2: Er...I don’t agree.
persuasion
Fig. 4.2 Intensification between verbal language and eye contact
verbal language
nonverbal delivery
sp_1: So she can help me buy the, buy the thing which I
required. All in all, I’d like the friends who are different compensate (eye contact)
from me. That’s all. sp_1 sp_2
regulatory
Fig. 4.3 Compensation of eye contact for the verbal language

4.3.2 Discussion on Eye Contact
As unfolded above, generally the candidates were not observed to be highly active
in presenting EC/p in group discussions, nor were their EC/p duration long and
constant. In a sense, the lack of EC/p might partly be attributable to their inexact
understanding of what they were supposed to do. Most learners of intermediate and
elementary proficiency levels, if not all, might regard group discussion task as a
platform on which they would just voice out their own views rather than play the
role of a group member with active interaction and engagement. Therefore, they
intrinsically discarded or even poorly performed EC/p with attentive or persuasive
functions.
In response to the finding that Group A outnumbered Group B in EC/p fre-
quency, it is thought that learners of advanced proficiency, with more exposure to
English learning and incidental culture acquisition, would employ more conver-
sation management strategies so that their intended conveyance can be further
intensified by or compensated for the accompanying verbiage. Although there is no
significant difference in EC/p frequencies between Group A and Group C, the
corresponding duration of the latter is shorter. This is because, on one hand,
elementary-level candidates might be excessively cautious in their discussion,
turning to their peers for negotiation or turn-taking via eye contact. On the other
hand, such occurrences of eye contact usually featured briefness and instability.
This would only augment the absolute frequency of Group-C candidates’ eye
contact with peers, whose duration, nevertheless, is not accordingly in proportionate
to their occurrence frequency.
The occurrences of candidates’ eye contact, especially those with the
teacher/researcher, was also characterised by an excess of impression management.
Admittedly, eye contact should be employed for impression purpose on certain
occasions. However, as far as group discussion is concerned, where candidates
already got acquainted with other discussants, eye contact with someone other than
the discussants should not be encouraged. Prior to video-recording, the researcher
explicitly clarified their roles as not assessors on the spot; nonetheless, despite such
reassurance, the candidates still seemed to wrongly deem the teacher/researcher as
their discourse referent.
With regard to the difference across the proficiency levels, Group C outnum-
bered Group A in EC/r frequency, yet a reversed picture is presented in the case of
cumulative durations. It is considered that elementary learners geared their dis-
course referent to the researcher in fear of committing errors in spoken production.
Each time they shifted the directionality of eye contact, it would not last long
because such an action was taken just for the sake of receiving “not-that-bad”
reassurance from the on-the-spot researcher. By contrast, Group A, despite a rather
satisfactory mastery over conversation management strategies as pinpointed above,
virtually talked to the researcher; therefore, Group-A candidates would explore
every means possible for impression making, lengthening their duration of EC/r. He
and Dai (2006) also found that CET-SET candidates would “express their own line
of thought to display their best possible performance for assessment purposes”

(p. 389). Therefore, the reason why EC/r occurred surprisingly during the group
discussion might be that the advanced-level candidates would vehemently impress
the researcher or their audience in general.
With regard to EC/n, it was estimated that, for intermediate and elementary
candidates, a certain amount of discussion period was not left blank by any spec-
ified eye contact in the light of directionality. Such a gap might be excusably
allowed in the sense that in case a discussant gazes at the other peers throughout the
discussion period, it could become an extreme of staring, causing negative effect on
communication. However, since group discussion is a task in which viewpoints are
shared and negotiated, such a gap should not be as huge as what has been found.
Judging from the findings on modality interaction, it is felt that almost all the
participants’ eye contact with their peers were mostly of attentive function, while
persuasive EC/p also co-occurred when advanced learners’ discourse content was
analysed. This can illustrate that most candidates observed satisfactorily exhibited
their politeness via eye contact as a vehicle of attentiveness, yet its function of
persuasiveness seemed to be under-presented. What is noteworthy is that
advanced-level candidates were adept in switching their eye contact with a reper-
toire of functions. Being listeners, they might present eye contact with peers to be
indicative of their attentiveness; being turn-holders, they were capable of switching
the eye contact to a stronger form for a convincing purpose. Additionally, candi-
dates of elementary and intermediate proficiency levels tended to deploy regulatory
EC/p in turn-taking as a compensation for an absence of the verbal language, while
such cases are scant on the part of advanced counterparts.
4.3.3 Findings on Gesture
In what follows, the findings on gestures will be presented. Table 4.11 lists the
descriptive statistics of gesture frequencies on a sample basis. On the whole, there
were averagely 10.82 times of gesture occurrences in each observed sample of
group discussion. It can be interpreted, therefore, that if ASD is still standardised to
five minutes with three candidates involved in each group, the mean frequency of
gesturing concerning all the observed samples was approximately one occurrence
per minute for each candidate. This initially reveals that the candidates did not
frequently resort to gestures synchronising with their verbiage. A comparison
Table 4.11 Descriptive Min. Max. Mean S.D.

statistics of gesture frequency
All samples 0.00 40.00 10.82 8.728
Group A 1.00 40.00 13.89 9.949
Group B 0.00 34.00 10.58 8.033
Group C 0.00 31.00 7.85 7.217
across the different groups would expose that Group A ranked the first with 13.89
occurrences of gestures in each sample; Group B and Group C came next with an
average frequency of 10.58 and 7.85, respectively.
Being normally distributed, the data representing gesture frequencies were fur-
ther processed by one-way ANOVA. It is testified that there is significant difference
of gesture frequencies across the groups (see Table 4.12, p = 0.001 < 0.01). Since
the data for each proficiency group are also not homogeneous, post hoc Tamhane’s
T2 test was deployed for a further inter-group comparison. It is found that Group-A
candidates exhibited statistically more occurrences of gestures than Group-C
counterparts (p = 0.002 < 0.01).
The research findings then turn to the descriptive statistics of cumulative gesture
duration. Although gesture duration might not be a sound parameter that can dis-
cern candidates across different proficiency levels, or might not be included in the
rating scale descriptors. At this exploratory phase, it would be more advisable to
include this tentative measure as more insightful and interesting findings might thus
be produced.
As indicated in Table 4.13, be it cumulative gesture duration, or gesture versus
ASD ratio (henceforth gesture vs. ASD ratio), the rankings remain to be the same in
the order of Group A, Group B and Group C. Among the groups, the cumulative
duration of gesture in Group-A samples averagely accounted for 40.45 % of ASD,
indicating that quite frequently candidates synchronised their verbiage with gestures
of various manifestations. The maximum cumulative duration of gesture in
Group A (6′ 48.5″) was even longer than the ASD of that particular group. This is
because when transcribed on the time frame, gestures by all the candidates in each
group were observed, triggering a possibility that more than two candidates’ ges-
tures were encoded. However, the above extreme case, though comparatively rare,
showcases that advanced-level candidates could be found to entirely synchronise
their verbal utterances with gesturing.
Table 4.12 One-way ANOVA of gesture frequency

Between groups 986.300 2 493.150 7.059 0.001
Within groups 9221.433 27 69.859
Total 10,207.733 29
Table 4.13 Descriptive statistics of gesture cumulative duration

ASD Cumulative duration of Gesture versus Max. Min.
gesture (mean) ASD ratio (%)
Group A 05:40.8 02:17.9 40.45 06:48.5 00:01.4
Group B 04:11.5 01.33.8 37.28 02:40.3 00:04.3
Group C 04:54.2 00:43.9 14.92 02:50.4 00:03.7
Table 4.14 One-way ANOVA of gesture cumulative duration

Between groups 0.275 2 0.137 3.183 0.045
Total 5.971 29
A similar approach of one-way ANOVA and post hoc Tamhane’s T2 was used
to test possible disparity of gesture cumulative duration across the groups. It is
found that there is significant inter-group difference (see Table 4.14,
p = 0.045 < 0.05) and that the cumulative duration of gesture by Group-C candi-
dates was marginally statistically shorter than that by Group A (p = 0.044 < 0.05).
So far rough findings can be obtained that the candidates generally kept a low
profile of employing gestures, yet those of higher proficiency tended to instantiate
more gestures. However, at this stage, a fuller understanding of the candidates’ de
facto gesturing could not be reached unless in-depth analyses on their gesture
manifestations are profiled. Given this, the findings turn to the descriptive tran-
scriptions of gestures.
By randomly sifting the transcription texts, it has been found that a majority of
them are embedded with a number of keywords related to gestures defined in the
present study: HAND, FINGER, PALM, ARM and FIST, and their pluralities
included. The concordance frequencies of the above keywords constitute 95.14 %
of all the gesture transcriptions, ensuring that the extracted keywords can to a great
extent account for how the candidates instantiated their gestures. Tentatively, the
keyword HAND(S) was first championed with a view to extracting all the verbs
related to that because this keyword could be thought of as the most direct word in
describing various gestures. Table 4.15 lists all the HAND(S)-related verbs across
the groups with their respective rankings.
As is shown in Table 4.15, 16 verbs were retrieved from both Group A and
Group B, whereas 13 verbs from Group C. This disparity basically corresponds
with the previous findings that Group C presents a lower profile regarding both the
frequency and cumulative duration of gesture use. A detailed comparison among
the top-ranked verbs would further reveal that candidates of all proficiency levels
share the same descriptive verbs with basically similar ranking: MOVING,
RAISING, SHAKING and WAVING.
The next step was geared to taking a closer look at these shared verbs as a
revelation of how the candidates performed gestures. The pilot screening of these
verbs could be divided into two broad categories in relation to the
meaning-productiveness of gestures, as shown by part of the concordance lines in
Figs. 4.4 and 4.5. Referring to the accompanying verbiage of the gesture tran-
scription for meaning making in Fig. 4.4, MOVING was mainly associated with the
movement of hand(s) for meaning conveyance; RAISING mostly referred to use of
hand in yielding the turn to the group members; SHAKING, as its face meaning
suggests, often indicates an act of hand-shaking; what is worth mentioning is that
Table 4.15 Comparison of Rank Group A Group B Group C

gesture-related verbs (1)
1 MOVING WAVING MOVING
2 WAVING MOVING RAISING
3 RAISING RAISING SHAKING
4 SHAKING LIFTING PUTTING
5 STRETCHING SHAKING WAVING
6 RUBBING HOLDING USING
7 CIRCLING STRETCHING TOUCHING
8 CROSSING TOUCHING LIFTING
9 PUTTING CIRCLING CROSSING
10 USING PULLING CIRCLING
11 LIFTING CROSSING HOLDING
12 TURNING POINTING STRETCHING
13 TOUCHING PUTTING POINTING
14 TWIDDLING SWAYING
15 HOLDING TWISTING
16 POINTING USING
Fig. 4.4 Meaning-generative gesture concordance lines (HAND as search item)
Fig. 4.5 Non-meaning-generative gesture concordances (1)

WAVING in the transcription was more concerned with hand movement as an

indicator of disagreement or turn-taking. On the non-meaning-generative side, it is
found that these four common verbs were also related to certain behaviours
reflecting self-adaption, anxiety or fidgetiness (e.g. raising his right hand to the
forehead).
Then, the focus turns to the uncommon verbs with HAND(S) as the search item.
As is shown in Table 4.15, those uncommon verbs in Group A and Group B
basically reflect their connotative meanings. They might be either conducive to
transmitting communicative meanings (e.g. pulling her two hands apart to show
helplessness) or simply performative without any meaning-making function (e.g.
rubbing his hands and fiddling with his fingers). Therefore, the findings from the
uncommon verbs in gesture transcription indicate that Group A and Group B, like
Group C, would also perform non-meaning-generative gestures, yet they differed in
the sense that such uncommon verbs are ranked at a comparatively lower position
(see Table 4.15).
Having explored the transcription texts based on HAND(S), the follow-up
findings rewind the above procedure to approach other keywords: FINGER,
PALM, ARM and FIST. Table 4.16 lists all the gesture-related verbs when the
remaining keywords were retrieved. As is revealed, the number of verbs in Group A
still surpasses those of Group B and Group C, dragging the latter two proficiency
groups to be a disadvantageous position regarding gesture variety. Similar to the
findings from Table 4.15, there are a few shared verbs. An impression can be
conjured up that the gestures such retrieved were mostly not communication gen-
erative, as partly exemplified in the concordance lines in Fig. 4.6. However,
Table 4.15 also suggests three unshared verbs related to gesture in Group A:
RAISING, STRETCHING and OPENING. After these uncommon verbs were
searched with the keywords as the context words, they are found to be mostly
associated with meaning-expressive gestures, such as stretching out one finger to
show his idea.
As the transcription texts align the verbal language with the occurrences of
gestures on the same time frame, three- and four-word contiguous phraseologies
from the accompanying verbiage were retrieved. After a manual exclusion of
redundancy, Table 4.17 lists the phraseologies by different proficiency groups,
along with their corresponding rankings.
Table 4.16 Comparison of Rank Group A Group B Group C

gesture-related verbs (2)
1 MOVING MOVING MOVING
2 WAVING CROSSING CROSSING
3 FIDDLING TOUCHING PUTTING
4 CROSSING FIDDLING FIDDLING
5 RAISING
6 STRETCHING
7 OPENING
Fig. 4.6 Non-meaning-generative gesture concordances (2)
Table 4.17 Phraseologies of gesture-synchronised verbal utterances

Rank Group A Group B Group C
1 Do you think What do you My name is
2 I think I Do you think Do you think
3 What do you And so on That’s all
4 More and more How about you A lot of
5 Don’t want I don’t I don’t
6 What about you That’s all Is more important
7 I want to A lot of I think it
8 Think it is I think it I think that
9 Agree with you The most effective So I think
There are two aspects shared by different proficiency groups, as can be found in
Table 4.17. One aspect is that candidates across a range of proficiency groups share
many THINK-related chunks, an indicator that when they synchronised their
gestures with verbal utterances, most meanings intended might be expressing their
own viewpoints or requesting others’ opinions. Regarding the communicative
functions of gestures (Ekman and Friesen 1969), candidates’ gesture in that aspect
should fall into illustrators because they resorted to a variety of hand or arm
movements in making themselves comprehended.
The second shared aspect mainly refers to the meaning embedded with adjective
or adverb degrees. As comparative and superlative degrees might serve an emphatic
purpose, the above finding illustrates that while learners were instantiating certain
meanings with emphatic foci, they would be likely to use gestures in synchroni-
sation with the accompanying verbiage. Such occurrences of gesture realised the
function of illustrators when its communicative function is concerned.
Considering the interaction between verbal language and gesture, it can be found
that both modalities achieve complementarities in the process of meaning trans-
mission. For instance, as illustrated in Fig. 4.7, Speaker 1, after alleging an opinion
Verbal language Nonverbal delivery

sp_1: I didn’t say the knowledge is not important. I
178 (gesture)
only say the skill is more important than knowledge. intensify sp_1: slightly raising the forefinger of
sp_2: What do you think? the left hand and pointing it upward
illustrative
Fig. 4.7 Intensification between verbal language and gesture
with a comparative degree more important, yielded his turn to another discussant;
meanwhile, Speaker 1 slightly raised the forefinger of the left hand upwards as if the
speaker was pointing at something for an emphasis on the accompanying verbiage.
As the comparative degree in this case expressed illustrative meaning and the gesture
functioned as illustrators, the interaction between both modalities was intensified.
Table 4.17 also shows a few uncommon phraseologies. As can be found,
Group B and Group C tended to synchronise their gestures with the accompanying
verbiage of That’s all, a signal of turn termination. Considering the communicative
function of such a gesture, it should fall into adaptors in Ekman and Friesen’s
(1969) taxonomy because, instead of gesturing for a signal of an intention to yield a
turn, candidates’ gestures were mostly those reassuring themselves of task fulfil-
ment. Nonetheless, there can also be exceptions as illustrated in Fig. 4.8. When
Speaker 1 finished the turn with That’s all, instead of gesturing to invite other
candidates for floor-taking, he still seemed to continue his turn via raising both
hands upwards from a resting position on the thighs. Therefore, the accompanying
verbiage intended for a turn termination seemed inconsistent with what gesture
would instantiate; hence, the two modalities bifurcated.
From Table 4.17, the phraseology of agree with you can also be found in Group
A. By reading the concordance lines retrieved from the gesture transcription texts, it
can be noticed that advanced-level candidates were able to appropriately use
Verbal language
Nonverbal delivery
sp_1: Think about our history and some (gesture)
famous, famous people and event. Yes. sp_1: raising both hands upwards from a
That’s all. Diverge static position resting on the thighs
Turn-ending Adaptor
Fig. 4.8 Divergence between verbal language and gesture

gestures as an accompanying indicator of agreement, such as raising the right hand

to pad the speaker 2’s shoulders. So far, nevertheless, there seem a limited number
of observable occurrences that account for gestures with emblematic or regulatory
functions.
4.3.4 Discussion on Gesture
As is found above, generally the candidates seemed to sporadically employ gestures

in synchronisation with their verbal utterances in group discussions. If the cumu-
lative durations of gesture making were also taken into account, they did not
constitute a large portion of the entire group discussion period though gesturing
duration did not well discriminate candidates across a range of proficiency levels.
All these reveal that the candidates under observation had a low profile of gesturing
except for those of advanced language proficiency, who would gesture more fre-
quently for enhancing intended meaning conveyance.
There may be three factors to account for the findings. First, internally, as far as
the candidates’ perceptions towards gestures are concerned, they might not have a
full understanding of the communicative functions that gestures can possibly serve.
As a matter of fact, as is foreshadowed, gestures can be emblems, illustrators,
regulators as well as adaptors, but the findings lead to the fact that the candidates’
gestures were indicative of a comparatively large proportion of adaptors because
they tended to make gestures, unconsciously or subconsciously, to show anxiety,
diffidence or nervousness. In contrast, emblematic gestures were rarely found, such
as a hand movement that measures a size (e.g. pulling both hands afar with palms
facing each other) or spatial distance (e.g. holding the right hand and pointing at
the far end of the room with the forefinger). The discussion below concerning the
interaction between verbal language and nonverbal delivery will further explain the
reasons of their scantiness.
Second, externally, an underuse of gestures could also be explained by a pos-
sible dearth of synchronised gestures with verbal language as a result of EFL
teaching or learning material input. In other words, given the fact that language
instructors and course books play core roles of providing input for EFL learners, it
could be highly possible that language knowledge might not have been presented
multimodally, particularly with regard to an emphasis on gesture. In particular,
confined by individual teaching style, teachers might employ fewer gestures at
certain points where verbal utterance is needed to be synchronised with nonverbal
delivery. The research implications to be made in the last chapter will address this
external factor again.
Third, it should be acknowledged that nonverbal communication, gesture
included, can be culture or social-context specific. Chinese EFL learners, when
expressing themselves in English, may transfer their style in the mother tongue to
the target language oral production, in which fewer occurrences of gesture would
instantiate. However, it is noteworthy that what is expected would not be a
complete abandonment of the home culture, but an expectation of approximating to

a norm in the target language community. The absence of gestures by EFL learners
would also mean that nonverbal delivery that would supposedly generate meanings
did not come into full play. As such, it seems to be more of an issue of whether,
rather than how, candidates instantiate gestures.
Despite a general profile of infrequent gesturing, there were still certain sig-
nificant differences between the candidates of advanced and elementary proficiency
levels as found above. The former, with gradual progress in language learning and
incremental exposure to the target language, could be assumed to be equipped with
more strategies in managing group discussions. What’s more, incidental culture
acquisition may also substantiate advanced language learners’ access to the target
language culture so that their language production would be more accompanied
with gestures, as would be expected and normative in the native English-speaking
community.
Furthermore, the inter-group differences found above also reveal candidates’
characteristics in employing gestures as a channel of meaning conveyance. The first
aspect of characteristics is concerned with the variety of gestures retrieved from the
gesture-related verbs. Candidates of advanced and intermediate proficiency levels
tended to exhibit gestures with more diversity, which echoes the previous discus-
sion concerning their developmental progress in conversation management that
grows with their increasing exposure to the target language. When EFL learners’
awareness is promoted that gestures could generate communicative meanings along
with accompanying verbiage, they would be, in all likelihood, resort to a repertoire
of gestures for more effective communication.
However, it remains unclear as to whether candidates are only aware of the
significance regarding gesturing variety. Thus, a second main characteristic of the
candidates’ gestures would be whether, if so, how their gestures could generate
meanings. As partly discovered, candidates of advanced proficiency, when using
HAND(S)-related gestures in meaning construal, would be more adept in gesturing
for illustration or intensification. This might signify that advanced-level candidates
would attach importance to how diversified gestures are manifested as well as
communication-conduciveness of their gestures. Contrastively, intermediate-level
counterparts, though presenting diversified gesturing, seemed to largely perform
gestures that would fail to instantiate intended communicative meanings. The
non-meaning-generative gestures then would fall into adaptors, which might be
triggered by their nervousness or anxiety in the assessment settings.
Concerning the interface between gesture and verbal language, there were more
gestures with illustrative and adaptive functions, while gestures with emblematic
and regulatory functions seemed scant. Given a small sample size at this phase,
emblematic gestures were far from abundant in the candidates’ spoken production.
As the topics for group discussion were selected by the candidates themselves, it
could be that they just poured out what they had prepared and memorised when
proceeding with group discussions rather than paying due attention to instantiating
emblematic gestures with potential meanings. It is a similar case with regard to
regulatory gestures. Judging from the findings above, the gestures of this function
also seemed scarce. As regulatory gesture is usually used as backchannelling in

turn-taking, it can be thought that a lack of the gesture in this kind might be the
possibilities that candidates either resorted to other nonverbal channels, such as eye
contact, or judged it sufficient as long as their verbal utterance indicated turn
termination. It would be anticipated that emblematic and regulatory gestures will
have places if a larger sample size or an excellent candidate’s nonverbal delivery
performance is investigated in the rating scale validation phase.
4.3.5 Findings on Head Movement
The last nonverbal channel examined in this phase of study is head movement,
mainly manifested by head nod and shake. Table 4.18 lists the descriptive statistics
of head movement frequency. It can be found that the minimum occurrence of head
movement is 1. As far as the mean frequencies are concerned, the occurrences of
head movement could be ranked in an ascending order in terms of proficiency
levels: Group C (5.26), Group B (7.53) and Group A (9.47). If 5 min is still taken as
ASD, each candidate then had only one occurrence of head movement in
approximately 2.5 min.
As the data present normal distribution and heterogeneity, one-way ANOVA and
post hoc Tamhane’s T2 test were conducted to see any possible inter-group sig-
nificant difference. As is shown in Table 4.19, three proficiency groups are sig-
nificantly different from each other (p = 0.025 < 0.05). A post hoc Tamhane’s T2
test further finds that such difference lies between Group A and Group C
(p = 0.026 < 0.05). Therefore, it could be interpreted that in terms of frequency,
candidates of higher proficiency instantiated more head movement than the
lower-level counterparts.
Table 4.18 Descriptive statistics of head movement frequency

Min. Max. Mean S.D.
All samples 1.00 38.00 7.59 6.933
Group A 1.00 38.00 9.47 8.463
Group B 1.00 22.00 7.53 5.505
Group C 1.00 25.00 5.26 5.431
Table 4.19 One-way ANOVA of head movement frequency

Between groups 349.115 2 174.558 3.805 0.025
Within groups 5275.359 27 45.873
Total 5624.475 29
Table 4.20 Descriptive statistics of head movement cumulative duration

ASD Cumulative duration Head movement Max. Min. S.D.
of head movement versus ASD ratio
(mean) (%)
Group A 05:39.4 01:09.6 20.50 04:07.1 00:01.3 01:01.2
Group B 04:19.9 00.47.3 18.21 03:33.7 00:00.7 00:48.2
Group C 04:58.1 00:49.4 16.57 03:24.7 00:00.4 01:18.3
Likewise, the cumulative duration of head movements, together with the ratio of
head movement to ASD in each group discussion, was also calculated. Table 4.20
lists the statistics described above. Impressionistically, the head movement duration
versus ASD ratio, a most obvious parameter indicative of the extent of head
movement instantiation showcases that in Group A approximately 20 % of the
discussion period was accompanied with head movements, yet Group B and
Group C had moderately lower percentage in this regard.
When the data were standardised and tested by one-way ANOVA (see
Table 4.21), significant difference can be found of head movement across three
proficiency groups (p = 0.004 < 0.05), and a post hoc Tamhane’s T2 test further
indicates that Group C was significantly different from Group B (p = 0.005 < 0.05)
and Group A (p = 0.007 < 0.05) in that aspect. Therefore, a brief summary can
made that, from both dimensions of head movement frequency and duration, the
candidates are generally found to keep the head in a rather static position as a
whole. This is because during group discussion, only about one-fifth of the time
witnessed the occurrences of head movement. Group C was significantly different
from Group A in the aspect of head movement frequency, whereas Group C could
also be distinguished significantly from Group A and Group B from a duration
perspective.
At this stage, the statistics could only provide a sketchy profile of how the
candidates performed head movement. It has to be admitted that in the Chinese
social context, nodding is generally understood as agreement, while head shaking
usually refers to disagreement. However, a fuller picture of how the candidates
aligned appropriate head movements with what they expressed verbally can only be
depicted when verbal language is taken into consideration. In that context, it would
be necessary to examine how head movement interacts with accompanying ver-
biage and further try to analyse the communicative functions they would possibly
serve.
Table 4.21 One-way ANOVA of head movement duration

Between groups 0.379 2 0.190 5.845 0.004
Total 4.109 29
Fig. 4.9 Concordance lines of synchronisation between head nod and verbal language
At this phase of research, the communicative functions of head movements can

fall into explanatory, regulatory and adaptive, as adapted from Ekman and Friesen’s
(1969) taxonomy. Talking about the function of explanation, when the candidates
accompanied their verbiage of affirmative meaning with nodding, it can be deemed
as appropriately enhancing the intended meaning; the occurrence of head shaking,
however, would be the other way round. The randomly retrieved concordance lines
in Fig. 4.9 evidence that most occurrences of nodding were accompanied with the
verbiage of yes, an enhancer for affirmativeness.
However, there was a peculiar case, where what was instantiated by an occur-
rence of head movement ran counter to the intended verbiage meaning, as is
illustrated in Fig. 4.10. When sp_3 tried to provide a counterargument, the
accompanying head movement was nevertheless nodding, assumedly indicating
agreement. In that case, nonverbal delivery somehow diverged from the verbal
language. Although this case is rare, it could to a certain extent indicate an inap-
propriate use of nodding. Discussion will be made in the follow-up section in
response to the appropriateness of head movement.
What seemed to be thought-provoking is that compared with nodding, head
shaking remained less salient. If all the occurrences of shaking were retrieved with
Verbal language
sp_2: It doesn’t mean you want to lie or something. It just Nonverbal delivery
meant you want to don’t hurt others and want to make others (head movement)
more comfortable. diverge sp_3: nodding
sp_3: I’m afraid I don’t think so. Anyway, a lie is a lie.
disagreement agreement
Fig. 4.10 Divergence between verbal language and head movement

Fig. 4.11 Concordance lines of synchronisation between head shake and verbal language
Table 4.22 Phraseologies of Rank Group A Group B Group C

head-movement-synchronised
verbal language 1 Agree with Do you think I think I
you
2 I agree with I can’t That’s all
3 I see I It’s good Do you think
4 I think it That’s all It’s good
5 I think I What about How about
you you
the negation signal not or no as the context words, as shown in Fig. 4.11, only a
total of 7 occurrences have been found. It could be holistically felt that when
negation was conveyed, the candidates seemed to rarely or reluctantly accompany
their head shake with the intended verbiage of negation or disagreement.
When the verbal language synchronised with head movements was retrieved in
the format of phraseology, as is encompassed in Table 4.22, it can be found that the
candidates generally expressed their own viewpoints or elicited other discussants’
responses when performing head movements, which can be evidenced by such
phraseologies as I think, do you think and what about you.
However, inter-group differences concerning the phraseologies can also be
found in Table 4.22. Two points are worthy of attention. One is that when the
meaning of agreement was expressed by advanced-level candidates, there was also
accompanying head movement, as can be cross-validated with the findings above.
The other point is that the expressions indicative of turn termination that’s all were
again uttered by Group-B and Group-C candidates. More specifically, when nearing
the end of their turns, they would yield their turn partially by means of nodding so
that other discussants might be hinted to take the floor. The finding in this aspect
also corresponds to what has been discovered in the section of eye contact, where
candidates, instead of resorting to verbal utterance, might perform eye contact with
other discussants for turn-taking.
4.3.6 Discussion on Head Movement
Confining the instantiations of head movement to head nod and shake, this phase of
research discovers that the candidates would present more occurrences of nodding
than head shake. Generally, they are able to nod when an intended verbiage of
agreement or a signal of backchannelling is requested though occasional cases of
inappropriateness in nodding might also occur. The following is a discussion on
what is found above.
First, as head movement is one of the most salient nonverbal channels as
afore-reviewed, it would be expected of candidates, whenever necessary, to
accompany their verbiage with head nod or shake in group discussion, the task of
which usually elicits conflicts of viewpoints or negotiation. As found above,
however, the frequency of head movement keeps a comparative low profile. When
verbal utterances intend the meanings of agreement or disagreement, rare syn-
chronised head movements were observed. Microscopically, a number of candi-
dates, particularly those of elementary proficiency level, might be unable to initiate
their head movement as backchannelling in that their proficiency may deter them
from fully fathoming what was conveyed by other discussants. Another possibility
would be that they might not pay due attention to others’ utterances so that no
response in the form of head movement could be detected. This also enlightens this
study that head movement can be instantiated in a context, where a need for
backchannelling arises. The infrequent head shake could also be partly explained by
cultural influence. In the Chinese social context, communicators, out of courtesy,
might not frequently shake head even in the case of disagreement. This is consistent
with the findings of Jungheim’s (2001) study, where Japanese EFL learners, con-
textualised in a similar culture of courtesy, were found to perform frequent nodding
when assessed by native speakers of English.
Second, as far as the communicative functions of head movements are con-
cerned, the main purpose for head nod and shake should be indicating agreement or
disagreement in an enhanced fashion. If the candidates did not appropriately nod or
shake their heads in synchronisation with the intended verbiage, or sometimes
accompanied head movements only for regulatory purposes as a result of anxiety in
the assessment context, their performance cannot be regarded as communicative.
The degree of appropriateness, therefore, can serve as one of the dividing lines to
discern candidates across a range of proficiency levels when head movement, a
salient domain of nonverbal delivery, is to be incorporated into speaking
assessment.
4.4 Summary 157
4.4 Summary
With building an argument for embedding nonverbal delivery into speaking

assessment as a point of departure, a full account is rendered to the above empirical
study, which was drawn from 30 samples of group discussion. In terms of signif-
icance, this study responds to the AB phase in this research project and also serves
as a beacon for the follow-up rating scale formulation and validation. A brief
summary addressing the three research questions of this research phase is made as
follows.
It is generally found that the candidates as a whole keep a low profile in resorting
to nonverbal channels accompanying verbal language in the context of group
discussion. As far as eye contact is concerned, although most of the candidates’
occurrences would target at the peer discussants as the directionality to serve
persuasive, regulatory or attentive purposes, a few of them would inappropriately
be directed towards other objects for impression management. Gesture is instanti-
ated in a repertoire of manifestations by the candidates. However, most occur-
rences, if not all, serve as illustrators or adaptors, maintaining a less conspicuous
profile of emblematic or regulatory gestures. Head movement is generally inves-
tigated via a study of nodding and shaking, with their prescribed meanings of
agreement and disagreement, respectively. The candidates seemed to nod more as
either an indicator of attentiveness or a signal of backchannelling agreement.
This study also compares three predetermined proficiency groups in relation to
their performance of nonverbal delivery in group discussions. Regarding eye
contact, candidates with advanced proficiency level are distinguished from the other
counterparts in that their eye contact with peers features a significantly higher
frequency and steadiness. Elementary-level candidates, though characterised by a
moderately high frequency of eye contact, mostly target at the directionalities other
than their peer discussants. Similar cases can be found in the findings of gesture.
The candidates of the elementary group would seldom instantiate gestures in
accompaniment with verbiage. By comparison, intermediate group candidates
would surpass the elementary group counterparts in gesturing frequency, yet not all
of their gestures would generate communicative meanings. Ascending to the can-
didates of advanced level, this study finds that not only do they present a high
frequency of diversified gestures, but those gestures also enhance communication
effectiveness. Candidates can be further individualised judging from their deploy-
ment of head movement. In particular, whether there is a moderately high frequency
of head movements and whether such movements appropriately denote intended
verbiages discern the candidates across a range of proficiency levels.
With the above discernible features, an argument can thus be built in the sense
that nonverbal delivery, as reflected by the three nonverbal channels in a
fine-grained manner, can discriminate candidates in consistency with their prede-
termined proficiency levels. The incorporation of this dimension into speaking
assessment would also be an enhancement for an even more comprehensive
assessment feedback to the parties of stakeholders concerned. In addition, what is
found in the light of demarcating the candidates’ performance in nonverbal delivery

will inform the formulation of the rating scale, to be thrown more light on in
Chap. 5.
Last, nonverbal delivery is found to interact with verbal language in an
intriguing way. At this phase of study, intensification and divergence are sum-
marised on a holistic basis. Following a revised framework drawn from Martinec
(2000, 2001, 2004) and Hood (2007, 2011), this study will further explore other
possible inter-semiotic interactions, including the relationship among three non-
verbal channels, which remains untouched in this exploratory phase of research.
References
Ekman, P., and W.V. Friesen. 1969. Nonverbal leakage and clues to deception. Psychiatry 32:
88–106.
He, L., and Y. Dai. 2006. A corpus-based investigation into the validity of the CET-SET group
discussion. Language Testing 23(3): 370–401.
Hood, S.E. 2011. Body language in face-to-face teaching: A focus on textual and interpersonal
meaning. In Semiotic margins: Meanings in multimodalities ed. Dreyfus, S., S. Hood, and
S. Stenglin, pp. 31–52. London and New York: Continuum.
Jungheim, N.O. 2001. The unspoken element of communicative competence: Evaluating language
learners’ nonverbal behaviour. In A focus on language test development: Expanding the
language proficiency construct across a variety of tests, ed. T. Hudson, and J.D. Brown, 1–34.
Honolulu: University of Hawaii, Second Language Teaching and Curriculum Centre.
Leathers, D.G., and H.M Eaves. 2008. Successful nonverbal communication: Principles and
applications, 4th ed. Pearson Education, Inc.
Martinec, R. 2000. Types of processes in action. Semiotica 130(3): 243–268.
Martinec, R. 2004. Gestures that co-occur with speech as a systematic resource: The realisation of
Chapter 5
Rating Scale Formulation
An assertion of necessarily incorporating nonverbal delivery into speaking

assessment advances this study to the phase of rating scale formulation, which is
broadly arranged into three steps, as is specified in the research design. The first two
steps are allocated in this chapter concerning how two broad dimensions of
assessment, viz. language competence and strategic competence, are formulated on
the rating scale.
As elaborated and justified in Chap. 2, the rationale based on which the rating
scale is to be proposed in the present study is the CLA model (Bachman 1990;
Bachman and Palmer 1996). The soundness of the model lends support to adopting
this model given its high inclusiveness of all possible aspects of communicative
competence. In real practice, nevertheless, this model still needs to be opera-
tionalised to serve particular research purposes and different contexts. In the case of
this study, it is more advisable if the assessment dimensions could be tailored to be
observable, which could be particularly true regarding the descriptors for language
competence on the rating scale. Given this, RSF-I first operationalises the notion of
language competence into finespun and observable statements, followed by an
alignment of those statements with teachers’ and learners’ perceptions of what
should be assessed regarding language competence in group discussions in the
context of formative assessment.
The practice of RSF-II is distinguished from that of RSF-I in that the part of
nonverbal delivery descriptors will be directly informed by the research findings in
the AB phase. What has been describably judged to discern nonverbal delivery by
the candidates across a range of proficiency levels is to be reflected in the rating
scale descriptors. In particular, certain modifiers will be referred to as an indicator
of gradable change between adjacent proficiency levels.
5.1 Research Objectives and Question
Overall, this phase of study mainly aims to formulate a tentative version of the
rating scale with language competence and strategic competence as two broad
dimensions. As aforementioned, how both dimensions will be formulated is
DOI 10.1007/978-981-10-0170-3_5
160 5 Rating Scale Formulation
dependent on discrepant methods and how the part of strategic competence, as

reflected by nonverbal delivery, is already informed by the empirical study reported
in Chap. 4. In a sense, therefore, RSF-II can be deemed as an extension of the AB
phase; in addressing RSF-II, this chapter only provides a recapture of the AB phase
research findings, reports on the formulated descriptors on nonverbal delivery and
makes justifications on how nonverbal delivery descriptors were developed by what
is enlightened from the AB phase study. All the other details, such as research
design and detailed findings, would not be redundant in this chapter.
Back to RSF-I, it should be noted that bearing the main objective of developing
language competence on the rating scale, RSF-I was carried out with three sub-
sidiary objectives on a chronological continuum. To start with, a granular
description was given to the operationalised statements deriving from the compo-
nents of language competence in the CLA model. In other words, various mani-
festations with regard to language competence were itemised and granulated. These
observable significations led to the second objective as all of them were pooled
together for designing questionnaires to be administered to both teachers and
learners in the Chinese EFL context. This step was designed as such because it
aimed to align the operationalised statements concerning language competence
with what concerned stakeholders suppose should assess. The last objective was to
check the extent to which the manifestations of the CLA model were alignable with
Chinese EFL teachers’ and learners’ perceptions as drawn from questionnaire
results, whether any latent mismatch needs to be identified for further adjustment
and eventually how the part of language competence on the rating scale would be
formulated.
Guided by the above objectives, RSF-I attempted to answer only one research
question. What assessment dimensions regarding language competence can be
extracted based on teachers’ and learners’ perceptions towards what language
competence should be assessed in the context of group discussion? In addressing
this question, questionnaires were deployed as research instrument, to be unfolded
below in the next section.
5.2 Method
This section apportions the research design of RSF-I. Given the fact that the par-
ticipants involved in this phase of study have been introduced with regard to their
demographic information and that exploratory factor analysis, the statistical method
adopted to extract the assessment domains from the questionnaire responses, is also
brought to light in Chap. 4, this part mainly outlines the research procedure and
explains the questionnaire design.
5.2 Method 161
5.2.1 Research Procedure
Having noted that the ultimate product of RSF-I would be the assessment domains
and descriptors of language competence on the rating scale, RSF-I was executed in
three steps, as illustrated in Fig. 5.1. The first step was to operationalise various
manifestations of language competence into statements. This step was followed by
the core of RSF-I with questionnaire as a research instrument (see Sect. 5.2.2 for
more details). A good number of operationalised statements concerning language
competence were then itemised for generating the trial versions of questionnaires
for teachers and learners, respectively. The modified questionnaires were dis-
tributed to the respondents after the trial use mainly to disambiguate the band
descriptors. In order to distil the essence of the respondents’ perceptions towards
what would be supposed to constitute language competence in group discussion,
their rating on the questionnaire statements was extracted with EFA. The last step,
deriving from the questionnaire response analyses, proceeded to design the
extracted assessment domains and the corresponding descriptors for measuring
language competence in group discussion on the rating scale.
Step 2
Formulate a questionnaire
based on the
operationalisations
Step 3
Step 1
Questionnaire trial Specify the descriptors

Operationalise the for different range
components of finders to measure
language competence language competence
into statements in group discussion
Administer the
questionnaire to both
teachers and learners
Extract the teachers’ and

learners’ perceptions on
language competence
Fig. 5.1 Research design for RSF-I

5.2.2 Research Instrument
The questionnaires serving as the core research instrument at this phase are intro-
duced in detail in this section. As previously mentioned, the questionnaire could be
regarded as a granular epitome, or the operationalisation of language competence in
the CLA model. The following part presents the conceptual components of lan-
guage competence and a few assumedly aligned operationalisation statements in the
questionnaire (see Appendices V and VII for the trial versions for teacher and
learner respondents, respectively).
It can be recaptured that language competence in the CLA model is categorised
into organisational competence and pragmatic competence. The former can be
further divided into grammatical competence (GC) and textual competence (TC),
whereas the latter is composed of illocutionary competence (IC) and sociolinguistic
competence (SC). Blending the nature and practicality of group discussion with
these four domains, along with a consideration of determining assessment domains
and benchmarking each domain to be observable and characterisable, RSF-I
embellishes and tabulates the above in Tables 5.1 and 5.2.
From Table 5.1, it can be seen that although the CLA model stratifies different
layers of ingredients regarding organisational competence, modifications are made
to more effectively foreground the components when the assessment task is
Table 5.1 Operationalised statements of organisational competence

Grammatical GC_1. Pronunciation accuracy is important in assessing candidates’ oral
competence English proficiency
GC_2. Intelligibility in pronunciation to facilitate listener’s effort is
important in assessing candidates’ oral English proficiency
GC_3. Good pronunciation in oral English proficiency means native-like
GC_4. Speaking smoothly and loudly can help clear communication
GC_5. Effective use of pitch patterns and pauses means effective control
of intonation
GC_6. Effective use of stress means effective control of intonation
GC_7. Grammar correctness is important in assessing the candidates’
oral English proficiency
GC_8. Grammar variation is important in assessing the candidates’ oral
English proficiency
GC_9. Vocabulary range is important in assessing the candidates’ oral
English proficiency
GC_10. Using right words is important in assessing the candidates’
vocabulary
GC_11. Choosing appropriate words is important in assessing the
candidates’ vocabulary
Textual TC_1. Employing cohesive devices and discourse markers in group
competence discussion is important in assessing the candidates’ oral English
proficiency
5.2 Method 163
Table 5.2 Operationalised statements of pragmatic competence

Illocutionary IC_1. Fulfilling language communicative functions is important in
competence assessing the candidates’ oral English proficiency
IC_2. Stating topic-related ideas with reasons and examples is
important in assessing the candidates’ oral English proficiency
Sociolinguistic SC_1. Choosing appropriate language to fit different contexts and
competence audience means good oral English proficiency
SC_2. Knowing to use fillers to compensate for occasional hesitation to
control speech means good oral English proficiency
contextualised in group discussion. The subcomponent of morphology is excluded

largely because it is hardly distinguishable and manageable in rating process, where
more than one candidate is subject to being scored. However, all the other three
subcomponents in GC are reserved and are devised to be correspondent with
phonology, syntax and vocabulary in Table 5.1. For better informing the formu-
lation of descriptors across various bands in Step 3 of RSF-I, a few statements are
blended with frequency and intensity modifiers as indicators of gradable changes so
that the rating scale descriptors can be more enhanced in discriminating candidates
across levels in relation to their GC.
A shift to TC in Table 5.1 portends that rhetorical organisation is discarded,
given that a group discussion assumedly differs from a presentation or speech,
where a coherent wholeness from beginning to end is required. Therefore, only
cohesion is operationalised into cohesive devices, as Bachman (1990) points out
that “a rich variety of devices for marking cohesive relationships in oral discourse
[can maximally achieve] the communicative goals” (p. 89).
Table 5.2 outlines the operationalised statements with regard to pragmatic
competence in the CLA model. A glimpse at the number of statements leads to an
impression that pragmatic competence is not as granular as organisational com-
petence in Table 5.1. This is partly because certain competence domains might be
imperceptibly manifested or observed in group discussion and partly because they
are barely quantifiable and manageable in rating process, similar to the reason why
morphology is excluded, as explained above. Theoretically, IC consists of various
language functions, such as ideational, manipulative, heuristic and imaginative
functions (Bachman 1990). However, just as what Bachman (1990) contends,
several language functions can serve different purposes simultaneously. Therefore,
RSF-I operationalises the competence in this regard into two statements only,
respectively, touching upon “ideational function” (to express the facts of world) and
“topic relevance” (to discuss on the designated topic).
In a similar vein, SC in the context of group discussion is also specified from the
degree of sensitivity to language variation to the appropriateness of expression to
particular contexts and audiences and how fluency or disfluency should be mani-
fested. In that case, although the operationalised statements are not as many as those
representing organisational competence, they could faithfully and clearly reflect

what is observable in group discussion, thus facilitating respondents’ understanding
and judgment on the concrete manifestations when questionnaires are administered.
In addition to the above specifications of the questionnaire statements, one more
point needs clarifying is the scale on which respondents rate their perceptions of
agreement or disagreement. The scale to be adopted is a conventional five-point
Likert scale with one end signifying strongly agree, the other strongly disagree and
the middle point hard to say. The statements of organisational competence and
pragmatic competence are detached into two sections with respective instructions.
The scale for the former (Statements 1–12, from strongly agree to strongly dis-
agree) is arranged in a reversed order of the latter (Statements 13–16, from strongly
disagree to strongly agree). This was so designed as this study aims to alertedly
examine whether the respondents would be conscientious in approaching the
questionnaires. In case a respondent, negligent of such a change albeit its saliency
and bold-type underscored instruction, rashly rates the perceptions on the scale and
the responses are exactly anchored on the same ends of the scale with no tilting
tendency (e.g. assigning all the statements with strongly disagree for both sections),
it can be interpreted as invalid.
After the questionnaire trial, both the teacher and learner respondents provided
their comments. On the positive side, it was commented that the design of the
questionnaire into two sections with the five-point scales in reverse directions could
effectively detect invalid responses. In addition, the length of the questionnaire was
judged to be fairly acceptable as it would just take approximately twenty minutes to
complete. The exclusion of open-ended questions might also enhance the reliability
of the questionnaire data; many a respondent would be reluctant to offer lengthy or
meaningful answers to open-ended questions.
On the negative side, however, it was felt by the three experts that the design of a
bilingual questionnaire (English and Chinese) could have been more advisable for
the sake of disambiguation. Additionally, they suggested that certain terms, such as
discourse markers and fillers, should be exemplified and that certain notions, such
as syntactic variation, should also be revised to be more approachable. This sug-
gestion particularly echoed the responses from learner respondents as they also
found it less comprehensible to rate their perceptions on the scale regarding these
terms and notions. Consequently, they might have no choice but to assign a middle
point on the scale.
Therefore, the questionnaires were revised in two aspects. One was that the
questionnaire would be designed into a bilingual version, where both English and
Chinese equivalents were provided. The other was that a few examples would be
cited to illustrate those less approachable notions and terms, as pinpointed by the
experts and learner respondents. Appendix VI and Appendix VIII outline the
finalised versions, where the shaded parts signify the modifications of the above
two aspects.
Before a presentation of the research findings, this part will first dwell on the
threshold values for running EFA of the observed dataset. As is reviewed con-
cerning the threshold indices for EFA (see Sect. 3.3.1), the number of the
respondents amounts 1312, a figure exceeding the minimum requirement of 300.
With the method of principal component EFA, this phase of study first checks the
threshold values as follows. Table 5.3 shows that the KMO value is 0.758, indi-
cating sound fitness of the dataset for EFA. Bartlett’s test also presents statistical
significance (p = 0.000), which further reveals the appropriateness for the teachers’
and learners’ rating data to run factor analysis.
Table 5.4 reflects the communalities of each item (statement) after extraction in
the factor analysis. As is aforementioned in Chap 3, the extraction above 0.5 can be
succinctly acceptable for any further data interpretation. All the extraction values in
Table 5.4 are above 0.5, showing fairly much variance in each item (statement)
explained by the latent factors can be represented.
Table 5.3 KMO and KMO and Bartlett's test

Bartlett’s test results
Kaiser–Meyer–Olkin measure of sampling 0.758
adequacy
Bartlett’s test of sphericity Approx. chi-square 8.276E3
df 120
Sig. 0.000
Table 5.4 Communalities of Communalities

items after extraction
Initial Extraction
GC_1 1.000 0.713
GC_2 1.000 0.649
GC_3 1.000 0.548
GC_4 1.000 0.528
GC_5 1.000 0.615
GC_6 1.000 0.614
GC_7 1.000 0.578
GC_8 1.000 0.590
GC_9 1.000 0.581
GC_10 1.000 0.523
GC_11 1.000 0.595
TC_1 1.000 0.613
IC_1 1.000 0.517
IC_2 1.000 0.750
SC_1 1.000 0.758
SC_2 1.000 0.515
Table 5.5 Component matrix of factor analysis

Statements (items) Component
1 2 3 4
GC_1 0.523
GC_2 0.597
GC_3 0.509
GC_4 0.319
GC_5 0.594
GC_6 0.556
GC_7 0.308
GC_8 0.314
GC_9 0.750
GC_10 0.752
GC_11 0.522
TC_1 0.839
IC_1 0.431
IC_2 0.514
SC_1 0.444
SC_2 0.788
Eigenvalue 8.73 3.69 1.52 1.03
Variance explained (%) 44.21 12.56 10.01 2.85
Accumulated variance explained (%) 44.21 56.77 66.78 69.63
Then, the research findings of RSF-I move to the core part of EFA deriving from
the questionnaire responses. Table 5.5 presents the factor loadings of each variable
on the latent components when eigenvalue threshold is set at 1.0 by default. Judging
from the results in Table 5.5, four components are extracted from the 16 variables
(statements), whose loadings exceeding 0.3 on each corresponding latent compo-
nent are displayed (loadings below 0.3 are discarded due to poor interpretability).
Component 1, heavily loaded on the variables from GC_1 through GC_6, can be
regarded as one of the main contributors to GC. The only low loading, yet above
0.3, is found in the case of GC_4 (0.319). Component 2 is closely related to the
variables from GC_7 to GC_11, serving as another main contributor to GC.
However, there are two variables with low loadings on this component: GC_7
(0.308) and GC_8 (0.314), betokening that the variances at a marginal level in
either statement can be explained by Component 2. Component 3 is heavily loaded
on TC_1 and SC_2, and the remaining three variables (IC_1, IC_2 and SC_1)
mainly contribute to Component 4. Table 5.5 also alludes to the fact that all the
latent variables (components) can explain 69.63 % accumulated variance, again
affirming a sound indicator for the explanatory power of the extracted components
with regard to what all the statements intend to reflect.
Another issue to be cross-checked was whether the components extracted were
inter-correlated because in principal component analysis, promax rotation of the
Table 5.6 Correlation matrix Component 1 2 3 4

of the extracted components
1 1.000 0.338 0.271 0.298
2 0.338 1.000 0.188 0.246
3 0.271 0.188 1.000 0.502
4 0.298 0.246 0.502 1.000
Extraction method: principal component analysis
Rotation method: promax with Kaiser normalisation
latent factors was selected for maximising data fit. Considering the nature of
non-orthogonal rotation in this method, the correlations between latent factors might
be unpredictably high. Bearing the possible results of promax rotation in mind,
Table 5.6 presents a correlation matrix of four latent components. It can be noted that
the four components, after being rotated in a promax fashion, were not incidentally
highly correlated with each other, with only two correlation coefficients above 0.3.
One of them is the correlation coefficient between Component 1 and Component 2
(0.338), both of which originally derived from the operationalisations of GC in the
CLA model. Thus, it can be apprehensible that such a correlation coefficient fell into
an excusable range. Another even slightly higher correlation coefficient can be found
between Component 3 and Component 4 (0.502). An initial possible explanation
might be that after the promax rotation, where orthogonal angle was no longer
maintained, these two components might be clustered slightly closer to be more
interdependent. This issue will be re-addressed in the discussion below.
Given what is found above, including four initially extracted components, their
factor loadings and the degree of independence among latent factors, two issues
should be addressed. First, why did certain individual variables fail to be loaded on
the supposedly latent component for a particular assessment domain? Second, how
could these four extracted components inform the formulation of language com-
petence on the rating scale? The following discussion attempts to address these
questions in detail.
5.4 Discussion
Admittedly, although EFA was adopted as the statistical method in RSF-I, an

intention would still be purposefully loading certain variables on the intended
broader categories under the labels of various language competences. However, the
data findings from the factor analysis above seem to profile a slightly reversed
picture; certain variables, though loaded on a particular component, did not feature
high loadings. In that case, it is necessary to trace back to the corresponding
statements for a query of the teachers’ and learners’ perceptions.
As found above, Component 1 can to a certain extent account for one aspect of
GC. The items from GC_1 through GC_6 are closely pertaining to various elements
of pronunciation and intonation in light of accuracy, intelligibility, native-likeness,
speaking volume, pitch and stress. Such convergence not only indicates teachers’
and learners’ shared perceptions in that domain but also reveals that pronunciation
and intonation should be one of the legitimate and key elements in assessing
candidates’ GC. Against this, these elements would be reflected in the formulation
of the rating scale, particularly in the dimension concerning pronunciation and
intonation. The exception, as found above, derives from the statement GC4
speaking smoothly and loudly can help clear communication, with a loading
marginally exceeding 0.3. This means speaking smoothly and loudly in group
discussion does not substantially contribute to respondents’ perceptions of what
should be assessed, which consequently disqualifies that particular element off the
descriptors on the rating scale.
The statements falling into Component 2 from EFA are also relevant to GC
stipulated in the CLA model. However, dissimilar to Component 1, this component
seems to statistically contribute to grammar (correctness and variation) and vo-
cabulary (range, correctness and appropriateness). In that case, it can be assumed
that grammar and vocabulary can be grouped together as another assessment
dimension on the rating scale. Although the statements GC_7 and GC_8 have no
high factor loading on Component 2, it would not necessarily follow that their
contribution to this latent variable can be negligent. A more discrete reading at both
statements will yield the finding that they are intended for grammar correctness and
variation. Therefore, the possible reasons for their low loadings could be either
teachers’ and learners’ under-saliency in recognising grammaticality in group dis-
cussion, or certain foreseeable infeasibility of observing grammar variation in rating
process. In that context, when the rating scale is formulated, due caution needs to be
taken in describing grammar correctness and variation.
So far the first two latent variables are discussed. As reviewed in the literature,
most existing rating scales for speaking assessment (see Appendices I through IV)
also “conventionally” include (1) pronunciation and intonation, and (2) grammar
and vocabulary. The discrepancy, if any, among certain analytic rating scales
consists in a further demarcation of assessment domains into more concrete points.
In that sense, these two assessment dimensions extracted from EFA greatly cor-
respond with a majority of prevailing rating scales, and they would also be naturally
set as two dimensions on the rating scale of this study. Then, the discussions
proceed to the other latent variables from EFA, which touches upon the dimensions
uncommon in the existing rating scales.
What is found reveals that Component 3 is loaded with TC_1 and SC_2, and
Component 4 loaded with IC_1, IC_2 and SC_1. It has to be acknowledged that a
majority of these variables are originally designed with a view to operationalising
TC, IC and SC in the CLA model. However, the three intended dimensions have
been shrunk into only two latent variables after EFA, and the statements repre-
senting the three dimensions even presented their divergence on the extracted
factors. This might have resulted from the promax rotation, where a possible threat
was posed to the independence between the components, as can be cross-validated
by the correlation matrix in Table 5.6. In addition, a host of remaining items, such
as IC_1 and SC_1, were credited with less heavy loadings on either component.
5.4 Discussion 169
Against that context, considerations can be made as to generalise and integrate the
intended construct of these remaining statements into one unitary component:
discourse management, which covers coherence and cohesion, fluency and topic
development. The naming of this assessment dimension to a great extent is
expected to reflect how candidates can manage their discourse in executing group
discussion.
To draw an interim summary, when language competence in the CLA model
was operationalised into individualised statements, based on which questionnaires
were designed and further administered to teachers and learners in the Chinese EFL
context, what should be assessed regarding language competence in group dis-
cussion was extracted in an exploratory manner. With the method of EFA, RSF-I
extracted four latent variables. The first two latent variables corresponded with GC,
comprising of pronunciation and intonation, and grammar and vocabulary.
However, justifications are made as to integrate all the remaining two latent vari-
ables into one, providing that no manifestation perceived in the questionnaire was
to be eliminated. Therefore, such an integrated dimension is named as discourse
management.
5.5 Rating Scale (1): Language Competence
Having been informed by the results from the questionnaire and the follow-up EFA,
RSF-I in this section embarks upon formulating the part of language competence on
the rating scale. Basically, this part of the rating scale will be presented in two steps.
First, the rating scale for each analytic dimension, together with the corresponding
band descriptors, will be outlined. Second, the specifications will be provided to
illuminate how each band descriptor is brought forth mainly with respect to the
discriminating power across a range of proficiency levels.
5.5.1 Pronunciation and Intonation
The dimension of Pronunciation and Intonation is extracted based on the ques-

tionnaire responses; therefore, when the continuum for assessment is perceived, the
keywords embedded in the questionnaire statements, such as intelligibility and
foreignness in relation to pronunciation, and appropriate and varied regarding
intonation, are taken into account.
Figure 5.2 illustrates the first dimension of the rating scale, composed of pro-
nunciation and intonation. In terms of pronunciation, two subdimensions epito-
mised by intelligible/unintelligible and foreign/native are anchored on both ends of
the continuums. Similarly, two subdimensions characterising intonation are
appropriate/inappropriate and varied/unvaried. The reason why both ends of the
scale are attached with these modifiers is that raters would be thus reminded of the
Pronunciation
Intelligible Unintelligible
Native Foreign
5 4 3 2 1
Appropriate Inappropriate
Varied Monotonous
Intonation
Fig. 5.2 Rating scale (Part I): Pronunciation and Intonation
foci on what is supposed to be assessed. What is worth noting is that although this
dimension is embedded with more than one aspect when rating is proceeded,
supposedly the rater would assign only one score on a five-point scale in an inte-
grated manner to evaluate candidates’ performance in this regard.
Prior to using this rating scale, raters would be routinely anticipated to famil-
iarise themselves with all the band descriptors, based on a correct and consistent
understanding of which the follow-up field rating could be facilitated. Table 5.7
presents the band descriptors for Pronunciation and Intonation.
Table 5.7 Band descriptors for Pronunciation and Intonation

Band Band descriptors
5 No listener effort in sound recognition for intelligibility
No detectable foreign accent
No noticeable mispronunciation
Flexible control of stress on words and sentences for meaning conveyance
Correctness and variation in intonation at the sentence level
4 Almost no listener effort for intelligibility, with acceptable slip of tongue
Detectable foreign accent without reducing overall intelligibility
Occasional mispronunciation
Occasional inappropriate stress on words and sentences without reducing meaning
conveyance
Correctness in intonation, but with less variation at the sentence level
3 Detectable accent slightly reducing overall intelligibility
Mispronunciations of some words with possible confusion
Inappropriate stress on words and sentences reducing meaning conveyance
Occasional inappropriate or awkward intonation noticeable at the sentence level
2 Effort needed in sound recognition for intelligibility
Detectable foreign accent that sometimes causes confusion
Frequent noticeable mispronunciation
Frequent inappropriate stress on words and sentences reducing clarity of expression
Frequent inappropriate and awkward intonation at the sentence level
1 Much effort in sound recognition for intelligibility
Strong foreign accent with noticeable L1 interference
Frequent mispronunciation and detectable hesitations/pauses blocking flow of
expression
Frequent inappropriate stress and awkward intonation
5.5 Rating Scale (1): Language Competence 171
In the process of being translated into something gradable and observable

between adjacent proficiency levels, these band descriptors not only carry the
keywords in the statements of the questionnaire above showcased but also reflect
EFA results drawn from the respondents’ perceptions. For each level in this
assessment dimension, there are four common aspects of observation. The first
aspect is concerned with intelligibility, realised by both listener efforts in recog-
nising the uttered sound and the pronunciation accuracy. Therefore, corresponding
with the scale in Table 5.7, this aspect is linked with intelligible and unintelligible
on both ends.
The second aspect is to perceive whether candidates’ pronunciation would bear
noticeable transfer from their mother tongue. The dividing line is drawn concerning
whether and, if so, to what extent such accent or negative transfer is detectable.
Hence, the continuum is linked by native and foreign. The third and fourth aspects
shift slightly towards intonation; in other words, whether or not appropriate stress at
both word and sentence levels, and whether or not varied intonation at sentence
level would be achieved. Accordingly, the continuums with
appropriate/inappropriate and varied/monotonous as extreme cases are marked on
both ends on the rating scale.
5.5.2 Grammar and Vocabulary
The second dimension on the rating scale, namely Grammar and Vocabulary, which
is extracted from the questionnaire responses, bears much resemblance with the first
dimension as foregoing formulated, yet with slight difference in that the adjectives
used on both ends as reminders for raters are more congruent with those keywords
in the questionnaire statements. Figure 5.3 exhibits this dimension on the rating
scale. On the two continuums concerning the subdimension of Grammar,
accurate/inaccurate and varied/monotonous are provided for positioning and
observation purposes. Comparatively, Vocabulary is chiefly manifested by its
observable breadth and depth as well as by whether or not what is conveyed could
reflect idiomaticity use expected in the native speech community of English.
Grammar
Accurate Inaccurate
Varied Monotonous
5 4 3 2 1
Broad/Deep Narrow/shallow
Idiomatic Unidiomatic
Vocabulary
Fig. 5.3 Rating scale (Part II): Grammar and Vocabulary

Table 5.8 Band descriptors for Grammar and Vocabulary

5 No detectable grammatical errors, with only self-repaired minor lapses
A range of syntactic variations (complex and simple structures) with accuracy and
flexibility
Vocabulary breath and depth sufficient for natural and accurate expression
Accompanying frequent use of idiomatic chunks
4 Occasional grammatical errors without reducing expressiveness
A range of syntactic variations (both complex and simple structures) with occasional
inaccuracy and inflexibility
Almost all sentences are error free
Vocabulary breath and depth sufficient for expression, with occasional detectable
inaccuracy
Accompanying infrequent use of idiomatic chunks
3 Noticeable grammatical errors slightly reducing expressiveness
Effective and accurate use of simple structures, with less frequent use of complex
structures
Frequent error-free sentences
Vocabulary breadth sufficient for the topic, with less noticeable vocabulary depth
Rare use of idiomatic chunks
2 Noticeable grammatical errors seriously reducing expressiveness
Fairly accurate use of simple structures, with inaccuracy in complex structures
Frequently incomplete and choppy sentences
Vocabulary breadth insufficient for the topic
Inaccurate use of words causing confusion
1 Frequent grammatical errors, with no intention of self-correction
Detectable and repetitive formulaic expressions
Inaccuracy and inability to use basic structures
Topic development seriously limited by vocabulary scarcity
Following the same practice as raters would do for this first dimension on the
rating scale, the researcher of this study also imposes the requirement that raters
should get acquainted with the dimension of Grammar and Vocabulary. The scales
with modifiers attached on both ends serve the purpose of reminding raters of what
assessment domains should be carefully observed as stipulated in the band
descriptors. Similar to the practice of the first assessment dimension, raters would
also be supposed to assign only one score to this dimension based on their
observation and judgment on candidates’ performance in this aspect.
Table 5.8 lists the detailed band descriptors for the second dimension.
A microscopic look at the descriptors of one particular band will more effectively
enhance an understanding of what constitutes this dimension and how it is drawn
from the results of questionnaire survey. Take Band 4, a level to be considered as
higher-immediate proficiency, for example. The first descriptor at this level indi-
cates the degree of grammaticality, which can tolerate “occasional grammatical
error” only. The second descriptor is laid down with more reference to a consid-
eration of syntactic variation. At this level, candidates would be anticipated to
produce a range of variations though occasional inaccuracy and inflexibility might
5.5 Rating Scale (1): Language Competence 173
be excused. The third respect of descriptor deals with accuracy at sentence level,
still conforming to the adjective explanatory continuum linked by accuracy and
inaccuracy. With regard to the fourth and fifth descriptors, more emphasis is placed
on the bifold aspect of vocabulary. On the one hand, candidates to be assigned at
this level should be proven to have both vocabulary breadth and depth though
lapses can be tolerated. On the other hand, candidates might not be able to con-
stantly produce idiomatic expressions, but it can be detected that there would be
certain efforts for groping idiomatic expressions. All the foregoing descriptors
constitute what would be expected of candidates falling into that level.
It should be noted that in better discerning candidates across a range of profi-
ciency levels with regard to their grammar and vocabulary, all the gradable mod-
ifiers between two adjacent levels on the rating scale, such as those indicating
frequency (e.g. constant, frequent) and those degree (e.g. repetitive, limited), are
largely enlightened by and formulated from the frequency and degree modifiers
deployed in the questionnaire statements.
5.5.3 Discourse Management
Formulating Discourse Management is distinguished from the previous two

assessment domains on the rating scale, partly because the results from EFA have to
be borne in mind at this phase, which requires downplaying particular statements
with comparatively low factor loadings, and partly because more considerations are
supposed to be given in integrating those questionnaire statements within a single
dimension as it is.
Figure 5.4 displays the dimension of Discourse Management on the rating scale.
Following the previous practice, the scale is basically characterised by a five-point
continuum with both ends clustered with evaluative adjectives of two extremes.
There are altogether three subdimensions for observation in rating process. The first
subdimension is related to the degree of fluency. However, what is worth pointing
out is that fluency on this rating scale does not necessarily mean that candidates
would only be expected to keep the flow of speech; rather, it is also reflected by
whether they can compensate for their occasional communication breakdown or
hesitation with proper use of fillers, as indicated in the band descriptors below.
Discourse Management
Fluency Disfluency
Coherent Fragmentary
Developed Underdeveloped
5 4 3 2 1
Fig. 5.4 Rating scale (Part III): Discourse Management

In addition, hesitation is treated differently as candidates might hesitate as they

would not know how to develop a topic, or how to grope for a more suitable word.
The second subdimension is more concerned with coherence of the entire
speech. Oppositely, if candidates’ utterance features many long pauses without any
connectors or other compensatory discourse markers, which would seem scattered
or fragmentary, candidates tend to be rated towards the other extreme of the con-
tinuum. The third subdimension looks at whether candidates would be able to fully
develop their opinions pertinent to a given topic. In group discussions, each par-
ticipant’s contribution would to a certain degree limit or delimit topic development.
If candidates would be only able to substantiate simply ideas and mostly follow
other discussants’ opinions, their grades in this dimension would be accordingly
downgraded. As the EFA results have informed RSF-I that serving various com-
municative functions and choosing different languages for particular contexts and
audience are not heavily loaded, both statements are downplayed for a reflection of
shared perceptions from the stakeholders as examined above.
Table 5.9 outlines the five band descriptors for the part of Discourse
Management, each of which is an operationalised statement in relation to each
epitomised adjective on the rating scale, indicating the degree to which candidates
could successfully manage their discourse in group discussions.
Table 5.9 Band descriptors for Discourse Management

5 Rare repetition or self-correction; effective use of fillers to compensate for occasional
hesitation(s)
Coherence and cohesion achieved by effective use of connectors and discourse
markers
Topic is discussed with reasoning, personal experience or other examples for in-depth
development
4 Occasional repetition and self-correction; hesitation for word and grammar is rare;
infrequent use of fillers
Generally coherent discussion with appropriate use of connectors and discourse
markers; no significant long pause hindering the flow of utterance
Much topic-related development with some minor irrelevance in discussion
3 General continuous flow of utterance can be maintained, yet repetition, self-correction
and hesitation are noticeable for word and grammar
Coherence and cohesion can be basically achieved by the use of connectors and
discourse markers, but sometimes inappropriate use might occur
Topic is discussed with relevant utterance, but the attempt to produce long response is
sometimes limited
2 Frequent repetition, self-correction and long noticeable pauses for word and grammar
Constant use of only a limited number of connectors and discourse markers for
coherence and cohesion
Topic is not developed clearly with reasoning or expected details; development can be
maintained with other discussants’ elicitation
1 Almost broken utterance with constant long pauses between sentences
Almost no connector and discourse marker used to link sentences
Only basic ideas related to the topic can be expressed; development is limited due to
noticeably less participation
5.6 Rating Scale (2): Strategic Competence 175
5.6 Rating Scale (2): Strategic Competence
What is elaborated above addresses the first broad dimension of the rating scale in
this study, RSF-II then dwells upon how strategic competence, mainly as reflected
by nonverbal delivery, can be formulated. As the development of this dimension
heavily relies on the research findings of the AB phase, this section will first
recapture the empirical study that aims at building an argument for embedding
nonverbal delivery into speaking assessment. Afterwards, the dimension of
Nonverbal Delivery, together with its corresponding descriptors, will be presented.
5.6.1 Nonverbal Delivery: A Recapture
The AB phase, based on a small sample size, assesses the role of three nonverbal
channels by Chinese college EFL learners in their group discussions in formative
assessment. What follows in this section draws a synopsis of what has been cap-
tured and also proposes how the research findings and discussion can inform the
formulation of Nonverbal Delivery on the rating scale in this study.
In terms of eye contact, candidates generally tended to instantiate less eye
contact with their peers, and there were significant inter-group differences in fre-
quency and duration. Advanced learners, comparatively, were capable of resorting
to gazing in fulfilling the assessment task and switching their eye contact between
attentive and persuasive functions when turn-taking was involved. By contrast,
candidates of elementary and intermediate proficiencies, in most respects, gazed at
other discussants largely for attentive and regulatory purposes. In all likelihood, the
above observation results from their inexact speech referents or a discrepant mas-
tery of strategic competence. A majority of candidates across different proficiency
levels tended to have eye contact with an aim of impression management. However,
the ultimate goals of doing this can be discernible among candidates across a range
of proficiency levels. Advanced learners would be more likely to domineer or
impress the discourse referents, whereas those of lower proficiency were prone to
be timid or fidget in expressing themselves or afraid of committing errors when
shifting their eye contact directionality to the on-the-spot researcher.
Similarly, when the dimension of gesture was probed into, candidates did not
frequently avail themselves of gestures in synchronisation with their verbiage.
Many occurrences of gestures in group discussions as there might be, the cumu-
lative durations might still be short. Candidates of different proficiency levels
presented certain differences in gesturing in the context, where candidates of
advanced proficiency exhibited better performances in both gesture variety and the
degree to which their gestures could explain or intensify the intended accompa-
nying verbiage. In stark contrast, although candidates of elementary and interme-
diate levels could use gestures to partly illustrate or reinforce accompanying verbal
language, their gestures were still less satisfactory given a dearth in diversity and
potential in meaning-productiveness. Almost all the candidates, however, kept a

low profile in gesturing with emblematic or regulatory functions. Just as what is
explained in the case of eye contact, it might be because they used other nonverbal
channels, or only the verbal channel to compensate for the above two functions, or
they might indeed have a less understanding of what gestures could contribute in
oral production.
Candidates as a whole instantiated sporadic head movement; comparatively
speaking, there were more occurrences of head nod than those of head shake.
Although most head nod occurrences enhanced the accompanying verbiage of
agreement, the conveyance of disagreement was disproportionately profiled by head
shake, which could be possibly explained by a courtesy intrinsically rooted in the
Chinese culture. Candidates across different proficiency levels could be differenti-
ated not only by the discrepancy in head movement frequency, but also by whether
head nod or shake was appropriately instantiated. This is because certain occur-
rences of head nod were found to run counter to what was intended in the verbal
language stringed with negation.
5.6.2 Nonverbal Delivery: Rating Scale
With the above recap, it might dawn on this phase of study that the design of the
rating scale for formative assessment can be provided with resourceful insights
from the AB phase research findings. In addition, the “unconventional” dimension
of Nonverbal Delivery can also be formulated in a describable manner, which is, in
fact, first explored by Jungheim (1995), who argues a necessity for formulating
Nonverbal Ability Scales. Given the fact that the candidates across various profi-
ciency levels in the AB phase of research exhibit significantly different performance
on three most salient nonverbal channels, the descriptors of this dimension on the
rating scale should be naturally drawn from what is found regarding the statistical
and descriptive differences among different groups.
Therefore, informed by the research findings and discussions in the AB phase,
particularly the descriptions discerning the employment of nonverbal delivery by
candidates across a range of proficiency levels, RSF-II comes to formulate the part
of nonverbal delivery on the rating scale as shown in Fig. 5.5. Following a similar
approach as practised in formulating language competence on the rating scale, in
terms of layout, the part of Nonverbal Delivery is also characterised by extreme
modifiers on both ends with five possible grades positioned in the centre. The
modifiers still serve to remind raters of what should be primarily observed. For
instance, they are supposed to judge whether a candidate would instantiate a higher
or lower frequency of eye contact with other discussants in a group discussion and
whether the occurrences of eye contact, if any, are mostly durable ones or merely
brief glances. In addition, whether candidates’ gestures feature variedness or
monotony and whether they can instantiate appropriate head movements are also
etched on both ends of the scale for scoring.
5.6 Rating Scale (2): Strategic Competence 177
Nonverbal Delivery
Frequent Infrequent
Durable Brief
Varied Monotonous
5 4 3 2 1
Fig. 5.5 Rating scale (Part IV): Nonverbal Delivery
Despite the reminders on the rating scale, raters are still supposed and strongly
encouraged to familiarise with each individual descriptor so that their scoring
results would not by a great margin yield an inconsistency due to discrepant
understandings.
5.6.3 Nonverbal Delivery: Band Descriptors
The band descriptors for nonverbal delivery on the rating scale are shown in
Table 5.10. The five-level division on this part of the rating scale is the same as the
previous three dimensions in RSF-I. The band descriptors for each level are
revolved around three nonverbal channels recaptured above. For eye contact, the
measures of frequency and duration are significantly considered. However, in
addition to gesturing frequency, whether gestures are characterised by a formal
diversity and whether they can perceivably enhance meaning making along with
candidates’ verbiage in group discussions are also reflected as domains to be
Table 5.10 Band descriptors for Nonverbal Delivery

5 Frequent and durable eye contact with other discussants
Frequent and various meaning-making communication-conducive gestures (support or
enhance meaning)
Evidence of appropriate head nod/shake
4 Frequent eye contact with other discussants
Frequent gestures with a lack in variety
Head nod/shake detectable, but sometimes inappropriate
3 Having eye contact with other discussants, but brief
Gestures employed, but some are not for communicative purposes
Infrequent head nod/shake
2 Infrequent eye contact with other discussants
Gestures, most of them are for regulatory reasons
Most head nod/shake is inappropriate
1 Almost no eye contact with other discussants
Almost no gesture in group discussion
No head nod/shake
observed and assessed, because they are discovered to significantly differentiate

candidates across various proficiency levels.
As can be observed, what can be slightly different from the observation of eye
contact and gesture is the descriptors for head movement. This nonverbal channel,
which would be assessed in light of appropriateness, ranges from the visibility of
head nod/shake to the evidence of appropriate head nod/shake, threading through
the descriptors of all these five bands. In the meantime, in order to further spread
out the gradable descriptors of head movement, certain frequency modifiers, such as
sometimes and infrequent, are used in this tentative version of the rating scale.
5.7 Summary
Abridged from what was found in the AB phase of this study, this chapter addresses
the phases of RSF-I and RSF-II, viz. how the rating scale with a consideration of
embedding nonverbal delivery into speaking assessment is formulated.
Appendix IX provides a tentative version of the rating scale.
When the part of language competence was formulated on the rating scale, this
study used a questionnaire comprising of the perceptibly operationalised statements
originating from the CLA model, based on which in the Chinese EFL context
teachers’ and learners’ rating could be drawn for an extraction of possible assess-
ment dimensions on the rating scale. After the processing of EFA and a further
discussion on latent variable naming, this phase of study proposed three dimensions
representing the core components of language competence, namely Pronunciation
and Intonation, Grammar and Vocabulary, and Discourse Management. In par-
ticular, Discourse Management was incubated as a result of a few remaining salient
features that were not found to be statistically heavily loaded on the intended factor.
Therefore, an integration approach was adopted for this dimension formulation.
Afterwards, the rating scale descriptors were developed by referring to certain
modifiers signifying degree and frequency on candidates’ potential performance.
The gradable descriptors were aimed at discriminating candidates across a range of
proficiency levels.
How strategic competence was developed was largely based on the research
findings of the empirical study in the AB phase. As it has been found that candi-
dates with predetermined proficiency levels might be discerned with regard to their
performance in eye contact, gesture and head movements, strategic competence, as
mainly reflected by the dimension of Nonverbal Delivery on the rating scale, can be
developed with the aid of certain observable distinguishing features detected from
the study previously conducted. In a similar vein, certain degree and frequency
modifiers are employed with a view to reflecting the discriminating power of the
gradable descriptors.
Therefore, a tentative rating scale with four dimensions is so far brought forth.
However, considering that this rating scale is still subject to refinement, rather than
directly applying this rating scale for any validation, this study proceeds to RSF-III,
5.7 Summary 179
where a prevalidation study is conducted based on expert raters’ trial rating and
their feedback. It is expected that with the results from the trial rating as well as the
suggestions contributed by the expert raters, this rating scale can be further shaped
up for an enhancement of its perceived construct validity and rater-friendliness.
References
Press.
Bachman, L.F., and A.S. Palmer. 1996. Language testing in practice: Designing and developing
Jungheim, N.O. 1995. Assessing the unsaid: The development of tests of nonverbal ability. In
Language testing in Japan, ed. J.D. Brown, and S.O. Yamashita, 149–165. Tokyo: JALT.
Chapter 6
Rating Scale Prevalidation
and Modification
The previous two chapters, respectively, contrive two core components of the rating
scale drawn from the CLA model: language competence and strategic competence.
Generally speaking, the proposed rating scale is developed into a five-band one,
with three dimensions contributing to language competence and one dimension to
strategic competence. Detailed descriptors and discriminating wording between
each two adjacent bands are also substantiated so as to assess potential candidates
with respect to their all-round attainment of communicative language ability in the
context of group discussion. However, due caution should be taken before this
tentatively formulated rating scale proceeds to be validated; it should be first trialled
or, in a sense, prevalidated to eliminate any potential impracticality or
rater-unfriendliness. Bearing the above as a crux consideration for this phase of
study, this chapter reports on the last step of the RSF phase, where the proposed
rating scale is processed via a small-scale validation by expert rating and judgment
for further rating scale refinement.
With trialling the tentatively proposed rating scale as a point of departure, this phase
of study mainly aims to testify rater-friendliness of this rating scale, viz. the extent
to which expert raters would perceive it as practical, or would adjust and disam-
biguate any inappropriate diction that could possibly attenuate the validity of the
proposed rating scale. Expert judgment in this case, therefore, would be expected to
fine-tune the rating scale so that candidates can be even better distinguished
between distinct adjacent proficiency levels. The answers are sought to address the
following four research questions.
RSF-III-RQ1: To what extent is the tentatively proposed rating scale valid?
RSF-III-RQ2: To what extent is the tentatively proposed rating scale
rater-friendly?
RSF-III-RQ3: To what extent can the proposed rating scale distinguish candi-
dates across a range of proficiency levels?
RSF-III-RQ4: How can the proposed rating scale be revised?
DOI 10.1007/978-981-10-0170-3_6
182 6 Rating Scale Prevalidation and Modification
For an operationalisation of the above research questions, RSF-III was also

carried out in a step-by-step manner. Regarding how to preliminarily validate the
proposed rating scale (RSF-III-RQ1), the scores assigned by the expert raters would
be put into correlation analysis so that whether the different assessment domains
could achieve high correlation could be probed into. If such expected results could
be yielded, it would also be generally assumed that the rating scale as partitioned
into multidimensional assessment domains would actually measure a predetermined
intended unitary construct. What needs to be justified is that as this phase of study
was based on only 20 samples of group discussion with an involvement of
approximately 60 candidates (see data description in Sect. 3.2.2), using correlation
analysis for this preliminary validation can be judged as appropriate in that the
sample size was not up to the threshold for any modelling hypothesis or testing,
such as MTMM alternative model comparison to be conducted in RSV-I.
Specifically, for RSF-III-RQ2 and RSF-III-RQ3, a group interview with the
invited expert raters was convened for a provision of more insights on how the
rating scale might be revised for the next stage of validation. As the responses to
these two research questions centre upon the expert judgment on the degree of
rater-friendliness and discriminating power of the rating scale, the criteria for such
judgment would be called for. Thus, the researcher structured a few questions
before the interview and also intended to leave ample space for the expert raters to
air their viewpoints and comments concerning the practicality of the proposed
rating scale. By integrating the findings of the first three research questions, this
phase of study would proceed to the rating scale refinement to address RSF-III-
RQ4, together with proper justifications and elaborations for the modification.
6.2 Research Procedure and Methods
As a wrapping-up step at the RSF phase, this phase of study was conducted to
initially testify the construct validity of the proposed rating scale and also its
practicality, without which the RSV phase could not proceed with full preparations.
This section, therefore, outlines the research procedure and the methods used.
6.2.1 Research Procedure
This phase of study would virtually serve as a prevalidation with three steps
strapped by expert rater scoring and judgment. To commence with, three invited
expert raters were requested to score the same 20 samples of group discussion
against the tentatively proposed rating scale. Afterwards, a group interview with
them was called for to procure the feedback mainly dwelling on the extent to which
the tentative rating scale would be rater-friendly. After the gathering of the expert
raters’ scoring and the interview data, namely the raters’ responses to the interview
6.2 Research Procedure and Methods 183
questions along with their suggestions, this phase of study would glide to the
analyses of investigating the construct validity of the proposed rating scale by
correlating the subscores assigned. In addition, the experts’ comments would also
be qualitatively retrieved so as to inform how the rating scale could be better
modified. Upon the completion of all these steps, both the analyses of the scoring
results and the interview responses would be referred to for a refinement of the
rating scale formulation.
RSF-III needed 20 samples of group discussion from Dataset 2, the expert rating
results and the interview data. As how Dataset 2 and Dataset 3 concerning expert
rating were collected has been detailed (see Sects. 3.2.2 and 3.2.3), no more
description is redundantly rendered here. However, more elaborations will be made
below on how the interview with the expert raters was conducted, and how the
related data in this phase of study would be processed and analysed.
6.2.2 Research Methods
Primarily, two research methods were deployed in this phase of study.

Qualitatively, when the rating scale was referred to for the rating on a small scale,
and later commented by the expert raters, the method of expert judgment was
adopted. Expert judgment is usually regarded as a research approach for soliciting
informed opinions from individuals with the required expertise in particular fields
(Alderson 1993). In employing this approach, it would be facilitating to obtain a
rapid and trustworthy evaluation of the rating scale against the criteria intuitively
accumulated from experts’ evaluation. Apart from using the proposed rating scale
by the expert raters, this phase of study also took the conventional form of expert
solicitation, where experts’ opinions covering a broad range of issues concerning
the rating scale practicality could be thus aggregated.
All the questions in Table 6.1 were raised and solicited in the session of expert
consultation. It can be seen that these questions are all related to the practicality and
rater-friendliness of the proposed rating scale. As described in Chap. 3, the
researcher audio-recorded the whole process of the interview with the expert raters
(also see Sect. 3.2.3.1), who commented on various issues of the rating scale,
particularly the fine-grained questions regarding rater-friendliness and discrimi-
nating power in correspondence with RSF-III-RQ2 and RSF-III-RQ3.
On the quantitative side, when the scoring results were analysed, the statistical
method of correlation analysis was employed. This is because, as previously stated,
given the comparatively small size of the data to be processed, RSF-III could only
look into whether different subscores of each dimension of the rating scale could be
highly correlated to reflect the construct embedded in the proposed rating scale. As
construct validity reviewed in the literature is a unitary concept, assumedly it would
lead to high correlation coefficients between the four assessment dimensions though
presumably they are independent from each other and represent different aspects of
observation.
Table 6.1 Questions for expert consultation

No. Questions
Interview-Q1 Is it possible that teacher raters and peer raters would have possible
misunderstanding on the rating scale that is incurred by the diction in the
various band descriptors?
Interview-Q2 Is there any need to add more dimensions of descriptors to the rating scale? If
so, what should be added?
Interview-Q3 Is there any need to delete any part of descriptors that would most likely fail to
distinguish candidates across different proficiency levels? If so, what should
be deleted?
Interview-Q4 Can adjacent bands reflect gradable descriptions of communicative
competence in the context of group discussion? Would there be any possibility
that two adjacent bands would overlap too vaguely?
Interview-Q5 How is the layout of the rating scale? Is it friendly to be understood and used
by teacher raters and peer raters?
Based on the above research purposes and design, this section will first unfold the
quantitative findings of the initial examination on the construct validity of the rating
scale proposed, which is followed by the qualitative findings of expert evaluation in
the interview.
Prior to a presentation of the findings on examining the construct validity, it
would be necessary to first check the inter-rater reliability for the scores assigned by
the three expert raters against the proposed rating scale based on candidates’ per-
formance in group discussion. Since more than two raters were involved, this study
finds it less appropriate to resort to the conventional method of computing Kappa
coefficient since this method should be more often deployed in scrutinising the
agreement between two raters only. Against this, correlations among the three raters
regarding the same assessment domains of rating scale were analysed. Since the
raters were supposed to assign each subscore within a range between 1 and 5, there
would be no possibility of quasi-correlation incurred by the computation of being
concordant in order rather than in magnitude, in the case of which intra-class
correlation checking was therefore exempted.
Table 6.2 displays the results of Pearson correlation as an indication of rating
agreement. There being four assessment dimensions on the proposed rating scale,
the correlation analysis was computed on a dimension basis among the raters
accordingly. Judging from Table 6.2, almost all the correlation coefficient values
are well above 0.70 (p < 0.01). Although there seems controversial concerning the
Table 6.2 Inter-rater reliability of expert rater scoring
Dimension 1 Dimension 2 Dimension 3 Dimension 4

Rater_A Rater_B Rater_C Rater_A Rater_B Rater_C Rater_A Rater_B Rater_C Rater_A Rater_B Rater_C
Rater_A Pearson 1 0.847a 0.765a 1 0.711a 0.750a 1 0.674a 0.613a 1 0.901a 0.856a
correlation
Sig. 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
(two-tailed)
Rater_B Pearson 0.847a 1 0.788a 0.711a 1 0.704a 0.674a 1 0.674a 0.901a 1 0.852a
correlation
Sig. 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
(two-tailed)
Rater_C Pearson 0.765a 0.788a 1 0.750a 0.704a 1 0.613a 0.674a 1 0.856a 0.852a 1
correlation
Sig. 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
(two-tailed)
a
Correlation is significant at the 0.01 level (two-tailed)
185
threshold values of correlation strength,1 it could be generally agreed that corre-

lation coefficient indicative of inter-rater reliability above 0.70 (p value significant)
is regarded as moderate to high correlation (see Gwet 2012). Thus, the above
coefficients provide a sound indicator that the scoring results by the three expert
raters reached a satisfactorily high degree of consistency. The only exception occurs
in the case of the rater agreement in Dimension 3, where the correlation coefficient
is slightly below 0.70 (p < 0.01). As a whole, however, this does not negatively and
significantly affect much when a conclusion can be drawn that the expert raters have
a satisfactorily high inter-rater reliability in scoring the candidates’ performance
against the proposed rating scale.
Given the fact that the inter-rater reliability can be guaranteed as statistically
evidenced above, this phase of study would consistently refer to the means of the
subscores in every dimension on the rating scale and then turn to the core part,
where the correlation of the subscores assigned by the raters would be analysed.
6.3.1 Assessment Dimension Correlation
Before the unveiling of the correlation matrix of the subscores, capturing a brief
picture of the descriptive statistics of the scores would be necessary so as to profile
how proficient the candidates had achieved when measured against the rating scale
tentatively proposed.
Table 6.3 lists the descriptive statistics of the expert rating results. As indicated,
the mean score for Dimension 1 (4.07) is the highest among all the dimension
scores. Given the intended role of Dimension 1 in mainly assessing candidates’
performance of pronunciation and intonation, it can be initially interpreted that the
candidates under observation have a quite satisfactory command of English pro-
nunciation as well as intonation, which can be nearly alignable with the
near-advanced-level descriptors of the rating scale (Band 4). Comparatively, there
is no great gap in the mean subscores for the other three dimensions, basically
falling into a range between 3.29 and 3.85. This indicates that the observed can-
didates could survive the middle demarcation of all the bands (Band 3) on the rating
scale. What is worth noticing is that the mean subscore of Dimension 4 is the lowest
(3.29), leading to a depiction that the candidates generally would not attain the
anticipated performance on nonverbal delivery. Considering the statistics of
skewness and kurtosis do not reveal a normal distribution of the dataset, in the
follow-up data analysis, Spearman rho would be adopted for nonparametric cor-
relation analysis.
1
For example, Landis and Koch (1977), Altman (1991) propose that inter-rater reliability coeffi-
cient within a range of 0.60 and 0.80 is considered substantial or good, whereas Fleiss (1981) more
vaguely sets the range of 0.40 and 0.75 as intermediate to good. This study generally sets 0.70 as a
threshold as moderate to high correlation strength as suggested by Gwet (2012).
Table 6.3 Descriptive statistics of the expert rating results

Minimum Maximum Mean Std. Skewness Kurtosis
deviation Statistic Std. Statistic Std.
error error
Dimension 1 1 5 4.07 0.791 −2.416 0.067 6.612 0.105
Dimension 2 1 5 3.78 0.958 −1.239 0.036 −3.885 0.237
Dimension 3 1 5 3.85 0.917 −2.580 0.087 −4.244 0.255
Dimension 4 1 5 3.29 0.763 −1.381 0.032 2.265 0.298
Table 6.4 Correlation of subscores in expert rating

Dimension 1 Dimension 2 Dimension 3 Dimension 4
Spearman’s Dimension 1 Correlation 1.000 0.734a 0.818a 0.779a
rho coefficient
Sig. (two-tailed) 0.000 0.000 0.000
Dimension 2 Correlation 0.734a 1.000 0.785a 0.731a
Coefficient
Sig. (two-tailed) 0.000 0.000 0.000
Dimension 3 Correlation 0.818a 0.785a 1.000 0.710a
coefficient
Sig. (two-tailed) 0.000 0.000 0.000
Dimension 4 Correlation 0.779a 0.731a 0.710a 1.000
coefficient
Sig. (two-tailed) 0.000 0.000 0.000
a
Correlation is significant at the 0.01 level (two-tailed)
Table 6.4 shows the correlation of the mean subscores assigned by the expert
raters. As can be seen, the correlation between every two dimensions features quite
high coefficient values of above 0.70 (p < 0.01). For example, Dimension 1 is most
highly correlated with Dimension 3, with a coefficient reaching 0.818 (p < 0.01). To
a great extent, this means that although Dimension 1 (Pronunciation and Intonation)
and Dimension 3 (Discourse Management) are intended for different domains of
assessment, the subscores in these regards are positively highly related so that a
unitary construct is actually being observed and measured. It is a similar case with
the correlations among other dimensions on the rating scale proposed.
6.3.2 Expert Judgment
As specified above, after the three expert raters completed the scoring on the 20
samples of group discussion using the proposed rating scale, an interview with them
was called for to obtain their feedback on the structured questions listed in
Table 6.1. This part will display the synthesised interview responses addressing
each question, which, in an integrated manner, would be ultimately pertinent to the
rating scale modification when the RSF phase winds up.
Interview-Q1: Is it possible that teacher raters and peer raters would have pos-
sible misunderstanding on the rating scale that is incurred by the diction in the
various band descriptors?
Three expert raters unanimously agreed that the rating scale features clear
wording in different bands on the whole. However, there could be a few places
worth improving so that misunderstanding, if any, might be reduced to the mini-
mum level possible.
1. Rater_A pointed out that the wording of “foreign accent” in the dimension of
“Pronunciation and Intonation” would possibly incur misunderstanding because
“foreign accent” cannot be equivalently interpreted as “Chinese accent”.
Therefore, Rater_A proposed that “foreign accent” be changed into “Chinese
transfer” so that raters can have full access to clearer references as to what
should be observed and what should be compared. This change can be necessary
as would-be raters are Chinese EFL teachers and learners, in the case of which
“Chinese transfer” might be more directly comprehensible in this particular
context.
2. Rater_A also held that the wording of “flexibility” in the dimension of
“Grammar and Vocabulary” needs to be clarified, especially what is meant by
“flexibility” concerning syntactic variations. The other two expert raters also
agreed on a necessity of such clarification. Rater_C suggested deleting the word
“flexibility” because “range of syntactic variation” is already to a great extent
inclusive of what “flexibility” intends to denote.
3. Rater_C observed that the rating scale would be presented to raters in its English
version, in the context of which certain unfamiliar terms would be likely to
trigger confusion for peer raters. For example, in the dimension of “Discourse
Management”, there are such terms as “coherence”, “cohesion”, “connectors”
and “discourse markers”. While EFL teachers might have a basic understanding
of the above terms due to their research experience, it would be challenging for
peer raters, who would be bewildered as to what to observe and what these
terms are really referred to. However, when the suggestions on how to improve
this flaw were invited from the floor, all the expert raters expressed their
intention of maintaining an English version rating scale instead of accommo-
dating it into a bilingual one. Therefore, how to resolve this issue would be open
to discussion below.
4. Rater_B pointed out that “expressiveness” in the dimension of “Grammar and
Vocabulary”, such as the case of “occasional grammatical errors without
reducing expressiveness”, can be possibly confounding to peer raters. In order to
facilitate their understanding, Rater_B suggested that the descriptor be rephrased
with “with the intended meaning maintained” so that the intended meanings on
the rating scale could be more approachable to users.
5. Rater_B also paid heed to the dimension of “Nonverbal Delivery”, in the
descriptors of which “changeable eye contact” might cause misunderstanding
because raters would be disoriented by a possible opposing pair of “changeable”
and “unchangeable”. The latter may mean the other extreme of a possible
interpretation. The researcher explained that this descriptor was laid down from
the relevant research findings in the AB phase and that “changeable” in this case
refers to the phenomenon where a candidate is able to instantiate and switch eye
contact to different addressees in group discussion when turn-taking occurs.
Rater_B, therefore, suggested replacing it with “manageable” or “controllable”
and emphasised a proper way of understanding this wording in rater training
process.
6. Rater_C remarked that there is a descriptor with “regulatory gesture” in the
dimension of “Nonverbal Delivery” (Band 2), which might be elusive to raters.
The researcher responded that “regulatory gesture” was described as a result of
referring to a previous taxonomy in relation to gesture functions. Rater_C
thought that it would be more advisable to eschew terming certain descriptors on
the rating scale for the sake of facilitating understanding. Against this context, a
descriptor, such as “gestures not conducive to verbal language conveyance”,
opposite to the corresponding descriptor in Band 5 of the same dimension,
might be crispier.
Interview-Q2: Is there any need to add more dimensions of descriptors to the
rating scale? If so, what should be added?
All the expert raters believed that although there are only four assessment
dimensions of descriptors on the rating scale, each dimension is actually inclusive
of multiple traits to be observed by raters. Hence, the wholeness of the rating scale
can already reflect the comprehensiveness of communicative competence inspired
by the CLA model. If one more dimension is reckoned up, practicality of this rating
scale might be jeopardised as teacher and peer raters would be overburdened with
too many dimensions in scoring process. As can be imagined, and also as the expert
raters perceived, this is because, from a cognitive perspective, five assessment
dimensions might be the maximum cognition load for raters. In other words, any
other aggregated domain would distract raters’ attention in the real practice of
rating. On the other hand, the specified period of time of group discussion for
on-the-spot scoring will also determine the impracticality of a rating scale with
more than four dimensions.
Interview-Q3: Is there any need to delete any part of descriptors that would most
likely fail to distinguish candidates across different proficiency levels? If so, what
should be deleted?
The expert raters conjectured that two kinds of descriptors should be considered
deleting. One kind might be those redundant descriptors that can be almost
explained by other descriptors within the same band, viz. “overlapping descriptors”.
The other kind could be certain descriptors that would not perceivably function well
or cannot serve as much discriminating power as expected, viz. “weak descriptors”.
The following is a collection of the experts’ viewpoints regarding both scenarios
above.
1. Rater_C pointed out that, in the dimension of “Grammar and Vocabulary”, the
rating scale features the descriptors of “[a]lmost all sentences are error-free”
(Band 4) and “[f]requent error-free sentences” (Band 3). However, both
descriptors might be somewhat overlapping with, or largely accounted for by the
relevant descriptors concerning sentential accuracy, such as “[a] range of syn-
tactic variations with occasional inaccuracy” (Band 4). Therefore, the descrip-
tors in this respect may be deleted.
2. Both Rater_B and Rater_C cast doubt on the feasibility of assessing “idiomatic
chunks” as described in the dimension of “Grammar and Vocabulary”. As the
judgment on whether a chunk is idiomatic can be substantially dependent on
rater’s own language proficiency and their sensitivity to the degree of
idiomaticity. Therefore, although a consideration of incorporating a judgment on
chunk idiomaticity is highly recommendable, potential subjectivity involved by
rating scale end-users might be proportionately problematic. In addition, the
expert raters challenged that the rating scale describes chunk idiomaticity from
Band 3 through Band 5, yet the aspect of such descriptors is nowhere traceable
for the bottom two bands. Furthermore, in describing chunk idiomaticity, the
rating scale glides from “frequent use” (Band 5) abruptly to “infrequent use”
(Band 4) between two adjacent bands. With the above, it would be doubtful as
to whether chunk idiomaticity with the expected power to distinguish candidates
of various proficiency levels should be embedded in the rating scale.
3. Rater_A echoed the viewpoint of Rater_B and Rater_C and also gauged that the
modifiers, such as “rare” and “occasional”, might be varied or inconsistent when
accorded with raters’ subjective judgment. Rater_A, therefore, proposed that the
adjacent two bands may be integrated into one band. This is partly because such
a solution can dodge rater leniency or harshness incurred by subjective judg-
ment on the wording of frequency adverbials, and partly because the three expert
raters estimated that even candidates of foreseeable excellent performance could
only be categorised into a mixed descriptors of the top two bands on the rating
scale. Against the above, Rater_A would rather prefer a reduction in the top two
bands and suggested that they be condensed into one single band.
In the process of extended discussion and certain digression, the three expert
raters also rendered a good number of insightful suggestions on how the top two
bands could be inter-woven. It can be summarised that basically most modifiers in
the descriptors are softened (e.g. avoidance of absolute wording) so that the revised
top band on the rating scale might manifest a near-native proficiency level guided
by the notion of communicative competence in the context of group discussion.
Interview-Q4: Can the adjacent bands really reflect gradable descriptions of
communicative competence in the context of group discussion? Would there be
any possibility that two adjacent bands would overlap too vaguely?
The feedback from the expert raters addressing the last point of RSF-III-Q3
naturally led to a discussion on RSF-III-Q4. Rater_B and Rater_C shared their
perceptions about slight vagueness and overlapping of Band 4 and Band 5, and that
only candidates with the best performance in assessment settings might be partially
alignable with descriptors in Band 4 and partially with those in Band 5. Rater_B
commented as follows.
I would think it challenging for either teacher raters or peer raters to distinguish the shades
of difference in the descriptors between the Band 4 and Band 5. In addition, quite few
candidates in the Chinese EFL context would be able to reach an ideal proficiency level of
Band 5. So what can be suggested is that the top two levels should be somewhat integrated
into one single level. Compared with the bottom two levels, the top two levels can be
somewhat overlapping in the respective descriptors.
As such, it would be worth considering reducing the band number from five to
four. The detailed revision will be unfolded in the next section.
Interview-Q5: How is the layout of the rating scale? Is it easy and friendly to be
understood and used by teacher raters and peer-raters?
As this question involved the practicality of the rating scale proposed, the expert
raters opined much at their discretion. Most inclinations were well informed by
their professional rating practice in using this proposed rating scale, as well as by
their previous experience in monitoring rating quality for large-scale high-stakes
assessments.
All the expert raters thought that although presenting a rating scale with extreme
modifiers on both ends of a five-point continuum would be conducive to reminding
raters of what should be assessed in each dimension, certain side effects might also
arise in that too many descriptions or domains are supposed to be observed at one
rating, in the case of which raters would be more distracted than reminded. Rater_B,
therefore, commended that this rating scale should be physically composed of two
parts only, with one part dealing with all the detailed band descriptors for rater
training and reference, and the other part as a separate sheet for raters to assign
marks for each assessment dimension.
All the above are the excerpts and analyses drawn from the group interview after
the expert raters had completed the scoring on the 20 samples of group discussion.
In addition to the structured questions, the expert raters also foregrounded rater
training. They unanimously and fastidiously put the significance of rater training to
a limelight, without which they believed raters would fail to reach a shared
understanding in approaching the rating scale descriptors. In that case, scoring
results would not be generalisable to other contexts, nor can this study ensure that
the intended construct would be measured consistently by raters in the Chinese EFL
context. For instance, Rater_A pinpointed the preponderance of rater training as
follows.
Rater training, no matter whether for teacher raters or learner raters, is quite essential for
this validation study because, in this way, the consensus can be reached concerning some
key areas to be observed. Also rater training, especially on the side of peer raters, can be
indispensable as this group of raters will be likely to judge on their own with little con-
sideration of the concordance with what is described in the rating scale.
6.4 Discussion
Departing from both quantitative and qualitative aspects, the previous section sheds
light on the analyses of the inter-dimension correlation and of how the expert raters
perceived the usefulness of the proposed rating scale. Generally speaking, when the
subscores assigned by the expert raters were correlated, the dimensions were proven
to be highly correlated with each other, indicating that the expert raters, after being
trained, were able to consistently measure the candidates’ communicative compe-
tence in group discussions with a shared construct as reflected in the proposed
rating scale. The only exception occurs in the correlations of Dimension 3
(Discourse Management) with the other assessment dimensions. This might be
because raters needed to observe various aspects contributing to candidates’ com-
petence in managing their discourse, thus possibly leading to a slightly different
divergence in the scoring results. However, as the correlation coefficient can still
satisfactorily meet the basic requirements of examining the construct validity of the
rating scale, its validity can be preliminarily verified in that sense. Based on the
above qualitative findings and analyses from the interview, the following part, in
five facets, will discuss how the previously proposed rating scale should be revised
based on the prevalidation analyses.
First, in order to reduce to the minimum level the probable misunderstanding of
the rating scale caused by descriptor wording, this study intends to take the advice
by the three expert raters. Specifically elicited from the interview, unclear wording
is concerned with either the phrasing or those frequency adverbials triggering a
difference in raters’ subjective judgment. Therefore, it can be quintessential to
revise flawed and ambiguous wording with an aim to disburdening the problematic
percipience in rating process. This phase of study, therefore, would contribute to
revising part of the wording problems detected by the expert raters analysed above.
The modified wording in accordance with the experts’ suggestions will be reflected
in the revised rating scale in the next section.
Second, there would be two options for accommodating peer raters’ apprehen-
sion constraints concerning certain terms adopted in the rating scale descriptors.
One option is that more examples would be provided for peer raters to facilitate
their observation and further judgment. However, after a consultation with the
expert raters, the other option was more favoured that the examples would not be
rendered explicitly on the rating scale; instead, more input on exemplifications
would be rendered in peer-rating training process, aggregated with rated samples of
group discussion, so that peer raters would not only know what are meant by such
terms as “discourse markers” and “connector” but also can familiarise themselves
with more lively and anchorable examples in training. Therefore, the issue of the
unfamiliar terms in the rating scale can be thus addressed by means of more
informative explanation realised in rater training process.
Third, concerning the doubt arising from the possibly weak descriptors in certain
bands of the rating scale, this study intended to respond to this issue by deleting a
few descriptors that can be greatly explained by an integration of other descriptors
6.4 Discussion 193
within the same band. For instance, as stated above, it would be unnecessary to
include the descriptor of “almost all sentences are error-free” in the dimension of
Grammar and Vocabulary for Band 4 because it can be substantially covered by
“accurate syntactic variation”. Therefore, inspired by the expert raters’ suggestion,
the descriptors that cannot foreseeably clearly distinguish candidates across dif-
ferent proficiency levels were eliminated, as reflected in the revised version of the
rating scale below.
Fourth, this study, as suggested by the expert raters’ feedback in the group
interview, needs to consider whether the proposed five-band rating scale should be
contracted to four bands. The interview analysis indicates that the expert raters
foresaw that at cost of losing finer distinctions between the top two adjacent bands,
a four-band rating scale would be more advantageous in its feasibility and
rater-friendliness, particularly being more practical in slicing the top performers in
the assessment. This is because candidates who achieve extraordinarily well can be
first categorised into the top band and then provided with more detailed and per-
tinent feedback on an individual basis, which also echoes what formative assess-
ment uniquely excels in. In addition, the three expert raters also doubted whether a
good number of candidates would be really assigned with Band 5 as it is too
perfectly described. A reduction in the band number, viz. an accommodation and
integration of the top two band descriptors, can also be resonated with the spoken
rating scale calibration of the TOEFL iBT (Chapelle et al. 2008), where a four-band
rating scale cannot only reserve its distinguishing power in discerning candidates
across a range of proficiency levels as equally well as a five-band one, but also
facilitate raters’ pain-staking efforts in making a choice among the five prescribed
levels of descriptors. What is also worth noticing is that aligning candidates’ per-
formance in group discussion with a five-band rating scale could be even more
challenging for peer raters, who would barely assign a point of five to their peers.
Hence, the top band in a five-band rating scale would seem to be not as powerful
and discriminating as expected; hence, certain descriptors in Band 5 will be
attenuatedly integrated into Band 4 descriptors.
Fifth, as all the expert raters expressed their concern of possible distraction by
more than one reminder on each end of the continuum on the rating scale, this study
needs to consider rearranging its layout. According to the expert raters’ proposition,
the rating scale would only maintain the names of assessment dimensions, while
those words placed on the ends of the continuum would be discarded.
Another concern is the necessity of rater training. This issue was not prioritised
in the interview questions, but brought forth among the top concerns after a con-
sultation with the expert raters. This is because if raters are not vigorously trained,
their understanding would be prone to diverge. In addition, to enhance scoring
reliability, the rater training process should be deemed as an ingredient that helps to
yield consistent rating results if another group of teacher raters or peer raters is
invited to score the samples in this research.
Judged and suggested by the expert raters, this phase of study brought forth a
rating scale ready for the validation phases. Table 6.5 presents the revised full
version of the rating scale, with necessary modifications of descriptor wording and
Table 6.5 The revised rating scale
194
Band Pronunciation and Grammar and Vocabulary Discourse Management Nonverbal Delivery
Intonation
4 – Almost no listener effort – Almost no detectable – Rare repetition or self-correction; effective – Frequent, controllable
for intelligibility, with grammatical errors, with use of fillers to compensate for occasional eye contact with other
acceptable slip of tongue only self-repaired minor hesitation(s) discussants
– Almost no foreign accent lapses – General coherence and cohesion achieved – Frequent and various
of Chinese transfer – A range of syntactic by controllable use of connectors and communication-conducive
– Occasional variations (complex and discourse markers gestures
mispronunciation simple structures) with – Topic is discussed with reasoning, personal – Evidence of appropriate
– Flexible stress on words accuracy experience or other examples for in-depth head nod/shake
and sentences – Vocabulary breath and development, with only minor irrelevance
– Correctness and variation depth almost sufficient for
in intonation at the natural and accurate
sentence level expression
3 – Detectable accent slightly – Noticeable grammatical – General continuous flow of utterance can – Having only brief eye
reducing overall errors slightly reducing be maintained, yet repetition, contact with other
intelligibility expressiveness self-correction and hesitation are noticeable discussants
– Mispronunciations of – Effective and accurate use for word and grammar – Frequent gestures with a
some words with of simple structures, with – Coherence and cohesion can be basically lack in variety
possible confusion less frequent use of achieved by the use of connectors and – Head nod/shake
– Inappropriate stress on complex structures discourse markers, but sometimes detectable, but sometimes
words and sentences – Vocabulary breadth inappropriate use might occur inappropriate
reducing meaning sufficient for the topic, – Topic is discussed with relevant utterance,
conveyance with less noticeable but the attempt to produce long response is
– Occasional inappropriate vocabulary depth sometimes limited
or awkward intonation – Rare use of idiomatic
noticeable at the sentence chunks
level
2 – Effort needed in sound – Noticeable grammatical – Frequent repetition, self-correction, and – Infrequent eye contact
recognition for errors seriously reducing long noticeable pauses for word and with other discussants
intelligibility expressiveness grammar – Gestures, most of them
– Fairly accurate use of are for non-communicative
simple structures, with purposes
6 Rating Scale Prevalidation and Modification
(continued)
Table 6.5 (continued)
Band Pronunciation and Grammar and Vocabulary Discourse Management Nonverbal Delivery
Intonation
– Detectable foreign accent inaccuracy in complex – Constant use of only a limited number of – Inappropriate head nod/shake
that sometimes cause structures connectors and discourse markers for
6.4 Discussion
confusion – Frequently incomplete and coherence and cohesion

– Frequent noticeable choppy sentences – Topic is not developed clearly with
mispronunciation – Vocabulary breadth reasoning or expected details; development
– Frequent inappropriate insufficient for the topic can be maintained with other discussants’
stress on words and – Inaccurate use of words elicitation
sentences reducing causing confusion
clarity of expression
– Frequent inappropriate
and awkward intonation
at the sentence level
1 – Much effort in sound – Frequent grammatical – Almost broken utterance with constant long – Almost no eye contact with other
recognition for errors, with no intention pauses between sentences discussants
intelligibility of self-correction – Almost no connector and discourse marker – Almost no gesture in group discussion
– Strong foreign accent – Detectable and repetitive used to link sentences – No head nod/shake
with noticeable L1 formulaic expressions – Only basic ideas related to the topic can be
interference – Inaccuracy and inability to expressed; development is limited due to
– Frequent use basic structures noticeably less participation
mispronunciation and – Topic development
detectable seriously limited by
hesitations/pauses vocabulary scarcity
blocking flow of
expression
– Frequent inappropriate
stress and awkward
intonation
195
Dimension Score
Pronunciation and Intonation 1 2 3 4

Language
Grammar and Vocabulary 1 2 3 4
Competence
Discourse Management 1 2 3 4
Strategic
Nonverbal Delivery 1 2 3 4
Competence
Total
Fig. 6.1 The layout of the revised rating scale
the deletion of a few descriptors least empowered to have discriminating function.

In addition, there is a reduction in the dimension number from five to four in that
the top two bands of the rating scale were integrated based on the principle that
most modifiers in the descriptors were tailored to “almost” and “rare”, indicating a
rather advanced level of proficiency in a hedging manner. Therefore, the present top
band (Band 4) is representative of the possibly best performance of communicative
ability in the context of group discussion with the intended construct being
observed. Likewise, the layout of the rating scale was also accorded with the
suggestions by the expert raters as only the dimension names and the possible
scores to be assigned are retained on the scale, as reflected in Fig. 6.1.
6.5 Summary
This chapter dwells on the prevalidation of the rating scale based on the expert
raters’ scoring results of the 20 samples of group discussion and their judgments
concerning the possibly problematic wording, discriminating power and other rel-
evant issues of practicality. The expert judgment and suggestion on the de facto use
of the rating scale have informed a multifaceted modification for the rating scale
descriptors, band ranges as well as the layout. In addition, the significance of rater
training, for both teacher raters and peer raters, is re-emphasised as another outcome
in this phase of study.
Therefore, this chapter serves as a bridge between the formulation and the
validation of the rating scale in such a context, where the construct validity and
certain practical issues of the rating scale tentatively proposed were initially
examined. The end product of the RSF phase renders a revised and supposedly
more rater-friendly version of the rating scale, paving the way for the large-scale
validation in the next phase.
References 197
References
Alderson, J.C. 1993. Judgments in language testing. In A new decade of language testing
research: Selected papers from the 1990 Language Testing Research Colloquium, ed.
D. Douglas, and C. Chapelle, 46–50. Washington, DC: Teachers of English to Speakers of
Other Languages Inc.
Altman, D.G. 1991. Practical statistics for medical research. London: Chapman and Hall.
Chapelle, C.A., M.K. Enright, and J. Jamieson (eds.). 2008. Building a validity argument for the
test of english as a foreign language. New York: Routledge.
Fleiss, J.L. 1981. Statistical methods for rates and proportions, 2nd ed. New York: Wiley.
Gwet, K.L. 2012. Handbook of inter-rater reliability: The definitive guide to measuring the extent
of agreement among multiple raters, 3rd ed. Gaithersburg: Advanced Analytics LLC.
Landis, J.R., and G.G. Koch. 1977. The measurement of observer agreement for categorical data.
Biometrics 33: 159–174.
Chapter 7
Rating Scale Validation: An MTMM
Approach
On the basis of the rating scale formulated and further revised, the research project
proceeds into the validation stage, where the rating scale is processed in a larger
sample size validation with the quantitative method previously elaborated on so that
the rating scale proposed can be statistically robust to validly measure the antici-
pated construct of communicative competence in candidates’ performance in group
discussion. At key issue in this phase of the study is whether the revised rating scale
can be validated with an observation of a multitude of assessment dimensions
coupled with discrepant rating methods. As arranged, this chapter will first brief
certain methodological issues concerning how the quantitative validation, viz.
MTMM, is conducted specific to the RSV-I phase and then analyse the data,
especially the goodness-of-fit statistics in line with Widaman’s (1985) framework
of alternative model comparison in probing and validating whether, and if so, how
different assessment dimensions on the rating scale are modelled.
With validating the revised rating scale with a quantitative method as a primary
point of departure, this phase of study bears two subsidiary objectives: (1) to
compare and select the best model that fits the dataset, namely 100 samples of
group discussion, among all the alternative CFA MTMM models; (2) to determine
the parameter estimates for the final model selected in order to investigate the
extents to which how each trait and method factor can contribute to the selected
CFA MTMM model. As such, this phase of study seeks to address only one
research question: To what extent do different rating methods measure the construct
of communicative ability as reflected by the different assessment dimensions in the
proposed rating scale?

DOI 10.1007/978-981-10-0170-3_7
200 7 Rating Scale Validation: An MTMM Approach
7.2 Research Procedure and Method
In response to operationalising the research question specified above, the procedure

and the method concerning this phase of study will be concisely recaptured below
(see Sect. 2.5.1 for more details).
MTMM is used as the validation method in this study, where the dimensions of
the rating scale are regarded as multiple traits and teacher- and peer-rating as two
methods. The MTMM models were developed on the ground that construct validity
should be examined when data exhibit evidence as follows. Convergent validity sets
the requirement that different assessment methods should concur in their mea-
surement of the same trait, whereas discriminant validity requires the independent
assessment methods to diverge in their assessment of different traits. In that case,
the baseline model would be usually trait-correlated yet method-uncorrelated
model, whereas alternative MTMM models should be developed considering the
variation of being trait-correlated, trait-noncorrelated, method-correlated and
method-noncorrelated (Widaman 1985) so that not only the models per se would be
testified in terms of goodness of fit (see Sect. 2.5.1, last paragraph), but also the
model comparison would yield the values for a check on convergent validity (ex-
pected higher correlation) and discriminant validity (expected lower correlation).
When an MTMM model meets the above criteria, it would be examined in the light
of method effects, which represent bias that could possibly derive from using the
same method in the assessment of different traits. The correlations among these
traits would be typically higher than those measured by different methods. If more
than one MTMM models are considered satisfactory regarding all the above cri-
teria, then existing well-established models or related theories might lend support to
a decision on the final model.
As foreshadowed in the overview of this project design in Chap. 3, two datasets
would be targeted in this phase of study. On the one hand, concerning the samples
of group discussion to be assessed by the teacher raters and peer raters against the
revised rating scale, the remaining 100 samples out of 150 samples of group dis-
cussion from Dataset 2 would be turned to. Among the remaining samples, 33 are
from Group A, 35 from Group B and 32 from Group C (also see Sect. 3.2.3.1). The
total number of candidates involved in this phase of validation is 304. On the other
hand, the rating results, namely the averaged analytic subscores by teacher raters
and peer raters, constitute Dataset 3 to be deployed to run MTMM model com-
parison and parameter estimates.
With regard to the method for quantitative validation of the revised rating scale,
as reviewed and justified in Chap. 2, MTMM was adopted. In order to investigate
which MTMM model can best interpret the data, Widaman’s (1985) framework of
alternative model comparison is referred to with a view to scrutinising the relative
effects of teacher-rating and peer-rating on the targeted construct of communicative
competence measured against the revised rating scale.
When reporting on the research findings on the quantitative validation of the revised
rating scale, this subsection will be unfolded in three consecutive parts. First, the
baseline model of CFA MTMM specific to the present study as well as all the other
alternative models will be displayed and probed into for exploring a range of
model-fitness indices. Second, as the selection of the best fit model largely depends
on the convergent validity, discriminant validity and absence of method bias, a
triangulated comparison will be made to see which model can fit the data appro-
priately and effectively. The last part of this subsection is to determine the
parameter estimates of the selected model so as to further validate how each factor
functions within the model and correlates with each other.
From the perspective of basic model composition, MTMM models contain a
series of linear equations relating dependent variables to independent variables.
Dependent variables are defined as those receiving a path from another variable in
the model and thus appear on the left-hand side of an equation (Kline 2005). In the
case of the present research, the dependent variables are equated with four
assessment dimensions, namely (F1) pronunciation and intonation (PI), (F2)
grammar and vocabulary (GV), (F3) discourse management (DM) and (F4) non-
verbal delivery (ND), which, in an integrated manner, comprise the underlying
communicative ability in the context of group discussion via the rating by teachers
(F5) and peers (F6). On the other hand, independent variables are those that
originate paths but do not receive a path and appear on the right-hand side of an
equation. In this study, the observed variables, viz. all the analytic scores assigned
by teacher and peer raters, are the independent variables supposed to be
squared-lined in a vertical fashion in the centre of the model. The basic layout of the
model construction can be perceived through the follow-up research findings.
Table 7.1 outlines the univariate and multivariate statistics for model assumption
checks. Univariate normality is usually tested by referring to skewness and kurtosis.
If skewness and kurtosis values fall within |3.30| (z score at p < 0.01), univariate
normality can be accordingly recognised (Tabachnick and Fidell 2007). As is
indicated in Table 7.1, all the skewness and kurtosis values fall into the absolute
value of |1.38| (z score at p < 0.01), showing that the data present univariate
normality. As regards multivariate normality, Mardia’s normalised estimate was
checked, with values of 5.00 or below considered to indicate multivariate normality
(Byrne 2006). Table 7.1 also displays that the Mardia’s normalised estimate reaches
4.8345, an indicator that the observed data do not violate the assumption of mul-
tivariate normality. With the above model assumption checked, the ensuing section
can reassuringly proceed to conducting the three steps concerning model devel-
opment, comparison and parameter estimate determination specified in Widaman’s
(1985) framework of MTMM model comparison.
202
Table 7.1 Univariate and multivariate statistics for normal distribution

Univariate statistics
Variable PI_T V1 PI_P V2 GV_T V3 GV_P V4 DM_T V5 DM_P V6 ND_T V7 ND_P V8
Mean 2.9539 3.3010 3.2492 3.5710 3.1477 3.4119 3.0772 3.3057
Skewness (G1) −0.2964 −1.0274 −0.8361 −1.3842 −0.3917 −0.9145 −0.5469 −1.2255
Kurtosis (G2) −0.1606 0.1531 0.0793 0.8399 −0.8750 −0.1554 −0.5466 0.4412
Standard Dev. 0.6857 0.8089 0.7339 0.6407 0.7781 0.6960 0.8563 0.8808
Multivariate Kurtosis
Mardia’s coefficient (G2, P) = 8.8037
Normalised estimate = 4.8345
7 Rating Scale Validation: An MTMM Approach
Fig. 7.1 The baseline CFA MTMM model (Model 1). PI Pronunciation and Intonation, GV
Grammar and Vocabulary, DM Discourse Management, ND Nonverbal Delivery, T-rating
Teacher-rating, P-rating Peer-rating
7.3.1 CFA MTMM Model Development
The first model, representing the hypothesised CFA MTMM model as shown in
Fig. 7.1, is the baseline model against which all the subsequent alternative MTMM
models are compared. This model designates the traits (assessment dimensions) to
be correlated in pairs and the scoring methods independent of each other. The
reason why the baseline model is designed with a consideration of uncorrelated
scoring methods is that either teacher-rating or peer-rating should be regarded as
unique. Since in MTMM models estimating the factor loadings is the primary focus,
instead of fixing factor loadings, the variances of factors are fixed to 1 for the
purpose of model identification. Therefore, all factor loadings and the covariances
among the trait factors are freely estimated. However, just as previously justified,
covariances among method factors are not constrained to be 0 in the baseline
model, given that each scoring method is unique. As is shown in Table 7.2, the fit
indices indicate the baseline model (Model 1) provides a good fit for the data
(χ2(28) = 462.796, p = 0.818; CFI = 1.000; NNFI = 1.024; SRMR = 0.015;
RMSEA = 0.000; 90 % C.I. = 0.000, 0.060).
Model 2 specifies that there is no trait observed in the model, with only the
presence of scoring methods yet uncorrelated, as is displayed in Fig. 7.2. As
Table 7.2 Fit indices for the Bentler–Bonett normed fit index = 0.995
baseline model (Model 1)
Bentler–Bonett non-normed fit index = 1.024
Comparative fit index (CFI) = 1.000
Root mean square residual (RMR) = 0.008
Standardised RMR = 0.015
Root mean square error of approximation (RMSEA) = 0.000
90 % Confidence interval of RMSEA (0.000, 0.060)
Fig. 7.2 No trait/uncorrelated

method MTMM model
(Model 2)
Table 7.3 Fit indices for Bentler–Bonett normed fit index = 0.899

Model 2
indicated by the goodness-of-fit statistics shown in Table 7.3, the fit for this model
is extremely poor (χ2(19) = 59.716, p = 0.000; CFI = 0.528; NNFI = 0.894;
SRMR = 0.106; RMSEA = 0.043; 90 % C.I. = 0.076, 0.136), justifying an
assumption that this model cannot be a plausible explanation for the observed data.
Following Model 2, which eschews a presence of traits, Model 3, as displayed in
Fig. 7.3, eclectically integrates all the observed variables into one latent variable,
Communicative Language Ability. As with the baseline model, each observed
variable loads on both a trait and a method factor in Model 3. However, the
Fig. 7.3 Single trait/uncorrelated method MTMM model (Model 3)

Model 3
correlations among the trait factors are fixed to 1, thus treating the four factors as
one overall “umbrella factor”. As is shown in Table 7.4, the goodness-of-fit results
indicate that the fit of this model is marginally good albeit substantially less well
fitting than is the case for the baseline model (χ2(11) = 37.116, p = 0.000;
0.151).
As is presented in Fig. 7.4, another alternative model is Model 4, whose dif-
ference from the baseline model only consists in the unspecified correlations among
the trait factors. The lack of correlation among the traits, therefore, can be con-
ducive to a comparison that would provide evidence of the extent to which the traits
are significantly distinct from one another. The fit indices shown in Table 7.5 reveal
that Model 4 does not meet the statistical criterion of fit (χ2(12) = 84.882, p = 0.000;
0.213).
Model 5, as displayed in Fig. 7.5, can be regarded as typically the least
restrictive one (Schmitt and Stults 1986; Widaman 1985) in that both trait and
method factors are specified and the baseline and correlations among traits and
Fig. 7.4 Uncorrelated

trait/uncorrelated method
MTMM model (Model 4)

Model 4
methods are also allowed. Comparing this model with the baseline model provides
the discriminant evidence related to the method factors. A review of the
goodness-of-fit results shows that the fit of this model is exceptionally good fit to
the data (χ2(5) = 454.251, p = 0.813; CFI = 0.998; NNFI = 1.017; SRMR = 0.019;
RMSEA = 0.009; 90 % C.I. = 0.000, 0.079). However, as Model 5 correlates the
two method factors, whether this model is more interpretable in the context of the
present study still needs to be further explored and accounted for in the follow-up
discussion (Table 7.6).
The final CFA MTMM model is illustrated in Fig. 7.6. In this model, a
higher-order factor perceived as communicative language ability in group discus-
sion affects the rating on all observed variables through the first-order factors. As
previously noted, the fit indices of this model are assumed to be the same as those
of the baseline model because there is no difference between a higher-order model
with four first-order factors and uncorrelated-two-factor model in terms of fit
statistics (Rindskopf and Rose 1988; Shin 2005). Nonetheless, this model has more
explanatory power regarding the inter-factor covariances when the factors are
highly correlated with each other.
Fig. 7.5 Correlated trait/correlated method MTMM model (Model 5)

Model 5
7.3.2 Alternative CFA MTMM Model Comparisons
The previous subsection has examined the goodness-of-fit results of all the sug-
gested alternative MTMM models. In this subsection, in determining the evidence
of construct validity of the proposed rating scale at the matrix level, the baseline
model is compared with the other four CFA MTMM models, noting that Model 1
and Model 6 being intrinsically the same. Goodness-of-fit indices for all six
MTMM models are summarised in Table 7.7.
As observed earlier, the evidence of construct validity can be twofold: conver-
gent validity and discriminant validity. One of the criteria related to evidence of
construct validity provides the basis for judgment regarding the issue of convergent
evidence among trait factors. Using Widaman’s (1985) approach, this study
Fig. 7.6 A Second-order factor model (Model 6)
Table 7.7 Summary of goodness-of-fit statistics

χ2 p df CFI SRMR RMSEA
Model 1 (6) 462.796 0.818 28 1.000 0.015 0.000
Model 2 59.716 0.000 19 0.528 0.106 0.043
Model 3 37.116 0.000 11 0.854 0.056 0.111
Model 4 84.882 0.000 12 0.871 0.211 0.178
Model 5 454.251 0.813 5 0.998 0.019 0.009
Table 7.8 Differential Dχ2 Ddf DCFI

goodness-of-fit indices for
MTMM model comparisons Test of convergent validity (traits)
Model 1 versus Model 2 403.08 9 0.472
Test of discriminant validity (traits)
Test of discriminant validity (methods)
compares Model 1 with the model whose traits are not specified (Model 2).
A significant χ2 difference (Δχ2) between the two models represents convergent
evidence among the traits. Cheung and Rensvold (2002) also suggest that difference
in CFI (ΔCFI), the value of which exceeds 0.01 within the context of invariance
testing, should also serve as the yardstick of significant difference. In the case of the
present study, as indicated in Table 7.8, a comparison between Model 1 and Model
2 leads to the result of Dv2ð9Þ ¼ 403:08, with highly significant difference
(p < 0.001) and ΔCFI = 0.472, being a substantial difference as well.
The evidence of discriminant validity is sought not only from the perspective of
trait factor but also assessed in terms of method factor. The first comparison is made
between the models whose traits are freely correlated (Model 1) and the one in
which traits are perfectly correlated, namely with a single trait (Model 3). The
comparison results shown in Table 7.8 indicate a significant difference

Dv2ð17Þ ¼ 425:68; p\0:001 and a sizeable CFI difference (ΔCFI = 0.146),
revealing the anticipated evidence of discriminant validity among traits. On the
other hand, as Model 4 features uncorrelated traits, a comparison between Model 1
and Model 4 would be able to suggest the extent to which each trait factor is
separable from each other. As indicated in Table 7.8, the comparison between
Model 1 and Model 4 leads to the results of exceedingly significant difference
(Dv2ð16Þ ¼ 377:914; p\0:001) and a value of ΔCFI greater than 0.01
(ΔCFI = 0.129), both of which do not excessively depart from a acceptable range,
thus once again lending support to validating a close relationship between trait
factors.
Based on the same logic, though in reverse, the second comparison is made to
test the evidence of discriminant validity regarding method factors, where the
baseline model with uncorrelated methods is compared with a freely correlated
model (Model 5). As Model 5 is characterised by the least restriction as above
explained, it can be thereby regarded as less restrictive than the baseline model.
What is noteworthy is that a more restricted model with more degrees of freedom
can be a stronger candidate model in that it has to withstand a greater chance of
being rejected (Raykov and Marcoulides 2006). Against this, the larger the dis-
crepancy in Δχ2 and ΔCFI values to be found between Model 1 and Model 5, the
weaker the support for evidence of discriminant validity between method factors
would be. Table 7.8 also outlines the comparison results of an insignificant Dv2ð23Þ
being 8.545 (p > 0.05) and almost negligible ΔCFI amounting 0.004. In that case,
evidence of discriminant relationship between method factors can be collected and
it can be fairly argued that the observed data present a minimum effect of common
method bias across methods of measurement.
In line with the requirements of CFA MTMM model comparison, what is found
above, to a great extent, demonstrates that the data in this phase of study are
characterised by satisfactory convergent relationship among the traits and dis-
criminant relationship among the traits and between the methods. As model com-
parison at the matrix level is only able to provide a global assessment of evidence of
construct validity (Byrne and Bazana 1996), individual parameter estimates would
be subsequently examined so that the trait- and method-related variance could be
evaluated more precisely.
However, before proceeding to parameter estimates, determination should be
made as to which candidate model can be selected as the final model. The previous
research findings have pinpointed that Models 1 and 6 feature better goodness-of-fit
results than Models 2, 3 and 4 and that Models 1 and 6 are also more interpretable
than Model 5 in the sense that scoring methods should be regarded as individually
unique in lieu of being interrelated. Therefore, Models 1 and 6 could stand out to be
better candidates than the other ones, yet the issue of selecting between Model 1
and Model 6 still remains to be resolved. The discrepancy between these two
models, as previously noted, is that the latter is a higher-order factor model, the fact
of which thus paves the way for a consideration of making Model 6 as the final
selection because trait factors are not only closely related to each other but also
correlated with a higher-order factor within that model. In that sense, Model 6 can
be more parsimonious and interpretable considering the hypothesised notion of
CLA. Thus, the factor loadings of and the correlations within Model 6 are further
investigated below.
7.3.3 Individual Parameters for the Final Model
Considering the comparison results of model fit and parsimony, the above findings
have led to the selection of Model 6 with a higher-order factor as the final model. In
order to seek a more precise assessment of construct validity, the extent of variances
accounted for by trait and method factors is envisioned, and the corresponding
factor loadings and error variances of Model 6 are accordingly examined, as out-
lined in Table 7.9. All the factor loadings are standardised parameter estimates,
which have been scaled to a mean of 0 and a standard deviation of 1 (Byrnes and
Bazana 1996). Bollen (1989) argues that standardised parameter estimates are more
useful than unstandardised parameter counterparts for interpretability because the
former is more powerful in reflecting the relative sizes of the factor loadings in a
model. As such, all the factor loadings outlined in Table 7.9 are standardised
parameter estimates.
In examining individual parameters, convergence is reflected in the magnitude of
the trait loadings. The more significant the factor loadings tend to be, the more
evidence of convergence among traits and methods can be collected. As indicated
in Table 7.8, all the trait factor loadings are significant and substantial, indicating an
overall convergent evidence of construct validity. With the factor loadings of four
assessment dimensions (with PI = 0.990, GV = 0.998, DM = 0.991, ND = 0.991
loaded on CLA) on the underlying higher-order factor CLA noted, it can be felt that
at the parameter level, a reasonably sound indicator of CLA comprising the above
four dimensions on the rating scale could be sought. In other words, the high
first-order factor loadings temper evidence of discrimination, which is typically
determined by examining the factor correlation matrices or, in this case, the
higher-order factor loadings.
When factor loadings are compared across traits, methods and error variances,
the proportion of method variance exceeds that of trait variance for all except for
Discourse Management (DM) rated by peers. The factor loading of DM_P on DM
is 0.405, slightly lower than the corresponding error variance of 0.526. This means
when the dimension of discourse management was observed by peer raters, more
error of measurement might occur, which was likely to be attributable to the fact
Table 7.9 Trait and method loadings (standardised parameter estimates)

Analytic Factor loadings Error
scores First-order traits Higher-order Methods variances
factor
PI GV DM ND CLA T-rating P-rating
a a
PI_T 0.603 0.376 0.402
PI_P 0.688a 0.192a 0.305
GV_T 0.772a 0.288a 0.060
GV_P 0.664a 0.217a 0.473
a a
DM_T 0.530 0.215 0.299
DM_P 0.405a 0.178a 0.526
a a
ND_T 0.450 0.213 0.128
ND_P 0.388a 0.164a 0.233
a
PI 0.990 0.013
GV 0.998a 0.006
DM 0.991a 0.010
ND 0.991a 0.010
a
Factor loading significant
Table 7.10 Trait and method correlations

Traits Methods
PI GV DM ND T-rating P-rating
PI 1.000
GV 0.364a 1.000
DM 0.422a 0.218a 1.000
a
ND 0.277 0.289a 0.673a 1.000
T-rating 1.000
P-rating 0.080b 1.000
a
Statistically significant
b
Statistically not significant
that peer raters might fail to capture, or assess as accurately as teacher raters
regarding the candidates’ de facto performance in managing their discourse.
Discriminant validity bearing on particular traits and methods is determined by
examining the factor correlation matrices, as shown in Table 7.10. Conceptually
and ideally, although correlations among traits should be almost negligible to sat-
isfy evidence of discriminant validity, “such findings are highly unlikely in general
and with respect to psychological data in particular” (Byrne 2006, p. 344).
Generally speaking, the coefficients among the traits in Table 7.10 are below
moderate correlation, which indicates that the four assessment dimensions are not
interdependent. One exception is that the correlation coefficient between DM and
ND reaches 0.673, entitling these two traits to be above moderately correlated.
Since the previous findings have revealed that the proposed model is a higher-factor
one and that the higher-order factor is heavily loaded on four assessment dimen-
sions, the general below moderate correlation among the traits could be
understandable.
Finally, an examination of method factor correlation touches upon their dis-
criminability and thus upon the extent to which the methods are maximally dis-
similar. This factor is an important underlying assumption of an MTMM approach.
Given the obvious dissimilarity of teaching-rating and peer-rating, it is not sur-
prising to find a statistically insignificant correlation of 0.080 between the two
scoring methods, as shown in Table 7.10.
7.4 Discussion
The above research findings have been presented in three aspects. The first aspect
addresses the goodness-of-fit results of the baseline CFA MTMM model and the
alternative models. In line with the predetermined criteria drawn from the literature,
Models 1, 5 and 6 can be regarded as well-fitting ones. Based on the consideration
of interpretability, Model 5 has been eliminated because the traits within that model
are neither correlated nor related to a higher-order factor. Finally, Model 6 is
selected as the final model given its interpretability and consistency with previous
studies regarding speaking ability taxonomy or language ability as a whole. In
particular, Model 6 would be soundly supported by Sawaki’s (2007) research,
where a validation study of assessment scales is conducted for L2 speaking ability
for the purpose of student placement and diagnosis. Her analysis also shows that
speaking ability consists of several dimensions yet with an underlying higher-order
ability. In addition, such a hierarchical model of L2 communicative language ability
has received extensive support from other well-documented studies as well (e.g.
Bachman and Palmer 1989; Llosa 2007; Sawaki et al. 2009; Shin 2005).
In the second place, pairs of hierarchically nested models are compared using
chi-square difference tests to determine whether the assessment dimensions display
convergence, discrimination and method effects. In terms of global model fit,
evidence of convergence, discrimination and method effects is found in the final
model. Nonetheless, when it comes to the third aspect, where a closer inspection of
the individual parameter estimates is taken, a slightly nuanced picture is depicted.
On the one hand, extremely high factor loadings of the higher-order factor
Communicative Language Ability can be obtained so that the perceived
CFA MTMM model is further confirmed, thus somehow lending support to the
construct validity of the revised rating scale. On the other hand, as found above, the
factor loading of DM_P on DM is 0.405, below the corresponding error variance of
0.526. This means peer raters might experience certain difficulty of assessing
candidates’ performance in managing their discourse. Part of the reason could be
the wording confusion in the band descriptors for Discourse Management. It should
be borne in mind that this assessment dimension incorporates textual competence,
7.4 Discussion 213
illocutionary competence and sociolinguistic competence drawn from the umbrella

notion of CLA (see Sect. 5.2 for more details). Therefore, the inclusion of the three
aspects into one assessment dimension might confound peer raters to a certain
extent.
In addition, it is found above that the correlation coefficient between DM and
ND is 0.673, revealing an unexpectedly high correlation between two independent
assessment dimensions. The issue in discriminating between different aspects of
language ability is also addressed by Sawaki’s (2007) study on analysing second
language speaking assessment. The extremely high correlations among speaking
subscales found in her study are partially attributed to the wording of the band
descriptors as well. Another possible explanation for the lack of discrimination in
this case could be the presence of halo effects relating to raters. As reviewed in the
literature, halo effects should be an intrinsic weakness of using an analytic rating
scale. When raters tend to assign a higher score for the dimension of Discourse
Management, it would be highly likely that a correspondingly higher score might be
subsequently assigned to Nonverbal Delivery. Against this, more caution should be
taken in training peer raters when using this analytic rating scale, especially with
regard to how to accurately interpret and effectively align candidates’ performance
with the descriptors of a multi-fold dimension of Discourse Management.
The above research findings also suggest almost no method effect or bias con-
cerning rating methods, which are actually adopted by teachers and peers, two
completely independent groups in the context of formative assessment. This not
only further bolsters the perceived model, where these two scoring methods are
unique, but also implies that the four assessment dimensions validly reflect the
notion of CLA in an integrated manner, with almost no detectable interference
caused by different scoring methods.
7.5 Summary
In an attempt to gather evidence of construct validity for the revised rating scale,
confirmatory factor analysis of MTMM data was conducted in this research phase.
In general, this phase of study gathered the convergent and discriminant evidence as
well as the absence of method effect that enabled the revised rating scale to validly
address communicative language ability, a higher-order latent factor perceived in
the final CFA MTMM model. Although in this validation phase certain noises were
detected, such as peer raters’ possibly improper handling of assessing discourse
management and the unexpectedly high correlation between certain assessment
dimensions, their main causes were expounded and could be largely attributable to
the weaknesses stringed with an analytic rating scale per se. In order to take a closer
look at the correlation between candidates’ performance and the scores they were
assigned by teacher and peer raters, the next phase of validation will take a qual-
itative approach so that more arguments can be collected to validate the rating scale
proposed in this study.
References
Bachman, L.F., and A.S. Palmer. 1989. The construct validation of self-ratings of communicative
language ability. Language Testing 6(4): 449–465.
Bollen, K. A. 1989. Structural Equations with Latent Variables. John Wiley and Sons.
Byrne, B.M. 2006. Structural equation modeling with EQS: Basic concepts, applications, and
programming, 2nd ed. Mahwah: Lawrence Erlbaum Associates.
Byrne, B.M., and P.G. Bazana. 1996. Investigating the measurement of social and academic
competencies for early/late preadolescents and adolescents: A multitrait-multimethod analysis.
Applied Measurement in Education 9: 113–132.
Cheung, G.W., and R.B. Rensvold. 2002. Evaluating goodness-of-fit indexes for testing
measurement invariance. Structural Equation Modeling 9(2): 233–255.
Kline, R.B. 2005. Principles and practice of structural equation modeling, 2nd ed. New York: The
Guildford Press.
Llosa, L. 2007. Validating a standards-based classroom assessment of english proficiency: A
multi-trait multi-method approach. Language Testing 24(4): 489–515.
Raykov, T., and G.A Marcoulides. 2006. A first course in structural equation modeling (2nd ed.).
Lawrence Erlbaum Associates, Inc.
Rindskopf, D., and T. Rose. 1988. Some theory and applications of confirmatory second-order
factor analysis. Multivariate Behavioral Research 23: 51–67.
Sawaki, Y. 2007. Construct validation of analytic rating scales in a speaking assessment:
Reporting a score profile and a composite. Language Testing 24(3): 355–390.
Sawaki, Y., L.J. Stricker, and A.H. Oranje. 2009. Factor structure of the TOEFL internet-based
test. Language Testing 26(1): 5–30.
Schmitt, N., and D.M. Stults. 1986. Methodology review: Analysis of multi-trait multi-method
matrices. Applied Psychological Measurement 10: 1–22.
Shin, S.K. 2005. Did they take the same test? Examinee language proficiency and the structure of
language tests. Language Testing 22(1): 31–57.
Tabachnick, B.G., and L.S. Fidell. 2007. Using multivariate statistics, 5th ed. Needham Heights,
MA: Allyn and Bacon.
Chapter 8
Rating Scale Validation: An MDA
Approach
The previous chapter details a rating scale validation study in a quantitative

approach. Instead of taking a bird’s-eye view from a statistical perspective, this
chapter continues to validate the revised rating scale in a microscopic manner. In
real practice, multimodal discourse analysis was deployed to further validate the
rating scale from the perspective of associating and aligning the randomly selected
candidates’ performance in nonverbal delivery with the subscores they were
assigned by teacher and peer raters and the corresponding descriptors of the rating
scale. It is anticipated that this qualitative validation study will further serve as a
triangulation of examining the construct validity of the rating scale, particularly
with a view to validating the “unconventional” dimension of Nonverbal Delivery.
This phase of research aims to conduct a fine-grained investigation into three

randomly selected candidates representing not only the three predetermined profi-
ciency levels but also the stratified three bands demarcated against the teacher and
peer scoring results. It is intended that this phase of study, with a qualitative
approach, can serve as a complement to the validation study for three objectives.
First, an in-depth picture of the candidates’ performance in nonverbal delivery is to
be depicted from a systemic functional linguistics perspective. Second, the pro-
posed rating scale should be proven to not only exhibit satisfactory goodness-of-fit
indices as foregoing validated but also account for an alignment of its descriptors
with candidates’ de facto performance in nonverbal delivery. Third, the granular
analysis will also be able to further validate the discriminating power of the rating
scale. In order to realise the above objectives, the following research questions are
correspondingly put forward.
RSV-II-RQ1: What (meta)functions do the candidates’ nonverbal delivery channels
serve?
RSV-II-RQ2: To what extent are teacher raters’ and peer raters’ scoring in non-
verbal delivery alignable with the nonverbal delivery descriptors?

DOI 10.1007/978-981-10-0170-3_8
216 8 Rating Scale Validation: An MDA Approach
RSV-II-RQ3: To what extent can the nonverbal delivery descriptors distinguish

candidates across a range of proficiency levels?
For an operationalisation of the above research questions, the analyses based on
the performance of the three randomly selected candidates of respectively ele-
mentary, intermediate and advanced levels using an MDA approach will be detailed
in response to RSV-II-RQ1, whose answers thread through and serve as the basis of
the answers to the other two questions as well. In particular, Martinec’s (2000,
2001, 2004) taxonomy of action (see Section “Martinec’s Taxonomy on Actions”)
and Hood’s (2007, 2011) works on metafunctions of nonverbal delivery (see
Section “Hood’s Taxonomy on Nonverbal Delivery Metafunctions”) will be
referred to reveal the (meta)functions of the three candidates’ nonverbal delivery.
After an elaboration on the functions of the nonverbal delivery channels as a
response to RSV-II-RQ1, this phase of study will proceed with the answers to RSV-
II-RQ2 and RSV-II-RQ3 in expounding how the gradable descriptors on the rating
scale are aligned with the metafunctions exposed above and whether the perfor-
mance of candidates across a range of proficiency levels can also be discerned
accordingly.
8.2 Research Procedure and Method
In the research design, it has been noted that this phase of study will draw upon the
de facto performance in nonverbal delivery by the candidates and analyse their
performances in an MDA approach reviewed in the literature (see Sect. 2.5.2.3);
therefore, this phase of study would be rather straightforward in its research pro-
cedure. The data used in this phase would be the candidates’ performance, and their
respective scores assigned by teacher and peer raters.
As aforementioned, this phase of study qualitatively addresses the nonverbal
delivery by the candidates; thus, a need would be felt to select a number of can-
didates for analyses. As only three candidates were to be selected, instead of
conducting stratified random sampling for a larger sample size, this study consis-
tently aimed at the group discussion sequenced No. 50 from each proficiency
group. The second speaker in each selected group was further chosen as the rep-
resentative for each proficiency group. The candidates’ privacy was protected as
pseudonyms would be used in the follow-up analyses and descriptions. Table 8.1
Table 8.1 The selected candidates’ information (1)

No. Pseudonyms Proficiency level PI GV DM ND Total
T/P* T/P T/P T/P T/P
1 Tom Elementary 2/2 1/2 2/2 2/1 7/7
2 Linda Intermediate 3/3 3/3 2/3 3/3 11/12
3 Diana Advanced 3/4 4/4 3/3 4/4 14/15
*T/P: Teacher raters’ mean score/peer raters’ mean score
8.2 Research Procedure and Method 217
outlines the selected candidates with the averaged subscores from teacher- and
peer-rating attached. Tom, Linda and Diana represent elementary, intermediate and
advanced proficiency groups, respectively. From Table 8.1, it can be noticed that
their total scores measured against the rating scale present an ascending order,
meaning that Diana from Group A performed best (total score equal to 14) and Tom
from Group C performed worst (total score equal to 7). A closer look at their,
respectively, subscores on nonverbal delivery (ND) leads to an awareness that these
three candidates’ performance on ND also correspond to the sequence of their
predetermined proficiency levels. Although there is slight variation between tea-
cher- and peer-rating for the three cases, all the inconsistency, if any, is still within a
gap of one adjacent band, which can be generally deemed as acceptable. More
specifically to ND, the three candidates were assigned 1.5, 3 and 4, respectively, if
the teacher raters’ and peer raters’ scoring results are averaged. Given the quali-
tative approach this phase of research aims to adopt, this score distribution thus
indicates that the randomly selected candidates can be representative of different
levels in light of nonverbal delivery.
Table 8.2 presents the additional information about the whole duration of the
group discussion the selected candidates were, respectively, engaged in as well as
the cumulative duration of their participation in the group discussion. As can be
seen, both the group discussion length and the duration of how long the candidates
verbally participated in the group discussion follow the lowest-to-highest sequence
of their proficiency levels. When these two time parametres are standardised to
seconds, the extent to which the candidates virtually involved in the group dis-
cussion can be profiled. Table 8.2 indicates that Linda from the intermediate group
involved most (38.85 %) even though the time she spent in the group discussion
(1′ 55″) was shorter than Diana’s (2′ 28″). data were collected. It should also be
noteworthy that, however, averagely all the selected candidates verbally engaged
themselves in approximately one-third portion of the whole group discussion, thus
justifying the comparability across the selected candidates. In addition, Table 8.2
also indicates that Tom and Linda were in a sitting posture in the group discussion,
while Diana was standing when talking to the other discussants. Without any
intervention from the researcher, these postures were subject to the candidates’ own
preference or choice when the candidates’ nonverbal delivery frequencies were
calculated, they were standardised to the occurrences in a 5-min group discussion.
Table 8.2 The selected candidates’ information (2)

No. Pseudonyms Whole duration of Duration of Percentage Posture in
the group participation (%) group
discussion discussion
1 Tom 4′ 11″ 1′ 23″ 33.07 Sitting
2 Linda 4′ 56″ 1′ 55″ 38.85 Sitting
3 Diana 6′ 43″ 2′ 28″ 36.72 Standing
Having specified the demographic and data information of the selected candi-
dates above, the following section will outline an inventory of measures on which
the MDA analyses and the above three aspects of alignment will be based.
In line with the general framework adapted to this study and the integrated
framework (Martinec 2000, 2001, 2004; Hood 2007, 2011) reviewed in the liter-
ature to investigate the metafunctions of candidates’ nonverbal delivery, each
nonverbal channel is examined from the perspectives of its formal manifestations
and the corresponding metafunctions. As how metafunctions are classified has been
previously expounded (see Sections “Nonverbal Delivery: Communicative Versus
Performative”, “Martinec’s Taxonomy on Actions” and “Hood’s Taxonomy on
Nonverbal Delivery Metafunctions”), this section only outlines how the formal
nonverbal channels are observed.
Table 8.3 lists the measures of the three nonverbal delivery channels observed.
The checked areas in Table 8.3 indicate what measures the different nonverbal
channels will be studied. Regarding eye contact, this phase will touch upon the
frequency, directionality and duration of eye contact. What is worth mentioning is
that the duration here not only refers to the mean duration of eye contact for each
occurrence but also include the cumulative duration of eye contact of a particular
candidate in the group discussion. Level of eye contact is also included as it can be
feasible to judge the level of eye contact from the perspective of the recipient,
which is more associated with metafunctional meanings reviewed in the literature
(see Sects. “Martinec’s Taxonomy on Actions” and “Hood’s Taxonomy on
Nonverbal Delivery Metafunctions”).
Apart from the measures of frequency and directionality that eye contact will
cover, gesture will be looked into in light of its level instead of duration. However,
there are also two additional measures observed as a result of gesture realisation,
viz. how hand(s) is(are) described (e.g. palm open or fist) and use of hand(s) (e.g.
right hand, or left hand, or both). As head movement (e.g. head movement naturally
accompanying eye contact transition) can be regarded as a broader realisation of eye
contact, this phase of study will focus on the measures of frequency and direc-
tionality only.
Table 8.3 Measures of formal nonverbal delivery

Frequency Directionality Duration Level Other realisations
Eye contact ✓ ✓ ✓
Gesture ✓ ✓ ✓ ✓ Hand(s)
description
✓ Use of hand(s)
Head ✓ ✓
movement
In line with Martinec’s (2000, 2001, 2004) taxonomy of action and Hood’s (2007,
2011) research on nonverbal delivery metafunctions, this chapter will revolve
around the research findings in three aspects of alignment. The first alignment is
concerned with the correspondence between the nonverbal delivery channels and
the rating scale descriptors regarding nonverbal delivery. The second alignment is
more focused on the descriptive elaborations upon how candidate’s performance in
nonverbal delivery is realised from the MDA perspective and how much commu-
nicativeness is achieved corresponding to the rating scale descriptors. The third
alignment will further look into the interaction, particularly the complementarities,
between the candidates’ verbal and nonverbal delivery in relation to their respective
proficiency levels. However, the presentation of the research findings below still
follows the taxonomy of different nonverbal delivery channels, viz. eye contact,
gesture and head movement, with both their formal realisations and metafunctions
addressed in-depth.
8.3.1 Eye Contact
In accordance with the specifications above, the findings on eye contact will be
presented from the perspectives of formal eye contact and its metafunctions.
Meanwhile, the candidates’ performance in nonverbal delivery will be associated
with the above two perspectives under the given operationalisation for an analysis
of the alignments with the candidates’ overall proficiency level and the proposed
rating scale descriptors.
8.3.1.1 Formal Eye Contact
The formal eye contact is first presented with regard to its directionalities.
Figure 8.1, in the form of a bar chart, indicates that Diana, Linda and Tom exhibited
a descending order of eye contact frequency (see the rightmost column sum) and
that all of them had the forward eye contact,1 a commonplace directionality in
communication, but none had any backward eye contact. A more careful scrutiny
concerning the different directionalities would uncover more interesting findings.
Although Tom had the fewest occurrences of eye contact in group discussion, he
1
The directionality of eye contact here is slightly distinguished from the AB phase, where the
recipient of eye contact, such as the camera, was described. In this phase, forward eye contact
means having an occurrence of eye contact with an unspecified object physically located in front of
the speaker. In reverse, backward eye contact refers to the occurrence that a speaker looks at
certain positions at his/her back.
Fig. 8.1 Directionalities of 35

eye contact 30 Tom
25 Linda
20 Diana
15
10
5
0
ht
d
m
d
ft
ar
ar
ar
ar
le
rig
su
w
rw
nw
w
ck
up
fo
w
ba
do
Table 8.4 Eye contact Tom Linda Diana
duration (s)
Mean duration 3.15 3.29 4.38
Min. duration 0.50 0.55 0.85
Max. duration 6.80 6.70 10.36
Cumulative duration 41.10 79.05 114.05
had the highest frequency of downward eye contact,2 which seems absent in the
case of Diana. When eye contact in the context of group discussion was investi-
gated, the directionality is assumed to be more horizontal than vertical. The
downward eye contact might be Tom’s presenting eye contact with the ground.
Both Linda and Diana had one occurrence of upward eye contact. In addition, Tom
had no eye contact in a left or right way, indicating a comparative sedentary posture
and less varied eye contact positioning. It is also noted that Linda had no eye
contact to the right, which can be partially explained by her rightmost sitting
position among the three group discussants.
Table 8.4 outlines the duration of eye contact by the three candidates. The
results, especially the ordering, are similar to what was previously found in fre-
quencies, with Tom’s mean duration and cumulative duration of eye contact at 3.15
and 41 s, respectively (shortest), and Diana’s at 4.38 and 114 s (longest). However,
it was also found that Linda, positioned in the middle, did not feature significantly
longer mean duration than Tom. In particular, when the minimum and maximum
durations of eye contact fixation (gaze) were investigated, Linda performed a
shorter max duration than Tom. The findings below concerning the metafunctions
of eye contact will take one step further in untangling the discrepancies.
The proposed rating scale descriptors pertaining to eye contact focus on three
aspects of eye contact: frequency, controllability and briefness. Judging from the
above findings, it can be summarised that Tom, with the least variation in eye
2
Upward eye contact and downward eye contact are described as looking at the objects whose
location is, respectively, above (see Frame 8.4A as an illustration) and above (see Frame 8.4B as
an illustration) the horizontal vision of the speaker. They are usually synchronised with moving the
speaker’s head to a higher or lower position, which might facilitate the researcher’s judgment.
contact directionalities (controllability) and fewest occurrences of eye contact

(frequency), could be accordingly and justifiably assigned the lowest score.
Comparatively, the case of Diana presented a reversed picture, where conformity
can be found with what is specified in the rating scale as she performed highly
frequent eye contact of varied directionalities in a group discussion. The description
of Linda’s eye contact is also alignable with the rating scale descriptor because her
eye contact, though comparatively of high frequency, only extended briefness, as is
evidenced from the statistics in the mean duration.
8.3.1.2 Metafunctions of Eye Contact
Having obtained the above findings of the formal eye contact, this section, informed
by Martinec’s (2000, 2001, 2004) and Hood’s (2007, 2011) works, turns to the
metafunctions of eye contact following the integrated operational framework for
MDA analyses. In practice, the research findings will be unfolded in the spectrums
of the three metafunctions: ideational, interpersonal and textual meanings.
Ideational Meaning
To commence with, the findings regarding the ideational meaning of eye contact are
presented. Although Martinec’s (2000) demarcation of actions into presenting,
representing and indexical might blur the judgment of eye contact in relation to its
ideational meaning, the co-contextualisation of eye contact with the candidates’
verbiage might be of great assistance in facilitating the judgment in this study.
Figure 8.2 outlines the distribution of eye contact with regard to the above tax-
onomy. Among the candidates, Tom performed the largest number of presenting
actions in this regard, indicating that most of his eye contact, if not all, cannot
practically serve communicative purposes. In contrast, Linda and Diana kept an
almost negligible profile of eye contact falling into the category of presenting
action; most of their eye contact occurrences belong to indexical actions. As
indexical actions are usually language dependent, an abundance of eye contact in
Fig. 8.2 Distribution of eye 25

contact types Tom
20 Linda
Diana
15
10
0
presenting indexical representing
(a) (b)
Frame 8.3 Frame 8.3
Fig. 8.3 Presenting eye contact: material
this category can also be justified because most eye contact occurrences request the
co-contextualisation of verbiage for meaning access. Eye contact of the representing
type refers to the established conveyance of a certain formal eye contact, such as
wearing a despiteful look showing disagreement with disdain and rolling the eyes
indicating a prolonged inconclusive thinking. It should be noted that the judgment
on these formal eye contact of representing type might be confined to the generally
accepted Chinese social context.
Presenting Action
Eye contact serving presenting function,3 though tendering a comparatively
dwindling profile in Fig. 8.2, deserves a closer look because such eye contact via
material, state and mental processes can reflect how the candidates performed in
group discussions. However, eye contact of this type does not practically enhance
communication effectiveness, but mostly serves adaptive purposes particularly in an
assessment context.
Based on the findings in Fig. 8.2, this section looks into the occurrences of eye
contact of presenting type by Tom, as illustrated in Fig. 8.3. When material is
concerned, judging from the level of Tom’s vision, he presented eye contact with
the other discussant’s clothes (Frame 8.3A), or simply with the ground of the
classroom (Frame 8.3B) in the course of the discussion. Specifically, in Frame
8.3A, while Tom was holding the turn, he seemed to gaze at the other discussant’s
clothes (dashed arrow) while the others were attentively gazing at Tom (arrows).
Likewise, in Frame 8.3B, when Tom yielded the turn to another discussant, his eye
contact, instead of targeting at the speaker, chose the ground as the material. Neither
of the eye contact occurrences above therefore is regarded as semantically loaded or
3
As accorded with Martinec’s (2000) taxonomy, presenting functions mainly refers to those that do
not generate representational or communicative meanings, such as those actions representative of
the candidate’s nervousness in assessment contexts (see Section “Martinec’s Taxonomy on
Actions” for more explanations).
communication conducive. By comparison, in the cases of Linda and Diana,

extremely few occurrences akin to the above Tom’s eye contact can be found.
When it comes to the process of state, which relates more to the duration
dimension of eye contact (e.g. long-time fixed gaze or brief gaze with shifted tar-
gets), it can be found that the three candidates presented different styles from this
perspective. Both Tom and Diana would often have long-time static eye contact, yet
with different materials in the sense that if Tom gazed for a long time, the material
would be the objects, such as the ground, other than the other discussants, which also
echoes the above findings concerning the maximum duration of fixed gaze. Diana,
on the other hand, features long-time fixed eye contact with the group members. The
case of Linda is just positioned in between, where her eye contact can only be
described as elliptical. However, though it has to be admitted that the materials that
her eye contact targeted would rarely be the objects blemishing the communication,
in the sense of state, her eye contact seemed to be labelled as dynamic (involving
much obvious energy consumption) due to its constant shift in directions.
The last process carried by eye contact is the mental process, by which the
candidates’ eye contact can reveal what is happening in their mind. This can be
instantiated by an upward or downward eye contact when the candidates were
questioned, as, respectively, illustrated in Frame 8.4A and Frame 8.4B (Fig. 8.4).
Diana, questioned by one of the discussants, was somehow lost in thought with the
directionality of eye contact upward (dashed arrow). In a similar vein, when both of
the other discussants were anticipating an answer from Linda, she slightly shifted
the direction of her eye contact downward for approximately half a second before
her new turn was initiated. In this process, it can be imagined that not only an
inner-mind request for momentary hesitation was substantiated but also her eye
contact, as an outer-mind signal, was accompanying this mental process.
Representing Action
Representing actions with regard to eye contact refer to those with self-explanatory
gaze in a given communication and social context. They can be either language
independent or language correspondent, as explained before. Considering the
(a) (b)
Frame 8.4 Frame 8.4
Fig. 8.4 Presenting eye contact: mental

comparatively less variation of eye contact in conveying an inventory of established

meanings, and also the practically low profile of representing eye contact by the
candidates in this study, this section, instead of profiling the distribution of various
entities, such as participants, process and circumstances embedded in eye contact,
will turn to a number of examples of how they are conventionally realised for the
purpose of aligning them with the rating scale descriptor and the nonverbal sub-
score the candidates were assigned.
An occurrence of doubtful eye contact in synchronisation with the verbiage of
“are you sure” can mean that the candidate intended to challenge the other dis-
cussant’s viewpoint, as is illustrated in the case of Linda in Frame 8.5A of Fig. 8.5.
The doubtful eye contact, or even a glimpse in the Chinese context, can be tanta-
mount to an occurrence of language-correspondent eye contact in that such a gaze
actually reinforces the verbiage. Another example of representing eye contact is
illustrated in a transition from Frame 8.5B to Frame 8.5C in Fig. 8.5, where Diana
first half-jokingly raised a private question “what kind of man do you want to
marry?”, followed by a blinking eye contact with the third discussant as if she, with
shyness, intended to seek consonance of her inquisitiveness from the recipient of
the eye contact. Both of the above two occurrences of eye contact co-occur with the
candidate’s speech and do not entirely rely on the accompanying verbal utterances.
Indexical Action
If an occurrence of eye contact falls into the category of indexical action, accom-
panying language is indispensable for the full access to the ideational meaning of
the eye contact that is intended. Among the indexical eye contact occurrences by
the three candidates, this study has mainly retrieved two kinds of ideational
meanings: agreement and uncertainty, to be illustrated below.
As is observed in this study, agreement and uncertainty conveyed via eye contact
are usually realised as a result of long-time eye contact fixation of forward direction,
fulfilling a basic function of gaze: tendering response after attentiveness is shown.
The two frames in Fig. 8.6 illustrate the occurrences of eye contact indicating
agreement and uncertainty, respectively. In Frame 8.6A, upon one of the discus-
sants’ termination of her turn (the verbiage suggesting Hainan as a travel destination
for the forthcoming vacation), she had eye contact with Linda, who, in return,
Fig. 8.5 Eye contact of representing functions

(a) (b)
Frame 8.6 Frame 8.6
Fig. 8.6 Indexical eye contact
continued her gaze with an accompanying verbiage of “yeah” and even added a
smile as a response of agreement. Diana’s eye contact with the peer seems to be
different, as is shown in Frame 8.6B. Having been questioned the plan after
graduation by one of the discussants, Diana took over the turn and expressed her
uncertainty via a gaze at that particular discussant. Diana’s gesture (both hands
across in a fisted form) can also indicate an uncertainty in this regard (see the
findings in the section of gestures below for further triangulation). It is found that
both kinds of the eye contact occurrences with similar verbiage exemplified above
dominate the indexical eye contact in the cases of Linda and Diana.
Associating the above findings with the nonverbal delivery descriptors in the
rating scale, it can be felt that the keywords in the descriptors, viz. controllable and
brief, can be further validated. This is because the occurrences of Tom’s eye
contact, with the fewest occurrences, long duration of gazing at the physical objects
with almost no communication-enhancing effect, and most falling into the category
of presenting function, can be judged as neither controllable nor brief. Although
Linda had eye contact with the other discussants realising the intended representing
and indexical functions, her eye contact seemed to be less empowered due to its
briefness and constant shift in eye contact directions. With various ideational
meanings expressed, Diana’s eye contact with the other discussants at her own turn
or during the others’ turns, can be credited as controllable. She was able to employ
the gaze of showing her attention when the others held the turns and the gaze
serving as a signal of persuasion or agreement when the turn was yielded to her.
Accordingly, the nonverbal delivery subscores that the candidates were assigned
can also be aligned with the above findings.
Interpersonal Meaning
In addition to ideational meaning, eye contact can also realise interpersonal

meaning. With regard to the operational framework specified above, the interper-
sonal meaning is manifested in the representing and indexical eye contact via the
channels of attitude, engagement and graduation.
Eye contact in a certain manner can denote positive or negative attitudes. The
interpersonal meaning in relation to attitudes can particularly overlap with indexical
eye contact in that indexical eye contact, as analysed above, mostly contribute to the
ideational meaning of agreement and uncertainty. Therefore, how attitudes are
realised via eye contact will not be redundantly elaborated. However, interpersonal
meaning can also be realised via engagement, which might include the eye contact
indicating neutral, expansion, contraction and possibility. As the commonly
observed eye contact with neutral engagement leaves limited space to be explored
in-depth, and engagement of possibility can be similar to the conveyance of
uncertainty in ideational meaning conveyance, it would be more worthwhile to tap
the potential of the expansion and contraction engagement of eye contact.
When eye contact carries the interpersonal meaning of engagement, expansion
can be realised when the candidate performs a durable and slightly upward gaze
with the other discussant(s) as shown in Frame 8.7A of Fig. 8.7. This is because
Diana’s gaze with such a direction might indicate plenty of negotiation space
provided to show receptivity. Therefore, eye contact in this manner indicates not
only attentiveness but also broad-mindedness in listening to others. Another form of
engagement can be instantiated by an occurrence of slightly downward eye contact,
as is illustrated in Frame 8.7B and Frame 8.7C. In both frames, Linda performed a
downward gaze during her turn though the other discussants were both gazing at
her to show attentiveness. This can be understood as Linda’s unwillingness to be
interrupted when her flow of thought was going on and no other suggestion con-
cerning another travel destination would be allowed at that particular moment, thus
instantiating an engagement of contraction and realising a distancing effect.
The last realisation of engagement via eye contact is graduation, which can be
measured via the duration of how long one occurrence of representing or indexical
eye contact would take in shifting from one contact target to another. It has to be
pointed out, however, that the eye contact targets mainly refer to the discussants in
the group because if eye contact is shifted to certain physical objects, it could be
only regarded as a failure of realising any interpersonal meaning. The criteria for
graduation in eye contact are tentatively cut off as follows in this study: fast at 0.5 s
and shorter; slow at 1 s and longer; and medium between 0.5 and 1 s. Figure 8.8
outlines the frequency distribution of the three candidates’ eye contact shift when
measured accordingly.
(a) (b) (c)
Frame 8.7 Frame 8.7 Frame 8.7
Fig. 8.7 Engagement of eye contact in interpersonal meaning: expansion and contraction
Fig. 8.8 Interpersonal 12

meaning in eye contact: Tom
10
graduation Linda
8 Diana
6
0
fast medium slow
As is revealed from Fig. 8.8, among all the occurrences of representing and
indexical eye contact, none of Tom’s eye contact was lined up with the targets of the
other discussants. Therefore, the engagement of graduation seems to be absent in his
case. Diana and Linda tended to shift their gaze rapidly from one discussant to the
other especially when they were supposed to be attentive in the discussion.
Figure 8.9 presents such an eye contact shift in the case of Diana, where she shifted
her eye contact target swiftly from one discussant (Frame 8.9A, on her right) to the
other (Frame 8.9B, on her left) in accordance with the turn change between them.
One thing worth caution is that due to Linda’s rightmost sitting position, her leftward
eye contact might simultaneously capture both discussants, causing a shortened
duration of eye contact shift. By comparison, as Diana was standing in the middle, it
would take longer time for her to shift the eye contact from one side to the other.
With the above, when interpersonal meaning of eye contact is considered, the
candidates’ performances can also be justifiably aligned with the rating scale
descriptor in nonverbal delivery and the subscores they were assigned. Both Linda
and Diana were able to perform eye contact with positive and negative attitudes and
shifted their gaze at the other discussants quickly to achieve a high degree of
graduation. Yet Linda is reduced to a disadvantageous position in this regard in that
she is felt to be more passive given more manifestations of her contraction
engagement compared with Diana’s expansion engagement. The occurrences of
Tom’s eye contact can be hardly felt to realise any interpersonal meaning.
(a) (b)
Frame 8.9 Frame 8.9
Fig. 8.9 Engagement of eye contact in interpersonal meaning: graduation

Therefore, the keywords of controllable and brief in describing eye contact on the
rating scale are further validated. As the total number of Tom’s representing and
indexical eye contact is only twice, his being assigned 1.5 (between infrequent and
almost no eye contact) can also be justified.
Textual Meaning
Ideational meaning and interpersonal meaning alone cannot optimise the intended
meaning. To a certain extent, textual meaning should also be involved so that all the
meaning potentials can co-function in a semiotic network. Informed by the oper-
ational framework specified, textual meaning instantiated via eye contact mainly
involves two aspects: what/who is the receipt of eye contact and how specific is eye
contact. The former can be observed by the object(s)/person(s) at which an
occurrence of eye contact targets, whereas the latter is more concerned with the
duration of such occurrences of eye contact. The longer an occurrence of eye
contact would last, the more specific it is. In fact, such specificity can also be
measured via the size of pupils because the enlarged pupil size can mean a higher
degree of specificity or attentiveness. However, given the practical technology
constraints, neither the collected data nor the analysing instrument is suitable for
such a measurement.
Figure 8.10 outlines the distribution of the targets that the candidates’ eye
contact is aimed at. Basically, their eye contact targeted at the other discussants
(peers), the teacher on the spot, the camera for recording purpose, and other tangible
objects in the classroom, such as window, ground and ceiling. Among all the target
objects, the three candidates would have the highest frequency of eye contact with
the other discussants. Except for Tom, who had a saliently significant number of
eye contact occurrences with the ground, all the three candidates seemed to exhibit
the aforementioned eye contact with the physical objects that would possibly
attenuate communication effectiveness. For example, both Diana and Linda seemed
to have brief eye contact with the ceiling (occurrence of upward gaze). One
interesting finding is that Diana and Linda also had eye contact with their hand(s) or
Fig. 8.10 Textual meaning in 25

eye contact: contact targets Tom
20 Linda
15 Diana
10
5
0
)
ow
nd
r
a
er
(s
g
he
er
er
pe
ou
d
ili
ac
in
ng
ce
gr
ca
te
/fi
)
(s
nd
ha
Table 8.5 Eye contact with Contact target Tom Linda Diana
peers: duration (s)
Peers Mean 1.26 2.67 4.45
Min. 0.20 0.35 0.89
Max. 4.27 6.70 10.36
Hand(s)/finger(s) Mean 0.00 1.72 1.25
Min. 0.00 0.85 0.68
Max. 0.00 2.58 1.82
finger(s). This is because when they intended to express or reinforce the meaning
via gestures, their own gaze at the hand(s) or finger(s) would arouse the others’
attention, the phenomenon of which will be further unfolded in the findings on
gestures below.
However, these eye contact targets alone cannot help explain much with regard
to the textual meaning realised. This study then looks into the duration of the
candidates’ eye contact with the other discussants given the fact that only eye
contact of this kind would be textual meaning intended.
As is indicated in Table 8.5, the candidates’ eye contact with the peers can be
largely similar to the results in Table 8.4 as the eye contact of this category accounts
for a majority of all the occurrences. A scrutiny at the mean will help to recon-
ceptualise a scenario of Diana’s comparatively more durable eye contact. This
indicates that when Diana gazed at the peers, she would conscientiously and sin-
cerely look at the other discussants, thus achieving a higher degree of specificity.
However, the statistics concerning the eye contact with hand(s)/finger(s) will render
a different picture. It is found that Linda (mean: 1.72 s) presented longer gaze at her
own hand(s)/finger(s) than Diana (mean: 1.25 s). As such, it can be said that Linda,
when performing gestures in realisation of metafunctions, would also resort to her
eye contact, another form of meaning-making resource, to pinpoint the significance
of the gesture being performed. Against this, the other discussants’ attention would
be mobilised as a result of the specificity in the Linda’s eye contact.
Although not much alignment between the findings from the perspective of
textual meaning and the nonverbal delivery descriptors can be made, a picture of
how textual meaning is realised by Linda and Diana can be captured. One of the
main reasons why such an alignment seems not operationalisable is that the rating
scale descriptor is supposed to bring forth the most salient features instead of being
too fine-grained. This might end up with raters’ inaccessibility to what is supposed
to be observed. In that sense, even though Linda seemed to perform better than
Diana in giving full play to the possible textual meaning of her eye contact, this
cannot serve hard evidence in justifying that Linda outperformed Diana, whose
overall delivery via eye contact, as previously analysed, should still be appraised.
90
80 Tom
70 Linda
60 Diana
50
40
30
20
10
0
d
ht
d
m
d
ft
ar
ar
ar
ar
le
rig
su
w
rw
nw
w
ck
up
fo
w
ba
do
Fig. 8.11 Directionality of gestures
8.3.2 Gesture
8.3.2.1 Formal Gesture
Talking about the formal gestures by the three candidates, this section will present
the findings with regard to (1) the directionality of gesture4; (2) descriptions of
hands; (3) use of hands; and (4) hands level.5 Prior to the qualitative findings, the
frequency analyses of gestures concerning the above measures are presented below.
Figure 8.11 showcases the frequency of gesture directionalities in the candidates’
group discussion. Generally from the rightmost column of sum, Diana is found to
have performed the largest number of gesture directionalities of various kinds. In
particular, there was extraordinarily high frequency of Diana’s using right hand in
her gestures. Comparatively, Tom did not have noticeably high frequency of any
gesture directionality. Therefore, it can be initially deemed that Tom kept an
extremely low profile of gestures in synchronisation with his verbal utterance in the
group discussion.
Proceeding to the directionality of gestures in general to the description of hands,
as illustrated in Fig. 8.12, this study may come to a deal of slight variation against
the above gesture directionality comparison. One similar tendency is that Tom,
compared with the other counterparts, can be generally found to have least variation
in terms of hand descriptions. However, there are a few exceptions. Tom tended to
be more often fisted than Linda, who never had any occurrence of fist in her group
discussion. This finding urges an in-depth exploration of what role or function a fist
could play when Tom was involved in the group discussion. The follow-up
4
Similar to the directionalities of eye contact described in Sect. 8.3.1.1, gestures were observed
with regard to the directions of hand movement. For instance, if an occurrence of moving the hand
upwards from a lower position, its directionality is judged as upward.
5
Hands level is judged when the location of hand(s) is considered in relation to the speaker’s head,
chest, legs and waist.
14
12 Tom
Linda
10 Diana
8
6
4
2
0
g
n
n
d
in
t
de
fis
tin
pe
ow
ar
jo
wa
si
rw
in
-o
-d
s
ck
po
nd
lm
fo
lm
ba
pa
ha
pa
Fig. 8.12 Description of hands
Fig. 8.13 Use of hands 40

Tom
35
Linda
30 Diana
25
20
15
10
5
0
left hand right hand both hands
discussion will turn to this point again. In addition, Fig. 8.12 also reveals that Linda
used more pointing than Tom or Diana. Pointing can be a form of reference in
communication heavily loaded with the textual meaning of gestures, after the
exploration of which Linda’s prominent use of pointing can be explained below.
One peculiar finding in this figure is that Diana was found to open palm constantly
while Linda had the occurrence of such gesture once.
Figure 8.13 illustrates the use of hands, either left hand, right hand or both by the
candidates. Individually, all the candidates tend to use right hand more often.
However, it can be found that Diana’s right-hand use was exponentially more than
the left counterpart. In addition, Linda seldom used both hands in her gestures.
Hood (2011) puts forward that to a certain extent gesturing with both hands usually
produces larger and more dramatic gestures, whereas one hand usually triggers a
smaller and more reserved gestures. This seems to be consistent with what is found
above concerning hands description, where Linda performed significantly more
pointing with fingers only, yet presented fewer palm-triggered gestures.
The comparison in Fig. 8.14 shows that when candidates instantiated gestures,
their hands level might also vary. Tom’s hands level was either at the leg level or
above the head level, yet it was only Tom who would have the occurrences of
50
Tom
40 Linda
Diana
30
20
10
0
head chest legs waist
Fig. 8.14 Hands level
gestures above the head. Comparatively, it can be sensed that Linda and Diana
placed their hands at a wider range of positions and levels.
At this stage, if the proposed rating scale is aligned with the candidates’ perfor-
mance in nonverbal delivery, the descriptors concerning gestures with a focus on
frequency and variety can be validated. Diana, assigned a subscore of 4, presented not
only frequent but also diversified gestures, the latter of which can be more manifested
in the directionalities of her gesturing, the use and description of her hands as well her
hands level. By comparison, although Linda also had high frequency of gestures, the
above measures of hand descriptions and hand use would justifiably downgrade her to
the subscore of 3. The case of Tom in this regard is not quite up to the standards of
being frequent and various in gesture use. As Tom was assigned a subscore of 1.5 as
an averaged result of teacher- and peer-rating, a retrospective review on its upper
adjacent band, namely Band 2, is necessary. The gesture descriptor for Band 2 is
“gesture, most of them are for non-communicative purposes”; therefore, raters, due to
Tom’s poor performance, might not even move into length to consider the issues of
frequency or variation of communication-conducive gestures. The research findings
about the metafunctions about Tom’s gestures will further testify that his gestures, if
not all, are overwhelmingly performative gestures with non-representational mean-
ings, as expounded below.
8.3.2.2 Metafunctions of Gesture
As is reviewed, Martinec’s (2000, 2001, 2004) categorised actions into presenting,

indexical and representing actions, with the first one indicating merely practical
purpose while the latter two serving communicative purposes. When the three
candidates’ gestures are annotated accordingly, a picture of how gesture functions
are distributed can be captured, as illustrated in Fig. 8.15. Similar to the findings in
formal gestures, Tom, with a small number of gesture occurrences, showed a higher
percentage of presenting action, an indicator that most Tom’s gestures, if not all,
Fig. 8.15 Distribution of 50

gesture types Tom
40 Linda
Diana
30
20
10
0
presenting indexical representing
might not semantically loaded or wilfully performed. For example, his gesture
could be scratching the head or rubbing his hands on the legs. By contrast, neither
Linda nor Diana presented a salient profile of presenting actions; instead, Linda had
more representing actions, whereas Diana featured more indexical actions. The
following part will further scrutinise how three metafunctions are realised in the
above three types of gestures in the candidates’ group discussion.
Ideational Meaning
Ideational meaning is the construal of the reality. In gesturing, ideational meanings

can be realised through presenting, representing and indexical gestures though not
all of them convey communicative meanings as intended.
Presenting Action
Figure 8.15 has above indicated that presenting action keeps a lower profile
compared with representing and indexical actions, and this sort of action can be
more commonly found in the nonverbal delivery in Tom’s performance. However,
presenting action does not actually serve much communicativeness in group dis-
cussion; as such, an unexpectedly abundance use of presenting gestures can be
interpreted as not being communication conducive. Therefore, a judgment of Tom’s
low subscore on nonverbal delivery will justify an analysis of his presenting ges-
turing as a start.
As foreshadowed, gestural presenting actions can be realised by various means,
such as material, behavioural, state and mental processes. Material process refers to
the involvement of objects in the gestural realisation. This study finds regular
occurrences of material processes in Tom’s gestures, which can be showcased in
Fig. 8.16, where Tom, sitting in the leftmost position among the peers, moved the
chair slightly forward with both hands. This action might be interpreted in a bi-fold
manner. One explanation is that for the purpose of drawing physically closer to the
other two discussants, Tom performed a subtle forward movement of his chair. The
other explanation would be that Tom was too nervous in the assessment settings to
be aware of sitting calmly in the group discussion. One word of caution, however,
Fig. 8.16 Gestural presenting

action: material process
(Tom)
should be borne in mind that Tom performed that action three times, guiding this
study to be more in favour of the second explanation. Lim (2011), in analysing the
teachers’ gestures in the lecturing environment, argues that material processes that
“are extraneous to the focus of the lesson may draw attention away from the main
communicative event” (p. 273). Likewise, Tom’s action in this case would also be
liable to disrupt communication effectiveness.
Behavioural process can refer to the action of crying, laughing or moaning, or
other physiological process like breathing, coughing and burping (Martinec 2000);
naturally, this process can also be realised in a gestural fashion. As group discussion
might trigger viewpoint exchanging and experience sharing, the candidates’ ges-
tures are assumed to be embedded with behavioural processes. Figure 8.17 illus-
trates the presentation of behavioural processes in Linda’s gestures. Frame 8.17A
snapshots the presenting gesture of laughing yet hiding the face with both palms by
Linda sitting to the leftmost side when one of the other discussants (sitting in the
middle) shared unpleasant travelling experience in the group. Therefore, Linda’s
presenting gesture might be interpreted as regarding the discussant’s story as
laughter. Another example can be found in Frame 8.17B, where Linda was trying to
hide her face with the left hand and index finger touching the forehead when
another discussant suggested a travel destination that Linda had already been to.
Fig. 8.17 Gestural presenting action: behavioural process (Linda)

As such, Linda performed that gesture as if she was showing her unwillingness to
revisit a travel destination.
What is worth pointing out is that these behavioural processes are quite evident
in Linda’s performance in the group discussion; however, it does not necessarily
mean that Diana from the advanced group did not have any realisation of these
behaviours. This is because in the case of Diana, she would be more likely to realise
laughing, breathing or surprising via facial expression, the domain of which is
practically beyond a measurable scope of nonverbal delivery assessment in this
study.
With regard to the state processes, it is also found that the occurrences of Tom’s
gestures could be instantiated by long-time sitting. Martinec (2000) proposes the
category of state processes to describe processes that “have no significant move-
ment and have no obvious expenditure of energy” (p. 249). Echoing this definition,
Tom was constantly sitting still without much noticeable energy-consuming
movement whenever holding or yielding his turn in the group discussion.
Integrating with the findings of material process, this study finds that Tom would
either move the chair occasionally due to nervousness in communication or merely
sit still. The comparatively low profile of these two processes, therefore, might have
justifiably placed Tom at a disadvantageous position when he was assessed.
In stark contrast, although Linda was basically keeping the posture of sitting in
relation to the state process, her overall performance in overall nonverbal delivery,
particularly gestures observed, would trigger dynamics from time to time. It has to
be admitted that a sitting posture will, to a certain extent, confine the space of
gesturing in the domains of material and state processes. However, Linda seemed to
have accommodated herself to such a confinement by natural and constant gesturing
when discussing with the group members. With a standing posture, Diana was
naturally endowed with more flexibility; thus, the whole duration of the group
discussion witnessed almost no conspicuous happening of gestures with salient
expenditure of energy.
Another realisation of presenting action is mental process, and the instances of
which can be, for instance, described as a finger or hand pursing at the chin.
Although gestural presenting action does not serve much communicative purpose, it
would somehow mirror the candidate’s inner mindset, such as hesitation and
meditation. Figure 8.18 illustrates Diana’s (standing in the middle) mental process
in relation to her gestural presenting action. In Frame 8.18A, Diana was placing the
index finger of her left hand gently upon the tip on the left side when she was a bit
timid in asking her group members about what their future husband will be like.
Similarly, in Frame 8.18B, after yielding her turn to the discussant to her left, she
again pursed her index finger at the chin as if she was presenting uncertainty, or her
spontaneous reaction to a question that requested time buffering. As stated,
although mental process signifies the ideational meaning of presenting gestures, it
does not serve communication purposes. However, since this action is under the
category of performative gesture, raters might be impressed by the candidates’
performance if they would be able to realise the mental process with gestural
vehicles. As such, Diana’s high subscore in nonverbal delivery can be justified.
Fig. 8.18 Gestural presenting action: mental process (Diana)
Representing Action
Following the ideational meaning of gestural presenting actions, this section will
continue with the ideational meaning of representing action in relation to gestures,
which can be regarded as more pertaining to analysing the alignment of the can-
didates’ nonverbal delivery performance with the communication effects, both
implicit and explicit, achieved.
As is reviewed, representing gestures can be further categorised into
language-independent and language-correspondent gestures. The former in its own
right lends support to the iconic meaning of gestures in a certain social context. The
latter conveys the meaning without relying on the synchronised language though it
usually co-occurs with the verbal utterance. In the case of the three selected candidates
in this phase of study, both language-independent and language-correspondent ges-
tures can be retrieved.
Figure 8.19 renders three instances with which the representing gestures can be
captured and interpreted. Frame 8.19A is a presentation of Tom’s representing
gesture of waving his right hand towards the end of the discussion, signifying
“goodbye”. It should be noted that accompanying this gesture, Tom actually did not
utter the word “goodbye”, the case of which falls into the category of language-
independent gestures. This is because conventionally in the Chinese social context,
waving hands upon the termination of the group discussion might be interpreted as
Fig. 8.19 Examples of representing gestures

bidding farewell. However, such language-independent gestures with

self-explanatory ideational meaning cannot be abundantly found in the limited
number of Tom’s gesturing occurrences. Frame 8.19B is another example of rep-
resenting gesture, where Linda sitting to the leftmost position thumbed
up. Synchronising this gesture, she intended to express the verbiage of “great idea”
for planning a trip to Tibet. Therefore, this gesture is also a language-independent
one because the thumb up can be usually interpreted as something admirable in the
given social context. Frame 8.19C, nevertheless, reflects a language-correspondent
gesture, embodying that Diana raised her palm to the neck level when asking the
discussant if she would like to marry a tall husband. In instantiating the intended
meaning, she raised the palm dynamically from a lower to a higher position.
When representing gestures is looked into in the spectrum of entities, it is felt
that the categorisation might be problematic in that the judgment of whether a
representing gesture embodies physical entities or abstract processes can be blurred
(Lim 2011). In order to offset the dilemma, the ideational meanings realised by
representing gestures in this regard are facilitated by the accompanying verbiage. In
other words, verbal utterances serve as anchoring points in assisting the judgment
of whether a representing gesture falls into participants (e.g. “village”), process (e.g.
“scuba-diving”) or circumstances (e.g. “outdoors”). Figure 8.20 outlines the dis-
tribution of the representing gestures following this taxonomy.
From Fig. 8.20, it is found that due to a low profile of representing gestures, Tom
would only refer to concrete entities when performing a representing gesture. Linda
and Diana shared similar higher proportion of using representing gestures with
reference to participants; nonetheless, when the reference to circumstances is
investigated, it can be felt that Diana was inclined to gesture more when a need
referring to circumstances, such as “outside Shanghai” and “in the house”, arises.
Neither of them had a conspicuous profile of having representing gestures in
relation to process. In that sense, it can be said that Linda’s preference in using
representing gestures would be more participant oriented, while Diana seems to
keep a balance between participant oriented and circumstances oriented.

representing gestures: entities 14 Tom
Linda
12 Diana
10
8
6
4
2
0
participants process circumstances
Indexical Action
As is shown in Fig. 8.5, indexical actions account for the largest proportion of all
the gestures observed. Under most circumstances, indexical actions are language
dependent, which determines their close affinity with the accompanying verbal
language for the full interpretation of the meaning. In the context of the present
study, where the candidates were supposed to hold group discussion in the for-
mative assessment, it has been observed that the presenting gestures were primarily
intended for the conveyance of importance, receptivity and relation.
Importance can be instantiated by a rhythmical movement in the candidates’
indexical gestures. Figure 8.21 illustrates two frames, which, respectively, indicate
Diana’s and Linda’s rhythmic beat in highlighting the points they were conveying.
In Frame 8.21A, Diana was listing various disadvantages of living in a cos-
mopolitan. Each time she came up with one disadvantage, she would clap her hands
once, tantamount to attaching significance by counting numbers. In a quite similar
vein, in emphasising a number of criteria for selecting an ideal travel destination,
Linda expanded and contracted her palms rhythmically, as is shown in Frame
8.21B.
Another realisation of indexical gestures is receptivity, which is usually
instantiated by means of open palm, as illustrated in Fig. 8.22. In Frame 8.22A,
Fig. 8.21 Indexical gestures: importance
Fig. 8.22 Indexical gestures: receptivity

Diana opened her left palm as an indication of receptivity. The accompanying

verbal language was intended to welcome any question from the discussants about
their concern over Diana’s future. Akin to Frame 8.22A, Frame 8.22B reflects the
moment when Linda was attempting to invite the other group member in brain-
storming other travel destinations when Suzhou was vetoed in the discussion.
The last kind of indexical gestures in realising ideational meaning is showing
importance via pointing. It can be noticed in Frame 8.23A of Fig. 8.23 that Diana
used her index finger in pointing towards the discussant to her left when eliciting a
question with reference to that particular discussant. Frame 8.23B can be slightly
different in that Linda’s pointing was aimed at any discussant but downward at the
ground. Her pointing associated the downwardness with the direction of south when
she suggested a travel destination in the southern part of China.
In additional to the above realisation of indexical gestures, this study also finds
that Diana has another kind of indexical gesture that can be comprehended as
defensiveness. As is shown in Fig. 8.24, Diana folded her arms in Frame 8.24A and
crossed both fists in Frame 8.24B. In both cases, she unconsciously constructed an
invisible boundary with the other discussants so that the meaning of self-protection
or unwillingness of disclosing her own experience could be instantiated.
Fig. 8.23 Indexical gestures: relation
Fig. 8.24 Indexical gestures: defensiveness

Therefore, as far as the ideational meaning of the candidates’ gestures is above

concerned, it can be felt that both Linda and Diana were capable of resorting to
various gestures to achieve a multitude of communicative purposes. In particular,
these gestures can realise the ideational meaning in the manners of representing and
indexical gestures. Comparatively, Tom, with only a few performative or presenting
gestures detected, was least competent in performing gestures whose ideational
meanings could be conducive to communication effectiveness. Although the non-
verbal delivery descriptors have no embodiment of such professionally termed
taxonomy regarding gestures given the consideration of facilitating the rating
process, whether the gestures are “communicative” can serve as a yardstick. In the
case of Tom, the subscore he was assigned (1.5) can be regarded as correspondent
with the descriptors because the two adjacent bands (Band 1 and Band 2) specify
that the candidate has almost no gesture or most gestures, if any, are not an
enhancement of communication effectiveness.
The following part will be geared towards the interpersonal meaning interpreted
from the three candidates’ gestures. As is specified, representing and indexical
gestures might carry much interpersonal metafunction, which, as far as gestures are
concerned, can be probed into from the perspectives of attitude, engagement and
graduation (Hood 2011).
The interpersonal meaning of either representing or indexical gestures can
transmit the intended conveyance of being positive or negative. Figure 8.25 illus-
trates the distribution of positive and negative gestures with interpersonal meanings
across the three candidates. It is found that Tom and Linda basically kept a balance in
expressing positive and negative interpersonal meaning though Linda’s gestures
with attitudes embedded far outnumbered Tom’s. Diana is found to be distinguished
in that she tended to have more gestures with a positive polarity. This can also be
echoed with the findings of head movement below, where there was much more
nodding than head shaking. As Tom’s formal gestures are extremely limited in
number, the following analyses correspondingly reserve limited space for his case.

meaning in gestures: attitude Tom
25 Linda
Diana
20
15
10
0
positive negative
Fig. 8.26 Attitude of gestures in interpersonal meaning: negative
Despite a comparatively low profile of negative attitude in the candidate’s

gestures, the corresponding formal realisations of such gestures can be varied, as
exemplified in Fig. 8.26. In Frame 8.26A, when Linda intended to show how the
budget might be tightened due to changing the travel destination to a place outside
China, she zigzagged her right palm in a vertical manner as an indication of
fluctuation. In so doing, she realised an implicit negation of rejecting the idea of
travelling internationally. By contract, Diana, in Frame 8.26B, expressed a negative
interpersonal meaning by crossing both hands as if a boundary was established in
accepting the other discussant’s view, as reiterated above.
The second type of interpersonal meaning as reflected in gestures is engagement,
which measures the degree to which the candidates were engrossed with the group
discussion. As most gestures with engagement interpersonal meaning, if not all, are
neutral gestures, it would be more enlightening to explore the other three means of
realisation as is reviewed above. Expansion and contraction, a pair of opposing
engagement, is first looked into. Figure 8.27 illustrates a series of Diana’s and
Linda’s gestures with an embodiment of engagement. It can be found that when
Diana’s gestures realised engagement, she would stand slightly tilted with both
arms akimbo when the other discussants held the turn (Frame 8.27A), yet open her
palm sidewise during her own turn (Frame 8.27B). The former can be regarded as
contraction as she was supposed to listen and react, while the latter is expansion
because a negotiation space was enlarged as a result of Diana’s extended utterances.
In fact, open palm represents an expansion of engagement space that would invite
and convey a sense of openness by reducing social distance (Hood 2011).
By comparison, when Linda’s gestures exhibited engagement, it seems that she
would more often than not accompany the gestures with a downward palm (Frame
8.27C and Frame 8.27D), an indication of either negotiation space shrinkage or
reinforced distancing effect, namely contraction. Therefore, it can be felt that
Diana’s gestures with the interpersonal meaning of engagement were inclined to
shift between expansion and contraction in accordance with her turn-holding,
whereas Linda was less flexible in that she would mainly express the interpersonal
meaning in the manner of contraction.
Fig. 8.27 Engagement of gestures in interpersonal meaning: expansion and contraction
Figure 8.28 illustrates the candidates’ engagement that expresses possibility,

which is virtually not discovered in a large number. In Frame 8.28A, when casting
doubt on a travel destination suggested by another group discussant, Linda placed
her left hand against the tip of the nose with the index finger and the thumb gently
touching the face. Stylistically different, Diana would support her chin only with an
index finger in Frame 8.28B, during which she was hesitantly contemplating the
possibility of returning to her hometown after graduation.
The last type of interpersonal meaning realised in a gestural way is graduation,
which is further categorised into fast, medium and slow. As aforementioned, the
Fig. 8.28 Engagement of gestures in interpersonal meaning: possibility


meaning in gestures: Tom
graduation 50 Linda
Diana
40
30
20
10
0
fast medium slow
extent to which the gestures are judged to fall into one of the graduation subcat-
egories depends on the automatic retrieval of the gesture duration by ELAN. Fast
gestures are tentatively cut off at 0.5 s and below and slow gestures at 1 s and
above. The gestures falling into the range of 0.5–1 s is judged as medium gesture.
Against the criteria, Fig. 8.29 lists the distribution of gestures in relation to
interpersonal meaning of graduation. It is found that Diana’s gestures are basically
characterised by swiftness and that Linda performed more medium than slow
gestures. In the case of Tom, only a fragmentary number of gestures could be
grouped into medium and slow graduation. This holistic finding is consistent with
the above observations regarding the candidates’ activeness in that Diana and Linda
engaged themselves in the discussion with various gestures, while Tom was still
sedentary. In order to make a comparison across the candidates, this study selects
the shared gestures when all the candidates intended to express the negative attitude
of interpersonal meaning, as is illustrated in Fig. 8.30. Similar to the distribution
reflected in Fig. 8.29, Diana and Linda waved the palm in fast (Frame 8.30A) and
medium (Frame 8.30B) motion, respectively, while Tom was almost still (Frame
8.30C) in performing a similar interpersonal-meaning-embedded gesture.
In a brief summary, when interpersonal meaning channelled in the candidates’
gestures is assessed, Diana is found to not only have lavish and constant use of
gestures indicating positive and negative attitudes, but also shift different forms of
engagement in line with her turns with a large number of gestures rapidly per-
formed. In that sense, Diana can be judged as a frisky or even quick-witted com-
municator to a large extent, thus again aligning her gestures with a great sense of
Fig. 8.30 Graduation in interpersonal meaning

communicativeness in the rating scale descriptor. Linda is only second to Diana in

gestural interpersonal meaning in the sense that she would occasionally seem
passive, especially with regard to engagement, in the case of which certain gestures,
even though intentionally communicative, might be tenuously impaired. The case
of Tom regarding interpersonal gestures can be felt to be sedentary and less
responsive large because of his motionlessness, making an alignment with the
rating scale descriptor and Tom’s subscore in nonverbal delivery.
Textual Meaning
Textual meaning serves as a bridge linking the resources of ideational and inter-
personal meaning. According to Hood (2011), textual meaning with regard to
gestures can be realised by pointing, which can be assessed from the aspects of
directionalities and specificity. Figure 8.31 illustrates the distribution of various
possible directionalities of pointing, which can be broadly broken down into the
directions with reference to human body and those concerning physical objects or
geographic locations. Very few gestures, especially those that have not been
entirely captured by the camera or those with undetermined reference due to a
moving pointing, fall into the uncategorised.
It can be found that Tom would occasionally point at the other discussants to get
his viewpoints across. By comparison, Linda’s pointing at various directionalities
seems to be balanced. In other words, she would point not only at herself or the
other discussants with the referred person(s) embedded but also at the physical
objects, such as the window and the door in the classroom, or the geographic
directions like “south”. Diana shared a similar profile of pointing at objects and
directions with Linda, yet her pointing at the other discussants seems to be more
proliferated. This might be understood as a preferred reference to the other members
in the manner of pointing when she intended to convince them (e.g. accompanying
the verbiage of “don’t you think so”), to build a rapport in communication or draw
their attention, all of which seems to echo the above observation of her
vivaciousness.

35 Tom
gestures: pointing Linda
30
directionalities 25 Diana
20
15
10
5
0
lf
ed
t(s
s
oo
se
n(
is
/d
an
io
or
ow
ct
ss
g
re
te
cu
di
ca
in
is
un
ic
rd
ph
he
id
ra
ts
ot
og
ou
ge

gestures: pointing specificity 35 Tom
Linda
30 Diana
25
20
15
10
5
0
hand thumb index finger thumb and
index finger
A closer look at how the pointing at various directionalities is realised will

provide more insights on how textual meanings interlink ideational and interper-
sonal meanings elicited by gestures, as shown in Fig. 8.32. Among the candidates,
Tom performed pointing with the hand only. However, both Linda and Diana
would resort to pointing with hand, thumb, index finger or a combination of thumb
and index fingers (see Frame 8.23A). Hood (2011) argues that the textual meaning
can be realised by gesture in that different forms of pointing may precise the degree
of specificity of what is referred to. In other words, the pointing with a finger can
mean a higher degree of significance or centrality for the topic in question com-
pared with the pointing with a hand only. What is worth noticing is that thumb and
index finger can be assumed to serve different directionalities in pointing. This is
because thumb-pointing, owing to its outward direction, may serve more as a
vehicle of self-reference, while index finger pointing is more likely to refer to the
other discussants or outside objects.
Another way of realising textual meaning in gestures can be wavelength
intended to add more weight for emphasis or listing purpose, whose formal man-
ifestation can be generally rhythmical beat mentioned in the presenting actions (see
Frame 8.21A). As rhythmical beat can only be found in the case of Diana, this
textual meaning realisation serves as a further explanation of her activeness in the
discussion and also an observed advantage for her high subscore in nonverbal
delivery.
With the above, textual meaning instantiated via gestures can also support the
alignment of the candidates’ performance with the rating scale descriptors and the
subscores they were assigned. Diana is found to resort to various realisations of
pointing in her gestures, service to a catalyst to maximise the organisational
resources for the intended ideational and interpersonal meanings. Linda can also be
regarded as a fairly good achiever in the spectrum of textual meaning because she
would utilise the act of pointing with varied degrees of specificity realised as well
though her comparative inactiveness in the discussion would downgrade her non-
verbal delivery subscore in comparison with Diana. As Tom would only basically
use the hand to refer to the other discussants, his gestures could hardly realise
textual meanings.
8.3.3 Head Movement
The last channel of nonverbal delivery observed in this study is head movement,
whose stereotyped manifestations mainly include nodding and head shaking. The
following presents the research findings in the aspects of formal head movements
and how they realise metafunctions accorded with the integrated analytic frame-
work of this study.
8.3.3.1 Formal Head Movement
Although conventionally categorised into nodding and head shaking, in light of the
possible forms, head movement can also be inclusive of head upward, head
downward, head right and head left. It should be noted that nodding and shaking,
respectively, refer to the dynamic movement (more than one repetitive occurrence)
of head in vertical and horizontal manners, while the remaining four forms refer to
only one occurrence of a particular movement direction followed by a maintained
position for a certain period. For example, head downward can be turning the head
to a downward position that follows a maintained period, yet without any positive
or negative meaning as might be implied by nodding or head shaking.
Figure 8.33 outlines the distribution of various formal head movements by the
three candidates. It is found that, in terms of frequency, Diana performed the largest
number of head movements in various directions except for downwardness. By
contrast, Tom had only a few occurrences of head movement, mainly downward
movements, which also corresponds with what is found above regarding eye
contact. This is because when an occurrence of downward eye contact is captured, a
corresponding downward head movement might occur as an accompanying action.
When nodding and shaking are looked into, both Linda and Diana seem to have
performed more nodding than shaking. With this, in the Chinese social context, an
understanding of the fact that more positive expression was conveyed via their
utterances can be reached. Similar to what is found in eye contact, Linda moved her

formal head movements 40 Tom
35 Linda
30
25 Diana
20
15
10
5
0
g
ht
d
m
d
ft
in
in
ar
ar
le
rig
su
dd
ak
nw
w
up
sh
no
w
do
to the left a few times, with no occurrence of rightward head movement, which can
again be explained by her rightmost sitting position among the discussants.
Therefore, as far as the above findings of head movement frequencies are
concerned, as Tom only had a few salient downward head movements, there is no
appropriate contextualised nodding or head shaking to speak of. The nonverbal
delivery subscore that Tom was assigned (1.5) can be thus justified because his
performance can be regarded as falling between inappropriate head nod/shake and
no head/nod. Although Linda had fewer occurrences of head movement than Diana,
both of them can be said to have detectable head nodding and shaking. However,
whether these occurrences can be judged as appropriate would request further
explorations when the metafunctions of head movement are analysed.
8.3.3.2 Metafunctions of Head Movement
Following the integrated framework of investigating head movement specified

(Martinec 2000, 2001, 2004; Hood 2007, 2011), this section first analyses the head
movements by the candidates from the taxonomy of presenting, representing and
indexical actions (Martinec 2000), and continues with an analysis of the three
metafunctions realised by head movement.
Figure 8.34 outlines the distribution of head movement types. All the candidates
would have a few occurrences of presenting head movement, indicating that they
would consciously or unconsciously move the head of bare communication-
enhancing effects. With regard to representing head movements, which might
co-occur with the verbiage and could be interpreted without necessarily referring to
the accompanying language, only Linda and Diana performed the head movements
of this type; additionally, most of Diana’s head movements, if not all, fall into this
category. As head nodding (positive) and shaking (negative) are generally
semantically loaded, Diana’s high profile in head nodding and shaking as reflected
in Fig. 8.33 can also triangulate the findings here because without further inquiry
into the verbiage, the meanings of most occurrences of head nodding and shaking
can be interpreted. Indexical head movements, with contextualisation as a prereq-
uisite for meaning access, can also be found abundant in the cases of Linda and

head movement types Tom
25 Linda
20 Diana
15
10
0
presenting representing indexical
Diana. This can be particularly true when it comes to their head movements other
than nodding and shaking because only the discussion context can be referred to in
interpreting what is intended by an upward, downward, left or right head
movement.
Ideational Meaning
Ideational meaning in the case of head movement refers to the surface meaning by
which such movements instantiate. The following presents the findings of ideational
meaning realised via presenting, representing and indexical head movement.
Presenting Action
The integrated analytic framework stipulates that ideational meaning of nonverbal
delivery channels can be theoretically instantiated by material, behavioural, state,
verbal and mental processes. However, considering the practicality of head
movements, characterising no object contacts (material), exhaustible movements
(behavioural), predetermined dynamic state and inapplicable verbal processes, only
the mental process is analysed here.
The above findings already indicate that Tom’s presenting head movements
were usually manifested by downwardness, coinciding with the findings in the
directionality of his eye contact. Such occurrences of presenting head movements
can be tentatively described as absent-mindedness because during the other dis-
cussants’ turn, Tom did not show his attentiveness by appropriately gazing at the
turn-holder; instead, the change of his eye contact direction was naturally accom-
panied with the downward movement of his head. However, when Linda’s and
Diana’s presenting head movements, though in a limited number, are analysed, it
can be felt that ideational meaning can be realised by the mental process, as
illustrated in Fig. 8.35.
In Frame 8.35A, Linda was listening to another discussant on suggesting Tibet
as the travel destination, the verbiage being “there is some culture”, upon the
(a) (b)
Frame 8.35 Frame 8.35
Fig. 8.35 Ideational meaning in head movement: mental

termination of which, Linda subtly moved her head to the left (see Frame 8.35B)
and maintained that position for a certain period with the verbiage of “yes, some
traditional culture” as a signal of confirmed agreement. Although this process
seemed to be less noticeable than other vibrant movements of the head, the detected
action of this type reveals her thinking, an ongoing mental process.
Representing Action
Representing head movements can be interpreted without the language. As most
occurrences of head nodding and shaking are already semantically loaded,
respectively, known as positive and negative meanings, they would correspond-
ingly fall into representing actions. In particular, the act of nodding can indicate not
only a speaker’s agreement with what others utter but also his or her attentiveness at
that particular moment, or known as nonverbal backchannelling (e.g. White 1989;
Young and Lee 2004).
Take an occurrence of nodding by Diana as an illustration. In Frame 8.36A of
Fig. 8.36, Diana was listening to one of the discussants in airing her view on the
given topic. Statically, Diana was gazing, yet dynamically she was nodding when
transited to Frame 8.36B. Diana’s gaze at the other discussant, with the varying
levels of vision (see the dashed arrows) not only shows her attentiveness
(backchannelling showing attention or interest) but also implies her agreement to
the other’s view. In that case, it can be felt that the whole process of this head
movement can be accessed without any involvement of verbal language.
Indexical Action
Indexical head movements are language dependent, indicating that their meanings
would be blurred if the accompanying language or the verbal context is not given.
This study finds that most of the indexical head movements would instantiate the
meaning of importance or receptivity, as analysed below.
(a) (b)
Fig. 8.36 Representing head movement: nodding

Fig. 8.37 Indexical head movement: importance
Figure 8.37 illustrates two frames with the similar conveyance of importance via
head movement. In Frame 8.37A, Linda was trying to emphasise that one of the
selling points of travelling to Tibet is to see the special animals like antelope.
Accompanying her verbiage, she vibrantly moved her head downward to highlight
the word “special” in her verbiage. However, such a downward head movement,
though intended to convey importance, does not seem to be as effective or
appropriate as anticipated because a downward action would more often suggest
weakening than strengthening. This can be somehow understood because her
accompanying indexical gesture already accounts for the intended meaning of
importance. When Diana was attaching the importance to her return to the home-
town after graduation, she moved her head upward a bit (see Frame 8.37B) as if she
would like to achieve an effect of awakening along with an uplifted open palm. In
both cases, the candidates intended to show the meaning of importance.
In addition, indexical head movements can also express receptivity especially
when the speaker intends to yield the turn to the next speaker. Figure 8.38 is just a
case in point, illustrating the only occurrence of Tom’s indexical head movement.
Frame 8.38A shows that Tom was talking in a static sitting posture. When he
intended to yield his turn to another discussant with the verbiage “what do you
think, Mr. Zhang?”, he moved his head to the left with a synchronised gaze (see
Frame 8.38B, the dashed arrows). In the meantime, the third discussant also turned
aside (see the arrow). So far at that moment, Tom performed as expected; however,
(a) (b) (c)
Frame 8.38 Frame 8.38 Frame 8.38
Fig. 8.38 Indexical head movement: receptivity

the transition to Frame 8.38C would lead to disappointment because while the
turn-holder was substantiating the discussion, Tom moved his head back to gaze at
the third discussant (see the dashed arrows), whereas the third discussant was still
gazing at the turn-holder (see the arrow). Against this, it can be said that the only
occurrence of Tom’s indexical head movement fails to salvage him from the low
nonverbal delivery subscore assigned.
If the findings concerning formal head movements are not succinct to align the
candidates’ performance in nonverbal delivery with the corresponding descriptor on
the rating scale, especially regarding the appropriateness of head nodding and
shaking, how ideational meanings are realised via presenting, representing and
indexical head movements above can to a certain degree account for the reasons
why Linda’s head movement might occasionally be regarded as inappropriate and
why Diana’s performance in head movement can not only present “evidence of
appropriate head nod/shake” but also feature well-timed co-ordination with other
nonverbal channels, such as gestures, to maximise meaning potential.
Consistent with the realisations of interpersonal meaning via eye contact and ges-
ture, head movement is also able to realise interpersonal meaning by means of
attitude, engagement and graduation.
It is evident that, in terms of attitude, head movement can realise positive and
negative meaning through head nodding and shaking, respectively. As is noted in
Fig. 8.33, Linda and Diana had fewer occurrences of head shaking than nodding.
Since nodding, as an indication of attentiveness and agreement, has been elaborated
above, this section will bring forth more insights on head shaking. Throughout the
discussion, it has been observed that Linda simply had only one occurrence of head
shaking, as illustrated in Fig. 8.39, where both frames present a dynamic horizontal
movement as indicated by the arrows. However, a further integration with the
accompany verbiage will again capture an inappropriate use of head movement.
When Linda was agreeing to plan a trip by uttering “oh, that’s a good idea”, she
(a) (b)
Fig. 8.39 Interpersonal meaning in head movement: negative attitude

shook the head as described. Naturally, her verbiage should be thought of as

something positive; however, the head movement exhibits a negative attitude that
runs counter to what was verbally expressed.
The second realisation of interpersonal meaning via head movement is
engagement. Confined by the intrinsic meaning of certain formal head movements,
especially head nodding and shaking, the interpersonal meaning of engagement can
be either expansion via nodding, or contraction via head shaking. This is because
when a candidate nods to show attentiveness and agreement, the implied meaning
would be inviting a continuer for the foregoing turn, thus providing more negotiable
space for the speaker. By comparison, when head shaking occurs, a candidate
implicitly sends out a signal of disagreement, reducing the possible negotiable
space to the minimum. There seems to be almost no head movement with an
embodiment of neutrality and possibility, the engagement of which are more likely
to be realised by gestures. With this, the profile of engagement interpersonal
meaning can be similar to that of attitude as they both are realised by head nodding
and shaking.
The last interpersonal meaning realisation is graduation, which can be measured
via the duration of the head movement. Determined by the software ELAN, the
graduation criteria are set in consistency with those of eye contact and gesture: the
duration of 0.5 s and below is judged as fast, that of 1 s and above judged as slow,
and that between 0.5 and 1 s judged as medium.
Figure 8.40 illustrates the distribution of graduation in interpersonal meaning as
reflected by the candidates’ head movements. It is found that Diana presented head
movements of different graduations, slightly different from the picture depicted in
the findings of gestures, where Diana performed fast gestures in a great number. It
is thought that head movements of slow graduation, especially with an embodiment
of attentiveness, connote deliberateness as the candidates would intend to send the
signal of their attention. In comparison, if the head moves rapidly, especially when
a different signal of agreement is intended, the interpersonal meaning can thus be
shifted to a conveyance of dynamism and immediacy. With this, it might be said
that Diana is able to shift the graduation of head movement in accordance with what
is intended. As this figure does not exclude head movements of presenting types,
Fig. 8.40 Graduation in 20

interpersonal meaning: head Tom
movements Linda
15
Diana
10
0
fast medium slow
even though the graduation of Tom’s head movements also features slowness, the
corresponding interpersonal meaning cannot be instantiated. In the case of Linda,
most occurrences of head movement fall into medium graduation, indicating that
her head movements cannot be characterised by deliberateness or urgency.
When the descriptor of nonverbal delivery on the rating scale is validated again
by referring to what is found above concerning how interpersonal meaning is
realised via the candidates’ performance in head movements, more evidence of
alignment can be collected. The demarcation in head movement descriptor between
Band 3 and Band 4 lies in the appropriateness of head movement. As Linda is
found to shake her head accompanying the verbiage of positive conveyance, cou-
pled with the detected unexpected occurrence of downward head movement above,
her head nod/shake can be judged as less appropriate. The appropriateness of
Diana’s head movements with regard to the interpersonal meaning can again
support the subscore she was assigned because she is found to perform head
nodding and shaking as expected in the given social context and also control the
graduation of head movement in instantiating different meanings.
Textual Meaning
Textual meaning with regard to head movement can be twofold. On one hand, when
a candidate performs head nod or shake, the wavelength can be measured to
indicate the degree of agreement or disagreement, respectively. This is because a
head nod or shake can be understood as a more confirmed occurrence of agreement
or disagreement if it features higher frequencies in a unit interval. For example,
nodding rapidly as a token of positive backchannelling can be felt as a circum-
stantiated acknowledgement of agreement. In order to standardise this measure, this
study retrieved the frequencies of horizontal (head shake) or vertical (head nod)
movements that occurred in one second. On the other hand, concerning the head
movements other than nodding or shaking, this study looked into the amplitude of
head movement because this measure, akin to the pointing in gesture, can tell the
specificity, particularly that of attentiveness, tendering the organisational resources
for ideational and interpersonal meanings.
Informed by the fact that there is no detectable head nod or shake by Tom,
Table 8.6 lists the standardised wavelength of head movement performed only by
Linda and Diana. The higher frequency a candidate performs in a second, the more
accelerated a head nod or shake is. Thus, as is revealed, Linda seems to perform
nodding and head shaking more slowly than Diana. This indicates that when Linda
nodded or shook head, she might have transmitted a mere signal of hesitant positive
Table 8.6 Wavelength of Nodding Shaking

head movement (frequency
per second) Linda 0.75 1.13
Diana 1.84 1.57
or negative polarity, which echoes the findings regarding interpersonal graduation.

Diana, however, with 1.84 time of nodding and 1.57 time of head shaking per
second, exhibited a comparatively higher frequency so that the other discussants
might be impressed by her pronounced contention or denial, respectively.
The other measure to realise textual meaning is the amplitude of head movement
from one direction to another. Therefore, this measure can be regarded as one factor
that somehow would impede the wavelength because the wider the amplitude of a
head movement presents, the longer time it would take. However, if the head is
moved to a certain direction by a wide margin, it can also realise its textual meaning
in its own right in that wider amplitude would indicate a higher degree of specificity
in drawing other discussants’ attention. If such amplitude is accurately measured,
the angle from which the head turns from one direction to another can serve as the
criteria. Due to the recording, framing and analysing constraints, nonetheless, this
study could only describe how the textual meaning is realised in the head move-
ment of wide amplitude.
Figure 8.41 is an instance demonstrating how Linda realised the textual meaning
via a leftward head movement of wide amplitude. Frame 8.41A shows the situation
where Linda was still holding the turn, yet, after a transition into Frame 8.41B,
Linda turned her head substantially leftward with almost a right angle (see the
arrow) and faced one of the other discussants with the accompanying verbiage of
“do you think so?” This not only serves a signal of turn-yielding but also orients a
specific addressee. This study also finds a number of such head movements with
remarkable amplitude in the case of Diana. The instant she initiated a new turn or
yielded a turn to the others, she, standing in the middle, would turn her head either
leftward or rightward so that her attentiveness or the intended addressee of the next
turn can be specified.
If the textual meanings realised via the candidates’ head movement are aligned
with the nonverbal delivery descriptors, it is still found that the keyword appro-
priate can be further validated. As far as Diana’s head movements are concerned,
she is generally found to nod or shake head rapidly in pursuit of confirmedness, and
to be capable of performing head movement with profound amplitude. The case of
(a) (b)
Fig. 8.41 Amplitude of head movement

Linda can be evaluated to be almost similar except for the fact that her head
movement in a unit interval seems longer, thus seemingly aggravating a scenario of
a non-committal approval or rebuttal.
8.4 Discussion
Having presented the findings of the three randomly selected candidates’ perfor-
mances in nonverbal delivery with regard to its various forms and the respective
metafunctions with an MDA approach, this section continues with a further dis-
cussion on the three research questions.
RSV-II-RQ1: What functions do the candidates’ nonverbal delivery channels serve?
When each nonverbal delivery channel, viz. eye contact, gesture and head move-
ment, is investigated, both Martinec’s (2000, 2001, 2004) and Hood’s (2007, 2011)
frameworks are referred to. With regard to the former, the candidates’ eye contact,
gesture and head movement are categorised into performative actions (presenting)
and communicative actions (representing and indexical), the judgment of which
mainly relies on an interwoven evaluation of their potential of communicativeness
and the synchronised verbal language. The latter framework, after being accom-
modated and slightly revised in the present study, is able to multimodally analyse
the three metafunctions instantiated by the nonverbal delivery channels in accor-
dance with an MDA approach.
Tom is the candidate who has been found to perform the least number of
nonverbal delivery occurrences of any channel, leaving an impression of being
sedentary. Regarding eye contact, Tom is characterised by frequent durable gaze at
the ground in the group discussion, blocking the instantiation and realisation of the
corresponding ideational and textual meanings. Likewise, his gesturing was also
limited in light of variation, with merely a detected occurrence of monotonous arm
swing as presenting action and that of a waving hand as the only representing action
of bidding farewell upon the termination of the group discussion. In addition, Tom,
instead of having any head nod or shake, only performed downward head move-
ment coinciding with the finding of constant and noticeable gaze at the ground.
Therefore, it can be said that in a meaning-making process, Tom almost resorted to
the verbal modality in meaning conveyance. Most supposed functions, especially
those that can be instantiated by representing and indexical actions, failed to
enhance the accompanying verbiage. Judging from the above, this study would
think that Tom basically reaches the first stratum of meaning-making network,
namely the ability to employ conventional monomodality. The second stratum of
the network, viz. how individual modality presents different metafunctions, and the
third stratum, viz. how different modalities achieve complementarities, seem to be
groundless for an analysis in Tom’s case.
Moving to the case of Linda, a candidate from the intermediate proficiency level,
it can be found that more meaning-making resources are made use of. Linda’s eye
contact features comparatively high frequency yet with briefness and constant shifts
in gaze directionalities. An interpersonal meaning of contraction can be described
as a result of a few occurrences of downward gaze during her own turn and others’
turns. However, Linda is competent in instantiating more desirable textual mean-
ings in that some durable gaze features the specificity of her gesturing. Although
she has a good number of gesture occurrences of various kinds and directionalities,
due to her leftmost sitting position, she could have performed even better if more
physical space had been provided for freer instantiation. In addition, the tendency of
her contraction in eye contact can also be triangulated in her salient gestures of
down palm, which draws more social distance and limits negotiation space between
speakers. Concerning head movement, Linda is able to instantiate textual meaning
by her leftward head movement with great amplitude so that more of her atten-
tiveness and initiation of turn-yielding can be realised. Nevertheless, Linda’s head
movement occasionally fails to realise the expected ideational meanings because
certain head movements of hers violate contextualised appropriateness.
Therefore, when the stratum of metafunctions realised by nonverbal delivery is
considered, this study thinks that Linda, despite her occasional inactiveness that
might be triggered by the personality, can perform quite satisfactorily in the domain
of nonverbal delivery because her eye contact, gesture and head movement all
achieve the desired and describable metafunctions to a certain degree. Even moving
to the stratum of inter-semiotic complementarities, her gestures and eye contact can
co-function to instantiate the accompanying verbiage.
The case of Diana can be judged as a model. From the statistics of formal
nonverbal delivery channels with regard to their respective frequency, duration and
variation, she performed better than the other two candidates. Unavoidably, Diana
had only a few occurrences of performative, or presenting actions. However, those
cannot serve as a counterargument to downgrade her performance in this regard. In
addition to the anticipated ideational meanings, her eye contact, with its durability
and firmness, can also instantiate positive and negative attitudes, controls the
engagement of contraction and expansion in accordance with the turn shifts.
Likewise, her rapid gestures would indicate her activeness and openness in wel-
coming different viewpoints, while presenting an invisible defensiveness when a
need of building her own arguments in support of her view arose. Her head
movement is also properly controlled as she not only performs various swift head
movements in conveying surface meanings but also shows her own attentiveness
via such movements.
Therefore, as a whole, Diana can be felt to be natural in the meaning-making
process of group discussion. Diana is even more proficient than Linda because not
only her conveyance can be instantiated by various nonverbal delivery channels
with ideational, interpersonal and textual meaning realised but also different
modalities of nonverbal delivery, along with the modality of verbal language, can
co-ordinate in an integrated manner to maximise the meaning potential.
8.4 Discussion 257
RSV-II-RQ2: To what extent are teacher raters’ and peer raters’ scoring in non-
verbal delivery alignable with the corresponding descriptors of the proposed rating
scale?
This research question can be generally facilitated with the above fine-grained
analyses and discussion and can be addressed in twofolds.
First, a closer look at the nonverbal delivery descriptors might generate a few
keywords, or certain crucial points of observation. In describing eye contact, the
main demarcation lies in frequency, controlledness, briefness, with the first key-
word pertaining to the formal eye contact and the latter two concerned with the
metafunctions explored in an MDA approach above. Gesture, in addition to fre-
quency (formal gesture), is also described in terms of variation (formal gesture) and
communicativeness (metafunctions) on the rating scale. Head movement, as the last
dimension of nonverbal delivery, is judged against appropriateness of head nod or
shake. The exclusion of frequency in head movement, as previously explained, is to
minimise the intervening effect that candidates’ diversified personalities and cul-
tural background might exert on the scoring results. Therefore, appropriateness in
head movement can be aligned via both formal and metafunctions of head move-
ments. The detailed descriptions of the three candidates’ performance in eye con-
tact, gesture and head movement indicate that what is found above can almost
perfectly match what is supposed to be observed and stipulated in the rating scale.
Second, when the nonverbal subscores assigned by the teacher raters and peer
raters are considered, there was no inconsistency in Linda’s (3) and Diana’s
(4) subscores, and most observable and analysable characteristics of their formal
nonverbal delivery and their, respectively, metafunctions can be accorded to the
respective bands. Tom was assigned 1 by peer raters and 2 by teacher raters. This
discrepancy can be mediated because all raters were supposed to, respectively,
observe eye contact, gesture and head movement to reach one subscore of non-
verbal delivery. The judgment on the poor performance in one nonverbal delivery
channel might unconsciously impair another. In the case of Tom, there was no
detectable head nod or shake, with which raters might assign 1, yet raters might also
assign 2 owing to their observation that most of his gestures, though detectable,
were not communicative enhancing. Therefore, justifications can be made that
teacher raters’ and peer raters’ scoring in nonverbal delivery can be to a great extent
alignable with the nonverbal delivery descriptors of the proposed rating scale.
RSV-II-RQ3: To what extent can the nonverbal delivery descriptors distinguish
candidates across different proficiency levels?
This research question addresses the discriminating power of the gradable
descriptors of the rating scale. As specified in the research design of this chapter,
the three candidates were randomly selected from three predetermined proficiency
groups. The scoring results against the proposed rating scale have already discerned
them into three levels, with Diana and Linda, candidates from the advanced and
intermediate proficiency levels, respectively, falling into Band 4 and Band 3, and
Tom positioned between Band 2 and Band 1. Therefore, this ranking basically
corresponds to the predetermined proficiency levels of these candidates.
As is found above, the nonverbal delivery descriptors of the rating scale can
effectively discern the case of Tom because much alignment can be found of his
poor performance with the detailed descriptors specified previously. Linda is dis-
tinguished from Diana from a few formal nonverbal delivery performances and the
corresponding metafunctions. Formally, Linda’s eye contact is found to be brief
instead of being durable and firm, and occasionally she also presented certain
inappropriate head nodding. Considering the metafunctions, her inactiveness as
reflected in the interpersonal meaning of eye contact and more liable engagement of
contraction can account for the downgraded subscore she was assigned. Oppositely,
Diana is found to be satisfactory in the aspects where Linda flawed. Therefore, the
discriminating power of the rating scale, particularly with regard to the nonverbal
delivery descriptors, can also be accordingly validated.
8.5 Summary
Following the line of validating the revised rating scale, this phase of study adopted
an MDA approach to analyse three randomly selected candidates’ (Tom, Linda and
Diana) nonverbal delivery performance. When nonverbal channels were investi-
gated from the perspective of their formal manifestations, a series of parametres,
such as frequency, directionality, duration and levels, were probed into. However,
due to the complexity of gestures, this study also focused on the use of hands and
detailed gesture descriptions for a further analysis. When nonverbal channels were
analysed with regard to their metafunctions, the integrated framework drawn from
Martinec’s (2000, 2001, 2004) and Hood’s (2007, 2011) research was referred to.
In investigating formal nonverbal channels, namely the first stratum of the
general framework reviewed in the literature, this study found that the three can-
didates differ in their employment of nonverbal delivery, yet their individual per-
formance on nonverbal delivery may be generally aligned with the corresponding
rating scale descriptors, especially concerning the quantifier descriptors, such as the
parametres of frequency and duration. Among the candidates, Tom seemed to be
most sedentary, without salient performance in any of the nonverbal channels
observed. Comparatively, Linda and Diana performed better in that they both
frequently and constantly resorted to eye contact, gesture and head movement in
accompanying their verbal language.
Further elevated to the second stratum of the general framework, where the
metafunctions instantiated by the candidates’ nonverbal channels were analysed,
this study focally conducted an even more fine-grained comparison between Linda
and Diana as an analysis on Tom’s performance was almost excluded due to his low
profile in nonverbal delivery. The comparison has found that Diana was able to
instantiate different metafunctional meaning via her nonverbal delivery. In addition,
she could be demonstrated to impress the other discussants as an engaged, articulate
8.5 Summary 259
and strategic speaker in the group discussion. Diana could also shift the meta-
functions of a particular nonverbal channel in accordance with turn-taking.
Although Linda also performed quite satisfactorily in nonverbal delivery, the
metafunctions realised via nonverbal channels seemed to present an image of
slightly passive and hesitant speaker among the discussants. Such a comparison
also lends support to an alignment of the candidates’ performance with the sub-
scores assigned to them as well as the observable descriptors of nonverbal delivery
on the rating scale. In particular, the key quantifiers used in the descriptors, such as
controlled (eye contact), communication-conducive (gesture) and appropriate (head
movement), can be further validated.
On top of what is summarised above, this study also explored the third stratum
specified in the general framework of validating this rating scale in an MDA
approach. Diana was found to employ different channels of nonverbal delivery with
her accompanying verbiage so that the intended meaning could be conveyed more
effectively. Even when there was no synchronised verbal language, different non-
verbal channels might also co-function for an enhancement of meaning instantiation
in Diana’s case. Nonetheless, very limited co-ordination across different nonverbal
channels might be detected regarding Tom’s and Linda’s performances.
Against what is revealed from the analyses on the candidates’ nonverbal
delivery, the discriminating power of the rating scale as reflected by the four
gradable bands was accordingly validated. It can be summarised that, nonverbal
delivery, as a newly devised and incorporated assessment dimension of this rating
scale, is valid in measuring candidates’ performance in nonverbal delivery, which is
judged as the most salient representation of strategic competence under the CLA
model. Therefore, a combination of a validation study with an MTMM approach
and another one with an MDA approach, this research project approaches the
accomplishment of validating the proposed rating scale in a triangulated manner.
References
Hood, S.E. 2007. Gesture and meaning making in face-to-face teaching. Paper Presented at the
Hood, S.E. 2011. Body language in face-to-face teaching: A focus on textual and interpersonal
meaning. In Semiotic margins: Meanings in multimodalities, ed. Dreyfus, S, S. Hood and
S. Stenglin, 31–52. London and New York: Continuum.
Lim, F.V. 2011. A systemic functional multimodal discourse analysis approach to pedagogic
discourse. Unpublished PhD thesis. Singapore: National University of Singapore.
Martinec, R. 2000. Types of processes in action. Semiotica 130(3): 243–268.
Martinec, R. 2004. Gestures that co-occur with speech as a systematic resource: The realisation of
White, S. 1989. Backchannels across cultures: A study of Americans and Japanese. Language in
Society 18: 59–76.
Young, R.F., and J. Lee. 2004. Identifying units in interaction: Reactive tokens in Korean and
English conversations. Journal of Sociolinguistics 8(3): 380–407.
Chapter 9
Conclusion
9.1 Summary of This Research Project
This section briefly summarises the main findings of the three research phases in
this project.
In the AB phase, this study conducted an empirical study to explore the role of
nonverbal delivery in Chinese EFL candidates’ performance in group discussion,
particularly how candidates across a range of proficiency levels might be dis-
criminated against their nonverbal delivery performance. In a sense, if nonverbal
delivery can discriminate well among the candidates of predetermined proficiency
levels, an argument of incorporating nonverbal delivery into speaking assessment
can be accordingly advanced.
In this phase of study, it was mainly found that although there seemed to be a
generally low profile of employing nonverbal delivery in group discussion, the
candidates across a range of proficiency levels can be statistically discerned with
regard to their performance of eye contact, gesture and head movement. Candidates
of advanced proficiency were characterised by higher frequency and longer dura-
tion of eye contact. Elementary-level candidates, though featuring a high frequency
of eye contact occurrences, were inclined to shift their gaze hurriedly without much
fixed or durable eye contact with their peer discussants. In addition, rather than
enhance communication effectiveness, most occurrences of their eye contact, if not
all, serve regulatory or adaptive purposes. Although intermediate-level candidates
were found to instantiate eye contact with other discussants, the degree to which
their eye contact can serve attentive purposes would be more impaired compared
with the advanced counterparts.
Candidates’ gestures can be mainly distinguished from the perspectives of fre-
quency, diversity and communication-conduciveness. Advanced candidates would
be able to perform satisfactorily in all of the above measures, whereas candidates of
the elementary proficiency level were found to keep an extremely low profile of
resorting to gestures in accompanying their verbal language. The intermediate-level

DOI 10.1007/978-981-10-0170-3_9
262 9 Conclusion
candidates performed well in gesturing frequency and diversity, but a number of

gesture occurrences were found to serve merely adaptive or performative purposes,
failing to be a remarkable enhancer for intended meaning conveyance.
When head movement was probed into at the AB phase, head nod and shake
were the main manifestations. It has to be noted that, given the socio- and
cultural-preponderance, candidates were not significantly different in presenting
lower frequency of head shake than head nod, yet whether they performed certain
head movements appropriately in the given social context might be referred to as a
discriminating point because candidates were found to nod even though certain
negative meanings were intended to be instantiated.
Enlightened by the above findings, this study drew an interim conclusion that
nonverbal delivery, as reflected by the above three measures, can be a sound
indicator of candidates’ overall spoken English production and that what was
extracted to discern candidates of various proficiency levels can usefully and
effectively inform the formulation of the rating scale.
When such a rating scale was developed, two broad dimensions were perceived
in the RSF phase: language competence and strategic competence. The former was
formulated by an operationalised questionnaire drawn from the partial construct of
the CLA model. After an EFA analysis from the Chinese EFL teaching practi-
tioners’ and learners’ responses to the constituents of language competence in
group discussion, this study distilled and brought forth three assessment dimensions
representing language competence: Pronunciation and Intonation, Grammar and
Vocabulary and Discourse Management. The gradable descriptors of these
dimensions were written and further fine-grained by referring to the statements in
the questionnaires. Given the review on the definitions of strategic competence and
further relevant justifications, Nonverbal Delivery was perceived as the fourth
dimension on the rating scale proposed. In writing the descriptors for this dimen-
sion, what observations can effectively discriminate candidates regarding their
nonverbal delivery in the AB phase was referred to.
A four-dimension rating scale, therefore, was tentatively proposed, and it epit-
omised what would supposedly be measured in relation to communicative language
ability in group discussion, as guided by the construct of the CLA model.
Considering the fact that the expert raters’ scoring revealed a high correlation
between each two assessment dimensions, this rating scale initially features sound
construct validity, yet it would be subject to certain modifications in wording and
disambiguation and the shrinkage of bands from five to four for a higher degree of
rater-friendliness.
The rating scale, afterwards, was phased into the RSV phase with both quan-
titative and qualitative approaches. When an MTMM method was deployed, it was
found that, considering the interpretability and consistency with previous studies
regarding speaking ability taxonomy, a second-order correlated-trait,
uncorrelated-method model not only provided sound goodness-of-fit indices
(χ2(28) = 462.796, p = 0.818; CFI = 1.000; NNFI = 1.024; SRMR = 0.015;
RMSEA = 0.000; 90 % C.I. = 0.000, 0.060), but also presented divergent validity
(Δχ2(9) = 403.08, p < 0.001, ΔCFI = 0.472) and discriminant validity
9.1 Summary of This Research Project 263
(Δχ2(17) = 425.68, p < 0.001, ΔCFI = 0.146). The standardised parameter estimates
and trait–method correlations revealed no method effect or bias concerning rating
methods. Thus, this rating scale, with nonverbal delivery included as a crucial
dimension, was validated in a statistical spectrum.
The rating scale, especially its assessment dimension of Nonverbal Delivery, was
further validated at the micro-level with an MDA approach. Three randomly
selected candidates (pseudonyms as Tom, Linda and Diana) representing different
proficiency levels were probed into concerning their de facto performance in
nonverbal delivery. Tom, with a subscore of 1.5 on nonverbal delivery, was found
to be rather sedentary and passive in the group discussion because only a limited
number of nonverbal channels with ideational meanings are instantiated. A majority
of his nonverbal delivery occurrences remained to be performative, or as a likely
regulation to adapt himself to an assessment setting. In that sense, almost no
interpersonal or textual meanings could be detected from his nonverbal delivery;
thus, Tom was reduced to stagnation where only the first stratum of nonverbal
delivery employment could be taken into account in his case.
In stark contrast, Diana, as a representative of advanced proficiency level who
was assigned a full mark in nonverbal delivery, was found to be articulate in
eclectically resorting to a repertoire of nonverbal channels in accompanying her
verbiage. At certain points, her nonverbal performance can also instantiate intended
meanings without any synchronised verbal language. Judging from the perspective
of metafunctions, she was found to be capable of realising a variety of meaning
potentials via nonverbal delivery. Although she seemed somewhat aggressive in
group discussion, her frequent shift in instantiating different nonverbal channels
with discrepant metafunctions would impress other discussants as an active and
negotiable speaker as well as an attentive listener. Although Linda, whose subscore
of nonverbal delivery is 3, performed quite satisfactorily in terms of formal non-
verbal channels, she was found to be slightly passive and hesitant in the group
discussion. In particular, when the interpersonal meaning of her gestures was
looked into, she seemed to be self-contained and strike a certain distancing effect on
the peer discussants.
The above profile of the three candidates’ performance on nonverbal delivery
can also be aligned with the descriptors of nonverbal delivery on the rating scale
and the subscores they were assigned. Therefore, the MDA approach further val-
idated the rating scale regarding certain keywords to be observed in the rating
process as well as a number of quantifiers that reflect discriminant bands of can-
didates’ nonverbal delivery.
9.2 Research Implications
A validated rating scale, with nonverbal delivery embedded as an “unconventional”

assessment dimension, is the ultimate product of this research project. As fore-
shadowed in the research significance in Chap. 1, the usefulness of this rating scale
264 9 Conclusion
per se is also anticipated to yield implications. Since much hope is pinned on this
product to be routinely applied in the group discussion of formative assessment,
certain washback effects (Alderson and Wall 1993; Cheng 2005; Green 2007)
should also be considered. This section dwells upon what possible implications this
rating scale might exert on English teaching and textbook compilation, both of
which are the main sources where EFL learners’ acquire the English language.
9.2.1 Nonverbal Delivery in EFL Teaching
Since nonverbal delivery has been empirically demonstrated to be paramount in

discerning candidates with regard to their overall spoken production, it should dawn
on EFL teaching that nonverbal delivery should be intrinsically incorporated in
classroom teaching. In other words, implications can be made that a language class
be constructed and construed into a multimodal one with a wealth of
meaning-making resources.
Although a number of studies have long attached significance to the effective-
ness of language class learning (e.g. Allright 1984; Bailey and Nunan 1996; Ellis
1990; Frank 1999; Long and Sato 1983), how to maximise the incidental acqui-
sition of nonverbal delivery as part of language learning remains to be explored
in-depth. On one hand, teaching practitioners should give full play to their own
performance in class instructions. For example, awareness should be promoted as to
mobilise a variety of nonverbal channels in accompanying the verbal language so
that EFL learners, as recipients, might be able to subconsciously know what is
supposed to be expected in spoken production along with the accompanying ver-
biage. In addition, instructors should also come to realise certain interpersonal and
textual meanings via their nonverbal delivery. Lim (2011) points out the popularity
of teachers’ presenting palms-up gesture as it might provide infinite space for
acceptance and tolerance. In case EFL learners are exposed to and gradually
immersed into this teaching style, in all likelihood, they will express themselves by
mimicking and presenting similar nonverbal delivery to instantiate certain meta-
functional meanings that would enhance communication effectiveness.
Another concern is how to optimise classroom teaching so that all the learners
involved can give full play to their nonverbal delivery. In fact, this can be realised
in the seating arrangement of a language class. Unlike the conventional way of
arranging seats into rows, it is suggested that seats be arranged in a way that keeps
individual learners within each other’s vision. Figure 9.1 roughly illustrates how a
language class might be arranged to maximise communication via nonverbal
delivery. As is illustrated, the U-shape desk arrangement with 3–4 learners as a
group would enable every learner to be a communicator employing all possible
meaning conveyance via nonverbal channels. In addition, all the chairs are not fixed
to the ground so that more flexibility is provided for speaking activities. An
instructor might be free in choosing any position for standing, either in the centre of
9.2 Research Implications 265
Fig. 9.1 An example of seating arrangement in a multimodal class
the classroom or somewhere in the corner. Although this way of seating can be
prevailing in certain EFL teaching contexts, it would be eagerly desirable in the
Chinese EFL context given its large population of English learning.
9.2.2 Nonverbal Delivery in EFL Textbooks
Another implication lies in how nonverbal delivery can be incidentally acquired in

EFL textbooks. One of the principles in textbook compilation is that learning
materials should serve as a bridge between the target language and language
learners (Cunningsworth 1995). In that sense, if nonverbal delivery is to be
incorporated, EFL textbooks should play a role in moving beyond printed or
audio-visual materials; rather, an emerging generation of textbooks, encompassing
a learning platform, should be perceived.
What can be unconventionally offered on a learning platform? Basically, two
“mirrors” can be suggested specific to learners of different proficiency levels. EFL
learners, especially those of advanced proficiency level, should be embraced with
authentic materials of native speakers’ spoken production so that they have an
access to a “mirror” to observe the extent to which, and where their nonverbal
delivery can be further progressed, since learners are anticipated to at least perform
certain nonverbal delivery to accompany their verbal utterances instead of standing
or sitting still. Such exposure, akin to language instructors’ nonverbal delivery
266 9 Conclusion
input, would to a large extent accelerate learners’ approximating the nonverbal

communication norms in the speech community of the target language.
In a similar vein, the recordings of learners’ spoken production should also be
rendered to EFL learners, particularly those of elementary proficiency level. This is
because even though they would not be practically approaching the norms of
nonverbal delivery at the present stage of language learning, they would be advised
to know the extent to which they can strive to the nonverbal delivery performance
by advanced EFL learners. In a way, this should serve as a basic “mirror” in
language learning scaffolding. EFL learners’ frequent reference to themselves in
these two “mirrors” would be complementary to the EFL instructors’ teaching with
a repertoire of nonverbal delivery performance.
9.3 Limitations of This Study
Despite the significance of what has been revealed in this study and implications
yielded above, it has to be admitted that this research is not without caveats. The
following two points have to be highlighted when the limitations of the study are
considered.
First, as is also reviewed in the literature, nonverbal delivery can be highly social
context specific, which means there can be substantial differences in nonverbal
communication from one social context to another. In that case, it can be likely that,
similar to language transfer, EFL learners exhibit the same nonverbal delivery
performance as they would do in the native language. Although this point can be
claimed to be an excuse for EFL learners to keep a low profile in their performance
regarding nonverbal delivery in certain social contexts, awareness should be raised
that since EFL learners communicate and are assessed in English, they are supposed
to perform as expected in the target language. In order to minimise the possible
effects of L1 nonverbal delivery transfer, this study has maintained a homogeneous
social context, where all the data, ranging all the way from learners’
video-recordings to the scoring results, were collected in the Chinese EFL context.
With regard to rater characteristics, the raters are homogeneous given their
nationality being Chinese. All these findings derive from an expected guarantee that
raters would score the candidates of the same social context; if the raters with other
social contexts were selected for this study, the scoring will be jeopardised because
they might be either more severe or lenient with the candidates in the Chinese EFL
context.
Second, nonverbal delivery should also be claimed to be highly personality
oriented. It can be observed that more extroverted learners might be more likely to
resort to nonverbal delivery channels. However, this study also manages to offset
this weakness by “being lenient” in the descriptors to be observed. When an
argument for embedding nonverbal delivery into speaking assessment is built, it
should be noted that a good number of parameters have been taken into account,
whereas when the descriptors of nonverbal delivery are formulated, not every
9.3 Limitations of This Study 267
fine-grained parameter, such as the duration of gesture, has been written into the
rating scale descriptors. This is because if all the details of nonverbal delivery
channels are considered, not only would the raters find it infeasible to observe so
many points in the scoring process, but also they might be requested to be too tough
to those less extroverted candidates. Therefore, it can be claimed that the corre-
sponding descriptors only manifest the most basic and salient presentations of
expected nonverbal delivery.
9.4 Future Directions of Research
The above research limitations can indeed provide more insights on the future
directions of research, to be outlined as follows.
First, rater characteristics can be regarded as a variable to be further explored.
Should native speakers or speakers of other EFL contexts be designated as the raters
to score the same performance against the proposed rating scale, there might or
might not be differences. If there is no discrepancy in the rating results between
native speakers and non-native speakers, it can be said that the possible effect of
rater’s social contexts on the scoring results can be negligible. However, in case
significant differences are yielded, a word of caution should be made as to limit the
applicability of the proposed rating scale to a homogeneous social context only.
According to a most recent study, Gui (2012) posits that Chinese and American
raters might hold different perceptions of nonverbal communication when scoring
contestants’ performance in public speaking. A follow-up study deriving from the
present research will further helpfully validate the rating scale with regard to its
scope of utility.
Second, the argument of embedding nonverbal delivery into speaking assess-
ment can also be further consolidated by comparing different scoring contexts,
where raters might be provided with the video-recording or with the
audio-recording only. If raters are blocked with the visual channel that would
otherwise enable them to view the nonverbal delivery of the candidates, the rating
differences in candidates’ overall performance across a range of proficiency levels
might not be as significant as is revealed in this study. In the context of formative
assessment, where more detailed feedback to learners and teaching practitioners are
requested, the blockage of visual channel in the scoring process can be regarded as
an impediment of comprehensive assessment and a potential danger posed to test
fairness.
268 9 Conclusion
9.5 Summary
This chapter, mainly recapturing the main findings of each research phase, draws a
conclusion to the whole research project. Departing from three research aims, this
study links an argument of embedding nonverbal delivery into speaking assessment
with the development and validation of a rating scale so that the role of nonverbal
delivery in assessing communicative ability is increasingly given prominence to. It
is highlighted that the final product of this study, namely a validated rating scale to
be used for group discussion in the context of formative assessment, would not only
yield much utility significance but also achieve positive washback effects on EFL
teaching and textbook compilation. The last two sections, respectively, clarify the
limitations of this study concerning candidate variability in nonverbal delivery
performance, and point out the directions of exploring nonverbal delivery from the
perspectives of rater characteristics and whether rating should be approached via
audio- and/or video-recordings in formative assessment.
References
Alderson, J.C. 1993. Judgments in language testing. In A new decade of language testing
research: Selected papers from the 1990 language testing research colloquium, ed.
D. Douglas, and C. Chapelle, 46–50. Washington, DC: Teachers of English to Speakers of
Other Languages Inc.
Alderson, J.C., and D. Wall 1993. Does washback exist?. “Applied Linguistics, 14(2): 115–129.
Allright, R. 1984. The importance of interaction in classroom language teaching. Applied
Linguistics, (5): 156–171.
Bailey, K.M., and Nunan, D. (Eds.) 1996. Voices from the Language Classroom: Qualitative
Research in Second Language Education. New York: Cambridge University Press.
Cheng, L. 2005. Changing language teaching through language testing: A washback study.
Cunningsworth, A. 1995. Choosing Your Coursebook. Oxford: Heinemann.
Ellis, R. 1990. Instructed Second Language Acquisition. Oxford: Blackwell.
Frank, C. 1999. Ethnographic Eyes: A Teacher’s Guide to Classroom Observation. Westport:
Heinemann.
Green, A. 2007. Washback to learning outcomes: A comparative study of IELTS preparation and
university pre-sessional language courses. Assessment in Education 14(1): 75–97.
Gui, M. 2012. Exploring differences between Chinese and American EFL teachers’ evaluations of
speech performance. Language Assessment Quarterly 9(2): 186–203.
Lim, F.V. 2011. A systemic functional multimodal discourse analysis approach to pedagogic
discourse. Unpublished Ph.D. thesis. Singapore: National University of Singapore.
Long, M.H. & Sato, C.J. 1983. Classroom foreigner talk discourse: Forms and functions of
teachers’ questions. In H.W. Seliger & Long, M.H. (Eds.), Classroom Oriented Research in
Second Language Acquisition, pp. 268–285. Mass: Newbury House.
Appendix I
IELTS Speaking Rating Scale
(Band 8 and Band 9)
Band Fluency and Lexical resource Grammatical Pronunciation

coherence range/accuracy
9 • Speaks fluently • Uses • Uses a full range • Uses a full
with only rare vocabulary of structures range of
repetition or self- with full naturally and pronunciation
correction; any flexibility and appropriately features with
hesitation is precision in all • Produces precision and
content-related topics consistently subtlety
rather than to find • Uses idiomatic accurate • Sustains
words or language structures apart flexible use of
grammar naturally and from “slips” features
• Speaks accurately characteristic of throughout
coherently with native speaker • Is effortless to
fully appropriate speech understand
cohesive features
• Develops topics
fully and
appropriately
8 • Speaks fluently • Uses a wide • Uses a wide range • Uses a wide
with only vocabulary of structures range of
occasional resource flexibly pronunciation
repetition or self- readily and • Produces a features
correction; flexibly to majority of error- • Sustains
hesitation is convey precise free sentences flexible use of
usually content- meaning with only very features, with
related and only • Uses less occasional only
rarely to search common and inappropriacies occasional
for language idiomatic of basic/non- lapses
• Develops topics vocabulary systematic errors • s easy to
coherently and skilfully, with understand
appropriately occasional throughout,
inaccuracies L1 accent
• Uses has minimal
paraphrase effect on
effectively as intelligibility
required

DOI 10.1007/978-981-10-0170-3
Appendix II
TOEFL Independent Speaking Rating
Scale (Band 3 and Band 4)
Score General Delivery Language use Topic

description development
4 The response Generally well- The response Response is
fulfils the paced flow (fluid demonstrates sustained and
demands of the expression). effective use of sufficient to the
task, with at most Speech is clear. It grammar and task. It is
minor lapses in may include vocabulary. It generally well
completeness. It minor lapses, or exhibits a fairly developed and
is highly minor difficulties high degree of coherent;
intelligible and with automaticity with relationships
exhibits pronunciation or good control of between ideas are
sustained, intonation basic and clear (or clear
coherent patterns, which complex progression of
discourse. A do not affect structures (as ideas)
response at this overall appropriate).
level is intelligibility Some minor (or
characterised by systematic) errors
all of the are noticeable but
following do not obscure
meaning
3 The response Speech is The response Response is
addresses the task generally clear, demonstrates mostly coherent
appropriately, but with some fluidity fairly automatic and sustained and
may fall short of of expression, and effective use conveys relevant
being fully though minor of grammar and ideas/information.
developed. It is difficulties with vocabulary, and Overall
generally pronunciation, fairly coherent development is
intelligible and intonation or expression of somewhat
coherent, with pacing are relevant ideas. limited, usually
some fluidity of noticeable and Response may lacks elaboration
expression may require exhibit some or specificity.
though it exhibits listener effort at imprecise or Relationships
some noticeable times (though inaccurate use of between ideas
lapses in the overall vocabulary or may at times not
expression of intelligibility is grammatical be immediately
ideas. A response structures or be clear
(continued)

DOI 10.1007/978-981-10-0170-3
272 Appendix II: TOEFL Independent Speaking Rating Scale (Band 3 and Band 4)
(continued)
Score General Delivery Language use Topic
description development
at this level is not significantly somewhat limited
characterised by affected) in the range of
at least two of the structures used.
following This may affect
overall fluency,
but it does not
seriously interfere
with the
communication of
the message
Appendix III
TEEP Speaking Rating Scale
Appropriateness
0 Unable to function in the spoken language.
1 Able to operate only in a very limited capacity: responses characterised by
sociocultural inappropriateness.
2 Signs of developing attempts at response to role, setting, etc., but
misunderstandings may occasionally arise through inappropriateness,
particularly of sociocultural convention.
3 Almost no errors in the sociocultural conventions of language; errors not
significant enough to be likely to cause social misunderstandings.
Adequacy of vocabulary for purpose

0 Vocabulary inadequate even for the most basic parts of the intended
communication.
1 Vocabulary limited to that necessary to express simple elementary needs;
inadequacy of vocabulary restricts topics of interaction to the most basic;
perhaps frequent lexical inaccuracies and/or excessive repetition.
2 Some misunderstandings may arise through lexical inadequacy or inac-
curacy; hesitation and circumlocution are frequent, though there are signs
of a developing active vocabulary.
3 Almost no inadequacies or inaccuracies in vocabulary for the task. Only
rare circumlocution.

DOI 10.1007/978-981-10-0170-3
274 Appendix III: TEEP Speaking Rating Scale
Grammatical accuracy
0 Unable to function in the spoken language; almost all grammatical patterns
are inaccurate, except for a few stock phrases.
1 Syntax is fragmented and there are frequent grammatical inaccuracies;
some patterns may be mastered but speech may be characterised by a
telegraphic style and/or confusion of structural elements.
2 Some grammatical inaccuracies; developing a control of major patterns,
but sometimes unable to sustain coherence in longer utterances.
3 Almost no grammatical inaccuracies; occasional imperfect control of a few
patterns.
Intelligibility
0 Severe and constant rhythm, intonation and pronunciation problems cause
almost complete unintelligibility.
1 Strong interference from LI in rhythm, intonation and pronunciation;
understanding is difficult and achieved often only after frequent repetition.
2 Rhythm, intonation and pronunciation require concentrated listening, but
only occasional misunderstanding is caused or repetition required.
3 Articulation is reasonably comprehensible to native speakers; there may be
a marked “foreign accent” but almost no misunderstanding is caused and
repetition required only infrequently.
Fluency
0 Utterances halting, fragmentary and incoherent.
1 Utterances hesitant and often incomplete except in a few stock remarks
and responses. Sentences are, for the most part, disjointed and restricted in
length.
2 Signs of developing attempts at using cohesive devices, especially con-
junctions. Utterances may still be hesitant, but are gaining in coherence,
speed and length.
3 Utterances, while occasionally hesitant, are characterised by an evenness
and flow hindered, very occasionally, by groping, rephrasing and cir-
cumlocutions. Inter-sentential connectors are used effectively as fillers.
Appendix III: TEEP Speaking Rating Scale 275
Relevance and adequacy of content
0 Response irrelevant to the task set; totally inadequate response.

1 Response of limited relevance to the task set; possible major gaps and/or
pointless repetition.
2 Response for the most part relevant to the task set, though there may be
some gaps or redundancy.
3 Relevant and adequate response to the task set.
Appendix IV
BEC Level 1 Rating Scale

DOI 10.1007/978-981-10-0170-3
Holistic rating scale
278
0 1 2 3 4 5
NONSPEAKER VERY LIMITED SPEAKER Some BASIC SPEAKER AT BEC 1 Some MODERATE SPEAKER AT
AT BEC 1 LEVEL features LEVEL features BEC 1 LEVEL
Insufficient
of 1 and of 3 and
sample to make Has considerable difficulty Able to communicate in Generally able to communicate
some of some of
an assessment or communicating in everyday everyday situations if listener is in everyday situations with little
3 5
totally situations, even when listener is patient and supportive strain on listener
incomprehensible patient and supportive Most utterances are basic Basic structures sufficiently
Basic structures consistently structures, with frequent errors accurate for everyday use;
distorted; lack of vocabulary of grammar, vocabulary and difficulty with more complex
makes communication on style. Range of vocabulary and structures. Adequate range of
familiar topics consistently style only partly adequate for vocabulary for familiar topics;
difficult familiar topics and situations some errors in style
Very limited range of structures; Limited range of structures; Some range of structures; some
little or no attempt at using attempt at using cohesive use of cohesive devices, though
cohesive devices; speech devices; speech often halting, not always successfully. Speech
halting; pauses may be lengthy; though some utterances flow generally flows smoothly; some
utterances sometimes smoothly. Most utterances short; hesitation while searching for
abandoned. Speech generally turns rarely developed language. Often uses
fragmented; no lengthy Fairly frequent pronunciation appropriately long utterances,
utterances attempted; turns not errors; first-language though may leave turns
developed characteristics noticeably hinder undeveloped
Frequent pronunciation errors; understanding. Strongly marked Some pronunciation errors; first-
intrusive first-language first-language interference in language characteristics may
characteristics consistently prosody hinder understanding. Fairly
hinder understanding. Stress and Generally understands language marked first-language
intonation patterns generally and purpose of task. Sometimes interference in prosody
distorted has to be drawn out; requires Deals with tasks reasonably
May not understand language assistance from effectively. Occasionally relies
and purpose of talk. Often interlocutor/partner. Has some on assistance of
depends on interlocutor/partner difficulty in responding to topic- interlocutor/partner in initiating
Appendix IV: BEC Level 1 Rating Scale
(continued)
(continued)
0 1 2 3 4 5
for initiating or sustaining shifts. Often inappropriate or or sustaining utterances.
utterances. Has difficulty in ineffective in turn-taking or Responds to topic-shifts, but
responding to topic-shifts— responding to may require time to do so.
often seems unaware of them. interlocutor/partner. Has Usually appropriate and
Can use very basic difficulty using basic repair effective in turn-taking and
conversational formulae but strategies. Listening ability: responding to
may interact inappropriately. sometimes requires rephrasing interlocutor/partner. Generally
Generally unable to repair uses appropriate repair
communication problems strategies. Listening ability:
himself/herself. Listening occasionally requires rephrasing
ability: often requires rephrasing
279
Analytic rating scale
280
0 1 2 3 4 5
Grammar and Impossible to Frequently difficult to Some Meaning sometimes Some Meaning generally
vocabulary understand or understand basic features obscured features conveyed despite errors
insufficient to structures consistently of 1 and Most utterances are basic of 3 and Basic structures
assess distorted; lack of some of structures, with frequent some of sufficiently accurate for
vocabulary makes 3 errors of grammar, 5 everyday use; difficulty
communication on vocabulary and style; with more complex
familiar topics range of vocabulary and structures; adequate
consistently difficult style only partly adequate range of vocabulary for
for familiar topics and familiar topics; some
situations errors in style
Discourse (almost) no Very limited range of Some Limited range of Some Fair range of linguistic
management linguistic resources linguistic resources features linguistic resources features resources;
Very limited range of of 1 and Limited range of of 3 and some range of structures;
structures; little or no some of structures; some attempt some of some use of cohesive
attempt at using cohesive 3 at using cohesive 5 devices, though not
devices; speech halting; devices; speech often always successfully;
pauses may be lengthy; halting, though some speech generally flows
utterances sometimes utterances flow smoothly; smoothly; some
abandoned; speech most utterances short; hesitation while
generally fragmented; no turns rarely developed searching for language;
lengthy utterances often uses appropriately
attempted; turns not long utterances, though
developed may leave turns
undeveloped
Pronunciation Impossible to Frequently difficult to Some Sometimes difficult to Some Occasionally difficult to
understand or understand; features understand; features understand
insufficient to Frequent pronunciation of 1 and fairly frequent of 3 and Some pronunciation
assess errors; very intrusive some of pronunciation errors; some of errors; first-language
first-language 3 first-language 5 characteristics may
(continued)
(continued)
0 1 2 3 4 5
characteristics characteristics noticeably hinder understanding;
consistently hinder hinder understanding; fairly marked first-
understanding; stress and strongly marked first- language interference in
intonation patterns language interference in prosody
generally distorted prosody
Interactive (almost) no Frequently dependent in Some Sometimes dependent in Some Fairly independent in
communication interaction with interaction; features interaction; features interaction;
interlocutor/partner may not understand of 1 and generally understands of 3 and deals with tasks
language and purpose of some of language and purpose of some of reasonably effectively;
task; often depends on 3 task; sometimes has to be 5 occasionally relies on
interlocutor/partner for drawn out; requires assistance of
initiating and sustaining assistance from interlocutor/partner in

utterances; difficulty in interlocutor/partner; initiating or sustaining
responding to topic-shifts difficult using basic repair utterances; responds to
—often seems unaware strategies; listening topic-shifts, but may also
of them; can use very ability: sometimes require time to do so;
basic conversation requires rephrasing usually appropriate and
formulae but may interact effective in turn-taking
inappropriately; generally and responding to
unable to repair interlocutor/partner;
communication problems generally uses
himself/herself; listening appropriate repair
ability: often requires strategies; listening
rephrasing ability: occasionally
requires rephrasing
281
Appendix V
Questionnaire for Teachers (Trial Version)
Respectful Teachers,
Many thanks for participating in this questionnaire survey. It is related to a study on
“Nonverbal Delivery in Speaking Assessment: From an Argument to a Rating Scale
Development and Validation”. It is my honour to have invited you to provide what
you think of the features of good oral English proficiency in group discussion. It
will take you about 10–15 min to complete this questionnaire. Please carefully
read the following directions before you proceed to your response.
******************************************************************
Directions: Please circle the number corresponding to your perception for
each statement. If you strongly agree with the statement, please circle the
number 5; if you agree with statement, please circle the number 4; if you think
it is hard to make judgment, please circle the number 3; if you disagree with
the statement, please circle the number 2; if you strongly disagree with the
statement, please circle the number 1.
1. Pronunciation accuracy is important in assessing candidates’ oral English
proficiency.
1 2 3 4 5
2. Intelligibility in pronunciation to facilitate listener’s effort is important in
assessing candidates’ oral English proficiency.
1 2 3 4 5
3. Good pronunciation in oral English proficiency means native-like.
1 2 3 4 5
4. Speaking smoothly and loudly can help clear communication.
1 2 3 4 5
5. Effective use of pitch patterns and pauses means effective control of
intonation.
1 2 3 4 5
6. Effective use of stress means effective control of intonation.
1 2 3 4 5

DOI 10.1007/978-981-10-0170-3
284 Appendix V: Questionnaire for Teachers (Trial Version)
7. Grammar correctness is important in assessing the candidates’ oral English

proficiency.
1 2 3 4 5
8. Grammar variation is important in assessing the candidates’ oral English
proficiency.
1 2 3 4 5
9. Vocabulary range is important in assessing the candidates’ oral English
proficiency.
1 2 3 4 5
10. Using right words is important in assessing the candidates’ vocabulary.
1 2 3 4 5
11. Choosing appropriate words is important in assessing the candidates’
vocabulary.
1 2 3 4 5
12. Employing cohesive devices and discourse markers in group discussion is
important in assessing the candidates’ oral English proficiency.
1 2 3 4 5
Directions: Please circle the number corresponding to your perception for each
statement. If you strongly disagree with the statement, please circle the number 5; if
you disagree with statement, please circle the number 4; if you think it is hard to
make judgment, please circle the number 3; if you agree with the statement, please
circle the number 2; if you strongly agree with the statement, please circle the
number 1.
1. Fulfilling language communicative functions is important in assessing the
candidates’ oral English proficiency.
1 2 3 4 5
2. Stating topic-related ideas with reasons and examples is important in assessing
the candidates’ oral English proficiency.
1 2 3 4 5
3. Choosing appropriate language to fit different contexts and audience means
good oral English proficiency.
1 2 3 4 5
4. Knowing to use fillers to compensate for occasional hesitation to control speech
means good oral English proficiency.
1 2 3 4 5
***************************************************************
Thank you again for your co-operation and support!

Appendix VI
Questionnaire for Teachers
(Final Version)
Respectful Teachers,
Nonverbal Delivery in Speaking Assessment: From an Argument to a Rating Scale
Development and Validation. It is my honour to have invited you to provide what
will take you about 10–15 min to complete this questionnaire. Please carefully
尊敬的老師:
非常感謝您能參加此次問卷調查。此次問卷調查是有關“口語測試中之非言
語行為:論述的構建到評分量表的設計與驗證”之博士論文科研項目。我們很
榮幸能夠邀請到您,並由您向我們提供您對學生小組討論時評估其英語口語
能力特徵的看法。本次問卷大約會佔用您10至15分鐘的時間。勞煩您在填寫
以下問卷之前仔細閱讀填寫細則。
******************************************************************
以下是對英語小組討論時學生口語能力特徵的部分描述,請在相應的數字上
畫圈。如果您極為贊同這一描述,則請在數字5上面畫圈;如果您贊同這一描
述,則請在數字4上面畫圈;如果您對這一描述較難判斷,則請在數字3上面畫
圈;如果您不贊同這一描述,則請在數字2上面畫圈;如果您極為不贊同這一描
述,則請在數字1上面畫圈。

DOI 10.1007/978-981-10-0170-3
286 Appendix VI: Questionnaire for Teachers (Final Version)

proficiency.
發音的準確對評估學生英語口語能力很重要。
1 2 3 4 5
發音可辨,並無需聽眾多加費盡地去理解對評估學生英語口語能力很重
要。
1 2 3 4 5
英語口語能力中優秀的發音意味着與本族語者的發音接近。
1 2 3 4 5
說話平緩響亮可有助於清晰的溝通。
1 2 3 4 5
intonation.
有效運用發音的高低和停頓意味着有效地控制聲調。
1 2 3 4 5
6. Effective use of stress, such as stressing a word or part of a sentence, means
effective control of intonation.
有效運用重音,如強調某個單詞或是句子的某一部分,意味着有效地控制
聲調。
1 2 3 4 5
proficiency.
語法的正確性對評估學生英語口語能力很重要。
1 2 3 4 5
8. Grammar variation, such as syntactic complexity and variety (integrative use
of all kinds of sentence structures), is important in assessing the candidates’
oral English proficiency.
語法的多變性,如句法的複雜性和多樣性(綜合使用各類句法結構)對評估
學生英語口語能力很重要。
1 2 3 4 5
9. Vocabulary depth and breath is important in assessing the candidates’ oral
English proficiency.
詞彙的寬度和廣度對評估學生英語口語能力很重要。
1 2 3 4 5
10. Using right words and phrases is important in assessing the candidates’
vocabulary.
使用正確的詞語及短語對評估學生的詞彙很重要。
1 2 3 4 5
Appendix VI: Questionnaire for Teachers (Final Version) 287
11. Choosing appropriate words and phrases is important in assessing the candi-
dates’ vocabulary.
選擇恰當的詞語及短語對評估學生的詞彙很重要。
1 2 3 4 5
12. Employing cohesive devices, such as those indicating cause and effect (be-
cause, therefore) and sequence (then), and discourse markers, such as well, I
mean, in group discussion is important in assessing the candidates’ oral
在小組討論中運用銜接手段,如表明因果關係(because, therefore)和秩序關
係(then)和話
語標記語,如well及 I mean, 對評估學生英語口語能力很重要。
1 2 3 4 5
number 1.
畫圈。如果您極為不贊同這一描述,則請在數字5上面畫圈;如果您不贊同這一
描述,則請在數字4上面畫圈;如果您對這一描述較難判斷,則請在數字3上面畫
圈;如果您贊同這一描述,則請在數字2上面畫圈;如果您極為贊同這一描述,則
請在數字1上面畫圈。
1. Fulfilling language communicative functions, such as greeting and apology, is
能夠完成各種語言交流功能,比如問候和道歉,對評估學生英語口語能力很
重要。
1 2 3 4 5
運用說理和舉例來闡述與話題有關的內容對評估學生英語口語能力很重
要。
1 2 3 4 5
根據不同的場合和聽眾來選擇恰當的語言意味着較好的英語口語能力。
1 2 3 4 5
288 Appendix VI: Questionnaire for Teachers (Final Version)
4. Knowing to use fillers, such as so, I mean and well, to compensate for occa-
sional hesitation to control speech means good oral English proficiency.
懂得運用填充語,如so, I mean和well以彌補偶爾的遲疑來控制話語意味着
較好的英語口語
能力。
1 2 3 4 5
***************************************************************
再次感謝您的合作和支持!
Appendix VII
Questionnaire for Learners (Trial Version)
Dear Students,
Many thanks for participating in this questionnaire survey. It is related to study on

“Nonverbal Delivery in Speaking Assessment: From an Argument to a Rating
Scale Development and Validation”. It is my honour to have invited you to provide
what you think of the features of good oral English proficiency in group discussion.
It will take you about 10-15 minutes to complete this questionnaire. Please
carefully read the following directions before you proceed to your response.
******************************************************************

proficiency.
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
intonation.
1 2 3 4 5
6. Effective use of stress means effective control of intonation.
1 2 3 4 5

DOI 10.1007/978-981-10-0170-3
290 Appendix VII: Questionnaire for Learners (Trial Version)

proficiency.
1 2 3 4 5
8. Grammar variation is important in assessing the candidates’ oral English
proficiency.
1 2 3 4 5
9. Vocabulary range is important in assessing the candidates’ oral English
proficiency.
1 2 3 4 5
10. Using right words is important in assessing the candidates’ vocabulary.
1 2 3 4 5
11. Choosing appropriate words is important in assessing the candidates’
vocabulary.
1 2 3 4 5
12. Employing cohesive devices and discourse markers in group discussion is
1 2 3 4 5
number 1.
1. Fulfilling language communicative functions is important in assessing the
candidates’ oral English proficiency.
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
4. Knowing to use fillers to compensate for occasional hesitation to control speech
means good oral English proficiency.
1 2 3 4 5
***************************************************************

Appendix VIII
Questionnaire for Learners
(Final Version)
Dear Students,
“Nonverbal Delivery in Speaking Assessment: From an Argument to a Rating Scale
Development and Validation”. It is my honour to have invited you to provide what
will take you about 10-15 minutes to complete this questionnaire. Please carefully
親愛的同學:
非常感謝您能參加此次問卷調查。此次問卷調查是有關“口語測試中之非言
語行為:論述的構建到評分量表的設計與驗證”之博士論文科研項目。我們很
榮幸能夠邀請到您,並由您向我們提供您對學生小組討論時評估其英語口語
能力特徵的看法。本次問卷大約會佔用您10至15分鐘的時間。勞煩您在填寫
以下問卷之前仔細閱讀填寫細則。
******************************************************************
statement. If you strongly agree with the statement, please circle the number 5;
if you agree with statement, please circle the number 4; if you think it is hard
to make judgment, please circle the number 3; if you disagree with the
statement, please circle the number 2; if you strongly disagree with the
畫圈。如果您極為贊同這一描述,則請在數字5上面畫圈;如果您贊同這一描
述,則請在數字4上面畫圈;如果您對這一描述較難判斷,則請在數字3上面畫圈;
如果您不贊同這一描述,則請在數字2上面畫圈;如果您極為不贊同這一描述,
則請在數字1上面畫圈。

DOI 10.1007/978-981-10-0170-3
292 Appendix VIII: Questionnaire for Learners (Final Version)

proficiency.
發音的準確對評估學生英語口語能力很重要。
1 2 3 4 5
發音可辨,並無需聽眾多加費盡地去理解對評估學生英語口語能力很重
要。
1 2 3 4 5
英語口語能力中優秀的發音意味着與本族語者的發音接近。
1 2 3 4 5
說話平緩響亮可有助於清晰的溝通。
1 2 3 4 5
intonation.
有效運用發音的高低和停頓意味着有效地控制聲調。
1 2 3 4 5
6. Effective use of stress, such as stressing a word or part of a sentence, means
effective control of intonation.
有效運用重音,如強調某個單詞或是句子的某一部分,意味着有效地控制
聲調。
1 2 3 4 5
proficiency.
語法的正確性對評估學生英語口語能力很重要。
1 2 3 4 5
8. Grammar variation, such as syntactic complexity and variety (integrative use
of all kinds of sentence structures), is important in assessing the candidates’
oral English proficiency.
語法的多變性,如句法的複雜性和多樣性(綜合使用各類句法結構)對評估
學生英語口語能力很重要。
1 2 3 4 5
9. Vocabulary depth and breath is important in assessing the candidates’ oral
詞彙的寬度和廣度對評估學生英語口語能力很重要。
1 2 3 4 5
10. Using right words and phrases is important in assessing the candidates’
vocabulary.
使用正確的詞語及短語對評估學生的詞彙很重要。
1 2 3 4 5
Appendix VIII: Questionnaire for Learners (Final Version) 293
11. Choosing appropriate words and phrases is important in assessing the candi-
dates’ vocabulary.
選擇恰當的詞語及短語對評估學生的詞彙很重要。
1 2 3 4 5
12. Employing cohesive devices, such as those indicating cause and effect (be-
cause, therefore) and sequence (then), and discourse markers, such as well, I
mean, in group discussion is important in assessing the candidates’ oral
在小組討論中運用銜接手段,如表明因果關係(because, therefore)和秩序關
係(then)和話
語標記語,如well及I mean,對評估學生英語口語能力很重要。
1 2 3 4 5
you disagree with statement, please circle the number 4; if you think it hard to make
judgment, please circle the number 3; if you agree with the statement, please circle
the number 2; if you strongly agree with the statement, please circle the number 1.
畫圈。如果您極為不贊同這一描述,則請在數字5上面畫圈;如果您不贊同這一
描述,則請在數字4上面畫圈;如果您對這一描述較難判斷,則請在數字3上面畫
圈;如果您贊同這一描述,則請在數字2上面畫圈;如果您極為贊同這一描述,則
請在數字1上面畫圈。
1. Fulfilling language communicative functions, such as greeting and apology, is
能夠完成各種語言交流功能,比如問候和道歉,對評估學生英語口語能力很
重要。
1 2 3 4 5
運用說理和舉例來闡述與話題有關的內容對評估學生英語口語能力很重
要。
1 2 3 4 5
根據不同的場合和聽眾來選擇恰當的語言意味着較好的英語口語能力。
1 2 3 4 5
294 Appendix VIII: Questionnaire for Learners (Final Version)
4. Knowing to use fillers, such as so, I mean and well, to compensate for occa-
sional hesitation to control speech means good oral English proficiency.
懂得運用填充語,如so, I mean和well以彌補偶爾的遲疑來控制話語意味着
較好的英語口語
能力。
1 2 3 4 5
***************************************************************
再次感謝您的合作和支持!
Appendix IX
Proposed Rating Scale (Tentative Version)
Pronunciation and Intonation

Intelligible Unintelligible
Native Foreign
5 4 3 2 1
Varied Monotonous
Grammar and Vocabulary

Accurate Inaccurate
Varied Monotonous
5 4 3 2 1
Broad/Deep Narrow/shallow
Idiomatic Unidiomatic
Vocabulary
Discourse Management
Fluency Disfluency
Coherent Scattered
Developed Underdeveloped
5 4 3 2 1

DOI 10.1007/978-981-10-0170-3
296 Appendix IX: Proposed Rating Scale (Tentative Version)
Nonverbal Delivery
Frequent Infrequent
Durable Brief
Varied Monotonous
5 4 3 2 1
Band Band descriptors for pronunciation and intonation

5 No listener effort in sound recognition for intelligibility
No detectable foreign accent
No noticeable mispronunciation
Flexible control of stress on words and sentences for meaning conveyance
Correctness and variation in intonation at the sentence level
4 Almost no listener effort for intelligibility, with acceptable slip of tongue
Detectable foreign accent without reducing overall intelligibility
Occasional mispronunciation
Occasional inappropriate stress on words and sentences without reducing meaning
conveyance
Correctness in intonation, but with less variation at the sentence level
3 Detectable accent slightly reducing overall intelligibility
Mispronunciations of some words with possible confusion
Inappropriate stress on words and sentences reducing meaning conveyance
Occasional inappropriate or awkward intonation noticeable at the sentence level
2 Effort needed in sound recognition for intelligibility
Detectable foreign accent that sometimes causes confusion
Frequent noticeable mispronunciation
Frequent inappropriate stress on words and sentences reducing clarity of expression
Frequent inappropriate and awkward intonation at the sentence level
1 Much effort in sound recognition for intelligibility
Strong foreign accent with noticeable L1 interference
Frequent mispronunciation and detectable hesitations/pauses blocking flow of
expression
Frequent inappropriate stress and awkward intonation
Band Band descriptors for grammar and vocabulary

5 No detectable grammatical errors, with only self-repaired minor lapses
A range of syntactic variations (complex and simple structures) with accuracy and
flexibility
Vocabulary breath and depth sufficient for natural and accurate expression
Accompanying frequent use of idiomatic chunks
4 Occasional grammatical errors without reducing expressiveness
A range of syntactic variations (both complex and simple structures) with occasional
inaccuracy and inflexibility
Almost all sentences are error-free
(continued)
Appendix IX: Proposed Rating Scale (Tentative Version) 297
(continued)
Band Band descriptors for grammar and vocabulary
Vocabulary breath and depth sufficient for expression, with occasional detectable
inaccuracy
Accompanying infrequent use of idiomatic chunks
3 Noticeable grammatical errors slightly reducing expressiveness.
Effective and accurate use of simple structures, with less frequent use of complex
structures
Frequent error-free sentences
Vocabulary breadth sufficient for the topic, with less noticeable vocabulary depth
Rare use of idiomatic chunks
2 Noticeable grammatical errors seriously reducing expressiveness
Fairly accurate use of simple structures, with inaccuracy in complex structures
Frequently incomplete and choppy sentences
Vocabulary breadth insufficient for the topic
Inaccurate use of words causing confusion
1 Frequent grammatical errors, with no intention of self-correction
Detectable and repetitive formulaic expressions
Inaccuracy and inability to use basic structures
Topic development seriously limited by vocabulary scarcity
Band Band descriptors for discourse management

5 Rare repetition or self-correction; effective use of fillers to compensate for occasional
hesitation(s)
Coherence and cohesion achieved by effective use of connectors and discourse
markers
Topic is discussed with reasoning, personal experience or other examples for in-depth
development
4 Occasional repetition and self-correction; hesitation for word and grammar is rare;
infrequent use of fillers
Generally coherent discussion with appropriate use of connectors and discourse
markers; no significant long pause hindering the flow of utterance
Much topic-related development with some minor irrelevance in discussion
3 General continuous flow of utterance can be maintained, yet repetition, self-correction
and hesitation are noticeable for word and grammar
Coherence and cohesion can be basically achieved by the use of connectors and
discourse markers, but sometimes inappropriate use might occur
Topic is discussed with relevant utterance, but the attempt to produce long response is
sometimes limited
2 Frequent repetition, self-correction and long noticeable pauses for word and grammar
Constant use of only a limited number of connectors and discourse markers for
coherence and cohesion
Topic is not developed clearly with reasoning or expected details; development can be
maintained with other discussants’ elicitation
1 Almost broken utterance with constant long pauses between sentences
Almost no connector and discourse marker used to link sentences
Only basic ideas related to the topic can be expressed; development is limited due to
noticeably less participation
298 Appendix IX: Proposed Rating Scale (Tentative Version)
Band Band descriptors for nonverbal delivery

5 Frequent and durable eye contact with other discussants
Frequent and various meaning-making communication-conducive gestures (support or
enhance meaning)
Evidence of appropriate head nod/shake
4 Frequent eye contact with other discussants
Frequent gestures with a lack in variety
Head nod/shake detectable, but sometimes inappropriate
3 Having eye contact with other discussants, but brief
Gestures employed, but some are not for communicative purposes
Infrequent head nod/shake
2 Infrequent eye contact with other discussants
Gestures, most of them are for regulatory reasons
Most head nod/shake is inappropriate
1 Almost no eye contact with other discussants
Almost no gesture in group discussion
No head nod/shake
Appendix X
Transcriptions of the Three Selected
Group Discussions
(1) Diana: sp2

<schoollevel=key> </schoollevel>
<school=ECUST> </school>
<studentlevel=3> </studentlevel>
<major=Chemical Engineering> </major>
<speakers sp1=ZS, female sp2=YZTGL, female sp3=WYQ, female> </speakers>
<topic=What Would You Like to Know about Future?> </topic>
<conversation>
<sp1> If you have a special power to know the future, what do you want to know
about? </sp1>
<sp2> Oh I want to know I want to know where I will live in. En…just you know
as a college student, we have studied for many years and. </sp2>
<sp1> You are from Xin Jiang? </sp1>
<sp2> Yeah, I'm from Xin Jiang Wulihutan. My hometown is so far away from
here. </sp2>
<sp3> You miss your parents and your hometown? </sp3>
<sp2> Yes, so far that I miss my hometown and my parents so much. When I
graduate in two thousand and thirteen, I just want to go to, go back to my
hometown. </sp2>
<sp3> Me too. </sp3>
<sp2> And become an ordinary people to look after my parents. But at the same
time, I want, I think I must continue my study. And I want to get Master's degree,
Doctor's degree. </sp2>
<sp1> You are so hardworking. </sp1>
<sp3> So you want to live in Shanghai? </sp3>
<sp2> Yeah, so at that time I mean I can't go back to my hometown because the
economy and condition of my hometown is, is very poor. </sp2>
<sp1> A hard choice. </sp1>
<sp3> So stay here is a better choice for you, you think. </sp3>
<sp2> Oh, yes. These two kinds of things always disturb me so I I’m eager to know
where I will live in my future. </sp2>

DOI 10.1007/978-981-10-0170-3
300 Appendix X: Transcriptions of the Three Selected Group Discussions
<sp3> This is a hard choice between Shanghai and hometown. And what do you
want to know about future? </sp3>
<sp1> If I know if I have special power, I want to know what air environment will
be. After some years later or some decades later, as you know that we are in face of
many environmental problems and some the local problems have graduated into the
international issues. Sometimes we may talk about what we will do if the end of the
earth really occurs. </sp1>
<sp3> I know you say in the movie. </sp3>
<sp2> Just just just to me. Two days before I dreamed of there in Shanghai have an
earthquake. So horrible, so horrible. </sp2>
<sp1> Really? So terrible. Maybe you when when you woke up, you will feel
lucky that it was just a dream. </sp1>
<sp3> Yeah, my major is environmental engineering. I think I can do something to
the environment, yes? </sp3>
<sp2> Yes, we must protect our environment. </sp2>
<sp3> I will do something to protect the river, to en…yeah to the air and to
something else. </sp3>
<sp2> And and what what do you want to know about your future? </sp2>
<sp3> What I want to know most is the condition of my parents’ health. They do
not have some serious illness these days but some small ones will come up just now
and then. Some days ago, my father told me that he feels it is a little bit hard for
him to go downstairs, so I worried about him very much. I asked her, him to to do
more exercise so he will en…her condition will be better, I think. En…everyone
will die but I don’t want them to suffer from a lot of pain before that day come.
That’s the thing I care about the most. What do you have anything else you want to
know? </sp3>
<sp2> I want to know what kind of person you will, you will marry. </sp2>
<sp1> We are also. </sp1>
<sp3> We too. </sp3>
<sp2> Is he taller or is he shorter? </sp2>
<sp1> Is he handsome? </sp1>
<sp3> Handsome? </sp3>
<sp2> Handsome? Yeah. </sp2>
<sp3> I hope he can be very kind and responsible, yes. If he is handsome and tall, it
can be better. </sp3>
<sp1> I hope that. </sp1>
<sp3> We all hope want a tall and handsome boyfriend, yes? </sp3>
<sp2> Yes, and I want to, I want to know what kind of job I will I will take I will
do in my future. </sp2>
<sp3> What do you want to do? </sp3>
<sp2> I want to be a university teacher. </sp2>
<sp3> I support you. </sp3>
<sp1> I want to be a white collar to earn a lot of money. </sp1>
<sp3> I want to be a psychologist but it is different, a little bit different from my
major. I want to know if I will fulfil my dream in the future. </sp3>
Appendix X: Transcriptions of the Three Selected Group Discussions 301
<sp1> I think if you work hard, you can do it. </sp1>

<sp3> Yes, I hope so, so I will try to adjust to both about this kind of thing. En…
anything else you want to know? Oh I want to know whether I will be rich so I can
buy a lot of new clothes. En…with money we can do a lot of things. What do you
want to do if you are rich in the future? Do you want to know if you are rich in the
future? </sp3>
<sp1> I think first I want to buy a lot of beautiful clothes. </sp1>
<sp3> Me too. It is the thing girls always care about, I think. </sp3>
<sp1> I think maybe if I really have money, I can buy a beautiful house in
Shanghai. You know the price in Shanghai is very expensive. </sp1>
<sp3> If you are going to live in Shanghai, you need to buy a house, you must
work very hard. </sp3>
<sp2> I don’t want to live in Shanghai. </sp2>
<sp1> You want to go back to Xin Jiang? </sp1>
<sp2> Yeah. </sp2>
<sp3> Sometimes you have to make a choice. </sp3>
<sp2> Yeah, it is </sp2>
<sp1> The environment there is very beautiful, right? </sp1>
<sp2> Yeah. </sp2>
<sp3> One of my roommate is from Xin Jiang, too. Is she said that the fruit there is
very delicious. </sp3>
<sp2> Yeah. </sp2>
<sp3> Maybe someday I can go to your hometown. </sp3>
<sp2> Yeah. You are welcome. </sp2>
<sp3> Maybe I will go to her home and then your home. </sp3>
<sp1> I will go with you. </sp1>
<sp3> Ok. </sp3>
</conversation>
(2) Linda: sp2
<schoollevel=local> </schoollevel>
<school=USST> </school>
<major=Environmental Engineering> </major>
<speakers sp1=ZY, female sp2=GY, female sp3=XYL, female> </speakers>
<topic=Travel in Summer Holidays> </topic>
<conversation>
<sp1> The summer holiday is coming, and shall we go to, go travelling? </sp1>
<sp2> We three together? </sp2>
<sp1> Yes. </sp1>
<sp2> Oh, That’s a good idea. </sp2>
<sp1> And you dream where you want to go? </sp1>
<sp2> I think I want to go to South Korea. </sp2>
<sp1> South Korea? </sp1>
<sp2> Yes, because I want to see the singing star there, and that’s a good place for
shopping. </sp2>
<sp3> But I think it maybe so expensive, and it is so complicated to make a
passport. </sp3>
<sp2> Oh, yes, that’s a problem. </sp2>
<sp1> Yeah, the time is not enough. I think we should consider, not consider
abroad, because it’s too expensive. </sp1>
<sp2> In our country, China. </sp2>
<sp1> Yeah. </sp1>
<sp2> Do you have some idea? </sp2>
<sp3> I want to go Tibet. </sp3>
<sp1> Tibet? </sp2>
<sp3> Yeah, It’s my favourite space. And they are so cultural there, traditional
cultural. Em… and it’s mystery here. </sp3>
<sp2> Yes, Great, I think so. And I think there we can see some special animals,
like antelope, Tibet antelope and others. Do you think so? </sp2>
<sp1> But I think, Tibet is too far away, and the air pressures is not fit us. </sp1>
<sp3> Do you have some suggestions? </sp3>
<sp1> Em…let me think about. Maybe we, we can go to Guilin. </sp1>
<sp2> Guilin? Oh, that’s a good place. </sp2>
<sp1> Yeah, it’s very beautiful. The scenery attract me a lot. And in the TV
program, I see the shaped, strange shaped mountain and the beautiful river there.
And em…have a lot of, has a lot of legends there. I er…look forward there very
much. </sp1>
<sp3> But do you know as an attractive place, so in summer vacation there will be
so many people. </sp3>
<sp1> It’s a problem. </sp1>
<sp2> That’s a pity. I have never been to Guilin. And I’ve heard the saying that
Guilin’s scenery is the best of the world. </sp2>
<sp1> Yeah. </sp1>
<sp2> But I think maybe there are too many people there at that time. </sp2>
<sp3> So we should think about the close to Shanghai. </sp3>
<sp2> Some place close to Shanghai? </sp2>
<sp3> Yes. For example, Hangzhou or Suzhou? </sp3>
<sp1> Hangzhou, Suzhou is. </sp1>
<sp2> Hangzhou is my hometown! </sp2>
<sp1> They are very good place, but I think I have been there many, many times,
and I don’t want to go there again. </sp1>
<sp2> How about Suzhou? </sp2>
<sp1> Suzhou, em… </sp1>
<sp2> Can you introduce Suzhou? </sp2>
<sp3> Just some gardens, special. </sp3>
<sp1> Suzhou Garden is very famous. </sp1>
<sp3> Hanshan Temple. </sp3>
<sp2> Hanshan Temple! </sp2>
<sp1,2> I don’t know. </sp1,2>

<sp2> Where is it? </sp2>
<sp3> Suzhou. </sp3>
<sp1> I don’t know, what is it famous for? </sp1>
<sp3> Er…it have a long history, and it was built on a mountain. So when you go
to the temple. </sp3>
<sp2> I love mountain climbing. </sp2>
<sp1> But I think many temples are all built on the mountain. Right? I think, Oh,
maybe Hainan is a good idea. </sp1>
<sp2> Hainan? Yeah, we can go to the sea. I like sea. </sp2>
<sp1> Yeah, the blue sea, very beautiful! </sp1>
<sp2> I like scuba-diving. </sp2>
<sp1> Yeah, yeah, I like it. </sp1>
<sp3> But in summer, it will be very too hot there. </sp3>
<sp1> Oh, yeah, the sun will hurt our skin. We will be black. </sp1>
<sp3> So it’s hard to decision where to go. </sp3>
<sp1> Yes, where shall we go? </sp1>
<sp2> We should go some, em…the weather is cool and we don’t spend much, and
it’s near to Shanghai. </sp2>
<sp3> Oh, I know a place, Lu Mountain. </sp3>
<sp2> Lu Mountain, in which province? </sp2>
<sp3> Jiangxi </sp3>
<sp2> Jiangxi Province? </sp2>
<sp3> En…it’s not far away from Shanghai. </sp3>
<sp2> You sure? I think it’s in Anhui Province. </sp2>
<sp1> Anhui Province is Huang Mountain, yeah? You are joking, it’s so bad. </sp1>
<sp3> Lu Mountain is a place to, em… </sp3>
<sp2> To spend the hot summer. </sp2>
<sp3> Yeah, it’s very cool there. </sp3>
<sp1> Have you, have you been there before? </sp1>
<sp3> Yes, last summer. </sp3>
<sp1> Then you want to go there again with us? </sp1>
<sp3> No! </sp3>
<sp1> Do you have any other tips? </sp1>
<sp2> Em…let me think. Maybe we shall go to Zhoushan in Zhejiang Province.
It’s an island, and sea, blue sea around it. </sp2>
<sp1> Zhoushan? Oh, the seafood! </sp1>
<sp2> Oh, yes. There are delicious fishes. </sp2>
<sp3> It’s near Ningbo. </sp3>
<sp2> Er…yes. </sp2>
<sp1> You know, my hometown is Fuzhou. And I have had many sea food there,
and I want to, I don’t want to. </sp1>
<sp2> So you mean you don’t want to go seaside? </sp2>
<sp1> Not seaside, but I don’t want to go east seaside, such as Zhoushan. </sp1>
<sp2> So southern, southern. </sp2>
<sp3> Chengde. How about Chengde? </sp3>

<sp2> Where is Chengde? </sp2>
<sp3> Hebei. </sp3>
<sp2> Hebei Province? </sp2>
<sp3> Er…it’s near Beijing. But I think we have no chance to buy the tickets from
Beijing because of the Olympic Games. </sp3>
<sp1> Ok. </sp1>
</conversation>
(3) Tom: sp2
<schoollevel=key> </schoollevel>
<school=HIT> </school>
<major=Electronic Science and Technology> </major>
<speakers sp1=ZZL, male sp2=QJ, male sp3=XYH, male> </speakers>
<topic=Do You Want to Live in the City or in the Countryside?> </topic>
<conversation>
<sp1> Hi! </sp1>
<sp2> Hi, how are you? </sp2>
<sp1> Fine, thank you. And you? </sp1>
<sp2> I’m fine, too. But I’m very busy these days. </sp2>
<sp1> Why? </sp1>
<sp2> Because I want to buy the house, but I don’t decide whether I live in a house
in the city or in a country. Because living in, in the city need more money, and the
environment for the city is worse than the country, but, but the country is far away
from my my workplace. It’s, it’s hard for me to go to the work; and there is
supermarket, it’s also hard for me to go shopping. So it’s it’s hard for me to choose
whether live in the city or the country. Would you give me some advice, Mr. Quan?
</sp2>
<sp3> In my views, I like living in the city more than in the country. </sp3>
<sp2> How about you, Mr. Zhang? </sp2>
<sp1> Although my home is in the city, but I like living in the countryside very
much. Becau, because the air in the countryside is fresh, you will be very healthy;
and the envi, environment in the countryside is quite clear, they there have
beautiful sun, beautiful sun; sometimes usually, usually you can see some small
animals. People in people in there is also very friendly, you will enjoy yourself.
The most, the most important your house in the coun er…countryside is much che
cheaper than in the city. That’s all. </sp1>
<sp2> En…why do you, why do you choose live in the city, Mr. Quan? </sp2>
<sp3> You see, the city is convenience, as it has a lot of public transport and there
are shopping centre all-round. What’s more, there is a lot of entertainment in the
city, for example, parks, museums and many more. The most important is the city
has better schools for child and for you, too, and more work and more work
potential for your to choose. So this is the reason why I choose living in the city. So
if I have a chance buy a house, I like living in the city. </sp3>
<sp2> Em…er…I think you are all right. But I’d better living in the city, because I
think the city is better for me. See you! </sp2>
<sp1><sp3> See you! </sp1></sp3>
</conversation>

Mingwei Pan (Auth.) - Nonverbal Delivery in Speaking Assessment - From An Argument To A Rating Scale Formulation and Validation-Springer Singapore (2016)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mingwei Pan (Auth.) - Nonverbal Delivery in Speaking Assessment - From An Argument To A Rating Scale Formulation and Validation-Springer Singapore (2016)

Uploaded by

Copyright:

Available Formats

Mingwei Pan

ISBN 978-981-10-0169-7 ISBN 978-981-10-0170-3 (eBook)

Library of Congress Control Number: 2015955873

Springer Singapore Heidelberg New York Dordrecht London

Printed on acid-free paper

Language, especially its spoken form, is now universally recognised as being

phase-speciﬁc purposes, all the samples are accordingly video-recorded,

formulated by an operationalised questionnaire drawn from the related spectra of

Martinec, R. 2000a. Construction of identity in Michael Jackson’s “Jam”. Social Semiotics 10

2.5 Rating Scale Evaluation and Validation . . . . . . . . . . . . . . . . . . . 62

5.6 Rating Scale (2): Strategic Competence . . . . . . . . . . . . . . . . . . . 175

9.3 Limitations of This Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266

Appendix I: IELTS Speaking Rating Scale (Band 8 and Band 9) . . . . . 269

Appendix II: TOEFL Independent Speaking Rating Scale

Appendix III: TEEP Speaking Rating Scale . . . . . . . . . . . . . . . . . . . . . 273

Appendix IV: BEC Level 1 Rating Scale . . . . . . . . . . . . . . . . . . . . . . . 277

Appendix V: Questionnaire for Teachers (Trial Version) . . . . . . . . . . . 283

Appendix VI: Questionnaire for Teachers (Final Version) . . . . . . . . . . . 285

Appendix VII: Questionnaire for Learners (Trial Version) . . . . . . . . . . 289

Appendix VIII: Questionnaire for Learners (Final Version) . . . . . . . . . 291

Appendix IX: Proposed Rating Scale (Tentative Version) . . . . . . . . . . . 295

Appendix X: Transcriptions of the Three Selected

ACTFL American Council of the Teaching of Foreign Languages

ETS English Testing Service

Figure 2.1 Communicative Competence Model (Canale and Swain

Figure 4.2 Intensification between verbal language

Figure 8.9 Engagement of eye contact in interpersonal meaning:

Table 2.1 Taxonomies of rating scales . . . . . . . . . . . . . . . . . . . . . . . . 35

Table 4.15 Comparison of gesture-related verbs (1). . . . . . . . . . . ..... 146

Table 8.4 Eye contact duration (s). . . . . . . . . . . . . . . . . . . . . . . . . . . 220

As a prelude to reporting a full-length expedition of the present study, this chapter

1.1 Research Background

Language, especially its spoken form, is now universally recognised as being

merely observed from standardised summative assessments and is mainly measured

1.2 Research Objectives

Considering the main aims of building an argument for embedding nonverbal

in the task of group discussion in formative assessment. In addition, the proposed

1.3 General Research Questions

In order to build an argument for incorporating nonverbal delivery into speaking

1.4 Research Signiﬁcance

1.5 Book Layout

in Chapter 5, is prevalidated via expert judgment with both quantitative and

Argyle, M. 1988. Bodily communication, 2nd ed. London: Methuen.

2.1 Nonverbal Delivery

Cicero (106-43 BC) particularly expounds on rhetorical skills and conceptualises

relation to what potential progress can be made in their spoken English

2.1.1 Eye Contact

The communication of affect displays or emotions is much more closely linked

(1969) taxonomy in order to maximise the interpretability of various gestures.

2.1.3 Head Movement

2.2 Communicative Competence

In order to more comprehensively and accurately evaluate the multifacets of EFL

2.2.1 Hymes’ Notion of Communicative Competence

2.2.1.1 Theoretical Groundings

The notion of communicative competence is termed by American sociolinguist Dell

2.2.1.2 Components of Communicative Competence

(To what degree) is something formally possible?

2.2.1.3 A Critique on Hymes’ Notion of Communicative Competence

Hymes conceptualises communicative competence mainly from an ethnographic

operationalisation in a given context. Therefore, the former’s effort is abstracting all

2.2.2 Communicative Competence Model

2.2.2.1 Theoretical Groundings

2.2.2.2 Components of Communicative Competence Model