Dubeau Testing Research Bucharest

STANAG 6001OPI Testing
Julie J. Dubeau
Bucharest BILC 2008
Bill Who???
Julie J. Dubeau
Julie J. Dubeau
Are We All On the Same

Page?
AnExploratoryStudyofOPIRatings
AcrossNATOCountries
UsingtheNATOSTANAG6001Scale*
*This research was completed in 2006 as part of a M.A. Thesis in

Applied Linguistics
Julie J. Dubeau
Presentation Outline
Context
Research Questions
Literature Review
Methodology
Results
Ratings
Raters
Scale
Conclusion
Julie J. Dubeau
NATO Language Testing Context
Standardized Language Profile (SLP)

based on the NATO STANDARDIZATION
AGREEMENT (NATO STANAG) 6001
Language Proficiency Levels (Ed 1? Ed 2?)
26 NATO countries, 20 Partnership for Peace
(PfP) countries & others
Julie J. Dubeau
Interoperability Problem?
Language training is central within armed forces due
to the increasing number of peace-support
operations, and is considered as having an
important role in achieving interoperability
among the various players.
The single most important problem identified

by almost all partners as an impediment to
developing interoperability with the Alliance
has been shortcomings in
communications (EAPC (PARP) D, 1997,
1, p.10).
Julie J. Dubeau
Overarching Research Question
Since no known study had investigated inter-rater

reliability in this context, the main research question
was:
How comparable or consistent are ratings across
NATO raters and countries?
Julie J. Dubeau
Research Questions
Research questions pertaining to the ratings

RQ1
Research questions pertaining raters training

and background RQ2
Research questions pertaining to the rating

process and to the scale RQ3
Julie J. Dubeau
Research Questions
RQ1-Ratings:
How do ratings of the same oral proficiency

interviews (OPIs) compare from rater to rater?
Would the use of plus levels increase rater
agreement?
How do the ratings of the OPIs compare from
country to country?
Are there differences in scores within the same
country?
Julie J. Dubeau
Research Questions
RQ2-Raters training and background:
Are there differences in ratings between raters who

have received varying degrees of tester/rater
training and STANAG training?
Did very experienced raters score more reliably
than lesser experienced ones? Are experienced
raters scoring as reliably as trained raters?
Are there differences in ratings between
participants who test part-time versus full-time,
are native or non-native speakers of English, and
are from Older and Newer NATO countries?
Julie J. Dubeau
Research Questions
RQ3-Rating
process and scale use:
Do
differing rating practices affect ratings?

Do raters appear to use the scale in similar
ways?
What are the raters comments regarding the
use and application of the scale?
Julie J. Dubeau
Literature Review
Testing Constructs
What are we testing?
General proficiency & Why
Rating scales
Rater Variance
How do raters vary?
Rater/scale interaction
Rater training & background
Julie J. Dubeau
Methodology
Design of study: Exploratory survey

2 Oral Proficiency Interviews (OPIs A & B)
Rater data questionnaire
Questionnaire accompanying each sample OPI
Participants : Countries recruited at BILC Seminar

in Sofia 2005
103 raters from 18 countries and 2 NATO units

Julie J. Dubeau
Analysis:
Rating comparisons
Original ratings
Plus ratings
Rater comparisons
Training
Background
Country to country comparisons
Within country dispersion
Rating process
Rating factors
Rater/scale interaction
Scale user-friendliness
Julie J. Dubeau
Results RQ1- Summary
Ratings : To compare OPI ratings and to

explore the efficacy of plus ratings.
Some rater-to-rater differences
Plus levels brought ratings closer to the mean
Some country-to-country differences
Greater within-country dispersion in some
countries
Julie J. Dubeau
A
d
j
u
s
t
e
d
+
r
a
n
g
e
A
6
0
W
i
t
h
n
L
1
r
a
n
g
e
2
3
5
0
4
0
3
0
2
0
1
0
0w
ith
n
le
v1S
ita
tc
w
lk
h
n
e
v
2
w
i
t
h
n
l
e
v
3
e
d
v
ie
w
o
fA
C
o
u
n
t
View of OPI ratings sample A
Julie J. Dubeau
Results Sample A (L1)

All Ratings (with +)
Levels
Numbers
Within Level 1 range
70
68.0
32
31.1
1.0
Total
103
100.0
Julie J. Dubeau
Country numbers
All Countries Means for Sample A
20
15
19
18
17
16
14
13
12
11
10
9
7
4
1.00
1.20
1.40
8
5
1.60
1.80
2.00
Overall Country Mean
Julie J. Dubeau
2.20
2.40
All Ratings for Sample B (level 2)

Levels
Numbers
1.9
1+
1.0
47
45.6
2+
7.8
34
33.0
3+
1.9
1.9
Total
96
93.2
Julie J. Dubeau
A
d
j
u
s
t
e
d
+
r
a
n
g
e
B
6
0
W
i
t
h
n
L
1
r
a
n
g
e
2
3
W
i
t
h
n
L
4
r
a
n
g
e
5
0
4
0
3
0
2
0
1
0
0w
ith
n
le
v
l1w
ith
lS
n
lt
e
v
2
w
i
t
h
n
l
e
v
l
3
w
i
t
h
n
l
e
v
l
4
a
c
k
e
d
v
ie
w
o
fB
C
ount
View of OPI ratings sample B
Julie J. Dubeau
C
o
u
n
try#
.2
2
5
0
2
0
.1
0
1
9
1
8
1
7
1
5
.1
5
0
1
4
3
1
2
1
1
0
.5
0
9
8
7
5
.0
0
6
4
3
2
1
.1
.8
02
.1
02
.4
02
.7
03
.03
.0
All Countries Means for Sample B
c
o
u
n
try
m
e
a
n
B
Julie J. Dubeau
Raters: To investigate rater training and scale training

and see how (or if) they impacted the ratings, and to
explore how various background characteristics
impacted the ratings
Trained raters scored within the mean, especially
for sample B
Experienced raters did not do as well as scaletrained raters
Full-time raters scored closer to mean
New NATO raters scored slightly closer to mean
NNS raters scored slightly closer to mean
Julie J. Dubeau
Tester (Rater) Training

70
60
Frequency
50
40
63.27%
30
20
36.73%
10
none to little
substantial to lots
Julie J. Dubeau
Years of Experience
50
Frequency
40
30
49.5%
20
19.8%
10
15.84%
14.85%
0 to 1 year
2 to 3 years
4 to 5 years
Julie J. Dubeau
5 years +
STANAG Scale Training

60
50
Percent
40
30
60.0%
40.0%
20
10
none to little
substantial to lots
Julie J. Dubeau
Old vs. New NATO Countries

Summary of Tester Trg
Newer NATO
member?
Yes
No
Total
Total
Little
Lots
6
14
23
20
2
36
29
30
36
28
51
58
87
Julie J. Dubeau
Old vs. New NATO Countries

Rating OPI B Correct?
Newer NATO
member?
Yes
No
Total
Total
Yes
No
Other/Missing
27
14
27
20
2
36
54
37
26
55
32
92
Julie J. Dubeau
Results Raters Background

Conducts
Testing Full-time?
Yes 34 (33.0 %)
No
67 (65.0 %)
Full-time testers more reliable (accurate)
NNS (60%) raters better trained?

New raters better trained?
Julie J. Dubeau
Scale: To explore the ways in which raters used the

various STANAG statements and rating factors to
arrive at their ratings.
Rating process did not affect ratings significantly
3 main types of raters emerged:
Evidence-based
Intuitive
Extra-contextual
Julie J. Dubeau
Results
An evidenced-based rating for Sample B (level 2):
I compared the candidates performance with

the STANAG criteria (levels 2 and 3) and
decided that he did not meet the
requirements for level 3 with regard to
flexibility and the use of structural devices.
Errors were frequent not only in low
frequency structures, but in some high
frequency areas as well. (Rater90rated2)
Julie J. Dubeau
Results
An intuitive rating for Sample A (level 1):
I would say that just about every single

sentence in the interpretation of the level 2
speaking could be applied to this man. And
because of that I would say that he is literally
at the top of level 2. He is on the verge of
level 3 literally. So I would automatically up
him to a low 3. (Rater 1- rated 3)
Julie J. Dubeau
Results
An extra-contextual rating for Sample A (level 1):

Level 3 is the basic level needed for officers in
(my country). I think the candidate could perform
the tasks required of him. He could easily be
bulldozed by native speakers in a meeting, but
would hold his own with non-native speakers. He
makes mistakes that very rarely distort meaning
and are rarely disturbing. (Rater95rated2)
Julie J. Dubeau
Implications
Training not equal in all countries
Scale interpretation
Plus levels useful
Different grids, speaking tests
Institutional perspectives
Julie J. Dubeau
Limitations & Future Research

Participants
may not have rated this

way in their own countries
OPIs new to some participants
Future
research could
Get participants to test

Investigate rating grids
Look at other skills
Julie J. Dubeau
Conclusion of Research
So, are we all on the same page?
YES! BUT
Plus levels were instrumental in bridging

gap
Training was found to be key to reliability
More in-country training should be the first

step toward international benchmarking.
Julie J. Dubeau
Thank You!
Are We All On the Same Page?
AnExploratoryStudyofOPIRatings
AcrossNATOCountries
UsingtheNATOSTANAG6001Scale
Dubeau.JJ@forces.gc.ca
The full thesis is available on the CDA website
http://www.cda-acd.forces.gc.ca
Or google Dubeau thesis

Dubeau Testing Research Bucharest

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dubeau Testing Research Bucharest

Uploaded by

Copyright:

Available Formats

STANAG 6001OPI Testing

Are We All On the Same

*This research was completed in 2006 as part of a M.A. Thesis in

NATO Language Testing Context

Standardized Language Profile (SLP)

The single most important problem identified

Overarching Research Question

Since no known study had investigated inter-rater

Research questions pertaining to the ratings

Research questions pertaining raters training

Research questions pertaining to the rating

How do ratings of the same oral proficiency

Are there differences in ratings between raters who

process and scale use:

differing rating practices affect ratings?

Design of study: Exploratory survey

Participants : Countries recruited at BILC Seminar

103 raters from 18 countries and 2 NATO units

Results RQ1- Summary

Ratings : To compare OPI ratings and to

View of OPI ratings sample A

Results Sample A (L1)

Within Level 1 range

Within Level 2 range

Within Level 3 range

All Countries Means for Sample A

Overall Country Mean

All Ratings for Sample B (level 2)

View of OPI ratings sample B

Results RQ2- Summary

Raters: To investigate rater training and scale training

Tester (Rater) Training

STANAG Scale Training

Old vs. New NATO Countries

Old vs. New NATO Countries

Results Raters Background

NNS (60%) raters better trained?

Results RQ3- Summary

Scale: To explore the ways in which raters used the

An evidenced-based rating for Sample B (level 2):

I compared the candidates performance with

An intuitive rating for Sample A (level 1):

I would say that just about every single

An extra-contextual rating for Sample A (level 1):

Training not equal in all countries

Plus levels useful

Different grids, speaking tests

Limitations & Future Research

may not have rated this

Get participants to test

Plus levels were instrumental in bridging

Training was found to be key to reliability

More in-country training should be the first

You might also like