You are on page 1of 39

University of Ostrava

Czech republic
26-31, March, 2012




Different forms of a test

Item banking

Achievement monitoring

Classical Test Theory Item ResponseTheory
It is applied only for
different test forms
equating
It is often ignored
(conception of parallel test
forms)
Establishes equivalent
scores on different test
forms
Doesnt create a common
scale


Allows to satisfy all
equating needs

Allows to put all estimates
of item and examinee
parameters to the common
scale
It is a special procedure that allows to establish
relation between examinee scores on different
test forms and place them onto the same scale.
As a result, measure based on responses to one
test can be matched to a measure based on
responses to another test, and the conclusions
drawn about examinee are identical, regardless
of the test form that produced the measure.
Equating of different test forms is called
horizontal equating.


The purpose: comparison of student
achievements at different grade levels
Test forms are designed to be of different
difficulties
Measures from different tests should be
placed on the same linear continuum
Procedure of this test equating is called
vertical equating.
Item bank a set of items from which test forms
that create equivalent measures may be
constructed.
Item bank is composed of a set of test items that
have been placed onto a common scale, so that
different subsets of these items produce
interchangeable measures for an examinee.
In the presence of item bank we dont need in
further equating


Both are designed to place estimated parameters
onto a common scale
In test equating the goal is to place person
measures from the multiple test forms onto the
same scale
In item banking the goal is to place item
calibrations on the same scale
Procedures are nearly identical when we use Rasch
measurement
Equating procedure that ensures the
examinee measures obtained from different
subsets of items are interchangeable. When
two tests are equated, the resulting measures
are placed onto the same scale.
Scaling procedure that associates numbers
with the performance of examinees. Tests can
be scaled identically, but have not been
equated.

Applies only to compare examinee test scores on
two different test forms
A problem can be ignored (introduction of
parallel test froms)
Implies only an establishment of relation between
test scores on different test forms
Doesnt imply creation of a common scale



Linear equating

Equipercentile equating
It is based on equating the standard score on
test X to the standard score on test Y:



Thus, , where

,



B x A y
x
y
A

x y B
x
y


y x
y y x x

Scores on
tests X and
Y are
considered
to be
equivalent if
their
respective
percentile
ranks in any
given group
are equal.

Both methods require assumptions
concerning identity of test score
destrubutions and about equivalence of
examinee groups
Equating in CTT doesnt imply creation of a
common scale
Measuring the same trait tests of different
content can not be equated (but can be scaled in a
similar manner).
Invariance of equating results across samples of
examinees
Independence of equating results on which test is
used as a reference test

Method of common items: linkage between two
test forms is accomplished by means of a set of
items which are common for two test forms

Method of common persons: linkage between
two test forms is accomplished by means of a
set of persons who respond to both test forms

Combined methods: linkage between two test
forms is accomplished by means of common
items and / or common persons plus common
raters

Internal anchor:
Each test form
has one set of
items that is
shared with
other forms and
another set of
items that is
unique to this
form
External anchor:
Each test form
has an additional
set of items, that
are not from
these test forms
Involving all examinees respond both test
forms.
There are two approaches to this design:

- same group/ same time

- same group/ different time

Linkage between
two test forms is
accomplished by
means of a set of
examinees who
respond to all
items.

Selecting an equating method
Parameter estimation
Transformation of parameters from
different test froms to the same scale
Evaluating the quality of the links between
test froms

Simultaneous calibration: all parameters are
estimated simultaneously in one run of the
estimation software. Data are automatically scaled
to the same scale.
Separate calibration: parameters are estimated for
each test form separately. That is, the data are
calibrated in multiple runs of the estimation
software.
Separate calibration may be more difficult to
accomplish because the test developer needs to
transform measures to a common scale
Separate calibration of all test forms with
transformating measures to the common scale

Simultaneous calibration of all test forms and
placing all measures on the common scale

Separate calibration of all test forms with
anchoring the difficulty values of the common
items and consecutive placing all parameters on
the common scale
As a rule this procedure is used with method of common
items that are called nodal items in this case
Each test form is calibrated separately. As a result for each
test form all estimates lie on the own scale. The only
difference between scales is in difference between origins of
the scales
This difference can be removed by means of calculating
location shift
It is desirable to have not less that 15-20 % nodal items
(some of them can be deleted from the link later).

Choice of a common scale
Selection of nodal items
Calibration of all test forms
Calculating equating constants
Link quality evaluation
Transformating all parameters onto a common scale
t
12
shift constant from test form 1 to test form 2;

i1
difficulty estimate of item i in test from 1;

i2
difficulty estimate of item i in test from 2;
l the number of common items.

Sometimes other formulas are applied - weighted
mean, dispersion shift, etc.
l
t
l
i
i i

1
1 2
12
) (

i1
'
=
i1
+ t
12
,
where
i1
difficulty estimate for item i in test form 1;

i1
'
difficulty estimate for the same item on the scale of test form
2, i=1,,k, k the total number of test items;

n1
'
=
n1
+ t
12
,
where
n1
ability estimate for examinee n who respond items of
test form 1;
n1
'
ability estimate for the same examinee on the
scale of test form 2, n=1,, N; N

the total number of
examinees who respond items of test form 1.

Shifted by this way parameter estimates of test from 1 will be
placed to the scale of test form 2.
Item-within-link (fit analysis of linking items);


Item-between-link (stability of the item
calibrations between two test forms)
where
i12
is defined by
i12
2
=
i1
2
+
i2
2
;

i1

,
i2

- standard errors of measurement for item i under
calibration of test form 1 and 2;

i1
- difficulty estimate for item i in test form 1;

i1
'
- difficulty estimate for the same item on the scale of
test form 2;
U
i
~ N(0,1)


12
1 1
i
i i
i
U

All parameters of all test forms are


estimated simultaneously

Is the simplest approach to equating test
forms or calibrating an item bank because it
requires no subsequent transformation of
the estimated measures or calibrations.
Data are automatically scaled to the same
scale in one run the estimation software

As a rule this procedure is used with method of common
items that are called anchor items in this case
Common items are estimated one time during calibration of
the first test form
During calibration of another test form the calibration
values for these items are treated as being fixed or known
and are not estimated. As a result, the remaining parameter
estimates are forced onto the same scale as the anchor
items
It is easy to anchor items in most estimation software



IAFILE=*
2 -0.29
4 -1.06
8 -0.49
11 -0.04
17 -0.28
37 -2.20
38 -1.34
*
Numbers of anchor items and their difficulties are
specified. These difficulty values will be fixed and not be
estimated during calibration of new test form
Choice of a common scale
Selection of anchor items
Calibration of the test form which scale is accepted as a
common scale
Sequential calibration of other test forms with fixing the
difficulty values of anchor items
Item-Within Link Fit (fit analysis of linking items);
If we use different equating procedures, obtained scales will
be different and can not be directly compared. It is connected
with different ways of origin selection in different procedures.

There are papers (for example, Smith R.M. Applications of
Rasch Measurement. Chicago: Mesa Press. -1992) where all
three procedures are analyzed. The precision of estimated
examinee and item parameters is approximately the same
and correlation between measures is high.




Each test form has 26 dichotomous items
Both test forms have 6 common items: 4, 6, 7, 14, 20, 24
(23 % of the total number of items)
The total number of examinees for test form 1 is 654, for test
form 2 - 661
For test calibration Winsteps software was used
Means of examinee measures are -1,07 -0,72 logits for test
form 1 and 2 correspondingly
The first test form scale was chosen as a common scale


Item
numbe
r
Test form 1 Test form 2


u
i

Difficult
y
estimate

i

Standar
d Error

i

Difficu
lty
estimat
e

i

Standard
Error

i

Shifted
Difficul
ty
estimate

i
'
4 -1.39 0.09 -1.07 0.09 -1.368 -0.17
6 -0.93 0.1 -0.54 0.09 -0.838 0.69
7 -2.57 0.1 -1.99 0.1 -2.288 2.0
14 -0.44 0.1 -0.32 0.09 -0.618 -1.33
20 0.88 0.12 0.96 0.11 0.662 -1.34
Sum -4.45 -2.96 -4.45
Mean -0.89 -0.592 -0.89


Shift constant t
12
= - 0,298.

It implies creation of a common response matrix for both test forms
containing 1315 examinees and 46 different items.

Measures of all examinees and difficulty values of all items will be
placed on a common scale that is centered in the difficulty mean of
all 46 items
Calibration of test form 1
Calibration of test form 2 with fixing the difficulty values of anchor
items from the first calibration
IAFILE=*
4 -1.39
6 -0.93
7 -2.57
14 -0.44
20 0.88
*
As a result examinee measures from both test forms will be on the
first test form scale

Comparison of examinee measures from three
equating procedures revealed approximately
similar results: correlation is closed to 1

The choice of equating procedure is determined by
the real data design and purpose of research

You might also like