Professional Documents
Culture Documents
3
S~.FrF.MB ~ , 1951
LF~ J. CRONBACH
UNIVERSITY OF ILLINOIS
I. Historical Resum~
A n y research based on measurement must be concerned with the
accuracy or dependability or, as we usually call it, reliability of meas-
urement. A reliability coefficient demonstrates w h e t h e r the test de-
signer w a s correct in expecting a certain collection of items to yield
interpretable statements about individual differences (25).
E v e n those investigators who regard reliability as a pale shadow
of the more vital m a t t e r of validity cannot avoid considering the re-
liability of their measures. No validity coefficient and no factor analy-
sis can be interpreted without some appropriate estimate of the mag-
nitude of the e r r o r of measurement. The p r e f e r r e d w a y to find out
h o w accurate one's measures are is to make t w o independent measure-
ments and compare them. In practice, psychologists and educators
have often not had the opportunity to recapture their subjects f o r a
second test. Clinical tests, or those used for vocational guidance, a r e
generally worked into a crowded schedule, and there is always a de-
*The assistance of Dora Damrin and Willard Warrington is gratefully ac-
knowledged. Miss Damrin took major responsibility for the empirical studies re-
ported. This research was supported by the Bureau of Research and Service, Col-
lege of Education.
297
298 PSYCHOMETRIKA
°(
rtt(~,,,~ - - - - 1
,
- - ; (i--l,2,...n). (I)
n i
a - - - - i---- . (2)
n--1 Vt
Here Vt is the variance of test scores, and lz~ is the variance of item
scores after weighting. This formula reduces to (1) when all items
are scored 1 or zero. The variants reported by Dressel (10) for cer-
tain weighted scorings, such as Rights-minus-Wrongs, are also spe-
cial cases of (2), but f o r most data computation directly f r o m (2) is
simpler than b y Dressel's method. H o y t ' s derivation (20) arrives
at a formula identical to (2), although he d r a w s attention to its ap-
plication only to the case w h e r e items are scored 1 or 0. Following
the p a t t e r n of a n y of the other published derivations of (1) (19, 22),
making the same assumptions b u t imposing no limit on the scoring
pattern, will permit one to derive (2).
Since each w r i t e r offering a derivation used his own set of as-
sumptions, and in some cases criticized those used by his predeces-
sors, the precise meaning of the formula became obscured. The origi-
nal derivation unquestionably made much more stringent assumptions
than necessary, which made it seem as if the formula could properly
be applied only to r a r e tests which happened to fit these conditions.
It has generally been stated that a gives a lower bound to "the true
reliability"--whatever that means to t h a t particular writer. In this
paper, w e take formula (2) as given, and make no assumptions re-
garding it. Instead, we proceed in the opposite direction, examining
the properties of a and t h e r e b y arriving at an interpretation.
We introduce the symbol a partly as a convenience. "Kuder-
Richardson F o r m u l a 20" is an a w k w a r d handle f o r a tool t h a t we ex-
pect to become increasingly prominent in the test literature. A second
reason for the symbol is that a is one of a set of six analogous coeffi-
cients (to be designated /~, r , 8, etc.) which deal with such other
300 PSYCHOMETRIKA
TABLE 1
Formulas for Split-Half Coefficients
Formulas Assuming Equal
Entering Data* Covariances Between Formulas Assuming
Half-Tests aa ~ ab
ZAt" ZB~
Tab ag gb 4uaab~ab 2~¢~
OaZ -t- ~b2 "Jr 2aaab'rab 1 + zab
9.A§
o{ o"a o b
2(1 %2+%~)u,.
3All
ut2
4A¶ 4B (=--4A)
ff$ o'd o'd2 Ud2
1 - - - -u| 2 l - - - ~t 2
5A 5B
% .~ ~ 4 (%~ ~ %udr~) 2 (2%2 - - add
4aa~ -]- a d2 - - 4 % o d~'ad 4aa2--Ud2
TABLE 2
Ratio of Spearman-Brown Estimate to More E x a c t Split-Half Estimate of
Coefficient of Equivalence when S.D.'s are Unequal
Ratio of
Half-Test
S.D.'s Correlation Between Half-Tests
(greater/lesser)
.00 .20 .40 .60 .80 1.00
1 1 1 I 1 I 1
I.I 1.005 1.004 1.003 1.003 1.003 1.002
1.2 1.017 1.014 1.012 1.010 1.009 1.008
1.3 1.035 1.029 1.025 1.022 1.020 1.017
1.4 1.057 1.048 1.041 1.036 1.032 1.029
1.5 1.083 1.069 1.060 1.052 1.046 1.042
i V~ C~ C~ C,~ FIR S T
HALFi ~ _~ Ct
V~ C.
,,, ,,
C~ C~ V~
F~QUaE 1
Schematic Division of the Matrix of Item Variances and Covariances.
304 PSYCHOMETRIKA
C~b is the total covariance of all item pairs such t h a t one item is
within ~ and the other is within b ; it is the "between halves" covari-
anee.
Then
C~ = r~b~b ; (7)
C, ----C~ + C~ + C ~ ; (8)
V, ----V. + Vb + 2C.b ----~ V , + 2C, ; and (9)
V~=~Vi
~,
+2C~ and Vb~-~Vi..+2Cb.
~,,
(I0)
n- 1 Vt ~ -- 1
(vt 1 Vt
(13)
n 2C,
(i4)
-- 1 V,
- C~
Cij (15)
,,(,~- I)/2"
Therefore
(16)
v,
L E E J. CRONBACH 305
W e p r o c e e d n o w b y d e t e r m i n i n g t h e m e a n coefficient f r o m all
(2n') !/2 (n' l) 2 possible splits of t h e test. F r o m ( 1 2 ) ,
4 (Jab
'
rt~ - - (17)
V,
In a n y s p l i t , a p a r t i c u l a r C~; h a s a p r o b a b i l i t y of of f a l l i n g
2 ( n - - 1)
into the between-halves covariance C~b. Then over all splits,
(2n') ! n
E C,b - - - - E E C(j ; (i = 1 , 2 , - - - n - - 1 ;
2(n'!)~2(n~1) i s
1--i + 1,..-;n). (18)
But
n(n-- 1) -
~F. C~j :- C~j. (19)
~j 2
(2n') ! n ~--
Cab -- - - C~, (20)
2(n'!)~4
and
__ n2__
C~ ~ - - Cij. (21)
4
From (17),
r--,
4n 2 - n 2 Cij
"r,,- - C,t - - (22)
4V, V,
Therefore
@,t ~ a . (23)
F r o m ( 1 4 ) , w e c a n also w r i t e a in t h e f o r m
Z Z C,i
a-- ; (i,j=l,2,...n;i~:D. (24)
n-- 1 Vc
T h i s i m p o r t a n t r e l a t i o n s t a t e s a c l e a r m e a n i n g f o r a as n / ( n - - l ) t i m e s
the ratio of interitem covariance to total variance. The multiplier
n/(n--1) allows for the proportion of variance in any item which is
due to the same dements as the covariance.
306 PSYCHOMETRIKA
, as a s p e c i a l case o f t h e s p l i t - h a l f c o e f f i c i e n t . N o t only is a a
f u n c t i o n of all the split-half coefficients f o r a t e s t ; it can also be shown
to be a special case of the split-half coefficient.
i f we assume t h a t t h e t e s t is divided into equivalent halves such
t h a t Cry,, (i.e., C,~/n' 2) equals C~j, t h e a s s u m p t i o n s f o r f o r m u l a 2A
still hold. We m a y designate the split-half coefficient f o r this s p l i t t i n g
as "l'tt l .
4 C~b
r. - - ~ (12)
V,
Then
ru ~ -- ------- -- (25)
~-- Vt --V, V,
From (16),
rt~o ~ a. (26)
T h i s a m o u n t s to a p r o o f t h a t a is a n e x a c t d e t e r m i n a t i o n of t h e paral-
lel-form c o r r e l a t i o n w h e n we can assume t h a t t h e m e a n c o v a r i a n c e
between parallel items equals the m e a n c o v a r i a n c e b e t w e e n u n p a i r e d
items. This is t h e least r e s t r i c t i v e a s s u m p t i o n usable in " p r o v i n g "
the K u d e r - R i c h a r d s o n formula.
a as t h e e q u i v a l e n o e o f r a n d o m s a m p l e s o f i t e m s . T h e f o r e g o i n g
d e m o n s t r a t i o n s show t h a t a m e a s u r e s essentially the same t h i n g as
t h e split-half coefficient. I f all t h e splits f o r a t e s t w e r e made, t h e
m e a n of the coefficients obtained would be a . W h e n we m a k e only
one split, and m a k e t h a t split at r a n d o m , we obtain a value s o m e w h e r e
in t h e distribution of which a is t h e mean. I f split-half coefficients a r e
d i s t r i b u t e d m o r e o r less symmetrically, an obtained split-half coeffi-
cient will be h i g h e r t h a n a about as o f t e n as it is lower t h a n a. T h i s
a v e r a g e t h a t is a is based on t h e v e r y best splits a n d also on some
v e r y p o o r splits w h e r e t h e items going into t h e two halves a r e quite
unlike each other.
Suppose we h a v e a u n i v e r s e o f items f o r w h i c h t h e m e a n covari-
ance is t h e same as t h e m e a n covariance w i t h i n t h e given test. T h e n
suppose two tests a r e m a d e b y twice s a m p l i n g n items a t r a n d o m
f r o m t h i s u n i v e r s e w i t h o u t r e p l a c e m e n t , and a d m i n i s t e r e d a t t h e same
sitting. T h e i r c o r r e l a t i o n would be a coefficient o f equivalence. T h e
m e a n o f such coefficients would be t h e same as t h e c o m p u t e d a . a is
t h e r e f o r e an e s t i m a t e of t h e c o r r e l a t i o n expected b e t w e e n t w o t e s t s
d r a w n a t r a n d o m f r o m a pool o f items like t h e items in this test. I t e m s
LEE J . CRONBACH 307
are not selected at random for psychological tests where any differen-
tiation among the items' contents or difficulties permits a planned
selection. Two planned samplings m a y be expected to have higher
correlations t h a n two random samplings, as Kelley pointed out (25).
We shall show t h a t this difference is usually small.
TABLE 3
Summary of D a t a from Repeated Splittings of Mechanical Reasoning Test
(60 it,~-; a ~ .811)
Splits Where
All Splits 1.05 > %/% > .95
There are only 126 possible splits for the morale test, and it is
possible to compute all half-test standard deviations directly from the
item variances and covariances. Of the 126 splits, six were designated
in advance as Type II parallel splits, on the basis of content and an
item analysis of a supplementary sample of papers. Results based on
200 cases appear in Table 4.
TABLE 4
Summary o f D a t a from Repeated Splittinge of Morale Scale
(I0 items; a ~-" .715)
Splits Where
All Splits 1.1 > %/% > .9
The highest and lowest coefficients for the mechanical test differ
by only .08, a difference which would be important only when a very
precise estimate of reliability is needed. The range for the morale
scale is greater (.20), but the probability of obtaining one of the ex-
treme values in sampling is slight. Our findings agree with Clark, that
the variation from split to split is less than the variation expected
from sample to sample for the same split. The standard error of a
Spearman-Brown coefficient based on 200 cases using the same split is
.03 when rtt = .8, .04 when r~t -- .7. The former value compares with
a standard deviation of .02 for all random-split coefficients of the me-
chanical test. The standard error of .04 compares with a standard
deviation of .035 for the 126 coefficients of the morale test.
This bears on Kelley's comment on proposals to obtain a unique
estimate: "A determinate answer would result if the mean for all
possible spilts were gotten, but, even neglecting the labor involved,
this would seem to contravene the judgment of comparability." (25,
p. 79). As our tables show, the splittings where half-test standard
deviations are unequal, which "contravene the judgment of compar-
ability," have coefficients about like those which have equal standard
deviations.
Combining our findings with those of Clark and Cronbach we
have studies of seven tests which seem to show that the variation from
split to split is too small to be of practical importance. Brownell finds
appreciable variation, however, for the four tests he studied. The ap-
parent contradiction is explained by the fact that the former results
applied to tests having fairly large coefficients of equivalence (.70 or
over). Brownell worked with tests whose coefficients were much low-
er, and the larger range of r's does not represent any greater varia-
tion in z values at this lower level.
In Tables 3 and 4, the values obtained from deliberately equated
half-tests differ slightly, but only slightly, from those for random
splits. Where a is .715 for the morale scale, the mean of parallel splits
is .748--a difference of no practical importance. One parallel split
reaches .780, but this split could not have been defended a priori as
more logical than the other planned splits. In Table 3, we find that
neither Type I nor Type II splits averaged more than .01 higher than
a . Here, then, is evidence that the sort of judgment a tester might
make on typical items, knowing their content and difficulty, does not,
contrary to the earlier opinion of Kelley and Cronbach, permit him to
make more comparable half-tests than would be obtained by random
splitting. The data from Cronbach's earlier study agree with this. This
conclusion seems to apply to tests of any length (the morale scale has
312 PSYCttOMETRIKA
only ten items). Where items fall into obviously diverse subgroups in
either content or difficulty, as, say, in the California Test of Mental
Maturity, the tester's j u d g m e n t could provide a better-than-random
split. It is dubious w h e t h e r he could improve on a random division
w i t h i n subtests.
It should be noted t h a t in this empirical s t u d y no a t t e m p t was
made to divide items on the basis of r , , as Gulliksen (18, p. 207-210)
has recently suggested. Provided this is done on a large sample of
cases other t h a n those used to estimate r , , Gulliksen's plan m i g h t
indeed give parallel-split coefficients which are consistently at least a
few points higher t h a n a.
The failure of the data to support our expectation led to a f u r t h e r
study of the problem. We discovered t h a t even tests which seem to
be heterogeneous are often highly s a t u r a t e d w i t h the first factor
among the items. This forces us not only to extend the i n t e r p r e t a t i o n
of a , but also to reexamine certain theories of test design.
Factorial composition of the test variance. To make fully clear
the relations involved, our analytic procedure will be spelled out in
detail. We postulate t h a t the variance of a n y item can be divided
a m o n g k + 1 orthogonal factors (k common w i t h other items and one
unique). Of these, we shall r e f e r to the first, f l , as the general factor,
even though it is possible t h a t some items would have a zero load-
ing on this factor.* Then if f.-~ is the loading of common f a c t o r z on
item i ,
C~j : N' ~ ~,.(f~i flj + f,~ f,j + - " + fk~ fkj). (28)
C, - - Y.~,C,j : N ' ~.. ~. ~, ~j f~i f , + " " + N ' Y. Y~ ,n (rj f., fkj ;
V, ---- N" Y. o~, ( f ' , + . . . f'~, + f'~,, ) + ~ " Y. ~.. ,,, ~j L , f~j
+ . . . + 2N 2 F. ~ .~ uj f ~ f ~ j . (30)
4 i
*This factor may be a so-called primary or reference factor like Verbal, but
it is more likely to be a composite of several such elements which contribute to
every item.
LEE J . CRONBACH 313
5
(37)
~Pv~' 6n + 5"
L~ E I'5 , ----
_ _O. (38)
314 PSYCHOMETRIKA
Note that among the t e r m s making up the variance of any test, the
n u m b e r of terms representing the general f a c t o r is n times the num-
b e r representing item specific and e r r o r factors.
We have seen t h a t the general factor cumulates a v e r y large in-
fluence in the test. This is made even clearer b y F i g u r e 2, w h e r e we
I00
40
|0 - ERROR
tlJ
X 4o
4o
E
a. 2o
io
o I I.... I.............I....... I I I I I I I I I I i I I I I
o ~ ~ 4 4 4 s ? 4 s ~ ~ m ~4 M • ~ ~ m ts zo
NUMBER OF I'IIEMS
plot the trend in variance for a particular case of the above test struc-
ture. Here we set k - 6 , n ~ - - n , n 2 = n a - - n , = n ~ = n 6 - - n / 5 . Then
we assume that each-item h-~Ethe composl%Ton: 9% general factor, 9%
f r o m some one group factor, 827~ unique. F u r t h e r , the unique vari-
ance is divided by 70/12 between e r r o r and specific stable variance. It
is seen t h a t even with unreliable items such as these, which intercor-
relate only .09 or .18, the general factor quickly becomes the predomi-
nant portion of the variance. In the limit, as n becomes indefinitely
large, the general factor is 5,/6 of the variance, and each group factor
is 1/30 of the total variance.
LEE J . CRONBACH 315
Number
Per Cent of of Items Per Cent of Total Test Variance
Pattern Factors Variance in Containing (assuming equal item variances)
Any Item Factor
~-~-1 7~-----5 ~-~--2
5 n-~--100 n - - ~
/k 50 (if present) 10 1 0
~/,.../~ 50 74 48 21 0
.fu~ 41 1
>:.fu~ n 41 12 8 $ 0
21 (-,
;E
~, f., f.,),
1 2 2
n , ~ n(n--l) ~ , n(n--l)
2
Y. ~. ~ ' f.~ -- - - C,, (40)
, ~ n--1
a - - - - 1 . (41)
test authors, these scores correlate .668. Then, by (42), a , the com-
mon-factor concentration, is .80. Turning to the P r i m a r y Mental
Abilities Tests, we have a set of moderate positive correlations re-
ported when these were given to a group of eighth-graders (35). The
question m a y be asked: H o w much would a composite score on these
tests reflect common elements r a t h e r than a hodgepodge of elements
each specific to one subtest ? The intercorrelations suggest t h a t there
is one general factor ~mong the tests. Computing a on the assump-
tion of equal subtest variances, we get .77. The total score is loaded
to this extent with a general intellective factor. Our third illustra-
tion relates to f o u r Air Force scores related to carefulness. E a c h
score is the count of n u m b e r w r o n g on a plotting test. The f o u r scores
have r a t h e r small intercorrelations (15, p. 687), and each score has
such l o w reliability t h a t its use alone as a m e a s u r e of carefulness is
not advisable. The question therefore arises w h e t h e r the tests a r e
enough intereorrelated t h a t the general f a c t o r would cumulate in a
preponderant w a y in their total. The s u m of the six intercorrela-
tions is 1.76. Therefore a is .62. I.e., 62% of the variance in the
equally weighted composite is due to the common f a c t o r among the
tests.
F r o m this approach comes a suggestion f o r obtaining a superior
coefficient of equivalence for the " l u m p y " test. It w a s shown t h a t a
test containing distinct clusters of items might have a parallel-split
coefficient appreciably higher than a . If so, we should divide the test
into subtests, each containing w h a t appears to be a homogeneous
group of items. ~ is computed f o r each subtest separately b y (2).
Then ~ a gives the covariance of each cluster with the opposite clus-
ter in a parallel form, and the eovariance between subtests is an esti-
mate of the covariance of similar pairs '%etween forms." Hence
t j
r, t = ; (i=l,2,...n;j=l,2,...n), (43)
x 2 Vt
Parallel-£orms coefficient
~ r e a t e a t split-half c o e f £ i -
cient
Mean s p l i t - h a l f c o e f f i c i e n t about here-
Coeffic lent of precision
.
i Ch=e 1
P r i n c i p a l common f a c t o r ~
0 E C! error
{I
~,~¢..t, -- (44)
n + (l--n)a
6C
(~J 4c
o
2C
" ~ r t , t ...3o r~,t =.3o - ' "
0 I I I I I I I I I I !
0 20 40 80 80 I00 0 20 40 60 80 800 0 20 40 60 810 ~0o
FIGUE~ 4
Relation of ¢ij to Pl and pj for Several Levels of Correlation.
326 PSYCHOMETRIKA
TABLE 7
Variation in Certain Indices of Interitem Consistency with Changes in Item
Difficulty (Tetrachorie Correlation Held Constant)
p~ .50 .50 .50 .50 .50 .50 .50 .50 .50
pj -->.00 .10 .20 .40 .50 .60 .80 .90 --~1.00
rijte t .30 .30 .30 .30 .30 .30 .80 .30 .30
~j "-'~.00 .14 .17 .19 .19 .19 .17 .14 -'-.00
Hij ---~1.00 .42 .34 .23 .19 .23 .~4 .42 -->1.00
PSYCHOMETRIKA 327
TABLE 8
Comparison of ~ and u for Hypothetical 45-Item Tests With and Without
"Heterogeneity Due to Item Difficulty"
Distribution of Range of
Test Difficulties Pt p~ ~ Diff. a Diff.
A Normal .20 to .80 .50 .181 .011 .909 .005
A' Peaked .50 .50 .192 .914
B Normal .10to.90 .50 .176 .016 .906 .008
B' Peaked .50 .50 .192 .914
C Normal .50 to .90 .70 .170 .011
,902 .007
C' Peaked .70 .70 .181 .909
D Rectangular .10 to .90 .50 .153 .039
.892 .02°~
D' Peaked .50 .50 .192 .914
it is believed t h a t t h e t e s t d e s i g n e r should s e e k i n t e r i t e m c o n s i s t e n c y ,
a n d j u d g e t h e e f f e c t i v e n e s s of h i s e f f o r t s b y t h e coefficient a , t h e p u r e
scale should n o t b e v i e w e d a s a n ideal. I t s h o u l d b e r e m e m b e r e d t h a t
T u c k e r (36) a n d B r o g d e n (1) h a v e d e m o n s t r a t e d t h a t i n c r e a s e s in
i n t e r n a l c o n s i s t e n c y m a y lead to d e c r e a s e s in t h e p r o d u c t - m o m e n t
v a l i d i t y coefficient w h e n t h e s h a p e of t h e t e s t - s c o r e d i s t r i b u t i o n d i f -
fers from that of the criterion distribution.
1. F o r m u l a s f o r s p l i t - h a l f coefficients of e q u i v a l e n c e a r e c o m -
p a r e d , a n d those o f R u l o n a n d G u t t m a n a r e a d v o c a t e d f o r p r a c t i c a l
use rather than the Spearman-Brown formula.
2. a , t h e g e n e r a l f o r m u l a of w h i c h K u d e r - R i c h a r d s o n f o r m u l a
20 is a special case, is f o u n d to h a v e t h e f o l l o w i n g i m p o r t a n t m e a n -
ings:
(a) a is t h e m e a n o f all possible s p l i t - h a l f coefficients.
(b) a is the value expected when two ral,tdom samples of
items from a pool like those in the given test are correlated.
(c) a is a lower bound for the coefficientof precision (the
instantaneous accuracy of this test with these particular items), a
is also a lower bound for coefficientsof equivalence obtained by simul-
taneous a d m i n i s t r a t i o n o f t w o t e s t s h a v i n g m a t c h e d items. B u t f o r
r e a s o n a b l y l o n g t e s t s n o t divisible i n t o a f e w f a c t o r i a l l y - d i s t i n c t s u b -
t e s t s , a is n e a r l y e q u a l t o " p a r a l l e l - s p l i t " a n d " p a r a l l e l - f o r m s " coeffi-
c i e n t s of equivalence.*
(d) a e s t i m a t e s , a n d is a l o w e r b o u n d to, t h e p r o p o r t i o n o f
t e s t v a r i a n c e a t t r i b u t a b l e to c o m m o n f a c t o r s a m o n g t h e i t e m s . T h a t
is, it is a n i n d e x o f c o m m o n - f a c t o r c o n c e n t r a t i o n . T h i s i n d e x s e r v e s
p u r p o s e s c l a i m e d f o r indices o f h o m o g e n e i t y , a m a y b e a p p l i e d b y a
modified t e c h n i q u e to d e t e r m i n e t h e c o m m o n - f a c t o r c o n c e n t r a t i o n
among a battery of subtests.
*w. G. Madow suggests that the amount of disagreement between two ran-
dom or two planned samples of items from a larger population of items could be
anticipated from sampling theory. The person's score on a test is a sample mean,
intended to estimate the population mean or "true score" over all items. The vari-
ance of such a mean from one sample to another decreases rapidly as the sample
is enlarged by lengthening the test, whether samples are drawn at random or are
drawn after stratifying the universe as to difIlculty and content. The conditions
under which the radom splits correlate about as highly as parallel splits are those
in which stratified sampling has comparatively little advantage. Madows comment
has implications also for the preparation of comparable forms of tests and for
developing objective methods of selecting a sample of items to represent a larger
set of items so that the variance of the difference between the score based on t h e
sample and the score based on the universe of items is as small as possible.
332 PSYCHOMETRIKA